Hypothesis

When Hypothesis finds an example triggering a bug, it tries to shrink the example down to something simpler that triggers it. This is a pretty common feature, and most property-based testing libraries implement something similar (though there are a number of differences between them). Stand-alone test case reducers are also fairly common, as it's a useful thing to be able to do when reporting bugs in external projects - rather than submitting a giant file triggering the bug, a good test case reducer can often shrink it down to a couple of lines.

But there's a problem with doing this: How do you know that the bug you started with is the same as the bug you ended up with?

This isn't just an academic question. It's very common for the bug you started with to slip to another one.

Consider for example, the following test:

from hypothesis import given, strategies as st


def mean(ls):
    return sum(ls) / len(ls)


@given(st.lists(st.floats()))
def test(ls):
    assert min(ls) <= mean(ls) <= max(ls)

This has a number of interesting ways to fail: We could pass NaN, we could pass [-float('inf'), +float('inf')], we could pass numbers which trigger a precision error, etc.

But after test case reduction, we'll pass the empty list and it will fail because we tried to take the min of an empty sequence.

This isn't necessarily a huge problem - we're still finding a bug after all (though in this case as much in the test as in the code under test) - and sometimes it's even desirable - you find more bugs this way, and sometimes they're ones that Hypothesis would have missed - but often it's not, and an interesting and rare bug slips to a boring and common one.

Historically Hypothesis has had a better answer to this than most - because of the Hypothesis example database, all intermediate bugs are saved and a selection of them will be replayed when you rerun the test. So if you fix one bug then rerun the test, you'll find the other bugs that were previously being hidden from you by that simpler bug.

But that's still not a great user experience - it means that you're not getting nearly as much information as you could be, and you're fixing bugs in Hypothesis's priority order rather than yours. Wouldn't it be better if Hypothesis just told you about all of the bugs it found and you could prioritise them yourself?

Well, as of Hypothesis 3.29.0, released a few weeks ago, now it does!

If you run the above test now, you'll get the following:

Falsifying example: test(ls=[nan])
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 671, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 120, in run
    return test(*args, **kwargs)
  File "broken.py", line 8, in test
    def test(ls):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 531, in timed_test
    result = test(*args, **kwargs)
  File "broken.py", line 9, in test
    assert min(ls) <= mean(ls) <= max(ls)
AssertionError

Falsifying example: test(ls=[])
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 671, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 120, in run
    return test(*args, **kwargs)
  File "broken.py", line 8, in test
    def test(ls):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 531, in timed_test
    result = test(*args, **kwargs)
  File "broken.py", line 9, in test
    assert min(ls) <= mean(ls) <= max(ls)
ValueError: min() arg is an empty sequence

You can add @seed(67388524433957857561882369659879357765) to this test to reproduce this failure.
Traceback (most recent call last):
  File "broken.py", line 12, in <module>
    test()
  File "broken.py", line 8, in test
    def test(ls):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 815, in wrapped_test
    state.run()
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 732, in run
    len(self.falsifying_examples,)))
hypothesis.errors.MultipleFailures: Hypothesis found 2 distinct failures.

(The stack traces are a bit noisy, I know. We have an issue open about cleaning them up).

All of the different bugs are minimized simultaneously and take full advantage of Hypothesis's example shrinking, so each bug is as easy (or hard) to read as if it were the only bug we'd found.

This isn't perfect: The heuristic we use for determining if two bugs are the same is whether they have the same exception type and the exception is thrown from the same line. This will necessarily conflate some bugs that are actually different - for example, [float('nan')], [-float('inf'), float('inf')] and [3002399751580415.0, 3002399751580415.0, 3002399751580415.0] each trigger the assertion in the test, but they are arguably "different" bugs.

But that's OK. The heuristic is deliberately conservative - the point is not that it can distinguish whether any two examples are the same bug, just that any two examples it distinguishes are different enough that it's interesting to show both, and this heuristic definitely manages that.

As far as I know this is a first in property-based testing libraries (though something like it is common in fuzzing tools, and theft is hot on our tail with something similar) and there's been some interesting related but mostly orthogonal research in Erlang QuickCheck.

It was also surprisingly easy.

A lot of things went right in writing this feature, some of them technical, some of them social, somewhere in between.

The technical ones are fairly straightforward: Hypothesis's core model turned out to be very well suited to this feature. Because Hypothesis has a single unified intermediate representation which defines a total ordering for simplicity, adapting Hypothesis to shrink multiple things at once was quite easy - whenever we attempt a shrink and it produces a different bug than the one we were looking for, we compare it to our existing best example for that bug and replace it if the current one is better (or we've discovered a new bug). We then just repeatedly run the shrinking process for each bug we know about until they've all been fully shrunk.

This is in a sense not surprising - I've been thinking about the problem of multiple-shrinking for a long time and, while this is the first time it's actually appeared in Hypothesis, the current choice of model was very much informed by it.

The social ones are perhaps more interesting. Certainly I'm very pleased with how they turned out here.

The first is that this work emerged tangentially from the recent Stripe funded work - Stripe paid me to develop some initial support for testing Pandas code with Hypothesis, and I observed a bunch of bug slippage happening in the wild while I was testing that (it turns out there are quite a lot of ways to trigger exceptions from Pandas - they weren't really Pandas bugs so much as bugs in the Pandas integration, but they still slipped between several different exception types), so that was what got me thinking about this problem again.

Not by accident, this feature also greatly simplified the implementation of the new deadline feature that Smarkets funded, which was going to have to have a lot of logic about how deadlines and bugs interacted, but all that went away as soon as we were able to handle multiple bugs sensibly.

This has been a relatively consistent theme in Hypothesis development - practical problems tend to spark related interesting theoretical developments. It's not a huge exaggeration to say that the fundamental Hypothesis model exists because I wanted to support testing Django nicely. So the recent funded development from Stripe and Smarkets has been a great way to spark a lot of seemingly unrelated development and improve Hypothesis for everyone, even outside the scope of the funded work.

Another thing that really helped here is our review process, and the review from Zac in particular.

This wasn't the feature I originally set out to develop. It started out life as a much simpler feature that used much of the same machinery, and just had a goal of avoiding slipping to new errors all together. Zac pushed back with some good questions around whether this was really the correct thing to do, and after some experimentation and feedback I eventually hit on the design that lead to displaying all of the errors.

Our review handbook emphasises that code review is a collaborative design process, and I feel this was a particularly good example of that. We've created a great culture of code review, and we're reaping the benefits (and if you want to get in on it, we could always use more people able and willing to do review...).

All told, I'm really pleased with how this turned out. I think it's a nice example of getting a lot of things right up front and this resulting in a really cool new feature.

I'm looking forward to seeing how it behaves in the wild. If you notice any particularly fun examples, do let me know, or write up a post about them yourself!