Hypothesis

Test faster, fix more

technical

These are articles that are primarily of interest to people who are actually going to want to write code using Hypothesis. You’re welcome to read it anyway if you’re not, but it might not be your thing.

The Threshold Problem

In my last post I mentioned the problem of bug slippage: When you start with one bug, reduce the test case, and end up with another bug.

I’ve run into another related problem twice now, and it’s not one I’ve seen talked about previously.

The problem is this: Sometimes shrinking makes a bug seem much less interesting than it actually is.

Read More

When multiple bugs attack

When Hypothesis finds an example triggering a bug, it tries to shrink the example down to something simpler that triggers it. This is a pretty common feature, and most property-based testing libraries implement something similar (though there are a number of differences between them). Stand-alone test case reducers are also fairly common, as it’s a useful thing to be able to do when reporting bugs in external projects - rather than submitting a giant file triggering the bug, a good test case reducer can often shrink it down to a couple of lines.

But there’s a problem with doing this: How do you know that the bug you started with is the same as the bug you ended up with?

This isn’t just an academic question. It’s very common for the bug you started with to slip to another one.

Consider for example, the following test:

from hypothesis import given, strategies as st

def mean(ls):
    return sum(ls) / len(ls)


@given(st.lists(st.floats()))
def test(ls):
    assert min(ls) <= mean(ls) <= max(ls)

This has a number of interesting ways to fail: We could pass NaN, we could pass [-float('inf'), +float('inf')], we could pass numbers which trigger a precision error, etc.

But after test case reduction, we’ll pass the empty list and it will fail because we tried to take the min of an empty sequence.

This isn’t necessarily a huge problem - we’re still finding a bug after all (though in this case as much in the test as in the code under test) - and sometimes it’s even desirable - you find more bugs this way, and sometimes they’re ones that Hypothesis would have missed - but often it’s not, and an interesting and rare bug slips to a boring and common one.

Historically Hypothesis has had a better answer to this than most - because of the Hypothesis example database, all intermediate bugs are saved and a selection of them will be replayed when you rerun the test. So if you fix one bug then rerun the test, you’ll find the other bugs that were previously being hidden from you by that simpler bug.

But that’s still not a great user experience - it means that you’re not getting nearly as much information as you could be, and you’re fixing bugs in Hypothesis’s priority order rather than yours. Wouldn’t it be better if Hypothesis just told you about all of the bugs it found and you could prioritise them yourself?

Well, as of Hypothesis 3.29.0, released a few weeks ago, now it does!

If you run the above test now, you’ll get the following:

Falsifying example: test(ls=[nan])
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 671, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 120, in run
    return test(*args, **kwargs)
  File "broken.py", line 8, in test
    def test(ls):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 531, in timed_test
    result = test(*args, **kwargs)
  File "broken.py", line 9, in test
    assert min(ls) <= mean(ls) <= max(ls)
AssertionError

Falsifying example: test(ls=[])
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 671, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 120, in run
    return test(*args, **kwargs)
  File "broken.py", line 8, in test
    def test(ls):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 531, in timed_test
    result = test(*args, **kwargs)
  File "broken.py", line 9, in test
    assert min(ls) <= mean(ls) <= max(ls)
ValueError: min() arg is an empty sequence

You can add @seed(67388524433957857561882369659879357765) to this test to reproduce this failure.
Traceback (most recent call last):
  File "broken.py", line 12, in <module>
    test()
  File "broken.py", line 8, in test
    def test(ls):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 815, in wrapped_test
    state.run()
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 732, in run
    len(self.falsifying_examples,)))
hypothesis.errors.MultipleFailures: Hypothesis found 2 distinct failures.

(The stack traces are a bit noisy, I know. We have an issue open about cleaning them up).

All of the different bugs are minimized simultaneously and take full advantage of Hypothesis’s example shrinking, so each bug is as easy (or hard) to read as if it were the only bug we’d found.

This isn’t perfect: The heuristic we use for determining if two bugs are the same is whether they have the same exception type and the exception is thrown from the same line. This will necessarily conflate some bugs that are actually different - for example, [float('nan')], [-float('inf'), float('inf')] and [3002399751580415.0, 3002399751580415.0, 3002399751580415.0] each trigger the assertion in the test, but they are arguably “different” bugs.

But that’s OK. The heuristic is deliberately conservative - the point is not that it can distinguish whether any two examples are the same bug, just that any two examples it distinguishes are different enough that it’s interesting to show both, and this heuristic definitely manages that.

As far as I know this is a first in property-based testing libraries (though something like it is common in fuzzing tools, and theft is hot on our tail with something similar) and there’s been some interesting related but mostly orthogonal research in Erlang QuickCheck.

It was also surprisingly easy.

A lot of things went right in writing this feature, some of them technical, some of them social, somewhere in between.

The technical ones are fairly straightforward: Hypothesis’s core model turned out to be very well suited to this feature. Because Hypothesis has a single unified intermediate representation which defines a total ordering for simplicity, adapting Hypothesis to shrink multiple things at once was quite easy - whenever we attempt a shrink and it produces a different bug than the one we were looking for, we compare it to our existing best example for that bug and replace it if the current one is better (or we’ve discovered a new bug). We then just repeatedly run the shrinking process for each bug we know about until they’ve all been fully shrunk.

This is in a sense not surprising - I’ve been thinking about the problem of multiple-shrinking for a long time and, while this is the first time it’s actually appeared in Hypothesis, the current choice of model was very much informed by it.

The social ones are perhaps more interesting. Certainly I’m very pleased with how they turned out here.

The first is that this work emerged tangentially from the recent Stripe funded work - Stripe paid me to develop some initial support for testing Pandas code with Hypothesis, and I observed a bunch of bug slippage happening in the wild while I was testing that (it turns out there are quite a lot of ways to trigger exceptions from Pandas - they weren’t really Pandas bugs so much as bugs in the Pandas integration, but they still slipped between several different exception types), so that was what got me thinking about this problem again.

Not by accident, this feature also greatly simplified the implementation of the new deadline feature that Smarkets funded, which was going to have to have a lot of logic about how deadlines and bugs interacted, but all that went away as soon as we were able to handle multiple bugs sensibly.

This has been a relatively consistent theme in Hypothesis development - practical problems tend to spark related interesting theoretical developments. It’s not a huge exaggeration to say that the fundamental Hypothesis model exists because I wanted to support testing Django nicely. So the recent funded development from Stripe and Smarkets has been a great way to spark a lot of seemingly unrelated development and improve Hypothesis for everyone, even outside the scope of the funded work.

Another thing that really helped here is our review process, and the review from Zac in particular.

This wasn’t the feature I originally set out to develop. It started out life as a much simpler feature that used much of the same machinery, and just had a goal of avoiding slipping to new errors all together. Zac pushed back with some good questions around whether this was really the correct thing to do, and after some experimentation and feedback I eventually hit on the design that lead to displaying all of the errors.

Our review handbook emphasises that code review is a collaborative design process, and I feel this was a particularly good example of that. We’ve created a great culture of code review, and we’re reaping the benefits (and if you want to get in on it, we could always use more people able and willing to do review…).

All told, I’m really pleased with how this turned out. I think it’s a nice example of getting a lot of things right up front and this resulting in a really cool new feature.

I’m looking forward to seeing how it behaves in the wild. If you notice any particularly fun examples, do let me know, or write up a post about them yourself!

Read More

Moving Beyond Types

If you look at the original property-based testing library, the Haskell version of QuickCheck, tests are very closely tied to types: The way you typically specify a property is by inferring the data that needs to be generated from the types the test function expects for its arguments.

This is a bad idea.

Read More

Solving the Water Jug Problem from Die Hard 3 with TLA+ and Hypothesis

This post was originally published on the author’s personal site. It is reproduced here with his permission.

In the movie Die Hard with a Vengeance (aka Die Hard 3), there is this famous scene where John McClane (Bruce Willis) and Zeus Carver (Samuel L. Jackson) are forced to solve a problem or be blown up: Given a 3 gallon jug and 5 gallon jug, how do you measure out exactly 4 gallons of water?

(The video title is wrong. It's Die Hard 3.)

Read More

Hypothesis for Computer Science Researchers

I’m in the process of trying to turn my work on Hypothesis into a PhD and I realised that I don’t have a good self-contained summary as to why researchers should care about it.

So this is that piece. I’ll try to give a from scratch introduction to the why and what of Hypothesis. It’s primarily intended for potential PhD supervisors, but should be of general interest as well (especially if you work in this field).

Why should I care about Hypothesis from a research point of view?

The short version:

Hypothesis takes an existing effective style of testing (property-based testing) which has proven highly effective in practice and makes it accessible to a much larger audience. It does so by taking several previously unconnected ideas from the existing research literature on testing and verification, and combining them to produce a novel implementation that has proven very effective in practice.

The long version is the rest of this article.

Read More

How Hypothesis Works

Hypothesis has a very different underlying implementation to any other property-based testing system. As far as I know, it’s an entirely novel design that I invented.

Central to this design is the following feature set which every Hypothesis strategy supports automatically (the only way to break this is by having the data generated depend somehow on external global state):

  1. All generated examples can be safely mutated
  2. All generated examples can be saved to disk (this is important because Hypothesis remembers and replays previous failures).
  3. All generated examples can be shrunk
  4. All invariants that hold in generation must hold during shrinking ( though the probability distribution can of course change, so things which are only supported with high probability may not be).

(Essentially no other property based systems manage one of these claims, let alone all)

The initial mechanisms for supporting this were fairly complicated, but after passing through a number of iterations I hit on a very powerful underlying design that unifies all of these features.

It’s still fairly complicated in implementation, but most of that is optimisations and things needed to make the core idea work. More importantly, the complexity is quite contained: A fairly small kernel handles all of the complexity, and there is little to no additional complexity (at least, compared to how it normally looks) in defining new strategies, etc.

This article will give a high level overview of that model and how it works.

Read More

Compositional shrinking

In my last article about shrinking, I discussed the problems with basing shrinking on the type of the values to be shrunk.

In writing it though I forgot that there was a halfway house which is also somewhat bad (but significantly less so) that you see in a couple of implementations.

This is when the shrinking is not type based, but still follows the classic shrinking API that takes a value and returns a lazy list of shrinks of that value. Examples of libraries that do this are theft and QuickTheories.

This works reasonably well and solves the major problems with type directed shrinking, but it’s still somewhat fragile and importantly does not compose nearly as well as the approaches that Hypothesis or test.check take.

Ideally, as well as not being based on the types of the values being generated, shrinking should not be based on the actual values generated at all.

This may seem counter-intuitive, but it actually works pretty well.

Read More

Integrated vs type based shrinking

One of the big differences between Hypothesis and Haskell QuickCheck is how shrinking is handled.

Specifically, the way shrinking is handled in Haskell QuickCheck is bad and the way it works in Hypothesis (and also in test.check and EQC) is good. If you’re implementing a property based testing system, you should use the good way. If you’re using a property based testing system and it doesn’t use the good way, you need to know about this failure mode.

Unfortunately many (and possibly most) implementations of property based testing are based on Haskell’s QuickCheck and so make the same mistake.

Read More

Another invariant to test for encoders

The encode/decode invariant is one of the most important properties to know about for testing your code with Hypothesis or other property-based testing systems, because it captures a very common pattern and is very good at finding bugs.

But how do you go beyond it? If encoders are that common, surely there must be other things to test with them?

Read More

Hypothesis vs. Eris

Eris is a library for property-based testing of PHP code, inspired by the mature frameworks that other languages provide like QuickCheck, Clojure’s test.check and of course Hypothesis.

Here is a side-by-side comparison of some basic and advanced features that have been implemented in both Hypothesis and Eris, which may help developers coming from either Python or PHP and looking at the other side.

Read More

How many times will Hypothesis run my test?

This is one of the most common first questions about Hypothesis.

People generally assume that the number of tests run will depend on the specific strategies used, but that’s generally not the case. Instead Hypothesis has a fairly fixed set of heuristics to determine how many times to run, which are mostly independent of the data being generated.

But how many runs is that?

The short answer is 200. Assuming you have a default configuration and everything is running smoothly, Hypothesis will run your test 200 times.

The longer answer is “It’s complicated”. It will depend on the exact behaviour of your tests and the value of some settings. In this article I’ll try to clear up some of the specifics of which settings affect the answer and how.

Read More

Generating recursive data

Sometimes you want to generate data which is recursive. That is, in order to draw some data you may need to draw some more data from the same strategy. For example we might want to generate a tree structure, or arbitrary JSON.

Hypothesis has the recursive function in the hypothesis.strategies module to make this easier to do. This is an article about how to use it.

Read More

How do I use pytest fixtures with Hypothesis?

pytest is a great test runner, and is the one Hypothesis itself uses for testing (though Hypothesis works fine with other test runners too).

It has a fairly elaborate fixture system, and people are often unsure how that interacts with Hypothesis. In this article we’ll go over the details of how to use the two together.

Read More

Calculating the mean of a list of numbers

Consider the following problem:

You have a list of floating point numbers. No nasty tricks - these aren’t NaN or Infinity, just normal “simple” floating point numbers.

Now: Calculate the mean (average). Can you do it?

It turns out this is a hard problem. It’s hard to get it even close to right. Lets see why.

Read More

Testing as a Complete Specification

Sometimes you’re lucky enough to have problems where the result is completely specified by a few simple properties.

This doesn’t necessarily correspond to them being easy! Many such problems are actually extremely fiddly to implement.

It does mean that they’re easy to test though. Lets see how.

Read More

Testing Configuration Parameters

A lot of applications end up growing a complex configuration system, with a large number of different knobs and dials you can turn to change behaviour. Some of these are just for performance tuning, some change operational concerns, some have other functions.

Testing these is tricky. As the number of parameters goes up, the number of possible configuration goes up exponentially. Manual testing of the different combinations quickly becomes completely unmanageable, not to mention extremely tedious.

Fortunately, this is somewhere where property-based testing in general and Hypothesis in particular can help a lot.

Read More

Evolving toward property-based testing with Hypothesis

Many people are quite comfortable writing ordinary unit tests, but feel a bit confused when they start with property-based testing. This post shows how two ordinary programmers started with normal Python unit tests and nudged them incrementally toward property-based tests, gaining many advantages on the way.

Read More

Testing Optimizers

We’ve previously looked into testing performance optimizations using Hypothesis, but this article is about something quite different: It’s about testing code that is designed to optimize a value. That is, you have some function and you want to find arguments to it that maximize (or minimize) its value.

As well as being an interesting subject in its own right, this will also nicely illustrate the use of Hypothesis’s data() functionality, which allows you to draw more data after the test has started, and will introduce a useful general property that can improve your testing in a much wider variety of settings.

Read More

Exploring Voting Systems with Hypothesis

Hypothesis is, of course, a library for writing tests.

But from an implementation point of view this is hardly noticeable. Really it’s a library for constructing and exploring data and using it to prove or disprove hypotheses about it. It then has a small testing library built on top of it.

It’s far more widely used as a testing library, and that’s really where the focus of its development lies, but with the find function you can use it just as well to explore your data interactively.

In this article we’ll go through an example of doing this, by using it to take a brief look at one of my other favourite subjects: Voting systems.

Read More

Generating the right data

One thing that often causes people problems is figuring out how to generate the right data to fit their data model. You can start with just generating strings and integers, but eventually you want to be able to generate objects from your domain model. Hypothesis provides a lot of tools to help you build the data you want, but sometimes the choice can be a bit overwhelming.

Here’s a worked example to walk you through some of the details and help you get to grips with how to use them.

Read More

Testing performance optimizations

Once you’ve flushed out the basic crashing bugs in your code, you’re going to want to look for more interesting things to test.

The next easiest thing to test is code where you know what the right answer is for every input.

Obviously in theory you think you know what the right answer is - you can just run the code. That’s not very helpful though, as that’s the answer you’re trying to verify.

But sometimes there is more than one way to get the right answer, and you choose the one you run in production not because it gives a different answer but because it gives the same answer faster.

Read More

Rule Based Stateful Testing

Hypothesis’s standard testing mechanisms are very good for testing things that can be considered direct functions of data. But supposed you have some complex stateful system or object that you want to test. How can you do that?

In this article we’ll see how to use Hypothesis’s rule based state machines to define tests that generate not just simple data, but entire programs using some stateful object. These will give the same level of boost to testing the behaviour of the object as you get to testing the data it accepts.

Read More

QuickCheck in Every Language

There are a lot of ports of QuickCheck, the original property based testing library, to a variety of different languages.

Some of them are good. Some of them are very good. Some of them are OK. Many are not.

I thought it would be worth keeping track of which are which, so I've put together a list.

Read More

The Encode/Decode invariant

One of the simplest types of invariant to find once you move past just fuzzing your code is asserting that two different operations should produce the same result, and one of the simplest instances of that is looking for encode/decode pairs. That is, you have some function that takes a value and encodes it as another value, and another that is supposed to reverse the process.

This is ripe for testing with Hypothesis because it has a natural completely defined specification: Encoding and then decoding should be exactly the same as doing nothing.

Lets look at a concrete example.

Read More

Anatomy of a Hypothesis Based Test

What happens when you run a test using Hypothesis? This article will help you understand.

Read More

Getting started with Hypothesis

Hypothesis will speed up your testing process and improve your software quality, but when first starting out people often struggle to figure out exactly how to use it.

Until you’re used to thinking in this style of testing, it’s not always obvious what the invariants of your code actually are, and people get stuck trying to come up with interesting ones to test.

Fortunately, there’s a simple invariant which every piece of software should satisfy, and which can be remarkably powerful as a way to uncover surprisingly deep bugs in your software.

Read More