These are articles that are primarily of interest to people who are actually going
to want to write code using Hypothesis. You’re welcome to read it anyway if you’re
not, but it might not be your thing.
In the movie Die Hard with a Vengeance
(aka Die Hard 3), there is
this famous scene where
John McClane (Bruce Willis) and Zeus Carver (Samuel L. Jackson)
are forced to solve a problem or be blown up: Given a 3 gallon jug and
5 gallon jug, how do you measure out exactly 4 gallons of water?
So this is that piece. I’ll try to give a from scratch introduction to the why and what of Hypothesis. It’s primarily
intended for potential PhD supervisors, but should be of general interest as well (especially if you
work in this field).
Why should I care about Hypothesis from a research point of view?
The short version:
Hypothesis takes an existing effective style of testing (property-based testing) which has proven highly effective in practice
and makes it accessible to a much larger audience. It does so by taking several previously unconnected ideas from the existing
research literature on testing and verification, and combining them to produce a novel implementation that has proven very effective
Hypothesis has a very different underlying implementation to any other
property-based testing system. As far as I know, it’s an entirely novel
design that I invented.
Central to this design is the following feature set which every
Hypothesis strategy supports automatically (the only way to break
this is by having the data generated depend somehow on external
All generated examples can be safely mutated
All generated examples can be saved to disk (this is important because
Hypothesis remembers and replays previous failures).
All generated examples can be shrunk
All invariants that hold in generation must hold during shrinking (
though the probability distribution can of course change, so things
which are only supported with high probability may not be).
(Essentially no other property based systems manage one of these claims,
let alone all)
The initial mechanisms for supporting this were fairly complicated, but
after passing through a number of iterations I hit on a very powerful
underlying design that unifies all of these features.
It’s still fairly complicated in implementation, but most of that is
optimisations and things needed to make the core idea work. More
importantly, the complexity is quite contained: A fairly small kernel
handles all of the complexity, and there is little to no additional
complexity (at least, compared to how it normally looks) in defining
new strategies, etc.
This article will give a high level overview of that model and how
In writing it though I forgot that there was a halfway house which is
also somewhat bad (but significantly less so) that you see in a couple
This is when the shrinking is not type based, but still follows the
classic shrinking API that takes a value and returns a lazy list of
shrinks of that value. Examples of libraries that do this are
This works reasonably well and solves the major problems with type
directed shrinking, but it’s still somewhat fragile and importantly
does not compose nearly as well as the approaches that Hypothesis
or test.check take.
Ideally, as well as not being based on the types of the values being
generated, shrinking should not be based on the actual values generated
This may seem counter-intuitive, but it actually works pretty well.
One of the big differences between Hypothesis and Haskell QuickCheck is
how shrinking is handled.
Specifically, the way shrinking is handled in Haskell QuickCheck is bad
and the way it works in Hypothesis (and also in test.check and EQC) is
good. If you’re implementing a property based testing system, you should
use the good way. If you’re using a property based testing system and it
doesn’t use the good way, you need to know about this failure mode.
Unfortunately many (and possibly most) implementations of property based
testing are based on Haskell’s QuickCheck and so make the same mistake.
The encode/decode invariant
is one of the most important properties to know about for testing your code with Hypothesis
or other property-based testing systems, because it captures a very common pattern and is
very good at finding bugs.
But how do you go beyond it? If encoders are that common, surely there must be other things
to test with them?
Eris is a library for property-based testing of PHP code, inspired by the mature frameworks that other languages provide like QuickCheck, Clojure’s test.check and of course Hypothesis.
Here is a side-by-side comparison of some basic and advanced features that have been implemented in both Hypothesis and Eris, which may help developers coming from either Python or PHP and looking at the other side.
This is one of the most common first questions about Hypothesis.
People generally assume that the number of tests run will depend on
the specific strategies used, but that’s generally not the case.
Instead Hypothesis has a fairly fixed set of heuristics to determine
how many times to run, which are mostly independent of the data
But how many runs is that?
The short answer is 200. Assuming you have a default configuration
and everything is running smoothly, Hypothesis will run your test
The longer answer is “It’s complicated”. It will depend on the exact
behaviour of your tests and the value of some settings. In this article
I’ll try to clear up some of the specifics of which
affect the answer and how.
Sometimes you want to generate data which is recursive.
That is, in order to draw some data you may need to draw
some more data from the same strategy. For example we might
want to generate a tree structure, or arbitrary JSON.
Hypothesis has the recursive function in the hypothesis.strategies
module to make this easier to do. This is an article about how to
A lot of applications end up growing a complex configuration system,
with a large number of different knobs and dials you can turn to change
behaviour. Some of these are just for performance tuning, some change
operational concerns, some have other functions.
Testing these is tricky. As the number of parameters goes up, the number
of possible configuration goes up exponentially. Manual testing of the
different combinations quickly becomes completely unmanageable, not
to mention extremely tedious.
Fortunately, this is somewhere where property-based testing in general
and Hypothesis in particular can help a lot.
Many people are quite comfortable writing ordinary unit tests, but feel a bit
confused when they start with property-based testing. This post shows how two
ordinary programmers started with normal Python unit tests and nudged them
incrementally toward property-based tests, gaining many advantages on the way.
We’ve previously looked into testing performance optimizations
using Hypothesis, but this
article is about something quite different: It’s about testing code
that is designed to optimize a value. That is, you have some function
and you want to find arguments to it that maximize (or minimize) its
As well as being an interesting subject in its own right, this will also
nicely illustrate the use of Hypothesis’s data() functionality, which
allows you to draw more data after the test has started, and will
introduce a useful general property that can improve your testing in
a much wider variety of settings.
Hypothesis is, of course, a library for writing tests.
But from an implementation point of view this is hardly noticeable.
Really it’s a library for constructing and exploring data and using it
to prove or disprove hypotheses about it. It then has a small testing
library built on top of it.
It’s far more widely used as a testing library, and that’s really where
the focus of its development lies, but with the find function you can
use it just as well to explore your data interactively.
In this article we’ll go through an example of doing this, by using it
to take a brief look at one of my other favourite subjects: Voting
One thing that often causes people problems is figuring out how to generate
the right data to fit their data
model. You can start with just generating strings and integers, but eventually you want
to be able to generate
objects from your domain model. Hypothesis provides a lot of tools to help you build the
data you want, but sometimes the choice can be a bit overwhelming.
Here’s a worked example to walk you through some of the details and help you get to grips with how to use
Hypothesis’s standard testing mechanisms are very good for testing things that can be
considered direct functions of data. But supposed you have some complex stateful
system or object that you want to test. How can you do that?
In this article we’ll see how to use Hypothesis’s rule based state machines to define
tests that generate not just simple data, but entire programs using some stateful
object. These will give the same level of boost to testing the behaviour of the
object as you get to testing the data it accepts.
One of the simplest types of invariant to find once you move past
just fuzzing your code is asserting that two
different operations should produce the same result, and one of the simplest instances of
that is looking for encode/decode pairs. That is, you have some function that takes a
value and encodes it as another value, and another that is supposed to reverse the process.
This is ripe for testing with Hypothesis because it has a natural completely defined
specification: Encoding and then decoding should be exactly the same as doing nothing.