Hypothesis

Test faster, fix more

The Hypothesis continuous release process

If you watch the Hypothesis changelog, you’ll notice the rate of releases sped up dramatically in 2017. We released over a hundred different versions, sometimes multiple times a day.

This is all thanks to our continuous release process. We’ve completely automated the process of releasing, so every pull request that changes code gets a new release, without any human input. In this post, I’ll explain how our continuous releases work, and why we find it so useful.

Read More

Smarkets's funding of Hypothesis

Happy new year everybody!

In this post I’d like to tell you about one of the nice things that happened in 2017: The Hypothesis work that was funded by Smarkets Smarkets are an exchange for peer-to-peer trading of bets but, more importantly for us, they are fairly heavy users of Hypothesis for the Python part of their stack.

Read More

The Threshold Problem

In my last post I mentioned the problem of bug slippage: When you start with one bug, reduce the test case, and end up with another bug.

I’ve run into another related problem twice now, and it’s not one I’ve seen talked about previously.

The problem is this: Sometimes shrinking makes a bug seem much less interesting than it actually is.

Read More

When multiple bugs attack

When Hypothesis finds an example triggering a bug, it tries to shrink the example down to something simpler that triggers it. This is a pretty common feature, and most property-based testing libraries implement something similar (though there are a number of differences between them). Stand-alone test case reducers are also fairly common, as it’s a useful thing to be able to do when reporting bugs in external projects - rather than submitting a giant file triggering the bug, a good test case reducer can often shrink it down to a couple of lines.

But there’s a problem with doing this: How do you know that the bug you started with is the same as the bug you ended up with?

This isn’t just an academic question. It’s very common for the bug you started with to slip to another one.

Consider for example, the following test:

from hypothesis import given, strategies as st

def mean(ls):
    return sum(ls) / len(ls)


@given(st.lists(st.floats()))
def test(ls):
    assert min(ls) <= mean(ls) <= max(ls)

This has a number of interesting ways to fail: We could pass NaN, we could pass [-float('inf'), +float('inf')], we could pass numbers which trigger a precision error, etc.

But after test case reduction, we’ll pass the empty list and it will fail because we tried to take the min of an empty sequence.

This isn’t necessarily a huge problem - we’re still finding a bug after all (though in this case as much in the test as in the code under test) - and sometimes it’s even desirable - you find more bugs this way, and sometimes they’re ones that Hypothesis would have missed - but often it’s not, and an interesting and rare bug slips to a boring and common one.

Historically Hypothesis has had a better answer to this than most - because of the Hypothesis example database, all intermediate bugs are saved and a selection of them will be replayed when you rerun the test. So if you fix one bug then rerun the test, you’ll find the other bugs that were previously being hidden from you by that simpler bug.

But that’s still not a great user experience - it means that you’re not getting nearly as much information as you could be, and you’re fixing bugs in Hypothesis’s priority order rather than yours. Wouldn’t it be better if Hypothesis just told you about all of the bugs it found and you could prioritise them yourself?

Well, as of Hypothesis 3.29.0, released a few weeks ago, now it does!

If you run the above test now, you’ll get the following:

Falsifying example: test(ls=[nan])
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 671, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 120, in run
    return test(*args, **kwargs)
  File "broken.py", line 8, in test
    def test(ls):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 531, in timed_test
    result = test(*args, **kwargs)
  File "broken.py", line 9, in test
    assert min(ls) <= mean(ls) <= max(ls)
AssertionError

Falsifying example: test(ls=[])
Traceback (most recent call last):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 671, in run
    print_example=True, is_final=True
  File "/home/david/hypothesis-python/src/hypothesis/executors.py", line 58, in default_new_style_executor
    return function(data)
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 120, in run
    return test(*args, **kwargs)
  File "broken.py", line 8, in test
    def test(ls):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 531, in timed_test
    result = test(*args, **kwargs)
  File "broken.py", line 9, in test
    assert min(ls) <= mean(ls) <= max(ls)
ValueError: min() arg is an empty sequence

You can add @seed(67388524433957857561882369659879357765) to this test to reproduce this failure.
Traceback (most recent call last):
  File "broken.py", line 12, in <module>
    test()
  File "broken.py", line 8, in test
    def test(ls):
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 815, in wrapped_test
    state.run()
  File "/home/david/hypothesis-python/src/hypothesis/core.py", line 732, in run
    len(self.falsifying_examples,)))
hypothesis.errors.MultipleFailures: Hypothesis found 2 distinct failures.

(The stack traces are a bit noisy, I know. We have an issue open about cleaning them up).

All of the different bugs are minimized simultaneously and take full advantage of Hypothesis’s example shrinking, so each bug is as easy (or hard) to read as if it were the only bug we’d found.

This isn’t perfect: The heuristic we use for determining if two bugs are the same is whether they have the same exception type and the exception is thrown from the same line. This will necessarily conflate some bugs that are actually different - for example, [float('nan')], [-float('inf'), float('inf')] and [3002399751580415.0, 3002399751580415.0, 3002399751580415.0] each trigger the assertion in the test, but they are arguably “different” bugs.

But that’s OK. The heuristic is deliberately conservative - the point is not that it can distinguish whether any two examples are the same bug, just that any two examples it distinguishes are different enough that it’s interesting to show both, and this heuristic definitely manages that.

As far as I know this is a first in property-based testing libraries (though something like it is common in fuzzing tools, and theft is hot on our tail with something similar) and there’s been some interesting related but mostly orthogonal research in Erlang QuickCheck.

It was also surprisingly easy.

A lot of things went right in writing this feature, some of them technical, some of them social, somewhere in between.

The technical ones are fairly straightforward: Hypothesis’s core model turned out to be very well suited to this feature. Because Hypothesis has a single unified intermediate representation which defines a total ordering for simplicity, adapting Hypothesis to shrink multiple things at once was quite easy - whenever we attempt a shrink and it produces a different bug than the one we were looking for, we compare it to our existing best example for that bug and replace it if the current one is better (or we’ve discovered a new bug). We then just repeatedly run the shrinking process for each bug we know about until they’ve all been fully shrunk.

This is in a sense not surprising - I’ve been thinking about the problem of multiple-shrinking for a long time and, while this is the first time it’s actually appeared in Hypothesis, the current choice of model was very much informed by it.

The social ones are perhaps more interesting. Certainly I’m very pleased with how they turned out here.

The first is that this work emerged tangentially from the recent Stripe funded work - Stripe paid me to develop some initial support for testing Pandas code with Hypothesis, and I observed a bunch of bug slippage happening in the wild while I was testing that (it turns out there are quite a lot of ways to trigger exceptions from Pandas - they weren’t really Pandas bugs so much as bugs in the Pandas integration, but they still slipped between several different exception types), so that was what got me thinking about this problem again.

Not by accident, this feature also greatly simplified the implementation of the new deadline feature that Smarkets funded, which was going to have to have a lot of logic about how deadlines and bugs interacted, but all that went away as soon as we were able to handle multiple bugs sensibly.

This has been a relatively consistent theme in Hypothesis development - practical problems tend to spark related interesting theoretical developments. It’s not a huge exaggeration to say that the fundamental Hypothesis model exists because I wanted to support testing Django nicely. So the recent funded development from Stripe and Smarkets has been a great way to spark a lot of seemingly unrelated development and improve Hypothesis for everyone, even outside the scope of the funded work.

Another thing that really helped here is our review process, and the review from Zac in particular.

This wasn’t the feature I originally set out to develop. It started out life as a much simpler feature that used much of the same machinery, and just had a goal of avoiding slipping to new errors all together. Zac pushed back with some good questions around whether this was really the correct thing to do, and after some experimentation and feedback I eventually hit on the design that lead to displaying all of the errors.

Our review handbook emphasises that code review is a collaborative design process, and I feel this was a particularly good example of that. We’ve created a great culture of code review, and we’re reaping the benefits (and if you want to get in on it, we could always use more people able and willing to do review…).

All told, I’m really pleased with how this turned out. I think it’s a nice example of getting a lot of things right up front and this resulting in a really cool new feature.

I’m looking forward to seeing how it behaves in the wild. If you notice any particularly fun examples, do let me know, or write up a post about them yourself!

Read More

Moving Beyond Types

If you look at the original property-based testing library, the Haskell version of QuickCheck, tests are very closely tied to types: The way you typically specify a property is by inferring the data that needs to be generated from the types the test function expects for its arguments.

This is a bad idea.

Read More

Solving the Water Jug Problem from Die Hard 3 with TLA+ and Hypothesis

This post was originally published on the author’s personal site. It is reproduced here with his permission.

In the movie Die Hard with a Vengeance (aka Die Hard 3), there is this famous scene where John McClane (Bruce Willis) and Zeus Carver (Samuel L. Jackson) are forced to solve a problem or be blown up: Given a 3 gallon jug and 5 gallon jug, how do you measure out exactly 4 gallons of water?

(The video title is wrong. It's Die Hard 3.)

Read More

Hypothesis for Computer Science Researchers

I’m in the process of trying to turn my work on Hypothesis into a PhD and I realised that I don’t have a good self-contained summary as to why researchers should care about it.

So this is that piece. I’ll try to give a from scratch introduction to the why and what of Hypothesis. It’s primarily intended for potential PhD supervisors, but should be of general interest as well (especially if you work in this field).

Why should I care about Hypothesis from a research point of view?

The short version:

Hypothesis takes an existing effective style of testing (property-based testing) which has proven highly effective in practice and makes it accessible to a much larger audience. It does so by taking several previously unconnected ideas from the existing research literature on testing and verification, and combining them to produce a novel implementation that has proven very effective in practice.

The long version is the rest of this article.

Read More

How Hypothesis Works

Hypothesis has a very different underlying implementation to any other property-based testing system. As far as I know, it’s an entirely novel design that I invented.

Central to this design is the following feature set which every Hypothesis strategy supports automatically (the only way to break this is by having the data generated depend somehow on external global state):

  1. All generated examples can be safely mutated
  2. All generated examples can be saved to disk (this is important because Hypothesis remembers and replays previous failures).
  3. All generated examples can be shrunk
  4. All invariants that hold in generation must hold during shrinking ( though the probability distribution can of course change, so things which are only supported with high probability may not be).

(Essentially no other property based systems manage one of these claims, let alone all)

The initial mechanisms for supporting this were fairly complicated, but after passing through a number of iterations I hit on a very powerful underlying design that unifies all of these features.

It’s still fairly complicated in implementation, but most of that is optimisations and things needed to make the core idea work. More importantly, the complexity is quite contained: A fairly small kernel handles all of the complexity, and there is little to no additional complexity (at least, compared to how it normally looks) in defining new strategies, etc.

This article will give a high level overview of that model and how it works.

Read More

Compositional shrinking

In my last article about shrinking, I discussed the problems with basing shrinking on the type of the values to be shrunk.

In writing it though I forgot that there was a halfway house which is also somewhat bad (but significantly less so) that you see in a couple of implementations.

This is when the shrinking is not type based, but still follows the classic shrinking API that takes a value and returns a lazy list of shrinks of that value. Examples of libraries that do this are theft and QuickTheories.

This works reasonably well and solves the major problems with type directed shrinking, but it’s still somewhat fragile and importantly does not compose nearly as well as the approaches that Hypothesis or test.check take.

Ideally, as well as not being based on the types of the values being generated, shrinking should not be based on the actual values generated at all.

This may seem counter-intuitive, but it actually works pretty well.

Read More

Integrated vs type based shrinking

One of the big differences between Hypothesis and Haskell QuickCheck is how shrinking is handled.

Specifically, the way shrinking is handled in Haskell QuickCheck is bad and the way it works in Hypothesis (and also in test.check and EQC) is good. If you’re implementing a property based testing system, you should use the good way. If you’re using a property based testing system and it doesn’t use the good way, you need to know about this failure mode.

Unfortunately many (and possibly most) implementations of property based testing are based on Haskell’s QuickCheck and so make the same mistake.

Read More

3.6.0 Release of Hypothesis for Python

This is a release announcement for the 3.6.0 release of Hypothesis for Python. It’s a bit of an emergency release.

Hypothesis 3.5.0 inadvertently added a dependency on GPLed code (see below for how this happened) which this release removes. This means that if you are running Hypothesis 3.5.x then there is a good chance you are in violation of the GPL and you should update immediately.

Apologies for any inconvenience this may have caused.

Read More

Another invariant to test for encoders

The encode/decode invariant is one of the most important properties to know about for testing your code with Hypothesis or other property-based testing systems, because it captures a very common pattern and is very good at finding bugs.

But how do you go beyond it? If encoders are that common, surely there must be other things to test with them?

Read More

Seeking funding for deeper integration between Hypothesis and pytest

Probably the number one complaint I hear from Hypothesis users is that it “doesn’t work” with py.test fixtures. This isn’t true, but it does have one very specific limitation in how it works that annoys people: It only runs function scoped fixtures once for the entire test, not once per example. Because of the way people use function scoped fixtures for handling stateful things like databases, this often causes people problems.

I’ve been maintaining for a while that this is impossible to fix without some changes on the pytest end.

The good news is that this turns out not to be the case. After some conversations with pytest developers, some examining of other pytest plugins, and a bunch of prototyping, I’m pretty sure it’s possible. It’s just really annoying and a lot of work.

So that’s the good news. The bad news is that this isn’t going to happen without someone funding the work.

I’ve now spent about a week of fairly solid work on this, and what I’ve got is quite promising: The core objective of running pytest fixtures for every examples works fairly seamlessly.

But it’s now in the long tail of problems that will need to be squashed before this counts as an actual production ready releasable piece of work. A number of things don’t work. For example, currently it’s running some module scoped fixtures once per example too, which it clearly shouldn’t be doing. It also currently has some pretty major performance problems that are bad enough that I would consider them release blocking.

As a result I’d estimate there’s easily another 2-3 weeks of work needed to get this out the door.

Which brings us to the crux of the matter: 2-3 additional weeks of free work on top of the one I’ve already done is 3-4 weeks more free work than I particularly want to do on this feature, so without sponsorship it’s not getting finished.

I typically charge £400/day for work on Hypothesis (this is heavily discounted off my normal rates), so 2-3 weeks comes to £4000 to £6000 (roughly $5000 to $8000) that has to come from somewhere.

I know there are a number of companies out there using pytest and Hypothesis together. I know from the amount of complaining about this integration that this is a real problem you’re experiencing. So, I think this money should come from those companies. Besides helping to support a tool you’ve already got a lot of value out of, this will expand the scope of what you can easily test with Hypothesis a lot, and will be hugely beneficial to your bug finding efforts.

This is a model that has worked well before with the funding of the recent statistics work by Jean-Louis Fuchs and Adfinis-SyGroup, and I’m confident it can work well again.

If you work at such a company and would like to talk about funding some or part of this development, please email me at [email protected].

Read More

3.5.0 and 3.5.1 Releases of Hypothesis for Python

This is a combined release announcement for two releases. 3.5.0 was released yesterday, and 3.5.1 has been released today after some early bug reports in 3.5.0

Changes

3.5.0 - 2016-09-22

This is a feature release.

  • fractions() and decimals() strategies now support min_value and max_value parameters. Thanks go to Anne Mulhern for the development of this feature.
  • The Hypothesis pytest plugin now supports a –hypothesis-show-statistics parameter that gives detailed statistics about the tests that were run. Huge thanks to Jean-Louis Fuchs and Adfinis-SyGroup for funding the development of this feature.
  • There is a new event() function that can be used to add custom statistics.

Additionally there have been some minor bug fixes:

  • In some cases Hypothesis should produce fewer duplicate examples (this will mostly only affect cases with a single parameter).
  • py.test command line parameters are now under an option group for Hypothesis (thanks to David Keijser for fixing this)
  • Hypothesis would previously error if you used function annotations on your tests under Python 3.4.
  • The repr of many strategies using lambdas has been improved to include the lambda body (this was previously supported in many but not all cases).

3.5.1 - 2016-09-23

This is a bug fix release.

  • Hypothesis now runs cleanly in -B and -BB modes, avoiding mixing bytes and unicode.
  • unittest.TestCase tests would not have shown up in the new statistics mode. Now they do.
  • Similarly, stateful tests would not have shown up in statistics and now they do.
  • Statistics now print with pytest node IDs (the names you’d get in pytest verbose mode).

Notes

Aside from the above changes, there are a couple big things behind the scenes of this release that make it a big deal.

The first is that the flagship chunk of work, statistics, is a long-standing want to have that has never quite been prioritised. By funding it, Jean-Louis and Adfinis-SyGroup successfully bumped it up to the top of the priority list, making it the first funded feature in Hypothesis for Python!

Another less significant but still important is that this release marks the first real break with an unofficial Hypothesis for Python policy of not having any dependencies other than the standard library and backports. This release adds a dependency on the uncompyle6 package. This may seem like an odd choice, but it was invaluable for fixing the repr behaviour, which in turn was really needed for providing good statistics for filter and recursive strategies.

Read More

Hypothesis vs. Eris

Eris is a library for property-based testing of PHP code, inspired by the mature frameworks that other languages provide like QuickCheck, Clojure’s test.check and of course Hypothesis.

Here is a side-by-side comparison of some basic and advanced features that have been implemented in both Hypothesis and Eris, which may help developers coming from either Python or PHP and looking at the other side.

Read More

How many times will Hypothesis run my test?

This is one of the most common first questions about Hypothesis.

People generally assume that the number of tests run will depend on the specific strategies used, but that’s generally not the case. Instead Hypothesis has a fairly fixed set of heuristics to determine how many times to run, which are mostly independent of the data being generated.

But how many runs is that?

The short answer is 200. Assuming you have a default configuration and everything is running smoothly, Hypothesis will run your test 200 times.

The longer answer is “It’s complicated”. It will depend on the exact behaviour of your tests and the value of some settings. In this article I’ll try to clear up some of the specifics of which settings affect the answer and how.

Read More

Generating recursive data

Sometimes you want to generate data which is recursive. That is, in order to draw some data you may need to draw some more data from the same strategy. For example we might want to generate a tree structure, or arbitrary JSON.

Hypothesis has the recursive function in the hypothesis.strategies module to make this easier to do. This is an article about how to use it.

Read More

How do I use pytest fixtures with Hypothesis?

pytest is a great test runner, and is the one Hypothesis itself uses for testing (though Hypothesis works fine with other test runners too).

It has a fairly elaborate fixture system, and people are often unsure how that interacts with Hypothesis. In this article we’ll go over the details of how to use the two together.

Read More

What is Hypothesis?

Hypothesis is a library designed to help you write what are called property-based tests.

The key idea of property based testing is that rather than writing a test that tests just a single scenario, you write tests that describe a range of scenarios and then let the computer explore the possibilities for you rather than having to hand-write every one yourself.

In order to contrast this with the sort of tests you might be used to, when talking about property-based testing we tend to describe the normal sort of testing as example-based testing.

Property-based testing can be significantly more powerful than example based testing, because it automates the most time consuming part of writing tests

  • coming up with the specific examples - and will usually perform it better than a human would. This allows you to focus on the parts that humans are better at - understanding the system, its range of acceptable behaviours, and how they might break.

You don’t need a library to do property-based testing. If you’ve ever written a test which generates some random data and uses it for testing, that’s a property-based test. But having a library can help you a lot, making your tests easier to write, more robust, and better at finding bugs. In the rest of this article we’ll see how.

Read More

3.4.2 Release of Hypothesis for Python

This is a bug fix release, fixing a number of problems with the settings system:

  • Test functions defined using @given can now be called from other threads (Issue #337)
  • Attempting to delete a settings property would previously have silently done the wrong thing. Now it raises an AttributeError.
  • Creating a settings object with a custom database_file parameter was silently getting ignored and the default was being used instead. Now it’s not.

Notes

For historic reasons, _settings.py had been excluded from the requirement to have 100% branch coverage. Issue #337 would have been caught by a coverage requirement: the code in question simply couldn’t have worked, but it was not covered by any tests, so it slipped through.

As part of the general principle that bugs shouldn’t just be fixed without addressing the reason why the bug slipped through in the first place, I decided to impose the coverage requirements on _settings.py as well, which is how the other two bugs were found. Both of these had code that was never run during tests - in the case of the deletion bug there was a __delete__ descriptor method that was never being run, and in the case of the database_file one there was a check later that could never fire because the internal _database field was always being set in __init__.

I feel like this experiment thoroughly validated that 100% coverage is a useful thing to aim for. Unfortunately it also pointed out that the settings system is much more complicated than it needs to be. I’m unsure what to do about that - some of its functionality is a bit too baked into the public API to lightly change, and I’m don’t think it’s worth breaking that just to simplify the code.

Read More

Hypothesis for Python 3.4.1 Release

This is a bug fix release for a single bug:

  • On Windows when running two Hypothesis processes in parallel (e.g. using pytest-xdist) they could race with each other and one would raise an exception due to the non-atomic nature of file renaming on Windows and the fact that you can’t rename over an existing file. This is now fixed.

Notes

My tendency of doing immediate patch releases for bugs is unusual but generally seems to be appreciated. In this case this was a bug that was blocking a py.test merge.

I suspect this is not the last bug around atomic file creation on Windows. Cross platform atomic file creation seems to be a harder problem than I would have expected.

Read More

Calculating the mean of a list of numbers

Consider the following problem:

You have a list of floating point numbers. No nasty tricks - these aren’t NaN or Infinity, just normal “simple” floating point numbers.

Now: Calculate the mean (average). Can you do it?

It turns out this is a hard problem. It’s hard to get it even close to right. Lets see why.

Read More

Testing as a Complete Specification

Sometimes you’re lucky enough to have problems where the result is completely specified by a few simple properties.

This doesn’t necessarily correspond to them being easy! Many such problems are actually extremely fiddly to implement.

It does mean that they’re easy to test though. Lets see how.

Read More

Testing Configuration Parameters

A lot of applications end up growing a complex configuration system, with a large number of different knobs and dials you can turn to change behaviour. Some of these are just for performance tuning, some change operational concerns, some have other functions.

Testing these is tricky. As the number of parameters goes up, the number of possible configuration goes up exponentially. Manual testing of the different combinations quickly becomes completely unmanageable, not to mention extremely tedious.

Fortunately, this is somewhere where property-based testing in general and Hypothesis in particular can help a lot.

Read More

Evolving toward property-based testing with Hypothesis

Many people are quite comfortable writing ordinary unit tests, but feel a bit confused when they start with property-based testing. This post shows how two ordinary programmers started with normal Python unit tests and nudged them incrementally toward property-based tests, gaining many advantages on the way.

Read More

Guest Posts Welcome

I would like to see more posters on the hypothesis.works blog. I’m particularly interested in experience reports from people who use Hypothesis in the wild. Could that be you?

Read More

Testing Optimizers

We’ve previously looked into testing performance optimizations using Hypothesis, but this article is about something quite different: It’s about testing code that is designed to optimize a value. That is, you have some function and you want to find arguments to it that maximize (or minimize) its value.

As well as being an interesting subject in its own right, this will also nicely illustrate the use of Hypothesis’s data() functionality, which allows you to draw more data after the test has started, and will introduce a useful general property that can improve your testing in a much wider variety of settings.

Read More

Exploring Voting Systems with Hypothesis

Hypothesis is, of course, a library for writing tests.

But from an implementation point of view this is hardly noticeable. Really it’s a library for constructing and exploring data and using it to prove or disprove hypotheses about it. It then has a small testing library built on top of it.

It’s far more widely used as a testing library, and that’s really where the focus of its development lies, but with the find function you can use it just as well to explore your data interactively.

In this article we’ll go through an example of doing this, by using it to take a brief look at one of my other favourite subjects: Voting systems.

Read More

Announcing Hypothesis Legacy Support

For a brief period, Python 2.6 was supported in Hypothesis for Python. Because Python 2.6 has been end of lifed for some time, I decided this wasn’t a priority and support was dropped in Hypothesis 2.0.

I’ve now added it back, but under a more restrictive license.

If you want to use Hypothesis on Python 2.6, you can now do so by installing the hypothesislegacysupport package. This will allow you to run Hypothesis on Python 2.6.

Note that by default this is licensed under the GNU Affero General Public License 3.0. If you want to use it in commercial software you will likely want to buy a commercial license. Email us at [email protected] to discuss details.

Read More

What is Property Based Testing?

I get asked this a lot, and I write property based testing tools for a living, so you’d think I have a good answer to this, but historically I haven’t. This is my attempt to fix that.

Historically the definition of property based testing has been “The thing that QuickCheck does”. As a working definition this has served pretty well, but the problem is that it makes us unable to distinguish what the essential features of property-based testing are and what are just accidental features that appeared in the implementations that we’re used to.

As the author of a property based testing system which diverges quite a lot from QuickCheck, this troubles me more than it would most people, so I thought I’d set out some of my thoughts on what property based testing is and isn’t.

This isn’t intended to be definitive, and it will evolve over time as my thoughts do, but it should provide a useful grounding for further discussion.

Read More

Generating the right data

One thing that often causes people problems is figuring out how to generate the right data to fit their data model. You can start with just generating strings and integers, but eventually you want to be able to generate objects from your domain model. Hypothesis provides a lot of tools to help you build the data you want, but sometimes the choice can be a bit overwhelming.

Here’s a worked example to walk you through some of the details and help you get to grips with how to use them.

Read More

You Don't Need Referential Transparency

It’s a common belief that in order for property based testing to be useful, your code must be referentially transparent. That is, it must be a pure function with no side effects that just takes input data and produces output data and is solely defined by what input data produces what output data.

This is, bluntly, complete and utter nonsense with no basis in reality.

Read More

Testing performance optimizations

Once you’ve flushed out the basic crashing bugs in your code, you’re going to want to look for more interesting things to test.

The next easiest thing to test is code where you know what the right answer is for every input.

Obviously in theory you think you know what the right answer is - you can just run the code. That’s not very helpful though, as that’s the answer you’re trying to verify.

But sometimes there is more than one way to get the right answer, and you choose the one you run in production not because it gives a different answer but because it gives the same answer faster.

Read More

Rule Based Stateful Testing

Hypothesis’s standard testing mechanisms are very good for testing things that can be considered direct functions of data. But supposed you have some complex stateful system or object that you want to test. How can you do that?

In this article we’ll see how to use Hypothesis’s rule based state machines to define tests that generate not just simple data, but entire programs using some stateful object. These will give the same level of boost to testing the behaviour of the object as you get to testing the data it accepts.

Read More

QuickCheck in Every Language

There are a lot of ports of QuickCheck, the original property based testing library, to a variety of different languages.

Some of them are good. Some of them are very good. Some of them are OK. Many are not.

I thought it would be worth keeping track of which are which, so I've put together a list.

Read More

The Purpose of Hypothesis

What is Hypothesis for?

From the perspective of a user, the purpose of Hypothesis is to make it easier for you to write better tests.

From my perspective as the primary author, that is of course also a purpose of Hypothesis. I write a lot of code, it needs testing, and the idea of trying to do that without Hypothesis has become nearly unthinkable.

But, on a large scale, the true purpose of Hypothesis is to drag the world kicking and screaming into a new and terrifying age of high quality software.

Read More

The Encode/Decode invariant

One of the simplest types of invariant to find once you move past just fuzzing your code is asserting that two different operations should produce the same result, and one of the simplest instances of that is looking for encode/decode pairs. That is, you have some function that takes a value and encodes it as another value, and another that is supposed to reverse the process.

This is ripe for testing with Hypothesis because it has a natural completely defined specification: Encoding and then decoding should be exactly the same as doing nothing.

Lets look at a concrete example.

Read More

Anatomy of a Hypothesis Based Test

What happens when you run a test using Hypothesis? This article will help you understand.

Read More

The Economics of Software Correctness

You have probably never written a significant piece of correct software.

That’s not a value judgement. It’s certainly not a criticism of your competence. I can say with almost complete confidence that every non-trivial piece of software I have written contains at least one bug. You might have written small libraries that are essentially bug free, but the chance that you have written a non-trivial bug free program is tantamount to zero.

I don’t even mean this in some pedantic academic sense. I’m talking about behaviour where if someone spotted it and pointed it out to you you would probably admit that it’s a bug. It might even be a bug that you cared about.

Why is this?

Read More

Getting started with Hypothesis

Hypothesis will speed up your testing process and improve your software quality, but when first starting out people often struggle to figure out exactly how to use it.

Until you’re used to thinking in this style of testing, it’s not always obvious what the invariants of your code actually are, and people get stuck trying to come up with interesting ones to test.

Fortunately, there’s a simple invariant which every piece of software should satisfy, and which can be remarkably powerful as a way to uncover surprisingly deep bugs in your software.

Read More