Hypothesis

Test faster, fix more

python

Articles that use the Python version of Hypothesis. Many of these will illustrate more widely applicable principles.

Solving the Water Jug Problem from Die Hard 3 with TLA+ and Hypothesis

This post was originally published on the author’s personal site. It is reproduced here with his permission.

In the movie Die Hard with a Vengeance (aka Die Hard 3), there is this famous scene where John McClane (Bruce Willis) and Zeus Carver (Samuel L. Jackson) are forced to solve a problem or be blown up: Given a 3 gallon jug and 5 gallon jug, how do you measure out exactly 4 gallons of water?

(The video title is wrong. It's Die Hard 3.)

Read More

Hypothesis for Computer Science Researchers

I’m in the process of trying to turn my work on Hypothesis into a PhD and I realised that I don’t have a good self-contained summary as to why researchers should care about it.

So this is that piece. I’ll try to give a from scratch introduction to the why and what of Hypothesis. It’s primarily intended for potential PhD supervisors, but should be of general interest as well (especially if you work in this field).

Why should I care about Hypothesis from a research point of view?

The short version:

Hypothesis takes an existing effective style of testing (property-based testing) which has proven highly effective in practice and makes it accessible to a much larger audience. It does so by taking several previously unconnected ideas from the existing research literature on testing and verification, and combining them to produce a novel implementation that has proven very effective in practice.

The long version is the rest of this article.

Read More

How Hypothesis Works

Hypothesis has a very different underlying implementation to any other property-based testing system. As far as I know, it’s an entirely novel design that I invented.

Central to this design is the following feature set which every Hypothesis strategy supports automatically (the only way to break this is by having the data generated depend somehow on external global state):

  1. All generated examples can be safely mutated
  2. All generated examples can be saved to disk (this is important because Hypothesis remembers and replays previous failures).
  3. All generated examples can be shrunk
  4. All invariants that hold in generation must hold during shrinking ( though the probability distribution can of course change, so things which are only supported with high probability may not be).

(Essentially no other property based systems manage one of these claims, let alone all)

The initial mechanisms for supporting this were fairly complicated, but after passing through a number of iterations I hit on a very powerful underlying design that unifies all of these features.

It’s still fairly complicated in implementation, but most of that is optimisations and things needed to make the core idea work. More importantly, the complexity is quite contained: A fairly small kernel handles all of the complexity, and there is little to no additional complexity (at least, compared to how it normally looks) in defining new strategies, etc.

This article will give a high level overview of that model and how it works.

Read More

Compositional shrinking

In my last article about shrinking, I discussed the problems with basing shrinking on the type of the values to be shrunk.

In writing it though I forgot that there was a halfway house which is also somewhat bad (but significantly less so) that you see in a couple of implementations.

This is when the shrinking is not type based, but still follows the classic shrinking API that takes a value and returns a lazy list of shrinks of that value. Examples of libraries that do this are theft and QuickTheories.

This works reasonably well and solves the major problems with type directed shrinking, but it’s still somewhat fragile and importantly does not compose nearly as well as the approaches that Hypothesis or test.check take.

Ideally, as well as not being based on the types of the values being generated, shrinking should not be based on the actual values generated at all.

This may seem counter-intuitive, but it actually works pretty well.

Read More

Integrated vs type based shrinking

One of the big differences between Hypothesis and Haskell QuickCheck is how shrinking is handled.

Specifically, the way shrinking is handled in Haskell QuickCheck is bad and the way it works in Hypothesis (and also in test.check and EQC) is good. If you’re implementing a property based testing system, you should use the good way. If you’re using a property based testing system and it doesn’t use the good way, you need to know about this failure mode.

Unfortunately many (and possibly most) implementations of property based testing are based on Haskell’s QuickCheck and so make the same mistake.

Read More

3.6.0 Release of Hypothesis for Python

This is a release announcement for the 3.6.0 release of Hypothesis for Python. It’s a bit of an emergency release.

Hypothesis 3.5.0 inadvertently added a dependency on GPLed code (see below for how this happened) which this release removes. This means that if you are running Hypothesis 3.5.x then there is a good chance you are in violation of the GPL and you should update immediately.

Apologies for any inconvenience this may have caused.

Read More

Another invariant to test for encoders

The encode/decode invariant is one of the most important properties to know about for testing your code with Hypothesis or other property-based testing systems, because it captures a very common pattern and is very good at finding bugs.

But how do you go beyond it? If encoders are that common, surely there must be other things to test with them?

Read More

Seeking funding for deeper integration between Hypothesis and pytest

Probably the number one complaint I hear from Hypothesis users is that it “doesn’t work” with py.test fixtures. This isn’t true, but it does have one very specific limitation in how it works that annoys people: It only runs function scoped fixtures once for the entire test, not once per example. Because of the way people use function scoped fixtures for handling stateful things like databases, this often causes people problems.

I’ve been maintaining for a while that this is impossible to fix without some changes on the pytest end.

The good news is that this turns out not to be the case. After some conversations with pytest developers, some examining of other pytest plugins, and a bunch of prototyping, I’m pretty sure it’s possible. It’s just really annoying and a lot of work.

So that’s the good news. The bad news is that this isn’t going to happen without someone funding the work.

I’ve now spent about a week of fairly solid work on this, and what I’ve got is quite promising: The core objective of running pytest fixtures for every examples works fairly seamlessly.

But it’s now in the long tail of problems that will need to be squashed before this counts as an actual production ready releasable piece of work. A number of things don’t work. For example, currently it’s running some module scoped fixtures once per example too, which it clearly shouldn’t be doing. It also currently has some pretty major performance problems that are bad enough that I would consider them release blocking.

As a result I’d estimate there’s easily another 2-3 weeks of work needed to get this out the door.

Which brings us to the crux of the matter: 2-3 additional weeks of free work on top of the one I’ve already done is 3-4 weeks more free work than I particularly want to do on this feature, so without sponsorship it’s not getting finished.

I typically charge £400/day for work on Hypothesis (this is heavily discounted off my normal rates), so 2-3 weeks comes to £4000 to £6000 (roughly $5000 to $8000) that has to come from somewhere.

I know there are a number of companies out there using pytest and Hypothesis together. I know from the amount of complaining about this integration that this is a real problem you’re experiencing. So, I think this money should come from those companies. Besides helping to support a tool you’ve already got a lot of value out of, this will expand the scope of what you can easily test with Hypothesis a lot, and will be hugely beneficial to your bug finding efforts.

This is a model that has worked well before with the funding of the recent statistics work by Jean-Louis Fuchs and Adfinis-SyGroup, and I’m confident it can work well again.

If you work at such a company and would like to talk about funding some or part of this development, please email me at drmaciver@hypothesis.works.

Read More

3.5.0 and 3.5.1 Releases of Hypothesis for Python

This is a combined release announcement for two releases. 3.5.0 was released yesterday, and 3.5.1 has been released today after some early bug reports in 3.5.0

Changes

3.5.0 - 2016-09-22

This is a feature release.

  • fractions() and decimals() strategies now support min_value and max_value parameters. Thanks go to Anne Mulhern for the development of this feature.
  • The Hypothesis pytest plugin now supports a –hypothesis-show-statistics parameter that gives detailed statistics about the tests that were run. Huge thanks to Jean-Louis Fuchs and Adfinis-SyGroup for funding the development of this feature.
  • There is a new event() function that can be used to add custom statistics.

Additionally there have been some minor bug fixes:

  • In some cases Hypothesis should produce fewer duplicate examples (this will mostly only affect cases with a single parameter).
  • py.test command line parameters are now under an option group for Hypothesis (thanks to David Keijser for fixing this)
  • Hypothesis would previously error if you used function annotations on your tests under Python 3.4.
  • The repr of many strategies using lambdas has been improved to include the lambda body (this was previously supported in many but not all cases).

3.5.1 - 2016-09-23

This is a bug fix release.

  • Hypothesis now runs cleanly in -B and -BB modes, avoiding mixing bytes and unicode.
  • unittest.TestCase tests would not have shown up in the new statistics mode. Now they do.
  • Similarly, stateful tests would not have shown up in statistics and now they do.
  • Statistics now print with pytest node IDs (the names you’d get in pytest verbose mode).

Notes

Aside from the above changes, there are a couple big things behind the scenes of this release that make it a big deal.

The first is that the flagship chunk of work, statistics, is a long-standing want to have that has never quite been prioritised. By funding it, Jean-Louis and Adfinis-SyGroup successfully bumped it up to the top of the priority list, making it the first funded feature in Hypothesis for Python!

Another less significant but still important is that this release marks the first real break with an unofficial Hypothesis for Python policy of not having any dependencies other than the standard library and backports. This release adds a dependency on the uncompyle6 package. This may seem like an odd choice, but it was invaluable for fixing the repr behaviour, which in turn was really needed for providing good statistics for filter and recursive strategies.

Read More

Hypothesis vs. Eris

Eris is a library for property-based testing of PHP code, inspired by the mature frameworks that other languages provide like QuickCheck, Clojure’s test.check and of course Hypothesis.

Here is a side-by-side comparison of some basic and advanced features that have been implemented in both Hypothesis and Eris, which may help developers coming from either Python or PHP and looking at the other side.

Read More

How many times will Hypothesis run my test?

This is one of the most common first questions about Hypothesis.

People generally assume that the number of tests run will depend on the specific strategies used, but that’s generally not the case. Instead Hypothesis has a fairly fixed set of heuristics to determine how many times to run, which are mostly independent of the data being generated.

But how many runs is that?

The short answer is 200. Assuming you have a default configuration and everything is running smoothly, Hypothesis will run your test 200 times.

The longer answer is “It’s complicated”. It will depend on the exact behaviour of your tests and the value of some settings. In this article I’ll try to clear up some of the specifics of which settings affect the answer and how.

Read More

Generating recursive data

Sometimes you want to generate data which is recursive. That is, in order to draw some data you may need to draw some more data from the same strategy. For example we might want to generate a tree structure, or arbitrary JSON.

Hypothesis has the recursive function in the hypothesis.strategies module to make this easier to do. This is an article about how to use it.

Read More

How do I use pytest fixtures with Hypothesis?

pytest is a great test runner, and is the one Hypothesis itself uses for testing (though Hypothesis works fine with other test runners too).

It has a fairly elaborate fixture system, and people are often unsure how that interacts with Hypothesis. In this article we’ll go over the details of how to use the two together.

Read More

What is Hypothesis?

Hypothesis is a library designed to help you write what are called property-based tests.

The key idea of property based testing is that rather than writing a test that tests just a single scenario, you write tests that describe a range of scenarios and then let the computer explore the possibilities for you rather than having to hand-write every one yourself.

In order to contrast this with the sort of tests you might be used to, when talking about property-based testing we tend to describe the normal sort of testing as example-based testing.

Property-based testing can be significantly more powerful than example based testing, because it automates the most time consuming part of writing tests

  • coming up with the specific examples - and will usually perform it better than a human would. This allows you to focus on the parts that humans are better at - understanding the system, its range of acceptable behaviours, and how they might break.

You don’t need a library to do property-based testing. If you’ve ever written a test which generates some random data and uses it for testing, that’s a property-based test. But having a library can help you a lot, making your tests easier to write, more robust, and better at finding bugs. In the rest of this article we’ll see how.

Read More

3.4.2 Release of Hypothesis for Python

This is a bug fix release, fixing a number of problems with the settings system:

  • Test functions defined using @given can now be called from other threads (Issue #337)
  • Attempting to delete a settings property would previously have silently done the wrong thing. Now it raises an AttributeError.
  • Creating a settings object with a custom database_file parameter was silently getting ignored and the default was being used instead. Now it’s not.

Notes

For historic reasons, _settings.py had been excluded from the requirement to have 100% branch coverage. Issue #337 would have been caught by a coverage requirement: the code in question simply couldn’t have worked, but it was not covered by any tests, so it slipped through.

As part of the general principle that bugs shouldn’t just be fixed without addressing the reason why the bug slipped through in the first place, I decided to impose the coverage requirements on _settings.py as well, which is how the other two bugs were found. Both of these had code that was never run during tests - in the case of the deletion bug there was a __delete__ descriptor method that was never being run, and in the case of the database_file one there was a check later that could never fire because the internal _database field was always being set in __init__.

I feel like this experiment thoroughly validated that 100% coverage is a useful thing to aim for. Unfortunately it also pointed out that the settings system is much more complicated than it needs to be. I’m unsure what to do about that - some of its functionality is a bit too baked into the public API to lightly change, and I’m don’t think it’s worth breaking that just to simplify the code.

Read More

Hypothesis for Python 3.4.1 Release

This is a bug fix release for a single bug:

  • On Windows when running two Hypothesis processes in parallel (e.g. using pytest-xdist) they could race with each other and one would raise an exception due to the non-atomic nature of file renaming on Windows and the fact that you can’t rename over an existing file. This is now fixed.

Notes

My tendency of doing immediate patch releases for bugs is unusual but generally seems to be appreciated. In this case this was a bug that was blocking a py.test merge.

I suspect this is not the last bug around atomic file creation on Windows. Cross platform atomic file creation seems to be a harder problem than I would have expected.

Read More

Calculating the mean of a list of numbers

Consider the following problem:

You have a list of floating point numbers. No nasty tricks - these aren’t NaN or Infinity, just normal “simple” floating point numbers.

Now: Calculate the mean (average). Can you do it?

It turns out this is a hard problem. It’s hard to get it even close to right. Lets see why.

Read More

Testing as a Complete Specification

Sometimes you’re lucky enough to have problems where the result is completely specified by a few simple properties.

This doesn’t necessarily correspond to them being easy! Many such problems are actually extremely fiddly to implement.

It does mean that they’re easy to test though. Lets see how.

Read More

Testing Configuration Parameters

A lot of applications end up growing a complex configuration system, with a large number of different knobs and dials you can turn to change behaviour. Some of these are just for performance tuning, some change operational concerns, some have other functions.

Testing these is tricky. As the number of parameters goes up, the number of possible configuration goes up exponentially. Manual testing of the different combinations quickly becomes completely unmanageable, not to mention extremely tedious.

Fortunately, this is somewhere where property-based testing in general and Hypothesis in particular can help a lot.

Read More

Evolving toward property-based testing with Hypothesis

Many people are quite comfortable writing ordinary unit tests, but feel a bit confused when they start with property-based testing. This post shows how two ordinary programmers started with normal Python unit tests and nudged them incrementally toward property-based tests, gaining many advantages on the way.

Read More

Testing Optimizers

We’ve previously looked into testing performance optimizations using Hypothesis, but this article is about something quite different: It’s about testing code that is designed to optimize a value. That is, you have some function and you want to find arguments to it that maximize (or minimize) its value.

As well as being an interesting subject in its own right, this will also nicely illustrate the use of Hypothesis’s data() functionality, which allows you to draw more data after the test has started, and will introduce a useful general property that can improve your testing in a much wider variety of settings.

Read More

Exploring Voting Systems with Hypothesis

Hypothesis is, of course, a library for writing tests.

But from an implementation point of view this is hardly noticeable. Really it’s a library for constructing and exploring data and using it to prove or disprove hypotheses about it. It then has a small testing library built on top of it.

It’s far more widely used as a testing library, and that’s really where the focus of its development lies, but with the find function you can use it just as well to explore your data interactively.

In this article we’ll go through an example of doing this, by using it to take a brief look at one of my other favourite subjects: Voting systems.

Read More

Announcing Hypothesis Legacy Support

For a brief period, Python 2.6 was supported in Hypothesis for Python. Because Python 2.6 has been end of lifed for some time, I decided this wasn’t a priority and support was dropped in Hypothesis 2.0.

I’ve now added it back, but under a more restrictive license.

If you want to use Hypothesis on Python 2.6, you can now do so by installing the hypothesislegacysupport package. This will allow you to run Hypothesis on Python 2.6.

Note that by default this is licensed under the GNU Affero General Public License 3.0. If you want to use it in commercial software you will likely want to buy a commercial license. Email us at licensing@hypothesis.works to discuss details.

Read More

Generating the right data

One thing that often causes people problems is figuring out how to generate the right data to fit their data model. You can start with just generating strings and integers, but eventually you want to be able to generate objects from your domain model. Hypothesis provides a lot of tools to help you build the data you want, but sometimes the choice can be a bit overwhelming.

Here’s a worked example to walk you through some of the details and help you get to grips with how to use them.

Read More

Testing performance optimizations

Once you’ve flushed out the basic crashing bugs in your code, you’re going to want to look for more interesting things to test.

The next easiest thing to test is code where you know what the right answer is for every input.

Obviously in theory you think you know what the right answer is - you can just run the code. That’s not very helpful though, as that’s the answer you’re trying to verify.

But sometimes there is more than one way to get the right answer, and you choose the one you run in production not because it gives a different answer but because it gives the same answer faster.

Read More

Rule Based Stateful Testing

Hypothesis’s standard testing mechanisms are very good for testing things that can be considered direct functions of data. But supposed you have some complex stateful system or object that you want to test. How can you do that?

In this article we’ll see how to use Hypothesis’s rule based state machines to define tests that generate not just simple data, but entire programs using some stateful object. These will give the same level of boost to testing the behaviour of the object as you get to testing the data it accepts.

Read More

The Encode/Decode invariant

One of the simplest types of invariant to find once you move past just fuzzing your code is asserting that two different operations should produce the same result, and one of the simplest instances of that is looking for encode/decode pairs. That is, you have some function that takes a value and encodes it as another value, and another that is supposed to reverse the process.

This is ripe for testing with Hypothesis because it has a natural completely defined specification: Encoding and then decoding should be exactly the same as doing nothing.

Lets look at a concrete example.

Read More

Anatomy of a Hypothesis Based Test

What happens when you run a test using Hypothesis? This article will help you understand.

Read More

Getting started with Hypothesis

Hypothesis will speed up your testing process and improve your software quality, but when first starting out people often struggle to figure out exactly how to use it.

Until you’re used to thinking in this style of testing, it’s not always obvious what the invariants of your code actually are, and people get stuck trying to come up with interesting ones to test.

Fortunately, there’s a simple invariant which every piece of software should satisfy, and which can be remarkably powerful as a way to uncover surprisingly deep bugs in your software.

Read More