Hypothesis

Test faster, fix more

Happy new year everybody!

In this post I’d like to tell you about one of the nice things that happened in 2017: The Hypothesis work that was funded by Smarkets Smarkets are an exchange for peer-to-peer trading of bets but, more importantly for us, they are fairly heavy users of Hypothesis for the Python part of their stack.

Smarkets approached me a while back to talk about possibly funding some Hypothesis development. We talked a bit about what their biggest pain points with Hypothesis were, and it emerged that they were having a lot of trouble with Hypothesis performance.

We talked a bit about how to speed Hypothesis up, and I suggested some diagnostic tests they could do to see where the problem was, and it fairly quickly emerged that in fact they mostly weren’t having trouble with Hypothesis performance - Hypothesis indeed was much slower than it should have been in their use case, but more than an order of magnitude more time was spent in their test code rather than in Hypothesis.

So on further discussion we decided that actually their big problem was not the performance of Hypothesis per se, but instead the legibility of Hypothesis performance problems - when tests using Hypothesis were slow, it was non-obvious why that might be the case, and it might be extremely difficult to reproduce the problem.

This is the sort of problem where it’s really useful to have user feedback and funding, because it’s more or less a non-problem for me and - to a lesser extent - anyone who already works on Hypothesis. Because we’ve got much deeper knowledge of the internals and the failure modes, we’re largely just used to working around these issues. More feedback from Hypothesis would be helpful, but it’s not essential.

So, given that in the normal course of things Hypothesis development is mostly driven by what we feel like working on, this is really the sort of work that will only happen with a source of external funding for it. Thus it’s really great that Smarkets were willing to step up and fund it!

After some discussion we ended up settling on four features that would significantly improve the situation for them:

Identification of examples causing slow tests

This was the introduction of the deadline feature, which causes Hypothesis to treat slow tests as failures - when set, a test that takes longer than its set deadline (not counting data generation) raises a DeadlineExceeded error.

This is a bit of a blunt instrument, but it is very effective at getting test runtime under control!

There turned out to be some amusing complications in developing it, so this feature was spread over a number of releases as we found and worked out its various kinks (3.27.0, 3.31.1, 3.38.2).

One of the interesting problems we found was that deadlines have a threshold problem - because the shrinking process tends to find examples which are just on the cusp of failure, often when you rerun a fully shrunk example it doesn’t fail!

I went back and forth on the best solution for this for a while, but in the end the best solution turned out to be a simple one - raise the deadline during example generation and shrinking, then replay with the actual set deadline.

This does mean that tests that are right on the cusp of being too slow may pass artificially, but that’s substantially better than introducing flaky failures.

On top of this we also needed to make this feature play well with inline data generation - the complicating factor was that data generation is much faster when replaying examples or shrinking than it is during generation (the generation logic is rather too complicated. I have some long-term plans to make it simpler, which would make this difference largely go away). Fortunately, the work done for the next feature made this easy to do.

Breakdown of generation time in statistics

This was a fairly simple change, prompted by the initial confusion we had in diagnosing Smarket’s test problems: If you can’t tell the difference between slow data generation and slow tests, your initial guesses about performance may be very misleading! So this change updated the statistics reporting system to report what fraction of time is spent in data generation. If you run tests with statistics you’ll now see a line like the following:

- Fraction of time spent in data generation: ~ 12%

A small change, but a very helpful one! This came in in Hypothesis 3.38.4.

Health check overhaul

Hypothesis has had a health check system for a while. Its general goal is to suggest that you might not want to do that thing you’re doing - roughly analogous to compiler warnings. It’s useful for guiding you into correct use of Hypothesis and helping you avoid things that might degrade the quality of your testing.

It’s historically had some problems: In particular the main source of health checks did not actually run your tests! They just run the data generation with the test stubbed out. This meant that you could very easily accidentally bypass the health checks by e.g. using inline data generation, or doing your filtering with assume.

It also had the problem that it wasn’t running the real data generation algorithm but instead an approximation to it. This meant that things that would work fine in practice would sometimes fail health checks.

This piece of work was an overhaul of the health check system to solve these problems and to expand the scope of the problems it could find.

It ended up working very well. So well in fact that it found some problems in Hypothesis’s built in library of strategies!

It was split across a number of releases:

  • In 3.37.0 we deprecated a number of existing health checks that no longer did anything useful.
  • In 3.38.0 I overhauled the health check system to be based on actual test execution, solving the existing limitations of it.
  • In 3.39.0 I added a new health check that tests whether the smallest example of the test was too large to allow reasonable testing - accidentally generating very large examples being a common source of performance bugs.

The health check in 3.39.0 turned out to catch a major problem in Hypothesis’s handling of blacklist_characters and some regular expression constructs, so prior to that we had to release 3.38.8 to fix those!

Over all I’m much happier with the new health check system and think it does a much better job of shaping user behaviour to get better results out of Hypothesis.

Printing reproduction steps

Historically output from Hypothesis has looked something like this:

Falsifying example: test_is_minimal(ls=[0], v=1)

Or, if you had a failed health check, like the following:

hypothesis.errors.FailedHealthCheck: It looks like your strategy is filtering out a lot of data. Health check found 50 filtered examples but only
0 good ones. This will make your tests much slower, and also will probably distort the data generation quite a lot. You should adapt your strategy
 to filter less. This can also be caused by a low max_leaves parameter in recursive() calls.

See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check,
add HealthCheck.filter_too_much to the suppress_health_check settings for this test.

This is fine if you’re running the tests locally, but if your failure is on CI this can be difficult to reproduce. If you got a falsifying example, you’re only able to reproduce it if all of your arguments have sensible reprs (which may not be the case even if you restrict yourself to Hypothesis’s built in strategies - e.g. using inline data generation prevents it!). If you got a health check failure, there’s nothing that helps you reproduce it at all!

So the proposed feature for this was to print out the random seed that produced this:

You can add @seed(302934307671667531413257853548643485645) to this test or run pytest with --hypothesis-seed=302934307671667531413257853548643485645 to reproduce this failure.

This was a great idea, and seemed to work out pretty well when we introduced it in 3.30.0, but on heavier use in the wild turned out to have some fairly major problems!

The big issue is that in order to reproduce Hypothesis’s behaviour on a given run you need to know not just the random seed that got you there, but also the state of Hypothesis’s example database! Hypothesis maintains a cache of many of the previously run examples, and uses it to inform the testing by replaying test cases that e.g. failed the last time they were run, or covered some hard to reach line in the code. Even for examples that don’t come from the database, Hypothesis under the hood is a mutation based fuzzer, so all the examples it finds will depend on the examples it loaded.

The initial solution to this (3.40.0) was just to turn off seed printing when its output would be misleading. This worked, but was fairly non-ideal even just for Smarkets - they do use the database in their CI, so this would result in a lot of failures to print.

After some discussion, I decided that given that the feature wasn’t nearly as useful as intended, so I threw in an extra freebie feature to make up the gap in functionality: @reproduce_failure. This uses Hypothesis’s internal format to replicate the functionality of the database in a way that is easy to copy and paste into your code. It took some careful designing to make it usable - my big concern was that people would leave this in their code, blocking future upgrades to Hypothesis - but in the end I’m reasonably happy with the result.

As a bonus, the work here allowed me to sort out one big concern about seed printing: We still needed a way to reproduce health check failures when the database was being used. The solution was in the end easy: We just don’t use the examples from the database in the part where the main health checks are running. This still leaves a few health checks which could theoretically be hard to reproduce (the main one is the hung test health check, but that one tends to reproduce fairly reliably on any seed if you have deadlines on).

So this leaves us with a state where health check failures will suggest @seed and example failures will suggest @reproduce_failure (where necessary. The linked documentation spells this out in more detail).

Ten releases later

In the end the Smarkets work came to a total of exactly 10 releases, some larger than other.

The end result has been very beneficial, and not just to Smarkets! I’ve had several users report back improvements to their tests as a result of the new health checks, and I’ve personally found the @reproduce_failure feature remarkably useful.

I’m very happy to have done this work, and am grateful to Smarkets for funding it.

I think this sort of thing where commercial users fund the “boring” features that are very useful for people using the tool at scale but maintainers are unlikely to work on under their own initiative is a very good one, and I hope we’ll do more of it in future.

As a bonus, Smarkets kindly agreed to put online the talk I gave them about Hypothesis (largely intended to raise awareness of it and property-based testing among their teams who aren’t using it yet). If you want to learn more about some of the philosophy and practice behind using this sort of testing, or want something to send to people who aren’t convinced yet, you can watch it here.