Test faster, fix more

This post was originally published on the author’s personal site. It is reproduced here with his permission.

In the movie Die Hard with a Vengeance (aka Die Hard 3), there is this famous scene where John McClane (Bruce Willis) and Zeus Carver (Samuel L. Jackson) are forced to solve a problem or be blown up: Given a 3 gallon jug and 5 gallon jug, how do you measure out exactly 4 gallons of water?

(The video title is wrong. It's Die Hard 3.)

Apparently, you can solve this problem using a formal specification language like TLA+. I don’t know much about this topic, but it appears that a formal specification language is much like a programming language in that it lets you describe the behavior of a system. However, it’s much more rigorous and builds on mathematical techniques that enable you to reason more effectively about the behavior of the system you’re describing than you can with a typical programming language.

In a recent discussion on Hacker News about TLA+, I came across this comment which linked to a fun and simple example showing how to solve the Die Hard 3 problem with TLA+. I had to watch the first two lectures from Leslie Lamport’s video course on TLA+ to understand the example well, but once I did I was reminded of the idea of property-based testing and, specifically, Hypothesis.

So what’s property-based testing? It’s a powerful way of testing your logic by giving your machine a high-level description of how your code should behave and letting it generate test cases automatically to see if that description holds. Compare that to traditional unit testing, for example, where you manually code up specific inputs and outputs and make sure they match.

How not to Die Hard with Hypothesis

Hypothesis has an excellent implementation of property-based testing for Python. I thought to myself: I wonder if you can write that Die Hard specification using Hypothesis? As it turns out, Hypothesis supports stateful testing, and I was able to port the TLA+ example to Python pretty easily:

from hypothesis import note, settings
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant

class DieHardProblem(RuleBasedStateMachine):
    small = 0
    big = 0

    def fill_small(self):
        self.small = 3

    def fill_big(self):
        self.big = 5

    def empty_small(self):
        self.small = 0

    def empty_big(self):
        self.big = 0

    def pour_small_into_big(self):
        old_big = self.big
        self.big = min(5, self.big + self.small)
        self.small = self.small - (self.big - old_big)

    def pour_big_into_small(self):
        old_small = self.small
        self.small = min(3, self.small + self.big)
        self.big = self.big - (self.small - old_small)

    def physics_of_jugs(self):
        assert 0 <= self.small <= 3
        assert 0 <= self.big <= 5

    def die_hard_problem_not_solved(self):
        note("> small: {s} big: {b}".format(s=self.small, b=self.big))
        assert self.big != 4

# The default of 200 is sometimes not enough for Hypothesis to find
# a falsifying example.
with settings(max_examples=2000):
    DieHardTest = DieHardProblem.TestCase

Calling pytest on this file quickly digs up a solution:

self = DieHardProblem({})

    def die_hard_problem_not_solved(self):
        note("> small: {s} big: {b}".format(s=self.small, b=self.big))
>       assert self.big != 4
E       AssertionError: assert 4 != 4
E        +  where 4 = DieHardProblem({}).big

how-not-to-die-hard-with-hypothesis.py:17: AssertionError
----------------------------- Hypothesis -----------------------------
> small: 0 big: 0
Step #1: fill_big()
> small: 0 big: 5
Step #2: pour_big_into_small()
> small: 3 big: 2
Step #3: empty_small()
> small: 0 big: 2
Step #4: pour_big_into_small()
> small: 2 big: 0
Step #5: fill_big()
> small: 2 big: 5
Step #6: pour_big_into_small()
> small: 3 big: 4
====================== 1 failed in 0.22 seconds ======================

What’s Going on Here

The code and test output are pretty self-explanatory, but here’s a recap of what’s going on:

We’re defining a state machine. That state machine has an initial state (two empty jugs) along with some possible transitions. Those transitions are captured with the rule() decorator. The initial state and possible transitions together define how our system works.

Next we define invariants, which are properties that must always hold true in our system. Our first invariant, physics_of_jugs, says that the jugs must hold an amount of water that makes sense. For example, the big jug can never hold more than 5 gallons of water.

Our next invariant, die_hard_problem_not_solved, is where it gets interesting. Here we’re declaring that the problem of getting exactly 4 gallons in the big jug cannot be solved. Since Hypothesis’s job is to test our logic for bugs, it will give our state machine a thorough shake down and see if we ever violate our invariants. In other words, we’re basically goading Hypothesis into solving the Die Hard problem for us.

I’m not entirely clear on how Hypothesis does its work, but I know the basic summary is this: It takes the program properties we’ve specified – including things like rules, invariants, data types, and function signatures – and generates data or actions to probe the behavior of our program. If Hypothesis finds a piece of data or sequence of actions that get our program to violate its stated properties, it tries to whittle that down to a minimum falsifying example—i.e. something that exposes the same problem but with a minimum number of steps. This makes it much easier for you to understand how Hypothesis broke your code.

Hypothesis’s output above tells us that it was able to violate the die_hard_problem_not_solved invariant and provides us with a minimal reproduction showing exactly how it did so. That reproduction is our solution to the problem. It’s also how McClane and Carver did it in the movie!

Final Thoughts

All in all, I was pretty impressed with how straightforward it was to translate the TLA+ example into Python using Hypothesis. And when Hypothesis spit out the solution, I couldn’t help but smile. It’s pretty cool to see your computer essentially generate a program that solves a problem for you. And the Python version of the Die Hard “spec” is not much more verbose than the original in TLA+, though TLA+’s notation for current vs. next value (e.g. small vs. small') is elegant and cuts out the need to have variables like old_small and old_big.

I don’t know how Hypothesis compares to TLA+ in a general sense. I’ve only just started to learn about property-based testing and TLA+, and I wonder if they have a place in the work that I do these days, which is mostly Data Engineering-type stuff. Still, I found this little exercise fun, and I hope you learned something interesting from it.

Thanks to Julia, Dan, Laura, Anjana, and Cip for reading drafts of this post.