Test faster, fix more

A lot of applications end up growing a complex configuration system, with a large number of different knobs and dials you can turn to change behaviour. Some of these are just for performance tuning, some change operational concerns, some have other functions.

Testing these is tricky. As the number of parameters goes up, the number of possible configuration goes up exponentially. Manual testing of the different combinations quickly becomes completely unmanageable, not to mention extremely tedious.

Fortunately, this is somewhere where property-based testing in general and Hypothesis in particular can help a lot.

Configuration parameters almost all have one thing in common: For the vast majority of things, they shouldn’t change the behaviour. A configuration parameter is rarely going to be a complete reskin of your application.

This means that they are relatively easy to test with property-based testing. You take an existing test - either one that is already using Hypothesis or a normal example based test test - and you vary some configuration parameters and make sure the test still passes.

This turns out to be remarkably effective. Here’s an example where I used this technique and found some bugs in the Argon2 password hashing library, using Hynek’s CFFI based bindings.

The idea of password hashing is straightforward: Given a password, you can create a hash against which the password can be verified without ever storing the password (after all, you’re not storing passwords in plain text on your servers, right?). Although straightforward to describe, there’s a lot of difficulty in making a good implementation of this. Argon2 is a fairly recent one which won the Password Hashing Competition so should be fairly good.

We can verify that hashing works correctly fairly immediately using Hypothesis:

from argon2 import PasswordHasher

from hypothesis import given
import hypothesis.strategies as st

class TestPasswordHasherWithHypothesis(object):
    def test_a_password_verifies(self, password):
        ph = PasswordHasher()
        hash = ph.hash(password)
        assert ph.verify(hash, password)

This takes an arbitrary text password, hashes it and verifies it against the generated hash.

This passes. So far, so good.

But as you probably expected from its context here, argon2 has quite a lot of different parameters to it. We can expand the test to vary them and see what happens:

from argon2 import PasswordHasher

from hypothesis import given, assume
import hypothesis.strategies as st

class TestPasswordHasherWithHypothesis(object):
        time_cost=st.integers(1, 10),
        parallelism=st.integers(1, 10),
        memory_cost=st.integers(8, 2048),
        hash_len=st.integers(12, 1000),
        salt_len=st.integers(8, 1000),
    def test_a_password_verifies(
        password, time_cost, parallelism, memory_cost, hash_len, salt_len,
        assume(parallelism * 8 <= memory_cost)
        ph = PasswordHasher(
            time_cost=time_cost, parallelism=parallelism,
            hash_len=hash_len, salt_len=salt_len,
        hash = ph.hash(password)
        assert ph.verify(hash, password)

These parameters are mostly intended to vary the difficulty of calculating the hash. Honestly I’m not entirely sure what all of them do. Fortunately for the purposes of writing this test, understanding is optional.

In terms of how I chose the specific strategies to get there, I just picked some plausible looking parameters ranges and adjusted them until I wasn’t getting validation errors (I did look for documentation, I promise). The assume() call comes from reading the argon2 source to try to find out what the valid range of parallelism was.

This ended up finding two bugs, which I duly reported to Hynek, but they actually turned out to be upstream bugs!

In both cases, a password would no longer validate against itself:

Falsifying example: test_a_password_verifies(
    password='', time_cost=1, parallelism=1, memory_cost=8, hash_len=4,
Falsifying example: test_a_password_verifies(
    password='', time_cost=1, parallelism=1, memory_cost=8,
    hash_len=513, salt_len=8

(I found the second one by manually determining that the first bug happened whenever salt_len < 12 and manually ruling that case out).

One interesting thing about both of these bugs is that they’re actually not bugs in the Python library but are both downstream bugs. I hadn’t set out to do that when I wrote these tests, but it nicely validates that Hypothesis is rather useful for testing C libraries as well as Python, given how easy they are to bind to with CFFI.