Verified binary multiplication for GHASH
Exploring formal verification (part 3)
Previously I introduced some very basic Cryptol and SAWScript, and explained how to reason about the correctness of constant-time integer multiplication written in C/C++.
In this post I will touch on using formal verification as part of the code review process, in particular show how, by using the Software Analysis Workbench, we saved ourselves hours of debugging when rewriting the GHASH implementation for NSS.
What’s GHASH again?
GHASH is part of the Galois/Counter Mode, a mode of operation for block ciphers. AES-GCM for example uses AES as the block cipher for encryption, and appends a tag generated by the GHASH function, thereby ensuring integrity and authenticity.
The core of GHASH is multiplication in GF(2128), a characteristic-two finite field with coefficients in GF(2); they’re either zero or one. Polynomials in GF(2m) can be represented as m-bit numbers, with each bit corresponding to a term’s coefficient. In GF(23) for example, x^2 + 1
may be represented as the binary number 0b101 = 5
.
Additions and subtractions in finite fields are “carry-less” because the coefficients must be in GF(p), for any GF(pm). As x * y
is equivalent to adding x
to itself y
times, we can call multiplication in finite fields “carry-less” too. In GF(2) addition is simply XOR, so we can say that multiplication in GF(2m) is equal to binary multiplication without carries.
Note that the term carry-less only makes sense when talking about GF(2m) fields that are easily represented as binary numbers. Otherwise one would rather talk about multiplication in finite fields without comparing it to standard integer multiplication.
Franziskus’ post nicely describes why and how we updated our AES-GCM code in NSS. In case a user’s CPU is not equipped with the Carry-less Multiplication (CLMUL) instruction set, we need to provide a fallback and implement carry-less, constant-time binary multiplication ourselves, using standard integer multiplication with carry.
bmul() for 32-bit machines
The basic implementation of our binary multiplication algorithm is taken straight from Thomas Pornin’s excellent constant-time crypto post. To support 32-bit machines the best we can do is multiply two uint32_t
numbers and store the result in a uint64_t
.
For the full GHASH, Karatsuba decomposition is used: multiplication of two 128-bit integers is broken down into nine calls to bmul32(x, y, ...)
. Let’s take a look at the actual implementation:
Thomas’ explanation is not too hard to follow. The main idea behind the algorithm are the bitmasks m1 = 0b00010001...
, m2 = 0b00100010...
, m4 = 0b01000100...
, and m8 = 0b10001000...
. They respectively have the first, second, third, and fourth bit of every nibble set. This leaves “holes” of three bits between each “data bit”, so that with those applied at most a quarter of the 32 bits are equal to one.
Per standard integer multiplication, eight times eight bits will at most add eight carry bits of value one together, thus we need sufficiently sized holes per digit that can hold the value 8 = 0b1000
. Three-bit holes are big enough to prevent carries from “spilling” over, they could even handle up to 15 = 0b1111
data bits in each of the two integer operands.
Review, tests, and verification
The first version of the patch came with a bunch of new tests, the vectors taken from the GCM specification. We previously had no such low-level coverage, all we had were a number of high-level AES-GCM tests.
When reviewing, after looking at the patch itself and applying it locally to see whether it builds and tests succeed, the next step I wanted to try was to write a Cryptol specification to prove the correctness of bmul32()
. Thanks to the built-in pmult
function that took only a few minutes.
The SAWScript needed to properly parse the LLVM bitcode and formulate the equivalence proof is straightforward, it’s basically the same as shown in the previous post.
Compile to bitcode and run SAW. After just a few seconds it will tell us it succeeded in proving equivalency of both implementations.
bmul() for 64-bit machines
bmul32()
is called nine times, each time performing 16 multiplications. That’s 144 multiplications in total for one GHASH evaluation. If we had a bmul64()
for 128-bit multiplication with uint128_t
we’d need to call it only thrice.
The naive approach taken in the first patch revision was to just double the bitsize of the arguments and variables, and also extend the bitmasks. If you paid close attention to the previous section you might notice a problem here already. If not, it will become clear in a few moments.
Tests and another equivalence proof
The above version of bmul64()
passed the GHASH test vectors with flying colors. That tricked reviewers into thinking it looked just fine, even if they just learned about the basic algorithm idea. Fallible humans. Let’s update the proofs and see what happens.
Instead of hardcoding bmul
for 32-bit integers we use polymorphic types m
and n
to denote the size in bits. m
is mostly a helper to make it a tad more readable. We can now reason about carry-less n-bit binary multiplication.
Duplicating the SAWScript spec and running :s/32/64
is easy, but certainly nicer is adding a function that takes n
as a parameter and returns a spec for n-bit arguments.
We use two instances of the bmul
spec to prove correctness of bmul32()
and bmul64()
sequentially. The second verification will take a lot longer before yielding results.
Proof failed. As you probably expected by now, the bmul64()
implementation is erroneous and SAW gives us a specific counterexample to investigate further. It took us a while to understand the failure but it seems very obvious in hindsight.
Fixing the bmul64() bitmasks
As already shown above, bitmasks leaving three-bit holes between data bits can avoid carry-spilling for up to two 15-bit integers. Using every fourth bit of a 64-bit argument however yields 16 data bits each, and carries can thus override data bits. We need bitmasks with four-bit holes.
m1
, …, m5
are the new bitmasks. m1
equals 0b0010000100001...
, the others are each shifted by one. As the number of data bits per argument is now 64/5 <= n < 64/4
we need 5*5 = 25
multiplications. With three calls to bmul64()
that’s 75 in total.
Run SAW again and, after about an hour, it will tell us it successfully verified @bmul64.
You might want to take a look at Thomas Pornin’s version of bmul64()
. This basically is the faulty version that SAW failed to verify, he however works around the overflow by calling it twice, passing arguments reversed bitwise the second time. He invokes bmul64()
six times, which results in a total of 96 multiplications.
Some final thoughts
One of the takeaways is that even an implementation passing all test vectors given by a spec doesn’t need to be correct. That is not too surprising, spec authors can’t possibly predict edge cases from implementation approaches they haven’t thought about.
Using formal verification as part of the review process was definitely a wise decision. We likely saved hours of debugging intermittently failing connections, or random interoperability problems reported by early testers. I’m confident this wouldn’t have made it much further down the release line.
We of course added an extra test that covers that specific flaw but the next step definitely should be proper CI integration. The Cryptol code has already been written and there is no reason to not run it on every push. Verifying the full GHASH implementation would be ideal. The Cryptol code is almost trivial:
Proving the multiplication of two 128-bit numbers for a 256-bit product will unfortunately take a very very long time, or maybe not finish at all. Even if it finished after a few days that’s not something you want to automatically run on every push. Running it manually every time the code is touched might be an option though.