Verified binary multiplication for GHASH

Exploring formal verification (part 3)

Previously I introduced some very basic Cryptol and SAWScript, and explained how to reason about the correctness of constant-time integer multiplication written in C/C++.

In this post I will touch on using formal verification as part of the code review process, in particular show how, by using the Software Analysis Workbench, we saved ourselves hours of debugging when rewriting the GHASH implementation for NSS.

What’s GHASH again?

GHASH is part of the Galois/Counter Mode, a mode of operation for block ciphers. AES-GCM for example uses AES as the block cipher for encryption, and appends a tag generated by the GHASH function, thereby ensuring integrity and authenticity.

The core of GHASH is multiplication in GF(2128), a characteristic-two finite field with coefficients in GF(2); they’re either zero or one. Polynomials in GF(2m) can be represented as m-bit numbers, with each bit corresponding to a term’s coefficient. In GF(23) for example, x^2 + 1 may be represented as the binary number 0b101 = 5.

Additions and subtractions in finite fields are “carry-less” because the coefficients must be in GF(p), for any GF(pm). As x * y is equivalent to adding x to itself y times, we can call multiplication in finite fields “carry-less” too. In GF(2) addition is simply XOR, so we can say that multiplication in GF(2m) is equal to binary multiplication without carries.

Note that the term carry-less only makes sense when talking about GF(2m) fields that are easily represented as binary numbers. Otherwise one would rather talk about multiplication in finite fields without comparing it to standard integer multiplication.

Franziskus’ post nicely describes why and how we updated our AES-GCM code in NSS. In case a user’s CPU is not equipped with the Carry-less Multiplication (CLMUL) instruction set, we need to provide a fallback and implement carry-less, constant-time binary multiplication ourselves, using standard integer multiplication with carry.

bmul() for 32-bit machines

The basic implementation of our binary multiplication algorithm is taken straight from Thomas Pornin’s excellent constant-time crypto post. To support 32-bit machines the best we can do is multiply two uint32_t numbers and store the result in a uint64_t.

For the full GHASH, Karatsuba decomposition is used: multiplication of two 128-bit integers is broken down into nine calls to bmul32(x, y, ...). Let’s take a look at the actual implementation:

/* Binary multiplication x * y = r_high << 32 | r_low. */
void
bmul32(uint32_t x, uint32_t y, uint32_t *r_high, uint32_t *r_low)
{
    uint32_t x0, x1, x2, x3;
    uint32_t y0, y1, y2, y3;
    uint32_t m1 = (uint32_t)0x11111111;
    uint32_t m2 = (uint32_t)0x22222222;
    uint32_t m4 = (uint32_t)0x44444444;
    uint32_t m8 = (uint32_t)0x88888888;
    uint64_t z0, z1, z2, z3;
    uint64_t z;

    /* Apply bitmasks. */
    x0 = x & m1;
    x1 = x & m2;
    x2 = x & m4;
    x3 = x & m8;
    y0 = y & m1;
    y1 = y & m2;
    y2 = y & m4;
    y3 = y & m8;

    /* Integer multiplication (16 times). */
    z0 = ((uint64_t)x0 * y0) ^ ((uint64_t)x1 * y3) ^
         ((uint64_t)x2 * y2) ^ ((uint64_t)x3 * y1);
    z1 = ((uint64_t)x0 * y1) ^ ((uint64_t)x1 * y0) ^
         ((uint64_t)x2 * y3) ^ ((uint64_t)x3 * y2);
    z2 = ((uint64_t)x0 * y2) ^ ((uint64_t)x1 * y1) ^
         ((uint64_t)x2 * y0) ^ ((uint64_t)x3 * y3);
    z3 = ((uint64_t)x0 * y3) ^ ((uint64_t)x1 * y2) ^
         ((uint64_t)x2 * y1) ^ ((uint64_t)x3 * y0);

    /* Merge results. */
    z0 &= ((uint64_t)m1 << 32) | m1;
    z1 &= ((uint64_t)m2 << 32) | m2;
    z2 &= ((uint64_t)m4 << 32) | m4;
    z3 &= ((uint64_t)m8 << 32) | m8;
    z = z0 | z1 | z2 | z3;
    *r_high = (uint32_t)(z >> 32);
    *r_low = (uint32_t)z;
}

Thomas’ explanation is not too hard to follow. The main idea behind the algorithm are the bitmasks m1 = 0b00010001..., m2 = 0b00100010..., m4 = 0b01000100..., and m8 = 0b10001000.... They respectively have the first, second, third, and fourth bit of every nibble set. This leaves “holes” of three bits between each “data bit”, so that with those applied at most a quarter of the 32 bits are equal to one.

Per standard integer multiplication, eight times eight bits will at most add eight carry bits of value one together, thus we need sufficiently sized holes per digit that can hold the value 8 = 0b1000. Three-bit holes are big enough to prevent carries from “spilling” over, they could even handle up to 15 = 0b1111 data bits in each of the two integer operands.

Review, tests, and verification

The first version of the patch came with a bunch of new tests, the vectors taken from the GCM specification. We previously had no such low-level coverage, all we had were a number of high-level AES-GCM tests.

When reviewing, after looking at the patch itself and applying it locally to see whether it builds and tests succeed, the next step I wanted to try was to write a Cryptol specification to prove the correctness of bmul32(). Thanks to the built-in pmult function that took only a few minutes.

m <- llvm_load_module "bmul.bc";

let {{
  bmul32 : [32] -> [32] -> ([32], [32])
  bmul32 a b = (take`{32} prod, drop`{32} prod)
      where prod = pad (pmult a b)
            pad x = zero # x
}};

The SAWScript needed to properly parse the LLVM bitcode and formulate the equivalence proof is straightforward, it’s basically the same as shown in the previous post.

llvm_verify m "bmul32" [] do {
  x <- llvm_var "x" (llvm_int 32);
  y <- llvm_var "y" (llvm_int 32);
  llvm_ptr "r_high" (llvm_int 32);
  r_high <- llvm_var "*r_high" (llvm_int 32);
  llvm_ptr "r_low" (llvm_int 32);
  r_low <- llvm_var "*r_low" (llvm_int 32);

  let res = {{ bmul32 x y }};
  llvm_ensure_eq "*r_high" {{ res.0 }};
  llvm_ensure_eq "*r_low" {{ res.1 }};

  llvm_verify_tactic abc;
};

Compile to bitcode and run SAW. After just a few seconds it will tell us it succeeded in proving equivalency of both implementations.

$ saw bmul.saw
Loading module Cryptol
Loading file "bmul.saw"
Successfully verified @bmul32

bmul() for 64-bit machines

bmul32() is called nine times, each time performing 16 multiplications. That’s 144 multiplications in total for one GHASH evaluation. If we had a bmul64() for 128-bit multiplication with uint128_t we’d need to call it only thrice.

The naive approach taken in the first patch revision was to just double the bitsize of the arguments and variables, and also extend the bitmasks. If you paid close attention to the previous section you might notice a problem here already. If not, it will become clear in a few moments.

typedef unsigned __int128 uint128_t;

/* Binary multiplication x * y = r_high << 64 | r_low. */
void
bmul64(uint64_t x, uint64_t y, uint64_t *r_high, uint64_t *r_low)
{
    uint64_t x0, x1, x2, x3;
    uint64_t y0, y1, y2, y3;
    uint64_t m1 = (uint64_t)0x1111111111111111;
    uint64_t m2 = (uint64_t)0x2222222222222222;
    uint64_t m4 = (uint64_t)0x4444444444444444;
    uint64_t m8 = (uint64_t)0x8888888888888888;
    uint128_t z0, z1, z2, z3;
    uint128_t z;

    /* Apply bitmasks. */
    x0 = x & m1;
    x1 = x & m2;
    x2 = x & m4;
    x3 = x & m8;
    y0 = y & m1;
    y1 = y & m2;
    y2 = y & m4;
    y3 = y & m8;

    /* Integer multiplication (16 times). */
    z0 = ((uint128_t)x0 * y0) ^ ((uint128_t)x1 * y3) ^
         ((uint128_t)x2 * y2) ^ ((uint128_t)x3 * y1);
    z1 = ((uint128_t)x0 * y1) ^ ((uint128_t)x1 * y0) ^
         ((uint128_t)x2 * y3) ^ ((uint128_t)x3 * y2);
    z2 = ((uint128_t)x0 * y2) ^ ((uint128_t)x1 * y1) ^
         ((uint128_t)x2 * y0) ^ ((uint128_t)x3 * y3);
    z3 = ((uint128_t)x0 * y3) ^ ((uint128_t)x1 * y2) ^
         ((uint128_t)x2 * y1) ^ ((uint128_t)x3 * y0);

    /* Merge results. */
    z0 &= ((uint128_t)m1 << 64) | m1;
    z1 &= ((uint128_t)m2 << 64) | m2;
    z2 &= ((uint128_t)m4 << 64) | m4;
    z3 &= ((uint128_t)m8 << 64) | m8;
    z = z0 | z1 | z2 | z3;
    *r_high = (uint64_t)(z >> 64);
    *r_low = (uint64_t)z;
}

Tests and another equivalence proof

The above version of bmul64() passed the GHASH test vectors with flying colors. That tricked reviewers into thinking it looked just fine, even if they just learned about the basic algorithm idea. Fallible humans. Let’s update the proofs and see what happens.

bmul : {n,m} (fin n, n >= 1, m == n*2 - 1) => [n] -> [n] -> ([n], [n])
bmul a b = (take`{n} prod, drop`{n} prod)
    where prod = pad (pmult a b : [m])
          pad x = zero # x

Instead of hardcoding bmul for 32-bit integers we use polymorphic types m and n to denote the size in bits. m is mostly a helper to make it a tad more readable. We can now reason about carry-less n-bit binary multiplication.

Duplicating the SAWScript spec and running :s/32/64 is easy, but certainly nicer is adding a function that takes n as a parameter and returns a spec for n-bit arguments.

let SpecBinaryMul n = do {
  x <- llvm_var "x" (llvm_int n);
  y <- llvm_var "y" (llvm_int n);
  llvm_ptr "r_high" (llvm_int n);
  r_high <- llvm_var "*r_high" (llvm_int n);
  llvm_ptr "r_low" (llvm_int n);
  r_low <- llvm_var "*r_low" (llvm_int n);

  let res = {{ bmul x y }};
  llvm_ensure_eq "*r_high" {{ res.0 }};
  llvm_ensure_eq "*r_low" {{ res.1 }};

  llvm_verify_tactic abc;
};

llvm_verify m "bmul32" [] (SpecBinaryMul 32);
llvm_verify m "bmul64" [] (SpecBinaryMul 64);

We use two instances of the bmul spec to prove correctness of bmul32() and bmul64() sequentially. The second verification will take a lot longer before yielding results.

$ saw bmul.saw
Loading module Cryptol
Loading file "bmul.saw"
Successfully verified @bmul32
When verifying @bmul64:
Proof of Term *(Term Ident "r_high") failed.
Counterexample:
  %x: 15554860936645695441
  %y: 17798150062858027007
  lss__alloc0: 262144
  lss__alloc1: 8
Term *(Term Ident "r_high")
Encountered:  5413984507840984561
Expected:     5413984507840984531
saw: user error ("llvm_verify" (bmul.saw:31:1):
Proof failed.)

Proof failed. As you probably expected by now, the bmul64() implementation is erroneous and SAW gives us a specific counterexample to investigate further. It took us a while to understand the failure but it seems very obvious in hindsight.

Fixing the bmul64() bitmasks

As already shown above, bitmasks leaving three-bit holes between data bits can avoid carry-spilling for up to two 15-bit integers. Using every fourth bit of a 64-bit argument however yields 16 data bits each, and carries can thus override data bits. We need bitmasks with four-bit holes.

/* Binary multiplication x * y = r_high << 64 | r_low. */
void
bmul64(uint64_t x, uint64_t y, uint64_t *r_high, uint64_t *r_low)
{
    uint128_t x1, x2, x3, x4, x5;
    uint128_t y1, y2, y3, y4, y5;
    uint128_t r, z;

    /* Define bitmasks with 4-bit holes. */
    uint128_t m1 = (uint128_t)0x2108421084210842 << 64 | 0x1084210842108421;
    uint128_t m2 = (uint128_t)0x4210842108421084 << 64 | 0x2108421084210842;
    uint128_t m3 = (uint128_t)0x8421084210842108 << 64 | 0x4210842108421084;
    uint128_t m4 = (uint128_t)0x0842108421084210 << 64 | 0x8421084210842108;
    uint128_t m5 = (uint128_t)0x1084210842108421 << 64 | 0x0842108421084210;

    /* Apply bitmasks. */
    x1 = x & m1;
    y1 = y & m1;
    x2 = x & m2;
    y2 = y & m2;
    x3 = x & m3;
    y3 = y & m3;
    x4 = x & m4;
    y4 = y & m4;
    x5 = x & m5;
    y5 = y & m5;

    /* Integer multiplication (25 times) and merge results. */
    z = (x1 * y1) ^ (x2 * y5) ^ (x3 * y4) ^ (x4 * y3) ^ (x5 * y2);
    r = z & m1;
    z = (x1 * y2) ^ (x2 * y1) ^ (x3 * y5) ^ (x4 * y4) ^ (x5 * y3);
    r |= z & m2;
    z = (x1 * y3) ^ (x2 * y2) ^ (x3 * y1) ^ (x4 * y5) ^ (x5 * y4);
    r |= z & m3;
    z = (x1 * y4) ^ (x2 * y3) ^ (x3 * y2) ^ (x4 * y1) ^ (x5 * y5);
    r |= z & m4;
    z = (x1 * y5) ^ (x2 * y4) ^ (x3 * y3) ^ (x4 * y2) ^ (x5 * y1);
    r |= z & m5;

    *r_high = (uint64_t)(r >> 64);
    *r_low = (uint64_t)r;
}

m1, …, m5 are the new bitmasks. m1 equals 0b0010000100001..., the others are each shifted by one. As the number of data bits per argument is now 64/5 <= n < 64/4 we need 5*5 = 25 multiplications. With three calls to bmul64() that’s 75 in total.

Run SAW again and, after about an hour, it will tell us it successfully verified @bmul64.

$ saw bmul.saw
Loading module Cryptol
Loading file "bmul.saw"
Successfully verified @bmul32
Successfully verified @bmul64

You might want to take a look at Thomas Pornin’s version of bmul64(). This basically is the faulty version that SAW failed to verify, he however works around the overflow by calling it twice, passing arguments reversed bitwise the second time. He invokes bmul64() six times, which results in a total of 96 multiplications.

Some final thoughts

One of the takeaways is that even an implementation passing all test vectors given by a spec doesn’t need to be correct. That is not too surprising, spec authors can’t possibly predict edge cases from implementation approaches they haven’t thought about.

Using formal verification as part of the review process was definitely a wise decision. We likely saved hours of debugging intermittently failing connections, or random interoperability problems reported by early testers. I’m confident this wouldn’t have made it much further down the release line.

We of course added an extra test that covers that specific flaw but the next step definitely should be proper CI integration. The Cryptol code has already been written and there is no reason to not run it on every push. Verifying the full GHASH implementation would be ideal. The Cryptol code is almost trivial:

ghash : [128] -> [128] -> [128] -> ([64], [64])
ghash h x buf = (take`{64} res, drop`{64} res)
    where prod = pmod (pmult (reverse h) xor) <|x^^128 + x^^7 + x^^2 + x + 1|>
          xor = (reverse x) ^ (reverse buf)
          res = reverse prod

Proving the multiplication of two 128-bit numbers for a 256-bit product will unfortunately take a very very long time, or maybe not finish at all. Even if it finished after a few days that’s not something you want to automatically run on every push. Running it manually every time the code is touched might be an option though.