Tim Taubert

Bitslicing With Quine-McCluskey

2018-08-27T15:00:00+02:00

Part one gave a short introduction of bitslicing as a concept, talked about its use cases, truth tables, software multiplexers, LUTs, and manual optimization.

The second covered Karnaugh mapping, a visual method to simplify Boolean algebra expressions that takes advantage of humans’ pattern-recognition capability, but is unfortunately limited to at most four inputs in its original variant.

Part three will introduce the Quine-McCluskey algorithm, a tabulation method that, in combination with Petrick’s method, can minimize circuits with an arbitrary number of input values. Both are relatively simple to implement in software.

Part 1: Bitslicing, An Introduction
Part 2: Bitslicing with Karnaugh maps
Part 3: Bitslicing with Quine-McCluskey

The Quine-McCluskey algorithm

Here is the 3-to-2-bit S-box from the previous posts again:

uint8_t SBOX[] = { 1, 0, 3, 1, 2, 2, 3, 0 };

Without much ado, we’ll jump right in and bitslice functions for both its output bits in parallel. You’ll probably recognize a few similarities to K-maps, except that the steps are rather mechanical and don’t require visual pattern-recognition abilities.

Step 1: Listing minterms

The lookup table SBOX[] can be expressed as the Boolean functions f_L(a,b,c) and f_R(a,b,c). Here are their truth tables, with each combination of inputs assigned a symbol m_i. Rows m₀-m₇ will be called minterms.

f_L(a,b,c)
	a	b	c	f_L
m₀	0	0	0	0
m₁	0	0	1	0
m₂	0	1	0	1
m₃	0	1	1	0
m₄	1	0	0	1
m₅	1	0	1	1
m₆	1	1	0	1
m₇	1	1	1	0

f_R(a,b,c)
	a	b	c	f_R
m₀	0	0	0	1
m₁	0	0	1	0
m₂	0	1	0	1
m₃	0	1	1	1
m₄	1	0	0	0
m₅	1	0	1	0
m₆	1	1	0	1
m₇	1	1	1	0

We’re interested only in the minterms where the function evaluates to 1 and will ignore all others. Boolean functions can already be constructed with just those tables. In Boolean algebra, OR can be expressed as addition, AND as multiplication. The negation of x is represented by x.

f_L(a,b,c) = ∑ m(2,4,5,6)
          = m₂ + m₄ + m₅ + m₆
          = abc + abc + abc + abc

f_R(a,b,c) = ∑ m(0,2,3,6)
          = m₀ + m₂ + m₃ + m₆
          = abc + abc + abc + abc

Well, that’s a start. Translated into C, these functions would be constant-time but not even close to minimal.

Step 2: Bit Buckets

Now that we have all these minterms, we’ll put them in separate buckets based on the number of 1s in their inputs a, b, and c.

f_L(a,b,c)
# of 1s	minterm	binary
1	m₂	010
	m₄	100
2	m₅	101
	m₆	110

f_R(a,b,c)
# of 1s	minterm	binary
0	m₀	000
1	m₂	010
2	m₃	011
	m₆	110

The reasoning here is the same as the Gray code ordering for Karnaugh maps. If we start with the minterms in the first bucket n, only bucket n+1 might contain matching minterms where only a single variable changes. They can’t be in any of the other buckets.

Step 3: Merging minterms

Why would you even look for pairs of minterms with a one-variable difference? Because they can be merged to simplify our expression. These combinations are called minterms of size 2.

All minterms have output 1, so if the only difference is exactly one input variable, then the output is independent of it. For example, (a & ~b & c) | (a & b & c) can be reduced to just a & c, the expression value is independent of b.

f_L(a,b,c)
# of 1s	minterm	binary	size-2
1	m₂	010	m_2,6	—10
	m₄	100	m_4,5	10—
			m_4,6	1—0
2	m₅	101
	m₆	110

f_R(a,b,c)
# of 1s	minterm	binary	size-2
0	m₀	000	m_0,2	0—0
1	m₂	010	m_2,3	01—
			m_2,6	—10
2	m₃	011
	m₆	110

Always start with the minterms in the very first bucket at the top of the table. For every minterm in bucket n, we try to find a minterm in bucket n+1 with a one-bit difference in the binary column. Any matches will be recorded as pairs and entered into the size-2 column of bucket n.

m₂=010 and m₆=110 for example differ in only the first input variable, a. They merge into m_2,6=—10, with a dash marking the position of the irrelevant input bit.

Once all minterms were combined (as far as possible), we’ll continue with the next size. Minterms of size bigger than 1 have dashes for irrelevant input bits and it’s important to treat those as a “third bit value”. In other words, their dashes must be at the same positions, otherwise they can’t be merged.

There’s nothing left to merge for f_L(a,b,c) as all its size-2 minterms are in the first bucket. For f_R(a,b,c), none of the size-2 minterms in the first bucket match any of those in the second, their dashes are all in different positions.

Step 4: Prime Implicants

All minterms from the previous step that can’t be combined any further are called prime implicants. Entering them into a table let’s us check how well they cover the original minterms determined by step 1.

If any prime implicant is the only one to cover a minterm, it’s called an essential prime implicant (marked with an asterisk). It’s essential because it must be included in the resulting minimal form, otherwise we’d miss one of the input values combinations.

f_L(a,b,c)
	m₂	m₄	m₅	m₆	abc
m_2,6*	x			x	-10
m_4,5*		x	x		10-
m_4,6		x		x	1-0

f_R(a,b,c)
	m₀	m₂	m₃	m₆	abc
m_0,2*	x	x			0-0
m_2,3*		x	x		01-
m_2,6*		x		x	-10

Prime implicant m_2,6* on the left for example is the only one that covers m₂. m_4,5* is the only one that covers m₅. Not only is m_4,6 not essential, but we actually don’t need it at all: m₄ and m₆ are already covered by the essential prime implicants. All prime implicants of f_R(a,b,c) are essential, so we need all of them.

When bitslicing functions with many input variables it may happen that you are left with a number of non-essential prime implicants that can be combined in various ways to cover the missing minterms. Petrick’s method helps finding a minimum solution. It’s tedious to do manually, but not hard to automate.

Step 5: Minimal Forms

Finally, we derive minimal forms of our Boolean functions by looking at the abc column of the essential prime implicants. Input variables marked with dashes are ignored.

f_L(a,b,c) = m_2,6 + m_4,5 = bc + ab

The code for SBOXL() with 8-bit inputs:

uint8_t SBOXL(uint8_t a, uint8_t b, uint8_t c) {
  return (b & ~c) | (a & ~b);
}

f_R(a,b,c), reduced to the combination of its three essential prime implicants:

f_R(a,b,c) = m_0,2 + m_2,3 + m_2,6 = ac + ab + bc

And SBOXR() as expected:

uint8_t SBOXR(uint8_t a, uint8_t b, uint8_t c) {
  return (~a & ~c) | (~a & b) | (b & ~c);
}

Combining SBOXL() and SBOXR() yields the familiar version of SBOX(), after eliminating common subexpressions and taking out common factors.

void SBOX(uint8_t a, uint8_t b, uint8_t c, uint8_t* l, uint8_t* r) {
  uint8_t na = ~a;
  uint8_t nb = ~b;
  uint8_t nc = ~c;

  uint8_t t0 = b & nc;
  uint8_t t1 = b | nc;

  *l = (a & nb) | t0;
  *r = (na & t1) | t0;
}

Bitslicing a DES S-box

When I started writing this blog post I thought it would be nice to ditch the small S-box from the previous posts, and naively bitslice a “real” S-box, like the ones used in DES.

But these are 6-to-4-bit S-boxes, how much more effort can it be? As it turns out, humans are terrible at understanding exponential growth. Here are my intermediate results after an hour of writing, trying to bitslice just one of the four output bits:

I gave up when I spotted a few mistakes that would likely lead to a non-minimal solution. Bitslicing a function with that many input variables manually is laborious and probably not worth it, except that it definitely helped me understand the steps of the algorithm better.

As mentioned in the beginning, Quine-McCluskey and Petrick’s method can be implemented in software rather easily, so that’s what I did instead. I’ll explain how, and what to consider, in the next post.

Bitslicing With Karnaugh Maps

2018-08-18T15:00:00+02:00

Bitslicing, in cryptography, is the technique of converting arbitrary functions into logic circuits, thereby enabling fast, constant-time implementations of cryptographic algorithms immune to cache and timing-related side channel attacks.

My last post Bitslicing, An Introduction showed how to convert an S-box function into truth tables, then into a tree of multiplexers, and finally how to find the lowest possible gate count through manual optimization.

Today’s post will focus on a simpler and faster method. Karnaugh maps help simplifying Boolean algebra expressions by taking advantage of humans’ pattern-recognition capability. In short, we’ll bitslice an S-box using K-maps.

Part 1: Bitslicing, An Introduction
Part 2: Bitslicing with Karnaugh maps
Part 3: Bitslicing with Quine-McCluskey

A tiny S-box

Here again is the 3-to-2-bit S-box function from the previous post.

uint8_t SBOX[] = { 1, 0, 3, 1, 2, 2, 3, 0 };

An AES-inspired S-box that interprets three input bits as a polynomial in GF(2³) and computes its inverse mod P(x) = x³ + x² + 1, with 0^-1 := 0. The result plus (x² + 1) is converted back into bits and the MSB is dropped.

This S-box can be represented as a function of three Boolean variables, where f(0,0,0) = 0b01, f(0,0,1) = 0b00, f(0,1,0) = 0b11, etc. Each output bit can be represented by its own Boolean function where f_L(0,0,0) = 0 and f_R(0,0,0) = 1, f_L(0,0,1) = 0 and f_R(0,0,1) = 0, …

A truth table per output bit

Each output bit has its own Boolean function, and therefore also its own thruth table. Here are the truth tables for the Boolean functions f_L(a,b,c) and f_R(a,b,c):

SBOX(a,b,c)
abc	out
000	01
001	00
010	11
011	01
100	10
101	10
110	11
111	00

f_L(a,b,c)
abc	out
000	0
001	0
010	1
011	0
100	1
101	1
110	1
111	0

f_R(a,b,c)
abc	out
000	1
001	0
010	1
011	1
100	0
101	0
110	1
111	0

Whereas previously at this point we built a tree of multiplexers out of each truth table, we’ll now build a Karnaugh map (K-map) per output bit.

Karnaugh Maps

The values of f_L(a,b,c) and f_R(a,b,c) are transferred onto a two-dimensional grid with the cells ordered in Gray code. Each cell position represents one possible combination of input bits, while each cell value represents the value of the output bit.

The row and column indices (a) and (b || c) are ordered in Gray code rather than binary numerical order to ensure only a single variable changes between each pair of adjacent cells. Otherwise, products of predicates (a & b, a & c, …) would scatter.

These products are what you want to find to get a minimum length representation of the truth function. If the output bit is the same at two adjacent cells, then it’s independent of one of the two input variables, because (a & ~b) | (a & b) = a.

Spotting patterns

The heart of simplifying Boolean expressions via K-maps is finding groups of adjacent cells with value 1. The rules are as follows:

Groups are rectangles of 2ⁿ cells with value 1.
Groups may not include cells with value 0.
Each cell with value 1 must be in at least one group.
Groups may be horizontal or vertical, not diagonal.
Each group should be as large as possible.
There should be as few groups as possible.
Groups may overlap.

First, we mark all cells with value 1. We then form a red group for the two horizontal groups of size 2¹. The two vertical groups are marked with green, also of size 2¹.

On f_R’s K-map on the right, the red and green group overlap. As per the rules above, that’s perfectly fine. The cell at abc=110 can’t be without a group and we’re instructed to form the largest groups possible, so they overlap.

But wait, you say, what’s going on with the blue rectangle on the right?

Wrapping around

A somewhat unexpected property of K-maps is that they’re not really grids, but actually toruses. In plain English: they wrap around the top, bottom, and the sides.

Look at this neat animation on Wikipedia that demonstrates how a rectangle can turn into a ~~donut~~torus. Adjacent thus has a special definition here: cells on the very right touch those on the far left, as do those at the very top and bottom.

Another way to understand this property is to imagine that the columns don’t start at 00 but rather at 01, and so we rotate the whole K-map by one to the left. Then the rectangles wouldn’t need to wrap around and they would all fit on the grid nicely.

Now that all cells with a 1 have been assigned to as few groups as possible, let’s get our hands dirty and write some code.

A bitsliced SBOX() function

K-maps are read groupwise: we look at each cell’s position and focus on the input values that do not change throughout the group. Values that do change are ignored.

One function for f_L(a,b,c) ...

The red group covers the cells at position 100 and 101. The values a=1 and b=0 are constant, they will be included into the group’s term. The value of c changes and is therefore irrelevant. The term is (a & ~b).

The green group covers the cells at 010 and 110. We ignore a, and include b=1 and c=0. The term is (b & ~c).

SBOXL() is the disjunction of the group terms we collected from the K-map. It lists all possible combinations of input values that lead to output value 1.

uint8_t SBOXL(uint8_t a, uint8_t b, uint8_t c) {
  return (a & ~b) | (b & ~c);
}

... and another one for f_R(a,b,c)

The red group covers the cells at 011 and 010. The term is (~a & b).

The green group covers the cells at 010 and 110. The term is (b & ~c).

The blue group covers the cells at 000 and 010. The term is (~a & ~c).

uint8_t SBOXR(uint8_t a, uint8_t b, uint8_t c) {
  return (~a & b) | (b & ~c) | (~a & ~c);
}

Great, that’s all we need! Now we can merge those two functions and compare that to the result of the previous post.

Putting it all together

The first three variables ensure that we negate inputs only once. t0 replaces the common subexpression b & nc. Any optimizing compiler would do the same.

void SBOX(uint8_t a, uint8_t b, uint8_t c, uint8_t* l, uint8_t* r) {
  uint8_t na = ~a;
  uint8_t nb = ~b;
  uint8_t nc = ~c;

  uint8_t t0 = b & nc;

  *l = (a & nb) | t0;
  *r = (na & b) | (na & nc) | t0;
}

Ten gates. That’s one more than the manually optimized version from the last post. What’s missing? Turns out that K-maps sometimes don’t yield the minimal form and we have to simplify further by taking out common factors.

The conjunctions in the term (na & b) | (na & nc) have the common factor na and, due to the Distributivity Law, can be rewritten as na & (b | nc). That removes one of the AND gates and leaves two.

void SBOX(uint8_t a, uint8_t b, uint8_t c, uint8_t* l, uint8_t* r) {
  uint8_t na = ~a;
  uint8_t nb = ~b;
  uint8_t nc = ~c;

  uint8_t t0 = b & nc;
  uint8_t t1 = b | nc;

  *l = (a & nb) | t0;
  *r = (na & t1) | t0;
}

Nine gates. That’s exactly what we achieved by tedious artisanal optimization.

More than four inputs

K-maps are neat and trivial to use once you’ve worked through an example yourself. They yield minimal circuits fast, compared to manual optimization where the effort grows exponentially with the number of terms.

There is one downside though, and it’s that the original variant of a K-map can’t be used with more than four input variables. There are variants that do work with more than four variables but they actually make it harder to spot groups visually.

The Quine–McCluskey algorithm is functionally identical to K-maps but can handle an arbitrary number of input variables in its original variant – although the running time grows exponentially with the number of variables. Not too problematic for us, S-boxes usually don’t have too many inputs anyway…

Bitslicing, an Introduction

2018-08-15T14:00:00+02:00

Bitslicing (in software) is an implementation strategy enabling fast, constant-time implementations of cryptographic algorithms immune to cache and timing-related side channel attacks.

This post intends to give a brief overview of the general technique, not requiring much of a cryptographic background. It will demonstrate bitslicing a small S-box, talk about multiplexers, LUTs, Boolean functions, and minimal forms.

Part 1: Bitslicing, An Introduction
Part 2: Bitslicing with Karnaugh maps
Part 3: Bitslicing with Quine-McCluskey

What is bitslicing?

Matthew Kwan coined the term about 20 years ago after seeing Eli Biham present his paper A Fast New DES Implementation in Software. He later published Reducing the Gate Count of Bitslice DES showing an even faster DES building on Biham’s ideas.

The basic concept is to express a function in terms of single-bit logical operations – AND, XOR, OR, NOT, etc. – as if you were implementing a logic circuit in hardware. These operations are then carried out for multiple instances of the function in parallel, using bitwise operations on a CPU.

In a bitsliced implementation, instead of having a single variable storing a, say, 8-bit number, you have eight variables (slices). The first storing the left-most bit of the number, the next storing the second bit from the left, and so on. The parallelism is bounded only by the target architecture’s register width.

What’s it good for?

Biham applied bitslicing to DES, a cipher designed to be fast in hardware. It uses eight different S-boxes, that were usually implemented as lookup tables. Table lookups in DES however are rather inefficient, since one has to collect six bits from different words, combine them, and afterwards put each of the four resulting bits in a different word.

Speed

In classical implementations, these bit permutations would be implemented with a combination of shifts and masks. In a bitslice representation though, permuting bits really just means using the “right” variables in the next step; this is mere data routing, which is resolved at compile-time, with no cost at runtime.

Additionally, the code is extremely linear so that it usually runs well on heavily pipelined modern CPUs. It tends to have a low risk of pipeline stalls, as it’s unlikely to suffer from branch misprediction, and plenty of opportunities for optimal instruction reordering for efficient scheduling of data accesses.

Parallelization

With a register width of n bits, as long as the bitsliced implementation is no more than n times slower to run a single instance of the cipher, you end up with a net gain in throughput. This only applies to workloads that allow for parallelization. CTR and ECB mode always benefit, CBC and CFB mode only when decrypting.

Constant execution time

Constant-time, secret independent computation is all the rage in modern applied cryptography. Bitslicing is interesting because by using only single-bit logical operations the resulting code is immune to cache and timing-related side channel attacks.

Fully Homomorphic Encryption

The last decade brought great advances in the field of Fully Homomorphic Encryption (FHE), i.e. computation on ciphertexts. If you have a secure crypto scheme and an efficient NAND gate you can use bitslicing to compute arbitrary functions of encrypted data.

Bitslicing a small S-box

Let’s work through a small example to see how one could go about converting arbitrary functions into a bunch of Boolean gates.

Imagine a 3-to-2-bit S-box function, a component found in many symmetric encryption algorithms. Naively, this would be represented by a lookup table with eight entries, e.g. SBOX[0b000] = 0b01, SBOX[0b001] = 0b00, etc.

uint8_t SBOX[] = { 1, 0, 3, 1, 2, 2, 3, 0 };

This AES-inspired S-box interprets three input bits as a polynomial in GF(2³) and computes its inverse mod P(x) = x³ + x² + 1, with 0^-1 := 0. The result plus (x² + 1) is converted back into bits and the MSB is dropped.

You can think of the above S-box’s output as being a function of three Boolean variables, where for instance f(0,0,0) = 0b01. Each output bit can be represented by its own Boolean function, i.e. f_L(0,0,0) = 0 and f_R(0,0,0) = 1.

LUTs and Multiplexers

If you’ve dealt with FPGAs before you probably know that these do not actually implement Boolean gates, but allow Boolean algebra by programming Look-Up-Tables (LUTs). We’re going to do the reverse and convert our S-box into trees of multiplexers.

Multiplexer is just a fancy word for data selector. A 2-to-1 multiplexer selects one of two input bits. A selector bit decides which of the two inputs will be passed through.

bool mux(bool a, bool b, bool s) {
  return s ? b : a;
}

Here are the LUTs, or rather truth tables, for the Boolean functions f_L(a,b,c) and f_R(a,b,c):

SBOX(a,b,c)
abc	out
000	01
001	00
010	11
011	01
100	10
101	10
110	11
111	00

f_L(a,b,c)
abc	out
000	0
001	0
010	1
011	0
100	1
101	1
110	1
111	0

f_R(a,b,c)
abc	out
000	1
001	0
010	1
011	1
100	0
101	0
110	1
111	0

The truth table for f_L(a,b,c) is (0, 0, 1, 0, 1, 1, 1, 0) or 2E_h. We can also call this the LUT-mask in the context of an FPGA. For each output bit of our S-box we need a 3-to-1 multiplexer, and that in turn can be represented by 2-to-1 multiplexers.

Multiplexers in Software

Let’s take the mux() function from above and make it constant-time. As stated earlier, bitslicing is competitive only through parallelization, so, for demonstration, we’ll use uint8_t arguments to later compute eight S-box lookups in parallel.

uint8_t mux(uint8_t a, uint8_t b, uint8_t s) {
  return (a & ~s) | (b & s);
}

If the n-th bit of s is zero it selects the n-th bit in a, if not it forwards the n-th bit in b. The wider the target architecture’s registers, the bigger the theoretical throughput – but only if the workload can take advantage of the level of parallelization.

A first implementation

The two output bits will be computed separately and then assembled into the final value returned by SBOX(). Each multiplexer in the above diagram is represented by a mux() call. The first four take the LUT-masks 2E_h and B2_h as inputs.

The diagram shows Boolean functions that only work with single-bit parameters. We use uint8_t, so instead of 1 we need to use ~0 to get 0b11111111.

uint8_t SBOXL(uint8_t a, uint8_t b, uint8_t c) {
  uint8_t c0 = mux( 0,  0, c);
  uint8_t c1 = mux(~0,  0, c);
  uint8_t c2 = mux(~0, ~0, c);
  uint8_t c3 = mux(~0,  0, c);

  uint8_t b0 = mux(c0, c1, b);
  uint8_t b1 = mux(c2, c3, b);

  return mux(b0, b1, a);
}

uint8_t SBOXR(uint8_t a, uint8_t b, uint8_t c) {
  uint8_t c0 = mux(~0,  0, c);
  uint8_t c1 = mux(~0, ~0, c);
  uint8_t c2 = mux( 0,  0, c);
  uint8_t c3 = mux(~0,  0, c);

  uint8_t b0 = mux(c0, c1, b);
  uint8_t b1 = mux(c2, c3, b);

  return mux(b0, b1, a);
}

void SBOX(uint8_t a, uint8_t b, uint8_t c, uint8_t* l, uint8_t* r) {
  *l = SBOXL(a, b, c);
  *r = SBOXR(a, b, c);
}

That wasn’t too hard. SBOX() is constant-time and immune to cache timing attacks. Not counting the negation of constants (~0) we have 42 gates in total and perform eight lookups in parallel.

Assuming, for simplicity, that a table lookup is just one operation, the bitsliced version is about five times as slow. If we had a workflow that allowed for 64 parallel S-box lookups we could achieve eight times the current throughput by using uint64_t variables.

A better mux() function

mux() currently needs three operations. Here’s another variant using XOR:

uint8_t mux(uint8_t a, uint8_t b, uint8_t s) {
  uint8_t c = a ^ b;
  return (c & s) ^ a;
}

Now there still are three gates, but the new version lends itself often to easier optimization as we might be able to precompute a ^ b and reuse the result.

Simplifying the circuit

Let’s optimize our circuit manually by following these simple rules:

mux(a, a, s) reduces to a.
Any X AND ~0 will always be X.
Anything AND 0 will always be 0.
mux() with constant inputs can be reduced.

With the new mux() variant there are a few XOR rules to follow as well:

Any X XOR X reduces to 0.
Any X XOR 0 reduces to X.
Any X XOR ~0 reduces to ~X.

Inline the remaining mux() calls, eliminate common subexpressions, repeat.

void SBOX(uint8_t a, uint8_t b, uint8_t c, uint8_t* l, uint8_t* r) {
  uint8_t na = ~a;
  uint8_t nb = ~b;
  uint8_t nc = ~c;

  uint8_t t0 = nb & a;
  uint8_t t1 = nc & b;
  uint8_t t2 = b | nc;
  uint8_t t3 = na & t2;

  *l = t0 | t1;
  *r = t1 | t3;
}

Using the laws of Boolean algebra and the rules formulated above I’ve reduced the circuit to nine gates (down from 42!). We actually couldn’t simplify it any further.

Circuit Minimization

Finding the minimal form of a Boolean function is an NP-complete problem. Manual optimization is tedious but doable for a tiny S-box such as the example used in this post. It will not be as easy for multiple 6-to-4-bit S-boxes (DES) or an 8-to-8-bit one (AES).

There are simpler and faster ways to build those circuits, and deterministic algorithms to check whether we reached the minimal form. One of those is covered in the next post Bitslicing with Karnaugh maps.

Verified Binary Multiplication for GHASH

2017-06-29T19:45:57+02:00

Previously I introduced some very basic Cryptol and SAWScript, and explained how to reason about the correctness of constant-time integer multiplication written in C/C++.

In this post I will touch on using formal verification as part of the code review process, in particular show how, by using the Software Analysis Workbench, we saved ourselves hours of debugging when rewriting the GHASH implementation for NSS.

What’s GHASH again?

GHASH is part of the Galois/Counter Mode, a mode of operation for block ciphers. AES-GCM for example uses AES as the block cipher for encryption, and appends a tag generated by the GHASH function, thereby ensuring integrity and authenticity.

The core of GHASH is multiplication in GF(2¹²⁸), a characteristic-two finite field with coefficients in GF(2); they’re either zero or one. Polynomials in GF(2^m) can be represented as m-bit numbers, with each bit corresponding to a term’s coefficient. In GF(2³) for example, x^2 + 1 may be represented as the binary number 0b101 = 5.

Additions and subtractions in finite fields are “carry-less” because the coefficients must be in GF(p), for any GF(p^m). As x * y is equivalent to adding x to itself y times, we can call multiplication in finite fields “carry-less” too. In GF(2) addition is simply XOR, so we can say that multiplication in GF(2^m) is equal to binary multiplication without carries.

Note that the term carry-less only makes sense when talking about GF(2^m) fields that are easily represented as binary numbers. Otherwise one would rather talk about multiplication in finite fields without comparing it to standard integer multiplication.

Franziskus’ post nicely describes why and how we updated our AES-GCM code in NSS. In case a user’s CPU is not equipped with the Carry-less Multiplication (CLMUL) instruction set, we need to provide a fallback and implement carry-less, constant-time binary multiplication ourselves, using standard integer multiplication with carry.

bmul() for 32-bit machines

The basic implementation of our binary multiplication algorithm is taken straight from Thomas Pornin’s excellent constant-time crypto post. To support 32-bit machines the best we can do is multiply two uint32_t numbers and store the result in a uint64_t.

For the full GHASH, Karatsuba decomposition is used: multiplication of two 128-bit integers is broken down into nine calls to bmul32(x, y, ...). Let’s take a look at the actual implementation:

/* Binary multiplication x * y = r_high << 32 | r_low. */
void
bmul32(uint32_t x, uint32_t y, uint32_t *r_high, uint32_t *r_low)
{
    uint32_t x0, x1, x2, x3;
    uint32_t y0, y1, y2, y3;
    uint32_t m1 = (uint32_t)0x11111111;
    uint32_t m2 = (uint32_t)0x22222222;
    uint32_t m4 = (uint32_t)0x44444444;
    uint32_t m8 = (uint32_t)0x88888888;
    uint64_t z0, z1, z2, z3;
    uint64_t z;

    /* Apply bitmasks. */
    x0 = x & m1;
    x1 = x & m2;
    x2 = x & m4;
    x3 = x & m8;
    y0 = y & m1;
    y1 = y & m2;
    y2 = y & m4;
    y3 = y & m8;

    /* Integer multiplication (16 times). */
    z0 = ((uint64_t)x0 * y0) ^ ((uint64_t)x1 * y3) ^
         ((uint64_t)x2 * y2) ^ ((uint64_t)x3 * y1);
    z1 = ((uint64_t)x0 * y1) ^ ((uint64_t)x1 * y0) ^
         ((uint64_t)x2 * y3) ^ ((uint64_t)x3 * y2);
    z2 = ((uint64_t)x0 * y2) ^ ((uint64_t)x1 * y1) ^
         ((uint64_t)x2 * y0) ^ ((uint64_t)x3 * y3);
    z3 = ((uint64_t)x0 * y3) ^ ((uint64_t)x1 * y2) ^
         ((uint64_t)x2 * y1) ^ ((uint64_t)x3 * y0);

    /* Merge results. */
    z0 &= ((uint64_t)m1 << 32) | m1;
    z1 &= ((uint64_t)m2 << 32) | m2;
    z2 &= ((uint64_t)m4 << 32) | m4;
    z3 &= ((uint64_t)m8 << 32) | m8;
    z = z0 | z1 | z2 | z3;
    *r_high = (uint32_t)(z >> 32);
    *r_low = (uint32_t)z;
}

Thomas’ explanation is not too hard to follow. The main idea behind the algorithm are the bitmasks m1 = 0b00010001..., m2 = 0b00100010..., m4 = 0b01000100..., and m8 = 0b10001000.... They respectively have the first, second, third, and fourth bit of every nibble set. This leaves “holes” of three bits between each “data bit”, so that with those applied at most a quarter of the 32 bits are equal to one.

Per standard integer multiplication, eight times eight bits will at most add eight carry bits of value one together, thus we need sufficiently sized holes per digit that can hold the value 8 = 0b1000. Three-bit holes are big enough to prevent carries from “spilling” over, they could even handle up to 15 = 0b1111 data bits in each of the two integer operands.

Review, tests, and verification

The first version of the patch came with a bunch of new tests, the vectors taken from the GCM specification. We previously had no such low-level coverage, all we had were a number of high-level AES-GCM tests.

When reviewing, after looking at the patch itself and applying it locally to see whether it builds and tests succeed, the next step I wanted to try was to write a Cryptol specification to prove the correctness of bmul32(). Thanks to the built-in pmult function that took only a few minutes.

m <- llvm_load_module "bmul.bc";

let {{
  bmul32 : [32] -> [32] -> ([32], [32])
  bmul32 a b = (take`{32} prod, drop`{32} prod)
      where prod = pad (pmult a b)
            pad x = zero # x
}};

The SAWScript needed to properly parse the LLVM bitcode and formulate the equivalence proof is straightforward, it’s basically the same as shown in the previous post.

llvm_verify m "bmul32" [] do {
  x <- llvm_var "x" (llvm_int 32);
  y <- llvm_var "y" (llvm_int 32);
  llvm_ptr "r_high" (llvm_int 32);
  r_high <- llvm_var "*r_high" (llvm_int 32);
  llvm_ptr "r_low" (llvm_int 32);
  r_low <- llvm_var "*r_low" (llvm_int 32);

  let res = {{ bmul32 x y }};
  llvm_ensure_eq "*r_high" {{ res.0 }};
  llvm_ensure_eq "*r_low" {{ res.1 }};

  llvm_verify_tactic abc;
};

Compile to bitcode and run SAW. After just a few seconds it will tell us it succeeded in proving equivalency of both implementations.

$ saw bmul.saw
Loading module Cryptol
Loading file "bmul.saw"
Successfully verified @bmul32

bmul() for 64-bit machines

bmul32() is called nine times, each time performing 16 multiplications. That’s 144 multiplications in total for one GHASH evaluation. If we had a bmul64() for 128-bit multiplication with uint128_t we’d need to call it only thrice.

The naive approach taken in the first patch revision was to just double the bitsize of the arguments and variables, and also extend the bitmasks. If you paid close attention to the previous section you might notice a problem here already. If not, it will become clear in a few moments.

typedef unsigned __int128 uint128_t;

/* Binary multiplication x * y = r_high << 64 | r_low. */
void
bmul64(uint64_t x, uint64_t y, uint64_t *r_high, uint64_t *r_low)
{
    uint64_t x0, x1, x2, x3;
    uint64_t y0, y1, y2, y3;
    uint64_t m1 = (uint64_t)0x1111111111111111;
    uint64_t m2 = (uint64_t)0x2222222222222222;
    uint64_t m4 = (uint64_t)0x4444444444444444;
    uint64_t m8 = (uint64_t)0x8888888888888888;
    uint128_t z0, z1, z2, z3;
    uint128_t z;

    /* Apply bitmasks. */
    x0 = x & m1;
    x1 = x & m2;
    x2 = x & m4;
    x3 = x & m8;
    y0 = y & m1;
    y1 = y & m2;
    y2 = y & m4;
    y3 = y & m8;

    /* Integer multiplication (16 times). */
    z0 = ((uint128_t)x0 * y0) ^ ((uint128_t)x1 * y3) ^
         ((uint128_t)x2 * y2) ^ ((uint128_t)x3 * y1);
    z1 = ((uint128_t)x0 * y1) ^ ((uint128_t)x1 * y0) ^
         ((uint128_t)x2 * y3) ^ ((uint128_t)x3 * y2);
    z2 = ((uint128_t)x0 * y2) ^ ((uint128_t)x1 * y1) ^
         ((uint128_t)x2 * y0) ^ ((uint128_t)x3 * y3);
    z3 = ((uint128_t)x0 * y3) ^ ((uint128_t)x1 * y2) ^
         ((uint128_t)x2 * y1) ^ ((uint128_t)x3 * y0);

    /* Merge results. */
    z0 &= ((uint128_t)m1 << 64) | m1;
    z1 &= ((uint128_t)m2 << 64) | m2;
    z2 &= ((uint128_t)m4 << 64) | m4;
    z3 &= ((uint128_t)m8 << 64) | m8;
    z = z0 | z1 | z2 | z3;
    *r_high = (uint64_t)(z >> 64);
    *r_low = (uint64_t)z;
}

Tests and another equivalence proof

The above version of bmul64() passed the GHASH test vectors with flying colors. That tricked reviewers into thinking it looked just fine, even if they just learned about the basic algorithm idea. Fallible humans. Let’s update the proofs and see what happens.

bmul : {n,m} (fin n, n >= 1, m == n*2 - 1) => [n] -> [n] -> ([n], [n])
bmul a b = (take`{n} prod, drop`{n} prod)
    where prod = pad (pmult a b : [m])
          pad x = zero # x

Instead of hardcoding bmul for 32-bit integers we use polymorphic types m and n to denote the size in bits. m is mostly a helper to make it a tad more readable. We can now reason about carry-less n-bit binary multiplication.

Duplicating the SAWScript spec and running :s/32/64 is easy, but certainly nicer is adding a function that takes n as a parameter and returns a spec for n-bit arguments.

let SpecBinaryMul n = do {
  x <- llvm_var "x" (llvm_int n);
  y <- llvm_var "y" (llvm_int n);
  llvm_ptr "r_high" (llvm_int n);
  r_high <- llvm_var "*r_high" (llvm_int n);
  llvm_ptr "r_low" (llvm_int n);
  r_low <- llvm_var "*r_low" (llvm_int n);

  let res = {{ bmul x y }};
  llvm_ensure_eq "*r_high" {{ res.0 }};
  llvm_ensure_eq "*r_low" {{ res.1 }};

  llvm_verify_tactic abc;
};

llvm_verify m "bmul32" [] (SpecBinaryMul 32);
llvm_verify m "bmul64" [] (SpecBinaryMul 64);

We use two instances of the bmul spec to prove correctness of bmul32() and bmul64() sequentially. The second verification will take a lot longer before yielding results.

$ saw bmul.saw
Loading module Cryptol
Loading file "bmul.saw"
Successfully verified @bmul32
When verifying @bmul64:
Proof of Term *(Term Ident "r_high") failed.
Counterexample:
  %x: 15554860936645695441
  %y: 17798150062858027007
  lss__alloc0: 262144
  lss__alloc1: 8
Term *(Term Ident "r_high")
Encountered:  5413984507840984561
Expected:     5413984507840984531
saw: user error ("llvm_verify" (bmul.saw:31:1):
Proof failed.)

Proof failed. As you probably expected by now, the bmul64() implementation is erroneous and SAW gives us a specific counterexample to investigate further. It took us a while to understand the failure but it seems very obvious in hindsight.

Fixing the bmul64() bitmasks

As already shown above, bitmasks leaving three-bit holes between data bits can avoid carry-spilling for up to two 15-bit integers. Using every fourth bit of a 64-bit argument however yields 16 data bits each, and carries can thus override data bits. We need bitmasks with four-bit holes.

/* Binary multiplication x * y = r_high << 64 | r_low. */
void
bmul64(uint64_t x, uint64_t y, uint64_t *r_high, uint64_t *r_low)
{
    uint128_t x1, x2, x3, x4, x5;
    uint128_t y1, y2, y3, y4, y5;
    uint128_t r, z;

    /* Define bitmasks with 4-bit holes. */
    uint128_t m1 = (uint128_t)0x2108421084210842 << 64 | 0x1084210842108421;
    uint128_t m2 = (uint128_t)0x4210842108421084 << 64 | 0x2108421084210842;
    uint128_t m3 = (uint128_t)0x8421084210842108 << 64 | 0x4210842108421084;
    uint128_t m4 = (uint128_t)0x0842108421084210 << 64 | 0x8421084210842108;
    uint128_t m5 = (uint128_t)0x1084210842108421 << 64 | 0x0842108421084210;

    /* Apply bitmasks. */
    x1 = x & m1;
    y1 = y & m1;
    x2 = x & m2;
    y2 = y & m2;
    x3 = x & m3;
    y3 = y & m3;
    x4 = x & m4;
    y4 = y & m4;
    x5 = x & m5;
    y5 = y & m5;

    /* Integer multiplication (25 times) and merge results. */
    z = (x1 * y1) ^ (x2 * y5) ^ (x3 * y4) ^ (x4 * y3) ^ (x5 * y2);
    r = z & m1;
    z = (x1 * y2) ^ (x2 * y1) ^ (x3 * y5) ^ (x4 * y4) ^ (x5 * y3);
    r |= z & m2;
    z = (x1 * y3) ^ (x2 * y2) ^ (x3 * y1) ^ (x4 * y5) ^ (x5 * y4);
    r |= z & m3;
    z = (x1 * y4) ^ (x2 * y3) ^ (x3 * y2) ^ (x4 * y1) ^ (x5 * y5);
    r |= z & m4;
    z = (x1 * y5) ^ (x2 * y4) ^ (x3 * y3) ^ (x4 * y2) ^ (x5 * y1);
    r |= z & m5;

    *r_high = (uint64_t)(r >> 64);
    *r_low = (uint64_t)r;
}

m1, …, m5 are the new bitmasks. m1 equals 0b0010000100001..., the others are each shifted by one. As the number of data bits per argument is now 64/5 <= n < 64/4 we need 5*5 = 25 multiplications. With three calls to bmul64() that’s 75 in total.

Run SAW again and, after about an hour, it will tell us it successfully verified @bmul64.

$ saw bmul.saw
Loading module Cryptol
Loading file "bmul.saw"
Successfully verified @bmul32
Successfully verified @bmul64

You might want to take a look at Thomas Pornin’s version of bmul64(). This basically is the faulty version that SAW failed to verify, he however works around the overflow by calling it twice, passing arguments reversed bitwise the second time. He invokes bmul64() six times, which results in a total of 96 multiplications.

Some final thoughts

One of the takeaways is that even an implementation passing all test vectors given by a spec doesn’t need to be correct. That is not too surprising, spec authors can’t possibly predict edge cases from implementation approaches they haven’t thought about.

Using formal verification as part of the review process was definitely a wise decision. We likely saved hours of debugging intermittently failing connections, or random interoperability problems reported by early testers. I’m confident this wouldn’t have made it much further down the release line.

We of course added an extra test that covers that specific flaw but the next step definitely should be proper CI integration. The Cryptol code has already been written and there is no reason to not run it on every push. Verifying the full GHASH implementation would be ideal. The Cryptol code is almost trivial:

ghash : [128] -> [128] -> [128] -> ([64], [64])
ghash h x buf = (take`{64} res, drop`{64} res)
    where prod = pmod (pmult (reverse h) xor) <|x^^128 + x^^7 + x^^2 + x + 1|>
          xor = (reverse x) ^ (reverse buf)
          res = reverse prod

Proving the multiplication of two 128-bit numbers for a 256-bit product will unfortunately take a very very long time, or maybe not finish at all. Even if it finished after a few days that’s not something you want to automatically run on every push. Running it manually every time the code is touched might be an option though.

The Future of Session Resumption

2017-02-15T18:00:00+01:00

A while ago I wrote about the state of server-side session resumption implementations in popular web servers using OpenSSL. Neither Apache, nor Nginx or HAproxy purged stale entries from the session cache or rotated session tickets automatically, potentially harming forward secrecy of resumed TLS session.

Enabling session resumption is an important tool for speeding up HTTPS websites, especially in a pre-HTTP/2 world where a client may have to open concurrent connections to the same host to quickly render a page. Subresource requests would ideally resume the session that for example a GET / HTTP/1.1 request started.

Let’s take a look at what has changed in over two years, and whether configuring session resumption securely has gotten any easier. With the TLS 1.3 spec about to be finalized I will show what the future holds and how these issues were addressed by the WG.

Did web servers react?

No, not as far as I’m aware. None of the three web servers mentioned above has taken steps to make it easier to properly configure session resumption. But to be fair, OpenSSL didn’t add any new APIs or options to help them either.

All popular TLS 1.2 web servers still don’t evict cache entries when they expire, keeping them around until a client tries to resume — for performance or ease of implementation. They generate a session ticket key at startup and will never automatically rotate it so that admins have to manually reload server configs and provide new keys.

The Caddy web server

I want to seize the chance and positively highlight the Caddy web server, a relative newcomer with the advantage of not having any historical baggage, that enables and configures HTTPS by default, including automatically acquiring and renewing certificates.

Version 0.8.3 introduced automatic session ticket key rotation, thereby making session tickets mostly forward secure by replacing the key every ~10 hours. Session cache entries though aren’t evicted until access just like with the other web servers.

But even for “traditional” web servers all is not lost. The TLS working group has known about the shortcomings of session resumption for a while and addresses those with the next version of TLS.

1-RTT handshakes by default

One of the many great things about TLS 1.3 handshakes is that most connections should take only a single round-trip to establish. The client sends one or more KeyShareEntry values with the ClientHello, and the server responds with a single KeyShareEntry for a key exchange with ephemeral keys.

If the client sends no or only unsupported groups, the server will send a HelloRetryRequest message with a NamedGroup selected from the ones supported by the client. The connection will fall back to two round-trips.

That means you’re automatically covered if you enable session resumption only to reduce network latency, a normal handshake is as fast as 1-RTT resumption in TLS 1.2. If you’re worried about computational overhead from certificate authentication and key exchange, that still might be a good reason to abbreviate handshakes.

Pre-shared keys in TLS 1.3

Session IDs and session tickets are obsolete since TLS 1.3. They’ve been replaced by a more generic PSK mechanism that allows resuming a session with a previously established shared secret key.

Instead of an ID or a ticket, the client will send an opaque blob it received from the server after a successful handshake in a prior session. That blob might either be an ID pointing to an entry in the server’s session cache, or a session ticket encrypted with a key known only to the server.

enum { psk_ke(0), psk_dhe_ke(1), (255) } PskKeyExchangeMode;

struct {
   PskKeyExchangeMode ke_modes<1..255>;
} PskKeyExchangeModes;

Two PSK key exchange modes are defined, psk_ke and psk_dhe_ke. The first signals a key exchange using a previously shared key, it derives a new master secret from only the PSK and nonces. This basically is as (in)secure as session resumption in TLS 1.2 if the server never rotates keys or discards cache entries long after they expired.

The second psk_dhe_ke mode additionally incorporates a key agreed upon using ephemeral Diffie-Hellman, thereby making it forward secure. By mixing a shared (EC)DHE key into the derived master secret, an attacker can no longer pull an entry out of the cache, or steal ticket keys, to recover the plaintext of past resumed sessions.

Note that 0-RTT data cannot be protected by the DHE secret, the early traffic secret is established without any input from the server and thus derived from the PSK only.

TLS 1.2 is surely here to stay

In theory, there should be no valid reason for a web client to be able to complete a TLS 1.3 handshake but not support psk_dhe_ke, as ephemeral Diffie-Hellman key exchanges are mandatory. An internal application talking TLS between peers would likely be a legitimate case for not supporting DHE.

But also for TLS 1.3 it might make sense to properly configure session ticket key rotation and cache turnover in case the odd client supports only psk_ke. It still makes sense especially for TLS 1.2, it will be around for probably longer than we wish and imagine.

Simple Cryptol Specifications

2017-02-07T16:00:00+01:00

In the previous post I showed how to prove equivalence of two different implementations of the same algorithm. This post will cover writing an algorithm specification in Cryptol to prove the correctness of a constant-time C/C++ implementation.

Apart from rather simple Cryptol I’m also going to introduce SAW’s llvm_verify function that allows much more complex verification. We need this as our function will not only take scalar inputs but also store the result of the computation using pointer arguments.

Constant-time multiplication

Part 1 dealt with addition, in part 2 we’re going to look at multiplication. Let’s implement a function mul(a, b, *hi, *lo) that multiplies a and b, and stores the eight most significant bits of the product in *hi, and the eight LSBs in *lo.

This time we’ll make it run in constant time right away and won’t bother implementing a trivial version first. Instead, we will write a Cryptol specification to verify LLVM bitcode afterwards — you will be amazed how simple that is.

Some helper functions

The first two functions of our C/C++ implementation will seem familiar if you’ve read the previous part of the series. msb hasn’t changed, and ge is the negated version of lt. nz returns 0xff if the given argument x is non-zero, 0 otherwise.

cmul.c[gist.github.com/ttaubert/c742ba7adf040e14ff21e111a929f5b8#file-cmul-c]

// 0xff if MSB(x) = 1 else 0x00
uint8_t msb(uint8_t x) {
  return 0 - (x >> (8 * sizeof(x) - 1));
}

// 0xff if a >= b else 0x00
uint8_t ge(uint8_t a, uint8_t b) {
  return ~msb(a ^ ((a ^ b) | ((a - b) ^ b)));
}

// 0xff if x > 0 else 0x00
uint8_t nz(uint8_t x) {
  return ~msb(~x & (x - 1));
}

uint8_t add(uint8_t a, uint8_t b, uint8_t *carry) {
  *carry = msb(ge(a, 0 - b) & nz(b)) & 1;
  return a + b;
}

Our add function that previously dealt with overflows by capping at UINT8_MAX is a little more mature now and will set *carry = 1 when an overflow occurs.

The core of the algorithm

mul(a, b, *hi, *lo), using all the helper functions we defined above, implements standard long multiplication, i.e. four multiplications per function call. We split the two 8-bit arguments into two 4-bit halves, multiply and add a few times, and then store two 8-bit results at the addresses pointed to by hi and lo.

cmul.c[gist.github.com/ttaubert/c742ba7adf040e14ff21e111a929f5b8#file-cmul-c]

void mul(uint8_t a, uint8_t b, uint8_t *hi, uint8_t *lo) {
  uint8_t a1 = a >> 4, a0 = a & 0xf;
  uint8_t b1 = b >> 4, b0 = b & 0xf;
  uint8_t z0 = a0 * b0;
  uint8_t z2 = a1 * b1;

  uint8_t z1, z1carry, carry, trash;
  z1 = add(a0 * b1, a1 * b0, &z1carry);
  *lo = add(z1 << 4, z0, &carry);
  *hi = add(z2, (z1 >> 4) + carry, &trash);
  *hi = add(*hi, z1carry << 4, &trash);
}

It’s relatively easy to see that a * b can be rewritten as (a1 * 2^4 + a0) * (b1 * 2^4 + b0), all four variables being 4-bit integers. After multiplying and rearranging you’ll get an equation that’s very similar to mul above. Here’s a good introduction to computing with long integers if you want to know more.

$ clang -c -emit-llvm -o cmul.bc cmul.c

Compile the code to LLVM bitcode as before so that we can load it into SAW later.

The Cryptol specification

To automate verification we’ll again write a SAW script. It will contain the necessary verification commands and details, as well as a Cryptol specification.

The specification doesn’t need to be constant-time, all it needs to be is correct and as simple as possible. We declare a function mul taking two 8-bit integers and returning a tuple containing two 8-bit integers. Read the notation [8] as “sequence of 8 bits”.

cmul.saw[gist.github.com/ttaubert/c742ba7adf040e14ff21e111a929f5b8#file-cmul-saw]

m <- llvm_load_module "cmul.bc";

let {{
  mul : [8] -> [8] -> ([8], [8])
  mul a b = (take`{8} prod, drop`{8} prod)
      where prod = (pad a) * (pad b)
            pad x = zero # x
}};

The built-in function take`{n} x returns a sequence with only the first n items of x. drop`{n} x returns sequence x without the first n items. zero is a special value that has a number of use cases, here it represents a flexible sequence of all zero bits. # is the append operator for sequences.

The first line of the definition gives the return value, a tuple with the first and the last 8 bits of prod. The Cryptol type system can automatically infer that the variable prod must hold a 16-bit sequence if the result of the take`{8} and drop`{8} function calls is a sequence of 8 bits each.

prod is the result of multiplying the zero-padded arguments a and b. zero # x appends x to 8 zero bits, and that number is again determined by the type system. If you want to learn more about the language, take a look at Programming Cryptol.

That’s about as simple as it gets. We multiply two 8-bit integers and out comes a 16-bit integer, split into two halves. Now let’s use the specification to verify our constant-time implementation.

SAW’s llvm_verify function

We will add LLVM SAW instructions to the same file that contains the Cryptol code from above. The llvm_verify call here takes module m, extracts the symbol "mul", and uses the body given after do for verification.

We need to declare all symbolic inputs as given by our C/C++ implementation. With llvm_var we tell SAW that "a" and "b" are 8-bit integer arguments, and map those to the SAW variables a and b.

The arguments "hi" and "lo" are declared as pointers to 8-bit integers using llvm_ptr. And because we want to dereference the pointers and refer to their values later we declare "*hi" and "*lo" as 8-bit integers too.

cmul.saw[gist.github.com/ttaubert/c742ba7adf040e14ff21e111a929f5b8#file-cmul-saw]

llvm_verify m "mul" [] do {
  a <- llvm_var "a" (llvm_int 8);
  b <- llvm_var "b" (llvm_int 8);

  llvm_ptr "hi" (llvm_int 8);
  hi <- llvm_var "*hi" (llvm_int 8);
  llvm_ptr "lo" (llvm_int 8);
  lo <- llvm_var "*lo" (llvm_int 8);

  let res = {{ mul a b }};
  llvm_ensure_eq "*hi" {{ res.0 }};
  llvm_ensure_eq "*lo" {{ res.1 }};

  llvm_verify_tactic abc;
};

We specify no constraints for any of the arguments and expect the verification to consider all possible inputs. I will talk a bit more about such constraints and how these are useful in a later post.

With llvm_ensure_eq we tell SAW what values we expect after symbolic execution. We expect "*hi" to be equal to the first 8-bit integer element of the tuple returned by mul, and "*lo" to be equal to the second 8-bit integer.

llvm_verify_tactic chooses UC Berkely’s ABC tool again and off we go.

Verification with SAW

Again, make sure you have saw and z3 in your $PATH. If you haven’t downloaded the binaries yet, take a look at the early sections of the previous post.

$ saw cmul.saw
Loading module Cryptol
Loading file "cmul.saw"
Successfully verified @mul

Successfully verified @mul. SAW tells us that for all possible inputs a and b, and actually hi and lo too, our constant-time C/C++ implementation behaves as stated by the SAW verification script and is thereby equivalent to our Cryptol specification.

Next: Finding bugs and more LLVM commands

In the next post I’m going to introduce and write more Cryptol, talk about specifying constraints on LLVM arguments and return values, and provide an example for finding bugs in a real-world codebase.

And while you wait, why not try your hand at optimizing mul to use only three instead of four multiplications with the Karatsuba algorithm? You can reuse the above Cryptol specification to verify you got it right.

Equivalence Proofs With SAW

2017-01-26T16:00:00+01:00

This is the first of a small series of posts that will scratch the surface of the world of formal verification. I will mainly use SAW, the Software Analysis Workbench, and Cryptol, a DSL for specifying crypto algorithms. Both are powerful tools for verifying C, C++, and even Rust code, i.e. almost anything that compiles to LLVM bitcode.

Verifying the implementation of a specific algorithm not only helps you weed out bugs early, it lets you prove that your code is correct and contains no further bugs - assuming you made no mistakes writing your algorithm specification in the first place.

Even if you don’t know a lot about formal verification, or anything, it’s easy to get started experimenting with Cryptol and SAW, and get a glimpse of what’s possible.

In this first post I’ll show how you can use SAW to prove equality of multiple implementations of the same algorithm, potentially written in different languages.

Setting up your workspace

To get started, download the latest SAW and Z3, as well as clang 3.8:

SAW: http://saw.galois.com/builds/nightly/
Z3: https://github.com/Z3Prover/z3/releases
LLVM 3.8: http://releases.llvm.org/download.html

You need clang 3.8, later versions seem currently not supported. Xcode’s latest clang would (probably) work for this small example but give you headaches with more advanced verification later on.

Unzip and copy the tools someplace you like, just don’t forget to update your $PATH environment variable. Especially if you already have clang on your system.

Let’s start with a simple example.

Unsigned addition without overflow

We define an addition function add(a, b) that takes two uint8_t arguments and returns a uint8_t. It deals with overflows so that 123 + 200 = 255, that is it caps the number at UINT8_MAX instead of wrapping around.

add.c[gist.github.com/ttaubert/ecf5b710e849ddfefa81c14a70631eec#file-add-c]

uint8_t add(uint8_t a, uint8_t b) {
  uint8_t sum = a + b;
  return sum < a ? UINT8_MAX : sum;
}

That’s such a trivial function that we probably wouldn’t write a test for it. If it compiles we’re somewhat confident it’ll work just fine:

$ clang -c -emit-llvm -o add.bc add.c

Note that the above command will not produce a binary or shared library, but instead instruct clang to emit LLVM bitcode and store it in add.bc. We’ll feed this into SAW in a minute.

Constant-time addition

Now imagine that we actually want to use add as part of a bignum library to implement cryptographic algorithms, and thus want it to have a constant runtime, independent of the arguments given. Here’s how you could do this:

cadd.c[gist.github.com/ttaubert/ecf5b710e849ddfefa81c14a70631eec#file-cadd-c]

// 0xff if MSB(x) = 1 else 0x00
uint8_t msb(uint8_t x) {
  return 0 - (x >> (8 * sizeof(x) - 1));
}

// 0xff if a < b else 0x00
uint8_t lt(uint8_t a, uint8_t b) {
  return msb(a ^ ((a ^ b) | ((a - b) ^ b)));
}

uint8_t add(uint8_t a, uint8_t b) {
  return (a + b) | lt(a + b, a);
}

If a + b < a, i.e. the addition overflows, lt(a + b, a) will return 0xff and change the return value into UINT8_MAX = 0xff. Otherwise it returns 0 and the return value will simply be a + b. That’s easy enough, but did we get msb and lt right?

$ clang -c -emit-llvm -o cadd.bc cadd.c

Let’s compile the constant-time add function to LLVM bitcode too and use SAW to prove that both our addition functions are equivalent to each other.

Writing the SAW script

SAW executes scripts to automate theorem proving, and we need to write one in order to check that our two implementations are equivalent. The first thing our script does is load the LLVM bitcode from the files we created earlier, add.bc and cadd.bc, as modules into the variables m1 and m2, respectively.

add.saw[gist.github.com/ttaubert/ecf5b710e849ddfefa81c14a70631eec#file-add-saw]

m1 <- llvm_load_module "add.bc";
m2 <- llvm_load_module "cadd.bc";

add <- llvm_extract m1 "add" llvm_pure;
cadd <- llvm_extract m2 "add" llvm_pure;

let thm = {{ \x y -> add x y == cadd x y }};
prove_print abc thm;

Next, we’ll extract the add functions defined in each of these modules and store them in add and cadd, the latter being our constant-time implementation. llvm_pure indicates that a function always returns the same result given the same arguments, and thus has no side-effects.

Last, we define a theorem thm stating that for all arguments x and y both functions have the same return value, that they are equivalent to each other. We choose to prove this theorem with the ABC tool from UC Berkeley.

We’re all set now, time to actually prove something.

Proving equivalence

Make sure you have saw and z3 in your $PATH. Run SAW and pass it the file we created in the previous section — it will execute the script and automatically prove our theorem.

$ saw add.saw
Loading module Cryptol
Loading file "add.saw"
Valid

Valid, that was easy. Maybe too easy. Would SAW even detect if we sneak a minor mistake into the program? Let’s find out…

 uint8_t lt(uint8_t a, uint8_t b) {
-  return msb(a ^ ((a ^ b) | ((a - b) ^ b)));
+  return msb(a ^ ((a ^ b) | ((a + b) ^ b)));
 }

The diff above changes the behavior of lt just slightly, a bug that we could have introduced by accident. Let’s run SAW again and see whether it spots it:

$ saw add.saw
Loading module Cryptol
Loading file "add.saw"
saw: user error ("prove_print" (add.saw:8:1):
prove: 1 unsolved subgoal(s)
Invalid: [x = 240, y = 0])

Invalid! The two functions disagree on the return value at [x = 240, y = 0]. SAW of course doesn’t know which function is at fault, but we are confident enough in our reference implementation to know where to look.

I can’t possibly explain how this all works in detail, but I can hopefully give you a rough idea. What SAW does is parse the LLVM bitcode and symbolically execute it on symbolic inputs to translate it into a circuit representation.

This circuit is then, together with our theorems, fed into a theorem prover. Z3 is an automated theorem prover, and ABC a tool for logic synthesis and verification; both are able to prove equality using automated reasoning.

Next: Some Cryptol and more SAW

In the second post I talk about verifying the implementation of a slightly more complex function, also written in C/C++, and show how you can use Cryptol to write a simple specification, as well as introduce more advanced SAW commands for verification.

If you found this interesting, play around with the examples above and come up with your own. Write a straightforward implementation of an algorithm that you can be certain to get right and then optimize it, make it constant-time, or change it in any other way and see how SAW behaves.

Notes on HACS 2017

2017-01-17T15:00:00+01:00

Real World Crypto is probably one of my favorite conferences. It’s a fine mix of practical and theoretical talks, plus a bunch of great hallway, lunch, and dinner conversations. It was broadcasted live for the first time this year, and the talks are available online. But I’m not going to talk more about RWC, others have covered it perfectly.

The HACS workshop

What I want to tell you about is a lesser-known event that took place right after RWC, called HACS - the High Assurance Crypto Software workshop. An intense, highly interactive two-day workshop in its second year, organized by Ben Laurie, Gilles Barthe, Peter Schwabe, Meredith Whittaker, and Trevor Perrin.

Its stated goal is to bring together crypto-implementers and verification people from open source, industry, and academia; introduce them and their projects to each other, and develop practical collaborations that improve verification of crypto code.

The projects & people

The formal verification community was represented by projects such as miTLS, HACL*, Project Everest, Z3, VeriFast, tis-interpreter, ct-verif, Cryptol/SAW, Entroposcope, and other formal verification and synthesis projects based on Coq or F*.

Crypto libraries were represented by one or multiple maintainers of OpenSSL, BoringSSL, Bouncy Castle, NSS, BearSSL, *ring*, and s2n. Other invited projects included LLVM, Tor, libFuzzer, BitCoin, and Signal. (I’m probably missing a few, sorry.)

Additionally, there were some attendants not directly involved with any of the above projects but who are experts in formal verification or synthesis, constant-time implementation of crypto algorithms, fast arithmetic in assembler, elliptic curves, etc.

All in all, somewhere between 70 and 80 people.

HACS - Day 1

After short one-sentence introductions on early Saturday morning we immediately started with simultaneous round-table discussions, focused on topics such as “The state of crypto libraries”, “Challenges in implementing crypto libraries”, “Efficient fuzzing”, “TLS implementation woes”, “The LLVM ecosystem”, “Fast and constant-time low-level algorithm implementations”, “Formal verification/synthesis with Coq”, and others.

These discussions were hosted by a rotating set of people, not always leading by pure expertise, sometimes also moderating, asking questions, and making sure we stay on track. We did this until lunch, and continued to talk over food with the people we just met. For the rest of the day, discussions became longer and more focused.

By this point people slowly started to sense what it is they want to focus on this weekend. They got to meet most of the other attendants, found out about their skills, projects, and ideas; thought about possibilities for collaboration on projects for this weekend or the months to come.

In the evening we split into groups and went for dinner. Most people’s brains were probably somewhat fried (as was mine) after hours of talking and discussing. Everyone was so engaged that you not once found the time to take out your laptop or phone, or had the desire to do so, which was great.

HACS - Day 2

The second day, early Sunday morning, continued much like the previous. We started off with a brainstorming session for what we think the group should be working on. The rest of the day was filled with long and focused discussion that were mostly a continuation from the day before.

A highlight of the day was the skill sharing session, where participants could propose a specific skill to share with others. If you didn’t find something to share you could be one of the 50% of the group that gets to learn from others.

My lucky pick was Chris Hawblitzel from Microsoft Research, who did his best to explain to me (in about 45 minutes) how Z3 works, what its limitations are, and what higher-level languages exist that make it a little more usable. Thank you, Chris!

We ended the day with signing up for one or multiple projects for the last day.

HACS - Day 3

The third day of the workshop was optional, a hacking day with maybe roughly 50% attendance. Some folks took the chance to arrive a little later after two days of intense discussions and socializing. By now you knew most people’s names, and you better did because no one cared to wear name tags anymore.

It was the time to get together with the people from the projects you signed up for and get your laptop out (if needed). I can’t possibly remember all the things people worked on but here are a few examples:

Verify DRBG implementations, various other crypto algorithms, and/or integrate synthesized implementations for different crypto libraries.
Brainstorm and experiment with a generic comparative fuzzing API for libFuzzer.
Come up with an ASCII representation for TLS records, similar to DER ASCII, that could be used to write TLS implementation tests or feed fuzzers.
Start fuzzing projects like BearSSL and Tor. I do remember that at least BearSSL quickly found a tiny (~900 byte) buffer overflow :)

See you again next year?

I want to thank all the organizers (and sponsors) for spending their time (or money) planning and hosting such a great event. It always pays off to bring communities closer together and foster collaboration between projects and individuals.

I got to meet dozens of highly intelligent and motivated people, and left with a much bigger sense of community. I’m grateful to all the attendants that participated in discussions and projects, shared their skills, asked hard questions, and were always open to suggestions from others.

I hope to be invited again to future workshops and check in on the progress we’ve made at improving the verification and quality assurance of crypto code across the ecosystem.

TLS Version Intolerance

2016-09-30T16:00:00+02:00

A few weeks ago I listened to Hanno Böck talk about TLS version intolerance at the Berlin AppSec & Crypto Meetup. He explained how with TLS 1.3 just around the corner there again are growing concerns about faulty TLS stacks found in HTTP servers, load balancers, routers, firewalls, and similar software and devices.

I decided to dig a little deeper and will use this post to explain version intolerance, how version fallbacks work and why they’re insecure, as well as describe the downgrade protection mechanisms available in TLS 1.2 and 1.3. It will end with a look at version negotiation in TLS 1.3 and a proposal that aims to prevent similar problems in the future.

What is version intolerance?

Every time a new TLS version is specified, browsers usually are the fastest to implement and update their deployments. Most major browser vendors have a few people involved in the standardization process to guide the standard and give early feedback about implementation issues.

As soon as the spec is finished, and often far before that feat is done, clients will have been equipped with support for the new TLS protocol version and happily announce this to any server they connect to:

Client: Hi! The highest TLS version I support is 1.2.
Server: Hi! I too support TLS 1.2 so let’s use that to communicate.
[TLS 1.2 connection will be established.]

In this case the highest TLS version supported by the client is 1.2, and so the server picks it because it supports that as well. Let’s see what happens if the client supports 1.2 but the server does not:

Client: Hi! The highest TLS version I support is 1.2.
Server: Hi! I only support TLS 1.1 so let’s use that to communicate.
[TLS 1.1 connection will be established.]

This too is how it should work if a client tries to connect with a protocol version unknown to the server. Should the client insist on any specific version and not agree with the one picked by the server it will have to terminate the connection.

Unfortunately, there are a few servers and more devices out there that implement TLS version negotiation incorrectly. The conversation might go like this:

Client: Hi! The highest TLS version I support is 1.2.
Server: ALERT! I don’t know that version. Handshake failure.
[Connection will be terminated.]

Or:

Client: Hi! The highest TLS version I support is 1.2.
Server: TCP FIN! I don’t know that version.
[Connection will be terminated.]

Or even worse:

Client: Hi! The highest TLS version I support is 1.2.
Server: (I don’t know this version so let’s just not respond.)
[Connection will hang.]

The same can happen with the infamous F5 load balancer that can’t handle ClientHello messages with a length between 256 and 512 bytes. Other devices abort the connection when receiving a large ClientHello split into multiple TLS records. TLS 1.3 might actually cause more problems of this kind due to more extensions and client key shares.

What are version fallbacks?

As browsers usually want to ship new TLS versions as soon as possible, more than a decade ago vendors saw a need to prevent connection failures due to version intolerance. The easy solution was to decrease the advertised version number by one with every failed attempt:

Client: Hi! The highest TLS version I support is 1.2.
Server: ALERT! Handshake failure. (Or FIN. Or hang.)
[TLS version fallback to 1.1.]
Client: Hi! The highest TLS version I support is 1.1.
Server: Hi! I support TLS 1.1 so let’s use that to communicate.
[TLS 1.1 connection will be established.]

A client supporting everything from TLS 1.0 to TLS 1.2 would start trying to establish a 1.2 connection, then a 1.1 connection, and if even that failed a 1.0 connection.

Why are these insecure?

What makes these fallbacks insecure is that the connection can be downgraded by a MITM, by sending alerts or TCP packets to the client, or blocking packets from the server. To the client this is indistinguishable from a network error.

The POODLE attack is one example where an attacker abuses the version fallback to force an SSL 3.0 connection. In response to this browser vendors disabled version fallbacks to SSL 3.0, and then SSL 3.0 entirely, to prevent even up-to-date clients from being exploited. Insecure version fallback in browsers pretty much break the actual version negotiation mechanisms.

Version fallbacks have been disabled since Firefox 37 and Chrome 50. Browser telemetry data showed it was no longer necessary as after years, TLS 1.2 and correct version negotiation was deployed widely enough.

The TLS_FALLBACK_SCSV cipher suite

You might wonder if there’s a secure way to do version fallbacks, and other people did so too. Adam Langley and Bodo Möller proposed a special cipher suite in RFC 7507 that would help a client detect whether the downgrade was initiated by a MITM.

Whenever the client includes TLS_FALLBACK_SCSV {0x56, 0x00} in the list of cipher suites it signals to the server that this is a repeated connection attempt, but this time with a version lower than the highest it supports, because previous attempts failed. If the server supports a higher version than advertised by the client, it MUST abort the connection.

The drawback here however is that a client even if it implements fallback with a Signaling Cipher Suite Value doesn’t know the highest protocol version supported by the server, and whether it implements a TLS_FALLBACK_SCSV check. Common web servers will likely be updated faster than others, but router or load balancer manufacturers might not deem it important enough to implement and ship updates for.

Signatures in TLS 1.2

It’s been long known to be problematic that signatures in TLS 1.2 don’t cover the list of cipher suites and other messages sent before server authentication. They sign the ephemeral DH parameters sent by the server and include the *Hello.random values as nonces to prevent replay attacks:

h = Hash(ClientHello.random + ServerHello.random + ServerParams)

Signing at least the list of cipher suites would have helped prevent downgrade attacks like FREAK and Logjam. TLS 1.3 will sign all messages before server authentication, even though it makes Transcript Collision Attacks somewhat easier to mount. With SHA-1 not allowed for signatures that will hopefully not become a problem anytime soon.

Downgrade Sentinels in TLS 1.3

With neither the client version nor its cipher suites (for the SCSV) included in the hash signed by the server’s certificate in TLS 1.2, how do you secure TLS 1.3 against downgrades like FREAK and Logjam? Stuff a special value into ServerHello.random.

The TLS WG decided to put static values (sometimes called downgrade sentinels) into the server’s nonce sent with the ServerHello message. TLS 1.3 servers responding to a ClientHello indicating a maximum supported version of TLS 1.2 MUST set the last eight bytes of the nonce to:

0x44 0x4F 0x57 0x4E 0x47 0x52 0x44 0x01

If the client advertises a maximum supported version of TLS 1.1 or below the server SHOULD set the last eight bytes of the nonce to:

0x44 0x4F 0x57 0x4E 0x47 0x52 0x44 0x00

If not connecting with a downgraded version, a client MUST check whether the server nonce ends with any of the two sentinels and in such a case abort the connection. The TLS 1.3 spec here introduces an update to TLS 1.2 that requires servers and clients to update their implementation.

Unfortunately, this downgrade protection relies on a ServerKeyExchange message being sent and is thus of limited value. Static RSA key exchanges are still valid in TLS 1.2, and unless the server admin disables all non-forward-secure cipher suites the protection can be bypassed.

The comeback of insecure fallbacks?

Current measurements show that enabling TLS 1.3 by default would break a significant fraction of TLS handshakes due to version intolerance. According to Ivan Ristić, as of July 2016, 3.2% of servers from the SSL Pulse data set reject TLS 1.3 handshakes.

This a very high number and would affect way too many people. Alas, with TLS 1.3 we have only limited downgrade protection for forward-secure cipher suites. And that is assuming that most servers either support TLS 1.3 or update their 1.2 implementations. TLS_FALLBACK_SCSV, if supported by the server, will help as long as there are no attacks tampering with the list of cipher suites.

The TLS working group has been thinking about how to handle intolerance without bringing back version fallbacks, and there might be light at the end of the tunnel.

Version negotiation with extensions

The next version of the proposed TLS 1.3 spec, draft 16, will introduce a new version negotiation mechanism based on extensions. The current ClientHello.version field will be frozen to TLS 1.2, i.e. {3, 3}, and renamed to legacy_version. Any number greater than that MUST be ignored by servers.

To negotiate a TLS 1.3 connection the protocol now requires the client to send a supported_versions extension. This is a list of versions the client supports, in preference order, with the most preferred version first. Clients MUST send this extension as servers are required to negotiate TLS 1.2 if it’s not present. Any version number unknown to the server MUST be ignored.

This still leaves potential problems with big ClientHello messages or choking on unknown extensions unaddressed, but according to David Benjamin the main problem is ClientHello.version. We will hopefully be able to ship browsers that have TLS 1.3 enabled by default, without bringing back insecure version fallbacks.

However, it’s not unlikely that implementers will screw up even the new version negotiation mechanism and we’ll have similar problems in a few years down the road.

GREASE-ing the future

David Benjamin, following Adam Langley’s advice to have one joint and keep it well oiled, proposed GREASE (Generate Random Extensions And Sustain Extensibility), a mechanism to prevent extensibility failures in the TLS ecosystem.

The heart of the mechanism is to have clients inject “unknown values” into places where capabilities are advertised by the client, and the best match selected by the server. Servers MUST ignore unknown values to allow introducing new capabilities to the ecosystem without breaking interoperability.

These values will be advertised pseudo-randomly to break misbehaving servers early in the implementation process. Proposed injection points are cipher suites, supported groups, extensions, and ALPN identifiers. Should the server respond with a GREASE value selected in the ServerHello message the client MUST abort the connection.

Continuous Integration for NSS

2016-08-09T16:00:00+02:00

The following image shows our TreeHerder dashboard after pushing a changeset to the NSS repository. It is the result of only a few weeks of work (on our side):

Based on my experience from building a Taskcluster CI for NSS over the last weeks, I want to share a rough outline of the process of setting this up for basically any Mozilla project, using NSS as an example.

What is the goal?

The development of NSS has for a long time been heavily supported by a fleet of buildbots. You can see them in action by looking at our waterfall diagram showing the build and test statuses of the latest pushes to the NSS repository.

Unfortunately, this setup is rather complex and the bots are slow. Build and test tasks are run sequentially and so on some machines it takes 10-15 hours before you will be notified about potential breakage.

The first thing that needs to be done is to replicate the current setup as good as possible and then split monolithic test runs into many small tasks that can be run in parallel. Builds will be prepared by build tasks, test tasks will later download those pieces (called artifacts) to run tests.

A good turnaround time is essential, ideally one should know whether a push broke the tree after not more than 15-30 minutes. We want a TreeHerder dashboard that gives a good overview of all current build and test tasks, as well as an IRC and email notification system so we don’t have to watch the tree all day.

Docker for Linux tasks

To build and test on Linux, Taskcluster uses Docker. The build instructions for the image containing all NSS dependencies, as well as the scripts to build and run tests, can be found in the automation/taskcluster/docker directory.

For a start, the fastest way to get something up and running (or building) is to use ADD in the Dockerfile to bake your scripts into the image. That way you can just pass them as the command in the task definition later.

# Add build and test scripts.
ADD bin /home/worker/bin
RUN chmod +x /home/worker/bin/*

Once you have NSS and its tests building and running in a local Docker container, the next step is to run a Taskcluster task in the cloud. You can use the Task Creator to spawn a one-off task, experiment with your Docker image, and with the task definition. Taskcluster will automatically pull your image from Docker Hub:

{

  "created": " ... ",
  "deadline": " ... ",
  "payload": {
    "image": "ttaubert/nss-ci:0.0.21",
    "command": [
      "/bin/bash",
      "-c",
      "bin/build.sh"
    ],
    "maxRunTime": 3600
  },

}

Docker and task definitions are well-documented, so this step shouldn’t be too difficult and you should be able to confirm everything runs fine. Now instead of kicking off tasks manually the next logical step is to spawn tasks automatically when changesets are pushed to the repository.

Using taskcluster-github

Triggering tasks on repository pushes should remind you of Travis CI, CircleCI, or AppVeyor, if you worked with any of those before. Taskcluster offers a similar tool called taskcluster-github that uses a configuration file in the root of your repository for task definitions.

If your master is a Mercurial repository then it’s very helpful that you don’t have to mess with it until you get the configuration right, and can instead simply create a fork on GitHub. The documentation is rather self-explanatory, and the task definition is similar to the one used by the Task Creator.

Once the WebHook is set up and receives pings, a push to your fork will make “Lisa Lionheart”, the Taskcluster bot, comment on your push and leave either an error message or a link to the task graph. If on the first try you see failures about missing scopes you are lacking permissions and should talk to the nice folks over in #taskcluster.

Move scripts into the repository

Once you have a GitHub fork spawning build and test tasks when pushing you should move all the scripts you wrote so far into the repository. The only script left on the Docker image would be a script that checks out the hg/git repository and then uses the scripts in the tree to build and run tests.

This step will pay off very early in the process, rebuilding and pushing the Docker image to Docker Hub is something that you really don’t want to do too often. All NSS scripts for Linux live in the automation/taskcluster/scripts directory.

#!/usr/bin/env bash

set -v -e -x

if [ $(id -u) = 0 ]; then
    # Drop privileges by re-running this script.
    exec su worker $0 $@
fi

# Do things here ...

Use the above snippet as a template for your scripts. It will set a few flags that help with debugging later, drop root privileges, and rerun it as the unprivileged worker user. If you need to do things as root before building or running tests, just put them before the exec su ... call.

Split build and test runs

Taskcluster encourages many small tasks. It’s easy to split the big monolithic test run I mentioned at the beginning into multiple tasks, one for each test suite. However, you wouldn’t want to build NSS before every test run again, so we should build it only once and then reuse the binary. Taskcluster allows to leave artifacts after a task run that can then be downloaded by subtasks.

# Build.
cd nss && make nss_build_all

# Package.
mkdir artifacts
tar cvfjh artifacts/dist.tar.bz2 dist

The above snippet builds NSS and creates an archive containing all the binaries and libraries. You need to let Taskcluster know that there’s a directory with artifacts so that it picks those up and makes them available to the public.

{

  "created": " ... ",
  "deadline": " ... ",
  "payload": {
    "image": "ttaubert/nss-ci:0.0.21",
    "artifacts": {
      "public": {
        "type": "directory",
        "path": "/home/worker/artifacts",
        "expires": " ... "
      }
    },
    "command": [
      ...
    ],
    "maxRunTime": 3600
  },

}

The test task then uses the $TC_PARENT_TASK_ID environment variable to determine the correct download URL, unpacks the build and starts running tests. Making artifacts automatically available to subtasks, without having to pass the parent task ID and build a URL, will hopefully be added to Taskcluster in the future.

# Fetch build artifact.
curl --retry 3 -Lo dist.tar.bz2 https://queue.taskcluster.net/v1/task/$TC_PARENT_TASK_ID/artifacts/public/dist.tar.bz2
tar xvjf dist.tar.bz2

# Run tests.
cd nss/tests && ./all.sh

Writing decision tasks

Specifying task dependencies in your .taskcluster.yml file is unfortunately not possible at the moment. Even though the set of builds and tasks you want may be static you can’t create the necessary links without knowing the random task IDs assigned to them.

Your only option is to create a so-called decision task. A decision task is the only task defined in your .taskcluster.yml file and started after you push a new changeset. It will leave an artifact in the form of a JSON file that Taskcluster picks up and uses to extend the task graph, i.e. schedule further tasks with appropriate dependencies. You can use whatever tool or language you like to generate these JSON files, e.g. Python, Ruby, Node, …

task:
  payload:
    image: "ttaubert/nss-ci:0.0.21"

    maxRunTime: 1800

    artifacts:
      public:
        type: "directory"
        path: "/home/worker/artifacts"
        expires: "7 days"

    graphs:
      - /home/worker/artifacts/graph.json

All task graph definitions including the Node.JS build script for NSS can be found in the automation/taskcluster/graph directory. Depending on the needs of your project you might want to use a completely different structure. All that matters is that in the end you produce a valid JSON file. Slightly more intelligent decision tasks can be used to implement features like try syntax.

mozilla-taskcluster for Mercurial projects

If you have all of the above working with GitHub but your main repository is hosted on hg.mozilla.org you will want to have Mercurial spawn decision tasks when pushing.

The Taskcluster team is working on making .taskcluster.yml files work for Mozilla-hosted Mercurial repositories too, but while that work isn’t finished yet you have to add your project to mozilla-taskcluster. mozilla-taskcluster will listen for pushes and then kick off tasks just like the WebHook.

TreeHerder Configuration

A CI is no CI without a proper dashboard. That’s the role of TreeHerder at Mozilla. Add your project to the end of the repository.json file and create a new pull request. It will usually take a day or two after merging until your change is deployed and your project shows up in the dashboard.

TreeHerder gets the per-task configuration from the task definition. You can configure the symbol, the platform and collection (i.e. row), and other parameters. Here’s the configuration data for the green B at the start of the fifth row of the image at the top of this post:

{

  "created": " ... ",
  "deadline": " ... ",
  "payload": {
    ...
  },
  "extra": {
    "treeherder": {
      "jobKind": "build",
      "symbol": "B",
      "build": {
        "platform": "linux64"
      },
      "machine": {
        "platform": "linux64"
      },
      "collection": {
        "debug": true
      }
    }
  }

}

IRC and email notifications

Taskcluster is a very modular system and offers many APIs. It’s built with mostly Node, and thus there are many Node libraries available to interact with the many parts. The communication between those is realized by Pulse, a managed RabbitMQ cluster.

The last missing piece we wanted is an IRC and email notification system, a bot that notifies about failures on IRC and sends emails to all parties involved. It was a piece of cake to write nss-tc that uses Taskcluster Node.JS libraries and Mercurial JSON APIs to connect to the task queue and listen for task definitions and failures.

A rough overview

I could have probably written a detailed post for each of the steps outlined here but I think it’s much more helpful to start with an overview of what’s needed to get the CI for a project up and running. Each step and each part of the system is hopefully more obvious now if you haven’t had too much interaction with Taskcluster and TreeHerder so far.

Thanks to the Taskcluster team, especially John Ford, Greg Arndt, and Pete Moore! They helped us pull this off in a matter of weeks and besides Linux builds and tests we already have Windows tasks, static analysis, ASan+LSan, and are in the process of setting up workers for ARM builds and tests.

The Evolution of Signatures in TLS

2016-07-26T16:00:00+02:00

This post will take a look at the evolution of signature algorithms and schemes in the TLS protocol since version 1.0. I at first started taking notes for myself but then decided to polish and publish them, hoping that others will benefit as well.

(Let’s ignore client authentication for simplicity.)

Signature algorithms in TLS 1.0 and TLS 1.1

In TLS 1.0 as well as TLS 1.1 there are only two supported signature schemes: RSA with MD5/SHA-1 and DSA with SHA-1. The RSA here stands for the PKCS#1 v1.5 signature scheme, naturally.

select (SignatureAlgorithm)
{
    case rsa:
        digitally-signed struct {
            opaque md5_hash[16];
            opaque sha_hash[20];
        };
    case dsa:
        digitally-signed struct {
            opaque sha_hash[20];
        };
} Signature;

An RSA signature signs the concatenation of the MD5 and SHA-1 digest, the DSA signature only the SHA-1 digest. Hashes will be computed as follows:

h = Hash(ClientHello.random + ServerHello.random + ServerParams)

The ServerParams are the actual data to be signed, the *Hello.random values are prepended to prevent replay attacks. This is the reason TLS 1.3 puts a downgrade sentinel at the end of ServerHello.random for clients to check.

The ServerKeyExchange message containing the signature is sent only when static RSA/DH key exchange is not used, that means we have a DHE_* cipher suite, an RSA_EXPORT_* suite downgraded due to export restrictions, or a DH_anon_* suite where both parties don’t authenticate.

Signature algorithms in TLS 1.2

TLS 1.2 brought bigger changes to signature algorithms by introducing the signature_algorithms extension. This is a ClientHello extension allowing clients to signal supported and preferred signature algorithms and hash functions.

enum {
    none(0), md5(1), sha1(2), sha224(3), sha256(4), sha384(5), sha512(6)
} HashAlgorithm;

enum {
    anonymous(0), rsa(1), dsa(2), ecdsa(3)
} SignatureAlgorithm;

struct {
    HashAlgorithm hash;
    SignatureAlgorithm signature;
} SignatureAndHashAlgorithm;

If a client does not include the signature_algorithms extension then it is assumed to support RSA, DSA, or ECDSA (depending on the negotiated cipher suite) with SHA-1 as the hash function.

Besides adding all SHA-2 family hash functions, TLS 1.2 also introduced ECDSA as a new signature algorithm. Note that the extension does not allow to restrict the curve used for a given scheme, P-521 with SHA-1 is therefore perfectly legal.

A new requirement for RSA signatures is that the hash has to be wrapped in a DER-encoded DigestInfo sequence before passing it to the RSA sign function.

DigestInfo ::= SEQUENCE {
    digestAlgorithm DigestAlgorithm,
    digest OCTET STRING
}

This unfortunately led to attacks like Bleichenbacher’06 and BERserk because it turns out handling ASN.1 correctly is hard. As in TLS 1.1, a ServerKeyExchange message is sent only when static RSA/DH key exchange is not used. The hash computation did not change either:

h = Hash(ClientHello.random + ServerHello.random + ServerParams)

Signature schemes in TLS 1.3

The signature_algorithms extension introduced by TLS 1.2 was revamped in TLS 1.3 and MUST now be sent if the client offers a single non-PSK cipher suite. The format is backwards compatible and keeps some old code points.

enum {
    /* RSASSA-PKCS1-v1_5 algorithms */
    rsa_pkcs1_sha1 (0x0201),
    rsa_pkcs1_sha256 (0x0401),
    rsa_pkcs1_sha384 (0x0501),
    rsa_pkcs1_sha512 (0x0601),

    /* ECDSA algorithms */
    ecdsa_secp256r1_sha256 (0x0403),
    ecdsa_secp384r1_sha384 (0x0503),
    ecdsa_secp521r1_sha512 (0x0603),

    /* RSASSA-PSS algorithms */
    rsa_pss_sha256 (0x0700),
    rsa_pss_sha384 (0x0701),
    rsa_pss_sha512 (0x0702),

    /* EdDSA algorithms */
    ed25519 (0x0703),
    ed448 (0x0704),

    /* Reserved Code Points */
    private_use (0xFE00..0xFFFF)
} SignatureScheme;

Instead of SignatureAndHashAlgorithm, a code point is now called a SignatureScheme and tied to a hash function (if applicable) by the specification. TLS 1.2 algorithm/hash combinations not listed here are deprecated and MUST NOT be offered or negotiated.

New code points for RSA-PSS schemes, as well as Ed25519 and Ed448-Goldilocks were added. ECDSA schemes are now tied to the curve given by the code point name, to be enforced by implementations. SHA-1 signature schemes SHOULD NOT be offered, if needed for backwards compatibility then only as the lowest priority after all other schemes.

The current draft-13 lists RSASSA-PSS as the only valid signature algorithm allowed to sign handshake messages with an RSA key. The rsa_pkcs1_* values solely refer to signatures which appear in certificates and are not defined for use in signed handshake messages.

To prevent various downgrade attacks like FREAK and Logjam the computation of the hashes to be signed has changed significantly and covers the complete handshake, up until CertificateVerify:

h = Hash(Handshake Context + Certificate) + Hash(Resumption Context)

This includes amongst other data the client and server random, key shares, the cipher suite, the certificate, and resumption information to prevent replay and downgrade attacks. With static key exchange algorithms gone the CertificateVerify message is now the one carrying the signature.

Six Months as a Security Engineer

2016-05-13T18:00:00+02:00

It’s been a little more than six months since I officially switched to the Security Engineering team here at Mozilla to work on NSS and related code. I thought this might be a good time to share what I’ve been up to in a short status update:

Removed SSLv2 code from NSS

NSS contained quite a lot of SSLv2-specific code that was waiting to be removed. It was not compiled by default so there was no way to enable it in Firefox even if you wanted to. The removal was rather straightforward as the protocol changed significantly with v3 and most of the code was well separated. Good riddance.

Added ChaCha20/Poly1305 cipher suites to Firefox

Adam Langley submitted a patch to bring ChaCha20/Poly1305 cipher suites to NSS already two years ago but at that time we likely didn’t have enough resources to polish and land it. I picked up where he left and updated it to conform to the slightly updated specification. Firefox 47 will ship with two new ECDHE/ChaCha20 cipher suites enabled.

RSA-PSS for TLS v1.3 and the WebCrypto API

Ryan Sleevi, also a while ago, implemented RSA-PSS in freebl, the lower cryptographic layer of NSS. I hooked it up to some more APIs so Firefox can support RSA-PSS signatures in its WebCrypto API implementation. In NSS itself we need it to support new handshake signatures in our experimental TLS v1.3 code.

Improve continuous integration for NSS

Kai Engert from RedHat is currently doing a hell of a job maintaining quite a few buildbots that run all of our NSS tests whenever someone pushes a new changeset. Unfortunately the current setup doesn’t scale too well and the machines are old and slow.

Similar to e.g. Travis CI, Mozilla maintains its own continuous integration and release infrastructure, called TaskCluster. Using TaskCluster we now have an experimental Docker image that builds NSS/NSPR and runs all of our 17 (so far) test suites. The turnaround time is already very promising. This is an ongoing effort, there are lots of things left to do.

Joined the WebCrypto working group

I’ve been working on the Firefox WebCrypto API implementation for a while, long before I switched to the Security Engineering team, and so it made sense to join the working group to help finalize the specification. I’m unfortunately still struggling to carve out more time for involvement with the WG than just attending meetings and representing Mozilla.

Added HKDF to the WebCrypto API

The main reason the WebCrypto API in Firefox did not support HKDF until recently is that no one found the time to implement it. I finally did find some time and brought it to Firefox 46. It is fully compatible to Chrome’s implementation (RFC 5869), the WebCrypto specification still needs to be updated to reflect those changes.

Added SHA-2 for PBKDF2 in the WebCrypto API

Since we shipped the first early version of the WebCrypto API, SHA-1 was the only available PRF to be used with PBKDF2. We now support PBKDF2 with SHA-2 PRFs as well.

Improved the Firefox WebCrypto API threading model

Our initial implementation of the WebCrypto API would naively spawn a new thread every time a crypto.subtle.* method was called. We now use a thread pool per process that is able to handle all incoming API calls much faster.

Added WebCrypto API to Workers and ServiceWorkers

After working on this on and off for more than six months, so even before I officially joined the security engineering team, I managed to finally get it landed, with a lot of help from Boris Zbarsky who had to adapt our WebIDL code generation quite a bit. The WebCrypto API can now finally be used from (Service)Workers.

What’s next?

In the near future I’ll be working further on improving our continuous integration infrastructure for NSS, and clean up the library and its tests. I will hopefully find the time to write more about it as we progress.

A Fast, Constant-time AEAD for TLS

2016-04-29T15:00:00+02:00

The only TLS v1.2+ cipher suites with a dedicated AEAD scheme are the ones using AES-GCM, a block cipher mode that turns AES into an authenticated cipher. From a cryptographic point of view these are preferable to non-AEAD-based cipher suites (e.g. the ones with AES-CBC) because getting authenticated encryption right is hard without using dedicated ciphers.

For CPUs without the AES-NI instruction set, constant-time AES-GCM however is slow and also hard to write and maintain. The majority of mobile phones, and mostly cheaper devices like tablets and notebooks on the market thus cannot support efficient and safe AES-GCM cipher suite implementations.

Even if we ignored all those aforementioned pitfalls we still wouldn’t want to rely on AES-GCM cipher suites as the only good ones available. We need more diversity. Having widespread support for cipher suites using a second AEAD is necessary to defend against weaknesses in AES or AES-GCM that may be discovered in the future.

ChaCha20 and Poly1305, a stream cipher and a message authentication code, were designed with fast and constant-time implementations in mind. A combination of those two algorithms yields a safe and efficient AEAD construction, called ChaCha20/Poly1305, which allows TLS with a negligible performance impact even on low-end devices.

Firefox 47 will ship with two new ECDHE/ChaCha20 cipher suites as specified in the latest draft. We are looking forward to see the adoption of these increase and will, as a next step, work on prioritizing them over AES-GCM suites on devices not supporting AES-NI.

Build Your Own Signal Desktop

2016-01-15T15:00:00+01:00

The Signal Private Messenger is great. Use it. It’s probably the best secure messenger on the market. When recently a desktop app was announced people were eager to join the beta and even happier when an invite finally showed up in their inbox. So was I, it’s a great app and works surprisingly well for an early version.

The only problem is that it’s a Chrome App. Apart from excluding folks with other browsers it’s also a shitty user experience. If you too want your messaging app not tied to a browser then let’s just build our own standalone variant of Signal Desktop.

NW.js beta with Chrome App support

Signal Desktop is a Chrome App, so the easiest way to turn it into a standalone app is to use NW.js. Conveniently, their next release v0.13 will ship with Chrome App support and is available for download as a beta version.

First, make sure you have git and npm installed. Then open a terminal and prepare a temporary build directory to which we can download a few things and where we can build the app:

$ mkdir signal-build
$ cd signal-build

[OS X] Packaging Signal and NW.js

Download the latest beta of NW.js and unzip it. We’ll extract the application and use it as a template for our Signal clone. The NW.js project does unfortunately not seem to provide a secure source (or at least hashes) for their downloads.

$ wget http://dl.nwjs.io/v0.14.4/nwjs-sdk-v0.14.4-osx-x64.zip
$ unzip nwjs-sdk-v0.14.4-osx-x64.zip
$ cp -r nwjs-sdk-v0.14.4-osx-x64/nwjs.app SignalPrivateMessenger.app

Next, clone the Signal repository and use NPM to install the necessary modules. Run the grunt automation tool to build the application.

$ git clone https://github.com/WhisperSystems/Signal-Desktop.git
$ cd Signal-Desktop/
$ npm install
$ node_modules/grunt-cli/bin/grunt

Finally, simply to copy the dist folder containing all the juicy Signal files into the application template we created a few moments ago.

$ cp -r dist ../SignalPrivateMessenger.app/Contents/Resources/app.nw
$ open ..

The last command opens a Finder window. Move SignalPrivateMessenger.app to your Applications folder and launch it as usual. You should now see a welcome page!

[Linux] Packaging Signal and NW.js

The build instructions for Linux aren’t too different but I’ll write them down, if just for convenience. Start by cloning the Signal Desktop repository and build.

$ git clone https://github.com/WhisperSystems/Signal-Desktop.git
$ cd Signal-Desktop/
$ npm install
$ node_modules/grunt-cli/bin/grunt

The dist folder contains the app, ready to be launched. zip it and place the resulting package somewhere handy.

$ cd dist
$ zip -r ../../package.nw *

Back to the top. Download the NW.js binary, extract it, and change into the newly created directory. Move the package.nw file we created earlier next to the nw binary and we’re done. The nwjs-sdk-v0.13.0-beta3-linux-x64 folder does now contain the standalone Signal app.

$ cd ../..
$ wget http://dl.nwjs.io/v0.14.4/nwjs-sdk-v0.14.4-linux-x64.tar.gz
$ tar xfz nwjs-sdk-v0.14.4-linux-x64.tar.gz
$ cd nwjs-sdk-v0.14.4-linux-x64
$ mv ../package.nw .

Finally, launch NW.js. You should see a welcome page!

$ ./nw

If you see something, file something

Our standalone Signal clone mostly works, but it’s far from perfect. We’re pulling from master and that might bring breaking changes that weren’t sufficiently tested.

We don’t have the right icons. The app crashes when you click a media message. It opens a blank popup when you click a link. It’s quite big because also NW.js has bugs and so we have to use the SDK build for now. In the future it would be great to have automatic updates, and maybe even signed builds.

Remember, Signal Desktop is beta, and completely untested with NW.js. If you want to help file bugs, but only after checking that those affect the Chrome App too. If you want to fix a bug only occurring in the standalone version it’s probably best to file a pull request and cross fingers.

Is this secure?

Great question! I don’t know. I would love to get some more insights from people that know more about the NW.js security model and whether it comes with all the protections Chromium can offer. Another interesting question is whether bundling Signal Desktop with NW.js is in any way worse (from a security perspective) than installing it as a Chrome extension. If you happen to have an opinion about that, I would love to hear it.

Another important thing to keep in mind is that when building Signal on your own you will possibly miss automatic and signed security updates from the Chrome Web Store. Keep an eye on the repository and rebuild your app from time to time to not fall behind too much.

More Privacy, Less Latency

2015-11-16T18:00:00+01:00

Please note that this post is about draft-11 of the TLS v1.3 standard.

TLS must be fast. Adoption will greatly benefit from speeding up the initial handshake that authenticates and secures the connection. You want to get the protocol out of the way and start delivering data to visitors as soon as possible. This is crucial if we want the web to succeed at deprecating non-secure HTTP.

Let’s start by looking at full handshakes as standardized in TLS v1.2, and then continue to abbreviated handshakes that decrease connection times for resumed sessions. Once we understand the current protocol we can proceed to proposals made in the latest TLS v1.3 draft to achieve full 1-RTT and even 0-RTT handshakes.

It helps if you already have a rough idea of how TLS and Diffie-Hellman work as I can’t go into every detail. The focus of this post is on comparing current and future handshakes and I might omit a few technicalities to get basic ideas across more easily.

Full TLS 1.2 Handshake (static RSA)

Static RSA is a straightforward key exchange method, available since SSLv2. After sharing basic protocol information via the ClientHello and ServerHello messages the server sends its certificate to the client. ServerHelloDone signals that for now there will be no further messages until the client responds.

The client then encrypts the so-called premaster secret with the server’s public key found in the certificate and wraps it in a ClientKeyExchange message. ChangeCipherSpec signals that from now on messages will be encrypted. Finished, the first message to be encrypted and the client’s last message of the handshake, contains a MAC of all handshake messages exchanged thus far to prove that both parties saw the same messages, without interference from a MITM.

The server decrypts the premaster secret found in the ClientKeyExchange message using its certificate’s private key, and derives the master secret and communication keys. It then too signals a switch to encrypted communication and completes the handshake. It takes two round-trips to establish a connection.

Authentication: With static RSA key exchanges, the connection is authenticated by encrypting the premaster secret with the server certificate’s public key. Only the server in possession of the private key can decrypt, correctly derive the master secret, and send an encrypted Finished message with the right MAC.

The simplicity of static RSA has a serious drawback: it does not offer forward secrecy. If a passive adversary records all traffic to a server then every recorded TLS session can be broken later by obtaining the certificate’s private key.

This key exchange method will be removed in TLS v1.3.

Full TLS 1.2 Handshake (ephemeral DH)

A full handshake using (Elliptic Curve) Diffie-Hellman to exchange ephemeral keys is very similar to the flow of static RSA. The main difference is that after sending the certificate the server will also send a ServerKeyExchange message. This message contains either the parameters of a DH group or of an elliptic curve, paired with an ephemeral public key computed by the server.

The client too computes an ephemeral public key compatible with the given parameters and sends it to the server. Knowing their private keys and the other party’s public key both sides should now share the same premaster secret and can derive a shared master secret.

Authentication: With (EC)DH key exchanges it’s still the certificate that must be signed by a CA listed in the client’s trust store. To authenticate the connection the server will sign the parameters contained in ServerKeyExchange with the certificate’s private key. The client verifies the signature with the certificate’s public key and only then proceeds with the handshake.

Abbreviated Handshakes in TLS 1.2

Since SSLv2 clients have been able to use session identifiers as a way to resume previously established TLS/SSL sessions. Session resumption is important because a full handshake can take time: it has a high latency as it needs two round-trips and might involve expensive computation to exchange keys, or sign and verify certificates.

Session IDs, assigned by the server, are unique identifiers under which both parties store the master secret and other details of the connection they established. The client may include this ID in the ClientHello message of the next handshake to short-circuit the negotiation and reuse previous connection parameters.

If the server is willing and able to resume the session it responds with a ServerHello message including the Session ID given by the client. This handshake is effectively 1-RTT as the client can send application data immediately after the Finished message.

Sites with lots of visitors will have to manage and secure big session caches, or risk pushing out saved sessions too quickly. A setup involving multiple load-balanced servers will need to securely synchronize session caches across machines. The forward secrecy of a connection is bounded by how long session information is retained on servers.

Session tickets, created by the server and stored by the client, are blobs containing all necessary information about a connection, encrypted by a key only known to the server. If the client presents this tickets with the ClientHello message, and proves that it knows the master secret stored in the ticket, the session will be resumed.

A server willing and able to decrypt the given ticket responds with a ServerHello message including an empty SessionTicket extension, otherwise the extension would be omitted completely. As with session IDs, the client will start sending application data immediately after the Finished message to achieve 1-RTT.

To not affect the forward secrecy provided by (EC)DHE suites session ticket keys should be rotated periodically, otherwise stealing the ticket key would allow recovering recorded sessions later. In a setup with multiple load-balanced servers the main challenge here is to securely generate, rotate, and synchronize keys across machines.

Authentication: Both session resumption mechanisms retain the client’s and server’s authentication states as established in the session’s initial handshake. Neither the server nor the client have to send and verify certificates a second time, and thus can reduce connection times significantly, especially when dealing with RSA certificates.

Full Handshakes in TLS 1.3

The first good news about handshakes in TLS v1.3 is that static RSA key exchanges are no longer supported. Great! That means we can start with full handshakes using forward-secure Diffie-Hellman.

Another important change is the removal of the ChangeCipherSpec protocol (yes, it’s actually a protocol, not a message). With TLS v1.3 every message sent after ServerHello is encrypted with the so-called ephemeral secret to lock out passive adversaries very early in the game. EncryptedExtensions carries Hello extension data that must be encrypted because it’s not needed to set up secure communication.

The probably most important change with regard to 1-RTT is the removal of the ServerKeyExchange and ClientKeyExchange messages. The DH parameters and public keys are now sent in special KeyShare extensions, a new type of extension to be included in the ServerHello and ClientHello messages. Moving this data into Hello extensions keeps the handshake compatible with TLS v1.2 as it doesn’t change the order of messages.

The client sends a list of KeyShareEntry values, each consisting of a named (EC)DH group and an ephemeral public key. If the server accepts it must respond with one of the proposed groups and its own public key. If the server does not support any of the given key shares the server will request retrying the handshake or abort the connection with a fatal handshake_failure alert.

Authentication: The Diffie-Hellman parameters itself aren’t signed anymore, authentication will be a tad more explicit in TLS v1.3. The server sends a CertificateVerify message that contains a hash of all handshake message exchanged so far, signed with the certificate’s private key. The client then simply verifies the signature with the certificate’s public key.

Session Resumption in TLS 1.3 (PSK)

Session resumption via identifiers and tickets is obsolete in TLS v1.3. Both methods are replaced by a pre-shared key (PSK) mode. A PSK is established on a previous connection after the handshake is completed, and can then be presented by the client on the next visit.

The client sends one or more PSK identities as opaque blobs of data. They can be database lookup keys (similar to Session IDs), or self-encrypted and self-authenticated values (similar to Session Tickets). If the server accepts one of the given PSK identities it replies with the one it selected. The KeyShare extension is sent to allow servers to ignore PSKs and fall back to a full handshake.

Forward secrecy can be maintained by limiting the lifetime of PSK identities sensibly. Clients and servers may also choose an (EC)DHE cipher suite for PSK handshakes to provide forward secrecy for every connection, not just the whole session.

Authentication: As in TLS v1.2, the client’s and server’s authentication states are retained and both parties don’t need to exchange and verify certificates again. A regular PSK handshake initiating a new session, instead of resuming, omits certificates completely.

Session resumption still allows significantly faster handshakes when using RSA certificates and can prevent user-facing client authentication dialogs on subsequent connections. However, the fact that it requires a single round-trip just like a full handshake might make it less appealing, especially if you have an ECDSA or EdDSA certificate and do not require client authentication.

Zero-RTT Handshakes in TLS 1.3

The latest draft of the specification contains a proposal to let clients encrypt application data and include it in their first flights. On a previous connection, after the handshake completes, the server would send a ServerConfiguration message that the client can use for 0-RTT handshakes on subsequent connections. The configuration includes a configuration identifier, the server’s semi-static (EC)DH parameters, an expiration date, and other details.

With the very first TLS record the client sends its ClientHello and, changing the order of messages, directly appends application data (e.g. GET / HTTP/1.1). Everything after the ClientHello will be encrypted with the static secret, derived from the client’s ephemeral KeyShareEntry and the semi-static DH parameters given in the server’s configuration. The end_of_early_data alert indicates the end of the flight.

The server, if able and willing to decrypt, responds with its default set of messages and immediately appends the contents of the requested resource. That’s the same round-trip time as for an unencrypted HTTP request. All communication following the ServerHello will again be encrypted with the ephemeral secret, derived from the client’s and server’s ephemeral key shares. After exchanging Finished messages the server will be re-authenticated, and traffic encrypted with keys derived from the master secret.

Security of 0-RTT Handshakes

At first glance, 0-RTT mode seems similar to session resumption or PSK, and you might wonder why one wouldn’t merge these mechanisms. The differences however are subtle but important, and the security properties of 0-RTT handshakes are weaker than those for other kinds of TLS data:

1. To protect against replay attacks the server must incorporate a server random into the master secret. That is unfortunately not possible before the first round-trip and so the poor server can’t easily tell whether it’s a valid request or an attacker replaying a recorded conversation. Replay protection will be in place again after the ServerHello message is sent.

2. The semi-static DH share given in the server configuration, used to derive the static secret and encrypt first flight data, defies forward secrecy. We need at least one round-trip to establish the ephemeral secret. As configurations are shared between clients, and recovering the server’s DH share becomes more attractive, expiration dates should be limited sensibly. The maximum allowed validity is 7 days.

3. If the server’s DH share is compromised a MITM can tamper with the 0-RTT data sent by the client, without being detected. This does not extend to the full session as the client can retrospectively authenticate the server via the remaining handshake messages.

Defending against Replay Attacks

Thwarting replay attacks without input from the server is fundamentally very expensive. It’s important to understand that this is a generic problem, not an issue with TLS in particular, so alas one can’t just borrow another protocol’s 0-RTT model and put that into TLS.

It is possible to have servers keep a list of every ClientRandom they have received in a given time window. Upon receiving a ClientHello the server checks its list and rejects replays if necessary. This list must be globally and temporally consistent as there are possible attack vectors due to TLS’ reliable delivery guarantee if an attacker can force a server to lose its state, as well as with multiple servers in loosely-synchronized data centers.

Maintaining a consistent global state is possible, but only in some limited circumstances, namely for very sophisticated operators or situations where there is a single server with good state management. We will need something better.

Removing Anti-Replay Guarantee

A possible solution might be a TLS stack API to let applications designate certain data as replay-safe, for example GET / HTTP/1.1 assuming that GET requests against a given resource are idempotent.

let c = new TLSConnection(...);
c.setReplayable0RTTData("GET / ...");
c.connect();

Applications can, before opening the connection, specify replayable 0-RTT data to send on the first flight. If the server ignores the given 0-RTT data, the TLS stack automatically replays it after the first round-trip.

Removing Reliable Delivery Guarantee

Another way of achieving the same outcome would be a TLS stack API that again lets applications designate certain data as replay-safe, but does not automatically replay if the server ignores it. The application can decide to do this manually if necessary.

let c = new TLSConnection(...);
c.setUnreliable0RTTData("GET / ...");
c.connect();

if (c.delivered0RTTData()) {
  // Things are cool.
} else {
  // Try to figure out whether to replay or not.
}

Both of these APIs are early proposals and the final version of the specification might look very different from what we can see above. Though, as 0-RTT handshakes are a charter goal, the working group will very likely find a way to make them work.

Summing up

TLS v1.3 will bring major improvements to handshakes, how exactly will be finalized in the coming months. They will be more private by default as all information not needed to set up a secure channel will be encrypted as early as possible. Clients will need only a single round-trip to establish secure and authenticated connections to servers they never spoke to before.

Static RSA mode will no longer be available, forward secrecy will be the default. The two session resumption standards, session identifiers and session tickets, are merged into a single PSK mode which will allow streamlining implementations.

The proposed 0-RTT mode is promising, for custom application communication based on TLS but also for browsers, where a GET / HTTP/1.1 request to your favorite news page could deliver content blazingly fast as if no TLS was involved. The security aspects of zero round-trip handshakes will become more clear as the draft progresses.

A Firefox OS Password Storage

2015-05-18T15:00:00+02:00

My esteemed colleague Frederik Braun recently took on to rewrite the module responsible for storing and checking passcodes that unlock Firefox OS phones. While we are still working on actually landing it in Gaia I wanted to seize the chance to talk about this great use case of the WebCrypto API in the wild and highlight a few important points when using password-based key derivation (PBKDF2) to store passwords.

The Passcode Module

Let us take a closer look at not the verbatim implementation but at a slightly simplified version. The API offers the only two operations such a module needs to support: setting a new passcode and verifying that a given passcode matches the stored one.

let Passcode = {
  store(code) {
    // ...
  },

  verify(code) {
    // ...
  }
};

When setting up the phone for the first time - or when changing the passcode later - we call Passcode.store() to write a new code to disk. Passcode.verify() will help us determine whether we should unlock the phone. Both methods return a Promise as all operations exposed by the WebCrypto API are asynchronous.

Passcode.store("1234").then(() => {
  return Passcode.verify("1234");
}).then(valid => {
  console.log(valid);
});

// Output: true

Make the passcode look “random”

The module should absolutely not store passcodes in the clear. We will use PBKDF2 as a pseudorandom function (PRF) to retrieve a result that looks random. An attacker with read access to the part of the disk storing the user’s passcode should not be able to recover the original input, assuming limited computational resources.

The function deriveBits() is a PRF that takes a passcode and returns a Promise resolving to a random looking sequence of bytes. To be a little more specific, it uses PBKDF2 to derive pseudorandom bits.

function deriveBits(code) {
  // Convert string to a TypedArray.
  let bytes = new TextEncoder("utf-8").encode(code);

  // Create the base key to derive from.
  let importedKey = crypto.subtle.importKey(
    "raw", bytes, "PBKDF2", false, ["deriveBits"]);

  return importedKey.then(key => {
    // Salt should be at least 64 bits.
    let salt = crypto.getRandomValues(new Uint8Array(8));

    // All required PBKDF2 parameters.
    let params = {name: "PBKDF2", hash: "SHA-1", salt, iterations: 5000};

    // Derive 160 bits using PBKDF2.
    return crypto.subtle.deriveBits(params, key, 160);
  });
}

Choosing PBKDF2 parameters

As you can see above PBKDF2 takes a whole bunch of parameters. Choosing good values is crucial for the security of our passcode module so it is best to take a detailed look at every single one of them.

Select a cryptographic hash function

PBKDF2 is a big PRF that iterates a small PRF. The small PRF, iterated multiple times (more on why this is done later), is fixed to be an HMAC construction; you are however allowed to specify the cryptographic hash function used inside HMAC itself. To understand why you need to select a hash function it helps to take a look at HMAC’s definition, here with SHA-1 at its core:

HMAC-SHA-1(k, m) = SHA-1((k ⊕ opad) + SHA-1((k ⊕ ipad) + m))

The outer and inner padding opad and ipad are static values that can be ignored for our purpose, the important takeaway is that the given hash function will be called twice, combining the message m and the key k. Whereas HMAC is usually used for authentication PBKDF2 makes use of its PRF properties, that means its output is computationally indistinguishable from random.

deriveBits() as defined above uses SHA-1 as well, and although it is considered broken as a collision-resistant hash function it is still a safe building block in the HMAC-SHA-1 construction. HMAC only relies on a hash function’s PRF properties, and while finding SHA-1 collisions is considered feasible it is still believed to be a secure PRF.

That said, it would not hurt to switch to a secure cryptographic hash function like SHA-256. Chrome supports other hash functions for PBKDF2 today, Firefox unfortunately has to wait for an NSS fix before those can be unlocked for the WebCrypto API.

Pass a random salt

The salt is a random component that PBKDF2 feeds into the HMAC function along with the passcode. This prevents an attacker from simply computing the hashes of for example all 8-character combinations of alphanumerics (~5.4 PetaByte of storage for SHA-1) and use a huge lookup table to quickly reverse a given password hash. Specify 8 random bytes as the salt and the poor attacker will have to suddenly compute (and store!) 2⁶⁴ of those lookup tables and face 8 additional random characters in the input. Even without the salt the effort to create even one lookup table would be hard to justify because chances are high you cannot reuse it to attack another target, they might be using a different hash function or combine two or more of them.

The same goes for Rainbow Tables. A random salt included with the password would have to be incorporated when precomputing the hash chains and the attacker is back to square one where she has to compute a Rainbow Table for every possible salt value. That certainly works ad-hoc for a single salt value but preparing and storing 2⁶⁴ of those tables is impossible.

The salt is public and will be stored in the clear along with the derived bits. We need the exact same salt to arrive at the exact same derived bits later again. We thus have to modify deriveBits() to accept the salt as an argument so that we can either generate a random one or read it from disk.

function deriveBits(code, salt) {
  // Convert string to a TypedArray.
  let bytes = new TextEncoder("utf-8").encode(code);

  // Create the base key to derive from.
  let importedKey = crypto.subtle.importKey(
    "raw", bytes, "PBKDF2", false, ["deriveBits"]);

  return importedKey.then(key => {
    // All required PBKDF2 parameters.
    let params = {name: "PBKDF2", hash: "SHA-1", salt, iterations: 5000};

    // Derive 160 bits using PBKDF2.
    return crypto.subtle.deriveBits(params, key, 160);
  });
}

Keep in mind though that Rainbow tables today are mainly a thing from the past where password hashes were smaller and shittier. Salts are the bare minimum a good password storage scheme needs, but they merely protect against a threat that is largely irrelevant today.

Specify a number of iterations

As computers became faster and Rainbow Table attacks infeasible due to the prevalent use of salts everywhere, people started attacking password hashes with dictionaries, simply by taking the public salt value and passing that combined with their educated guess to the hash function until a match was found. Modern password schemes thus employ a “work factor” to make hashing millions of password guesses unbearably slow.

By specifying a sufficiently high number of iterations we can slow down PBKDF2’s inner computation so that an attacker will have to face a massive performance decrease and be able to only try a few thousand passwords per second instead of millions.

For a single-user disk or file encryption it might be acceptable if computing the password hash takes a few seconds; for a lock screen 300-500ms might be the upper limit to not interfere with user experience. Take a look at this great StackExchange post for more advice on what might be the right number of iterations for your application and environment.

A much more secure version of a lock screen would allow to not only use four digits but any number of characters. An additional delay of a few seconds after a small number of wrong guesses might increase security even more, assuming the attacker cannot access the PRF output stored on disk.

Determine the number of bits to derive

PBKDF2 can output an almost arbitrary amount of pseudo-random data. A single execution yields the number of bits that is equal to the chosen hash function’s output size. If the desired number of bits exceeds the hash function’s output size PBKDF2 will be repeatedly executed until enough bits have been derived.

function getHashOutputLength(hash) {
  switch (hash) {
    case "SHA-1":   return 160;
    case "SHA-256": return 256;
    case "SHA-384": return 384;
    case "SHA-512": return 512;
  }

  throw new Error("Unsupported hash function");
}

Choose 160 bits for SHA-1, 256 bits for SHA-256, and so on. Slowing down the key derivation even further by requiring more than one round of PBKDF2 will not increase the security of the password storage.

Do not hard-code parameters

Hard-coding PBKDF2 parameters - the name of the hash function to use in the HMAC construction, and the number of HMAC iterations - is tempting at first. We however need to be flexible if for example it turns out that SHA-1 can no longer be considered a secure PRF, or you need to increase the number of iterations to keep up with faster hardware.

To ensure future code can verify old passwords we store the parameters that were passed to PBKDF2 at the time, including the salt. When verifying the passcode we will read the hash function name, the number of iterations, and the salt from disk and pass those to deriveBits() along with the passcode itself. The number of bits to derive will be the hash function’s output size.

function deriveBits(code, salt, hash, iterations) {
  // Convert string to a TypedArray.
  let bytes = new TextEncoder("utf-8").encode(code);

  // Create the base key to derive from.
  let importedKey = crypto.subtle.importKey(
    "raw", bytes, "PBKDF2", false, ["deriveBits"]);

  return importedKey.then(key => {
    // Output length in bits for the given hash function.
    let hlen = getHashOutputLength(hash);

    // All required PBKDF2 parameters.
    let params = {name: "PBKDF2", hash, salt, iterations};

    // Derive |hlen| bits using PBKDF2.
    return crypto.subtle.deriveBits(params, key, hlen);
  });
}

Storing a new passcode

Now that we are done implementing deriveBits(), the heart of the Passcode module, completing the API is basically a walk in the park. For the sake of simplicity we will use localforage as the storage backend. It provides a simple, asynchronous, and Promise-based key-value store.

//

Tim Taubert

Bitslicing With Quine-McCluskey

The Quine-McCluskey algorithm

Step 1: Listing minterms

Step 2: Bit Buckets

Step 3: Merging minterms

Step 4: Prime Implicants

Step 5: Minimal Forms

Bitslicing a DES S-box

Bitslicing With Karnaugh Maps

A tiny S-box

A truth table per output bit

Karnaugh Maps

Spotting patterns

Wrapping around

A bitsliced SBOX() function

One function for fL(a,b,c) ...

... and another one for fR(a,b,c)

Putting it all together

More than four inputs

Bitslicing, an Introduction

What is bitslicing?

What’s it good for?

Speed

Parallelization

Constant execution time

Fully Homomorphic Encryption

Bitslicing a small S-box

LUTs and Multiplexers

Multiplexers in Software

A first implementation

A better mux() function

Simplifying the circuit

Circuit Minimization

Verified Binary Multiplication for GHASH

What’s GHASH again?

bmul() for 32-bit machines

Review, tests, and verification

bmul() for 64-bit machines

Tests and another equivalence proof

Fixing the bmul64() bitmasks

Some final thoughts

The Future of Session Resumption

Did web servers react?

The Caddy web server

1-RTT handshakes by default

Pre-shared keys in TLS 1.3

TLS 1.2 is surely here to stay

Simple Cryptol Specifications

Constant-time multiplication

Some helper functions

The core of the algorithm

The Cryptol specification

SAW’s llvm_verify function

Verification with SAW

Next: Finding bugs and more LLVM commands

Equivalence Proofs With SAW

Setting up your workspace

Unsigned addition without overflow

Constant-time addition

Writing the SAW script

Proving equivalence

Next: Some Cryptol and more SAW

Notes on HACS 2017

The HACS workshop

The projects & people

HACS - Day 1

HACS - Day 2

HACS - Day 3

See you again next year?

TLS Version Intolerance

What is version intolerance?

What are version fallbacks?

Why are these insecure?

The TLS_FALLBACK_SCSV cipher suite

Signatures in TLS 1.2

Downgrade Sentinels in TLS 1.3

The comeback of insecure fallbacks?

Version negotiation with extensions

GREASE-ing the future

One function for f_L(a,b,c) ...

... and another one for f_R(a,b,c)