The second covered Karnaugh mapping, a visual method to simplify Boolean algebra expressions that takes advantage of humans’ pattern-recognition capability, but is unfortunately limited to at most four inputs in its original variant.
Part three will introduce the Quine-McCluskey algorithm, a tabulation method that, in combination with Petrick’s method, can minimize circuits with an arbitrary number of input values. Both are relatively simple to implement in software.
Part 1: Bitslicing, An Introduction
Part 2: Bitslicing with Karnaugh maps
Part 3: Bitslicing with Quine-McCluskey
Here is the 3-to-2-bit S-box from the previous posts again:
Without much ado, we’ll jump right in and bitslice functions for both its output bits in parallel. You’ll probably recognize a few similarities to K-maps, except that the steps are rather mechanical and don’t require visual pattern-recognition abilities.
The lookup table SBOX[]
can be expressed as the Boolean functions
fL(a,b,c) and fR(a,b,c). Here are their truth tables,
with each combination of inputs assigned a symbol mi. Rows
m0-m7 will be called minterms.
a | b | c | fL | |
---|---|---|---|---|
m0 | 0 | 0 | 0 | 0 |
m1 | 0 | 0 | 1 | 0 |
m2 | 0 | 1 | 0 | 1 |
m3 | 0 | 1 | 1 | 0 |
m4 | 1 | 0 | 0 | 1 |
m5 | 1 | 0 | 1 | 1 |
m6 | 1 | 1 | 0 | 1 |
m7 | 1 | 1 | 1 | 0 |
a | b | c | fR | |
---|---|---|---|---|
m0 | 0 | 0 | 0 | 1 |
m1 | 0 | 0 | 1 | 0 |
m2 | 0 | 1 | 0 | 1 |
m3 | 0 | 1 | 1 | 1 |
m4 | 1 | 0 | 0 | 0 |
m5 | 1 | 0 | 1 | 0 |
m6 | 1 | 1 | 0 | 1 |
m7 | 1 | 1 | 1 | 0 |
We’re interested only in the minterms where the function evaluates to 1
and
will ignore all others. Boolean functions can already be constructed with just
those tables. In Boolean algebra,
OR can be expressed as addition, AND as multiplication. The negation of x
is represented by x.
fL(a,b,c) = ∑ m(2,4,5,6) = m2 + m4 + m5 + m6 = abc + abc + abc + abc fR(a,b,c) = ∑ m(0,2,3,6) = m0 + m2 + m3 + m6 = abc + abc + abc + abc
Well, that’s a start. Translated into C, these functions would be constant-time but not even close to minimal.
Now that we have all these minterms, we’ll put them in separate buckets based
on the number of 1
s in their inputs a, b, and c.
# of 1s | minterm | binary |
---|---|---|
1 | m2 | 010 |
m4 | 100 | |
2 | m5 | 101 |
m6 | 110 |
# of 1s | minterm | binary |
---|---|---|
0 | m0 | 000 |
1 | m2 | 010 |
2 | m3 | 011 |
m6 | 110 |
The reasoning here is the same as the Gray code ordering for Karnaugh maps. If we start with the minterms in the first bucket n, only bucket n+1 might contain matching minterms where only a single variable changes. They can’t be in any of the other buckets.
Why would you even look for pairs of minterms with a one-variable difference? Because they can be merged to simplify our expression. These combinations are called minterms of size 2.
All minterms have output 1
, so if the only difference is exactly one input
variable, then the output is independent of it. For example, (a & ~b & c) | (a & b & c)
can be reduced to just a & c
, the expression value is independent of b.
# of 1s | minterm | binary | size-2 | |
---|---|---|---|---|
1 | m2 | 010 | m2,6 | —10 |
m4 | 100 | m4,5 | 10— | |
m4,6 | 1—0 | |||
2 | m5 | 101 | ||
m6 | 110 |
# of 1s | minterm | binary | size-2 | |
---|---|---|---|---|
0 | m0 | 000 | m0,2 | 0—0 |
1 | m2 | 010 | m2,3 | 01— |
m2,6 | —10 | |||
2 | m3 | 011 | ||
m6 | 110 |
Always start with the minterms in the very first bucket at the top of the table. For every minterm in bucket n, we try to find a minterm in bucket n+1 with a one-bit difference in the binary column. Any matches will be recorded as pairs and entered into the size-2 column of bucket n.
m2=010 and m6=110 for example differ in only the first input variable, a. They merge into m2,6=—10, with a dash marking the position of the irrelevant input bit.
Once all minterms were combined (as far as possible), we’ll continue with the next size. Minterms of size bigger than 1 have dashes for irrelevant input bits and it’s important to treat those as a “third bit value”. In other words, their dashes must be at the same positions, otherwise they can’t be merged.
There’s nothing left to merge for fL(a,b,c) as all its size-2 minterms are in the first bucket. For fR(a,b,c), none of the size-2 minterms in the first bucket match any of those in the second, their dashes are all in different positions.
All minterms from the previous step that can’t be combined any further are called prime implicants. Entering them into a table let’s us check how well they cover the original minterms determined by step 1.
If any prime implicant is the only one to cover a minterm, it’s called an essential prime implicant (marked with an asterisk). It’s essential because it must be included in the resulting minimal form, otherwise we’d miss one of the input values combinations.
m2 | m4 | m5 | m6 | abc | |
---|---|---|---|---|---|
m2,6* | x | x | -10 | ||
m4,5* | x | x | 10- | ||
m4,6 | x | x | 1-0 |
m0 | m2 | m3 | m6 | abc | |
---|---|---|---|---|---|
m0,2* | x | x | 0-0 | ||
m2,3* | x | x | 01- | ||
m2,6* | x | x | -10 |
Prime implicant m2,6* on the left for example is the only one that covers m2. m4,5* is the only one that covers m5. Not only is m4,6 not essential, but we actually don’t need it at all: m4 and m6 are already covered by the essential prime implicants. All prime implicants of fR(a,b,c) are essential, so we need all of them.
When bitslicing functions with many input variables it may happen that you are left with a number of non-essential prime implicants that can be combined in various ways to cover the missing minterms. Petrick’s method helps finding a minimum solution. It’s tedious to do manually, but not hard to automate.
Finally, we derive minimal forms of our Boolean functions by looking at the abc column of the essential prime implicants. Input variables marked with dashes are ignored.
fL(a,b,c) = m2,6 + m4,5 = bc + ab
The code for SBOXL()
with 8-bit inputs:
fR(a,b,c), reduced to the combination of its three essential prime implicants:
fR(a,b,c) = m0,2 + m2,3 + m2,6 = ac + ab + bc
And SBOXR()
as expected:
Combining SBOXL()
and SBOXR()
yields the familiar version of SBOX()
, after
eliminating common subexpressions and taking out common factors.
When I started writing this blog post I thought it would be nice to ditch the small S-box from the previous posts, and naively bitslice a “real” S-box, like the ones used in DES.
But these are 6-to-4-bit S-boxes, how much more effort can it be? As it turns out, humans are terrible at understanding exponential growth. Here are my intermediate results after an hour of writing, trying to bitslice just one of the four output bits:
I gave up when I spotted a few mistakes that would likely lead to a non-minimal solution. Bitslicing a function with that many input variables manually is laborious and probably not worth it, except that it definitely helped me understand the steps of the algorithm better.
As mentioned in the beginning, Quine-McCluskey and Petrick’s method can be implemented in software rather easily, so that’s what I did instead. I’ll explain how, and what to consider, in the next post.
]]>My last post Bitslicing, An Introduction showed how to convert an S-box function into truth tables, then into a tree of multiplexers, and finally how to find the lowest possible gate count through manual optimization.
Today’s post will focus on a simpler and faster method. Karnaugh maps help simplifying Boolean algebra expressions by taking advantage of humans’ pattern-recognition capability. In short, we’ll bitslice an S-box using K-maps.
Part 1: Bitslicing, An Introduction
Part 2: Bitslicing with Karnaugh maps
Part 3: Bitslicing with Quine-McCluskey
Here again is the 3-to-2-bit S-box function from the previous post.
An AES-inspired S-box that interprets three input bits as a polynomial in GF(23) and computes its inverse mod P(x) = x3 + x2 + 1, with 0-1 := 0. The result plus (x2 + 1) is converted back into bits and the MSB is dropped.
This S-box can be represented as a function of three Boolean variables, where f(0,0,0) = 0b01, f(0,0,1) = 0b00, f(0,1,0) = 0b11, etc. Each output bit can be represented by its own Boolean function where fL(0,0,0) = 0 and fR(0,0,0) = 1, fL(0,0,1) = 0 and fR(0,0,1) = 0, …
Each output bit has its own Boolean function, and therefore also its own thruth table. Here are the truth tables for the Boolean functions fL(a,b,c) and fR(a,b,c):
abc | out |
---|---|
000 | 01 |
001 | 00 |
010 | 11 |
011 | 01 |
100 | 10 |
101 | 10 |
110 | 11 |
111 | 00 |
abc | out |
---|---|
000 | 0 |
001 | 0 |
010 | 1 |
011 | 0 |
100 | 1 |
101 | 1 |
110 | 1 |
111 | 0 |
abc | out |
---|---|
000 | 1 |
001 | 0 |
010 | 1 |
011 | 1 |
100 | 0 |
101 | 0 |
110 | 1 |
111 | 0 |
Whereas previously at this point we built a tree of multiplexers out of each truth table, we’ll now build a Karnaugh map (K-map) per output bit.
The values of fL(a,b,c) and fR(a,b,c) are transferred onto a two-dimensional grid with the cells ordered in Gray code. Each cell position represents one possible combination of input bits, while each cell value represents the value of the output bit.
The row and column indices (a) and (b || c) are ordered in Gray code rather
than binary numerical order to ensure only a single variable changes between
each pair of adjacent cells. Otherwise, products of predicates
(a & b
, a & c
, …) would scatter.
These products are what you want to find to get a minimum length representation
of the truth function. If the output bit is the same at two adjacent cells,
then it’s independent of one of the two input variables, because
(a & ~b) | (a & b) = a
.
The heart of simplifying Boolean expressions via K-maps is finding groups of
adjacent cells with value 1
. The rules are as follows:
1
.0
.1
must be in at least one group.First, we mark all cells with value 1
. We then form a red
group for the two horizontal groups of size 21. The two vertical groups are
marked with green, also of size 21.
On fR’s K-map on the right, the red
and green group overlap. As per the rules
above, that’s perfectly fine. The cell at abc=110
can’t be without a group
and we’re instructed to form the largest groups possible, so they overlap.
But wait, you say, what’s going on with the blue rectangle on the right?
A somewhat unexpected property of K-maps is that they’re not really grids, but actually toruses. In plain English: they wrap around the top, bottom, and the sides.
Look at this neat animation on Wikipedia
that demonstrates how a rectangle can turn into a donuttorus. Adjacent
thus has a special definition here: cells on the very right touch those on the
far left, as do those at the very top and bottom.
Another way to understand this property is to imagine that the columns don’t
start at 00
but rather at 01
, and so we rotate the whole K-map by one to
the left. Then the rectangles wouldn’t need to wrap around and they would all
fit on the grid nicely.
Now that all cells with a 1
have been assigned to as few groups as possible,
let’s get our hands dirty and write some code.
K-maps are read groupwise: we look at each cell’s position and focus on the input values that do not change throughout the group. Values that do change are ignored.
The red group covers the cells at position
100
and 101
. The values a=1
and b=0
are constant, they will be included
into the group’s term. The value of c
changes and is therefore irrelevant.
The term is (a & ~b)
.
The green group covers the cells at 010
and 110
. We ignore a
, and include b=1
and c=0
. The term is (b & ~c)
.
SBOXL()
is the disjunction of the group terms we collected from the K-map. It
lists all possible combinations of input values that lead to output value 1
.
The red group covers the cells at 011
and 010
. The term is (~a & b)
.
The green group covers the cells at 010
and 110
. The term is (b & ~c)
.
The blue group covers the cells at 000
and 010
. The term is (~a & ~c)
.
Great, that’s all we need! Now we can merge those two functions and compare that to the result of the previous post.
The first three variables ensure that we negate inputs only once. t0
replaces
the common subexpression b & nc
. Any optimizing compiler would do the same.
Ten gates. That’s one more than the manually optimized version from the last post. What’s missing? Turns out that K-maps sometimes don’t yield the minimal form and we have to simplify further by taking out common factors.
The conjunctions in the term (na & b) | (na & nc)
have the common factor na
and, due to the Distributivity Law, can be rewritten as na & (b | nc)
. That
removes one of the AND gates and leaves two.
Nine gates. That’s exactly what we achieved by tedious artisanal optimization.
K-maps are neat and trivial to use once you’ve worked through an example yourself. They yield minimal circuits fast, compared to manual optimization where the effort grows exponentially with the number of terms.
There is one downside though, and it’s that the original variant of a K-map can’t be used with more than four input variables. There are variants that do work with more than four variables but they actually make it harder to spot groups visually.
The Quine–McCluskey algorithm is functionally identical to K-maps but can handle an arbitrary number of input variables in its original variant – although the running time grows exponentially with the number of variables. Not too problematic for us, S-boxes usually don’t have too many inputs anyway…
]]>This post intends to give a brief overview of the general technique, not requiring much of a cryptographic background. It will demonstrate bitslicing a small S-box, talk about multiplexers, LUTs, Boolean functions, and minimal forms.
Part 1: Bitslicing, An Introduction
Part 2: Bitslicing with Karnaugh maps
Part 3: Bitslicing with Quine-McCluskey
Matthew Kwan coined the term about 20 years ago after seeing Eli Biham present his paper A Fast New DES Implementation in Software. He later published Reducing the Gate Count of Bitslice DES showing an even faster DES building on Biham’s ideas.
The basic concept is to express a function in terms of single-bit logical operations – AND, XOR, OR, NOT, etc. – as if you were implementing a logic circuit in hardware. These operations are then carried out for multiple instances of the function in parallel, using bitwise operations on a CPU.
In a bitsliced implementation, instead of having a single variable storing a, say, 8-bit number, you have eight variables (slices). The first storing the left-most bit of the number, the next storing the second bit from the left, and so on. The parallelism is bounded only by the target architecture’s register width.
Biham applied bitslicing to DES, a cipher designed to be fast in hardware. It uses eight different S-boxes, that were usually implemented as lookup tables. Table lookups in DES however are rather inefficient, since one has to collect six bits from different words, combine them, and afterwards put each of the four resulting bits in a different word.
In classical implementations, these bit permutations would be implemented with a combination of shifts and masks. In a bitslice representation though, permuting bits really just means using the “right” variables in the next step; this is mere data routing, which is resolved at compile-time, with no cost at runtime.
Additionally, the code is extremely linear so that it usually runs well on heavily pipelined modern CPUs. It tends to have a low risk of pipeline stalls, as it’s unlikely to suffer from branch misprediction, and plenty of opportunities for optimal instruction reordering for efficient scheduling of data accesses.
With a register width of n bits, as long as the bitsliced implementation is no more than n times slower to run a single instance of the cipher, you end up with a net gain in throughput. This only applies to workloads that allow for parallelization. CTR and ECB mode always benefit, CBC and CFB mode only when decrypting.
Constant-time, secret independent computation is all the rage in modern applied cryptography. Bitslicing is interesting because by using only single-bit logical operations the resulting code is immune to cache and timing-related side channel attacks.
The last decade brought great advances in the field of Fully Homomorphic Encryption (FHE), i.e. computation on ciphertexts. If you have a secure crypto scheme and an efficient NAND gate you can use bitslicing to compute arbitrary functions of encrypted data.
Let’s work through a small example to see how one could go about converting arbitrary functions into a bunch of Boolean gates.
Imagine a 3-to-2-bit S-box function, a
component found in many symmetric encryption algorithms. Naively, this would be
represented by a lookup table with eight entries, e.g. SBOX[0b000] = 0b01
,
SBOX[0b001] = 0b00
, etc.
This AES-inspired S-box interprets three input bits as a polynomial in GF(23) and computes its inverse mod P(x) = x3 + x2 + 1, with 0-1 := 0. The result plus (x2 + 1) is converted back into bits and the MSB is dropped.
You can think of the above S-box’s output as being a function of three Boolean variables, where for instance f(0,0,0) = 0b01. Each output bit can be represented by its own Boolean function, i.e. fL(0,0,0) = 0 and fR(0,0,0) = 1.
If you’ve dealt with FPGAs before you probably know that these do not actually implement Boolean gates, but allow Boolean algebra by programming Look-Up-Tables (LUTs). We’re going to do the reverse and convert our S-box into trees of multiplexers.
Multiplexer is just a fancy word for data selector. A 2-to-1 multiplexer selects one of two input bits. A selector bit decides which of the two inputs will be passed through.
Here are the LUTs, or rather truth tables, for the Boolean functions fL(a,b,c) and fR(a,b,c):
abc | out |
---|---|
000 | 01 |
001 | 00 |
010 | 11 |
011 | 01 |
100 | 10 |
101 | 10 |
110 | 11 |
111 | 00 |
abc | out |
---|---|
000 | 0 |
001 | 0 |
010 | 1 |
011 | 0 |
100 | 1 |
101 | 1 |
110 | 1 |
111 | 0 |
abc | out |
---|---|
000 | 1 |
001 | 0 |
010 | 1 |
011 | 1 |
100 | 0 |
101 | 0 |
110 | 1 |
111 | 0 |
The truth table for fL(a,b,c) is (0, 0, 1, 0, 1, 1, 1, 0) or 2Eh. We can also call this the LUT-mask in the context of an FPGA. For each output bit of our S-box we need a 3-to-1 multiplexer, and that in turn can be represented by 2-to-1 multiplexers.
Let’s take the mux()
function from above and make it constant-time. As stated
earlier, bitslicing is competitive only through parallelization, so, for
demonstration, we’ll use uint8_t
arguments to later compute eight
S-box lookups in parallel.
If the n-th bit of s
is zero it selects the n-th bit in a
, if not it
forwards the n-th bit in b
. The wider the target architecture’s registers,
the bigger the theoretical throughput – but only if the workload can take
advantage of the level of parallelization.
The two output bits will be computed separately and then assembled into the
final value returned by SBOX()
. Each multiplexer in the above diagram is
represented by a mux()
call. The first four take the LUT-masks
2Eh and B2h as inputs.
The diagram shows Boolean functions that only work with single-bit parameters.
We use uint8_t
, so instead of 1
we need to use ~0
to get 0b11111111
.
That wasn’t too hard. SBOX()
is constant-time and immune to cache timing
attacks. Not counting the negation of constants (~0
) we have 42 gates in total
and perform eight lookups in parallel.
Assuming, for simplicity, that a table lookup is just one operation, the
bitsliced version is about five times as slow. If we had a workflow that
allowed for 64 parallel S-box lookups we could achieve eight times the
current throughput by using uint64_t
variables.
mux()
currently needs three operations. Here’s another variant using XOR:
Now there still are three gates, but the new version lends itself often to
easier optimization as we might be able to precompute a ^ b
and reuse the
result.
Let’s optimize our circuit manually by following these simple rules:
mux(a, a, s)
reduces to a
.X AND ~0
will always be X
.AND 0
will always be 0
.mux()
with constant inputs can be reduced.With the new mux()
variant there are a few XOR rules to follow as well:
X XOR X
reduces to 0
.X XOR 0
reduces to X
.X XOR ~0
reduces to ~X
.Inline the remaining mux()
calls, eliminate common subexpressions, repeat.
Using the laws of Boolean algebra and the rules formulated above I’ve reduced the circuit to nine gates (down from 42!). We actually couldn’t simplify it any further.
Finding the minimal form of a Boolean function is an NP-complete problem. Manual optimization is tedious but doable for a tiny S-box such as the example used in this post. It will not be as easy for multiple 6-to-4-bit S-boxes (DES) or an 8-to-8-bit one (AES).
There are simpler and faster ways to build those circuits, and deterministic algorithms to check whether we reached the minimal form. One of those is covered in the next post Bitslicing with Karnaugh maps.
]]>In this post I will touch on using formal verification as part of the code review process, in particular show how, by using the Software Analysis Workbench, we saved ourselves hours of debugging when rewriting the GHASH implementation for NSS.
GHASH is part of the Galois/Counter Mode, a mode of operation for block ciphers. AES-GCM for example uses AES as the block cipher for encryption, and appends a tag generated by the GHASH function, thereby ensuring integrity and authenticity.
The core of GHASH is multiplication in GF(2128), a characteristic-two finite field with coefficients in GF(2); they’re either zero or one. Polynomials in GF(2m) can be represented as m-bit numbers, with each bit corresponding to a term’s coefficient. In GF(23) for example, x^2 + 1
may be represented as the binary number 0b101 = 5
.
Additions and subtractions in finite fields are “carry-less” because the coefficients must be in GF(p), for any GF(pm). As x * y
is equivalent to adding x
to itself y
times, we can call multiplication in finite fields “carry-less” too. In GF(2) addition is simply XOR, so we can say that multiplication in GF(2m) is equal to binary multiplication without carries.
Note that the term carry-less only makes sense when talking about GF(2m) fields that are easily represented as binary numbers. Otherwise one would rather talk about multiplication in finite fields without comparing it to standard integer multiplication.
Franziskus’ post nicely describes why and how we updated our AES-GCM code in NSS. In case a user’s CPU is not equipped with the Carry-less Multiplication (CLMUL) instruction set, we need to provide a fallback and implement carry-less, constant-time binary multiplication ourselves, using standard integer multiplication with carry.
The basic implementation of our binary multiplication algorithm is taken straight from Thomas Pornin’s excellent constant-time crypto post. To support 32-bit machines the best we can do is multiply two uint32_t
numbers and store the result in a uint64_t
.
For the full GHASH, Karatsuba decomposition is used: multiplication of two 128-bit integers is broken down into nine calls to bmul32(x, y, ...)
. Let’s take a look at the actual implementation:
Thomas’ explanation is not too hard to follow. The main idea behind the algorithm are the bitmasks m1 = 0b00010001...
, m2 = 0b00100010...
, m4 = 0b01000100...
, and m8 = 0b10001000...
. They respectively have the first, second, third, and fourth bit of every nibble set. This leaves “holes” of three bits between each “data bit”, so that with those applied at most a quarter of the 32 bits are equal to one.
Per standard integer multiplication, eight times eight bits will at most add eight carry bits of value one together, thus we need sufficiently sized holes per digit that can hold the value 8 = 0b1000
. Three-bit holes are big enough to prevent carries from “spilling” over, they could even handle up to 15 = 0b1111
data bits in each of the two integer operands.
The first version of the patch came with a bunch of new tests, the vectors taken from the GCM specification. We previously had no such low-level coverage, all we had were a number of high-level AES-GCM tests.
When reviewing, after looking at the patch itself and applying it locally to see whether it builds and tests succeed, the next step I wanted to try was to write a Cryptol specification to prove the correctness of bmul32()
. Thanks to the built-in pmult
function that took only a few minutes.
The SAWScript needed to properly parse the LLVM bitcode and formulate the equivalence proof is straightforward, it’s basically the same as shown in the previous post.
Compile to bitcode and run SAW. After just a few seconds it will tell us it succeeded in proving equivalency of both implementations.
bmul32()
is called nine times, each time performing 16 multiplications. That’s 144 multiplications in total for one GHASH evaluation. If we had a bmul64()
for 128-bit multiplication with uint128_t
we’d need to call it only thrice.
The naive approach taken in the first patch revision was to just double the bitsize of the arguments and variables, and also extend the bitmasks. If you paid close attention to the previous section you might notice a problem here already. If not, it will become clear in a few moments.
The above version of bmul64()
passed the GHASH test vectors with flying colors. That tricked reviewers into thinking it looked just fine, even if they just learned about the basic algorithm idea. Fallible humans. Let’s update the proofs and see what happens.
Instead of hardcoding bmul
for 32-bit integers we use polymorphic types m
and n
to denote the size in bits. m
is mostly a helper to make it a tad more readable. We can now reason about carry-less n-bit binary multiplication.
Duplicating the SAWScript spec and running :s/32/64
is easy, but certainly nicer is adding a function that takes n
as a parameter and returns a spec for n-bit arguments.
We use two instances of the bmul
spec to prove correctness of bmul32()
and bmul64()
sequentially. The second verification will take a lot longer before yielding results.
Proof failed. As you probably expected by now, the bmul64()
implementation is erroneous and SAW gives us a specific counterexample to investigate further. It took us a while to understand the failure but it seems very obvious in hindsight.
As already shown above, bitmasks leaving three-bit holes between data bits can avoid carry-spilling for up to two 15-bit integers. Using every fourth bit of a 64-bit argument however yields 16 data bits each, and carries can thus override data bits. We need bitmasks with four-bit holes.
m1
, …, m5
are the new bitmasks. m1
equals 0b0010000100001...
, the others are each shifted by one. As the number of data bits per argument is now 64/5 <= n < 64/4
we need 5*5 = 25
multiplications. With three calls to bmul64()
that’s 75 in total.
Run SAW again and, after about an hour, it will tell us it successfully verified @bmul64.
You might want to take a look at Thomas Pornin’s version of bmul64()
. This basically is the faulty version that SAW failed to verify, he however works around the overflow by calling it twice, passing arguments reversed bitwise the second time. He invokes bmul64()
six times, which results in a total of 96 multiplications.
One of the takeaways is that even an implementation passing all test vectors given by a spec doesn’t need to be correct. That is not too surprising, spec authors can’t possibly predict edge cases from implementation approaches they haven’t thought about.
Using formal verification as part of the review process was definitely a wise decision. We likely saved hours of debugging intermittently failing connections, or random interoperability problems reported by early testers. I’m confident this wouldn’t have made it much further down the release line.
We of course added an extra test that covers that specific flaw but the next step definitely should be proper CI integration. The Cryptol code has already been written and there is no reason to not run it on every push. Verifying the full GHASH implementation would be ideal. The Cryptol code is almost trivial:
Proving the multiplication of two 128-bit numbers for a 256-bit product will unfortunately take a very very long time, or maybe not finish at all. Even if it finished after a few days that’s not something you want to automatically run on every push. Running it manually every time the code is touched might be an option though.
]]>Enabling session resumption is an important tool for speeding up HTTPS websites, especially in a pre-HTTP/2 world where a client may have to open concurrent connections to the same host to quickly render a page. Subresource requests would ideally resume the session that for example a GET / HTTP/1.1
request started.
Let’s take a look at what has changed in over two years, and whether configuring session resumption securely has gotten any easier. With the TLS 1.3 spec about to be finalized I will show what the future holds and how these issues were addressed by the WG.
No, not as far as I’m aware. None of the three web servers mentioned above has taken steps to make it easier to properly configure session resumption. But to be fair, OpenSSL didn’t add any new APIs or options to help them either.
All popular TLS 1.2 web servers still don’t evict cache entries when they expire, keeping them around until a client tries to resume — for performance or ease of implementation. They generate a session ticket key at startup and will never automatically rotate it so that admins have to manually reload server configs and provide new keys.
I want to seize the chance and positively highlight the Caddy web server, a relative newcomer with the advantage of not having any historical baggage, that enables and configures HTTPS by default, including automatically acquiring and renewing certificates.
Version 0.8.3 introduced automatic session ticket key rotation, thereby making session tickets mostly forward secure by replacing the key every ~10 hours. Session cache entries though aren’t evicted until access just like with the other web servers.
But even for “traditional” web servers all is not lost. The TLS working group has known about the shortcomings of session resumption for a while and addresses those with the next version of TLS.
One of the many great things about TLS 1.3 handshakes is that most connections should take only a single round-trip to establish. The client sends one or more KeyShareEntry
values with the ClientHello
, and the server responds with a single KeyShareEntry
for a key exchange with ephemeral keys.
If the client sends no or only unsupported groups, the server will send a HelloRetryRequest
message with a NamedGroup
selected from the ones supported by the client. The connection will fall back to two round-trips.
That means you’re automatically covered if you enable session resumption only to reduce network latency, a normal handshake is as fast as 1-RTT resumption in TLS 1.2. If you’re worried about computational overhead from certificate authentication and key exchange, that still might be a good reason to abbreviate handshakes.
Session IDs and session tickets are obsolete since TLS 1.3. They’ve been replaced by a more generic PSK mechanism that allows resuming a session with a previously established shared secret key.
Instead of an ID or a ticket, the client will send an opaque blob it received from the server after a successful handshake in a prior session. That blob might either be an ID pointing to an entry in the server’s session cache, or a session ticket encrypted with a key known only to the server.
Two PSK key exchange modes are defined, psk_ke
and psk_dhe_ke
. The first signals a key exchange using a previously shared key, it derives a new master secret from only the PSK and nonces. This basically is as (in)secure as session resumption in TLS 1.2 if the server never rotates keys or discards cache entries long after they expired.
The second psk_dhe_ke
mode additionally incorporates a key agreed upon using ephemeral Diffie-Hellman, thereby making it forward secure. By mixing a shared (EC)DHE key into the derived master secret, an attacker can no longer pull an entry out of the cache, or steal ticket keys, to recover the plaintext of past resumed sessions.
Note that 0-RTT data cannot be protected by the DHE secret, the early traffic secret is established without any input from the server and thus derived from the PSK only.
In theory, there should be no valid reason for a web client to be able to complete a TLS 1.3 handshake but not support psk_dhe_ke
, as ephemeral Diffie-Hellman key exchanges are mandatory. An internal application talking TLS between peers would likely be a legitimate case for not supporting DHE.
But also for TLS 1.3 it might make sense to properly configure session ticket key rotation and cache turnover in case the odd client supports only psk_ke
. It still makes sense especially for TLS 1.2, it will be around for probably longer than we wish and imagine.
Apart from rather simple Cryptol I’m also going to introduce SAW’s llvm_verify
function that allows much more complex verification. We need this as our function will not only take scalar inputs but also store the result of the computation using pointer arguments.
Part 1 dealt with addition, in part 2 we’re going to look at multiplication. Let’s implement a function mul(a, b, *hi, *lo)
that multiplies a
and b
, and stores the eight most significant bits of the product in *hi
, and the eight LSBs in *lo
.
This time we’ll make it run in constant time right away and won’t bother implementing a trivial version first. Instead, we will write a Cryptol specification to verify LLVM bitcode afterwards — you will be amazed how simple that is.
The first two functions of our C/C++ implementation will seem familiar if you’ve read the previous part of the series. msb
hasn’t changed, and ge
is the negated version of lt
. nz
returns 0xff
if the given argument x
is non-zero, 0
otherwise.
Our add
function that previously dealt with overflows by capping at UINT8_MAX
is a little more mature now and will set *carry = 1
when an overflow occurs.
mul(a, b, *hi, *lo)
, using all the helper functions we defined above, implements standard long multiplication, i.e. four multiplications per function call. We split the two 8-bit arguments into two 4-bit halves, multiply and add a few times, and then store two 8-bit results at the addresses pointed to by hi
and lo
.
It’s relatively easy to see that a * b
can be rewritten as (a1 * 2^4 + a0) * (b1 * 2^4 + b0)
, all four variables being 4-bit integers. After multiplying and rearranging you’ll get an equation that’s very similar to mul
above. Here’s a good introduction to computing with long integers if you want to know more.
Compile the code to LLVM bitcode as before so that we can load it into SAW later.
To automate verification we’ll again write a SAW script. It will contain the necessary verification commands and details, as well as a Cryptol specification.
The specification doesn’t need to be constant-time, all it needs to be is correct and as simple as possible. We declare a function mul
taking two 8-bit integers and returning a tuple containing two 8-bit integers. Read the notation [8]
as “sequence of 8 bits”.
The built-in function take`{n} x
returns a sequence with only the first n
items of x
. drop`{n} x
returns sequence x
without the first n
items. zero
is a special value that has a number of use cases, here it represents a flexible sequence of all zero bits. #
is the append operator for sequences.
The first line of the definition gives the return value, a tuple with the first and the last 8 bits of prod
. The Cryptol type system can automatically infer that the variable prod
must hold a 16-bit sequence if the result of the take`{8}
and drop`{8}
function calls is a sequence of 8 bits each.
prod
is the result of multiplying the zero-padded arguments a
and b
. zero # x
appends x
to 8 zero bits, and that number is again determined by the type system. If you want to learn more about the language, take a look at Programming Cryptol.
That’s about as simple as it gets. We multiply two 8-bit integers and out comes a 16-bit integer, split into two halves. Now let’s use the specification to verify our constant-time implementation.
We will add LLVM SAW instructions to the same file that contains the Cryptol code from above. The llvm_verify
call here takes module m
, extracts the symbol "mul"
, and uses the body given after do
for verification.
We need to declare all symbolic inputs as given by our C/C++ implementation. With llvm_var
we tell SAW that "a"
and "b"
are 8-bit integer arguments, and map those to the SAW variables a
and b
.
The arguments "hi"
and "lo"
are declared as pointers to 8-bit integers using llvm_ptr
. And because we want to dereference the pointers and refer to their values later we declare "*hi"
and "*lo"
as 8-bit integers too.
We specify no constraints for any of the arguments and expect the verification to consider all possible inputs. I will talk a bit more about such constraints and how these are useful in a later post.
With llvm_ensure_eq
we tell SAW what values we expect after symbolic execution. We expect "*hi"
to be equal to the first 8-bit integer element of the tuple returned by mul
, and "*lo"
to be equal to the second 8-bit integer.
llvm_verify_tactic
chooses UC Berkely’s ABC tool again and off we go.
Again, make sure you have saw
and z3
in your $PATH
. If you haven’t downloaded the binaries yet, take a look at the early sections of the previous post.
Successfully verified @mul. SAW tells us that for all possible inputs a
and b
, and actually hi
and lo
too, our constant-time C/C++ implementation behaves as stated by the SAW verification script and is thereby equivalent to our Cryptol specification.
In the next post I’m going to introduce and write more Cryptol, talk about specifying constraints on LLVM arguments and return values, and provide an example for finding bugs in a real-world codebase.
And while you wait, why not try your hand at optimizing mul
to use only three instead of four multiplications with the Karatsuba algorithm? You can reuse the above Cryptol specification to verify you got it right.
Verifying the implementation of a specific algorithm not only helps you weed out bugs early, it lets you prove that your code is correct and contains no further bugs - assuming you made no mistakes writing your algorithm specification in the first place.
Even if you don’t know a lot about formal verification, or anything, it’s easy to get started experimenting with Cryptol and SAW, and get a glimpse of what’s possible.
In this first post I’ll show how you can use SAW to prove equality of multiple implementations of the same algorithm, potentially written in different languages.
To get started, download the latest SAW and Z3, as well as clang 3.8:
You need clang 3.8, later versions seem currently not supported. Xcode’s latest clang would (probably) work for this small example but give you headaches with more advanced verification later on.
Unzip and copy the tools someplace you like, just don’t forget to update your $PATH
environment variable. Especially if you already have clang on your system.
Let’s start with a simple example.
We define an addition function add(a, b)
that takes two uint8_t
arguments and returns a uint8_t
. It deals with overflows so that 123 + 200 = 255
, that is it caps the number at UINT8_MAX
instead of wrapping around.
That’s such a trivial function that we probably wouldn’t write a test for it. If it compiles we’re somewhat confident it’ll work just fine:
Note that the above command will not produce a binary or shared library, but instead instruct clang to emit LLVM bitcode and store it in add.bc
. We’ll feed this into SAW in a minute.
Now imagine that we actually want to use add
as part of a bignum library to implement cryptographic algorithms, and thus want it to have a constant runtime, independent of the arguments given. Here’s how you could do this:
If a + b < a
, i.e. the addition overflows, lt(a + b, a)
will return 0xff
and change the return value into UINT8_MAX = 0xff
. Otherwise it returns 0
and the return value will simply be a + b
. That’s easy enough, but did we get msb
and lt
right?
Let’s compile the constant-time add
function to LLVM bitcode too and use SAW to prove that both our addition functions are equivalent to each other.
SAW executes scripts to automate theorem proving, and we need to write one in order to check that our two implementations are equivalent. The first thing our script does is load the LLVM bitcode from the files we created earlier, add.bc
and cadd.bc
, as modules into the variables m1
and m2
, respectively.
Next, we’ll extract the add
functions defined in each of these modules and store them in add
and cadd
, the latter being our constant-time implementation. llvm_pure
indicates that a function always returns the same result given the same arguments, and thus has no side-effects.
Last, we define a theorem thm
stating that for all arguments x
and y
both functions have the same return value, that they are equivalent to each other. We choose to prove this theorem with the ABC tool from UC Berkeley.
We’re all set now, time to actually prove something.
Make sure you have saw
and z3
in your $PATH
. Run SAW and pass it the file we created in the previous section — it will execute the script and automatically prove our theorem.
Valid, that was easy. Maybe too easy. Would SAW even detect if we sneak a minor mistake into the program? Let’s find out…
The diff above changes the behavior of lt
just slightly, a bug that we could have introduced by accident. Let’s run SAW again and see whether it spots it:
Invalid! The two functions disagree on the return value at [x = 240, y = 0]
. SAW of course doesn’t know which function is at fault, but we are confident enough in our reference implementation to know where to look.
I can’t possibly explain how this all works in detail, but I can hopefully give you a rough idea. What SAW does is parse the LLVM bitcode and symbolically execute it on symbolic inputs to translate it into a circuit representation.
This circuit is then, together with our theorems, fed into a theorem prover. Z3 is an automated theorem prover, and ABC a tool for logic synthesis and verification; both are able to prove equality using automated reasoning.
In the second post I talk about verifying the implementation of a slightly more complex function, also written in C/C++, and show how you can use Cryptol to write a simple specification, as well as introduce more advanced SAW commands for verification.
If you found this interesting, play around with the examples above and come up with your own. Write a straightforward implementation of an algorithm that you can be certain to get right and then optimize it, make it constant-time, or change it in any other way and see how SAW behaves.
]]>What I want to tell you about is a lesser-known event that took place right after RWC, called HACS - the High Assurance Crypto Software workshop. An intense, highly interactive two-day workshop in its second year, organized by Ben Laurie, Gilles Barthe, Peter Schwabe, Meredith Whittaker, and Trevor Perrin.
Its stated goal is to bring together crypto-implementers and verification people from open source, industry, and academia; introduce them and their projects to each other, and develop practical collaborations that improve verification of crypto code.
The formal verification community was represented by projects such as miTLS, HACL*, Project Everest, Z3, VeriFast, tis-interpreter, ct-verif, Cryptol/SAW, Entroposcope, and other formal verification and synthesis projects based on Coq or F*.
Crypto libraries were represented by one or multiple maintainers of OpenSSL, BoringSSL, Bouncy Castle, NSS, BearSSL, *ring*, and s2n. Other invited projects included LLVM, Tor, libFuzzer, BitCoin, and Signal. (I’m probably missing a few, sorry.)
Additionally, there were some attendants not directly involved with any of the above projects but who are experts in formal verification or synthesis, constant-time implementation of crypto algorithms, fast arithmetic in assembler, elliptic curves, etc.
All in all, somewhere between 70 and 80 people.
After short one-sentence introductions on early Saturday morning we immediately started with simultaneous round-table discussions, focused on topics such as “The state of crypto libraries”, “Challenges in implementing crypto libraries”, “Efficient fuzzing”, “TLS implementation woes”, “The LLVM ecosystem”, “Fast and constant-time low-level algorithm implementations”, “Formal verification/synthesis with Coq”, and others.
These discussions were hosted by a rotating set of people, not always leading by pure expertise, sometimes also moderating, asking questions, and making sure we stay on track. We did this until lunch, and continued to talk over food with the people we just met. For the rest of the day, discussions became longer and more focused.
By this point people slowly started to sense what it is they want to focus on this weekend. They got to meet most of the other attendants, found out about their skills, projects, and ideas; thought about possibilities for collaboration on projects for this weekend or the months to come.
In the evening we split into groups and went for dinner. Most people’s brains were probably somewhat fried (as was mine) after hours of talking and discussing. Everyone was so engaged that you not once found the time to take out your laptop or phone, or had the desire to do so, which was great.
The second day, early Sunday morning, continued much like the previous. We started off with a brainstorming session for what we think the group should be working on. The rest of the day was filled with long and focused discussion that were mostly a continuation from the day before.
A highlight of the day was the skill sharing session, where participants could propose a specific skill to share with others. If you didn’t find something to share you could be one of the 50% of the group that gets to learn from others.
My lucky pick was Chris Hawblitzel from Microsoft Research, who did his best to explain to me (in about 45 minutes) how Z3 works, what its limitations are, and what higher-level languages exist that make it a little more usable. Thank you, Chris!
We ended the day with signing up for one or multiple projects for the last day.
The third day of the workshop was optional, a hacking day with maybe roughly 50% attendance. Some folks took the chance to arrive a little later after two days of intense discussions and socializing. By now you knew most people’s names, and you better did because no one cared to wear name tags anymore.
It was the time to get together with the people from the projects you signed up for and get your laptop out (if needed). I can’t possibly remember all the things people worked on but here are a few examples:
I want to thank all the organizers (and sponsors) for spending their time (or money) planning and hosting such a great event. It always pays off to bring communities closer together and foster collaboration between projects and individuals.
I got to meet dozens of highly intelligent and motivated people, and left with a much bigger sense of community. I’m grateful to all the attendants that participated in discussions and projects, shared their skills, asked hard questions, and were always open to suggestions from others.
I hope to be invited again to future workshops and check in on the progress we’ve made at improving the verification and quality assurance of crypto code across the ecosystem.
]]>I decided to dig a little deeper and will use this post to explain version intolerance, how version fallbacks work and why they’re insecure, as well as describe the downgrade protection mechanisms available in TLS 1.2 and 1.3. It will end with a look at version negotiation in TLS 1.3 and a proposal that aims to prevent similar problems in the future.
Every time a new TLS version is specified, browsers usually are the fastest to implement and update their deployments. Most major browser vendors have a few people involved in the standardization process to guide the standard and give early feedback about implementation issues.
As soon as the spec is finished, and often far before that feat is done, clients will have been equipped with support for the new TLS protocol version and happily announce this to any server they connect to:
Client: Hi! The highest TLS version I support is 1.2.
Server: Hi! I too support TLS 1.2 so let’s use that to communicate.
[TLS 1.2 connection will be established.]
In this case the highest TLS version supported by the client is 1.2, and so the server picks it because it supports that as well. Let’s see what happens if the client supports 1.2 but the server does not:
Client: Hi! The highest TLS version I support is 1.2.
Server: Hi! I only support TLS 1.1 so let’s use that to communicate.
[TLS 1.1 connection will be established.]
This too is how it should work if a client tries to connect with a protocol version unknown to the server. Should the client insist on any specific version and not agree with the one picked by the server it will have to terminate the connection.
Unfortunately, there are a few servers and more devices out there that implement TLS version negotiation incorrectly. The conversation might go like this:
Client: Hi! The highest TLS version I support is 1.2.
Server: ALERT! I don’t know that version. Handshake failure.
[Connection will be terminated.]
Or:
Client: Hi! The highest TLS version I support is 1.2.
Server: TCP FIN! I don’t know that version.
[Connection will be terminated.]
Or even worse:
Client: Hi! The highest TLS version I support is 1.2.
Server: (I don’t know this version so let’s just not respond.)
[Connection will hang.]
The same can happen with the infamous F5 load balancer that can’t handle
ClientHello
messages with a length between 256 and 512 bytes. Other devices
abort the connection when receiving a large ClientHello
split into multiple
TLS records. TLS 1.3 might actually cause more problems of this kind due to
more extensions and client key shares.
As browsers usually want to ship new TLS versions as soon as possible, more than a decade ago vendors saw a need to prevent connection failures due to version intolerance. The easy solution was to decrease the advertised version number by one with every failed attempt:
Client: Hi! The highest TLS version I support is 1.2.
Server: ALERT! Handshake failure. (Or FIN. Or hang.)
[TLS version fallback to 1.1.]
Client: Hi! The highest TLS version I support is 1.1.
Server: Hi! I support TLS 1.1 so let’s use that to communicate.
[TLS 1.1 connection will be established.]
A client supporting everything from TLS 1.0 to TLS 1.2 would start trying to establish a 1.2 connection, then a 1.1 connection, and if even that failed a 1.0 connection.
What makes these fallbacks insecure is that the connection can be downgraded by a MITM, by sending alerts or TCP packets to the client, or blocking packets from the server. To the client this is indistinguishable from a network error.
The POODLE attack is one example where an attacker abuses the version fallback to force an SSL 3.0 connection. In response to this browser vendors disabled version fallbacks to SSL 3.0, and then SSL 3.0 entirely, to prevent even up-to-date clients from being exploited. Insecure version fallback in browsers pretty much break the actual version negotiation mechanisms.
Version fallbacks have been disabled since Firefox 37 and Chrome 50. Browser telemetry data showed it was no longer necessary as after years, TLS 1.2 and correct version negotiation was deployed widely enough.
You might wonder if there’s a secure way to do version fallbacks, and other people did so too. Adam Langley and Bodo Möller proposed a special cipher suite in RFC 7507 that would help a client detect whether the downgrade was initiated by a MITM.
Whenever the client includes TLS_FALLBACK_SCSV {0x56, 0x00}
in the list of
cipher suites it signals to the server that this is a repeated connection
attempt, but this time with a version lower than the highest it supports,
because previous attempts failed. If the server supports a higher version
than advertised by the client, it MUST abort the connection.
The drawback here however is that a client even if it implements fallback with
a Signaling Cipher Suite Value doesn’t know the highest protocol version
supported by the server, and whether it implements a TLS_FALLBACK_SCSV
check.
Common web servers will likely be updated faster than others, but router or
load balancer manufacturers might not deem it important enough to implement
and ship updates for.
It’s been long known to be problematic that signatures in TLS 1.2 don’t cover
the list of cipher suites and other messages sent before server authentication.
They sign the ephemeral DH parameters sent by the server and include the
*Hello.random
values as nonces to prevent replay attacks:
Signing at least the list of cipher suites would have helped prevent downgrade attacks like FREAK and Logjam. TLS 1.3 will sign all messages before server authentication, even though it makes Transcript Collision Attacks somewhat easier to mount. With SHA-1 not allowed for signatures that will hopefully not become a problem anytime soon.
With neither the client version nor its cipher suites (for the SCSV) included
in the hash signed by the server’s certificate in TLS 1.2, how do you secure
TLS 1.3 against downgrades like FREAK and Logjam? Stuff a special value into
ServerHello.random
.
The TLS WG decided to put static values (sometimes called downgrade sentinels)
into the server’s nonce sent with the ServerHello
message. TLS 1.3 servers
responding to a ClientHello
indicating a maximum supported version of TLS 1.2
MUST set the last eight bytes of the nonce to:
If the client advertises a maximum supported version of TLS 1.1 or below the server SHOULD set the last eight bytes of the nonce to:
If not connecting with a downgraded version, a client MUST check whether the server nonce ends with any of the two sentinels and in such a case abort the connection. The TLS 1.3 spec here introduces an update to TLS 1.2 that requires servers and clients to update their implementation.
Unfortunately, this downgrade protection relies on a ServerKeyExchange
message being sent and is thus of limited value. Static RSA key exchanges
are still valid in TLS 1.2, and unless the server admin disables all
non-forward-secure cipher suites the protection can be bypassed.
Current measurements show that enabling TLS 1.3 by default would break a significant fraction of TLS handshakes due to version intolerance. According to Ivan Ristić, as of July 2016, 3.2% of servers from the SSL Pulse data set reject TLS 1.3 handshakes.
This a very high number and would affect way too many people. Alas, with TLS
1.3 we have only limited downgrade protection for forward-secure cipher
suites. And that is assuming that most servers either support TLS 1.3 or
update their 1.2 implementations. TLS_FALLBACK_SCSV
, if supported by the
server, will help as long as there are no attacks tampering with the list
of cipher suites.
The TLS working group has been thinking about how to handle intolerance without bringing back version fallbacks, and there might be light at the end of the tunnel.
The next version of the proposed TLS 1.3 spec, draft 16, will introduce a new
version negotiation mechanism based on extensions. The current ClientHello.version
field will be frozen to TLS 1.2, i.e. {3, 3}
, and renamed to legacy_version
.
Any number greater than that MUST be ignored by servers.
To negotiate a TLS 1.3 connection the protocol now requires the client to send
a supported_versions
extension. This is a list of versions the client supports,
in preference order, with the most preferred version first. Clients MUST send
this extension as servers are required to negotiate TLS 1.2 if it’s not present.
Any version number unknown to the server MUST be ignored.
This still leaves potential problems with big ClientHello
messages or
choking on unknown extensions unaddressed, but according to David Benjamin
the main problem is ClientHello.version
.
We will hopefully be able to ship browsers that have TLS 1.3 enabled by default,
without bringing back insecure version fallbacks.
However, it’s not unlikely that implementers will screw up even the new version negotiation mechanism and we’ll have similar problems in a few years down the road.
David Benjamin, following Adam Langley’s advice to have one joint and keep it well oiled, proposed GREASE (Generate Random Extensions And Sustain Extensibility), a mechanism to prevent extensibility failures in the TLS ecosystem.
The heart of the mechanism is to have clients inject “unknown values” into places where capabilities are advertised by the client, and the best match selected by the server. Servers MUST ignore unknown values to allow introducing new capabilities to the ecosystem without breaking interoperability.
These values will be advertised pseudo-randomly to break misbehaving servers
early in the implementation process. Proposed injection points are cipher
suites, supported groups, extensions, and ALPN identifiers. Should the server
respond with a GREASE value selected in the ServerHello
message the client
MUST abort the connection.
Based on my experience from building a Taskcluster CI for NSS over the last weeks, I want to share a rough outline of the process of setting this up for basically any Mozilla project, using NSS as an example.
The development of NSS has for a long time been heavily supported by a fleet of buildbots. You can see them in action by looking at our waterfall diagram showing the build and test statuses of the latest pushes to the NSS repository.
Unfortunately, this setup is rather complex and the bots are slow. Build and test tasks are run sequentially and so on some machines it takes 10-15 hours before you will be notified about potential breakage.
The first thing that needs to be done is to replicate the current setup as good as possible and then split monolithic test runs into many small tasks that can be run in parallel. Builds will be prepared by build tasks, test tasks will later download those pieces (called artifacts) to run tests.
A good turnaround time is essential, ideally one should know whether a push broke the tree after not more than 15-30 minutes. We want a TreeHerder dashboard that gives a good overview of all current build and test tasks, as well as an IRC and email notification system so we don’t have to watch the tree all day.
To build and test on Linux, Taskcluster uses Docker. The build instructions for the image containing all NSS dependencies, as well as the scripts to build and run tests, can be found in the automation/taskcluster/docker directory.
For a start, the fastest way to get something up and running (or building) is
to use ADD
in the Dockerfile to bake your scripts into the image. That way
you can just pass them as the command in the task definition later.
Once you have NSS and its tests building and running in a local Docker container, the next step is to run a Taskcluster task in the cloud. You can use the Task Creator to spawn a one-off task, experiment with your Docker image, and with the task definition. Taskcluster will automatically pull your image from Docker Hub:
Docker and task definitions are well-documented, so this step shouldn’t be too difficult and you should be able to confirm everything runs fine. Now instead of kicking off tasks manually the next logical step is to spawn tasks automatically when changesets are pushed to the repository.
Triggering tasks on repository pushes should remind you of Travis CI, CircleCI, or AppVeyor, if you worked with any of those before. Taskcluster offers a similar tool called taskcluster-github that uses a configuration file in the root of your repository for task definitions.
If your master is a Mercurial repository then it’s very helpful that you don’t have to mess with it until you get the configuration right, and can instead simply create a fork on GitHub. The documentation is rather self-explanatory, and the task definition is similar to the one used by the Task Creator.
Once the WebHook is set up and receives pings, a push to your fork will make “Lisa Lionheart”, the Taskcluster bot, comment on your push and leave either an error message or a link to the task graph. If on the first try you see failures about missing scopes you are lacking permissions and should talk to the nice folks over in #taskcluster.
Once you have a GitHub fork spawning build and test tasks when pushing you should move all the scripts you wrote so far into the repository. The only script left on the Docker image would be a script that checks out the hg/git repository and then uses the scripts in the tree to build and run tests.
This step will pay off very early in the process, rebuilding and pushing the Docker image to Docker Hub is something that you really don’t want to do too often. All NSS scripts for Linux live in the automation/taskcluster/scripts directory.
Use the above snippet as a template for your scripts. It will set a few flags
that help with debugging later, drop root privileges, and rerun it as the
unprivileged worker user. If you need to do things as root before building or
running tests, just put them before the exec su ...
call.
Taskcluster encourages many small tasks. It’s easy to split the big monolithic test run I mentioned at the beginning into multiple tasks, one for each test suite. However, you wouldn’t want to build NSS before every test run again, so we should build it only once and then reuse the binary. Taskcluster allows to leave artifacts after a task run that can then be downloaded by subtasks.
The above snippet builds NSS and creates an archive containing all the binaries and libraries. You need to let Taskcluster know that there’s a directory with artifacts so that it picks those up and makes them available to the public.
The test task then uses the $TC_PARENT_TASK_ID
environment variable to
determine the correct download URL, unpacks the build and starts running tests.
Making artifacts automatically available to subtasks, without having to pass
the parent task ID and build a URL, will hopefully be added to Taskcluster in
the future.
Specifying task dependencies in your .taskcluster.yml file is unfortunately not possible at the moment. Even though the set of builds and tasks you want may be static you can’t create the necessary links without knowing the random task IDs assigned to them.
Your only option is to create a so-called decision task. A decision task is the only task defined in your .taskcluster.yml file and started after you push a new changeset. It will leave an artifact in the form of a JSON file that Taskcluster picks up and uses to extend the task graph, i.e. schedule further tasks with appropriate dependencies. You can use whatever tool or language you like to generate these JSON files, e.g. Python, Ruby, Node, …
All task graph definitions including the Node.JS build script for NSS can be found in the automation/taskcluster/graph directory. Depending on the needs of your project you might want to use a completely different structure. All that matters is that in the end you produce a valid JSON file. Slightly more intelligent decision tasks can be used to implement features like try syntax.
If you have all of the above working with GitHub but your main repository is hosted on hg.mozilla.org you will want to have Mercurial spawn decision tasks when pushing.
The Taskcluster team is working on making .taskcluster.yml files work for Mozilla-hosted Mercurial repositories too, but while that work isn’t finished yet you have to add your project to mozilla-taskcluster. mozilla-taskcluster will listen for pushes and then kick off tasks just like the WebHook.
A CI is no CI without a proper dashboard. That’s the role of TreeHerder at Mozilla. Add your project to the end of the repository.json file and create a new pull request. It will usually take a day or two after merging until your change is deployed and your project shows up in the dashboard.
TreeHerder gets the per-task configuration from the task definition. You can configure the symbol, the platform and collection (i.e. row), and other parameters. Here’s the configuration data for the green B at the start of the fifth row of the image at the top of this post:
Taskcluster is a very modular system and offers many APIs. It’s built with mostly Node, and thus there are many Node libraries available to interact with the many parts. The communication between those is realized by Pulse, a managed RabbitMQ cluster.
The last missing piece we wanted is an IRC and email notification system, a bot that notifies about failures on IRC and sends emails to all parties involved. It was a piece of cake to write nss-tc that uses Taskcluster Node.JS libraries and Mercurial JSON APIs to connect to the task queue and listen for task definitions and failures.
I could have probably written a detailed post for each of the steps outlined here but I think it’s much more helpful to start with an overview of what’s needed to get the CI for a project up and running. Each step and each part of the system is hopefully more obvious now if you haven’t had too much interaction with Taskcluster and TreeHerder so far.
Thanks to the Taskcluster team, especially John Ford, Greg Arndt, and Pete Moore! They helped us pull this off in a matter of weeks and besides Linux builds and tests we already have Windows tasks, static analysis, ASan+LSan, and are in the process of setting up workers for ARM builds and tests.
]]>(Let’s ignore client authentication for simplicity.)
In TLS 1.0 as well as TLS 1.1 there are only two supported signature schemes: RSA with MD5/SHA-1 and DSA with SHA-1. The RSA here stands for the PKCS#1 v1.5 signature scheme, naturally.
An RSA signature signs the concatenation of the MD5 and SHA-1 digest, the DSA signature only the SHA-1 digest. Hashes will be computed as follows:
The ServerParams
are the actual data to be signed, the *Hello.random
values
are prepended to prevent replay attacks. This is the reason TLS 1.3 puts a
downgrade sentinel
at the end of ServerHello.random
for clients to check.
The ServerKeyExchange message containing the signature is sent only when static RSA/DH key exchange is not used, that means we have a DHE_* cipher suite, an RSA_EXPORT_* suite downgraded due to export restrictions, or a DH_anon_* suite where both parties don’t authenticate.
TLS 1.2 brought bigger changes to
signature algorithms by introducing the signature_algorithms extension.
This is a ClientHello
extension allowing clients to signal supported and
preferred signature algorithms and hash functions.
If a client does not include the signature_algorithms
extension then it is
assumed to support RSA, DSA, or ECDSA (depending on the negotiated cipher suite)
with SHA-1 as the hash function.
Besides adding all SHA-2 family hash functions, TLS 1.2 also introduced ECDSA as a new signature algorithm. Note that the extension does not allow to restrict the curve used for a given scheme, P-521 with SHA-1 is therefore perfectly legal.
A new requirement for RSA signatures is that the hash has to be wrapped in a
DER-encoded DigestInfo
sequence before passing it to the RSA sign function.
This unfortunately led to attacks like Bleichenbacher’06
and BERserk
because it turns out handling ASN.1 correctly is hard. As in TLS 1.1, a
ServerKeyExchange
message is sent only when static RSA/DH key exchange is not
used. The hash computation did not change either:
The signature_algorithms
extension introduced by TLS 1.2 was revamped in
TLS 1.3 and MUST now
be sent if the client offers a single non-PSK cipher suite. The format is
backwards compatible and keeps some old code points.
Instead of SignatureAndHashAlgorithm
, a code point is now called a
SignatureScheme
and tied to a hash function (if applicable) by the
specification. TLS 1.2 algorithm/hash combinations not listed here
are deprecated and MUST NOT be offered or negotiated.
New code points for RSA-PSS schemes, as well as Ed25519 and Ed448-Goldilocks were added. ECDSA schemes are now tied to the curve given by the code point name, to be enforced by implementations. SHA-1 signature schemes SHOULD NOT be offered, if needed for backwards compatibility then only as the lowest priority after all other schemes.
The current draft-13 lists RSASSA-PSS as the only valid signature algorithm allowed to sign handshake messages with an RSA key. The rsa_pkcs1_* values solely refer to signatures which appear in certificates and are not defined for use in signed handshake messages.
To prevent various downgrade attacks like FREAK and Logjam the computation of the hashes to be signed
has changed significantly and covers the complete handshake, up until
CertificateVerify
:
This includes amongst other data the client and server random, key shares, the cipher suite, the certificate, and resumption information to prevent replay and downgrade attacks. With static key exchange algorithms gone the CertificateVerify message is now the one carrying the signature.
]]>NSS contained quite a lot of SSLv2-specific code that was waiting to be removed. It was not compiled by default so there was no way to enable it in Firefox even if you wanted to. The removal was rather straightforward as the protocol changed significantly with v3 and most of the code was well separated. Good riddance.
Adam Langley submitted a patch to bring ChaCha20/Poly1305 cipher suites to NSS already two years ago but at that time we likely didn’t have enough resources to polish and land it. I picked up where he left and updated it to conform to the slightly updated specification. Firefox 47 will ship with two new ECDHE/ChaCha20 cipher suites enabled.
Ryan Sleevi, also a while ago, implemented RSA-PSS in freebl
, the lower
cryptographic layer of NSS. I hooked it up to some more APIs so Firefox can
support RSA-PSS signatures in its WebCrypto API implementation. In NSS itself
we need it to support new handshake signatures in our experimental TLS v1.3
code.
Kai Engert from RedHat is currently doing a hell of a job maintaining quite a few buildbots that run all of our NSS tests whenever someone pushes a new changeset. Unfortunately the current setup doesn’t scale too well and the machines are old and slow.
Similar to e.g. Travis CI, Mozilla maintains its own continuous integration and release infrastructure, called TaskCluster. Using TaskCluster we now have an experimental Docker image that builds NSS/NSPR and runs all of our 17 (so far) test suites. The turnaround time is already very promising. This is an ongoing effort, there are lots of things left to do.
I’ve been working on the Firefox WebCrypto API implementation for a while, long before I switched to the Security Engineering team, and so it made sense to join the working group to help finalize the specification. I’m unfortunately still struggling to carve out more time for involvement with the WG than just attending meetings and representing Mozilla.
The main reason the WebCrypto API in Firefox did not support HKDF until recently is that no one found the time to implement it. I finally did find some time and brought it to Firefox 46. It is fully compatible to Chrome’s implementation (RFC 5869), the WebCrypto specification still needs to be updated to reflect those changes.
Since we shipped the first early version of the WebCrypto API, SHA-1 was the only available PRF to be used with PBKDF2. We now support PBKDF2 with SHA-2 PRFs as well.
Our initial implementation of the WebCrypto API would naively spawn a new thread
every time a crypto.subtle.*
method was called. We now use a thread pool per
process that is able to handle all incoming API calls much faster.
After working on this on and off for more than six months, so even before I officially joined the security engineering team, I managed to finally get it landed, with a lot of help from Boris Zbarsky who had to adapt our WebIDL code generation quite a bit. The WebCrypto API can now finally be used from (Service)Workers.
In the near future I’ll be working further on improving our continuous integration infrastructure for NSS, and clean up the library and its tests. I will hopefully find the time to write more about it as we progress.
]]>For CPUs without the AES-NI instruction set, constant-time AES-GCM however is slow and also hard to write and maintain. The majority of mobile phones, and mostly cheaper devices like tablets and notebooks on the market thus cannot support efficient and safe AES-GCM cipher suite implementations.
Even if we ignored all those aforementioned pitfalls we still wouldn’t want to rely on AES-GCM cipher suites as the only good ones available. We need more diversity. Having widespread support for cipher suites using a second AEAD is necessary to defend against weaknesses in AES or AES-GCM that may be discovered in the future.
ChaCha20 and Poly1305, a stream cipher and a message authentication code, were designed with fast and constant-time implementations in mind. A combination of those two algorithms yields a safe and efficient AEAD construction, called ChaCha20/Poly1305, which allows TLS with a negligible performance impact even on low-end devices.
Firefox 47 will ship with two new ECDHE/ChaCha20 cipher suites as specified in the latest draft. We are looking forward to see the adoption of these increase and will, as a next step, work on prioritizing them over AES-GCM suites on devices not supporting AES-NI.
]]>The only problem is that it’s a Chrome App. Apart from excluding folks with other browsers it’s also a shitty user experience. If you too want your messaging app not tied to a browser then let’s just build our own standalone variant of Signal Desktop.
Signal Desktop is a Chrome App, so the easiest way to turn it into a standalone app is to use NW.js. Conveniently, their next release v0.13 will ship with Chrome App support and is available for download as a beta version.
First, make sure you have git
and npm
installed. Then open a terminal and
prepare a temporary build directory to which we can download a few things and
where we can build the app:
Download the latest beta of NW.js and unzip
it. We’ll extract the application
and use it as a template for our Signal clone. The NW.js project does
unfortunately not seem to provide a secure source (or at least hashes)
for their downloads.
Next, clone the Signal repository and use NPM to install the necessary modules.
Run the grunt
automation tool to build the application.
Finally, simply to copy the dist
folder containing all the juicy Signal files
into the application template we created a few moments ago.
The last command opens a Finder window. Move SignalPrivateMessenger.app
to
your Applications folder and launch it as usual. You should now see a welcome
page!
The build instructions for Linux aren’t too different but I’ll write them down, if just for convenience. Start by cloning the Signal Desktop repository and build.
The dist
folder contains the app, ready to be launched. zip
it and place
the resulting package somewhere handy.
Back to the top. Download the NW.js binary, extract it, and change into the
newly created directory. Move the package.nw
file we created earlier next to
the nw
binary and we’re done. The nwjs-sdk-v0.13.0-beta3-linux-x64
folder
does now contain the standalone Signal app.
Finally, launch NW.js. You should see a welcome page!
Our standalone Signal clone mostly works, but it’s far from perfect. We’re pulling from master and that might bring breaking changes that weren’t sufficiently tested.
We don’t have the right icons. The app crashes when you click a media message. It opens a blank popup when you click a link. It’s quite big because also NW.js has bugs and so we have to use the SDK build for now. In the future it would be great to have automatic updates, and maybe even signed builds.
Remember, Signal Desktop is beta, and completely untested with NW.js. If you want to help file bugs, but only after checking that those affect the Chrome App too. If you want to fix a bug only occurring in the standalone version it’s probably best to file a pull request and cross fingers.
Great question! I don’t know. I would love to get some more insights from people that know more about the NW.js security model and whether it comes with all the protections Chromium can offer. Another interesting question is whether bundling Signal Desktop with NW.js is in any way worse (from a security perspective) than installing it as a Chrome extension. If you happen to have an opinion about that, I would love to hear it.
Another important thing to keep in mind is that when building Signal on your own you will possibly miss automatic and signed security updates from the Chrome Web Store. Keep an eye on the repository and rebuild your app from time to time to not fall behind too much.
]]>Please note that this post is about draft-11 of the TLS v1.3 standard.
TLS must be fast. Adoption will greatly benefit from speeding up the initial handshake that authenticates and secures the connection. You want to get the protocol out of the way and start delivering data to visitors as soon as possible. This is crucial if we want the web to succeed at deprecating non-secure HTTP.
Let’s start by looking at full handshakes as standardized in TLS v1.2, and then continue to abbreviated handshakes that decrease connection times for resumed sessions. Once we understand the current protocol we can proceed to proposals made in the latest TLS v1.3 draft to achieve full 1-RTT and even 0-RTT handshakes.
It helps if you already have a rough idea of how TLS and Diffie-Hellman work as I can’t go into every detail. The focus of this post is on comparing current and future handshakes and I might omit a few technicalities to get basic ideas across more easily.
Static RSA is a straightforward key exchange method, available since
SSLv2. After
sharing basic protocol information via the ClientHello
and ServerHello
messages the server sends its certificate to the client. ServerHelloDone
signals that for now there will be no further messages until the client
responds.
The client then encrypts the so-called premaster secret with the server’s
public key found in the certificate and wraps it in a ClientKeyExchange
message. ChangeCipherSpec
signals that from now on messages will be encrypted.
Finished
, the first message to be encrypted and the client’s last message of
the handshake, contains a MAC of all handshake messages exchanged thus far to
prove that both parties saw the same messages, without interference from a MITM.
The server decrypts the premaster secret found in the ClientKeyExchange
message using its certificate’s private key, and derives the master secret and
communication keys. It then too signals a switch to encrypted communication
and completes the handshake. It takes two round-trips to establish a
connection.
Authentication: With static RSA key exchanges, the connection is
authenticated by encrypting the premaster secret with the server certificate’s
public key. Only the server in possession of the private key can decrypt,
correctly derive the master secret, and send an encrypted Finished
message
with the right MAC.
The simplicity of static RSA has a serious drawback: it does not offer forward secrecy. If a passive adversary records all traffic to a server then every recorded TLS session can be broken later by obtaining the certificate’s private key.
This key exchange method will be removed in TLS v1.3.
A full handshake using (Elliptic Curve)
Diffie-Hellman to
exchange ephemeral keys is very similar to the flow of static RSA. The main
difference is that after sending the certificate the server will also send a
ServerKeyExchange
message. This message contains either the parameters of a
DH group or of an elliptic curve, paired with an ephemeral public key computed
by the server.
The client too computes an ephemeral public key compatible with the given parameters and sends it to the server. Knowing their private keys and the other party’s public key both sides should now share the same premaster secret and can derive a shared master secret.
Authentication: With (EC)DH key exchanges it’s still the certificate that
must be signed by a CA listed in the client’s trust store. To authenticate the
connection the server will sign the parameters contained in ServerKeyExchange
with the certificate’s private key. The client verifies the signature with the
certificate’s public key and only then proceeds with the handshake.
Since SSLv2 clients have been able to use session identifiers as a way to resume previously established TLS/SSL sessions. Session resumption is important because a full handshake can take time: it has a high latency as it needs two round-trips and might involve expensive computation to exchange keys, or sign and verify certificates.
Session IDs, assigned
by the server, are unique identifiers under which both parties store the master
secret and other details of the connection they established. The client may
include this ID in the ClientHello
message of the next handshake to
short-circuit the negotiation and reuse previous connection parameters.
If the server is willing and able to resume the session it responds with a
ServerHello
message including the Session ID given by the client. This
handshake is effectively 1-RTT as the client can send application data
immediately after the Finished
message.
Sites with lots of visitors will have to manage and secure big session caches, or risk pushing out saved sessions too quickly. A setup involving multiple load-balanced servers will need to securely synchronize session caches across machines. The forward secrecy of a connection is bounded by how long session information is retained on servers.
Session tickets, created by the server
and stored by the client, are blobs containing all necessary information about
a connection, encrypted by a key only known to the server. If the client
presents this tickets with the ClientHello
message, and proves that it knows
the master secret stored in the ticket, the session will be resumed.
A server willing and able to decrypt the given ticket responds with a
ServerHello
message including an empty SessionTicket extension, otherwise
the extension would be omitted completely. As with session IDs, the client will
start sending application data immediately after the Finished
message to
achieve 1-RTT.
To not affect the forward secrecy provided by (EC)DHE suites session ticket keys should be rotated periodically, otherwise stealing the ticket key would allow recovering recorded sessions later. In a setup with multiple load-balanced servers the main challenge here is to securely generate, rotate, and synchronize keys across machines.
Authentication: Both session resumption mechanisms retain the client’s and server’s authentication states as established in the session’s initial handshake. Neither the server nor the client have to send and verify certificates a second time, and thus can reduce connection times significantly, especially when dealing with RSA certificates.
The first good news about handshakes in TLS v1.3 is that static RSA key exchanges are no longer supported. Great! That means we can start with full handshakes using forward-secure Diffie-Hellman.
Another important change is the removal of the ChangeCipherSpec
protocol
(yes, it’s actually a protocol, not a message). With TLS v1.3 every message
sent after ServerHello
is encrypted with the so-called
ephemeral secret to lock
out passive adversaries very early in the game. EncryptedExtensions
carries
Hello extension data that must be encrypted because it’s not needed to set up
secure communication.
The probably most important change with regard to 1-RTT is the removal of the
ServerKeyExchange
and ClientKeyExchange
messages. The DH parameters and
public keys are now sent in special KeyShare extensions, a new type of
extension to be included in the ServerHello
and ClientHello
messages.
Moving this data into Hello extensions keeps the handshake compatible with TLS
v1.2 as it doesn’t change the order of messages.
The client sends a list of KeyShareEntry values, each consisting of a named
(EC)DH group and an ephemeral public key. If the server accepts it must respond
with one of the proposed groups and its own public key. If the server does not
support any of the given key shares the server will request retrying the
handshake or abort the connection with a fatal handshake_failure
alert.
Authentication: The Diffie-Hellman parameters itself aren’t signed anymore,
authentication will be a tad more explicit in TLS v1.3. The server sends a
CertificateVerify
message that contains a hash of all handshake message
exchanged so far, signed with the certificate’s private key. The client then
simply verifies the signature with the certificate’s public key.
Session resumption via identifiers and tickets is obsolete in TLS v1.3. Both methods are replaced by a pre-shared key (PSK) mode. A PSK is established on a previous connection after the handshake is completed, and can then be presented by the client on the next visit.
The client sends one or more PSK identities as opaque blobs of data. They can be database lookup keys (similar to Session IDs), or self-encrypted and self-authenticated values (similar to Session Tickets). If the server accepts one of the given PSK identities it replies with the one it selected. The KeyShare extension is sent to allow servers to ignore PSKs and fall back to a full handshake.
Forward secrecy can be maintained by limiting the lifetime of PSK identities sensibly. Clients and servers may also choose an (EC)DHE cipher suite for PSK handshakes to provide forward secrecy for every connection, not just the whole session.
Authentication: As in TLS v1.2, the client’s and server’s authentication states are retained and both parties don’t need to exchange and verify certificates again. A regular PSK handshake initiating a new session, instead of resuming, omits certificates completely.
Session resumption still allows significantly faster handshakes when using RSA certificates and can prevent user-facing client authentication dialogs on subsequent connections. However, the fact that it requires a single round-trip just like a full handshake might make it less appealing, especially if you have an ECDSA or EdDSA certificate and do not require client authentication.
The latest draft of the specification contains a proposal to let clients
encrypt application data and include it in their first flights. On a previous
connection, after the handshake completes, the server would send a
ServerConfiguration
message that the client can use for
0-RTT handshakes
on subsequent connections. The
configuration
includes a configuration identifier, the server’s semi-static (EC)DH parameters,
an expiration date, and other details.
With the very first TLS record the client sends its ClientHello
and, changing
the order of messages, directly appends application data (e.g. GET / HTTP/1.1
).
Everything after the ClientHello
will be encrypted with the
static secret, derived from
the client’s ephemeral KeyShareEntry and the semi-static DH parameters given
in the server’s configuration. The end_of_early_data
alert indicates the end
of the flight.
The server, if able and willing to decrypt, responds with its default set of
messages and immediately appends the contents of the requested resource. That’s
the same round-trip time as for an unencrypted HTTP request. All communication
following the ServerHello
will again be encrypted with the ephemeral secret,
derived from the client’s and server’s ephemeral key shares. After exchanging
Finished
messages the server will be re-authenticated, and traffic encrypted
with keys derived from the master secret.
At first glance, 0-RTT mode seems similar to session resumption or PSK, and you might wonder why one wouldn’t merge these mechanisms. The differences however are subtle but important, and the security properties of 0-RTT handshakes are weaker than those for other kinds of TLS data:
1. To protect against replay attacks the server must incorporate a server
random into the master secret. That is unfortunately not possible before the
first round-trip and so the poor server can’t easily tell whether it’s a valid
request or an attacker replaying a recorded conversation. Replay protection
will be in place again after the ServerHello
message is sent.
2. The semi-static DH share given in the server configuration, used to derive the static secret and encrypt first flight data, defies forward secrecy. We need at least one round-trip to establish the ephemeral secret. As configurations are shared between clients, and recovering the server’s DH share becomes more attractive, expiration dates should be limited sensibly. The maximum allowed validity is 7 days.
3. If the server’s DH share is compromised a MITM can tamper with the 0-RTT data sent by the client, without being detected. This does not extend to the full session as the client can retrospectively authenticate the server via the remaining handshake messages.
Thwarting replay attacks without input from the server is fundamentally very expensive. It’s important to understand that this is a generic problem, not an issue with TLS in particular, so alas one can’t just borrow another protocol’s 0-RTT model and put that into TLS.
It is possible to have servers keep a list of every ClientRandom they have
received in a given time window. Upon receiving a ClientHello
the server
checks its list and rejects replays if necessary. This list must be globally
and temporally consistent as there are
possible attack vectors
due to TLS’ reliable delivery guarantee if an attacker can force a server to
lose its state, as well as with multiple servers in loosely-synchronized data
centers.
Maintaining a consistent global state is possible, but only in some limited circumstances, namely for very sophisticated operators or situations where there is a single server with good state management. We will need something better.
A possible solution might be a TLS stack API to let applications designate
certain data as replay-safe, for example GET / HTTP/1.1
assuming that GET
requests against a given resource are idempotent.
Applications can, before opening the connection, specify replayable 0-RTT data to send on the first flight. If the server ignores the given 0-RTT data, the TLS stack automatically replays it after the first round-trip.
Another way of achieving the same outcome would be a TLS stack API that again lets applications designate certain data as replay-safe, but does not automatically replay if the server ignores it. The application can decide to do this manually if necessary.
Both of these APIs are early proposals and the final version of the specification might look very different from what we can see above. Though, as 0-RTT handshakes are a charter goal, the working group will very likely find a way to make them work.
TLS v1.3 will bring major improvements to handshakes, how exactly will be finalized in the coming months. They will be more private by default as all information not needed to set up a secure channel will be encrypted as early as possible. Clients will need only a single round-trip to establish secure and authenticated connections to servers they never spoke to before.
Static RSA mode will no longer be available, forward secrecy will be the default. The two session resumption standards, session identifiers and session tickets, are merged into a single PSK mode which will allow streamlining implementations.
The proposed 0-RTT mode is promising, for custom application communication
based on TLS but also for browsers, where a GET / HTTP/1.1
request to your
favorite news page could deliver content blazingly fast as if no TLS was
involved. The security aspects of zero round-trip handshakes will become more
clear as the draft progresses.
Let us take a closer look at not the verbatim implementation but at a slightly simplified version. The API offers the only two operations such a module needs to support: setting a new passcode and verifying that a given passcode matches the stored one.
When setting up the phone for the first time - or when changing the passcode
later - we call Passcode.store()
to write a new code to disk.
Passcode.verify()
will help us determine whether we should unlock the phone.
Both methods return a Promise as all operations exposed by the WebCrypto API
are asynchronous.
The module should absolutely not store passcodes in the clear. We will use PBKDF2 as a pseudorandom function (PRF) to retrieve a result that looks random. An attacker with read access to the part of the disk storing the user’s passcode should not be able to recover the original input, assuming limited computational resources.
The function deriveBits()
is a PRF that takes a passcode and returns a Promise
resolving to a random looking sequence of bytes. To be a little more specific,
it uses PBKDF2 to derive pseudorandom bits.
As you can see above PBKDF2 takes a whole bunch of parameters. Choosing good values is crucial for the security of our passcode module so it is best to take a detailed look at every single one of them.
PBKDF2 is a big PRF that iterates a small PRF. The small PRF, iterated multiple times (more on why this is done later), is fixed to be an HMAC construction; you are however allowed to specify the cryptographic hash function used inside HMAC itself. To understand why you need to select a hash function it helps to take a look at HMAC’s definition, here with SHA-1 at its core:
The outer and inner padding opad
and ipad
are static values that can be
ignored for our purpose, the important takeaway is that the given hash function
will be called twice, combining the message m
and the key k
. Whereas HMAC
is usually used for authentication PBKDF2 makes use of its PRF properties, that
means its output is computationally indistinguishable from random.
deriveBits()
as defined above uses SHA-1
as well, and although it is considered broken
as a collision-resistant
hash function it is still a safe building block in the HMAC-SHA-1 construction.
HMAC only relies on a hash function’s PRF properties, and while finding SHA-1
collisions is considered feasible it is still believed to be a secure PRF.
That said, it would not hurt to switch to a secure cryptographic hash function like SHA-256. Chrome supports other hash functions for PBKDF2 today, Firefox unfortunately has to wait for an NSS fix before those can be unlocked for the WebCrypto API.
The salt is a random component that PBKDF2 feeds into the HMAC function along with the passcode. This prevents an attacker from simply computing the hashes of for example all 8-character combinations of alphanumerics (~5.4 PetaByte of storage for SHA-1) and use a huge lookup table to quickly reverse a given password hash. Specify 8 random bytes as the salt and the poor attacker will have to suddenly compute (and store!) 264 of those lookup tables and face 8 additional random characters in the input. Even without the salt the effort to create even one lookup table would be hard to justify because chances are high you cannot reuse it to attack another target, they might be using a different hash function or combine two or more of them.
The same goes for Rainbow Tables. A random salt included with the password would have to be incorporated when precomputing the hash chains and the attacker is back to square one where she has to compute a Rainbow Table for every possible salt value. That certainly works ad-hoc for a single salt value but preparing and storing 264 of those tables is impossible.
The salt is public and will be stored in the clear along with the derived bits.
We need the exact same salt to arrive at the exact same derived bits later
again. We thus have to modify deriveBits()
to accept the salt as an argument
so that we can either generate a random one or read it from disk.
Keep in mind though that Rainbow tables today are mainly a thing from the past where password hashes were smaller and shittier. Salts are the bare minimum a good password storage scheme needs, but they merely protect against a threat that is largely irrelevant today.
As computers became faster and Rainbow Table attacks infeasible due to the prevalent use of salts everywhere, people started attacking password hashes with dictionaries, simply by taking the public salt value and passing that combined with their educated guess to the hash function until a match was found. Modern password schemes thus employ a “work factor” to make hashing millions of password guesses unbearably slow.
By specifying a sufficiently high number of iterations we can slow down PBKDF2’s inner computation so that an attacker will have to face a massive performance decrease and be able to only try a few thousand passwords per second instead of millions.
For a single-user disk or file encryption it might be acceptable if computing the password hash takes a few seconds; for a lock screen 300-500ms might be the upper limit to not interfere with user experience. Take a look at this great StackExchange post for more advice on what might be the right number of iterations for your application and environment.
A much more secure version of a lock screen would allow to not only use four digits but any number of characters. An additional delay of a few seconds after a small number of wrong guesses might increase security even more, assuming the attacker cannot access the PRF output stored on disk.
PBKDF2 can output an almost arbitrary amount of pseudo-random data. A single execution yields the number of bits that is equal to the chosen hash function’s output size. If the desired number of bits exceeds the hash function’s output size PBKDF2 will be repeatedly executed until enough bits have been derived.
Choose 160 bits for SHA-1, 256 bits for SHA-256, and so on. Slowing down the key derivation even further by requiring more than one round of PBKDF2 will not increase the security of the password storage.
Hard-coding PBKDF2 parameters - the name of the hash function to use in the HMAC construction, and the number of HMAC iterations - is tempting at first. We however need to be flexible if for example it turns out that SHA-1 can no longer be considered a secure PRF, or you need to increase the number of iterations to keep up with faster hardware.
To ensure future code can verify old passwords we store the parameters that
were passed to PBKDF2 at the time, including the salt. When verifying the
passcode we will read the hash function name, the number of iterations, and the
salt from disk and pass those to deriveBits()
along with the passcode itself.
The number of bits to derive will be the hash function’s output size.
Now that we are done implementing deriveBits()
, the heart of the Passcode
module, completing the API is basically a walk in the park. For the sake of
simplicity we will use localforage
as the storage backend. It provides a simple, asynchronous, and Promise-based
key-value store.
We generate a new random salt for every new passcode. The derived bits are
stored along with the salt, the hash function name, and the number of
iterations. HASH
and ITERATIONS
are constants that provide default values
for our PBKDF2 parameters and can be updated whenever desired. The Promise
returned by Passcode.store()
will resolve when all values have been
successfully stored in the backend.
To verify a passcode all values and parameters stored by Passcode.store()
will have to be read from disk and passed to deriveBits()
. Comparing the
derived bits with the value stored on disk tells whether the passcode is valid.
compare()
does not have to be constant-time. Even if the attacker learns
the first byte of the final digest stored on disk she cannot easily produce
inputs to guess the second byte - the opposite would imply knowing the
pre-images of all those two-byte values. She cannot do better than submitting
simple guesses that become harder the more bytes are known. For a successful
attack all bytes have to be recovered, which in turns means a valid pre-image
for the full final digest needs to be found.
If it makes you feel any better, you can of course implement compare()
as a
constant-time operation. This might be tricky though given that all modern
JavaScript engines optimize code heavily.
Both bcrypt and scrypt are probably better alternatives to PBKDF2. Bcrypt automatically embeds the salt and cost factor into its output, most APIs are clever enough to parse and use those parameters when verifying a given password.
Scrypt implementations can usually securely generate a random salt, that is one less thing for you to care about. The most important aspect of scrypt though is that it allows consuming a lot of memory when computing the password hash which makes cracking passwords using ASICs or FPGAs close to impossible.
The Web Cryptography API does unfortunately support neither of the two algorithms and currently there are no proposals to add those. In the case of scrypt it might also be somewhat controversial to allow a website to consume arbitrary amounts of memory.
]]>After you finished reading this one, please also read the follow-up post that covers session resumption changes in TLS 1.3.
The probably oldest complaint about TLS is that its handshake is slow and together with the transport encryption has a lot of CPU overhead. This certainly is not true anymore if configured correctly.
One of the most important features to improve user experience for visitors accessing your site via TLS is session resumption. Session resumption is the general idea of avoiding a full TLS handshake by storing the secret information of previous sessions and reusing those when connecting to a host the next time. This drastically reduces latency and CPU usage.
Enabling session resumption in web servers and proxies can however easily compromise forward secrecy. To find out why having a de-factor standard TLS library (i.e. OpenSSL) can be a bad thing and how to avoid botching PFS let us take a closer look at forward secrecy, and the current state of server-side implementation of session resumption features.
(Perfect) Forward Secrecy is an important part of modern TLS setups. The core of it is to use ephemeral (short-lived) keys for key exchange so that an attacker gaining access to a server cannot use any of the keys found there to decrypt past TLS sessions they may have recorded previously.
We must not use a server’s RSA key pair, whose public key is contained in the certificate, for key exchanges if we want PFS. This key pair is long-lived and will most likely outlive certificate expiration dates as you would just use the same key pair to generate a new certificate after the current expired. In case the server is compromised it would be far too easy to determine the location of the private key on disk or in memory and use it to decrypt recorded TLS sessions from the past.
Using Diffie-Hellman key exchanges where key generation is a lot cheaper we can use a key pair exactly once and discard it afterwards. An attacker with access to the server can still compromise the authentication part as shown above and MITM everything from here on using the certificate’s private key, but past TLS sessions stay protected.
TLS provides two session resumption features: Session IDs and Session Tickets. To better understand how those can be attacked it is worth looking at them in more detail.
In a full handshake the server sends a Session ID as part of the “hello” message. On a subsequent connection the client can use this session ID and pass it to the server when connecting. Because both server and client have saved the last session’s “secret state” under the session ID they can simply resume the TLS session where they left off.
To support session resumption via session IDs the server must maintain a cache that maps past session IDs to those sessions’ secret states. The cache itself is the main weak spot, stealing the cache contents allows to decrypt all sessions whose session IDs are contained in it.
The forward secrecy of a connection is thus bounded by how long the session information is retained on the server. Ideally, your server would use a medium-sized cache that is purged daily. Purging your cache might however not help if the cache itself lives on a persistent storage as it might be feasible to restore deleted data from it. An in-memory storage should be more resistant to these kind of attacks if it turns over about once a day and ensures old data is overridden properly.
The second mechanism to resume a TLS session are Session Tickets. This extension transmits the server’s secret state to the client, encrypted with a key only known to the server. That ticket key is protecting the TLS connection now and in the future and is the weak spot an attacker will target.
The client will store its secret information for a TLS session along with the ticket received from the server. By transmitting that ticket back to the server at the beginning of the next TLS connection both parties can resume their previous session, given that the server can still access the secret key that was used to encrypt.
We ideally want the same secrecy bounds for Session Tickets as for Session IDs. To achieve this we need to ensure that the key used to encrypt tickets is rotated about daily. It should just as the session cache not live on a persistent storage to not leave any trace.
Now that we determined how we ideally want session resumption features to be configured we should take a look at a popular web servers and load balancers to see whether that is supported, starting with Apache.
The Apache HTTP Server offers the
SSLSessionCache directive
to configure the cache that contains the session IDs of previous TLS sessions
along with their secret state. You should use shmcb
as the storage type, that is
a high-performance cyclic buffer inside a shared memory segment in RAM. It will
be shared between all threads or processes and allow session resumption no
matter which of those handles the visitor’s request.
The example shown above establishes an in-memory cache via the path
/path/to/ssl_gcache_data
with a size of 512 KiB. Depending on
the amount of daily visitors the cache size might be too small (i.e. have a
high turnover rate) or too big (i.e. have a low turnover rate).
We ideally want a cache that turns over daily and there is no really good way to determine the right session cache size. What we really need is a way to tell Apache the maximum time an entry is allowed to stay in the cache before it gets overridden. This must happen regardless of whether the cyclic buffer has actually cycled around yet and must be a periodic background job to ensure the cache is purged even when there have not been any requests in a while.
You might wonder whether the
SSLSessionCacheTimeout
directive can be of any help here - unfortunately no. The timeout is only checked when a session ID is given at the start of a TLS connection. It does not cause entries to be purged from the session cache.
While Apache offers the SSLSessionTicketKeyFile directive to specify a key file that should contain 48 random bytes, it is recommended to not specify one at all. Apache will simply generate a random key on startup and use that to encrypt session tickets for as long as it is running.
The good thing about this is that the session ticket key will not touch persistent storage, the bad thing is that it will never be rotated. Generated once on startup it is only discarded when Apache restarts. For most of the servers out there that means they use the same key for months, if not years.
To provide forward secrecy we need to rotate the session ticket key about daily and current Apache versions provide no way of doing that. The only way to achieve that might be use a cron job to gracefully restart Apache daily to ensure a new key is generated. That does not sound like a real solution though and nothing ensures the old key is properly overridden.
Changing the key file while Apache is running does not do it either, you would
still need to gracefully restart the service to apply the new key. An do not
forget that if you use a key file it should be stored on a temporary file
system like tmpfs
.
Although disabling session tickets will undoubtedly have a negative performance impact, for the moment being you will need to do that in order to provide forward secrecy:
Ivan Ristic adds that to disable session tickets for Apache using
SSLOpenSSLConfCmd
, you have to be running OpenSSL 1.0.2 which has not been released yet. If you want to disable session tickets with earlier OpenSSL versions, Ivan has a few patches for the Apache 2.2.x and Apache 2.4.x branches.
To securely support session resumption via tickets Apache should provide a configuration directive to specify the maximum lifetime for session ticket keys, at least if auto-generated on startup. That would allow us to simply generate a new random key and override the old one daily.
Another very popular web server is Nginx. Let us see how that compares to Apache when it comes to setting up session resumption.
Nginx offers the ssl_session_cache directive
to configure the TLS session cache. The type of the cache should be shared
to
share it between multiple workers:
The above line establishes an in-memory cache with a size of 10 MB. We again have no real idea whether 10 MB is the right size for the cache to turn over daily. Just as Apache, Nginx should provide a configuration directive to allow cache entries to be purged automatically after a certain time. Any entries not purged properly could simply be read from memory by an attacker with full access to the server.
You guessed right, the
ssl_session_timeout
directive again only applies when trying to resume a session at the beginning of a connection. Stale entries will not be removed automatically after they time out.
Nginx allows to specify a session ticket file using the ssl_session_ticket_key directive, and again you are probably better off by not specifying one and having the service generate a random key on startup. The session ticket key will never be rotated and might be used to encrypt session tickets for months, if not years.
Nginx, too, provides no way to automatically rotate keys. Reloading its configuration daily using a cron job might work but does not come close to a real solution either.
The best you can do to provide forward secrecy to visitors is thus again switch off session ticket support until a proper solution is available.
HAproxy, a popular load balancer, suffers from basically the same problems as Apache and Nginx. All of them rely on OpenSSL’s TLS implementation.
The size of the session cache can be set using the tune.ssl.cachesize directive that accepts a number of “blocks”. The HAproxy documentation tries to be helpful and explain how many blocks would be needed per stored session but we again cannot ensure an at least daily turnover. We would need a directive to automatically purge entries just as for Apache and Nginx.
And yes, the
tune.ssl.lifetime
directive does not affect how long entries are persisted in the cache.
HAproxy does not allow configuring session ticket parameters. It implicitly supports this feature because OpenSSL enables it by default. HAproxy will thus always generate a session ticket key on startup and use it to encrypt tickets for the whole lifetime of the process.
A graceful daily restart of HAproxy might be the only way to trigger key rotation. This is a pure assumption though, please do your own testing before using that in production.
You can disable session ticket support in HAproxy using the no-tls-tickets directive:
A previous version of the post said it would be impossible to deactivate session tickets. Thanks to the HAproxy team for correcting me!
If you have multiple web servers that act as front-ends for a fleet of back-end servers you will unfortunately not get away with not specifying a session ticket key file and a dirty hack that reloads the service configuration at midnight.
Sharing a session cache between multiple machines using memcached is possible but using session tickets you “only” have to share one or more session ticket keys, not the whole cache. Clients would take care of storing and discarding tickets for you.
Twitter wrote a great post about how they manage multiple web front-ends and distribute session ticket keys securely to each of their machines. I suggest reading that if you are planning to have a similar setup and support session tickets to improve response times.
Keep in mind though that Twitter had to write their own web server to handle forward secrecy in combination with session tickets properly and this might not be something you want to do yourselves.
It would be great if either OpenSSL or all of the popular web servers and load balancers would start working towards helping to provide forward secrecy by default and server admins could get rid of custom front-ends or dirty hacks to rotate keys.
]]>The most interesting part to me however is that Facebook brute-forced a custom hidden service address as it never occurred to me that this is something you might want to do. Again ignoring the pros and cons of doing that, investigating the how seems like a fun exercise to get more familiar with the WebCrypto API if that is still unknown territory to you.
Names for Tor hidden services are meant to be self-authenticating. When creating a hidden service Tor generates a new 1024 bit RSA key pair and then computes the SHA-1 digest of the public key. The .onion name will be the Base32-encoded first half of that digest.
By using a hash of the public key as the URL to contact a hidden service you can easily authenticate it and bypass the existing CA structure. This 80 bit URL is sufficient to prevent collisions, even with a birthday attack (and thus an entropy of 40 bit) you can only find a random collision but not the key pair matching a specific .onion name.
So how did Facebook manage to come up with a public key resulting in
facebookcorewwwi.onion
? The answer is that they were incredibly lucky.
You can brute-force .onion names matching a specific pattern using tools like Shallot or Scallion. Those will generate key pairs until they find one resulting in a matching URL. That is usably fast for 1-5 characters. Finding a 6-character pattern takes on average 30 minutes and for just 7 characters you might need to let it run for a full day.
Coming up with an .onion name starting with an 8-character pattern like
facebook
would thus take even longer or need a lot more resources. As a
Facebook engineer confirmed
they indeed got extremely lucky: they generated a few keys matching the pattern,
picked the best and then just needed to come up with an explanation for the
corewwwi
part to let users memorize it better.
Without taking a closer look at “Shallot” or “Scallion” let us go with a naive approach. We do not need to create another tool to find .onion names in the browser (the existing ones work great) but it is a good opportunity to again show what you can do with the WebCrypto API in the browser.
To generate a random name for a Tor hidden service we first need to generate a new 1024 bit RSA key just as Tor would do:
generateKey() returns a Promise that resolves to the new key pair. The second argument specifies that we want the key to be exportable as we need to do that in order to check for pattern matches. We will not actually use the key to sign or verify data but we need specify valid usages for the public and private keys.
To check whether a generated public key matches a specific pattern we of course have to compute the hash for the .onion URL:
We first use exportKey() to get an SPKI representation of the public key, use digest() to compute the SHA-1 digest of that, and finally pass it to base32() to Base32-encode the first half of that digest.
Note: base32() is an RFC 3548 compliant Base32 implementation. chrisumbel/thirty-two is a good one that unfortunately does not support ArrayBuffers, I will use a slightly adapted version of it in the example code.
The only thing missing now is a function that checks for pattern matches and loops until we found one:
We simply use generateRSAKey() and computeOnionHash() as defined before. In case of a pattern match we export the PKCS8 private key information, encode it as Base64 and format it nicely:
Note: base64() refers to an existing Base64 implementation that can deal with ArrayBuffers. niklasvh/base64-arraybuffer is a good one that I will use in the example code.
What is logged to the console can be directly used to replace any random key that Tor has assigned before. Here is how you would use the code we just wrote:
The Promise returned by findOnionName() will not resolve until a match was found. When generating lots of keys Firefox currently sometimes fails with a “transient error” that needs to be investigated. If you want a loop that runs despite that error you could simply restart the search in the error handler.
https://gist.github.com/ttaubert/389255d724f219f76900
Include it in a minimal web site and have the Web Console open. It will run in Firefox 33+ and Chrome 37+ with the WebCrypto API explicitly enabled (if necessary).
As said before, the approach shown above is quite naive and thus very slow. The easiest optimization to implement might be to spawn multiple web workers and let them search in parallel.
We could also speed up finding keys by not regenerating the whole RSA key every loop iteration but instead increasing the public exponent by 2 (starting from 3) until we find a match and then check whether that produces a valid key pair. If it does not we can just continue.
Lastly, the current implementation does not perform any safety checks that Tor might run on the generated key. All of these points would be great reasons for a follow-up post.
]]>Important: You should use the keys generated with this code to run a hidden service only if you trust the host that serves it. Getting your keys off of someone else’s web server is a terrible idea. Do not be that guy or gal.
You can see that it specifies two pin-sha256 values, that is the pins of two public keys. One is the pin of any public key in your current certificate chain and the other is the pin of any public key not in your current certificate chain. The latter is a backup in case your certificate expires or has to be revoked.
It is definitely not obvious which public keys you should pin and what a good backup pin would be. Let us answer those questions by starting with a more detailed overview of how public key pinning and TLS certificates work.
Let us go back to the beginning and start by taking a closer look at RSA keys:
The above command generates a 2048 bit RSA key and prints it to the console.
Although it says -----BEGIN RSA PRIVATE KEY-----
it does not only return the
private key but an
ASN.1 structure
that also contains the public key - we thus actually generated an RSA key pair.
A common misconception when learning about keys and certificates is that the RSA key itself for a given certificate expires. RSA keys however never expire - after all they are just numbers. Only the certificate containing the public key can expire and only the certificate can be revoked. Keys “expire” or are “revoked” as soon as there are no more valid certificates using the public key, and you threw away the keys and stopped using them altogether.
By submitting the Certificate Signing Request (CSR) containing your public key to a Certificate Authority it will issue a valid certificate. That will again contain the public key of the RSA key pair we generated above and an expiration date. Both the public key and the expiration date will be signed by the CA so that modifications of any of the two would render the certificate invalid immediately.
For simplicity I left out a few other fields that X.509 certificates contain to properly authenticate TLS connections, for example your server’s hostname and other details.
The whole purpose of public key pinning is to detect when the public key of a certificate for a specific host has changed. That may happen when an attacker compromises a CA such that they are able to issue valid certificates for any domain. A foreign CA might also just be the attacker, think of state-owned CAs that you do not want to be able to MITM your site. Any attacker intercepting a connection from a visitor to your server with a forged certificate can only be prevented by detecting that the public key has changed.
After establishing a TLS session with the server, the browser will look up any stored pins for the given hostname and check whether any of those stored pins match any of the SPKI fingerprints (the output of applying SHA-256 to the public key information) in the certificate chain. The connection must be terminated immediately if pin validation fails.
A valid certificate that passed all basics checks will be accepted if the browser could not find any pins stored for the current hostname. This might happen if the site does not support public key pinning and does not send any HPKP headers at all, or if this is the first time visiting and the server has not seen the HPKP header yet in a previous visit.
If your certificate expires or an attacker stole the private key you will have to replace (and possibly revoke) the leaf certificate. This might invalidate your pin, the constraints for obtaining a new valid certificate are the same as for an attacker that tries to impersonate you and intercept TLS sessions.
Pin validation requires checking the SPKI fingerprints of all certificates in the chain and will succeed if any of the public keys matches any of the pins. When for example StartSSL signed your certificate you have another intermediate Class 1 or 2 certificate and their root certificate in the chain. The browser trusts only the root certificate but the intermediate ones are signed by the root certificate. The intermediate certificate in turn signs the certificate deployed on your server and that is called a chain of trust.
If you pinned your leaf certificate then the only way to recover is your backup pin - whatever this points to must be included in your new certificate chain if you want to allow users that stored your pin from previous connections back on your server.
An easier solution would be available if you provided the SPKI fingerprint of StartSSL’s Class 1 intermediate certificate. To construct a new valid certificate chain you simply have to ask StartSSL to re-issue a new certificate for a new or your current key. This comes at the price of a slightly bigger attack surface as someone that stole the private key of the CA’s intermediate certificate would be able to impersonate your site and pass key pinning checks.
Another possibility is pinning StartSSL’s root certificate. Any certificate issued by StartSSL would let you construct a new valid certificate chain. Again, this slightly increases the attack vector as any compromised intermediate or root certificate would allow to impersonate your site and pass pinning checks.
Given all of the above scenarios you might ask which key would be the best to pin, and the answer is: it depends. You can pin one or all of the public keys in your certificate chain and that will work. The specification requires you to have at least two pins, so you must include the SPKI hash of another CA’s root certificate, another CA’s intermediate certificate (a different tier of your current CA would also work), or another leaf certificate. The only requirement is that this pin is not equal to the hash of any of the certificates in the current chain. The poor browser cannot tell whether you gave it a valid and useful backup pin so it will happily accept random values too.
Pinning to a small set of CAs that you are comfortable with helps you reduce the risk to yourself. Pinning just your leaf certificates is only advised if you are really certain that this is for you. It is a little like driving without a seatbelt and might work most of the time. If something goes wrong it usually goes really wrong and you want to avoid that.
Pinning only your own leaf certs also bears the risk of creating a backup key that adheres to ancient standards and could not be used anymore when you have to replace your current certificate. Assume it was three years ago, and your backup was a 1024-bit RSA key pair. You pin for a year, and your certificate expires. You go to a CA and say “Hey, re-issue my cert for Key A”, and they say “No, your key is too small/weak”. You then say “Ah, but what about my backup key?” - and that also gets rejected because it is too short. In effect, because you only pinned to keys under your control you are now bricked.
]]>Last weekend I finally deployed TLS for timtaubert.de
and decided to write up
what I learned on the way hoping that it would be useful for anyone doing the
same. Instead of only giving you a few buzz words I want to provide background
information on how TLS and certain HTTP extensions work and why you should use
them or configure TLS in a certain way.
One thing that bugged me was that most posts only describe what to do but not necessarily why to do it. I hope you appreciate me going into a little more detail to end up with the bigger picture of what TLS currently is, so that you will be able to make informed decisions when deploying yourselves.
To follow this post you will need some basic cryptography knowledge. Whenever you do not know or understand a concept you should probably just head over to Wikipedia and take a few minutes or just do it later and maybe re-read the whole thing.
Disclaimer: I am not a security expert or cryptographer but did my best to research this post thoroughly. Please let me know of any mistakes I might have made and I will correct them as soon as possible.
I read Andy Wingo’s blog post too and I really liked it. Everything he says in there is true. But what is also true is that TLS with the few add-ons is all we have nowadays and we better make the folks working for the NSA earn their money instead of not trying to encrypt traffic at all.
After you finished reading this page, maybe go back to Andy’s post and read it again. You might have a better understanding of what he is ranting about than you had before if the details of TLS are still dark matter to you.
Every TLS connection starts with both parties sharing their supported TLS versions and cipher suites. As the next step the server sends its X.509 certificate to the browser.
The following certificate checks need to be performed:
All of these are very obvious crucial checks. To query a certificate’s revocation status the browser will use the Online Certificate Status Protocol (OCSP) which I will describe in more detail in a later section.
After the certificate checks are done and the browser ensured it is talking to the right host both sides need to agree on secret keys they will use to communicate with each other.
A simple key exchange would be to let the client generate a master secret and encrypt that with the server’s public RSA key given by the certificate. Both client and server would then use that master secret to derive symmetric encryption keys that will be used throughout this TLS session. An attacker could however simply record the handshake and session for later, when breaking the key has become feasible or the machine is suspect to a vulnerability. They may then use the server’s private key to recover the whole conversation.
When using (Elliptic Curve) Diffie-Hellman as the key exchange mechanism both sides have to collaborate to generate a master secret. They generate DH key pairs (which is a lot cheaper than generating RSA keys) and send their public key to the other party. With the private key and the other party’s public key the shared master secret can be calculated and then again be used to derive session keys. We can provide Forward Secrecy when using ephemeral DH key pairs. See the section below on how to enable it.
We could in theory also provide forward secrecy with an RSA key exchange if the server would generate an ephemeral RSA key pair, share its public key and would then wait for the master secret to be sent by the client. As hinted above RSA key generation is very expensive and does not scale in practice. That is why RSA key exchanges are not a practical option for providing forward secrecy.
After both sides have agreed on session keys the TLS handshake is done and they can finally start to communicate using symmetric encryption algorithms like AES that are much faster than asymmetric algorithms.
Now that we understand authenticity is an integral part of TLS we know that in order to serve a site via TLS we first need a certificate. The TLS protocol can encrypt traffic between two parties just fine but the certificate provides the necessary authentication towards visitors.
Without a certificate a visitor could securely talk to either us, the NSA, or a different attacker but they probably want to talk to us. The certificate ensures by cryptographic means that they established a connection to our server.
If you want a cheap certificate, have no specific needs, and only a single subdomain (e.g. www) then StartSSL is an easy option. Do of course feel free to take a look at different authorities - their services and prices will vary heavily.
In the chain of trust the CA plays an important role: by verifying that you are the rightful owner of your domain and signing your certificate it will let browsers trust your certificate. The browsers do not want to do all this verification themselves so they defer it to the CAs.
For your certificate you will need an RSA key pair, a public and private key. The public key will be included in your certificate and thus also signed by the CA.
The example below shows how you can use OpenSSL on the command line to generate
a key for your domain. Simply replace example.com
with the domain of your
website. example.com.key
will be your new RSA key and example.com.csr
will
be the
Certificate Signing Request
that your CA needs to generate your certificate.
We will use a SHA-256 based signature for integrity as Firefox and Chrome will phase out support for SHA-1 based certificates soon. The RSA keys used to authenticate your website will use a 4096 bit modulus. If you need to handle a lot of traffic or your server has a weak CPU you might want to use 2048 bit. Never go below that as keys smaller than 2048 bit are considered insecure.
Sign up with the CA you chose and depending on how they handle this process you
probably will have to first verify that you are the rightful owner of the
domain that you claim to possess. StartSSL will do that by sending a token to
postmaster@example.com
(or similar) and then ask you to confirm the receipt
of that token.
Now that you signed up and are the verified owner of example.com
you simply
submit the example.com.csr
file to request the generation of a certificate
for your domain. The CA will sign your public key and the other information
contained in the CSR with their private key and you can finally download the
certificate to example.com.crt
.
Upload the .crt and .key files to your web server. Be aware that any
intermediate certificate in the CA’s chain must be included in the .crt file as
well - you can just cat
them together. StartSSL’s free tier has an
intermediate Class 1 certificate - make sure to use
the SHA-256 version
of it. All files should be owned by root and must not be readable by anyone
else. Configure your web server to use those and you should probably have TLS
running configured out-of-the-box.
To properly deploy TLS you will want to provide (Perfect) Forward Secrecy. Without forward secrecy TLS still seems to secure your communication today, it might however not if your private key is compromised in the future.
If a powerful adversary (think NSA) records all communication between a visitor and your server, they can decrypt all this traffic years later by stealing your private key or going the “legal” way to obtain it. This can be prevented by using short-lived (ephemeral) keys for key exchanges that the server will throw away after a short period.
Using RSA with your certificate’s private and public keys for key exchanges is off the table as generating a 2048+ bit prime is very expensive. We thus need to switch to ephemeral (Elliptic Curve) Diffie-Hellman cipher suites. For DH you can generate a 2048 bit parameter once, choosing a private key afterwards is cheap.
Simply upload dhparam.pem
to your server and instruct the web server to use
it for Diffie-Hellman key exchanges. When using ECDH the predefined elliptic
curve represents this parameter and no further action is needed.
Apache does unfortunately not support custom DH parameters, it is always set to 1024 bit and is not user configurable. This might hopefully be fixed in future versions.
One of the most important mechanisms to improve TLS performance is Session Resumption. In a full handshake the server sends a Session ID as part of the “hello” message. On a subsequent connection the client can use this session ID and pass it to the server when connecting. Because both the server and the client have saved the last session’s “secret state” under the session ID they can simply resume the TLS session where they left off.
Now you might notice that this could violate forward secrecy as a compromised server might reveal the secret state for all session IDs if the cache is just large enough. The forward secrecy of a connection is thus bounded by how long the session information is retained on the server. Ideally, your server would use a medium-sized in-memory cache that is purged daily.
Apache lets you configure that using the SSLSessionCache
directive and you
should use the high-performance cyclic buffer shmcb
. Nginx has the
ssl_session_cache
directive and you should use a shared
cache that is
shared between workers. The right size of those caches would depend on the
amount of traffic your server handles. You want browsers to resume TLS sessions
but also get rid of old ones about daily.
The second mechanism to resume a TLS session are Session Tickets. This extension transmits the server’s secret state to the client, encrypted with a key only known to the server. That ticket key is protecting the TLS connection now and in the future.
This might as well violate forward secrecy if the key used to encrypt session tickets is compromised. The ticket (just as the session cache) contains all of the server’s secret state and would allow an attacker to reveal the whole conversation.
Nginx and Apache by default generate a session ticket key at startup and do unfortunately provide no way to rotate it. If your server is running for months without a restart then you will use that same session ticket key for months and breaking into your server could reveal every recorded TLS conversation since the web server was started.
Neither Nginx nor Apache have a sane way to work around this, Nginx might be able to rotate the key by reloading the server config which is rather easy to implement with a cron job. Make sure to test that this actually works before relying on it though.
Thus if you really want to provide forward secrecy you should disable session
tickets using ssl_session_tickets off
for Nginx and SSLOpenSSLConfCmd
Options -SessionTicket
for Apache.
Mozilla’s guide on server side TLS provides a great list of modern cipher suites that needs to be put in your web server’s configuration. The combinations below are unfortunately supported by only modern browsers, for broader client support you might want to consider using the “intermediate” list.
All these cipher suites start with (EC)DHE which means they only support ephemeral Diffie-Hellman key exchanges for forward secrecy. The last line discards non-authenticated key exchanges, null-encryption (cleartext), legacy weak ciphers marked exportable by US law, weak ciphers (3)DES and RC4, weak MD5 signatures, and pre-shared keys.
Note: To ensure that the order of cipher suites is respected you need to set
ssl_prefer_server_ciphers on
for Nginx orSSLHonorCipherOrder on
for Apache.
Now that your server is configured to accept TLS connections you still want to
support HTTP connections on port 80 to redirect old links and folks typing
example.com
in the URL bar to your shiny new HTTPS site.
At this point however a Man-In-The-Middle (or Woman-In-The-Middle) attack can easily intercept and modify traffic to deliver a forged HTTP version of your site to a visitor. The poor visitor might never know because they did not realize you offer TLS connections now.
To ensure your users are secured when visiting your site the next time you want to send a HSTS header to enforce strict transport security. By sending this header the browser will not try to establish a HTTP connection next time but directly connect to your website via TLS.
Sending these headers over a HTTPS connection (they will be ignored via HTTP)
lets the browser remember that this domain wants strict transport security for
the next six months (~15768000 seconds). The includeSubDomains
token enforces
TLS connections for every subdomain of your domain and the non-standard
preload
token will be required for the next section.
If after deploying TLS the very first connection of a visitor is genuine we are fine. Your server will send the HSTS header over TLS and the visitor’s browser remembers to use TLS in the future. The very first connection and every connection after the HSTS header expires however are still vulnerable to a MITM attack.
To prevent this Firefox and Chrome share a HSTS Preload List that basically includes HSTS headers for all sites that would send that header when visiting anyway. So before connecting to a host Firefox and Chrome check whether that domain is in the list and if so would not even try using an insecure HTTP connection.
Including your page in that list is easy, just submit your domain using the
HSTS Preload List submission form. Your
HSTS header must be set up correctly and contain the includeSubDomains
and
preload
tokens to be accepted.
OCSP - using an external server provided by the CA to check whether the certificate given by the server was revoked - might sound like a great idea at first. On the second thought it actually sounds rather terrible. First, the CA providing the OCSP server suddenly has to be able to handle a lot of requests: every client opening a connection to your server will want to know whether your certificate was revoked before talking to you.
Second, the browser contacting a CA and passing the certificate is an easy way to monitor a user’s browsing behavior. If all CAs worked together they probably could come up with a nice data set of TLS sites that people visit, when and in what order (not that I know of any plans they actually wanted to do that).
OCSP Stapling is a TLS extension that enables the server to query its certificate’s revocation status at regular intervals in the background and send an OCSP response with the TLS handshake. The stapled response itself cannot be faked as it needs to be signed with the CA’s private key. Enabling OCSP stapling thus improves performance and privacy for your visitors immediately.
You need to create a certificate file that contains your CA’s root certificate
prepended by any intermediate certificates that might be in your CA’s chain.
StartSSL has an intermediate certificate for Class 1 (the free tier) - make
sure to use
the one having the SHA-256 signature.
Pass the file to Nginx using the ssl_trusted_certificate
directive and to
Apache using the SSLCACertificateFile
directive.
OCSP however is unfortunately not a silver bullet. If a browser does not know in advance it will receive a stapled response then the attacker might as well redirect HTTPS traffic to their server and block any traffic to the OCSP server (in which case browsers soft-fail). Adam Langley explains all possible attack vectors in great detail.
One solution might be the proposed OCSP Must Staple Extension. This would add another field to the certificate issued by the CA that says a server must provide a stapled OCSP response. The problem here is that the proposal expired and in practice it would take years for CAs to support that.
Another solution would be to implement a header similar to HSTS, that lets the browser remember to require a stapled OCSP response when connecting next time. This however has the same problems on first connection just like HSTS, and we might have to maintain a “OCSP-Must-Staple Preload List”. As of today there is unfortunately no immediate solution in sight.
Even with all those security checks when receiving the server’s certificate we would still be completely out of luck in case your CA’s private key is compromised or your CA simply fucks up. We can prevent these kinds of attacks with an HTTP extension called Public Key Pinning.
Key pinning is a trust-on-first-use (TOFU) mechanism. The first time a browser connects to a host it lacks the the information necessary to perform “pin validation” so it will not be able to detect and thwart a MITM attack. This feature only allows detection of these kinds of attacks after the first connection.
Creating an HPKP header is easy, all you need to do is to compute the base64-encoded “SPKI fingerprint” of your server’s certificate. An SPKI fingerprint is the output of applying SHA-256 to the public key information contained in your certificate.
The result of running the above command can be directly used as the pin-sha256 values for the Public-Key-Pins header as shown below:
Upon receiving this header the browser knows that it has to store the pins
given by the header and discard any certificates whose SPKI fingerprints do
not match for the next six months (max-age=15768000). We specified the
includeSubDomains
token so the browser will verify pins when connecting
to any subdomain.
It is considered good practice to include at least a second pin, the SPKI fingerprint of a backup RSA key that you can generate exactly as the original one:
In case your private key is compromised you might need to revoke your current certificate and request the CA to issue a new one. The old pin however would still be stored in browsers for six months which means they would not be able to connect to your site. By sending two pin-sha256 values the browser will later accept a TLS connection when any of the stored fingerprints match the given certificate.
In the past years (and especially the last year) a few attacks on SSL/TLS were published. Some of those attacks can be worked around on the protocol or crypto library level so that you basically do not have to worry as long as your web server is up to date and the visitor is using a modern browser. A few attacks however need to be thwarted by configuring your server properly.
BEAST is an attack that only affects TLSv1.0. Exploiting this vulnerability is possible but rather difficult. You can either disable TLSv1.0 completely - which is certainly the preferred solution although you might neglect folks with old browsers on old operating systems - or you can just not worry. All major browsers have implemented workarounds so that it should not be an issue anymore in practice.
BREACH is a security exploit against HTTPS when using HTTP compression. BREACH is based on CRIME but unlike CRIME - which can be successfully defended by turning off TLS compression (which is the default for Nginx and Apache nowadays) - BREACH can only be prevented by turning off HTTP compression. Another method to mitigate this would be to use cross-site request forgery (CSRF) protection or disable HTTP compression selectively based on headers sent by the application.
POODLE is yet another padding oracle attack on TLS. Luckily it only affects the predecessor of TLS which is SSLv3. The only solution when deploying a new server is to just disable SSLv3 completely. Fortunately, we already excluded SSLv3 in our list of preferred ciphers previously. Firefox 34 will ship with SSLv3 disabled by default, Chrome and others will hopefully follow soon.
Thanks for reading and I am really glad you made it that far! I hope this post did not discourage you from deploying TLS - after all getting your setup right is the most important thing. And it certainly is better to to know what you are getting yourselves into than leaving your visitors unprotected.
If you want to read even more about setting up TLS, the Mozilla Wiki page on Server-Side TLS has more information and proposed web server configurations.
]]>Thanks a lot to Frederik Braun for taking the time to proof-read this post and helping to clarify a few things!