Preparation for 0.6.0 (#517)

Co-authored-by: David Chisnall <davidchisnall@users.noreply.github.com> Co-authored-by: Robert Norton <1412774+rmn30@users.noreply.github.com> Co-authored-by: Nathaniel Wesley Filardo <nfilardo@microsoft.com> Co-authored-by: Istvan Haller <31476121+ihaller@users.noreply.github.com>
microsoft · May 9, 2022 · d5c732f · d5c732f
1 parent 5906b14
commit d5c732f
Show file tree

Hide file tree

Showing 21 changed files with 3,062 additions and 46 deletions.
diff --git a/README.md b/README.md
@@ -29,13 +29,24 @@ scenarios that can be problematic for other allocators:
 Both of these can cause massive reductions in performance of other allocators, but 
 do not for snmalloc.
 
-Comprehensive details about snmalloc's design can be found in the
-[accompanying paper](snmalloc.pdf), and differences between the paper and the
-current implementation are [described here](difference.md).
-Since writing the paper, the performance of snmalloc has improved considerably.
+The implementation of snmalloc has evolved significantly since the [initial paper](snmalloc.pdf).
+The mechanism for returning memory to remote threads has remained, but most of the meta-data layout has changed.
+We recommend you read [docs/security](./docs/security/README.md) to find out about the current design, and 
+if you want to dive into the code (./docs/AddressSpace.md) provides a good overview of the allocation and deallocation paths.
 
 [![snmalloc CI](https://github.com/microsoft/snmalloc/actions/workflows/main.yml/badge.svg?branch=master)](https://github.com/microsoft/snmalloc/actions/workflows/main.yml)
 
+# Hardening
+
+There is a hardened version of snmalloc, it contains
+
+*  Randomisation of the allocations' relative locations,
+*  Most meta-data is stored separately from allocations, and is protected with guard pages,
+*  All in-band meta-data is protected with a novel encoding that can detect corruption, and
+*  Provides a `memcpy` that automatically checks the bounds relative to the underlying malloc.
+
+A more comprehensive write up is in [docs/security](./docs/security/README.md).
+
 # Further documentation
 
  - [Instructions for building snmalloc](docs/BUILDING.md)

diff --git a/difference.md b/difference.md
diff --git a/docs/security/FreelistProtection.md b/docs/security/FreelistProtection.md
@@ -0,0 +1,130 @@
+# Protecting meta-data
+
+Corrupting an allocator's meta-data is a common pattern for increasing the power of a use-after-free or out-of-bounds write vulnerabilities.
+If you can corrupt the allocator's meta-data, then you can take a control gadget in one part of a system, and use it to affect other parts of the system.
+There are various approaches to protecting allocator meta-data, the most common are:
+
+* make the allocator meta-data hard to find through randomisation
+* use completely separate ranges of memory for meta-data and allocations
+* surround meta-data with guard pages
+* add some level of encryption/checksuming
+
+With the refactoring of the page table ([described earlier](./VariableSizedChunks.md)), we can put all the slab meta-data in completely separate regions of memory to the allocations.
+We maintain this separation over time, and never allow memory that has been used for allocations to become meta-data and vice versa.
+Within the meta-data regions, we add randomisation to make the data hard to find, and add large guard regions around the meta-data.
+By using completely separate regions of memory for allocations and meta-data we ensure that no dangling allocation can refer to current meta-data.
+This is particularly important for CHERI as it means a UAF can be used to corrupt allocator meta-data.
+
+But there is one super important bit that still remains: free lists.
+
+##  What are free lists?
+
+Many allocators chain together unused allocations into a linked list.
+This is remarkably space efficient, as it doesn't require meta-data proportional to the number of allocations on a slab.
+The disused objects can be used in either a linked stack or queue.
+However, the key problem is neither randomisation or guard pages can be used to protect this _in-band_ meta-data.
+
+In snmalloc, we have introduced a novel technique for protecting this data.
+
+## Protecting a free queue.
+
+The idea is remarkably simple: a doubly linked list is far harder to corrupt than a single linked list, because you can check its invariant:
+```
+   x.next.prev == x
+```
+In every kind of free list in snmalloc, we encode both the forward and backward pointers in our lists.
+For the forward direction, we use an [involution](https://en.wikipedia.org/wiki/Involution_(mathematics)), `f`, such as XORing a randomly choosen value:
+```
+  f(a) = a XOR k0
+```
+For the backward direction, we use a more complex, two-argument function
+```
+  g(a, b) = (a XOR k1) * (b XOR k2)
+```
+where `k1` and `k2` are two randomly chosen 64 bit values.
+The encoded back pointer of the node after `x` in the list is `g(x, f(x.next))`, which gives a value that is hard to forge and still encodes the back edge relationship.
+
+As we build the list, we add this value to the disused object, and when we consume the free list later, we check the value is correct.
+Importantly, the order of construction and consumption have to be the same, which means we can only use queues, and not stacks.
+
+The checks give us a way to detect that the list has not been corrupted.
+In particular, use-after-free or out-of-bounds writes to either the `next` or `prev` value are highly likely to be detected later.
+
+## Double free protection
+
+This encoding also provides a great double free protection.
+If you free twice, it will corrupt the `prev` pointer, and thus when we come to reallocate that object later, we will detect the double free.
+The following animation shows the effect of a double free:
+
+![Double free protection example](./data/doublefreeprotection.gif)
+
+This is a weak protection as it is lazy, in that only when the object is reused will snmalloc raise an error, so a `malloc` can fail due to double free, but we are only aiming to make exploits harder; this is not a bug finding tool.
+
+
+## Where do we use this?
+
+Everywhere we link disused objects, so (1) per-slab free queues and (2) per-allocator message queues for returning freed allocations to other threads.
+Originally, snmalloc used queues for returning memory to other threads.
+We had to refactor the per slab free lists to be queues rather than stacks, but that is fairly straightforward.
+The code for the free lists can be found here:
+
+[Code](https://github.com/microsoft/snmalloc/blob/main/src/snmalloc/mem/freelist.h)
+
+The idea could easily be applied to other allocators, and we're happy to discuss this.
+
+## Finished assembly
+
+So let's look at what costs we incur from this.
+There are bits that are added to both creating the queues, and taking elements from the queues.
+Here we show the assembly for taking from a per-slab free list, which is integrated into the fast path of allocation:
+```x86asm
+<malloc(unsigned long)>:
+    lea    rax,[rdi-0x1]                       # Check for small size class
+    cmp    rax,0xdfff                          # | zero is considered a large size
+    ja     SLOW_SIZE                           # | to remove from fast path.
+    shr    rax,0x4                             # Lookup size class in table
+    lea    rcx,[size_table]                    # | 
+    movzx  edx,BYTE PTR [rax+rcx*1]            # | 
+    mov    rdi,rdx                             #+Caclulate index into free lists
+    shl    rdi,0x4                             #+| (without checks this is a shift by
+                                               # |  0x3, and can be fused into an lea)
+    mov    r8,QWORD PTR [rip+0xab9b]           # Find thread local allocator state
+    mov    rcx,QWORD PTR fs:0x0                # |
+    add    rcx,r8                              # |
+    add    rcx,rdi                             # Load head of free list for size class
+    mov    rax,QWORD PTR fs:[r8+rdi*1]         # |
+    test   rax,rax                             # Check if free list is empty
+    je     SLOW_PATH_REFILL                    # |
+    mov    rsi,QWORD PTR fs:0x0                # Calculate location of free list structure
+    add    rsi,r8                              # | rsi = fs:[r8]
+    mov    rdx,QWORD PTR fs:[r8+0x2e8]         #+Load next pointer key
+    xor    rdx,QWORD PTR [rax]                 # Load next pointer
+    prefetcht0 BYTE PTR [rdx]                  # Prefetch next object
+    mov    QWORD PTR [rcx],rdx                 # Update head of free list
+    mov    rcx,QWORD PTR [rax+0x8]             #+Check signed_prev value is correct
+    cmp    rcx,QWORD PTR fs:[r8+rdi*1+0x8]     #+|
+    jne    CORRUPTION_ERROR                    #+|
+    lea    rcx,[rdi+rsi*1]                     #+Calculate signed_prev location
+    add    rcx,0x8                             #+|  rcx = fs:[r8+rdi*1+0x8]
+    mov    rsi,QWORD PTR fs:[r8+0x2d8]         #+Calculate next signed_prev value
+    add    rsi,rax                             #+|
+    add    rdx,QWORD PTR fs:[r8+0x2e0]         #+|
+    imul   rdx,rsi                             #+|
+    mov    QWORD PTR [rcx],rdx                 #+Store signed_prev for next entry.
+    ret
+```
+The extra instructions specific to handling the checks are marked with `+`.
+As you can see the fast path is about twice the length of the fast path without protection, but only adds a single branch to the fast path, one multiplication, five additional loads, and one store.
+The loads only involve one additional cache line for key material.
+Overall, the cost is surprisingly low.
+
+Note: the free list header now contains the value that `prev` should contain, which leads to slightly worse x86 codegen.
+For instance the checks introduce `shl rdi,0x4`, which was previously fused with an `lea` instruction without the checks.
+
+## Conclusion
+
+This approach provides a strong defense against corruption of the free lists used in snmalloc.
+This means all inline meta-data has corruption detection.
+The check is remarkably simple for building double free detection, and has far lower memory overhead compared to using an allocation bitmap.
+
+[Next we show how to randomise the layout of memory in snmalloc, and thus make it harder to guess relative address of a pair of allocations.](./Randomisation.md)
diff --git a/docs/security/GuardedMemcpy.md b/docs/security/GuardedMemcpy.md
@@ -0,0 +1,151 @@
+# Providing a guarded memcpy
+
+Out of bounds errors are a serious problem for systems.
+We did some analysis of the Microsoft Security Response Center data to look at the out-of-bounds heap corruption, and found a common culprit: `memcpy`.
+Of the OOB writes that were categorised as leading to remote code execution (RCE), 1/3 of them had a block copy operation like memcpy as the initial source of corruption.
+This makes any mitigation to `memcpy` extremely high-value.
+
+Now, if a `memcpy` crosses a boundary of a `malloc` allocation, then we have a well-defined error in the semantics of the program.
+No sensible program should do this.
+So let's see how we detect this with snmalloc.
+
+
+## What is `memcpy`?
+
+So `memcpy(src, dst, len)` copies `len` bytes from `src` to `dst`.
+For this to be valid, we can check: 
+```
+  if (src is managed by snmalloc)
+    check(remaining_bytes(src) >= len)
+  if (dst is managed by snmalloc)
+    check(remaining_bytes(dst) >= len)
+```
+Now, the first `if` is checking for reading beyond the end of the object, and the second is checking for writing beyond the end of the destination object.
+By default, for release checks we only check the `dst` is big enough.
+
+
+##  How can we implement `remaining_bytes`?
+
+In the previous [page](./VariableSizedChunks.md), we discussed how we enable variable sized slabs.
+Let's consider how that representation enables us to quickly find the start/end of any object.
+
+All slab sizes are powers of two, and a given slab's lowest address will be naturally aligned for the slab's size.
+(For brevity, slabs are sometimes said to be "naturally aligned (at) powers of two".)
+That is if `x` is the start of a slab of size `2^n`, then `x % (2^n) == 0`.
+This means that a single mask can be used to find the offset into a slab.
+As the objects are layed out continguously, we can also get the offset in the object with a modulus operations, that is, `remaining_bytes(p)` is effectively:
+```
+    object_size - ((p % slab_size) % object_size)
+```
+
+Well, as anyone will tell you, division/modulus on a fast path is a non-starter.
+The first modulus is easy to deal with, we can replace `% slab_size` with a bit-wise mask.
+However, as `object_size` can be non-power-of-two values, we need to work a little harder.
+
+##  Reciprocal division to the rescue
+
+When you have a finite domain, you can lower divisions into a multiply and shift.
+By pre-calculating `c = (((2^n) - 1)/size) + 1`, the division `x / size` can instead be computed by
+```
+  (x * c) >> n
+```
+The choice of `n` has to be done carefully for the possible values of `x`, but with a large enough `n` we can make this work for all slab offsets and sizes.
+
+Now from division, we can calculate the modulus, by multiplying the result of the division
+by the size, and then subtracting the result from the original value:
+```
+  x - (((x * c) >> n) * size)
+```
+and thus `remaining_bytes(x)` is:
+```
+  (((x * c) >> n) * size) + size - x
+```
+
+There is a great article that explains this in more detail by [Daniel Lemire](https://lemire.me/blog/2019/02/20/more-fun-with-fast-remainders-when-the-divisor-is-a-constant/).
+
+Making sure you have everything correct is tricky, but thankfully computers are fast enough to check all possilities.
+In snmalloc, we have a test program that verifies, for all possible slab offsets and all object sizes, that our optimised result is equivalent to the original modulus.
+
+We build the set of constants per sizeclass using `constexpr`, which enables us to determine the end of an object in a handful of instructions.
+
+## Non-snmalloc memory.
+
+The `memcpy` function is not just called on memory that is received from `malloc`.
+This means we need our lookup to work on all memory, and in the case where it is not managed by `snmalloc` to assume it is correct.
+We ensure that the `0` value in the chunk map is interpreted as an object covering the whole of the address space.
+This works for compatibility.
+
+To achieve this nicely, we map 0 to a slab that covers the whole of address space, and consider there to be single object in this space.
+This works by setting the reciprocal constant to 0, and then the division term is always zero.
+
+There is a second complication: `memcpy` can be called before `snmalloc` has been initialised.
+So we need a check for this case.
+
+## Finished Assembly
+
+The finished assembly for checking the destination length in `memcpy` is:
+
+```x86asm
+<memcpy_guarded>:
+    mov    rax,QWORD PTR [rip+0xbfa]        # Load Chunk map base
+    test   rax,rax                          # Check if chunk map is initialised
+    je     DONE                             #  |
+    mov    rcx,rdi                          # Get chunk map entry
+    shr    rcx,0xa                          #  |
+    and    rcx,0xfffffffffffffff0           #  |
+    mov    rax,QWORD PTR [rax+rcx*1+0x8]    # Load sizeclass
+    and    eax,0x7f                         #  |
+    shl    rax,0x5                          #  |
+    lea    r8,[sizeclass_meta_data]         #  |
+    mov    rcx,QWORD PTR [rax+r8*1]         # Load object size
+    mov    r9,QWORD PTR [rax+r8*1+0x8]      # Load slab mask
+    and    r9,rdi                           # Offset within slab
+    mov    rax,QWORD PTR [rax+r8*1+0x10]    # Load modulus constant
+    imul   rax,r9                           # Perform recripocal modulus
+    shr    rax,0x36                         #  |
+    imul   rax,rcx                          #  |
+    sub    rcx,r9                           # Find distance to end of object.
+    add    rcx,rax                          #  |
+    cmp    rax,rdx                          # Compare to length of memcpy.
+    jb     ERROR                            #  |
+DONE:
+    jmp    <memcpy>
+ERROR:
+    ud2                                     # Trap
+```
+
+## Performance
+
+We measured the overhead of adding checks to various sizes of `memcpy`s.
+We did a batch of 1000 `memcpy`s, and measured the time with and without checks.
+The benchmark code can be found here: [Benchmark Code](../../src/test/perf/memcpy/)
+
+![Performance graphs](./data/memcpy_perf.png)
+
+As you can see, the overhead for small copies can be significant (60% on a single byte `memcpy`), but the overhead rapidly drops and is mostly in the noise once you hit 128 bytes.
+
+When we actually apply this to more realistic examples, we can see a small overhead, which for many examples is not significant.
+We compared snmalloc (`libsnmallocshim.so`) to snmalloc with just the checks enabled for bounds of the destination of the `memcpy` (`libsnmallocshim-checks-memcpy-only`) on the applications contained in mimalloc-bench.
+The results of this comparison are in the following graph:
+
+![Performance Graphs](./data/perfgraph-memcpy-only.png)
+
+The worst regression is for `redis` with a 2-3% regression relative to snmalloc running without memcpy checks.
+However, given that we this benchmark runs 20% faster than jemalloc, we believe the feature is able to be switched on for production workloads.
+
+## Conclusion
+
+We have an efficient check we can add to any block memory operation to prevent corruption.
+The cost on small allocations will be higher due to the number of arithmetic instructions, but as the objects grow the overhead diminishes.
+The memory overhead for adding checks is almost zero as all the dynamic meta-data was already required by snmalloc to understand the memory layout, and the small cost for lookup tables in the binary is negligible.
+
+The idea can easily be applied to other block operations in libc, we have just done `memcpy` as a proof of concept.
+If the feature is tightly coupled with libc, then an initialisation check could also be removed improving the performance.
+
+[Next, we look at how to defend the internal structures of snmalloc against corruption due to memory safety violations.](./FreelistProtection.md)
+
+
+# Thanks
+
+The research behind this has involved a lot of discussions with a lot of people.
+We are particularly grateful to Andrew Paverd, Joe Bialek, Matt Miller, Mike Macelletti, Rohit Mothe, Saar Amar and Swamy Nagaraju for countless discussions on guarded memcpy, its possible implementations and applications.