Cheri Concentrate

Reference:

Overview

Cheri Concentrate(CC) is a compression scheme applied to CHERI. CC achieves the best published region encoding efficiency, solves important pipeline problems caused by a decompressed register file.

Problem

The object bounds and permission information encoded in capability pointers cause the largest overhead among all overheads. Thus need a new encoding scheme to reduce overhead, i.e a method for compression and decompression.

Challenge

  • Pipeline optimization in terms of the hardware changes for the encoding scheme.
  • Minimize the delay of bound checking for integration with standard processor designs.
  • Minimize the semantic restrictions to support legacy code.

Solution

  • Floating-point bounds encoding.
  • Full toolchain support: compilers, binaries, runtime environment.
  • Representable Region with power-of-two bounds, beyonds object bounds.
  • Representability Check with low delay, comparable to a pointer add.

Design

128-bit cap format

64-bit address + 16 permission bits (4 user defined + 12 hardware defined) + 18-bit object type + 27 bits that encode the bounds relative

  • $MW$: mantissa width, a parameter of the encoding that determines the precision of the bounds. For 128-bit capabilities we use $MW=14$, but this could be adjusted depending on the number of bits available in the capability format.

  • $B$ and $T$ are $MW$-bit values that are substituted into the capability address to form the base and top. They are stored in a slightly compressed form in the encoding, in one of the two formats depending on the $I_E$ bit.

  • $I_E$ is the internal exponent bit selects between two formats.

    • $I_E=0$: the exponent is implied to be zero and the full width of $B$ and $T$ are used;
    • $I_E=1$: an exponent is stored instead of the lower three bits of $B$ and $T$ fields, reducing the precision available by three bits.
  • $E$ is the 6-bit exponent. It determines the position at which $B$ and $T$ are inserted in $a$. Larger values allow larger regions to be encoded but impose stricter alignment restrictions on the bounds.

Address $a$ is splitted into three parts:

  • $a_{top}=a[63:E+14]$
  • $a_{mid}=a[E+13:E]$
  • $a_{low}=a[E-1:0]$.

computing the top and base from a with E/T/B

Replace $a_{mid}$ with $B$ and $T$, and clearing the lower $E$ bits.

Adjust $a_{top}$ with corrections $c_b$ and $c_t$.

Evaluation

Representability

Mac OS X 10.9 dtrace framework, collect traces from every allocator found in six real-world applications:

  • Chrome 38.0.2125
  • Firefox 31
  • Apache 2.4
  • iTunes 12
  • MPlayer build #127
  • mySQL 5.

Allocators included many forms of malloc(), serveral application-specific allocators, driver internal allocators, and many other variants.

Instruction Frequency Study

  • Duktape Javascript interpreter running Splay benchmark from the Octane suite
  • SQLite benchmark for LevelIDB project
  • P7Zip benchmark from LLVM
  • boot of FreeBSD with all user-space processes compiled in a pure-capability mode.

Each case we traced around 1 billion user-space instructions from the FPGA implementation, about 10 seconds of exection time on our 100MHz processor, sampled throughout the benchmark.

Results:

  • Pointer-size loads constitue up to 12% of common programs.
    • pointer loads should minimize additional delay caused by an unpack operation to decode capabilities into the register file or else risk greatly impacting performance.
  • Poniter arithmetic commonly constitutes over 10% of executed instructions.
    • pointer add must remain simple, fast, and energy-efficient, in the face of new requirements imposed by capabilities.
  • Load and stores of data and capabilities constituting as much as 35% of common programs.
    • The bounds check on the offset addressing operation must not impede the critical path, or else would risk greatly impacting program performance.

Concentrate pipeline

Unpack, pointer add, bounds check.

Complete bounds decoding (unpack):

  • 2013 CCS LowFat 1, on Altera Stratix V 5SGXEA7N2F45C2 FGPA: 4.47ns.
    • on their Xilinx Virtex 6: 4ns
    • Too complex to be performed after memory load and before register write back
  • CHERI CC, on Altera Stratix V 5SGXEA7N2F45C2 FGPA: 1.70ns

Pointer add: 2.74ns –> 2.89ns: slightly longer than Low-fat’s pointer add.

Bounds check: 1.88ns –> 3.85ns: longer than lowfat (but “fits comfortably into the execution path of our pipeline (which is parallel to cache lookup)”).

Execution performance

Security policies:

  • Spatial memory safety: use capabilities for all data pointers (including heap and stack allocations)
  • CFI: use capabilities for return addresses and function pointers 2

Small benchmarks:

  • MiBench
  • Olden suite

Larger benchmarks:

  • P7Zip 16.02
  • Octane with Duktape 1.4.0
  • Sqlite3 3.21.0
  • SPECint 2006
Created Jun 27, 2019 // Last Updated Dec 6, 2022

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?