Reference:
[ISAv9-draft, Chapter 3.5.4, CHERI Concentrate Compression]()
Cheri Concentrate(CC) is a compression scheme applied to CHERI. CC achieves the best published region encoding efficiency, solves important pipeline problems caused by a decompressed register file.
The object bounds and permission information encoded in capability pointers cause the largest overhead among all overheads. Thus need a new encoding scheme to reduce overhead, i.e a method for compression and decompression.
64-bit address + 16 permission bits (4 user defined + 12 hardware defined) + 18-bit object type + 27 bits that encode the bounds relative
$MW$: mantissa width, a parameter of the encoding that determines the precision of the bounds. For 128-bit capabilities we use $MW=14$, but this could be adjusted depending on the number of bits available in the capability format.
$B$ and $T$ are $MW$-bit values that are substituted into the capability address to form the base and top. They are stored in a slightly compressed form in the encoding, in one of the two formats depending on the $I_E$ bit.
$I_E$ is the internal exponent bit selects between two formats.
$E$ is the 6-bit exponent. It determines the position at which $B$ and $T$ are inserted in $a$. Larger values allow larger regions to be encoded but impose stricter alignment restrictions on the bounds.
Address $a$ is splitted into three parts:
Replace $a_{mid}$ with $B$ and $T$, and clearing the lower $E$ bits.
Adjust $a_{top}$ with corrections $c_b$ and $c_t$.
Mac OS X 10.9 dtrace
framework, collect traces from every allocator found in six real-world applications:
Allocators included many forms of malloc()
, serveral application-specific allocators, driver internal allocators, and many other variants.
Each case we traced around 1 billion user-space instructions from the FPGA implementation, about 10 seconds of exection time on our 100MHz processor, sampled throughout the benchmark.
Results:
unpack
operation to decode capabilities into the register file or else risk greatly impacting performance.Unpack, pointer add, bounds check.
Complete bounds decoding (unpack):
Pointer add: 2.74ns –> 2.89ns: slightly longer than Low-fat’s pointer add.
Bounds check: 1.88ns –> 3.85ns: longer than lowfat (but “fits comfortably into the execution path of our pipeline (which is parallel to cache lookup)”).
Security policies:
Small benchmarks:
Larger benchmarks:
If you could revise
the fundmental principles of
computer system design
to improve security...
... what would you change?