Portable NaCl

Portable Native Client 1: control flow and memory integrity with average performance overhead of under 5% on ARM and 7% on x86-64.

Introduction

About previous SFI on CISC

Control+store SFI on x86-32, which we considered excessive, indicates about 25% overhead.

“As we continued our exploration of ARM SFI and sought to understand ARM behavior relative to x86 behavior, we could not adequately explain the observed performance gap between ARM SFI at under 10% overhead with the overhead on x86-32 in terms of instruction set differences. With further study we understood that the prior implementations for x86-32 may have suffered from suboptimal instruction selection and overly pessimistic alignment.”

ARM ISA and binary format

On ARM, only 16-bit Thumb and 32-bit ARM instructions.

ARM binaries commonly include a number of read-only data embedded in the text segment. Such data in executable memory regions must be isolated to ensure it cannot be used to invoke system call instructions or other instructions incompatible with our sandboxing scheme.

Indirect control flow and memory reference

Indirect control flow and memory references must be constrained to within the untrusted memory region, achieved through sandboxing instructions.

Page protection to replace segmentation (considered, but not used)

Page-table protection would be used to prevent the untrusted code from manipulating trusted data; SFI is still required to enforce control flow instructions.

Hence, page protection only avoids data SFI; the control flow SFI persists.

But depends on OS-based protection mechanisom. This OS interaction is complicated y the requirement for multiple threads that transition independently between untrusted and trusted execution.

High complexity and overhead, with small potential performance gain ==> Not suitable.

Arch Design

SFI Schemes on x86-32, ARM, x86-64

  • All use alignment masks on control flow target addresses;

    • on ARM/x86-64, use high-order address bits to limit control flow targets to logical zero-based virtual address range;
  • Data mask

    • No data mask on x86-32; use segmentation instead;
    • ARM/x86-64: combining masking and guard pages to keep stores within the valid address range for untrusted data. Can read outside of sandbox.
    • Explicit instruction data mask for ARM:
    • Implicit in result width on x86-64:
  • Data type: ILP32

    • Int, Long, Pointer are all 32 bit.
    • same as x86-32, for portability between systems.
    • can improve performance on x86-64 systems.
  • Instruction sequences

  • Address space layout

Impl on ARM

ARM designs:

  • condition codes that can be used to predicate most instructions. ??? what is the predicate mean here?

ARM goals:

  • No forbidden instructions in untrusted code;
  • No store above 1GB in untrusted code;
  • No jump above 1GB in untrusted code.

Extension to Wahbe et al. 2

  • reserve no registers for holding sandboxed addresses; instead requiring they are computed and checked in a single instruction: ``;
  • ensure integrity of multi-instruction sandboxing by ???, with adaption to further prevent execution of embeded data;
  • ARM’s fully predicated instruction set to introduce an alternative data address sandboxing sequence: replace a data dependency with control dependency, preventing pipeline stalls and providing better overhead on multiple-issue and out-of-order microarchitectures. ??? why

Code layout: 16 bytes bundles/four instrs; All ARM instructions, no Thumb; data bundles starting with invalid offset to prevent execution as code.

Validation:

  • direct branch: confirms the target is a valid instruction (bundle start);
  • indirect branch: forbide writing to r15, the PC; only allow explicit branch instruction, such as bx, r0 and their conditional equivalents; most significant 2 bits cleared; 4 least significant bits cleared;

    bic r0, r0, #0xc000,000f
    bx r0
    
    /*
    # pop {pc} is replaced with
    */
    pop {lr}
    bic lr, lr, 0xc000,000f
    bx lr

Note above code: data dependency between bx branch and masking instruction. This pattern(generating an address via the ALU and immediately jumping to it) is sufficiently common in ARM code that the modern ARM implementations[^3] can dispatch the sequence without stalling.

Data Stores: check within 1 GB.

tst r0, #0xc0000000
streq r1, [r0, #12]

==> use tst rather than bic here avoids a data dependency between the guard instruction and the store, eliminating a two-cycle address-generation stall on Cortex-A8 that would otherwise triple the cost of the added instruction.

(Again this is ARM’s fully predicated instruction set).

Guard page: immediate displacement: $\pm$ 4096 bytes (only base-plus-displacement addressing is allowed, forbide multiple registers), can be used to overflow/underflow 1GB by 4096 bytes; use guard page to trap such.

Stack SP: within 1 GB.

LLVM 2.6 for ARM; faster than GCC.

Impl on x86-64

x86-64 features:

  • 8 new general purpose registers: r8 - r15.

Rules:

  • 4 GB aligned region; flanked above/below by 10 x 4GB regions. ??? why 10?

Take aways


  1. Adapting Software Fault Isolation to Contemporary CPU Architectures. USENIX SEC, 2010. ↩
  2. Efficient software-based fault isolation. By R. Wahbe, S. Lucco, T.E. Anderson, and S.L. Graham. ACM SIGOPS Operating System Review. 1993. ↩
Created Aug 13, 2019 // Last Updated Dec 19, 2020

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?