Into the Depths of C: Elaborating the De Facto Standards

References:

  • Memarian, K., Matthiesen, J., Lingard, J., Nienhuis, K., Chisnall, D., Watson, R.N. and Sewell, P., 2016. Into the depths of C: elaborating the de facto standards. ACM SIGPLAN Notices, 51(6), pp.1-15. pdf
  • Cerberus

C remains central to our computing infrastructure. It is notionally defined by ISO standards, but in reality the properties of C assumed by systems code and those implemented by compilers have diverged, both from the ISO standards and from each other, and none of these are clearly understood.

“We articulate many specific questions, build a suite of semantic test cases, gather experimental data from multiple implementations, and survey what C experts believe about the de facto standards.”

“We identify questions where there is a consensus (either following ISO or differing) and where there are conflicts.”

“We describe a formal model, Cerberus, for large parts of C. Cerberus is parameterised on its memory model; it is linkable either with a candidate de facto memory object model, under construction, or with an operational C11 concurrency model; it is defined by elaboration to a much simpler Core language for accessibility, and it is executable as a test oracle on small examples.”

“This should provide a solid basis for discussion of what mainstream C is now: what programmers and analysis tools can assume and what compilers aim to implement. Ultimately we hope it will be a step towards clear, consistent, and accepted semantics for the various use-cases of C.”

Problems of Current C

C is notionally defined by

  • ANSI/ISO standards, C80/C90, C99 and C11

And a numer of research projects have worked to formalise aspects of these [c3,c16,c18,c25-29,c38-40].

But the question of what C is in reality is much more complex, and more problematic, than the existence of an established standard would suggest; formalisation alone is not enough.

Problem 1: de facto vs ISO

In practice, the de facto standards in:

  • behaviours of the various mainstream compilers;
  • assumptions that system programmers make about the behaviour they rely on;
  • assumptions implicit in the corpus of existing C code;
  • assumptions implicit in C analysis tools.

They all differ from each other, and from the ISO standard, despite the latter’s stated intent to “codify the common, existing definition of C”.

Overtime, C implementations have tended to become more aggressively optimising, exploiting situations which the standards now regard as undefined behaviour. This can break code that used to work correctly in practise, sometimes with security implications 12.

Critical infrastructure code, including Linux and FreeBSD kernels, depends on idioms that current compilers do not in general support, or that ISO standards does not permit. Compiler flags -fno-strict-aliasing and -fno-strict-overflow, that turn off particular analyses.

Developers of static and dynamic analysis tools must make ad hoc semantic choices to avoid too many “false” positives that are contrary to their customers’ expectations, even where these are real violations of the ISO standards 3. This knowledge of the de facto standards is often left implicit in the tool implementations, and making it precise requires an articulation of the possible choices.

The space of mainstream C implementation has become simpler than the one that the C standard was originally written to cope with. For example, mainstream hardware can now reasonably be assumed to have 8-bit bytes, twos-complement arithmetic, and (often) non-segmented memory, but the ISO standard does not take any of this into account.

Problem 2: no precise and accessible specification

The ISO standard is hard to be interpreted precisely.

C tries to solve a challenging problem: to simultaneously enable programmers to write high-performance hand-crafted low-level code, provide portability between widely varying target machine architectures, and support sophisticated compiler optimizations.

It does so by providing operations both on abstract values and on their underlying concrete representations. The interactioins between these is delicate, as are the dynamic propoerties relating to memory and type safety, aliasing, concurrency, and so on; they have, unsurprisingly, proved difficult to characterise precisely in the prose specification style of the standards.

Even those few who are very familiar with the standard often struggle with the subtleties of C, as witnessed by the committee’s internal mailing list discussion of what the correct interpretation of the standard should be, the number of the requests for clarification made in the form of defect reports4, and the inconclusive discussions of whether compiler anomalies are bugs w.r.t. the standard.

The prose standard is not executable as a test oracle, which would let one simply compute the set of all allowed behaviours of any small test case, so such discussions have to rely on exegesis of the standard text.

The goal of ANSI C89, to “develop a clear, consistent, and unambiguous Standard for the C programming language”, thus remains an open problem.

Problem 3: integrating C11 concurrent and sequential semantics

A technical challenge of dealing with concurrency.

Following C++, the C standard finally addressed concurrency in C11.

Batty, et al., worked to ensure that the standard text of the concurrency model is written in close correspondence to a formalisation.

But the concurrency model is in an axiomatic style, while the rest of C11 is more naturally expressed in an operational semantics; it is hard to integrate the two.

Summary

Current C, a mess: Divergence among the de facto and ISO standards, the prose form, ambiguities, and xomplexities of the ISO standards, lack of an integrated treatment of concurrency.

Contributions of this paper

  1. detailed investigation into the possible semantics for pointers and memory in C (its memory object model)

  2. formal semantics, Cerberus, for large parts of C.

C pointers and memory

C pointers and memory model, where there seems to be the most divergence and variation among the multiple de facto and ISO standards.

  • design space: 85 questions, 196 hand-written semantic test cases;
    • 80+ page document 56
  • de facto standards: what system programmers and compiler writers believe about compiler behaviour and extant code;
  • data collected for test suite. for GCC/Clang/TrustInSoft/KCC
  • “In ongoing work, we are building a candidate formal model capturing one plausible view of the de facto standards, as a reference for discussion.”

Proved useful in practice: “we applied our analysis and test suite to the C dialect supported by Watson et al’s experimental CHERI processor 78, which implements unforgeable, bounds-checked C pointers (capabilities)“.

Formal Semantics for C

Cerberus: Capture ISO text for large parts of C as clearly as possible.

  • Parameterised memory model: can be instantiated with the candidate formal model that we are developing or, in future, other alternative memory object models.

    • Can also be instantiated with an operational C/C++11 concurrency model, though not yet combined with a full memory object model.
  • Executable as a test oracle, to explore all behaviours or single paths of test programs;

Cerberus aims to cover essentially all the material of Section 5 Environment and 6 Language of the ISO C11 standard, both syntax and semantics, except: preprocessor features, C11 character-set features, floating-point and complex types (beyond simple float constants), user-defined variadic functions (we do cover printf), bitfields, volatile, restrict, generic selection, register, flexible array members, some exotic initialisation forms, signals, longjmp, multiple translation units, and aspects where our candidate de facto memory object model intentionally differs from the standard.

It supports only a small parts of the standard libraries. Threads, atomic types, and atomic operations are supported only with a more restricted memory object model.

Cerberus computation semantics:

  • expressed by an elaboration from a fully type-annotated C AST to a carefully defined Core: a typed call-by-value calculus with constructs to model certain aspects of the C dynamic semantics.
  • handle subtleties such as C evaluation order, integer promotions (???), and the associated impl defined and undefined behaviour.
  • closely follow ISO standard text

Cerberus front-end:

  • C parser
  • desugaring phase
  • type checker
  • re-implement to avoid the building in semantic choices about C that are implicit in the transformations from C source to AST done by compiler or CIL front-ends.
  • preliminary experiment in translation validation (in Coq) for the front-end of Clang, for very simple programs.

Key Disagreements

Of the 85 questions:

  • for 39 the ISO standard is unclear;
  • for 27 the de facto standards are unclear, in some cases with significant differences between usage and implementation;
  • for 27 there are significant differences between the ISO and the de facto standards.

Pointer Provenance

From ISO WG14 Defect Report DR260 Committee Response [^c53]: “The implementation is entitled to take account of the provenance of a pointer value when determining what actions are and are not defined.”

// example provenance_basic_global_yx.c
#include <stdio.h>
#include <string.h>
int y=2, x=1;
int main() {
    int *p = &x + 1;
    int *q = &y;
    printf("Addresses: p=%p q=%p\n",(void*)p,(void*)q);
    if (memcmp(&p, &q, sizeof(p)) == 0) {
        *p = 11; // does this have undefined behaviour?
        printf("x=%d y=%d *p=%d *q=%d\n",x,y,*p,*q);
    }
    return 0;
}

Above code will result in different behaviors if the compiler treat pointer provenance differently:

If x and y happen to be allocated in adjacent memory, &x+1 and &y will have bitwise-identical runtime representation values, the memcmp will succeed, and p (derived from a pointer to x) will have the same representation value as the pointer to y (a different object) at the pointer of the update *p=11.

In a concrete semantics we would expect to see x=1 y=11 *p=11 *q=11, but GCC produces x=1 y=2 *p=11 *q=2, ICC produces x=1 y=2 *p=11 *q=11. This suggests that GCC is reasoning, from provenance information, that *p does not alias with y or *q, and hence that the initial value of y=2 can be propagated to the final printf. Note that this is not a type-based aliasing issue: the pointers are of the same type, and the GCC result is not affected by -fno-strict-aliasing.

D260 suggests a semantic in which pointer values include not just a concrete address but also provenance information, erased at runtime in conventional implementations, including a unique ID from the original allocation. This can be used in the semantics for memory accesses to check that the address used is consistent with the original allocation, which here lets the *p=11 access be regarded as undefined behavior.

Undefined behavior: program is erroneous and compilers are notionally entirely unconstrained in how they treat it. This makes the above compiler analysis and optimizations (vacuously) corrent in this case.

==> “This general pattern is typical for C: regarding situations as undefined behaviour puts an obligation on programmers to avoid them, but permits compilers to make strong assumptions when optimising.”

The (block-ID, offset) semantics model for pointers in C, leads to many more vexed questions. e.g.:

Q25: Can we do relational comparison (with <, >, <=, or >= ) of two pointers to separately allocated objects (of compatible object types)?

  • ISO clearly prohibits this, but surveys show that it is widely used, e.g. for global lock orderings and for collection-implementation orderings.
    • For mainstream C semantics, it seems more useful to permit it than to take the strict-ISO view that all occurrences are bugs.
    • In our model, we can easily regard these relational comparisons as ignoring the provenance information.
  • Ways a concrete address values are exposed to programs:
    • Relational comparison of two pointers from different allocated objects, as above.
    • Explicit cast to integer types;
    • IO of pointers;
    • Examination of pointer representation bytes.
  • Therefore, abstract pointer values must also contain concrete addresses, in contrast to those earlier C semantics that only had block IDs, and the semantics must let those be nondeterministically chosen in any way that a reasonable implementation might.

Q9: Can one make a usable offset between two separately allocated objects by inter-object integer or pointer subtraction, make a usable pointer to the second by adding the offset to a pointer to the first?

  • conflicts between C usage one one side and compilers and ISO on the other.
  • In practice, not common, but
    • Linux, FreeBSD kernel implementation: per-CPU variables.
    • Current compilers sometimes do optimise based on an assumption, in a point-to analysis, that inter-object pointer arithmetic does not occur.
  • How to resolve?
    • The usage is buggy, should be rewritten:
    • But it will lose performance.
    • Globally disable the optimizations, eg -fno-tree-pta for GCC:
    • But too a blunt instrument.
    • Adapt the compiler analyses to treat inter-object pointer subtractions as giving integer offsets that have the power to move between objects.
    • a multiple-provenance semantics, not just a singleton provenance, but
      • compiler would have to be conservative where they could not determine that pionter subtractions are intra-object; might also too costly.
    • require such usages to be explicitly annotated, e.g. with an attribute on the resulting pionter that declares that it might alias with anything.
  • None of these are wholly satisfactory.
    • (we suspect) the last is the most achievable
    • (for the moment) our candidate formal model forbids this idiom.

Q5: Must provenance information be tracked via casts to integer types and integer arithmetic?

  • It is common to cast pointers to integer types and back to do arithmetic on them, e.g. to store information in unused bits.
  • To get a coherent model, two possible options:
    • regard the result of casting from an integer to a pointer as of a wildcard provenance or
    • track provenance through integer operations;
  • Our formal model associates provenances with all integer values, following the same at-most-one provenance model we use for pointers.

GCC documentation states: “When casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined”.

Q2: Can equality testing on pointers be affected by pointer provenance information?

  • ISO allows equality testing between pointers to different objects of compatible types; but leaves open whether the result might depend on provenance.
  • GCC with our test suite:
    • Two pointers with same runtime representation but different provenances are regarded as different by GCC. ==> compilers do take advantage of the provenance information if it is statically available.
  • This can be soundly modelled by making a nondeterministic choice at each such comparison whether to take provenance into account or not.

Out-of-Bounds Pointers

Very limited pointer arithmentic is permitted in ISO standard.

  • within an array
  • among the members of a struct
  • access representation bytes
  • one-past an object

However, in practice, “it seems to be common to transiently construct out-of-bounds pointers”. See Beyond PDP11 paper, Table 1 [^c11]. On the other side, the textual comments reveal that some compiler developers believe that compilers may optimise assuming that this does not occur, and there can be issues with overflow in pointer arithmetic and with large allocations.

So long as they are brought back in-bounds before being used to access memory, many experts believe this will work.

“For our formal model, as in CHERI, we tentatively choose to permit arbitrary pointer arithmetic: out-of-bounds pointers can be constructed, with undefined behaviour occurring if an access-time check w.r.t. the bounds associated to their provenance fails.”

Pointer Copying

  • preserve the provenance when copy the entire pointer?
  • How about copying only bits of pointers?

“Windows /GS stack cookies do this all the time to protect the return address. The return address is encrypted on the stack, and decrypted as part of the function epilogue” and a “JIT that stores 64-bit virtual ptrs as their hardware based 48-bits”.

Our candidate formal model:

  • should permit copying pointer values via indirect dataflow, as the representation bytes or bits carry the original provenance,
    • combining values with the same provenance preserves that provenance
    • the access-time check compares the recalculated address with that of the original allocation.
  • should not permit copying via indirect control flow (e.g. making a branch based on each bit of a pointer value)
  • does not require all of the original bits to flow to the result.

We view this (our formal model) as a plausible de facto semantics, but more work is needed to see if it is really compatible with current compiler analysis implementations.

Unspecified Values

C does not require variables and memory to be initialised.

(In the text responses) the only real use cases seem to be copying a partially initialised struct and (more rarely) comparing against one.

Survey on the semantics of reading an uninitialized variable or struct member (either due to a bug or intentionally, to copy, output, hash, or set some bits of a partially initialised value):

  1. undefined behaviour (meaning that the compiler is free to arbitrarily miscompile the program, with or without a warning); 139 (43%)
  2. going to make the result of any expression involving that value unpredictable; 42(13%)
  3. going to give an arbitrary and unstable value (maybe with a different value if you read again); 21 (6%)
  4. going to give an arbitrary but stable value (with the same value if you read again); 112 (35%)

GCC and Clang do perform SSA transformation that make uninitialised values unstable (2).

Clang is moving towards (1).

ISO …

  • indeterminate values
    • unspecified value, a valid value [where ISO] imposes no requirements on which value is chosen in any instance.
    • trap representation

Unspecified Padding Values

Pading values.

Possible semantics:

  1. regarded as always holding unspecified values, irrespective of any byte writes to them, so the compiler could arbitrarily write to padding at any point;
  2. structure member writes are deemed to also write unspecified values over subsequent padding;
  3. .. or to nondeterministically either write zeros over subsequent padding or leave it unchanged.
  4. Structure copies might copy padding, but structure member writes never touch padding.

MSVC – 4.

Clang – Not 1, not 2.

GCC – 1 or 2.

IBM mainfram compiler – 1 or 2.

Effective Types

C99 introduced effective types to permit compilers to do optimisations driven by type-based alias analysis (TBAA), ruling out programs involving unannotated aliasing of references to different types by regarding them as having undefined behaviour.

This is one of the less clear, less well-understood, and more controversial aspects of C.

Example conflict:

Q75: Can an unsigned character array with static or automatic storage duration be used (in the same way as a malloc’d region) to hold values of other types?

  • 201(65%) know of real code that relies on it.
  • ISO: disallows it
  • GCC: “No, this is not safe (if it’s visible to the compiler that the memory in question has unsigned char as its declared type)
  • Our candidate formal model: focuses on the C used by system code, often compiled with -fno-strict-aliasing to turn TBAA off, and so should permit this and related idioms.

Memory Semantics of C Analysis Tools

Run test with three groups of tools:

  • Clang’s memory, address, and undefined-behavior sanitisers (MSan, ASan, UBSan),
    • intended to identify uses of uninitializd memory values, invlid memory accesses, and other undefined behaviors
  • TrustInSoft tis-interpreter
    • based on Frama-C value analysis
  • Hathhorn et al.’s KCC.

The result from these tools ara radically different.

Clang Sanitizers

Only few of tests triggered warnings.

  • All 13 of structure-padding test and 9 of other unspecified-value tests ran without any sanitiser warnings.

TrustInSoft

TrustInSoft aims for a tight semantics, to detect enough cases to ensure “that any program that executes correctly in tis-interpreter behaves deterministically when compiled and produces the same results”.

  • In many places it follows a much stricter notion of C than our candidate de facto model
    • e.g. flagging most of the unspecified-value tests, and not permitting comparison of pointer representations;
  • In some others it coincides with our candidate de facto model but differs from ISO.
    • e.g. assuming null pointer representations are zero.

KCC

KCC detected two potential alignment erros in earlier versions of our tests.

But it gave

  • “execution failed”, with no futher details for the test of 20 of our questions.
  • “Translation failed” for one;
  • segfaulted at runtime for one;
  • results contrary to our reading of the ISO standard for at least 6;

KCC exhibited a very strict semantics for reading uninitialised values (but not for padding bytes), and permitted some test that ISO effective types forbid.

Memory Semantics of CHERI C

Researches that focused on memory safe implementation of C:

  • [^c14]
  • [^c15]
  • [^c22]
  • [^c34]
  • [^c35]
  • Intel MPX [^c21]

But to date, no wrok in this area has clearly defined the interpretation of C that it aims to implement.

Cerberus

Validation

More

  • Undef
  • References: X. Wang, H. Chen, A. Cheung, Z. Jia, N. Zeldovich, and M. F. Kaashoek. Undefined behavior: what happened to my code? In Proc. APSYS, 2012. pdf X. Wang, N. Zeldovich, M. F. Kaashoek, and A. SolarLezama. Towards optimization-safe systems: Analyzing the impact of undefined behavior. In Proc. SOSP, 2013. pdf More


  1. X. Wang, H. Chen, A. Cheung, Z. Jia, N. Zeldovich, and M. F. Kaashoek. Undefined behavior: what happened to my code? In Proc. APSYS, 2012 ↩
  2. X. Wang, N. Zeldovich, M. F. Kaashoek, and A. SolarLezama. Towards optimization-safe systems: Analyzing the impact of undefined behavior. In Proc. SOSP, 2013. ↩
  3. A. Bessey, K. Block, B. Chelf, A. Chou, B. Fulton, S. Hallem, C. Henri-Gros, A. Kamsky, S. McPeak, and D. Engler. A few billion lines of code later: using static analysis to find bugs in the real world. Commun. ACM, 53(2), 2010. ↩
  4. WG14. Defect report summary for ISO/IEC 9899:1999. ↩
  5. K. Memarian, J. Matthiesen, K. Nienhuis, V. B. F. Gomes, J. Lingard, D. Chisnall, R. N. M. Watson, and P. Sewell. Cerberus, 2016. www.cl.cam.ac.uk/~pes20/cerberus, www.repository.cam.ac.uk/handle/1810/255730 ↩
  6. D. Chisnall, J. Matthiesen, K. Memarian, K. Nienhuis, P., and R.N.M. Watson. C memory object and value semantics: the space of de facto and ISO standards, 2016. http://www.cl.cam.ac.uk/~pes20/cerberus/. ↩
  7. R. N. M. Watson, P. G. Neumann, J. Woodruff, M. Roe, J. Anderson, D. Chisnall, B. Davis, A. Joannou, B. Laurie, S. W. Moore, S. J. Murdoch, and R. Norton. Capability hardware enhanced RISC instructions: CHERI instruction-set architecture. Technical Report UCAM-CL-TR-864, University of Cambridge, Computer Laboratory, November 2015. ↩
  8. R. N. M. Watson, J. Woodruff, P. G. Neumann, S. W. Moore, J. Anderson, D. Chisnall, N. H. Dave, B. Davis, K. Gudka, B. Laurie, S. J. Murdoch, R. Norton, M. Roe, S. Son, and M. Vadera. CHERI: A hybrid capability-system architecture for scalable software compartmentalization. In IEEE Symposium on Security and Privacy, SP, 2015. ↩
Created Aug 27, 2020 // Last Updated Oct 3, 2020

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?