References:
C remains central to our computing infrastructure. It is notionally defined by ISO standards, but in reality the properties of C assumed by systems code and those implemented by compilers have diverged, both from the ISO standards and from each other, and none of these are clearly understood.
“We articulate many specific questions, build a suite of semantic test cases, gather experimental data from multiple implementations, and survey what C experts believe about the de facto standards.”
“We identify questions where there is a consensus (either following ISO or differing) and where there are conflicts.”
“We describe a formal model, Cerberus, for large parts of C. Cerberus is parameterised on its memory model; it is linkable either with a candidate de facto memory object model, under construction, or with an operational C11 concurrency model; it is defined by elaboration to a much simpler Core language for accessibility, and it is executable as a test oracle on small examples.”
“This should provide a solid basis for discussion of what mainstream C is now: what programmers and analysis tools can assume and what compilers aim to implement. Ultimately we hope it will be a step towards clear, consistent, and accepted semantics for the various use-cases of C.”
C is notionally defined by
And a numer of research projects have worked to formalise aspects of these [c3,c16,c18,c25-29,c38-40].
But the question of what C is in reality is much more complex, and more problematic, than the existence of an established standard would suggest; formalisation alone is not enough.
In practice, the de facto standards in:
They all differ from each other, and from the ISO standard, despite the latter’s stated intent to “codify the common, existing definition of C”.
Overtime, C implementations have tended to become more aggressively optimising, exploiting situations which the standards now regard as undefined behaviour. This can break code that used to work correctly in practise, sometimes with security implications 12.
Critical infrastructure code, including Linux and FreeBSD kernels, depends on idioms that current compilers do not in general support, or that ISO standards does not permit.
Compiler flags -fno-strict-aliasing
and -fno-strict-overflow
, that turn off particular analyses.
Developers of static and dynamic analysis tools must make ad hoc semantic choices to avoid too many “false” positives that are contrary to their customers’ expectations, even where these are real violations of the ISO standards 3. This knowledge of the de facto standards is often left implicit in the tool implementations, and making it precise requires an articulation of the possible choices.
The space of mainstream C implementation has become simpler than the one that the C standard was originally written to cope with. For example, mainstream hardware can now reasonably be assumed to have 8-bit bytes, twos-complement arithmetic, and (often) non-segmented memory, but the ISO standard does not take any of this into account.
The ISO standard is hard to be interpreted precisely.
C tries to solve a challenging problem: to simultaneously enable programmers to write high-performance hand-crafted low-level code, provide portability between widely varying target machine architectures, and support sophisticated compiler optimizations.
It does so by providing operations both on abstract values and on their underlying concrete representations. The interactioins between these is delicate, as are the dynamic propoerties relating to memory and type safety, aliasing, concurrency, and so on; they have, unsurprisingly, proved difficult to characterise precisely in the prose specification style of the standards.
Even those few who are very familiar with the standard often struggle with the subtleties of C, as witnessed by the committee’s internal mailing list discussion of what the correct interpretation of the standard should be, the number of the requests for clarification made in the form of defect reports4, and the inconclusive discussions of whether compiler anomalies are bugs w.r.t. the standard.
The prose standard is not executable as a test oracle, which would let one simply compute the set of all allowed behaviours of any small test case, so such discussions have to rely on exegesis of the standard text.
The goal of ANSI C89, to “develop a clear, consistent, and unambiguous Standard for the C programming language”, thus remains an open problem.
A technical challenge of dealing with concurrency.
Following C++, the C standard finally addressed concurrency in C11.
Batty, et al., worked to ensure that the standard text of the concurrency model is written in close correspondence to a formalisation.
But the concurrency model is in an axiomatic style, while the rest of C11 is more naturally expressed in an operational semantics; it is hard to integrate the two.
Current C, a mess: Divergence among the de facto and ISO standards, the prose form, ambiguities, and xomplexities of the ISO standards, lack of an integrated treatment of concurrency.
detailed investigation into the possible semantics for pointers and memory in C (its memory object model)
formal semantics, Cerberus, for large parts of C.
C pointers and memory model, where there seems to be the most divergence and variation among the multiple de facto and ISO standards.
Proved useful in practice: “we applied our analysis and test suite to the C dialect supported by Watson et al’s experimental CHERI processor 78, which implements unforgeable, bounds-checked C pointers (capabilities)“.
Cerberus: Capture ISO text for large parts of C as clearly as possible.
Parameterised memory model: can be instantiated with the candidate formal model that we are developing or, in future, other alternative memory object models.
Executable as a test oracle, to explore all behaviours or single paths of test programs;
Cerberus aims to cover essentially all the material of Section 5 Environment and 6 Language of the ISO C11 standard, both syntax and semantics, except: preprocessor features, C11 character-set features, floating-point and complex types (beyond simple float constants), user-defined variadic functions (we do cover
printf
), bitfields,volatile
,restrict
, generic selection,register
, flexible array members, some exotic initialisation forms, signals,longjmp
, multiple translation units, and aspects where our candidate de facto memory object model intentionally differs from the standard.It supports only a small parts of the standard libraries. Threads, atomic types, and atomic operations are supported only with a more restricted memory object model.
Cerberus computation semantics:
Cerberus front-end:
Of the 85 questions:
From ISO WG14 Defect Report DR260 Committee Response [^c53]: “The implementation is entitled to take account of the provenance of a pointer value when determining what actions are and are not defined.”
// example provenance_basic_global_yx.c
#include <stdio.h>
#include <string.h>
int y=2, x=1;
int main() {
int *p = &x + 1;
int *q = &y;
printf("Addresses: p=%p q=%p\n",(void*)p,(void*)q);
if (memcmp(&p, &q, sizeof(p)) == 0) {
*p = 11; // does this have undefined behaviour?
printf("x=%d y=%d *p=%d *q=%d\n",x,y,*p,*q);
}
return 0;
}
Above code will result in different behaviors if the compiler treat pointer provenance differently:
If
x
andy
happen to be allocated in adjacent memory,&x+1
and&y
will have bitwise-identical runtime representation values, thememcmp
will succeed, andp
(derived from a pointer tox
) will have the same representation value as the pointer toy
(a different object) at the pointer of the update*p=11
.In a concrete semantics we would expect to see
x=1 y=11 *p=11 *q=11
, but GCC producesx=1 y=2 *p=11 *q=2
, ICC producesx=1 y=2 *p=11 *q=11
. This suggests that GCC is reasoning, from provenance information, that*p
does not alias withy
or*q
, and hence that the initial value ofy=2
can be propagated to the finalprintf
. Note that this is not a type-based aliasing issue: the pointers are of the same type, and the GCC result is not affected by-fno-strict-aliasing
.
D260 suggests a semantic in which pointer values include not just a concrete address but also provenance information, erased at runtime in conventional implementations, including a unique ID from the original allocation. This can be used in the semantics for memory accesses to check that the address used is consistent with the original allocation, which here lets the *p=11
access be regarded as undefined behavior.
Undefined behavior: program is erroneous and compilers are notionally entirely unconstrained in how they treat it. This makes the above compiler analysis and optimizations (vacuously) corrent in this case.
==> “This general pattern is typical for C: regarding situations as undefined behaviour puts an obligation on programmers to avoid them, but permits compilers to make strong assumptions when optimising.”
The (block-ID, offset) semantics model for pointers in C, leads to many more vexed questions. e.g.:
Q25: Can we do relational comparison (with <, >, <=, or >= ) of two pointers to separately allocated objects (of compatible object types)?
Q9: Can one make a usable offset between two separately allocated objects by inter-object integer or pointer subtraction, make a usable pointer to the second by adding the offset to a pointer to the first?
-fno-tree-pta
for GCC:Q5: Must provenance information be tracked via casts to integer types and integer arithmetic?
GCC documentation states: “When casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined”.
Q2: Can equality testing on pointers be affected by pointer provenance information?
Very limited pointer arithmentic is permitted in ISO standard.
However, in practice, “it seems to be common to transiently construct out-of-bounds pointers”. See Beyond PDP11 paper, Table 1 [^c11]. On the other side, the textual comments reveal that some compiler developers believe that compilers may optimise assuming that this does not occur, and there can be issues with overflow in pointer arithmetic and with large allocations.
So long as they are brought back in-bounds before being used to access memory, many experts believe this will work.
“For our formal model, as in CHERI, we tentatively choose to permit arbitrary pointer arithmetic: out-of-bounds pointers can be constructed, with undefined behaviour occurring if an access-time check w.r.t. the bounds associated to their provenance fails.”
“Windows /GS stack cookies do this all the time to protect the return address. The return address is encrypted on the stack, and decrypted as part of the function epilogue” and a “JIT that stores 64-bit virtual ptrs as their hardware based 48-bits”.
Our candidate formal model:
We view this (our formal model) as a plausible de facto semantics, but more work is needed to see if it is really compatible with current compiler analysis implementations.
C does not require variables and memory to be initialised.
(In the text responses) the only real use cases seem to be copying a partially initialised struct and (more rarely) comparing against one.
Survey on the semantics of reading an uninitialized variable or struct member (either due to a bug or intentionally, to copy, output, hash, or set some bits of a partially initialised value):
GCC and Clang do perform SSA transformation that make uninitialised values unstable (2).
Clang is moving towards (1).
ISO …
Pading values.
Possible semantics:
MSVC – 4.
Clang – Not 1, not 2.
GCC – 1 or 2.
IBM mainfram compiler – 1 or 2.
C99 introduced effective types to permit compilers to do optimisations driven by type-based alias analysis (TBAA), ruling out programs involving unannotated aliasing of references to different types by regarding them as having undefined behaviour.
This is one of the less clear, less well-understood, and more controversial aspects of C.
Example conflict:
Q75: Can an unsigned character array with static or automatic storage duration be used (in the same way as a malloc’d region) to hold values of other types?
-fno-strict-aliasing
to turn TBAA off, and so should permit this and related idioms.Run test with three groups of tools:
tis-interpreter
The result from these tools ara radically different.
Only few of tests triggered warnings.
…
TrustInSoft aims for a tight semantics, to detect enough cases to ensure “that any program that executes correctly in
tis-interpreter
behaves deterministically when compiled and produces the same results”.
KCC detected two potential alignment erros in earlier versions of our tests.
But it gave
KCC exhibited a very strict semantics for reading uninitialised values (but not for padding bytes), and permitted some test that ISO effective types forbid.
Researches that focused on memory safe implementation of C:
But to date, no wrok in this area has clearly defined the interpretation of C that it aims to implement.
References: X. Wang, H. Chen, A. Cheung, Z. Jia, N. Zeldovich, and M. F. Kaashoek. Undefined behavior: what happened to my code? In Proc. APSYS, 2012. pdf X. Wang, N. Zeldovich, M. F. Kaashoek, and A. SolarLezama. Towards optimization-safe systems: Analyzing the impact of undefined behavior. In Proc. SOSP, 2013. pdf More
If you could revise
the fundmental principles of
computer system design
to improve security...
... what would you change?