Symbol Table Section


Q&A

  • How does LLVM generate Symbol Table section in an object file?

References:

Symbol Table Section

The section .symtab holds a symbol table. The object file use the symbol table to locate and relocate a program’s symbolic definitions and references.

First entry is always undefined symbol.

If a file has a loadable segment that includes the symbol table, this symbol section’s attributes will include the SHF_ALLOC bit; otherwise the bit will be off.

// one symbol table entry.
typedef struct {
	Elf32_Word		st_name; // an index into the symbol name stored in symbol string table section.
	Elf32_Addr		st_value; // the value of the associated symbol. may be an address, absolute value, etc.
	Elf32_Word		st_size; // symbol's size, e.g. data object's size; 0 if unknown, or no size.
	uint8_t			st_info;  // symbol's type and binding attributes.
	uint8_t			st_other;
	Elf32_Half		st_shndx; // Every symbol is defined in relation to some section; this memober holds the relavant section header table index.
} Elf32_Sym;

Symbol Table Entry – Symbol Info

The symbol table info st_info holds the type and binding attributes for the symbol.

  • Binding determines the linkage visibility and behavior.
    • STB_LOCAL/0, local symbols not visible outside of objec file containing their definition. Local symbols of the same name may exist in multiple files without interfering with each other.
    • STB_GLOBAL/1, Global symbols are visible to all object files being combined. One file’s definition of a global symbol will satisfy another file’s undefined reference to the same global symbol.
    • STB_WEAK/2, Weak symbols resemble global symbols, but their definitions have lower precedence. ???
    • STB_LOPROC/13 -- STB_HIPROC/15, processor specific semantics.
  • Type prvoides a general classification for the associated entity.
    • STT_NOTYPE/0, no type is specified for the symbol.
    • STT_OBJECT/1, a data object, such as variable, an array, and so on.
    • STT_FUNC/2, a function or other executable code.
    • STT_SECTION/3, a section symbol. This type of entry is primarily used for relocation and normally have STB_LOCAL binding.
    • STT_FILE/4, a file symbol. has STB_LOCAL binding, its section index is SHB_ABS, and it precedes the other STB_LOCAL symbols for the file, if it is present.
    • STT_LOPROC/13 -- STT_HIPROC/15, processor specific semantics.

Symbol Table Entry – Symbol Value

Symbol table entries for different object file types have slightly different interpretation for the symbol value st_value member.

  • In relocatable files, two possiblities:
    • symbol value can hold alignment constraints for a symbol whose section index is SHN_COMMON. In this case, The symbol labels a common block that has not yet been allocated. The symbol’s value gives alignment constraints, similar to a section’s sh_addralign member. That is, the link editor will allocate the storage for the symbol at an address that is a multiple of st_value. The symbol’s size tells how many bytes are required.
    • symbol value can hold a section offset for a defined symbol. That is st_value is an offset from the beginning of the section that st_shndx identifies.
  • In executable and shared object files, st_value holds a virtual address. The make these files’ symbol more useful for the dynamic linker, the section offset (file interpretation) gives way to a virtual address (memory interpretation) for which the section number is irrelavant.

Symbols for Special Sections

Each symbols is assigned to some section of the object file, denoted by the section field st_shndx, which is an index into the section header table.

However, there are three special pseudosections that do not have entries in the section header table (Note these pseudosections exit only in relocatable object files; will be removed in executbale object files):

  • ABS is for symbols that should not be relocated.
  • UNDEF is for undefined symbols – that is, symbols that are referenced in this object module but defined elsewhere.
  • COMMON is for uninitialized data objects that are not yet allocated. For COMMON symbols, the st_value field gives the alignment requirement, and st_size gives the minumum size.

Distinction between COMMON and .bss section is subtle. Modern versions of GCC assign symbols in relocatable object files to COMMON and .bss using the following convention:

  • COMMON: Uninitialized global variables.
  • .bss: Uninitialized static variables, and global or static variables that are initialized to zero.

Linking – Symbol Resolution

References:

The linker resolves symbol references by associating each reference with exactl yone symbol definition from the symbol tables of its input relocatable object files.

Symbol resolution is straightforward for references to local symbols that are defined in the same module as the reference. The compiler allows only one definition of each local symbol per module. The compiler also ensures that static local variables, which get local linker symbols, have unique names.

Resolving references to global symbols, however, is tricker. When the compiler encounters a symbol (either a variable of function name) that is not defined in the current module, it assumes that it is defined in some other module, generates a linker symbol table entry, and leaves it for the linker to handle.

Resolve duplicate symbol names

Reference: Computer Systems: A Programmer’s Perspective, Chapter 7.6.1

The input to the linker is a collection of relocatable object modules. Each of these modules defines a set of symbols, some of which are local(visible only to the module that defines it), and some of which are global (visible to other modules).

If multiple modules define global symbols with the same name, Linux compilation system will use the following rules (which can cause unintential bugs).

  • At compile time, the compiler exports each global symbol to the assembler as either strong or weak, and the assembler encodes this information implicitly in the symbol table of the relocatable object file.
    • Functions and initialized global variables get strong symbols.
    • Uninitialized global variables get weak symbols.
  • At linking time, Linux linkers use the following rules for dealing with duplicate symbol names:
    1. Multiple strong symbols with the same name are not allowed.
    2. Given a strong symbol and multiple weak symbols with the same name, choose the strong symbol.
    3. Given multiple weak symbols with the same name, choose any of the weak symbols.
    4. Allow (but give warning) these symbols with the same name be defined with different types in different module. ==> E.g. A strong symbol defined as int in one file and another weak symbol declared as double in another file. ==> buggy if the double is stored while the linker resolves the symbol using address of the int. (See CS-APP book Chapter 7.6.1).
      • Use -fno-common or -Werror can help to avoid some of these bugs.

Linking with Static Libraries

Reference: Computer Systems: A Programmer’s Perspective, Chapter 7.6.2

Related modules can be packed into a single file called a static library, which can then be supplied as input to the linker. When the system builds the output executable, the linker copies only the object modules in the library that are referenced by the application program.

In a static library, related functions are compiled into separate object modules and then packaged in a single static library file.

At link time, the linker will only copy the object modules that are referenced by the program, which reduces the size of the executable on disk and in memory.

On Linux systems, static libraries are stored on disk in a particular file format known as an archive. An archive is a collection of concatenated relocatable object files, with a header that describes the size and location of each member object file. Archive filenames are denoted with the .a suffix.

Use AR tool to create a static library: ar rcs libvector.a addvec.o multvec.o

To link a static library (say libvector.a), into a program (say main.c):

  • If the linker found the symbols defined by addvec.o is referenced by main.o, it will copy addvec.o into the executable.
  • If the linker found there is no symbol in multvec.o is referenced by main.o, then it will not copy this module into the executable.

To resolve references using static libraries, the linker will scan the relocatable object files and archives left to right in the same sequential order that they appear on the compiler driver’s command line. During this scan, the linker maintains a set E of relocatable object files that will be merged to form the executable, a set U of unresolved symbols (i.e., symbols referred to but not yet defined), and a set D of symbols that have been defined in previous input files:

  • Initially, E, U, and D are empty.
  • For each input file f on the command line, the linker determines if f is an object file or an archive.
    • If an object file, the linker add f to E. Updates U and D to reflect the symbol definitions and references in f and proceeds to the next input file.
    • If f is an archive, the linker attempts to match the unresolved symbols in U against the symbols defined by the members of the archive. If any member m defines a symbol that resolves a reference in U, then m is added to E, and the linker updates U and D to reflect the symbol definitions and references in m. This process iterates over the member object files in the archive until a fixed point is reached where U and D no longer change. At this point, any member object files not contained in E are simply discarded and the linker proceeds to the next file.
  • If U is nonempty when the linker finishes scanning the input files on the command line, it prints an error and terminates. Otherwise, it merges and relocates the object files in E to build the output executable file.

Misc

Common Symbol

Common symbol is for back-compatibility purpose. Using common symbol now is a bad practice.

References:

Common symbols are a feature that allow a programmer to define several variables of the same name in different source files. This is in contrast with the more popular way of duing extern to reference a variable defined in another file.

Use GCC flag -fno-common to avoid these multi-place-defined symbols by preventing compiler to use common sections.

// lld/ELF/Symbols.h

// Represents a common symbol.
//
// On Unix, it is traditionally allowed to write variable definitions
// without initialization expressions (such as "int foo;") to header
// files. Such definition is called "tentative definition".
//
// Using tentative definition is usually considered a bad practice
// because you should write only declarations (such as "extern int
// foo;") to header files. Nevertheless, the linker and the compiler
// have to do something to support bad code by allowing duplicate
// definitions for this particular case.
//
// Common symbols represent variable definitions without initializations.
// The compiler creates common symbols when it sees varaible definitions
// without initialization (you can suppress this behavior and let the
// compiler create a regular defined symbol by -fno-common).
//
// The linker allows common symbols to be replaced by regular defined
// symbols. If there are remaining common symbols after name resolution is
// complete, they are converted to regular defined symbols in a .bss
// section. (Therefore, the later passes don't see any CommonSymbols.)
class CommonSymbol : public Symbol {
public:
  CommonSymbol(InputFile *file, StringRefZ name, uint8_t binding,
               uint8_t stOther, uint8_t type, uint64_t alignment, uint64_t size)
      : Symbol(CommonKind, file, name, binding, stOther, type),
        alignment(alignment), size(size) {}

  static bool classof(const Symbol *s) { return s->isCommon(); }

  uint32_t alignment;
  uint64_t size;
};
Created Jul 15, 2020 // Last Updated Aug 2, 2020

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?