1. The Link Layer
At the link layer, represented by the diabloobject
library, there are five important data structures:
t_object
: an object file (relocatable or executable)t_relocatable
: the base class for all relocatable entitiest_section
: an object file section (code or data or ...). This is a relocatable entity and thus derived fromt_relocatable
.t_symbol
: a "label" attached to a relocatable entityt_reloc
: a relocation. This structure conveys information about the relations existing between different relocatable entities.
1.1. Objects and sections
Each object file contains a number of sections. Each section has a type, indicating the kind of information it carries:
- code sections hold machine code instructions. They are typically read-only.
- rodata sections hold constant (read-only) data.
- data sections hold mutable data.
- bss sections hold zero-initialized mutable data.
- note sections contain some information needed by the OS loader to correctly load the program. They are not visible to the program code and their contents do not influence the execution of the program.
An object file can contain other section types as well (e.g. debug sections), but they are not necessary for the correct running of the program, and thus are ignored by Diablo. The t_object
structure holds pointers to all of the sections of the object file it represents. These are stored in arrays per section type; you can access them through the OBJECT_{CODE|RODATA|DATA|BSS|NOTE}
getters.
The t_section
structure contains all information to describe a section: it's type, it's size and address, it's alignment constraints, etc. The SECTION_DATA
field holds a pointer to the section's contents. This is not true for bss sections: the contents of these sections are not represented in Diablo as they contain only zeroes anyway. For code sections, the SECTION_DATA
pointer points to different things depending on the state of the section: it either points to the raw section contents, an array of disassembled instructions or the control flow graph of the program.
1.2. Sub- and parent objects and sections
The executable file to be rewritten by Diablo is called the parent object. This executable is created by linking together a number of relocatable object files and libraries. These object files are represented in Diablo as well, and are called subobjects. Likewise, the sections of the parent object are referred to as parent sections, and the sections from the subobjects are subsections. All subsections are mapped to a parent section. There are several ways to navigate this hierarchy:
OBJECT_FOREACH_SUBOBJECT
: for a parent object, iterate over all subobjectsOBJECT_FOREACH_SECTION
: for a given object (either parent or subobject), iterate over all sections. It is also possible to iterate over sections of a specific type as well.SECTION_PARENT_SECTION
: returns the parent section of a subsection- There is no easy way to iterate over all subsections of a given parent section, but there is a function called
SectionGetSubsections()
that returns the list of all subsections for a parent section. You can then just iterate over this list.
For Diablo, a subsection is considered to be a fundamental unit of information: the contents of a subsection cannot be divided into smaller blocks. The only exception to this rule are code sections: they can be subdivided into individual instructions, but this is done in the flowgraph layer.
1.3. Relocatable entities
A relocatable entity is, simply put, any program entity that has an address. Examples are program sections, but also instructions and basic blocks (these will be introduced in the section about the flowgraph layer). The t_relocatable
structure is the common base class for all these relocatable entities. It contains some information that is common to all types of relocatable entities: the current address, the original address (its address in the input program), the size of the entity, a list of relocations originating from this entity and a list of relocations referring to this entity.
1.4. Symbols
In order to identify (amongst others) functions and variables, object files contain symbols that act as labels attached to relocatable entities. In Diablo, these symbols are represented by a t_symbol
structure. This structure contains the name of the symbol and its type, and indicates whether or not this is a global symbol.
1.5. Relocations
The task of a linker is to create a working program out of all the different object files and libraries supplied by the developer. To do this, the linker has to resolve the dependencies between the input objects: if code in one object file calls a function in another object file, the linker has to look up this function and write the correct address for the function in the call instruction. The object files represent their dependencies as relocations: these structures identify the source of the reference, what is referenced, and how the address that needs to be filled in is to be computed (e.g. as an absolute address, as a pc-relative offset, ...)
This is a very important data structure, as it models the dependencies between different relocatable entities. If a relocatable entity is not referenced by any relocation, it is in effect unreachable and can be removed from the program. This is because no other part of the program can produce the address of this relocatable entity if there is no relocation point
ing to it. If the address is never produced, the contents of the entity can never be used in the program.
The t_reloc
structure in Diablo represents such a relocation. It has a FROM
field pointing to the relocatable entity that contains the reference and a TO_RELOCATABLE
field that points to the referenced relocatable entity. Diablo uses a special stack-based language to describe how the relocation should be computed, the details of this language can be found here.
Each relocatable entity has a list of relocations that refer to this entity (RELOCATABLE_REFED_BY
), and a list of relocations coming from this entity (RELOCATABLE_REFERS_TO
). These are singly-linked lists of t_reloc_ref
structures. The actual relocation can then be accessed through the ->rel
field of this structure. The following example code iterates over all relocations referring to a given relocatable entity:
t_relocatable *r = ...; t_reloc_ref *rr; for (rr = RELOCATABLE_REFED_BY(r); rr; rr = rr->next) { t_reloc *rel = rr->rel; // do something with rel }
Note: Initially, just like in a real linker, a relocation in Diablo actually points to a symbol instead of a relocatable entity. However, after the symbol resolution phase this layer of indirection is removed for convenience. Symbols stay attached to the relocatable entities however, as they can be used later on to look up the location of functions and data structures in the program.