The Flowgraph Layer

Submitted by admin on Fri, 04/27/2018 - 14:21

2. The Flowgraph Layer

On top of the diabloobject library, Diablo also offers the diabloflowgraph library. This library offers a more detailed view of the program code sections by splitting them in individual instructions and constructing a control flow graph. This control flow graph, combined with the data dependency graph from the link layer, forms the Augmented Whole-Program Control Flow Graph (AWPCFG), which is Diablo's most detailed representation of the program that is being rewritten.

The basic data structures in this layer are:

  1. t_cfg: the control flow graph.
  2. t_ins: a program instruction.
  3. t_bbl: a basic block. This is a group of instructions that is guaranteed to be executed together. These blocks are the nodes of the control flow graph.
  4. t_cfg_edge: a control flow graph edge. These edges model the possible control flow between the basic blocks.
  5. t_function: a function is a single-entry group of basic blocks, roughly corresponding to the original program procedures.

2.1. Instructions and basic blocks

The t_ins structure represents the architecture-independent information about an instruction. This type is derived from t_relocatable. It contains some information about the registers that are used and defined by this instruction, the instruction type, some architecture-independent instruction flags, etc. Each architecture backend derives an architecture-specific t_arch_ins structure from t_ins (e.g. t_arm_ins, t_i386_ins). This structure represents the architecture-specific information about the instruction: opcode, operands, etc.

A basic block is defined as "A sequence of instructions with a single entry point, single exit point, and no internal branches". As a consequence, once control flow enters the first instruction of a basic block, all other instructions in the basic block are guaranteed to be executed without interruption. This property makes the basic block ideal as node in the control flow graph. The t_bbl structure is, like the t_ins structure, derived from t_relocatable. This means basic blocks have addresses (typically the address of their first instruction) and a size (the sum of the sizes of all the block's instructions).

As a rule, relocations reference basic blocks and not instructions. Relocations to code imply that the referenced code can be called or jumped to. Even if the instruction at the referenced address is changed or deleted, the reference should keep existing (and move to the next instruction). The easiest way to accomplish this is to let the relocation reference the basic block instead of the instruction: no matter how many instructions are removed from the block, the relocation will always point to the first instruction of the remaining block.

On the other hand, relocations always come from instructions and never from basic blocks. The fact that some relocatable entity is referenced depends very much on the referencing instruction. If the instruction is removed, the reference should be removed as well. Hence, the relocation comes from the instruction itself and not from the containing basic block.

Note: On some architectures(e.g. ARM), there is data interspersed with the instructions in the code section. This data is represented in so-called data instructions, which are just dummy instructions that hold the data. These data instructions are grouped in data blocks, which are also represented as t_bbl structures.

2.2. CFG Edges

The control flow graph edges model all possible control flow between the basic blocks. The possible edge types are:

  • ET_FALLTHROUGH: control falls through to the next block, either because there is no control flow instruction or the instruction is conditional and this edge models the "not-taken" path.
  • ET_JUMP: control flows because of a (possibly conditional) jump instruction.
  • ET_CALL: function call
  • ET_SWI: system call (software interrupt)
  • ET_SWITCH: the basic block at the head of the edge implements a switch statement. ET_SWITCH edges connect all possible cases to the switch block.
  • ET_RETURN: this edge models a function return.
  • ET_COMPENSATING: this edge compensates for non-call interprocedural control flow edges. This will be explained in the next section.
  • ET_IPJUMPET_IPFALLTHRUET_IPSWITCH: interprocedural variants of the regular edges. This will also be explained in the next section.

Interprocedural edges are divided in two categories: call edges and interprocedural jump, switch or fallthrough edges are forward interprocedural edges, return and compensating edges are backward interprocedural edges. The interprocedural edges are paired: call edges have corresponding return edges and all other types of forward interprocedural edges have corresponding compensating edges. If a function does not return, an incoming forward interprocedural edge need not have a corresponding edge. However, every backward interprocedural edge should be paired to a forward interprocedural edge: it is impossible to return from a function that hasn't been entered.

2.3. Functions

Diablo's control flow graph is partitioned into functions. These correspond roughly to the procedures from the original program, but not entirely. Functions in Diablo are single-entry and single-exit regions. The single-entry requirement is met by starting a new function with every block that has incoming call edges, and all blocks that have incoming edges from two different functions. The single-exit requirement is met by introducing a dummy return block at the end of each function. All basic blocks that end in a return instruction are connected to this return block by ET_JUMP edges. The actual ET_RETURN edges then flow from the dummy return block to the function return site. If the function never returns (e.g. because it calls exit() on every code path), there will be no return block.

Unlike in source code procedures, it is very much possible for control flow to enter a function by other means than a function call. Hand-written assembler code (which can be found in any C library) need not follow the conventions compiler-generated code follows, and control flow can go in all kinds of unconventional ways. Optimizing compilers can perform tail-call optimization, thus creating interprocedural jumps between procedures. All of this is compounded by the single-entry requirement Diablo imposes on functions: once some interprocedural edges exist, new functions need to be created starting from their entry point, leading to more interprocedural edges, and so on. Hence the need for the interprocedural edge types introduced in the previous section. However, the presence of interprocedural control flow poses a new problem: if control flows from function A to function B without going through a call/return pair, how can it return to A's caller upon return of B? For this we would need to add extra return edges from B's return block to all possible return sites of function A. This would be very complicated. Fortunately, a simpler solution exists: for every forward interprocedural (non-call) edge from A to B a corresponding compensating edge is added from B's return block to A's return block. When B returns, the control can flow back to A's return block and from there on to the return sites of function A.

2.4. Unknown control flow

Unfortunately, it is usually impossible to build a completely accurate control flow graph. In each program there are a number of indirect control flow transfers. If Diablo cannot determine the possible targets of these transfers, it should conservatively assume that these transfers can go to any instruction in the program. Of course, this would make it nearly impossible to do any program rewriting as the control flow graph would be too conservative. Fortunately, we can be more precise: an indirect control transfer can only target some piece of code if the address of this piece of code is stored somewhere in the program. As all stored program addresses are represented by relocations, indirect control flow transfers can only target basic block referenced by relocations.

Still, if there are n indirect control transfers and m possible targets, this would mean there would be n*m unknown control flow edges. As both n and m are typically large values, this would be very impractical. Therefore, the unknown node (AKA hell node) is introduced. This is a dummy control flow graph node that has incoming edges from all indirect control flow transfers and outgoing edges to all possible indirect control flow targets. This way, only n+m edges are needed to represent all possible indirect control flow. For analyses, it suffices to make worst-case assumptions for the unknown node. This will guarantee conservative treatment of all indirect control flow.

In practice, Diablo distinguishes different unknown nodes, for different types of unknown control flow:

  • the hell node models general unknown control flow
  • the call hell node models indirect procedure calls
  • the swi hell node models system calls: its incoming edges are ET_SWI edges
  • the longjmp hell node models longjmp-type control flow

As mentioned before, all possible targets of indirect control flow are marked by relocations. These relocations contain a reference to their corresponding unknown control flow edge. If the relocation is removed from the program, the unknown control flow edge is automatically removed as well.

2.5. Navigating the hierarchy

This section summarizes the ways in which the hierarchy can be navigated.

2.5.1. from t_cfg

The t_cfg structure keeps unordered lists of all basic blocks, edges and functions. These can be accessed through the CFG_FOREACH_{BBL|EDGE|FUNCTION} iterators.

A CFG keeps a pointer to the code section it was derived from in the CFG_SECTION field.

2.5.2. from t_function

The t_function structure keeps a list of all basic blocks associated with this function. This list is in no particular order, except for the first block, which is guaranteed to be the entry block of the function, and the last block, which is the function's dummy return block (if there is any). The function's entry can always be accessed through FUNCTION_BBL_FIRST. The exit block can be accessed with the FunctionGetExitBlock() function, which returns NULL if the function has no exit block, and otherwise returns FUNCTION_BBL_LAST.
The list can be iterated with the FUNCTION_FOREACH_BBL iterator.

A function keeps a pointer to its CFG in the FUNCTION_CFG field.

Note: the following code fragments have different results:

         //do something


    //do something

The reason is that functions never contain data blocks. Directly iterating over all basic blocks in the control flow graph includes the data blocks, iterating over all basic blocks in all functions includes only the code blocks.

2.5.3. from t_bbl

The list of instructions in the basic block can be iterated over with the BBL_FOREACH_INS iterator. This list is of course ordered.

It is also possible to iterate over the predecessor and successor edges of a basic block with the BBL_FOREACH_{PRED|SUCC}_EDGE. These lists are unordered.

A basic block keeps pointers to its CFG and its function in the BBL_CFG and BBL_FUNCTION fields.

2.5.4. from t_cfg_edge

This structure keeps pointers to the head of the edge (CFG_EDGE_HEAD), its tail (CFG_EDGE_TAIL) and, for interprocedural edges, its corresponding edge (CFG_EDGE_CORR).

2.5.5. from t_ins

This structure keeps a pointer to its basic block (INS_BBL) and to the next and previous instructions in the basic block (INS_INEXT and INS_IPREV).