There have been many publications about control flow reconstruction from binaries/at link-time/from assembly. At link-time the algorithm basically goes something like this:
- disassemble the binary
- mark all basic block leaders (program entry point, successors of
control transfer instructions, targets of control transfer
instructions). - extract basic blocks (for each leader, put the instructions
starting at that leader, up to but not including the next leader as
a node in the CFG, the nodes are called basic blocks) - connect basic blocks with the right types of edges in the
graph-structure
Some of these steps can be really difficult.
Step 1 (disassembling) can be very cumbersome if there is data mixed in between the instructions, especially on architectures with a variable width instruction length.
Step 2 is easy for direct control flow transfers as the target of these instructions is encoded in the instruction itself. For indirect control flow all link-time/post-link-time systems use relocation information to identify possible control flow targets. Unfortunately, relocations only identify which values in the program are pointers, they do not specify how they will be used. Consider the following code fragment:
#include <stdio.h>
int
just_some_function ()
{
printf ("in just some function\n");
}
void *fake_address = ((char *) just_some_function) - 1000;
void
main ()
{
int (*real_address) () = ((char *) fake_address) + 1000;
real_address ();
}
fake_address
is a function pointer, but by simply looking at its value, one cannot determine which function will be jumped to when the pointer is dereferenced (with the appropriate offset of course).
On open-source toolchains we can avoid complicated analyses to overcome the problems with both step 1 and step 2, by patching the assembler (the safest way to handle it). See the FAQ item: "Why do I need to patch my toolchain/use a different toolchain? for more details. On other toolchains we try to use properties of the toolchain itself to overcome the problems.
Step 4 is non-trivial on architectures that feature an exposed program counter, like the ARM. If an instruction moves a value into the program counter, one can never be sure whether this is just an indirect jump or actually an indirect function call. In such cases, we rely on pattern matching heuristics to determine the right type of edge.