Next: Abstract Stack Machines Up: Compiler Theory: Intermediate Code Generation Previous: Compiler Theory: Intermediate Code Generation

Intermediate Languages

AN INTERMEDIATE REPRESENTATION (IR) is a language for an abstract machine (or a language that can be easily evaluated by an abstract machine)

It should not include too much machine specific detail.
It provides a separation between front and back ends which helps compiler portability. Code generation and assignment of temporaries to registers are clearly separated from semantic analysis.
It allows optimization independent of the target machine. Intermediate representations are by design more abstract and uniform, so optimization routines are simpler.

Properties of good intermediate representations.

Convenient to produce in the semantic analysis phase.
Convenient to translate into code for the desired target architecture.
Each construct must have a clear and simple meaning such that optimizing transformations that rewrite the IR can be easily specified and implemented.

TREE-BASED INTERMEDIATE REPRESENTATIONS. The most obvious way to present the information gained from lexical and syntax analysis is a syntax tree. In C, the representation can be handled using a struct for each node:

struct node {
    int nodetype;
    struct node *field1;
    struct node *field2;
    ...................
}

The nodetype field contains a small integer to identify the nonterminal corresponding to the node.
The other fields point to subtrees.
Unused fields may have null pointers.

Here's a C routine for allocating storage for the nodes.

struct node *mknode (int type,
                     struct node *f1, 
                     struct node *f2, 
                     struct node *f3) 
{
   struct node *t = (struct node *) malloc(sizeof(struct node)); 
   t -> nodetype = type; 
   t -> field1 = f1; 
   t -> field2 = f2; 
   t -> field3 = f3; 
   return t; 
}

The routine mknode can be used to build the syntax tree using a syntax-directed definition where each non-terminal E has a synthesized attribute E.ptr that points to the root of its syntax tree.

In practice we do not hold individual characters (letters or characters) at leaves of the tree:

i.e. we do not build subtrees for characters of a variable name.
The lexer groups these together into strings or integers.
thus we can use string or integer fields in the struct.

We can extend our definition of the struct node from earlier to include name and value fields

struct node {
    int nodetype;
    struct node *field1;
    struct node *field2;
    struct node *field3;
    char *name;
    int value;
}

Example 1 A syntax-directed definition to produce syntax trees for assignment statements could be

Production	Semantic Rule
S $\longmapsto$ $\bf id$ : = E	S.ptr := make_node(' $\bf assign$ ', make_leaf( $\bf id$ , $\bf id$ .entry), E.ptr)
E $\longmapsto$ E₁ + E₂	E.ptr := make_node(' $\bf +{^\prime}$ , E₁.ptr, E₂.ptr)
E $\longmapsto$ E₁E₂*	E.ptr := make_node(' $\bf {^\prime}$ , E₁.ptr, E₂.ptr)*
E $\longmapsto$ - E₁	E.ptr := make_node(' $\bf unimus$ ', E₁.ptr)
E $\longmapsto$ (E₁)	E.ptr := E₁.ptr
E $\longmapsto$ $\bf id$	E.ptr := make_leaf( $\bf id$ , $\bf id$ .entry)

POSTFIX NOTATION is a linearized representation of a syntax tree.

It is a list of the nodes of the tree in which every node appears immediately after its children.
The edges of a syntax tree do not appear explicitely in postfix notation.
However they can be recovered from
- the order in which the nodes appear and
- the number of operands that the operator at a node expects.
- The recovery of edges is similar to the evaluation, using a stack, of an expression in postfix notation.

Next: Abstract Stack Machines Up: Compiler Theory: Intermediate Code Generation Previous: Compiler Theory: Intermediate Code Generation

Marc Moreno Maza
2004-12-02