Wednesday, September 14, 2011

AST Transformations: Compiler Phases and Syntax Trees

A compiler is a strange beast and seen by many as one of the last domains of black magic. And in earlier times it definitely was one of the most complex fields in Computer Science. But today compilers are things to play with as easily as a graphics program (some would say graphics programs are more complex).

There are two basic tricks that have moved compilers from the area of immensely complex problems to something that can be understood without a degree in Rocket Science. The first is to divide the work in different phases that are executed sequentially. The second is to use a tree-based representation of the program that is to be translated. These syntax trees are an abstraction of the real input and are thus named AST (Abstract Syntax Tree).

If we want to create AST transformations we should understand both, the different compiler phases and the tree-based representation.

A compiler's work can be divided in three different stages, first of all the frontend stage in which the program is read and transformed into the tree-based representation, the second middle-end stage in which the tree then is transformed and optimized, and the third and final backend stage in which the tree is instrumented with the instructions (in our case the byte code) and then written out (to memory or disk).

In the case of a compiler that creates byte code for the Java Virtual Machine the second stage is not really that important because most of the optimization work is done by the JIT compiler. In Groovy this holds true as well. Groovy++ though is a bit of an exception; the static compiler for Groovy does much work in this second area.

The Groovy compiler has nine phases. Six of these phases can be seen as part of the frontend stage; they focus on creating the full abstract syntax tree, check its correctness with regard to syntax and semantics  and complete it (e.g., add additional nodes where needed). These phases are:

  • Initialization
    In this phase the necessary files are opened, the needed resources are allocated and the environment is set up.
  • Parsing
    Here a first tree representation of the input data is created. This includes the lexical analysis, identifying words from the language and the parsing of these words according to the Groovy grammar.  This representation is still very much based on the source code i.e., a concrete representation or Concrete Syntax Tree (CST).
  • Conversion
    This phase takes the CST and transforms it to a more abstract representation, the Abstract Syntax Tree or AST.
  • Semantic Analysis
    In every language there are statements for which the normal grammar cannot decide whether or not they are correct. So in this phase the Groovy compiler, by hand, checks all these cases. This includes resolving classes, static imports and scope of variables. When this phase is complete the input is seen as valid source.
  • Canonicalization
    In this phase access from inner classes to the surrounding scope is resolved. Interestingly, the Groovy++ compiler does most of its work here. 
Which phase is suited best depends on what we want to do with the AST transformation. If we need information about local variables or need to access the already resolved classes, then "Semantic Analysis" is the earliest phase that we can use. "Canonicalization" could also be suitable in this case, because this phase could be seen as the phase representing the second middle-end stage. If you do not need the information provided by these phases or if you would like to introduce and use local variables without having to think about scope then the earlier "Conversion" phase would be the best choice. I personally prefer the "Semantic Analysis" phase.

The Backend stage has to create the bytecode representation from the AST. This again is done in different phases:

  • Instruction SelectionClass Generation
    These phases are also called "Class Generation 1" and "Class Generation 2". They are responsible for two major steps on the way to the final class. The first, and more important step (at least for the AST transformations) completes the AST and fills in everything which the programmer can so conveniently ignore (e.g. optional return statements). The second is the selection of the target bytecode set (which could be "Pre JDK5" or "Post JDK5" and following that the instrumentation of the AST with the bytecodes. Now that the AST is complete and instrumented with the correct instruction set the bytecode can be generated in memory.
  • Output
    The name is the game. This phase writes the bytecode that has been generated in the last phase to the output.
  • Finalization
    This phase is simply for housekeeping purposes. Allocated resources are freed, the environment is cleaned, files are closed etc., to guarantee that everything is in order when the compiler terminates. This is especially important since the compiler might not have been called from the command line, but instead e.g. inside an application server that tries to deliver a groovy resource through the template engine.
The default compiler phase when creating AST Transformations with the ASTBulder (which is absolutely preferrable to the alternatives) is "Class Generation", which is quite late in the process.  Furthermore, if you create only fragments of code with the AST Builder that you later want to introduce into another AST as part of your transformation there might be unwanted instructions that you do not get if you choose an earlier compiler phase. We will revisit the ASTBuilder in future blog entries.

Before you think that I know every last detail of the Groovy compiler, let me tell you that Jochen "Blackdrag" Theodorou helped me and corrected the detail information in this entry.

Now that you know about the different phases of the compiler and their respective use, stay tuned for the next entries in which we actually start to play with the compiler.

Jochens presentation of working with the compiler.

Some Interesting Classes (see Groovy sourcecode)
Groovy Grammar groovy/trunk/groovy/groovy-core/src/main/org/codehaus/groovy/antlr/groovy.g
InnerClasseVisitor org.codehaus.groovy.classgen.InnerClassVisitor
Selection of Instruction Set   org.codehaus.groovy.classgen.AsmClassGenerator
Adding Return Statements org.codehaus.groovy.classgen.ReturnAdder
Compiler Phases org.codehaus.groovy.control.Phases

No comments: