BY THOMAS NIEMANN
This document explains how to construct a compiler using lex and yacc. Lex and yacc are tools used to generate lexical analyzers and parsers. I assume you can program in C, and understand data structures such as linked-lists and trees. The introduction describes the basic building blocks of a compiler and explains the interaction between lex and yacc. The next two sections describe lex and yacc in more detail. With this background, we construct a sophisticated calculator. Conventional arithmetic operations and control statements, such as if-else and while, are implemented. With minor changes, we convert the calculator into a compiler for a stack-based machine. The remaining sections discuss issues that commonly arise in compiler writing. Source code for examples may be downloaded from the web site listed below. Permission to reproduce portions of this document is given provided the web site listed below is referenced, and no additional restrictions apply. Source code, when part of a software project, may be used freely without reference to the author.
THOMAS NIEMANN Portland, Oregon email@example.com http://members.xoom.com/thomasn
1. INTRODUCTION 2. LEX 2.1 2.2 Theory Practice 4 6 6 7 12 12 14 17 20 20 23 24 25 29 30 32 32 33 33 35 35 35 36 37 37 38 39
3. YACC 3.1 3.2 3.3 Theory Practice, Part I Practice, Part II
4. CALCULATOR 4.1 4.2 4.3 4.4 4.5 4.6 Description Include File Lex Input Yacc Input Interpreter Compiler
5. MORE LEX 5.1 5.2 5.3 Strings Reserved Words Debugging Lex
6. MORE YACC 6.1 6.2 6.3 6.4 6.5 6.6 Recursion If-Else Ambiguity Error Messages Inherited Attributes Embedded Actions Debugging Yacc
Until 1975, writing a compiler was a very time-consuming process. Then Lesk  and Johnson  published papers on lex and yacc. These utilities greatly simplify compiler writing. Implementation details for lex and yacc may be found in Aho . Lex and yacc are available from • • • • • Mortice Kern Systems (MKS), at http://www.mks.com, GNU flex and bison, at http://www.gnu.org, Ming, at http://agnes.dida.physik.uni-essen.de/~janjaap/mingw32, Cygnus, at http://www.cygnus.com/misc/gnu-win32, and me, at http://members.xoom.com/thomasn/y_gnu.zip (executables), and http://members.xoom.com/thomasn/y_gnus.zip (source for y_gnu.zip).
The version from MKS is a high-quality commercial product that retails for about $300US. GNU software is free. Output from flex may be used in a commercial product, and, as of version 1.24, the same is true for bison. Ming and Cygnus are 32-bit Windows95/NT ports of the GNU software. My version is based on Ming’s, but is compiled with Visual C++ and includes a minor bug fix in the file handling routine. If you download my version, be sure to retain directory structure when you unzip. source code
a = b + c * d Lexical Analyzer
id1 = id2 + id3 * id4
id2 id3 Code Generator
load mul add store
id3 id4 id2 id1
Figure 1-1: Compilation Sequence
Lex generates C code for a lexical analyzer, or scanner. It uses patterns that match strings in the input and converts the strings to tokens. Tokens are numerical representations of strings, and simplify processing. This is illustrated in Figure 1-1. As lex finds identifiers in the input stream, it enters them in a symbol table. The symbol table may also contain other information such as data type (integer or real) and location of the variable in memory. All subsequent references to identifiers refer to the appropriate symbol table index. Yacc generates C code for a syntax analyzer, or parser. Yacc uses grammar rules that allow it to analyze tokens from lex and create a syntax tree. A syntax tree imposes a hierarchical structure on tokens. For example, operator precedence and associativity are apparent in the syntax...