Goal of the project: make the compiler able to compile itself to demonstrate that the compiler actually works.
- The choices
- The global architecture
- The C compiler
- The assembler
- The virtual machine
- The results
We considered the compilation of two types of language:
- CAML
- C
We chose C for personal taste and to test the compiler on a wide range of sources.
The big choices:
- No preprocessor: we use
gcc -E
- No floats
- No error recovery
- Use
flex (lex)
andbison (yacc)
for the parser. So the compiler must be able to accept the code generated by these 2 utilities. - Management of separate modules, so a link editing system is necessary.
- As far as possible, we stay in compliance with ANSI C.
- Using a stack machine. Simple, but prevents the evolution of the compiler.
- On the question of performance, we are dependent on the hardware: pointer size, data alignment, and its endianness.
- We finally impose the constraint of depending on a minimum set of
functions in the standard libraries: we recompile part of the code
for
stdlib.h
,stdio.h
,stdarg.h
andstring.h
.
Compiler C (fbcc
): (4700 lines)
- Lexer
- Parser
- Declaration Management
- Data production for static variables
- Typing expressions
- Propagation of constants
- Assembly code generation
Assembler (fbas
): tiny (700 lines)
- Lexer - Parser
- Management of exported, global, or private symbols
- Generations of a relocatable executable with relocation table
Virtual Machine (fbvm
): (900 lines)
- Loading the executable and relocation
- Emulation of instructions
- Calling some functions of the standard C library thanks to traps
Libraries (startup.s
, fblib.c
): (900 lines)
- Code for startup
- Compile functions
printf
andstdarg.h
- Compiling the functions of
string.h
ANSI features not implemented:
- Types
long
,float
anddouble
const
andvolatile
type qualifiers- Structure assignments, value passing and structure return
- Characters of type
wchar_t
- Parsing constants of type
unsigned int
- Some forms of static initializations
- Some type checks in expressions and control structures.
Examples of implemented features:
- Old and new feature prototypes, with
...
to indicate a function with a number of variable parameters - Blocks in functions
- Multidimensional tables,
struct
,union
,enum
, pointers on functions - Complete management of
typedef
- All control instructions, including
goto
- Separate symbol tables for
struct
tagsunion
, andenum
, the structure fields, the labels ofgoto
- Static and Dynamic Initializations Allowed
- Typing fully ANSI compliant for almost all operators
- There is no intermediate representation for code other than that of expressions. The compiler could be seen as a first step to an intermediate code generator
- All symbol tables are managed with hash tables
- Virtually all data is stored in lists. This simplifies memory management and avoids having to manage a large number of data structures.
- The stack model is the same as the standard C model. We use a dynamic link and we also save the size of arguments passed in parameters.
The syntax of declarations requires the construction of an intermediate representation for types. We use almost the same representation for the definitive type.
type: == (base_type)
| (TYPE_POINTER) + type
| (TYPE_ARRAY dim) + type
| (TYPE_STRUCT sym) | (TYPE_UNION sym) | (TYPE_ENUM sym)
| (TYPE_FUNC func_type var_list) + type
base_type: == TYPE_CHAR | TYPE_UCHAR
| TYPE_SHORT | TYPE_USHORT
| TYPE_INT | TYPE_UINT
func_type: == FUNC_ELLIPSIS | FUNC_OLD | FUNC_NEW
var_list: == var1 + ... + varN
var: == ((name) var_storage type var_init)
var_storage: == STORAGE_DEFAULT | STORAGE_AUTO | STORAGE_REGISTER
| STORAGE_STATIC | STORAGE_EXTERN
var_init: == (INIT_EXPR expr) | (INIT_LIST var_init1 ... var_initN)
var_location: == VAR_STACK | VAR_DATA
sym_var: == (SYM_VAR var_storage type (var_location var_offset))
sym_field_struct: == (offset type)
sym_typedef: == (SYM_TYPEDEF type)
sym_struct: == (TYPE_STRUCT -1) / *if not defined */
| (TYPE_STRUCT symbol_table size align)
sym_enum_const: == (SYM_ENUM_CONST val)
expr: == (type tag expr1 ... exprN)
expr_ident: == (type EXPR_IDENT sym)
expr_call: == (type EXPR_CALL expr_func n param1 paramN)
expr_int: == (type EXPR_INT n)
expr_str: == (type EXPR_STR str1 ... strN)
cast_expr: == (type EXPR_CAST expr)
etc.
- Generate code and data in 2 segments
.text
and.data
- Data:
.byte
num,.short
num,.int
expr,.align
num,.zero
num - Tags:
.equ
sym, num, sym:,.globl
sym - Expressions of the form: sym, num, sym + num
- Start of a new module:
.module
For simplicity, the assembler is used as a linker. A symbol can be external
(not defined at present), global (thanks to the directive .globl
) or private.
The .module
directive clears all private symbols.
The generated executable contains a relocation table because we want the pointers of the virtual machine to be compatible with real pointers.
It is a stack-based machine. The fbvmspec.h
file gives the relevant information
on the architecture (endianness, alignment, size of the base types ,
the stack model). It has been assumed that the size of pointers are
the same size as int
.
reading mem: ld_b, ld_ub, ld_w, ld_uw, ld_i
mem writing: st_b, st_w, st_i
arithmetic: add_i, sub_i, mul_i, mul_ui, div_i,
div_ui, mod_i, mod_ui, NEG_I
comparisons: cmplt_i, cmple_i, cmpge_i, cmpgt_i, cmpeq_i,
cmpne_i, cmplt_ui, cmple_ui, cmpge_ui, cmpgt_ui
logical: and_i, or_i, xor_i, not_i,
shl_i, SHR_I, shr_ui
conversions: cvt_i_b, cvt_i_ub, cvt_i_w, cvt_i_uw
constants: li_i n, libp_i n
jumps: jeq_i n, jne_i n, switch_i, jmp n
functions: jsr n, rts
stack management: dup, pop, addsp n
system: libcall n
- The compiler compiles itself, and the compiled version recompiles itself giving the same code. The speed is reasonable. To get to this point, it required not just the working compiler, but also the creation of the whole compilation and execution environment.
- Types in C are extremely complex. Surprisingly, the compiler for such a complex langauge can be quite simple.
- The choice of a stack-based virtual machine was not optimal. We did not think we had the time to generate intermediate code and then assembly code for a register0based machine. With hindsight, if such a choice had been done from the beginning, we could have completed this task. This would have allowed more work on the optimization of the generated code.
- You should rather see this project as an intermediate code generator for the C language and not as a complete compiler.