Skip to content

Latest commit

 

History

History
235 lines (175 loc) · 7.26 KB

fbcc.md

File metadata and controls

235 lines (175 loc) · 7.26 KB

Construction of a C Compiler: Fabrice Bellard

Goal of the project: make the compiler able to compile itself to demonstrate that the compiler actually works.

  • The choices
  • The global architecture
  • The C compiler
  • The assembler
  • The virtual machine
  • The results

The Choices

We considered the compilation of two types of language:

  • CAML
  • C

We chose C for personal taste and to test the compiler on a wide range of sources.

The big choices:

  • No preprocessor: we use gcc -E
  • No floats
  • No error recovery
  • Use flex (lex) and bison (yacc) for the parser. So the compiler must be able to accept the code generated by these 2 utilities.
  • Management of separate modules, so a link editing system is necessary.
  • As far as possible, we stay in compliance with ANSI C.
  • Using a stack machine. Simple, but prevents the evolution of the compiler.
  • On the question of performance, we are dependent on the hardware: pointer size, data alignment, and its endianness.
  • We finally impose the constraint of depending on a minimum set of functions in the standard libraries: we recompile part of the code for stdlib.h, stdio.h, stdarg.h and string.h.

General Architecture

Compiler C (fbcc): (4700 lines)

  • Lexer
  • Parser
  • Declaration Management
  • Data production for static variables
  • Typing expressions
  • Propagation of constants
  • Assembly code generation

Assembler (fbas): tiny (700 lines)

  • Lexer - Parser
  • Management of exported, global, or private symbols
  • Generations of a relocatable executable with relocation table

Virtual Machine (fbvm): (900 lines)

  • Loading the executable and relocation
  • Emulation of instructions
  • Calling some functions of the standard C library thanks to traps

Libraries (startup.s, fblib.c): (900 lines)

  • Code for startup
  • Compile functions printf and stdarg.h
  • Compiling the functions of string.h

The Compiler

ANSI features not implemented:

  • Types long, float and double
  • const and volatile type qualifiers
  • Structure assignments, value passing and structure return
  • Characters of type wchar_t
  • Parsing constants of type unsigned int
  • Some forms of static initializations
  • Some type checks in expressions and control structures.

Examples of implemented features:

  • Old and new feature prototypes, with ... to indicate a function with a number of variable parameters
  • Blocks in functions
  • Multidimensional tables, struct, union, enum, pointers on functions
  • Complete management of typedef
  • All control instructions, including goto
  • Separate symbol tables for struct tags union, and enum, the structure fields, the labels of goto
  • Static and Dynamic Initializations Allowed
  • Typing fully ANSI compliant for almost all operators

Implementation Details

  • There is no intermediate representation for code other than that of expressions. The compiler could be seen as a first step to an intermediate code generator
  • All symbol tables are managed with hash tables
  • Virtually all data is stored in lists. This simplifies memory management and avoids having to manage a large number of data structures.
  • The stack model is the same as the standard C model. We use a dynamic link and we also save the size of arguments passed in parameters.

The syntax of declarations requires the construction of an intermediate representation for types. We use almost the same representation for the definitive type.

Types

type: == (base_type)
        | (TYPE_POINTER) + type
        | (TYPE_ARRAY dim) + type
        | (TYPE_STRUCT sym) | (TYPE_UNION sym) | (TYPE_ENUM sym)
        | (TYPE_FUNC func_type var_list) + type

base_type: == TYPE_CHAR | TYPE_UCHAR
             | TYPE_SHORT | TYPE_USHORT
             | TYPE_INT | TYPE_UINT

func_type: == FUNC_ELLIPSIS | FUNC_OLD | FUNC_NEW

var_list: == var1 + ... + varN

var: == ((name) var_storage type var_init)

var_storage: == STORAGE_DEFAULT | STORAGE_AUTO | STORAGE_REGISTER
               | STORAGE_STATIC | STORAGE_EXTERN

var_init: == (INIT_EXPR expr) | (INIT_LIST var_init1 ... var_initN)

The Symbol Table

var_location: == VAR_STACK | VAR_DATA

sym_var: == (SYM_VAR var_storage type (var_location var_offset))

sym_field_struct: == (offset type)

sym_typedef: == (SYM_TYPEDEF type)

sym_struct: == (TYPE_STRUCT -1) / *if not defined */
              | (TYPE_STRUCT symbol_table size align)

sym_enum_const: == (SYM_ENUM_CONST val)

Expressions

expr: == (type tag expr1 ... exprN)

expr_ident: == (type EXPR_IDENT sym)

expr_call: == (type EXPR_CALL expr_func n param1 paramN)

expr_int: == (type EXPR_INT n)

expr_str: == (type EXPR_STR str1 ... strN)

cast_expr: == (type EXPR_CAST expr)

etc.

The Assembler

  • Generate code and data in 2 segments .text and .data
  • Data: .byte num, .short num, .int expr, .align num, .zero num
  • Tags: .equ sym, num, sym:, .globl sym
  • Expressions of the form: sym, num, sym + num
  • Start of a new module: .module

For simplicity, the assembler is used as a linker. A symbol can be external (not defined at present), global (thanks to the directive .globl) or private.

The .module directive clears all private symbols.

The generated executable contains a relocation table because we want the pointers of the virtual machine to be compatible with real pointers.

The Virtual Machine

It is a stack-based machine. The fbvmspec.h file gives the relevant information on the architecture (endianness, alignment, size of the base types , the stack model). It has been assumed that the size of pointers are the same size as int.

The VM Instructions

reading mem:      ld_b, ld_ub, ld_w, ld_uw, ld_i
mem writing:      st_b, st_w, st_i
arithmetic:       add_i, sub_i, mul_i, mul_ui, div_i,
                  div_ui, mod_i, mod_ui, NEG_I
comparisons:      cmplt_i, cmple_i, cmpge_i, cmpgt_i, cmpeq_i,
	          cmpne_i, cmplt_ui, cmple_ui, cmpge_ui, cmpgt_ui
logical:          and_i, or_i, xor_i, not_i,
                  shl_i, SHR_I, shr_ui
conversions:      cvt_i_b, cvt_i_ub, cvt_i_w, cvt_i_uw
constants:        li_i n, libp_i n
jumps:            jeq_i n, jne_i n, switch_i, jmp n
functions:        jsr n, rts
stack management: dup, pop, addsp n
system:           libcall n

Conclusion

  • The compiler compiles itself, and the compiled version recompiles itself giving the same code. The speed is reasonable. To get to this point, it required not just the working compiler, but also the creation of the whole compilation and execution environment.
  • Types in C are extremely complex. Surprisingly, the compiler for such a complex langauge can be quite simple.
  • The choice of a stack-based virtual machine was not optimal. We did not think we had the time to generate intermediate code and then assembly code for a register0based machine. With hindsight, if such a choice had been done from the beginning, we could have completed this task. This would have allowed more work on the optimization of the generated code.
  • You should rather see this project as an intermediate code generator for the C language and not as a complete compiler.