tutor2.txt



                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           24 July 1988


                   Part II: EXPRESSION PARSING


*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************


GETTING STARTED

If you've read the introduction document to this series, you will
already know what  we're  about.    You will also have copied the
cradle software  into your Turbo Pascal system, and have compiled
it.  So you should be ready to go.


The purpose of this article is for us to learn  how  to parse and
translate mathematical expressions.  What we would like to see as
output is a series of assembler-language statements  that perform
the desired actions.    For purposes of definition, an expression
is the right-hand side of an equation, as in

               x = 2*y + 3/(4*z)

In the early going, I'll be taking things in _VERY_  small steps.
That's  so  that  the beginners among you won't get totally lost.
There are also  some  very  good  lessons to be learned early on,
that will serve us well later.  For the more experienced readers:
bear with me.  We'll get rolling soon enough.

SINGLE DIGITS

In keeping with the whole theme of this series (KISS, remember?),
let's start with the absolutely most simple case we can think of.
That, to me, is an expression consisting of a single digit.

Before starting to code, make sure you have a  baseline  copy  of
the  "cradle" that I gave last time.  We'll be using it again for
other experiments.  Then add this code:


{---------------------------------------------------------------}
{ Parse and Translate a Math Expression }

procedure Expression;
begin
   EmitLn('MOVE #' + GetNum + ',D0')
end;
{---------------------------------------------------------------}


And add the  line  "Expression;"  to  the main program so that it
reads:


{---------------------------------------------------------------}
begin
   Init;
   Expression;
end.
{---------------------------------------------------------------}


Now  run  the  program. Try any single-digit number as input. You
should get a single line of assembler-language output.    Now try
any  other character as input, and you'll  see  that  the  parser
properly reports an error.


CONGRATULATIONS! You have just written a working translator!

OK, I grant you that it's pretty limited. But don't brush  it off
too  lightly.  This little "compiler" does,  on  a  very  limited
scale,  exactly  what  any larger compiler does:    it  correctly
recognizes legal  statements in the input "language" that we have
defined for it, and  it  produces  correct,  executable assembler
code,  suitable  for  assembling  into  object  format.  Just  as
importantly,  it correctly  recognizes  statements  that  are NOT
legal, and gives a  meaningful  error message.  Who could ask for
more?  As we expand our  parser,  we'd better make sure those two
characteristics always hold true.

There  are  some  other  features  of  this  tiny  program  worth
mentioning.    First,  you  can  see that we don't separate  code
generation from parsing ...  as  soon as the parser knows what we
want  done, it generates the object code directly.    In  a  real
compiler, of course, the reads in GetChar would be  from  a  disk
file, and the writes to another  disk  file, but this way is much
easier to deal with while we're experimenting.

Also note that an expression must leave a result somewhere.  I've
chosen the  68000  register  DO.    I  could have made some other
choices, but this one makes sense.


BINARY EXPRESSIONS

Now that we have that under our belt,  let's  branch  out  a bit.
Admittedly, an "expression" consisting of only  one  character is
not going to meet our needs for long, so let's see what we can do
to extend it. Suppose we want to handle expressions of the form:

                         1+2
     or                  4-3
     or, in general, <term> +/- <term>

(That's a bit of Backus-Naur Form, or BNF.)

To do this we need a procedure that recognizes a term  and leaves
its   result   somewhere,  and  another   that   recognizes   and
distinguishes  between   a  '+'  and  a  '-'  and  generates  the
appropriate code.  But if Expression is going to leave its result
in DO, where should Term leave its result?    Answer:    the same
place.  We're  going  to  have  to  save the first result of Term
somewhere before we get the next one.

OK, basically what we want to  do  is have procedure Term do what
Expression was doing before.  So just RENAME procedure Expression
as Term, and enter the following new version of Expression:


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   Term;
   EmitLn('MOVE D0,D1');
   case Look of
    '+': Add;
    '-': Subtract;
   else Expected('Addop');
   end;
end;
{--------------------------------------------------------------}


Next, just above Expression enter these two procedures:


{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   EmitLn('ADD D1,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB D1,D0');
end;
{-------------------------------------------------------------}


When you're finished with that,  the order of the routines should
be:

 o Term (The OLD Expression)
 o Add
 o Subtract
 o Expression

Now run the program.  Try any combination you can think of of two
single digits,  separated  by  a  '+' or a '-'.  You should get a
series of four assembler-language instructions out  of  each run.
Now  try  some  expressions with deliberate errors in them.  Does
the parser catch the errors?

Take  a  look  at the object  code  generated.    There  are  two
observations we can make.  First, the code generated is  NOT what
we would write ourselves.  The sequence

        MOVE #n,D0
        MOVE D0,D1

is inefficient.  If we were  writing  this code by hand, we would
probably just load the data directly to D1.

There is a  message  here:  code  generated by our parser is less
efficient  than the code we would write by hand.  Get used to it.
That's going to be true throughout this series.  It's true of all
compilers to some extent.  Computer scientists have devoted whole
lifetimes to the issue of code optimization, and there are indeed
things that can be done to improve the quality  of  code  output.
Some compilers do quite well, but  there  is a heavy price to pay
in complexity, and it's  a  losing  battle  anyway ... there will
probably never come a time when  a  good  assembler-language pro-
grammer can't out-program a compiler.    Before  this  session is
over, I'll briefly mention some ways that we can do a  little op-
timization,  just  to  show you that we can indeed improve things
without too much trouble.  But remember, we're here to learn, not
to see how tight we can make  the  object  code.    For  now, and
really throughout  this  series  of  articles,  we'll  studiously
ignore optimization and  concentrate  on  getting  out  code that
works.

Speaking of which: ours DOESN'T!  The code is _WRONG_!  As things
are working  now, the subtraction process subtracts D1 (which has
the FIRST argument in it) from D0 (which has the second).  That's
the wrong way, so we end up with the wrong  sign  for the result.
So let's fix up procedure Subtract with a  sign-changer,  so that
it reads


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB D1,D0');
   EmitLn('NEG D0');
end;
{-------------------------------------------------------------}


Now  our  code  is even less efficient, but at least it gives the
right answer!  Unfortunately, the  rules that give the meaning of
math expressions require that the terms in an expression come out
in an inconvenient  order  for  us.    Again, this is just one of
those facts of life you learn to live with.   This  one will come
back to haunt us when we get to division.

OK,  at this point we have a parser that can recognize the sum or
difference of two digits.    Earlier,  we  could only recognize a
single digit.  But  real  expressions can have either form (or an
infinity of others).  For kicks, go back and run the program with
the single input line '1'.

Didn't work, did it?   And  why  should  it?    We  just finished
telling  our  parser  that the only kinds of expressions that are
legal are those  with  two  terms.    We  must  rewrite procedure
Expression to be a lot more broadminded, and this is where things
start to take the shape of a real parser.


GENERAL EXPRESSIONS

In the  REAL  world,  an  expression  can  consist of one or more
terms, separated  by  "addops"  ('+'  or  '-').   In BNF, this is
written

          <expression> ::= <term> [<addop> <term>]*


We  can  accomodate  this definition of an  expression  with  the
addition of a simple loop to procedure Expression:


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   Term;
   while Look in ['+', '-'] do begin
      EmitLn('MOVE D0,D1');
      case Look of
       '+': Add;
       '-': Subtract;
      else Expected('Addop');
      end;
   end;
end;
{--------------------------------------------------------------}


NOW we're getting somewhere!   This version handles any number of
terms, and it only cost us two extra lines of code.  As we go on,
you'll discover that this is characteristic  of  top-down parsers
... it only takes a few lines of code to accomodate extensions to
the  language.    That's  what  makes  our  incremental  approach
possible.  Notice, too, how well the code of procedure Expression
matches the BNF definition.   That, too, is characteristic of the
method.  As you get proficient in the approach, you'll  find that
you can turn BNF into parser code just about as  fast  as you can
type!

OK, compile the new version of our parser, and give it a try.  As
usual,  verify  that  the  "compiler"   can   handle   any  legal
expression,  and  will  give a meaningful error  message  for  an
illegal one.  Neat, eh?  You might note that in our test version,
any error message comes  out  sort of buried in whatever code had
already been  generated. But remember, that's just because we are
using  the  CRT  as  our  "output  file"  for   this   series  of
experiments.  In a production version, the two  outputs  would be
separated ... one to the output file, and one to the screen.


USING THE STACK

At  this  point  I'm going to  violate  my  rule  that  we  don't
introduce any complexity until  it's  absolutely  necessary, long
enough to point out a problem with the code we're generating.  As
things stand now, the parser  uses D0 for the "primary" register,
and D1 as  a place to store the partial sum.  That works fine for
now,  because  as  long as we deal with only the "addops" '+' and
'-', any new term can be added in as soon as it is found.  But in
general that isn't true.  Consider, for example, the expression

               1+(2-(3+(4-5)))

If we put the '1' in D1, where  do  we  put  the  '2'?    Since a
general expression can have any degree of complexity, we're going
to run out of registers fast!

Fortunately,  there's  a  simple  solution.    Like  every modern
microprocessor, the 68000 has a stack, which is the perfect place
to save a variable number of items. So instead of moving the term
in D0 to  D1, let's just push it onto the stack.  For the benefit
of  those unfamiliar with 68000 assembler  language,  a  push  is
written

               -(SP)

and a pop,     (SP)+ .


So let's change the EmitLn in Expression to read:

               EmitLn('MOVE D0,-(SP)');

and the two lines in Add and Subtract to

               EmitLn('ADD (SP)+,D0')

and            EmitLn('SUB (SP)+,D0'),

respectively.  Now try the parser again and make sure  we haven't
broken it.

Once again, the generated code is less efficient than before, but
it's a necessary step, as you'll see.


MULTIPLICATION AND DIVISION

Now let's get down to some REALLY serious business.  As  you  all
know,  there  are  other  math   operators   than   "addops"  ...
expressions can also have  multiply  and  divide operations.  You
also  know  that  there  is  an implied operator  PRECEDENCE,  or
hierarchy, associated with expressions, so that in  an expression
like

                    2 + 3 * 4,

we know that we're supposed to multiply FIRST, then  add.    (See
why we needed the stack?)

In the early days of compiler technology, people used some rather
complex techniques to insure that the  operator  precedence rules
were  obeyed.    It turns out,  though,  that  none  of  this  is
necessary ... the rules can be accommodated quite  nicely  by our
top-down  parsing technique.  Up till now,  the  only  form  that
we've considered for a term is that of a  single  decimal  digit.

More generally, we  can  define  a  term as a PRODUCT of FACTORS;
i.e.,

          <term> ::= <factor>  [ <mulop> <factor ]*

What  is  a factor?  For now, it's what a term used to be  ...  a
single digit.

Notice the symmetry: a  term  has the same form as an expression.
As a matter of fact, we can  add  to  our  parser  with  a little
judicious  copying and renaming.  But  to  avoid  confusion,  the
listing below is the complete set of parsing routines.  (Note the
way we handle the reversal of operands in Divide.)


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Factor;
begin
   EmitLn('MOVE #' + GetNum + ',D0')
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Match('*');
   Factor;
   EmitLn('MULS (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Match('/');
   Factor;
   EmitLn('MOVE (SP)+,D1');
   EmitLn('DIVS D1,D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   Factor;
   while Look in ['*', '/'] do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '*': Multiply;
       '/': Divide;
      else Expected('Mulop');
      end;
   end;
end;


{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   EmitLn('ADD (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB (SP)+,D0');
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   Term;
   while Look in ['+', '-'] do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '+': Add;
       '-': Subtract;
      else Expected('Addop');
      end;
   end;
end;
{--------------------------------------------------------------}


Hot dog!  A NEARLY functional parser/translator, in only 55 lines
of Pascal!  The output is starting to look really useful,  if you
continue to overlook the inefficiency,  which  I  hope  you will.
Remember, we're not trying to produce tight code here.


PARENTHESES

We  can  wrap  up this part of the parser with  the  addition  of
parentheses with  math expressions.  As you know, parentheses are
a  mechanism to force a desired operator  precedence.    So,  for
example, in the expression

               2*(3+4) ,

the parentheses force the addition  before  the  multiply.   Much
more importantly, though, parentheses  give  us  a  mechanism for
defining expressions of any degree of complexity, as in

               (1+2)/((3+4)+(5-6))

The  key  to  incorporating  parentheses  into our parser  is  to
realize that  no matter how complicated an expression enclosed by
parentheses may be,  to  the  rest  of  the world it looks like a
simple factor.  That is, one of the forms for a factor is:

          <factor> ::= (<expression>)

This is where the recursion comes in. An expression can contain a
factor which contains another expression which contains a factor,
etc., ad infinitum.

Complicated or not, we can take care of this by adding just a few
lines of Pascal to procedure Factor:


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;
{--------------------------------------------------------------}


Note again how easily we can extend the parser, and how  well the
Pascal code matches the BNF syntax.

As usual, compile the new version and make sure that it correctly
parses  legal sentences, and flags illegal  ones  with  an  error
message.


UNARY MINUS

At  this  point,  we have a parser that can handle just about any
expression, right?  OK, try this input sentence:

                         -1

WOOPS!  It doesn't work, does it?   Procedure  Expression expects
everything to start with an integer, so it coughs up  the leading
minus  sign.  You'll find that +3 won't  work  either,  nor  will
something like

                    -(3-2) .

There  are  a  couple of ways to fix the problem.    The  easiest
(although not necessarily the best)  way is to stick an imaginary
leading zero in  front  of  expressions  of this type, so that -3
becomes 0-3.  We can easily patch this into our  existing version
of Expression:


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   if IsAddop(Look) then
      EmitLn('CLR D0')
   else
      Term;
   while IsAddop(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '+': Add;
       '-': Subtract;
      else Expected('Addop');
      end;
   end;
end;
{--------------------------------------------------------------}


I TOLD you that making changes  was  easy!   This time it cost us
only  three  new lines of Pascal.   Note  the  new  reference  to
function IsAddop.  Since the test for an addop appeared  twice, I
chose  to  embed  it in the new function.  The  form  of  IsAddop
should be apparent from that for IsAlpha.  Here it is:


{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;
{--------------------------------------------------------------}


OK, make these changes to the program and recompile.   You should
also include IsAddop in your baseline copy of the cradle.   We'll
be needing  it  again  later.   Now try the input -1 again.  Wow!
The efficiency of the code is  pretty  poor ... six lines of code
just for loading a simple constant ... but at least it's correct.
Remember, we're not trying to replace Turbo Pascal here.

At this point we're just about finished with the structure of our
expression parser.   This version of the program should correctly
parse and compile just about any expression you care to  throw at
it.    It's still limited in that  we  can  only  handle  factors
involving single decimal digits.    But I hope that by now you're
starting  to  get  the  message  that we can  accomodate  further
extensions  with  just  some  minor  changes to the parser.   You
probably won't be  surprised  to  hear  that a variable or even a
function call is just another kind of a factor.

In  the next session, I'll show you just how easy it is to extend
our parser to take care of  these  things too, and I'll also show
you just  how easily we can accomodate multicharacter numbers and
variable names.  So you see,  we're  not  far at all from a truly
useful parser.


A WORD ABOUT OPTIMIZATION

Earlier in this session, I promised to give you some hints  as to
how we can improve the quality of the generated code.  As I said,
the  production of tight code is not the  main  purpose  of  this
series of articles.  But you need to at least know that we aren't
just  wasting our time here ... that we  can  indeed  modify  the
parser further to  make  it produce better code, without throwing
away everything we've done to date.  As usual, it turns  out that
SOME optimization is not that difficult to do ... it simply takes
some extra code in the parser.

There are two basic approaches we can take:

  o Try to fix up the code after it's generated

    This is  the concept of "peephole" optimization.  The general
    idea it that we  know  what  combinations of instructions the
    compiler  is  going  to generate, and we also know which ones
    are pretty bad (such as the code for -1, above).    So all we
    do  is  to   scan   the  produced  code,  looking  for  those
    combinations, and replacing  them  by better ones.  It's sort
    of   a   macro   expansion,   in   reverse,   and   a  fairly
    straightforward  exercise  in   pattern-matching.   The  only
    complication,  really, is that there may be  a  LOT  of  such
    combinations to look for.  It's called  peephole optimization
    simply because it only looks at a small group of instructions
    at a time.  Peephole  optimization can have a dramatic effect
    on  the  quality  of the code,  with  little  change  to  the
    structure of the compiler  itself.   There is a price to pay,
    though,  in  both  the  speed,   size, and complexity of  the
    compiler.  Looking for all those combinations calls for a lot
    of IF tests, each one of which is a source of error.  And, of
    course, it takes time.

     In  the  classical  implementation  of a peephole optimizer,
    it's done as a second pass to the compiler.  The  output code
    is  written  to  disk,  and  then  the  optimizer  reads  and
    processes the disk file again.  As a matter of fact,  you can
    see that the optimizer could  even be a separate PROGRAM from
    the compiler proper.  Since the optimizer only  looks  at the
    code through a  small  "window"  of  instructions  (hence the
    name), a better implementation would be to simply buffer up a
    few lines of output, and scan the buffer after each EmitLn.

  o Try to generate better code in the first place

    This approach calls for us to look for  special  cases BEFORE
    we Emit them.  As a trivial example,  we  should  be  able to
    identify a constant zero,  and  Emit a CLR instead of a load,
    or even do nothing at all, as in an add of zero, for example.
    Closer to home, if we had chosen to recognize the unary minus
    in Factor  instead of in Expression, we could treat constants
    like -1 as ordinary constants,  rather  then  generating them
    from  positive  ones.   None of these things are difficult to
    deal with ... they only add extra tests in the code, which is
    why  I  haven't  included them in our program.  The way I see
    it, once we get to the point that we have a working compiler,
    generating useful code  that  executes, we can always go back
    and tweak the thing to tighten up the code produced.   That's
    why there are Release 2.0's in the world.

There IS one more type  of  optimization  worth  mentioning, that
seems to promise pretty tight code without too much hassle.  It's
my "invention" in the  sense  that I haven't seen it suggested in
print anywhere, though I have  no  illusions  that  it's original
with me.

This  is to avoid such a heavy use of the stack, by making better
use of the CPU registers.  Remember back when we were  doing only
addition  and  subtraction,  that we used registers  D0  and  D1,
rather than the stack?  It worked, because with  only  those  two
operations, the "stack" never needs more than two entries.

Well,  the 68000 has eight data registers.  Why not use them as a
privately managed stack?  The key is to recognize  that,  at  any
point in its processing,  the  parser KNOWS how many items are on
the  stack, so it can indeed manage it properly.  We can define a
private "stack pointer" that keeps  track  of  which  stack level
we're at, and addresses the  corresponding  register.   Procedure
Factor,  for  example,  would  not  cause data to be loaded  into
register  D0,  but   into  whatever  the  current  "top-of-stack"
register happened to be.

What we're doing in effect is to replace the CPU's RAM stack with
a  locally  managed  stack  made  up  of  registers.    For  most
expressions, the stack level  will  never  exceed eight, so we'll
get pretty good code out.  Of course, we also  have  to deal with
those  odd cases where the stack level  DOES  exceed  eight,  but
that's no problem  either.    We  simply let the stack spill over
into the CPU  stack.    For  levels  beyond eight, the code is no
worse  than  what  we're generating now, and for levels less than
eight, it's considerably better.

For the record, I  have  implemented  this  concept, just to make
sure  it  works  before  I  mentioned  it to you.  It does.    In
practice, it turns out that you can't really use all eight levels
... you need at least one register free to  reverse  the  operand
order for division  (sure  wish  the  68000 had an XTHL, like the
8080!).  For expressions  that  include  function calls, we would
also need a register reserved for them. Still, there  is  a  nice
improvement in code size for most expressions.

So, you see, getting  better  code  isn't  that difficult, but it
does add complexity to the our translator ...  complexity  we can
do without at this point.  For that reason,  I  STRONGLY  suggest
that we continue to ignore efficiency issues for the rest of this
series,  secure  in  the knowledge that we can indeed improve the
code quality without throwing away what we've done.

Next lesson, I'll show you how to deal with variables factors and
function calls.  I'll also show you just how easy it is to handle
multicharacter tokens and embedded white space.

*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************