Parsing Numerical and Address
Expressions in ANS Forth

Version 0.5.5

David N. Williams
MCTP
University of Michigan
Ann Arbor, MI 48109-1120

Revised: April 24, 2009


Copyright © 1996, 2006–2009 by David N. Williams.

This work is licensed under the Creative Commons Attribution-Share Alike 2.5 License.


Contents

1. Introduction
 
2. Backus-Naur grammar notation
 
3. Numerical expression grammar
3.1 Expression and map lists and evaluation algorithm
3.2 Finite state machine
 
4. Address expression grammar
 
5. Backus-Naur Forth templates
5.1 Normal and upto syntax elements
5.2 Basic or forms
5.3 Basic and forms and backtracking
5.4 Upto forms
5.5 Merges
5.6 Template summary

1. Introduction

This document describes in a general way the parsing and evaluation schemes for numerical and address expressions used by an assembly language source to source translator, currently being written in Forth for pfe, with some primitives in C. The source language is that of Martinus Veltman's AAma (Ann Arbor macro assembler) for Motorola 680x0 assembly language. The target language is that of the Apple GNU assembler for the PowerPC, and perhaps also for the IA-32 or Intel 64.

Although it originally motivated this article, much of the discussion is only superficially specific to AAma assembly language. That applies especially to the algorithm for numerical expression evaluation and to the section on Backus-Naur templates in Forth. The AAma numerical and address expression grammars provide examples for two approaches to parsing and evaluation, one using a finite state machine, and the other using automatically generated, backtracking code for Backus-Naur forms.

Earlier (1994-1999), versions of the translator (incomplete then as now) only did recognition of valid expression syntax, with no evaluation. Part of the Forth code for that was automatically generated (July, 1996) from Backus-Naur grammars equivalent to those given below, using code templates similar to those described in this document. A mostly ANS Forth compatible revision of the parser generator, which uses the current version of the templates, is implemented in bnaut.fs. Links to examples of generating and generated code can be found in the parsing expressions section of our Forth strings page.

The current version of the translator does recognition and evaluation together. Numerical expressions evaluate to actual cell-size integers. Address expressions cannot evaluate to actual numerical addresses because the translator is not an assembler; instead they evaluate to an AAma "block signature".

When we started this work, we were aware of three treatments of parsing in Forth, all of exceptional quality:

We have also consulted several standard books on parsing, but we do not claim to have achieved expert status in its theory.

2. Backus-Naur grammar notation

::= <==> "is of the form"
{ } <==> "none or any number of"
{n: } <==> "none or up to n of" (n > 0)
[ ] <==> "none or one of" (same as {1: })
( ) <==> group bnf elements
| <==> or separator for allowed forms

When the order of phrases separated by "|" is logically irrelevant, we write them in the order we expect to be most efficient for recognition. We use single quotes for string literals. An and sequence is represented by stringing elements together, separated by whitespace.

3. Numerical expression grammar

numexpr ::= [termop] num-term {termop num-term}
num-term ::= num-factor {factorop num-factor}
num-factor ::= num | '[' num-expr ']' [func-name] '[' num-expr ']'
num ::= dec# | hex# | num-name | asc# | oct# | bin#

The num-name terminal in this grammar is any name in a Forth wordlist for numerical expression names defined by the AAma pseudo-op, .SET. The remaining terminals are defined by the following subgrammars:

termop ::= '-' | '+'
factorop ::= '*' | '/' | '&' | '!' | '^' | '<<' | '>>'
func-name ::= '.Ieq' | '.Ine' | '.Ilt' | '.Ile' | '.Igt' | '.Imi' | '.Ipl' | '.Asc'
dec# ::= dec-digit {dec-digit}
hex# ::= '0x' hex-digit {hex-digit}
asc# ::= '"' char-byte {char-byte} '"'
oct# ::= '0y' octal-digit {octal-digit}
bin# ::= '0z' binary-digit {binary-digit}

The prefixes 0x, 0y, and 0z are actually taken to be case insensitive. Restrictions on the sizes of the numbers are imposed by the semantics (evaluation).

3.1 Expression and map lists and evaluation algorithm

The finite state machine for numerical evaluation manipulates four list instances for each level of nested brackets. They describe the state of an expression at a given level, including a list for numbers and three lists for binary operators with three precedences.

All four lists are double-ended, so their tails can be appended to, and they can be traversed starting at their heads. That is, they act as queues.

The three operator lists are single-linked, because there is no need to delete individual nodes. The operator nodes contain the execution token of the binary operation and a pointer to the node in the number list containing the leftmost number of the pair on which it operates.

The number list is double-linked, because the algorithm described below deletes the left-number node once an operation is performed.

The code also uses a single-ended, single-linked list of maps, one for each level of expression nesting, which acts as a LIFO stack. The top of the stack, i.e., the list head, is used for the currently active subexpression.

Nodes in the map list include head and tail pointers for the four expression list instances described above, plus the execution array index for the numerical function to be applied to the expression when its level is completed. If the level was started by an actual or implicit, simple left bracket, the function is a no-op. But levels are also started by numerical functions, whose arguments are numerical expressions.

A subexpression can be evaluated in principle when it is well-formed, i.e., when its number and three binary operator lists contain one more number than the total number of operators. It can be evaluated legally only when its level has been properly terminated.

To evaluate a well-formed subexpression, visit the nodes of the first precedence list in order until it is exhausted. At each node, perform the operation on the two numbers it sits between in the expression, the left one at the number node to which the operation node points, and the right one at the number node's successor in the number list. Delete the left-number node and replace the number in its successor by the result of the operation.

Repeat for the remaining operator lists in order of their precedence. At this point the single node remaining in the number list contains the evaluation.

3.2 Finite state machine

The finite state machine for recognition and evaluation of numerical expressions is simple in that all exclusive inputs have a unique action and state transition. Instead of representing the machine by a 2x2 maxtrix, with rows labeled by the state and columns by inputs and transitions, we represent it as a one-dimensional array of states, each of which has a one-dimensional array of actions that combine input recognition, manipulation of expression and map lists, and state transition.

The initial input string is assumed to contain nothing beyond the expression. When the input to an action is recognized at the beginning of the input string, it is parsed away from the next input. The treatment of whitespace is not described here.

Numerical machine skeleton:
0. map-opened: termop >2 num >1 [func]lbrak >0
1. num-added: op >2 eostring >exit rbrak >1
2. op-added: num >1 [func]lbrak >0

The machine works like this:

4. Address expression grammar

AAma imposes a block structure on the code. Except for external global addresses, every address and address expression is bound to a block. An address expression may contain sums and differences of addresses and numbers, as long as it resolves to an unscaled address plus a numerical offset, or, as a special case, simply a numerical offset. We call this block closure.

An external global may appear in an address expression only once, and the expression must resolve to the unscaled global plus a numerical offset.

The translator evaluates an address expression by computing its block signature, i.e., that it closes on a block or numerical offset, or resolves to an external global plus a numerical offset. It does not reduce numerical subexpressions to actual numbers.

Addresses themselves appear only in a linear fashion in address expressions. They are not allowed to be individually scaled by numerical factors, but combinations of simple sums and differences that resolve to a numerical offset may be scaled. To facilitate parsing in this situation, we require brackets when such combinations are to be scaled; and we require that brackets be used only for simple combinations that must resolve to a numerical offset, with no nesting. We also require that numbers appear only as simple numbers or names of numerical expressions defined by the .SET pseudo-op (recognized by the num-name terminal below). Since names for address expressions defined by the .EQU pseudo-op may also occur (recognized by the addr-ref terminal below), the absence of nested brackets is not a semantic restriction.
addr-expr ::= [termop] (addr-monom | addr) {termop (addr-monom | addr)}
addr-monom ::= [num {factorop num} '*'] addr-bracket {factorop num}
addr-bracket ::= '[' simple-addr-expr ']'
simple-addr-expr ::= [termop] addr {termop addr}
addr ::= addr-ref | temp-ref | num | '.'
num ::= dec# | hex# | num-name | asc# | oct# | bin#

The '.' literal in the addr production is the assembler label for the current location. It occurs last because AAma allows leading dots in several types of names recognized by addr-ref and num-name.

Evaluation of an address expression obeying this grammar imposes the further condition that addr-bracket must resolve to a numerical offset, and hence so does addr-monom. Our implementation of evaluation requires an initialization each time the start production addr-expr is called, which we mention because failure to do so was for us a hard bug to find.

5. Backus-Naur Forth templates

This section organizes bnf's into templates suitable for automatic Forth code generation. The scheme is used by the parser generator implemented in bnaut.fs. The templates implemented there are applied to the address expression grammar above in aexpr.fs. The resulting generated code is in aexpr.out.

The discussion in this section is improved from our earlier understanding, logically tightened, but also more discursive because we had to work hard to reunderstand some of the concepts we hadn't written enough about before.

Our notation is that s, or a .s suffix, stands for an ANS Forth string, appearing on the data stack as two cells, (addr len). We use corresponding words sdup, sdrop, and sswap, which are synonyms for 2DUP, 2DROP, and 2SWAP.

Formerly we used nonstandard

  [N]IF...[N]THENIF...{[N]THENIF}...ELSE...THEN
constructions due to Harralson, which are ideally suited to parsing. Here the brackets and braces are Backus-Naur notation. Now we use instead quasi ANS Forth
  COND...[N]IF...{[N]IF}...ELSES...THEN
constructions, where COND is due to Wil Baden, and ELSES is inspired by his THENS. The [N]IF's pass control to the clause between ELSES and THEN at the first failure. If none of the clauses before the last [N]IF fails, the clause just after it also executes, and control skips the ELSES clause and passes to just after THEN.

COND, THENS, and ELSES are quasi ANS Forth because, while their definition uses details of the underlying control-flow stack implementation, they can be defined in a fairly generic way that works on a number of systems. For example, for many systems the control-flow stack is implemented on the data stack, and zero is not used in syntax checks for control-flow syntax. In such cases, Baden defines

  0 constant COND immediate

  : THENS  ( CS: 0 orig_1 ... orig_n -- )
    BEGIN ?dup WHILE postpone THEN REPEAT ; immediate
This works with pfe 0.33.61, whose control flow orig items occupy two cells on the data stack, and with gforth 0.6.2, where they occupy three cells. For those systems, ELSES can be defined as:
  : ELSES  ( CS: 0 orig_1 ... orig_n -- orig )  \ pfe
    postpone AHEAD
    ( CS: orig) 2>r
    BEGIN ?dup WHILE postpone THEN REPEAT
    2r> ( CS: orig) ; immediate

  : ELSES  ( CS: 0 orig_1 ... orig_n -- orig )  \ gforth
    postpone AHEAD
    ( CS: orig) 2>r >r
    BEGIN ?dup WHILE postpone THEN REPEAT
    r> 2r> ( CS: orig) ; immediate

5.1 Normal and upto syntax elements

We distinguish two types of syntax elements, normal and upto. We hope it is not too confusing that we use the term syntax element both for abstract Backus-Naur elements and for Forth words or templates that implement them.

Normal syntax elements have the stack pattern:

  ( s -- after.s true | s false )
Here after.s is usually a proper right substring of s, or the empty string. Normal syntax elements may be combinations of other normal elements, sometimes with upto elements, as long as the combination has the same stack pattern. We call this kind of pattern, where the leading part of the string s in the true case is thrown away, a recognition pattern.

That's opposed to a separation pattern

  ( s -- after.s hit.s true | s false )
where the leading part of s in the true case is kept, hit.s. Separation patterns have no direct role in this discussion; we just mention that we find them convenient where evaluation is involved.

Any participation of a syntax element in the semantics of evaluation is assumed here to occur only as a side effect, and only in the true case. Any hit.s from an intermediate separation pattern is consumed.

We emphasize that the false case is supposed to have no net semantic side effect. The arrangement to accomplish this is called backtracking. In the case of combinations, it entails unwinding any semantic side effects accumulated by elements that were satisfied before the combination failed.

The upto syntax elements associated with a normal syntax element b are {b}, [b], and {n: b}. The normal recognition pattern is not appropriate for them, since failure to recognize b satisfies the upto condition. We could live with an always true recognition pattern:

  ( s -- s' true )
where s' is after.s when b is actually found an allowed number of times, and s when it is not found at all; but that turns out to be suboptimal when rules for the occurrence of upto's are considered. Instead, we use the pattern with true understood:
  ( s -- after.s | s )
Here after.s is the part of s following the satisfaction of b up to the allowed number of times, and s results when b is not satisfied at all.

5.2 Basic or forms

In this section we begin to consider combinations of normal syntax elements b, which we understand to be any elements with the normal syntax pattern. A normal element may be a terminal implemented as a named Forth word with the normal stack pattern, or it may be any of the normal template combinations discussed below. When there are semantic as well as recognition effects, we also demand that normal elements have normal backtracking behavior, discussed further in the next section.

The or sequence is the simplest combination, in that it requires no backtracking beyond that implicit in the normal stack pattern of its elements. The Forth template for Backus-Naur or forms with normal elements bi is

  \ b1 | ... | bn  ( s -- after.s true | s false )
  COND
    b1 NIF
    b2 NIF
    ...
    bn ELSES true THEN
Backtracking is automatic because at most one of the b's will be satisfied, and there is no semantic accumulation from other b's to be undone if the total form is unsatisfied.

5.3 Basic and forms and backtracking

The and sequence requires explicit backtracking, which we implement with the help of three words save-sem (save semantics), 'tis, and 'taint.

In the simplest case, where there is only recognition and no semantic accumulation, they can be defined as:

  : save-sem  ( s -- s s )                  sdup ;
  : 'tis      ( s after.s -- after.s true ) sswap sdrop true ;
  : 'taint    ( s after.s -- s false )      sdrop false ;

When there are semantic side effects, these definitions would occur as factors in more elaborate definitions. The model we have in mind is to do semantic accumulation by pushing items onto one or more semantic stacks, and to use a backtracking stack to record the semantic stack pointers for unwinding the accumulation.

Then save-sem would have the extra function of pushing a copy of the current semantic stack pointers onto the backtracking stack, 'tis would have the extra function of dropping that information from the backtracking stack, leaving the accumulation on the semantic stacks intact, and 'taint would have the extra function of restoring the semantic stack pointers to the values popped from the backtracking stack, thus dropping the semantic accumulation since the last save-sem.

The backtracking stack can also serve to define frames on the semantic stacks, to be replaced by evaluation of subexpressions at the successful completion of bracket levels.

Our policy is to leave the primitive definitions of save-sem, 'tis, and 'taint alone, and to use a text editor to replace the names in generated code with those of words that also handle the semantic and backtracking stacks.

With these words, the Forth template for Backus-Naur and forms with normal elements bi is:

  \ b1 ... bn  ( s -- after.s true | s false )
  COND save-sem
    b1 IF
    ...
    bn IF
    'tis ELSES 'taint THEN
Note that when n = 1, the backtracking logic in this template is redundant, because the output of the lone bi is already backtracked when false. We'll come back to that later.

We emphasize that the net stack effect of both or and and templates is the same as that of a normal element. Furthermore, if save-sem, 'tis, and 'taint properly backtrack the semantic state, semantic side effects occur when the result is true.

If we furthermore assume that the semantic side effects in the true branches of the b's are properly designed, then these templates recurse, in the sense that they may themselves appear wherever b's occur in such templates.

We call and and or templates normal templates because they produce normal elements, and can themselves be used as normal arguments in templates.

5.4 Upto forms

Here are Forth templates for the three bnf upto's for a normal syntax element b, with the flagless stack effect mentioned earlier:
  \ {b}     ( s -- after.s | s )
  BEGIN b NUNTIL

  \ {n: b}  ( s -- after.s | s )
  n 0 DO b drop LOOP

  \ [b]     ( s -- after.s | s )
  b drop
We use u labels for any of these, whether considered as abstract upto elements or as Forth templates.

Any element following a u element in an or sequence would be ignored, because an upto is always satisfied; so we propose no template that or's upto elements with subsequent normal or upto elements.

Upto templates can be turned into normal templates by simply appending true:

  u true

Here u can also be an upto merge, discussed in the next section.

5.5 Merges

And subsequences of u's have a nice property: a simple code sequence of Forth u templates represents both the recognition and semantic side effects of the corresponding bnf and sequence, and has a combined stack effect of the same form as that of a single element:

  ( s -- after.s | s )

So now let's use u labels, not just for individual upto elements or templates, but also for and sequences of upto elements or code sequences of their templates. We call such a sequence an upto merge.

A Forth code sequence

  u1 ... un b
of uptos and a trailing normal element almost has the stack effect of a normal and template. The stack effect is wrong when the normal element is not satisfied, if any of the upto elements produces parsing motion; and any backtracking done when the b is not satisfied leaves behind any semantics accumulated by the u's. We call such a sequence a seminormal merge.

Seminormal merges work fine in the basic and template, because its backtracking logic undoes the semantic accumulation of previous elements, restoring the semantic state to that at the beginning of the entire sequence.

To help keep track of when seminormal merges are legal in templates, we introduce w labels to represent arguments that can be either normal elements or seminormal merges. In the template summary section, we rewrite the basic and template with w arguments, which covers all cases except those with a trailing upto argument.

To cover that, we add a normal and template with an explicit upto argument:

  \ w1 ... wn u  ( s -- after.s true | s false )
  COND save-sem
    w1 IF
    ...
    wn IF
    u 'tis ELSES 'taint THEN
This can be simplified when n = 1 and the w argument is really a normal element:
  \ b u  ( s -- after.s true | s false )
    b IF
    u true ELSE false THEN

5.6 Template summary

Except for one simplification noted below, we list here all the templates defined so far, expressed in terms of normal b arguments, upto u arguments, and normal or seminormal w arguments. A normal argument is any normal template, an upto argument is any merge of upto templates, and a seminormal argument is any merge of upto templates with a trailing normal template.

Recursion ultimately defines everything in terms of basic, normal elements, which are to have normal stack patterns and semantic side effects.

Normal templates: ( s -- after.s true | s false )

  1. A Forth word with appropriate syntax recognition and semantic backtracking.

  2. An or template built from two or more normal templates:
      \ b1 | ... | bn
      COND
        b1     NIF
        ...
        b(n-1) NIF
        bn     ELSES true THEN
    

  3. An and template built from one or more seminormal arguments:
      \ w1 ... wn
      COND save-sem
        w1 IF
        ...
        wn IF
        'tis ELSES 'taint THEN
    

    For n = 1, this template should be used only with a seminormal argument, which needs the extra backtracking.

  4. An upto template followed by true:
      \ u
      u true
    

  5. An and template built from one or more normal templates and a trailing upto template:
      \ w1 ... wn u
      COND save-sem
        w1 IF
        ...
        wn IF
        u 'tis ELSES 'taint THEN
    

    Simplification for n = 1 and normal w:

      \ b u
        b IF
        u true ELSE false THEN
    

Upto templates: ( s -- after.s | s ) (true implicit for both branches)
  1. A template for any number of repetitions of a normal element, including none:
      \ {b}
      BEGIN b NUNTIL
    

  2. A template for up to n > 0 repetitions of a normal element, including none:
      \ {n: b}
      n 0 DO b drop LOOP
    

  3. A template for up to one normal element, including none:
      \ [b]
      b drop
    

  4. An upto merge:
      \ u1 ... un
      u1 ... un