% Introduction
% intro.tex
% David.N.Williams@umich.edu
% Last revision:  July 15, 2000

\section{Introduction}\label{intro}

Not only does Forth lend itself to a variety of structure schemes, but
simple schemes can be very effective for the application at hand.  The
elegant implementation in Anton Ertl's Gray parser generator
\cite{aertl:94} is a good example of this occasional approach, where
the implementation is precisely suited to its use.\footnote%
{According to Ertl, inspired by earlier work possibly by John Hayes.}
More elaborate schemes, such as that in the Forth Scientific Library
(FSL) project \cite{scarter:95} and those in this document, can be
both ANS Forth portable and reasonably efficient at runtime.

Taken as a whole, the Structure word set in this document is not
simple.  It is designed to be compatible with C structures, with the
aim of supporting the use of C/POSIX libraries.  Actually, the
discussion is more elaborate than the code, and the code would be a
good deal simpler if it did not cover bit-fields.  If stripped to
Peters' elegant core design, it becomes fairly simple and effective
for Forth-centric applications.

We attempt to control the complications a bit in the implementation 
file {\tt cstruct.fs} by dividing it into sections, with flags for 
conditional compilation of various features.

Being compatible with C structures does not mean that the syntax has
to be derived from C, but it does fix the layout of structure data. 
For example, the POSIX standard function {\tt localtime()} converts
the output of the POSIX time function into year, month, day, hours,
minutes, etc., and actually stores into a {\tt struct tm}.  If this
function is called from Forth (glue for that is not discussed here),
the Forth structure words have to know how to build and access a
standard {\tt struct tm}.

This is a bit of a problem, because standard C programs
\cite[6.5.2.1]{ansic:90} are not allowed to assume much about the
packing or alignment of the memory layout of structures, although the
ordering of structure elements in memory has to be the same as in the
structure declaration \cite[6.5.2.1]{ansic:90}, \cite[p.~213]{k&r:88}.%
\footnote{The ordering of structure elements in particular structure
{\em declarations} in standard C/POSIX libraries is however {\em not}
prescribed \cite[p.~551]{dlevine:91}; a POSIX standard program is not
supposed to declare the structures in the standard library
independently, but is expected to get them directly or indirectly from
the system header files.}

The approach we attempt for layout is based on that of the GNU CC
compiler \cite{rstallman:94}, which uses implementation-dependent
macro expressions and constants to parametrize the underlying system. 
As far as we can see, even the major system not (yet) covered by GNU,
namely Mac OS, can be handled this way.  The GNU source indicates that
their scheme also covers other languages.

Although we intend to cover all standard kinds of C structure fields
for the GNU machines, our attempt is not quite comprehensive.  For
example, we allow bit-fields and arrays as structure elements, but we
don't include any data access words for them.  But mainly, we can't
claim a comprehensive understanding of the GNU structure layouts,
which are not definitively documented in any one place, as far as we
know.  The GNU source is a complex body of knowledge, and even with a
lot of embedded documentation, is not exactly your normal bedtime
read.  Although somewhat C literate, we are not C expert.  We hope we
have understood the essential part of it.  Our fundamental
understanding is that, with the exception of bit-fields, any structure
layout can be reproduced by specifying a minimum structure alignment,
a structure size minimum roundup, and the sizes and alignments of all
basic data types ({\tt char}, {\tt short}, {\tt int}, {\tt long},
\xldots), plus pointer types, which we call ``atomic'' data types.%
\footnote{We use the word ``alignment'' to mean a number of address
units of which the address of a data block in memory is a multiple,
like 1, 2, 4, 8, \xldots In standard C \cite[3.1,3.4]{ansic:90}, bytes
are not necessarily 8 bits, but as far as alignment is concerned are
effectively address units.  We always mean 8 bits when we say
``byte''.  The number of bits in an address unit is system dependent.}

We state the algorithm we actually follow in Section~\ref{layout}.

Except for bit-fields, the system interface is in the file {\tt
machine.fs}.  The version we provide works for the Amiga 3000
(Motorola 68030), NeXT (Motorola 68040), and Macintosh (Motorola
680x0).  What we aim at there is a sufficient parametrization, not a
direct mapping of the GNU macros and constants for system parameters.

After some testing with a GNU CC compiler on a NeXT with results that
surprised us, we decided to take seriously the warnings in Kernighan
and Ritchie \cite[pp.~150, 213]{k&r:88} that bit-fields are especially
implementation dependent in standard C. They should be rare, but do
occur in several Berkeley UNIX header files.  Although there seem to
be none among the required structure elements in POSIX structures,
there also seems to be no guarantee that they do not occur legally as
nonrequired elements in particular implementations of required
structures.

Our approach is to make no attempt to map the GNU parameters that affect
bit-fields, such as \verb|PCC_BIT_FIELD_TYPE_MATTERS|, but to make Forth
bit-fields rich enough to reproduce any possibility, and leave it up to
the user to test what a particular system does and supply the appropriate
Forth syntax.  In other words, for bit-fields the Forth syntax may vary
with the underlying system, while for ``normal'' structure elements the
syntax is system independent.

The implementation here addresses several issues:

\paragraph{\em Nesting of structures and unions to any depth.\ }
This is not uncommon in Forth (at least for structures), e.g., the FSL
implementation \cite{scarter:95} does it as does the Forth-83
implementation in Dick Pountain's book on object-oriented Forth
\cite{dpountain:87}.

\paragraph{\em Orthogonality of field names.\ }
I.e., allowed use of identical names in different structures.

For these two issues, we adopt the solution in Randolph Peters'
implementation in Pocket Forth \cite{rpeters:93}.  We translate that
part of his scheme to ANS Forth, with only cosmetic modifications. 
Nesting of structures is essential for C layout compatibility. 
Orthogonality of field names is not; but it's nice to have, and should
make it easier to translate between C and Forth structure definitions. 
Peters' is the only Forth implementation of reusable field names that
we know about.

We do not implement independence of field names from other Forth names
that might be found first in the search order.  That would not be hard
to do, say, by searching only a tokens word list when a field name is
wanted.

\paragraph{\em Field typing.\ }
We elaborate aspects of Peters' scheme and the typing scheme in the
FSL \cite{scarter:95} implementation, which are similar in spirit, and
both of which lend themselves to alignment.  Peters includes
unstructured data fields of any size and substructures, and the FSL
also has {\tt integer:}, {\tt float:}, and {\tt array:} fields, plus
unions.  Besides these, we include bit-fields and many of the scalar C
types, which we call atomic types.  We do not make the signed/unsigned
distinction, leaving that to the user; and we include only a single
pointer type, which is taken to have the same size and alignment as
{\tt void*} (also the same as {\tt char*}) in standard C
\cite[6.1.2.5]{ansic:90}.  The focus for us is on sizes and
alignments.  Having a single generic pointer type should be sufficient
in that respect for many systems.  For systems where that isn't true,
we make it possible for the user to define his own atomic types, with
their own sizes and alignments.

\paragraph{\em Binding of structure data references.\ }
The FSL scheme pays attention to this issue, the point being that a
straightforward elaboration of the syntax for variables to access data
in structure instances means extra overhead, both in code size and
execution time.  And that on the other hand structure and substructure
field names are fixed, so the overhead can be avoided in a word
definition where the structure instance is known at compile time by
precomputing and compiling the field address (early binding).  Dick
Pountain [9] discusses that, too.

Since this is likely to be a common situation, we provide some
explicit early binding operators in a Structure Extensions word set,
including {\tt ]@}, {\tt ]!}, {\tt ]c@}, {\tt ]c!}, and a few others. 
This kind of peephole optimization is a minor issue for ordinary
variables, and a portable implementation is unlikely to make them more
efficient.  If written in assembly language, such words could be
implemented for explicit optimization of ordinary variables as well. 
For structures, even a portable optimization can be significant.

Otherwise the binding is late, corresponding to the direct
generalization of normal syntax for variables.  For array fields one
might want to mix the two, binding the address of an array field
early, and an index into the array late.

\paragraph{\em Data reference syntax.\ }
There are two basic syntax choices for referring to data in a
structure or union instance, where the field names are mentioned
before or after the instance name, and also two basic orderings for
the names in each.  Peters puts the field names before, ordered from
deeper structure nesting on the left to shallower on the right, which
makes a natural chain, including the parent structure itself furthest
to the right.  That is expressive for his examples, (reproduced
later), and is also natural for implementing his substructure scheme. 
The FSL scheme follows the C style of putting field names after the
instance name, ordered towards deeper on the right, with the parent
structure furthest to the left, again a natural chain including the
parent.  We reject the other two, ``unnatural'' orderings, somewhat
arbitrarily, because syntactic conventions could probably be
discovered that would make them seem natural.

Surely neither of the remaining ``natural orderings'' optimizes
expressiveness in all situations, so we cop out and do both.  We do
adopt the field prefix mode for the more primitive words in our list. 
This interacts with the next issue.

\paragraph{\em Action of named structure instances.\ }
Should the execution of a named structure instance simply leave an
instance pointer, or should it resolve nested field names and leave a
field address and/or size, or fetch a value according to its data
type, or perform a method, or what?  A general purpose scheme should
at least not hinder any of these possibilities.  Clearly whether field
names precede or follow the naming of an instance interacts with this
question.

We provide a named-instance defining word {\tt typeof} intended for
use with {\tt DOES>} to make defining words with various actions.  We
also define two generic words of that sort, \verb|{}structof| and
\verb|structof{}|, designed for use with address and data access
operations to be described later.

\clearemptydoublepage