% Introduction % intro.tex % David.N.Williams@umich.edu % Last revision: July 15, 2000 \section{Introduction}\label{intro} Not only does Forth lend itself to a variety of structure schemes, but simple schemes can be very effective for the application at hand. The elegant implementation in Anton Ertl's Gray parser generator \cite{aertl:94} is a good example of this occasional approach, where the implementation is precisely suited to its use.\footnote% {According to Ertl, inspired by earlier work possibly by John Hayes.} More elaborate schemes, such as that in the Forth Scientific Library (FSL) project \cite{scarter:95} and those in this document, can be both ANS Forth portable and reasonably efficient at runtime. Taken as a whole, the Structure word set in this document is not simple. It is designed to be compatible with C structures, with the aim of supporting the use of C/POSIX libraries. Actually, the discussion is more elaborate than the code, and the code would be a good deal simpler if it did not cover bit-fields. If stripped to Peters' elegant core design, it becomes fairly simple and effective for Forth-centric applications. We attempt to control the complications a bit in the implementation file {\tt cstruct.fs} by dividing it into sections, with flags for conditional compilation of various features. Being compatible with C structures does not mean that the syntax has to be derived from C, but it does fix the layout of structure data. For example, the POSIX standard function {\tt localtime()} converts the output of the POSIX time function into year, month, day, hours, minutes, etc., and actually stores into a {\tt struct tm}. If this function is called from Forth (glue for that is not discussed here), the Forth structure words have to know how to build and access a standard {\tt struct tm}. This is a bit of a problem, because standard C programs \cite[6.5.2.1]{ansic:90} are not allowed to assume much about the packing or alignment of the memory layout of structures, although the ordering of structure elements in memory has to be the same as in the structure declaration \cite[6.5.2.1]{ansic:90}, \cite[p.~213]{k&r:88}.% \footnote{The ordering of structure elements in particular structure {\em declarations} in standard C/POSIX libraries is however {\em not} prescribed \cite[p.~551]{dlevine:91}; a POSIX standard program is not supposed to declare the structures in the standard library independently, but is expected to get them directly or indirectly from the system header files.} The approach we attempt for layout is based on that of the GNU CC compiler \cite{rstallman:94}, which uses implementation-dependent macro expressions and constants to parametrize the underlying system. As far as we can see, even the major system not (yet) covered by GNU, namely Mac OS, can be handled this way. The GNU source indicates that their scheme also covers other languages. Although we intend to cover all standard kinds of C structure fields for the GNU machines, our attempt is not quite comprehensive. For example, we allow bit-fields and arrays as structure elements, but we don't include any data access words for them. But mainly, we can't claim a comprehensive understanding of the GNU structure layouts, which are not definitively documented in any one place, as far as we know. The GNU source is a complex body of knowledge, and even with a lot of embedded documentation, is not exactly your normal bedtime read. Although somewhat C literate, we are not C expert. We hope we have understood the essential part of it. Our fundamental understanding is that, with the exception of bit-fields, any structure layout can be reproduced by specifying a minimum structure alignment, a structure size minimum roundup, and the sizes and alignments of all basic data types ({\tt char}, {\tt short}, {\tt int}, {\tt long}, \xldots), plus pointer types, which we call ``atomic'' data types.% \footnote{We use the word ``alignment'' to mean a number of address units of which the address of a data block in memory is a multiple, like 1, 2, 4, 8, \xldots In standard C \cite[3.1,3.4]{ansic:90}, bytes are not necessarily 8 bits, but as far as alignment is concerned are effectively address units. We always mean 8 bits when we say ``byte''. The number of bits in an address unit is system dependent.} We state the algorithm we actually follow in Section~\ref{layout}. Except for bit-fields, the system interface is in the file {\tt machine.fs}. The version we provide works for the Amiga 3000 (Motorola 68030), NeXT (Motorola 68040), and Macintosh (Motorola 680x0). What we aim at there is a sufficient parametrization, not a direct mapping of the GNU macros and constants for system parameters. After some testing with a GNU CC compiler on a NeXT with results that surprised us, we decided to take seriously the warnings in Kernighan and Ritchie \cite[pp.~150, 213]{k&r:88} that bit-fields are especially implementation dependent in standard C. They should be rare, but do occur in several Berkeley UNIX header files. Although there seem to be none among the required structure elements in POSIX structures, there also seems to be no guarantee that they do not occur legally as nonrequired elements in particular implementations of required structures. Our approach is to make no attempt to map the GNU parameters that affect bit-fields, such as \verb|PCC_BIT_FIELD_TYPE_MATTERS|, but to make Forth bit-fields rich enough to reproduce any possibility, and leave it up to the user to test what a particular system does and supply the appropriate Forth syntax. In other words, for bit-fields the Forth syntax may vary with the underlying system, while for ``normal'' structure elements the syntax is system independent. The implementation here addresses several issues: \paragraph{\em Nesting of structures and unions to any depth.\ } This is not uncommon in Forth (at least for structures), e.g., the FSL implementation \cite{scarter:95} does it as does the Forth-83 implementation in Dick Pountain's book on object-oriented Forth \cite{dpountain:87}. \paragraph{\em Orthogonality of field names.\ } I.e., allowed use of identical names in different structures. For these two issues, we adopt the solution in Randolph Peters' implementation in Pocket Forth \cite{rpeters:93}. We translate that part of his scheme to ANS Forth, with only cosmetic modifications. Nesting of structures is essential for C layout compatibility. Orthogonality of field names is not; but it's nice to have, and should make it easier to translate between C and Forth structure definitions. Peters' is the only Forth implementation of reusable field names that we know about. We do not implement independence of field names from other Forth names that might be found first in the search order. That would not be hard to do, say, by searching only a tokens word list when a field name is wanted. \paragraph{\em Field typing.\ } We elaborate aspects of Peters' scheme and the typing scheme in the FSL \cite{scarter:95} implementation, which are similar in spirit, and both of which lend themselves to alignment. Peters includes unstructured data fields of any size and substructures, and the FSL also has {\tt integer:}, {\tt float:}, and {\tt array:} fields, plus unions. Besides these, we include bit-fields and many of the scalar C types, which we call atomic types. We do not make the signed/unsigned distinction, leaving that to the user; and we include only a single pointer type, which is taken to have the same size and alignment as {\tt void*} (also the same as {\tt char*}) in standard C \cite[6.1.2.5]{ansic:90}. The focus for us is on sizes and alignments. Having a single generic pointer type should be sufficient in that respect for many systems. For systems where that isn't true, we make it possible for the user to define his own atomic types, with their own sizes and alignments. \paragraph{\em Binding of structure data references.\ } The FSL scheme pays attention to this issue, the point being that a straightforward elaboration of the syntax for variables to access data in structure instances means extra overhead, both in code size and execution time. And that on the other hand structure and substructure field names are fixed, so the overhead can be avoided in a word definition where the structure instance is known at compile time by precomputing and compiling the field address (early binding). Dick Pountain [9] discusses that, too. Since this is likely to be a common situation, we provide some explicit early binding operators in a Structure Extensions word set, including {\tt ]@}, {\tt ]!}, {\tt ]c@}, {\tt ]c!}, and a few others. This kind of peephole optimization is a minor issue for ordinary variables, and a portable implementation is unlikely to make them more efficient. If written in assembly language, such words could be implemented for explicit optimization of ordinary variables as well. For structures, even a portable optimization can be significant. Otherwise the binding is late, corresponding to the direct generalization of normal syntax for variables. For array fields one might want to mix the two, binding the address of an array field early, and an index into the array late. \paragraph{\em Data reference syntax.\ } There are two basic syntax choices for referring to data in a structure or union instance, where the field names are mentioned before or after the instance name, and also two basic orderings for the names in each. Peters puts the field names before, ordered from deeper structure nesting on the left to shallower on the right, which makes a natural chain, including the parent structure itself furthest to the right. That is expressive for his examples, (reproduced later), and is also natural for implementing his substructure scheme. The FSL scheme follows the C style of putting field names after the instance name, ordered towards deeper on the right, with the parent structure furthest to the left, again a natural chain including the parent. We reject the other two, ``unnatural'' orderings, somewhat arbitrarily, because syntactic conventions could probably be discovered that would make them seem natural. Surely neither of the remaining ``natural orderings'' optimizes expressiveness in all situations, so we cop out and do both. We do adopt the field prefix mode for the more primitive words in our list. This interacts with the next issue. \paragraph{\em Action of named structure instances.\ } Should the execution of a named structure instance simply leave an instance pointer, or should it resolve nested field names and leave a field address and/or size, or fetch a value according to its data type, or perform a method, or what? A general purpose scheme should at least not hinder any of these possibilities. Clearly whether field names precede or follow the naming of an instance interacts with this question. We provide a named-instance defining word {\tt typeof} intended for use with {\tt DOES>} to make defining words with various actions. We also define two generic words of that sort, \verb|{}structof| and \verb|structof{}|, designed for use with address and data access operations to be described later. \clearemptydoublepage