A Model for Dynamic Strings

David N. Williams
MCTP
University of Michigan
Ann Arbor, MI 48109-1120

Last revision: April 27, 2010

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation, with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license may be found at http://www.gnu.org/copyleft/fdl.html

Abstract

We describe the concepts of a model for dynamic strings in Forth. It includes string stacks, string stack frames and frame stacks, and string spaces with (relatively) fast garbage collection.

1. Introduction
2. General Architecture
3. String Space Header
4. Garbage Collection
5. Basic Operations on Strings
6. String Frames
7. Multiple String Spaces
8. References

1. Introduction

The price of simplicity and transparency in the use of dynamic strings is a certain complexity in the underlying structure. In principle the user is not supposed to have to know the details, but as usual it helps to understand them. We aim to provide such a guide here.

In this article we discuss the architecture of our dynamic strings package for Forth.¹ Specifications for the Dynamic-Strings word list can be found in a separate document [1].

The model uses a string stack in a string buffer with garbage collection. We describe the layout in Sections 2 and 3. Another approach we've heard of that also includes garbage collection is based on elegant LISP-like list processing algorithms, manipulating pointers to string fragments. We learned about that in a proposal by Rj Brown [2] to improve on the fstrings Forth-83 package by George Hawkins [3]. If we're not misinterpreting a discussion by Marcus Gabriel [4] describing his adaptation of fstrings to Multi-Forth [5], Rj Brown's proposal goes back to Donald Knuth [6]. Maybe our more pedestrian scheme is also more straightforward and easier to think about.

The model is intended to work with, not replace, ANS Forth strings, represented by address, length pairs on the data stack. That representation is one of the strong features of ANS Forth. The two are roughly complementary in the following way: ANS Forth strings are especially good for analysis and parsing of strings and substrings, while dynamic strings are especially good for putting strings together and keeping them available while they're needed, then reclaiming their memory when they're not.

ANS Forth strings may or may not have a counted string representation in memory. In the model we use a variant of counted strings called measured strings, where the size of the count field is implementation defined.

Section 4 covers our "relatively fast" algorithm for garbage collection. Based on the terminology in Paul Wilson's [7] garbage collection survey, our method appears to be a variant of mark-compact. It is pretty efficient for several reasons:

Our situation is simpler than that of garbage collection in general, because strings contain no references to the locations of other strings.
Marking, whether of garbage or of live, dynamic strings, is efficient because live strings have references in at most one variable and/or on the string stack. The only scan required is that of the string stack, which is short in normal Forth practise. And the scan occurs only when a dynamic reference is overwritten in a variable, which is further minimized by the normal Forth practise of using few variables.
Compacting is efficient because only the (short) string stack has to be scanned to update references to live, dynamic strings that move.
External (nondynamic) strings do not participate directly in the marking or compacting process at all, but can still be manipulated on the string stack. They do participate indirectly because pushing their references onto the string stack can trigger a garbage collection.

In Section 5, we describe the operations of pushing, popping, copying, and concatenation for strings in the context of garbage collection.

String stack frames and macros with arguments are briefly discussed in Section 6. Wil Baden has an especially pleasing :-) implementation of macros with arguments in Standard Forth [8]. See also the string words in his tool belt [9].

Finally, Section 7 suggests a possible use for multiple string spaces.

¹The model was developed in our personal Forth implementation, called ^Forth ("hat-forth"), with Z80/CPM and Motorola 680x0 (NeXT and Macintosh) assembly language versions. The string package was coded for both cpu's, with the Z80 version last revised in August, 1985, and the Motorola version last revised in June, 1999. We have switched to gcc-based Forth's, and the current version of the string package (0.7.2) is coded in C as a pfe loadable module, and in ANS Forth as well.

2. General Architecture

The following assumes whatever alignments are appropriate for the host system. Alignments for strings could in principle be different from the alignment of cells in the string stack. We call the latter cell alignment or stack alignment, which is not necessarily a multiple of cell size. A dynamic string itself has three fields, to be described later, and our implementation does assume no alignment gaps between these fields.

1. String spaces are contiguous regions of memory organized from low to high memory as follows:

header a string space header structure instance

frame stack a short stack for string frame entries

string buffer holds dynamic strings that build up towards high memory and a string stack that builds down

The string space header structure and the structure of the frame stack are described in Section 3.

The string buffer is managed by several pointers in the header, especially the string break, which is the first stack-aligned address following the dynamic strings. The string stack pointer is² the address of the most accessible stack entry. When the string break is the same as the string stack pointer, the string buffer is full; and garbage collection must occur before anything more can be put into it.

The string stack is a last-in-first-out (LIFO) stack. The top of the stack is the most accessible entry, with the lowest memory address. The bottom of the stack is fixed, at the high end of the string space in terms of memory addresses. The top moves down and up in memory, between the break and the bottom in high memory. The break itself may move up and down as more dynamic strings are created or as garbage is collected.

2. Strings, or measured strings, consist of two fields (plus possible pad bytes):

count field = n at least 1 byte, not more than 1 cell

body field n bytes, e.g., n < = 255, n < = 64K-1, or n < = 4,096M-1

possible pad to align null bytes

We call the measured string address the MSA; it is the same as the count field address. The contents of the count and body fields are called the string value.

The empty string has the string value consisting only of a count field equal to zero and an empty body field.

The count field needs special mention. We have used the terminology measured string rather than counted string, because in ANS Forth a counted string has a one-character count. The quasi-official situation is that words dealing with counted string addresses on the data stack are deprecated in favor of Forth strings, but counted strings are not discouraged as an implementation practise for strings stored in memory [10, A.3.1.3.4]. Longer counted strings are not uncommon, according to discussions in the comp.lang.forth newsgroup. But it is clear that character-counted strings will remain in a number of systems; and we signal the difference by using the measured string terminology.

As usual in Forth, there is no restriction that bytes in the string body have to be ASCII characters, and there is no mandatory terminating null byte. The class of null terminated strings can be implemented as a subclass of measured strings with the null terminator included in the count.

3. String values may reside in the string buffer or outside of string space. A string value in the string buffer is called a dynamic string, and one that is outside is called an external string. Dynamic strings may be either bound or garbage; the distinction will be clarified as we go along.

4. A string variable is an ordinary Forth variable that holds a measured string address (MSA), and which is initialized to the empty string when defined. Occasionally a string address held in a variable (or on the string stack) is called a forward link.

5. Dynamic strings in string space are preceded by a backward link field cell, containing a backward link address. There are exactly three kinds of backward links, the first two of which correspond to bound strings:

The address of a string variable's data field, which in turn holds the address of the string that follows the backward link field (i.e., the MSA).
The address of a cell in the string stack, which in turn holds the string MSA.
Zero, indicating that the following string is garbage, eligible for deletion ("garbage collection") when more space is needed.

6. The string stack consists of cell entries building down towards the string break. Each entry is the MSA either of a bound string or of an external string.

7. We can now refine the specification of backward links. Every dynamic string which is not garbage has a backward link that either points to a string variable which points to the string following the backward link, or which points to a string stack entry which points to the string. We say that any nongarbage, dynamic string is either bound to a variable or bound to the string stack (not to both, because there is only one backward link for each dynamic string).

The MSA of a string which is bound to a variable may also appear any number of times on the string stack, but must not appear in other string variables. If a string is bound to the string stack, its address must not be held by any string variable. It may appear any number of times on the stack, but the backward link must point to the deepest stack entry.

In other words, a bound string is bound exactly once. Copies "by reference" may occur on the string stack. Storing a bound string in more than one string variable entails actual duplication of the string value.

It is valid for string variables and string stack entries to point to external strings, outside of string space. We do not call such strings "bound" because (we presume) they do not have backward link fields pointing to a variable or the stack.

8. We use the convenient language that a string is "in" or "held by" or "stored in" a variable or "on" or "held by" the string stack when in fact it is the string MSA that is there. A bound string is stored in a string variable if and only if it is bound to it.

9. A bound string value may change only in ways that do not change its count field. In fact, there is no mechanism within the string package itself to change an existing bound string value at all.

²Depending on the meaning of is!

3. String Space Header

Here we list the structure of the string space header, mentioned in the last section, as it currently appears in our C implementation. It depends on structures for measured strings, dynamic strings, and the string frame stack, which we list first.

In the p4_MStr and p4_DStr structures below, p4_MCount is the system-dependent type of the count field, e.g., unsigned long. The one-character body field is a stand-in for the string body:

struct p4_MStr          /* measured string */
{
  p4_MCount count;      /* size of string body without padding */
  char body;
};

struct p4_DStr          /* dynamic string */
{
  p4_MStr **backlink;
  p4_MCount count;
  char body;
};

The structure of frame stack entries in our implementation describes the address of a block of string stack entries and the number of strings in the block. The p4ucell type for the number of strings is that of an unsigned Forth cell, usually an unsigned int. Certainly that is much larger than needed, but the overhead is negligible. The frame stack itself can be short. We got by for a long time with only one entry, and would expect two to cover many of the remaining cases.

struct p4_StrFrame     /* frame stack item */
{  
  p4_MStr **top;       /* pointer to frame top */
  p4ucell num;         /* number of strings in frame */
};

That brings us to the header structure itself:

struct p4_StrSpace
{
  size_t size;          /* size of string buffer plus stack,
                           excluding guard null */
  size_t numframes;     /* maximum number of string stack frames */
  p4_DStr *buf;         /* pointer to start of string buffer */
  p4_DStr *sbreak;      /* pointer to next available space after
                           strings, normally aligned */
  p4_MStr **sp;         /* string stack pointer */
  p4_MStr **sp0;        /* initial string stack pointer, address of
                           last cell in string space */
  p4_StrFrame *fbreak;  /* top of frame stack limit pointer */
  p4_StrFrame *fp;      /* frame stack pointer */
  p4_StrFrame *fp0;     /* initial frame stack pointer */
  p4_MStr *cat_str;     /* pointer to the last string in the string
                           buffer, if concatenation is in progress,
                           else NULL */
  short garbage_flag;   /* true when there is garbage */
  short garbage_lock;   /* true when collection not permitted */
  short args_flag;      /* true when compiling with macro args */
};

4. Garbage Collection

Garbage collection is triggered by any (legal) operation that tries to insert new data into the string buffer and finds that there is not enough room.

The garbage flag in the string space header is tested, and if there is garbage, it is collected. Garbage strings are marked by null backward links. Nongarbage strings are bound by their backward links, pointing either to a string variable data field or to an entry on the string stack (which has to be the deepest if there are several).

Garbage collection fills the gaps occupied by garbage strings by moving any nongarbage strings after the first garbage string down, one at a time. The backward link of a string that is moved does not change, but forward links, either in at most one string variable, and/or possibly several on the string stack, are updated to point to the new MSA.

This algorithm is "fast" because no bound string is moved more than once, and because the backward links make it unnecessary to scan a list of string variables. It is, however, necessary to scan the string stack for multiple references to each string bound to a variable or the stack.

There are situations where trouble may occur if one is not careful. Coding the string primitives is slightly tricky, to avoid having an operation complete before a garbage collection occurs, causing the MSA of a bound string to be remapped. In other words, string primitives should be modified cautiously.

A more likely problem is the transfer of a bound string to the data stack for some manipulation. The MSA on the data stack may become invalid if a dynamic string operation is performed in the meantime which causes garbage collection. The simplest rule is to avoid dynamic string operations while manipulating bound strings on the data stack. But we do include a locking mechanism by which the user can explicitly disallow or repermit garbage collection.

5. Basic Operations on Strings

Garbage collection is supposed to be transparent to the other operations on strings, and to the user (except when it is impossible and leads to an abort); but it certainly affects the implementation. Here we describe the basic operations, including the role of copying, in the context of garbage collection. They correspond roughly to primitive C functions or macros in the implementation.

One of course wants to avoid copying strings when unnecessary. The string stack is the basic tool for string manipulation without copying. External strings are never copied, as long as they remain external (not bound). Bound strings have exactly one binding, either to a variable or to the string stack, and they are copied once, when they are created in the string buffer. A string that is either external or already bound may appear on the string stack any number of times without further copying.

Concatenation is a little special, as we discuss below.

In this discussion, an existing string is either a bound or external measured string that already exists. A new string is a measured string being created in the string buffer and being bound to a string variable or the string stack.

5.1 Pushes and Pops

1. Push existing string. The MSA is simply pushed onto the string stack in the normal way with stacks, after garbage collection if there is not enough room on the stack. No backward link is changed, no string value is copied, and the operation can be done multiple times for the same string.

2. Push with copy. Garbage collection is triggered if there is not room for a copy of the string and its stack entry in the string buffer. A new string is then created by copying the old string value and binding it to the stack, where the new MSA is pushed.

The old string may be a previously unmeasured, possibly transient Forth string. If so, a push with copy is one way of making it measured and not transient.

3. Pop existing string. The most accessible MSA is popped from the string stack in the normal way with stacks. That is a valid operation all by itself, except in one circumstance, when the deepest (and therefore only) entry of a dynamic string bound to the stack is being popped. Then the backward link must be changed, either by binding it to a variable in which the string is being stored, or by replacing it with a garbage null. The latter is the only case in which popping a string destroys its value. No garbage collection occurs.

4. Pop with copy. The string has to exist, so the operation above applies. There are two copy situations. One is an explicitly evoked copy into data space to make an external string, and does not involve garbage collection. The other is an automatic copy that occurs when a string already bound to a variable is stored into another variable. Then a new copy is made in the string buffer, possibly after garbage collection, and bound to the new variable.

5.2 String Concatenation

Iterative string concatenation is a key operation in our package. To avoid wasteful recopying of previously concatenated pieces, we lock the string buffer against all other operations that make new strings while concatenation is in progress. That ensures that the previous piece remains the last string in the buffer, to which the next piece can be appended.

To get the benefit of garbage collection while concatenation is in progress, we bind the unfinished piece to a string-like private variable in the string space header. It is only string-like because it is allowed to hold zero, which is not expected ever to be an MSA. Zero is the signal that the string buffer is unlocked for normal copy operations.

Concatenation is terminated by pushing the hidden string onto the string stack, and storing zero into the private variable.

6. String Frames

The Dynamic-Strings word list includes macros with arguments, based on string frames and a string frame stack. We like to call them "smart macros" because they can execute any Forth code, including code that is conditional on the arguments or that modifies them. They are aimed more at general text manipulation, like source to source translation, than at programming shortcuts for Forth itself. Details on the macros are covered in the glossary, so we won't go into them here. But we do want to say a little about string stack frames.

Although string stack frames are not just for macro arguments, that is our main use for them at the moment. Our Motorola 680x0 version did have macros, with no frame stack and at most one string stack frame, which was used for one level of macro arguments at compile-time or at run-time, or for one level of locals arguments at compile-time. That meant that macro labels and locals labels could not be used in the same definition, which we really didn't find to be much of a restriction. But it also meant no nesting of macro arguments at run-time.³ While that restriction might be encountered rarely, we thought it worth removing, especially since it was easy to add a string frame stack. So we did.

³Locals could be nested at run-time as usual, because they used the return stack both for the local frame and its description.

7. Multiple String Spaces

With very little programming or execution overhead, the design allows multiple string spaces in the same application, with a string stack for each. Dynamic string functions work from the "current" instance of a string space, somewhat analogous to old-style vocabularies with CURRENT and CONTEXT.

It might be handy for a set of procedures to share a private string space, leaving free a public string space for normal operations. A possibly contrived example might be a set of macros that get executed upon expansion, but which might induce garbage collection that would shatter the interpetation input stream. One way around would be to keep the macro execution string space separate and unavailable to interpretation of normal dynamic string words.

A more plausible example is that of multiple concatenation streams, like m4 deferred macros.

8. References

[1] David N. Williams, Dynamic-Strings Words: dstrings.html

[2] Rj Brown, February, 1988: gstrings.pro

[3] George T. Hawkins, August, 1987: fstrings.txt, fstrings.scr, fstest.scr

[4] Marcus D.Gabriel, quoted by Richard Thomson in: readme.txt

[5] Marcus D. Gabriel, February, 1988: mfstrings.scr

[6] Donald E. Knuth, The Art of Computer Programming, vol. 1, Fundamental Algorithms, third edition, (Addison-Wesley, Reading, Massachusetts, 1997).

[7] Paul R. Wilson, "Uniprocessor Garbage Collection Techniques", in International Workshop on Memory Management, St. Malo, France, September 1992 (proceedings published as Springer-Verlag Lecture Notes in Computer Science no. 637: gcsurvey.ps

[8] Wil Baden, "Pattern Expansion", comp.lang.forth, May 23, 24, 2000.

[9] Wil Baden, Tool Belt, August, 2000: tool2000.txt

[10] Technical Committee X3J14, Programming Languages---Forth, ANSI X3.215-1994 (American National Standards Institute, 1994): dpans.html

header	a string space header structure instance
frame stack	a short stack for string frame entries
string buffer	holds dynamic strings that build up towards high memory and a string stack that builds down

count field = n	at least 1 byte, not more than 1 cell
body field	n bytes, e.g., n < = 255, n < = 64K-1, or n < = 4,096M-1
possible pad to align	null bytes