Dynamic-Strings Words

Version 1.0.0

David N. Williams
MCTP
University of Michigan
Ann Arbor, MI 48109-1120

Last Revision: June 2, 2011


Copyright © 2001, 2002, 2004, 2006, 2008, 2010, 2011 by David N. Williams.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation, with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license may be found at http://www.gnu.org/copyleft/fdl.html.


Contents

1. Introduction
 
2. Terminology
 
3. Constants
MAX-MCOUNT  /MCOUNT  \n$  empty$
 
4. Variables
dstrings
 
5. ANS Forth and measured string extensions
(m!)  parse>s  s`  m,s
-mcount  mcount  mcount@  mcount!
 
6. Comments
($:
 
7. String space
$unused  /$buf
make-$space  max-#$frames  0strings
$garbage?  $gc-off  $gc-on  $gc-lock@  $gc-lock!
collect-$garbage
 
8. Load and store
$!  $"  $@  $`  $constant  $variable  parse>$
 
9. String stack
$.  $type  $depth
$drop  $dup  $swap  $nip  $over  $tuck
$2drop  $2dup  $pick  $exchange
$s>  $,s  $s@  >$s  >$s-copy
$pop  $push-ext
 
10. Concatenation
$+  $+"  $+`  ENDCAT  parse-s+  s+
 
11. Arguments
#$args  $ARGS{  $frame  $frame-depth
drop-$frame  find-$arg  th-$arg

1. Introduction

This is the Dynamic-Strings word glossary for our dynamic strings package, which includes string stacks, string stack frames, and string frame stacks, in string spaces with relatively fast garbage collection. See the separate article A Model for Dynamic Strings for a description of the ideas. There are currently two implementations, one in C for pfe, and another in ANS Forth. Both provide extra words not in the Dynamic-Strings glossary.

These words are intended to work with, not replace, ANS Forth string words, which act on ANS Forth strings, represented by address, length pairs on the data stack. This representation is a strong feature of ANS Forth. ANS Forth strings are roughly complementary to dynamic strings in the following way: the former are especially good for analysis and parsing of strings and substrings, while the latter are especially good for putting pieces of strings together and keeping them available while they're needed, then reclaiming their memory when they're not.

String frames allow string macro words to use arguments, and string frame stacks allow such macro words to use other macro words with arguments. The implementation of arguments, which are a string variant of locals, requires either patching the Forth text interpreter, which is what the pfe implementation does, or intercepting it in words that use arguments, which is what the ANS Forth implementation does with its argument interpreter.

Multiple string spaces are included because the programming overhead is negligible, and they allow multiple string concatenation streams.

It is understood that the specification for all words includes string stack and frame stack bounds checking. All operations that would push onto the string stack or copy a string value into string space throw an error if there is not enough room, even after garbage collection. Errors are also thrown before pop operations that would cause the string stack or frame stack to underflow.

The word specifications that follow use a specific, counted string variant (the "measured" string) for strings in memory, and use the corresponding count field addresses to represent strings on the string stack and in string variables. The specifications could have been more abstract, in terms of string data objects with unspecified implementation details. We felt that wouldn't have been as easy to understand, and could be done later anyway if it really turned out to be a good idea.

But we do want to mention that, with the exception of a few words in the ANS Forth and measured string extensions section, and the two words $pop and $push-ext, all of the words can be used without knowledge of the internal representation. They neither put a count field address on the data stack nor expect one there, and indeed they are intended to encourage the expectation that a string should normally appear on the data stack in its ANS Forth string representation.

2. Terminology

DFA: Standard forth abbreviation for "data field address".

ANS Forth string: A string represented as a body address and length (c-addr u) on the data stack. The address alignment need only be that of a character, and the string may or may not be stored as a counted string or a measured string.

s: Shorthand for an ANS Forth string on the data stack. Often used as a modifier, as in a.s.

measured string, mstring, mcount: A measured string is like an ANS Forth counted string with a count field followed by a body field, but with the count field aligned, and with its size determined by the system rather than specified as the size of one character. Canonical sizes are one character and one cell, and count fields larger than one cell are excluded. Since version 0.7.6, the one character size is strongly deprecated. The counts themselves count characters. We use "mstring" as short for "measured string" and "mcount" as short for "measured string count".

mcount overflow: An attempt to store an ANS Forth string as a measured string when its length is too large for the mcount field. This can only happen when the mcount field is smaller than a cell.

MSA: The address of a measured string, the same as that of its count field.

$: Shorthand for a measured string or a MSA, and a designator to distinguish dynamic string words.

string buffer: A contiguous memory region for dynamic strings and the string stack. Dynamic strings are stored building up from its lower end in memory, and stack entries build down from its upper end.

bound string: A measured string in the string buffer with a back link that points to a string variable or a string stack entry.

garbage string: A measured string in the string buffer with a null back link, marking it as unbound and deletable.

dynamic string: A string that is either bound or garbage (and thus in the string buffer).

fixed external string: An external string is a measured string outside of the string buffer. A fixed external string does not move. So, for example, it cannot be a dynamic string in a noncurrent string space where it could be moved by garbage collection.

string stack: A LIFO stack that holds the MSA's of bound or fixed external strings. It builds down from a fixed address at the higher end of the string buffer to its top in lower memory, which limits the movable end of the region where dynamic strings are stored.

string frame: A designated, contiguous block of entries on the string stack. String frame blocks do not overlap.

string frame stack: A LIFO stack of string frame designations.

string space: A contiguous memory region that includes a string frame stack and a string buffer.

current string space: The string space where all dynamic string operations take place. Its address is held by the variable dstrings. Unless otherwise stated, "string space", "string stack", "string buffer", etc., always refer to the current string space.

concatenation: Appending a string to the previous concatenation in the string buffer. Concatenation is initiated by the first attempt to concatenate, and is terminated by ENDCAT, which leaves the result on the string stack, or the empty string if there is no concatenation in progress. Concatenating the empty string is a special case which does not change the concatenation state in any way; in particular, it cannot initiate a new concatenation.

3. Constants

MAX-MCOUNT ( -- u )

Leave the largest mstring count provided by the implementation. Canonical values are 255 and 232-1. "max-m-count"

/MCOUNT ( -- u )

Leave the size of the mstring count field provided by the implementation in ANS Forth character units, nominally bytes. Canonical sizes are one character and one cell. "slash-m-count"

empty$ ($: -- empty$ )

Push the MSA of a fixed external representation of the empty string onto the string stack. "empty-string"

\n$ ($: -- newline$ )

Push onto the string stack the MSA of a fixed external string whose body is the C newline character, ASCII 10. "newline-string"

4. Variables

dstrings ( -- dfa )

A Forth variable that holds the address of the current string space, where all dynamic string operations take place. "d-strings"

5. ANS Forth and measured string extensions

Several of the words in this section violate the normal expectation that only ANS Forth strings should appear on the data stack.

(m!) ( s msa -- )

(M!) is the same as Wil Baden's PLACE, except it assumes the buffer address msa to be aligned, and stores s as a measured string, zero-filled to trailing alignment. As with PLACE, it is assumed that the mstring copy does not clobber the old string, and there is no check for room starting at msa. "parens-m-store"

parse>s compile:
run:
interpret:
( "ccc<char>" char -- )
( -- s )
( "ccc<char>" char -- s )

Parse the parse area up to the first occurrence of char, which is parsed away. If executing in compilation mode, append run-time semantics to the current definition that leaves the ANS Forth string representation on the stack. In interpretation mode, leave the ANS Forth string representation for a stored mstring copy, which may be transient in the style of S". In either mode, the stored string is a mstring. "parse-to-s"

s` compile:
run:
interpret:
( "ccc<`>" -- )
( -- s )
( "ccc<`>" -- s )

An immediate version of parse>s where the delimiter is `, with the same transience property. "s-back-tick"

m,s ( addr len -- addr' len )

Allot room and store the ANS Forth string into aligned data space as a measured string, leaving data space aligned; and leave the length and new body address. It is assumed that len is unsigned. An error is thrown if there is an mcount overflow. "m-comma-s"

Note: m,s differs from STRING, in Wil Baden's Tool Belt in that it stores a measured string instead of a counted string, and leaves the ANS Forth string representation of the stored string instead of nothing.

-mcount ( addr len -- msa )

Assume the input ANS Forth string to be represented as a measured string. Leave its MSA. "minus-m-count"

mcount ( msa -- addr len )

Convert the input measured string to its ANS Forth string representation. "m-count"

mcount@ ( msa -- len )

Leave the length of the input measured string. "m-count-fetch"

mcount! ( len msa -- )

Store len in the measured string count field at msa. "m-count-store"

6. Comments

($: ( -- )

A synonym for "(" including any immediacy, and allowing multiple lines before the terminating right parenthesis, as specified for "(" in the File-Access word set. "paren-string-colon"

7. String space

$unused ( -- u )

Leave the number of address units available for dynamic strings and string stack entries in the string buffer. "string-unused"

/$buf ( -- u )

Leave the size in address units allocated for the current string buffer. "slash-string-buf"

make-$space ( size #frames -- addr )

Allocate and initialize a string space with size address units available for the string buffer including the string stack, and with a string frame stack for frame description entries holding up to #frames. The size is rounded up to address alignment, and the buffer begins and ends with address alignment. Return addr, the address of the string space. The standard word free with addr as input can be used to release the space. "make-string-space"

max-#$frames ( -- u )

Leave the number of string frames allowed on the string frame stack for the current string space. "max-number-string-frames"

0strings ( -- )

Set all string variables holding bound string values in string space to the empty string, and clear string space including the string buffer, string stack, and string stack frames. "zero-strings"

$gc-off ( -- )

Disable garbage collection in the current string space. An error will be thrown if garbage collection is attempted. "string-g-c-off"

$gc-on ( -- )

Enable garbage collection in the current string space. This is the default. "string-g-c-on"

$gc-lock@ ( -- flag )

Fetch the dstring garbage collection "off" state. Intended for saving the off state for later restoration after a usage of $gc-on or $gc-off. "string-g-c-lock-fetch"

$gc-lock! ( flag -- )

Set the dstring garbage collection "off" state according to flag. Intended for restoring the off state previously fetched by gc-lock@. "string-g-c-lock-store"

$garbage? ( -- flag )

Leave true if there is garbage in the current string space. Not normally used, since garbage collection is transparent. "string-garbage-question"

collect-$garbage ( -- collected.flag )

If string space is not marked as containing garbage, return false. If there is garbage, throw an error when garbage collection is disabled. Otherwise remove the garbage and return true. Garbage collection is "transparent", so the user would not normally use this word. "collect-string-garbage"

Note: Garbage strings are marked by null backward links. Nongarbage strings are bound by their backward links, pointing either to a string variable DFA or to an entry on the string stack (the deepest if there are several identical references).

Garbage collection fills the gaps occupied by garbage strings by moving any nongarbage strings to lower memory one at a time. The backward link of a string that is moved does not change, but the forward links, in at most one string variable, and/or possibly several on the string stack, are updated to point to the new MSA. This algorithm is "fast" because the backward links make it unnecessary to scan a list of string variables; and because no string is moved more than once. It does, however, require a scan of the string stack for each string that moves, unless it is the current concatenation string, which is guaranteed not to be on the string stack.

8. Load and store

$! ( $var.dfa $: a$ -- )

Store the string whose MSA is on the string stack in the string variable whose DFA is on the parameter stack. "string-store"

Note: The only situation in which $! copies the string value is when it is a bound string already stored in another variable. In that case, the new copy is the one that is stored in the variable. In particular, external strings are not copied.

Note: If the string value held by the string variable on entry is a bound string that is also referenced deeper on the string stack, its back link is reset to point to the deepest string stack reference. If it is a bound string not deeper on the string stack and not identical to the input string, its back link is set to zero, making it garbage. If it is an external string, its MSA in the variable is simply written over by that popped from the string stack.

$" compile:
run:
interpret:
( "ccc<">" -- )
($: -- ccc$ )
( "ccc<">" -- $: ccc$ )
$` compile:
run:
interpret:
( "ccc<`>" -- )
($: -- ccc$ )
( "ccc<`>" -- $: ccc$ )

Immediate versions of parse>$ where the delimiter is ", respectively, `. In particular, unlike S", a nontransient copy is stored when interpreting. "string-quote", "string-back-tick"

$@ ( $var.dfa -- $: a$ )

Leave the MSA of the string held by the string variable. The string is not copied. "string-fetch"

$constant
"name" execution:
( "name" $: a$ -- )
($: -- ext$ )

Create a definition for "name" which pushes an external copy of a$ onto the string stack, without copying it into the string buffer. When a$ is already external, the definition does not make a new copy; otherwise it copies a$ into data space as a measured string. "string-constant"

For example:

     $" This is a sample string." $constant sample$

Warning: If a$ is external because it is in another string space, it may move or disappear at the next garbage collection in that space, making the MSA pushed by "name" invalid. This can be avoided by sandwiching sections of code where this could occur between $gc-off and $gc-on.

$variable
"name" execution:
( "name" -- )
( -- $var.dfa )

Create an ordinary ANS Forth variable and initialize it to the address of a fixed, external, measured representation of the empty string, such as that pushed onto the string stack by empty$. "string-variable"

parse>$ compile:
run:
interpret:
( "ccc<char>" char -- )
($: -- ccc$ )
( "ccc<char>" char -- $: ccc$ )

Parse the parse area up to the first occurrence of char, which is parsed away, and store the string as an external measured string. If executing in compilation mode, append run-time semantics to the current definition that leaves the MSA on the string stack. In interpretation mode, leave the MSA on the string stack, where the stored copy, unlike parse>s, is required to be nontransient. An error is thrown if there is an mcount overflow. "parse-to-string"

9. String stack

$. ($: a$ -- )
$type ($: a$ -- )

These words are synonyms, with $type deprecated. Display the string on the terminal. If the system implementation of TYPE has its output vectored, $. uses the same vector. "string-dot", "string-type"

$depth ( -- u )

Leave the number of items on the string stack. "string-depth"

$drop ($: a$ -- )

Drop the topmost string stack entry, marking it as garbage if it is initially bound to the top of the string stack. "string-drop"

$dup ($: a$ -- a$ a$ )

Leave a copy of the topmost string stack entry. The string value is not copied. "string-dupe"

$swap ($: a$ b$ -- b$ a$ )

Exchange the two most accessible strings on the string stack. Throw an error if there are less than two strings on the stack. Neither string value is copied. "string-swap"

$nip ($: a$ b$ -- b$ )

Drop the next to top item from the string stack. "string-nip"

Note: A system-level implementation may be omitted. Because of essential string space bookkeeping, it can be little more efficient than the high-level definition:

 
      : $nip  $swap $drop ;

$over ($: a$ b$ -- a$ b$ a$ )

Leave a copy of the next most accessible string stack entry on top of the string stack. The string value is not copied. "string-over"

$tuck ($: a$ b$ -- b$ a$ b$)

Copy the top string stack item just below the second item. The string value is not copied. "string-tuck"

Note: A system-level implementation may be omitted. Because of essential string space bookkeeping, it can be little more efficient than the high-level definition:

 
      : $tuck  $swap $over ;

$2drop ($: a$ b$ -- )

Drop the two topmost string stack entries, marking them as garbage if appropriate. "string-two-drop"

$2dup ($: a$ b$ -- a$ b$ a$ b$ )

Leave copies of the two topmost string stack entries. The string values are not copied. "string-two-dupe"

$pick ( u -- )
($: au$ ... a0$ -- au$ ... a0$ au$ )

Copy the u-th string stack entry to the top of the string stack. The string value is not copied. Throw an error if the input string stack does not have at least u+1 items. "string-pick"

$exchange ( i j -- )
($: maxth$ ... minth$ ...
   -- minth$ ... maxth$ ... )

Exchange the ith and jth strings on the string stack, where the top is the 0th. Throw an error if there are not at least max[i,j]+1 strings on the stack. Neither string value is copied. "string-exchange"

$s> ($: a$ -- s: a.s )

Drop a$ from the string stack and leave it as an ANS Forth string a.s, without copying. "string-s-from"

Warning: If a$ is a bound string, it may move or disappear at the next garbage collection, making a.s invalid. This can be avoided by sandwiching sections of code where this could occur between $gc-off and $gc-on.

$,s ($: a$ -- s: a.s )

Drop a$ from the string stack, copy it into data space as a measured string, and leave it as an ANS Forth string a.s. "string-comma-s"

$s@ ($: a$ -- a$ s: a.s)

Leave the string stack unchanged, and leave the string body address and length on the data stack. "string-s-fetch"

Warning: If a$ is a bound string, it may move at the next garbage collection, making a.s invalid. This can be avoided by sandwiching sections of code where this could occur between $gc-off and $gc-on.

>$s ( a.s -- $: a$ )

Push the external ANS Forth string a.s onto the string stack, without copying the string value into the string buffer. It is an unchecked error if the ANS Forth string a.s is not stored as a fixed, external measured string. "to-string-s"

Warning: If the string value of a.s is actually in the string buffer, the push operation may generate a garbage collection that invalidates its MSA.

>$s-copy ( a.s -- $: a$ )

Copy the external string value whose body address and count are on the parameter stack into the string buffer and push it onto the string stack. Errors are thrown if there is an mcount overflow, if there is not enough room in string space, even after garbage collection, or if there is an unterminated string concatenation. The input external string need not exist as a measured string. "to-string-s-copy"

Warning: This word should not be used when the input string is a bound string because the copy operation may generate a garbage collection which invalidates its MSA.

$pop ($: msa -- s: msa)

Abort if the string stack would underflow when popped.

Otherwise pop the top of the string stack and push it onto the data stack.

If the string is in the current string space and initially bound to the top of the string stack, mark it as garbage by setting its back link to NULL and set the garbage flag. "string-pop"

Note: This word violates the normal expectation that only ANS Forth strings should appear on the data stack.

Warning: If msa is the address of a bound string, it may become invalid at the next garbage collection. This can be avoided by sandwiching sections of code where this could occur between $gc-off and $gc-on.

$push-ext ( msa -- $: msa )

Pop an external mstring address from the data stack and push it onto the string stack after checking for room, invoking garbage collection if necessary. Not to be used with a dynamic string because a garbage collection can invalidate its address.

Note: This word violates the normal expectation that only ANS Forth strings should appear on the data stack.

10. Concatenation

$+ ($: a$ -- )
s+ ( a.s -- )

If the input is the empty string, drop it and do nothing else.

In particular, do not start a new concatenation, which would lock string space against new, nonconcatenating copies.

Append the string body to the end of the string currently being concatenated as the last string in the string buffer, and update its count field. If there is no concatenating string, start one. An error is thrown if there is not enough room in the string buffer even after a garbage collection, or if there is an mcount overflow. An ambiguous condition results if the combined string would be too large to calculate its length in a cell. When there is a concatenating string, only concatenation can copy a string into the string buffer. Pushes onto the string stack without copy are still allowed. "string-plus", "s-plus"

Note: A garbage collection triggered by $+ does not cause a problem, even if a$ is in the string buffer.

Warning: In contrast, a garbage collection triggered by s+ can invalidate a.s if it is in the string buffer. The normal use is with fixed, external strings. To use s+ with strings in the string buffer, it is essential to turn off garbage collection with $gc-off.

Note: For s+, the ANS Forth string a.s does not have to exist as a measured string.

$+" compile:
run:
interpret:
( "ccc<">" -- )
( -- )
( "ccc<">" -- )
$+` compile:
run:
interpret:
( "ccc<`>" -- )
( -- )
( "ccc<`>" -- )

These words are immediate. In compilation mode they append run-time semantics to the current definition that concatenates the quoted, respectively, back-ticked string, according to the specification for $+. In interpretation mode they concatenate the string. "string-plus-quote", "string-plus-back-tick"

ENDCAT ($: -- cat$|empty$ )

If there is no concatenating string, do nothing but leave the empty string. If there is, leave it as a string bound to the top of the string stack, and terminate concatenation, permitting normal copies into the string buffer. "end-cat"

Note: Concatenation can be interrupted and resumed like this:

 
      ENDCAT ... $+

where the intervening action must ensure that the string pushed by ENDCAT is on top of the string stack for $+. There may be overhead from a recopy of the interrupted concatenation, if any new strings have been produced in string space, even if marked as garbage. If the intervening action leaves a new concatenation in progress, the old is concatenated to the new (unless there was no old concatenation in progress).

parse-s+ compile:
run:
interpret:
( "ccc<char>" char -- )
( -- )
( "ccc<char>" char -- )

Parse the parse area up to the first occurrence of char, which is parsed away. If executing in compilation mode, append run-time semantics to the current definition that concatenates the characters parsed from the string. Otherwise concatenate the characters. "parse-s-plus"

11. Arguments

#$args ( -- u )

Leave the number of entries in the topmost string frame. Throw an error if the frame stack is empty. "number-string-args"

$ARGS{ compile:
run:
( "arg1 arg2 ... argn <}>" -- )
($: arg1'$ arg2'$ ... argn'$ -- )

Immediate and compilation-only.

Copy the argument strings across lines to the string buffer, push them onto the string stack with argn the most accessible, and make them into the top compile-time string stack frame. Compile the run-time code to make an argument frame out of the n most accessible run-time string stack entries.

Start intercepting the input stream across lines. For any argument word token, compile run-time code that concatenates the corresponding string in the run-time frame. For any nonargument word token, compile its normal run-time semantics. At the semicolon terminating the definition, drop the compile-time argument frame, compile code to drop the run-time argument frame, perform the normal system semicolon semantics, and stop intercepting the input stream. "string-args-brace"

Syntax for defining a string macro george:

 
      : george  ($: a$ b$ c$ -- cat$ )
        $ARGS{ arg1 arg2 arg3 }
        $+" This is arg1:  " arg1 $+" ." ENDCAT ;

The blank following the last argument is required. For a macro with no arguments, $ARGS{...} does nothing but add useless overhead and should be omitted. Two of the arguments in this example are ignored and could have been left out. It is also normal to use a $ARGS{...} word without ENDCAT before the colon as a step in a concatenation that is terminated elsewhere.

Sample syntax using the string macro george:

 
      $" bill"  $" sue"  $" marie"  george $.

The resulting display is:

 
      This is arg1:  bill.

Note: The code between $ARGS{...} and the terminating semicolon is allowed to produce a net increase in the runtime string stack depth, but is not allowed to change the part of the string stack present just after $ARGS{...} (which includes the frame), beyond what is induced by garbage collection.

Note: Macro argument labels must be distinct from each other and from any local labels that appear in the same definition, and there is no requirement to check for that.

Note: At the moment the semantics of $ARGS{ is undefined before DOES>.

$frame ( u -- )

Push the description of a string stack frame starting at the top of the string stack and containing u entries onto the string frame stack. Errors are thrown if the frame stack would overflow or if the depth of the string stack above the top frame, if there is one, is less than u. The value u = 0 is allowed. "string-frame"

Note: The current implementations push u and the string stack pointer onto the frame stack.

Note: Any changes to the string stack in or below the string frame before it is dropped, except for those induced by garbage collection, result in an ambiguous condition.

$frame-depth ( -- u )

Leave the number of string frames currently on the string frame stack. "string-frame-depth"

drop-$frame ($: frame*$ i*$ -- i*$ )

Drop the topmost string frame from the string frame stack and string stack, and the corresponding strings, frame*$, from the string stack. An error is thrown if either stack would underflow. The cases where the frame has zero entries on the string stack and/or there are zero or more items on the string stack above the top frame item are handled properly. "drop-string-frame"

find-$arg ( a.s -- u true | false )

Leave true and its index u in the top string frame if the ANS Forth string matches an element of the frame, else leave false. The index of the top frame element is zero. "find-string-arg"

th-$arg ( u -- $: arg_u$ )

Leave the u-th string in the topmost string frame, where the index u of the top element is zero. Throw an error if the frame stack is empty or if the top frame contains less than u + 1 strings. "th-string-arg"