Measured String Libraries

Version 0.8.5

David N. Williams

Last Revision: October 29, 2010


Copyright © 2002, 2007–2010 David N. Williams

This document is licensed under the Creative Commons Attribution-Share Alike 2.5 License.

The libraries it describes are free software, which can be redistributed and/or modified under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or any later version.


Contents

0  Terminology
1  Introduction
2  Naming Conventions
3  Loading and Testing
4  Glossaries

4.1 stack representation
mcount  -mcount  /m
 
4.2 m+ words
<m+>  m+room?  m+
 
4.3 m! words
<m!>  mroom?  m!
 
4.4 comma words
m,  m,s  <m+,>  ?mcontig-data  m+,
 
4.5 null termination
<null-m+>  null-m+  null-m+,
 
4.6 multiline input
eol-s
upto-m!  upto-m+  m!"  m!`  m+"  m+`
upto-m,s  m,s"  m,s`  m"<line><char>  m"  m`
s-upto-m+,  s-upto-m,s  |s|-upto-m,s
 
not implemented
{}upto-m!  {}upto-m+  s-upto-m!  s-upto-m+"
upto-m+  m+,"  m+,`
{}upto-m,s  {}upto-m+,  s-upto-m+,  |s|-upto-m+,

0  Terminology

c[u]: The number of address units for u characters. The result of ( u) CHARS.

string: A "normal" string, i.e., a sequence of ANS Forth chars.

fstring: The ANS Forth pair representation ( c-addr u) of a string. Note that u is measured in characters, not address units. This is the fundamental ANS Forth description of a string.

fstring body: The character data starting at c-addr and ending at c-addr + c[u] - 1, unless u is zero.

s or .s suffix: Short for an fstring, especially on the data stack.

mstring or measured string: A counted string whose count field is an aligned cell instead of a character, as in ANS Forth. The count is the number of characters in the string body. The string body field begins immediately after the count field.

m or .m suffix: Short for the address of an mstring.

mstring buffer or mbuffer: A contiguous region of memory aligned at its beginning, used for storing an mstring. An mbuffer is a structure instance: the number of characters is in the first cell, followed by the corresponding sequence of characters. Both the count and the characters must fit into the mbuffer.

mbuf, /mbuf: In stack comments mbuf is the address of an mbuffer. The size of the mbuffer structure instance in address units is /mbuf, assumed to be large enough to contain at least a count field. Note the distinction from a character buffer, designated by cbuf and #cbuf in parsing.fs.

host eol: The sequence of end of line characters used by the host implementation of WRITE-LINE.

whitespace: In this implementation, any of the ASCII characters 0x0 ... 0x20, called ws characters.

up to: Up to but not including any trailing delimiter, whether character or string.

1  Introduction

The measured string libraries are mostly ANS Forth compatible up to case dependence. The code is intended to be character clean.

Some of the words already existed in dstrings.fs and the corresponding pfe loadable module dstrings-ext.c.

Some were derived from Wil Baden's ToolBelt, which we believe to be in the public domain.

The library takes a particular point of view about the storage of ANS Forth Strings, aka fstrings, as measured strings. Namely, a measured sring is stored at an aligned address, with a leading one-cell count field for the number of characters, followed by the characters in the string body. Unlike their treatment in the dstrings libraries, mstrings here are not zero filled to trailing alignment.

Since we changed names anyway, to reflect the measured rather than byte-counted string storage model, we took the opportunity to also make it normal behavior to check for overflow and leave a flag when storing to a buffer. See the Naming Conventions section for a discussion of the names.

This version of the library does not automatically append a trailing null character to measured strings, but does include words for doing that.

Words that store or concatenate an fstring into an mbuffer do not check whether the source and target regions overlap. The outcome is unspecified if they do.

When mstrings are stored or concatenated into data space, the address of the next available data slot may be left unaligned, just as with C,. The system is expected to detect any data space overflow.

2  Naming Conventions

The starting point is the following set of three Baden ToolBelt words for storing fstrings as counted strings, with the corresponding names for storing as mstrings. The mstring actions are the same as those of the ToolBelt except for alignment and the size of the count field:

toolbelt
PLACE
APPEND
STRING,  
mstring
<m!>
<m+>
m,

Since we needed to change the names anyway, to reflect measured rather than counted string storage, we took the opportunity to also make it normal behavior to check for overflow and leave a flag when storing into a buffer. The corresponding nonchecking words have surrounding < > in their names, indicating their use as factors where overflow is under control.

On the other hand, there is no explicit overflow checking for comma words that store into data space, because it is left up to the system to handle that. Comma word names with surrounding < > indicate factor usage not related to overflow checking.

The normal behavior for comma words in Forth-94 is to leave nothing on the data stack, but that's not always convenient for fstrings stored as mstrings. We adopt the convention that a trailing ,s in the name of a word means it leaves the fstring description on the stack, and that a comma without a trailing s means it does not.

Although normal data stack parameters for strings in ANS Forth are fstring parameters, some words do have mstring addresses as stack inputs or outputs. An m in a word name is unrelated to that; it just indicates the mstring storage format.

The names of words that store the input stream across lines (or blocks) up to a character, set, or string-pattern delimiter, have upto, {}upto, or s-upto in the name, with the exception of those with a double quote or single grave quote in the name, which store across lines up to the quote. |s| in a name indicates that the delimiting string is whitespace delimited on both ends.

All words that store from the input stream leave it positioned just after the trailing delimiter, or just after the first of any trailing whitespace characters for |s| patterns.

3  Loading and Testing

The library collection contains six libraries, any of which loads any of the others that it needs. Any subset may be loaded in any order with REQUIRED.

Here are the ANS Forth source libraries:
mstrings-srep.fs
mstrings-cat.fs
mstrings-store.fs
mstrings-comma.fs
mstrings-linput.fs
mstrings-0end.fs

Among these, mstrings-srep.fs is standalone, and does not depend on any of the others.

The first five are listed in order of increasing inclusion, with each after the first loading all that precede it.

The mstrings-linput.fs library also loads our parsing library, with parsing.fs as the default, and with the pfe version based on parsing-ext.c as a selectable alternative.

The mstrings-0end.fs library depends only on mstrings-srep.fs.

There is also a set of Hayes-style ttester tests in mstrings-test.fs, which runs correctly with gforth, iForth, and pfe.

The test file checks that the end of line character sequence defined by the fstring constant eol-s in mstrings-linput.fs agrees with the host eol. The Unix eol is the default, which works with the OS X and Linux systems just mentioned. Other variants can be uncommented in mstrings-linput.fs.

4  Glossaries

4.1  stack representation

Library   mstrings-srep.fs
Loads nothing

Note that -mcount, the inverse of mcount, is meaningful only when s is stored as an mstring.

mcount ( m -- s )

Assume that m corresponds to an mstring. Leave its fstring representation. Also in the dstrings library.

-mcount ( s -- m )

Leave the mstring address corresponding to the fstring. Also in the dstrings library.

/m ( len -- c[len]+cell )

Leave the combined size in address units of the count and body fields of an mstring containing len characters.

4.2  m+ words

Library   mstrings-cat.fs
Loads mstrings-srep.fs

A sequence of mstring concatenations can be initialized by invoking: ( 0 m) !

<m+> ( s m -- )

Append the fstring to the mstring without checking for room, and without checking for source and target region overlap.

m+room? ( len mbuf /mbuf -- len mbuf flag )

Assume that the mbuffer contains an mstring. Test whether /mbuf is large enough to append len address units. The flag is the value of the predicate  c[len+count]+cell <= /mbuf.

m+ ( s mbuf /mbuf -- flag )

Assume that the mbuffer contains an mstring. If  c[s.len+m.count]+cell <= /mbuf, append the body of s to the mstring and leave true. Else append nothing and leave false. Note that there is an ANS Forth Double-Number word with the same name.

4.3  m! words

Library   mstrings-store.fs
Loads mstrings-cat.fs

<m!> ( s a-addr -- )

Store the fstring as an mstring at a-addr, without checking for room. Also in the dstrings library.

mroom? ( len mbuf /mbuf -- len mbuf flag )

The mbuf argument is not used. Test whether /mbuf is large enough to contain an mstring with body length len. The flag is the value of the predicate  c[len]+cell <= /mbuf.

m! ( s mbuf /mbuf -- flag )

If  c[s.len]+cell <= /mbuf, store s as an mstring at mbuf and leave true. Else store nothing and leave false.

4.4  comma words

Library   mstrings-comma.fs
Loads mstrings-store.fs

m, ( s -- )

Store the fstring as an mstring in data space.

m,s ( s -- s' )

Store the fstring as an mstring in data space and leave its fstring representation. Also in the dstrings library.

<m+,> ( s m -- )

The same as m+, below, without the contiguity check.

?mcontig-data ( m -- )

Throw an error if the next available data space is not contiguous with the mstring.

m+, ( s m -- )

Assume that m is the address of an mstring in data space. If s does not have zero length, and the next available data space after the mstring is not contiguous, throw an error. Else append the body of s to that of m and adjust the count.

4.5  null termination

Library   mstrings-0end.fs
Loads mstrings-srep.fs

These words append a null character to an mstring, without changing its count. Assuming no other embedded nulls, the resulting string body is a C string when sizeof(char) is the same as 1 CHARS.

The action of null-m+, agrees with its name when the next available slot in data space is contiguous with the end of an mstring, and that is not checked.

<null-m+> ( m -- )

null-m+ ( mbuf /mbuf -- flag )

Assume the mbuffer contains an mstring. If there is room in the buffer, append a null character without changing the count and leave true. Else append nothing and leave false.

null-m+, ( -- )

Assume the next available data space is contiguous with the end of an mstring, and append a null character to it. The count is not changed.

4.6  multiline input

Library   mstrings-linput.fs
Loads mstrings-comma.fs
parsing.fs
or pfe parsing-ext

eol-s ( -- host.eol.s )

A constant that leaves the fstring consisting of the host eol. The fstring is stored in memory as an mstring.

upto-m+ ( "<lines><char>" mbuf /mbuf char
     -- room? found? )

Append the input stream across lines, up to the first occurrence of char, to the mstring in the mbuffer, including the host eol for line ends.

The room? flag is true when the mbuffer is sufficient, and the found? flag is true when the delimiting character is found in the input stream before a REFILL failure.

When both flags are true, the input stream is left with the delimiting character parsed away.

When one or both of the flags is false, the input stream is positioned just after the last character stored, if that was within a line, or at the beginning of the next line, if that was the last character of the host eol sequence. Only full fragments within a line are stored, up to the delimiting character or the end of the line, whichever comes first.

upto-m! ( "<lines><char>" mbuf /mbuf char
     -- room? found? )

The same as upto-m+ except that the input stream across lines is stored as an mstring into the mbuffer instead of being appended to an mstring initially there.

m!" m!` m+" m+` ( "<lines><char>" mbuf /mbuf -- flag )

Versions of upto-m+ and upto-m! where char is the quote or back-tick as indicated by the name.

upto-m,s ( "<lines><char>" char -- s )

Store the string across lines, up to the first occurrence of char, as an mstring in data space, including the host eol for line ends. Leave its fstring representation. Leave the input stream positioned just after the trailing char. Throw an error if char is not found.

m,s" m,s` ( "<lines><char>" char -- s )

Versions of upto-m,s where char is the quote or back-tick as indicated by the name.

m"<lines><char>" ( "<lines><char>" char -- )
( -- s )
\ compile
\ run

Throw an error if not compiling. The interpretive version of this word is not implemented.

When compiling, store the input stream across lines into data space as an mstring according to the specification for upto-m,s, and append code to the definition which leaves its fstring representation on the data stack. Throw an error if the delimiting character is not found in the input stream.

m" m` ( "<lines><char>" -- )
( -- s )
\ compile
\ run

Immediate versions of m"<lines><char>" where char is the quote or back-tick as indicated by the name.

s-upto-m+, ( "<lines><pat.body>" pat.s m -- )

Assume that pat.s is not empty. Concatenate the input stream across lines, up to the first occurrence of the characters in pat.body, onto the data space mstring m, including line ends consisting of the host eol. Leave the input stream positioned just after the trailing character of the copy of pat.body found in the input stream. Abort if the next available data space is not contiguous with m, or if the pattern is not found.

s-upto-m,s ( "<lines><pat.body>" pat.s -- lines.s )

Assume that pat.s is not empty. Store the input stream across lines, up to the first occurrence of the pattern body, as an mstring in data space, including line ends consisting of the host eol. The fstring representation of the mstring is lines.s. Leave the input stream positioned just after the trailing character of the copy of pat.body found in the input stream. Abort if the pattern is not found.

|s|-upto-m,s ( "<lines><ws><pat.body><ws>" pat.s
     -- lines.s )

Assume that pat.s is nonempty. Store the input stream across lines as an mstring in data space, up to the first occurrence of the pattern body, delimited by whitespace or the beginning or end of a line, minus the last of any whitespace characters immediately preceding it. Line ends consisting of the host eol are included in the stored mstring. The fstring representation of the mstring is lines.s. Leave the input stream positioned just after the pattern and the first of any immediately trailing whitespace characters. Abort if the whitespace delimited pattern is not found.