Parsing Library

Version 1.0.0

David N. Williams

Last Revision: June 3, 2011


Copyright © 2010, 2011 David N. Williams

This document is licensed under the Creative Commons Attribution-Share Alike 2.5 License.

The library it describes is implemented as free software, which can be redistributed and/or modified under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or any later version.


Contents

0  Terminology
1  Introduction
2  Naming conventions
3  ANS Forth string words

3.1 tool belt 2002
string/  end-c@  skip  scan
trim|
aka trim  scan-back aka back  white? aka Is-White?
 
3.2 tool belt 2000
/split  s-starts aka starts?  s-ends aka ends?
 
3.3 other string extensions
cut-first  cut-last  keep-first  keep-last
skip-back  |trim  |trim|  seek
{}in  {}index  {}scan  {}seek  {}starts  {}ends
separate  {}separate  s-separate  s-after
first-word  separate-word
white-delimited?
pfe only
s-starts-nc  s-ends-nc
not implemented
starts  starts-nc  ends  ends-nc
{}skip  s-skip  s-scan  s-seek
 
4  Input stream words

4.1 parse area
parse-area@  parse-area!  empty-parse-area  parse-area-empty?
parse-name  preparse-name  parse-name-away  {}parse
>in++  >in--
tests only
>in+  >in-
pfe only
space|punct-parse
not implemented
s-parse  parse-away  {}parse-away
 
4.2 across lines or blocks
\\  ?emit-cr  next-instream-name
s-seek-instream  |s|-seek-instream  (*
uncommentable
((  ---
not implemented
skip-instream  scan-instream  seek-instream
{}skip-instream  {}scan-instream  {}seek-instream

0  Terminology

c[n]: The number of address units for n characters. The result of ( n) CHARS.

fstring: The ANS Forth pair representation (addr len) of a string, especially on the data stack. Note that len is a u measured in characters, not address units, and addr is a c-addr. This is the fundamental ANS Forth description of a string, and we often just say "string".

fstring body: The character data starting at addr and ending at addr + c[len] - 1, unless len is zero.

s or .s suffix: Short for the pair representation (addr len) of an fstring.

x.len: The length part of x = (addr len).

x.addr: The address part of x = (addr len).

s = s', or "s is the same as s'": If s is represented by (addr len) and s' by (addr' len'), then addr = c-addr' and u = u'. That is, the two fstrings have the same representation.

s' is a substring of s: addraddr'addr + c[len], and addr' + c[len'] &le addr + c[len].

s is equivalent to s': The lengths of s and s' are the same. The addresses of the string bodies need not be the same, but if the lengths are nonzero, the bodies have the same sequence of characters.

right empty string: For s = (addr len), the right empty string is (addr+c[len] 0), just off its end. According to the definition of a substring, the right empty string of s is a substring of s, even though there is no character of s at its address. That includes the case where s is empty.

left empty string: For s = (addr len), the left empty string is (addr 0), at its beginning. According to the definition of a substring, the left empty string is a substring of s, although there is no character of s at its address. That includes the case where s is empty.

Certain specifications in the Tool Belt imply one of the above when an empty string results. Our specs could have been written to be independent of such implementation details, but we found it clearer and simpler to adopt the same practice.

whitespace: Any of the ASCII characters 0x0 ... 0x20, called "ws" characters.

word in s: A substring of s satisfying all of the following:
1. it is not empty;
2. it contains no whitespace characters;
3. it is preceded by a whitespace character or is at the beginning of s;
4. it is followed by a whitespace character or is at the end of s.

input stream: The input source across lines.

"<name>", "<char>", "<pat>", "<eof>: In stack comments, the next word, character, string pattern, or end of file in the parse area, or in the input stream if the comment applies across lines.

cbuf, #cbuf: In stack comments cbuf is the address of a character aligned buffer. The size of the buffer in characters is #cbuf. This is distinct from mbuf and /mbuf used for measured strings in the mstrings library, where /mbuf is the size of a buffer structure instance.

1  Introduction

This library is for parsing ANS Forth strings, the parse area of the input stream, and the input stream across lines. Since their implementation is based on REFILL, words that act across lines are expected to work across blocks as well, but have not been tested with blocks. All words in the library ignore any counted or measured string model for storage in memory.

Much of the functionality of the glossary has been implemented independently by others. The most direct sources are our ^Forth Motorola 680x0 parsing words going back to before 1996, and Wil Baden's Tool Belt 2000 or 2002, with some name changes. All references to "Tool Belt 2002" in this file refer to the 2003-02-25 revision. A no doubt incomplete list of other sources includes Michael Gassanenko, George Hawkins, Bruce McFarling, Frederic Warren, and Leo Wong.

The library has two implementations:

At present the pfe module defines a few more words than the ANS Forth code. Each implementation defines its own version string, called PARSING-LIB-S, which is not mentioned in the glossaries below.

There are also Hayes-style tests based on Anton Ertl's ttester.

2  Naming conventions

Our naming policy tries to distinguish three kinds of parsing: ANS Forth strings, the parse area, and the input stream (across line ends). Words with "parse" or ">in" in the name parse the parse area. Words with "instream" in the name parse the input stream across lines or blocks. Words with neither, with exceptions like (*, parse strings.

Words that take a single character or whitespace as parsing delimiter have no prefix. Those that take a string pattern have the prefix "s-", and those that take any of a null-terminated set of characters have the prefix "{}".

In words with "scan" in the name, motion stops at the first character of the delimiter, and no flag is left. In words with "seek" in the name, motion stops at the first character after the delimiter, or off end, and a flag is left. In words with "separate" in the name, the delimiter is excluded from the output, and a flag is left.

3  ANS Forth string words

3.1  tool belt 2002

In the 2000 version of his Tool Belt, Wil Baden uses a BL prefix to distinguish words where whitespace is a delimiter, rather than the literal BL character. In the 2002 version he eliminates the BL prefix and does conditional processing for whitespace, using constructions like SCAN[ ... ]SCAN, based on his elegant COND ... THENS.

We see a point to maintaining distinct words, which are essentially factors in his consolidated definitions, but we have decided to follow the newer scheme. It seems simpler to us to maintain one slightly more complicated word than two slightly different words. In any case the words in question are good candidates for code words.

Although we do not use his COND ... THENS constructions, we generally advocate his suggestions for enhancing the control flow stack beyond the ANS Forth standard.

Not all of the Tool Belt string-parsing words are included here, and we change a few names. For some of his words, we revert to the Tool Belt 2000 specification; they are in the next section.

The definition of /string is included here for reference only.

/string ( addr len i -- addr+c[i] len-i ) ANS94

In the normal case, where i is nonnegative and len - i is nonnegative, the output string is the last len - i characters of the input string.

If i is nonnegative and len - i is negative, an ambiguous condition exists.

If i is negative, the input string consists of the last len-i characters of the output string, unless that doesn't make sense.

string/ ( addr len i -- addr+c[len-i] i ) 2002

In the normal case, where i is nonnegative and len - i is nonnegative, the output string has the last i characters of the input string.

end-c@ ( s -- char ) 2002

Assume the input string is not empty. Leave its last character.

skip ( addr len char -- addr+c[i] len-i ) 2002

Advance past leading occurrences of char, or whitespace in case char is BL. If there are no occurrences, the result is the right empty string.

scan ( addr len char -- addr+c[i] len-i ) 2002

Look for char in the input string, or whitespace if char is BL. The output string either begins at the first occurrence of char, respectively, whitespace, or is the right empty string.

trim| ( addr len -- addr len-i ) 2002 TRIM

Trim trailing whitespace from the input string. If the input is all whitespace, the result is the left empty string.

Note: We prefer the name TRIM| to the Tool Belt's TRIM (the same in 2000 and 2002), because a more natural meaning for TRIM might be to remove both leading and trailing whitespace. The | prefix/suffix for TRIM was used by George T. Hawkins in his 1987 FStrings Package. We adopt that practice. TRIM| is equivalent to the phrase:

     bl skip-back

scan-back ( addr len char -- addr len-i ) 2002 BACK

Scan the input string backwards for char, or whitespace if char is BL. The output string either ends with the first backward occurrence of char, respectively, whitespace, or is the left empty string.

Note: We prefer the name SCAN-BACK to the Tool Belt's BACK, because with no qualification, BACK could just as well mean SKIP-BACK.

white? ( char -- flag ) 2002 Is-White?

Leave true if char is whitespace, else leave false.

3.2  tool belt 2000

/split ( addr len addr+c[i] len-i
    -- addr+c[i] len-i addr i )
2000

Split the string (addr len) at the place given by (addr+c[i] len-i). Wil Baden's cut-split. Tool Belt 2002 makes this a deprecated synonym for CHOP.

s-starts ( s pat.s -- s flag ) 2000 STARTS?

Leave true if the string s starts with the character sequence in pat.s. The flag is true if pat.s is empty or if both s and pat.s are empty.

Note: Tool Belt 2002 changes this to a case-insensitive match, which we avoid here. Both the 2000 and 2002 implementations assume that pat.s.lens.len, but this implementation does not.

s-ends ( s pat.s -- s flag ) 2000 ENDS?

Leave true if the string s ends with the character sequence in pat.s. The flag is true if pat.s is empty or if both s and pat.s are empty.

Note: Tool Belt 2002 changes this to a case-insensitive match, which we avoid here. Both the 2000 and 2002 implementations assume that pat.s.lens.len, but this implementation does not.

3.3  other string extensions

cut-first
cut-last
keep-first
keep-last
( addr len i -- addr+c[i] len-i )
( addr len i -- addr len-i )
( addr len i -- addr i )
( addr len i -- addr+c[len-i] i )

The stack diagrams are self-explanatory, except that when the index i is negative for CUT-FIRST or CUT-LAST, characters are actually added.

An ambiguous condition exists for any of these words if the input or output string length is negative, or if the output string is longer than the input string and the extra characters are not known to be present in memory.

skip-back ( addr len char -- addr len-i )

Trim trailing instances of char from the input string, or of whitespace if char is BL. If (addr len) is all whitespace, the result is the left empty string.

|trim ( s -- s' )

Leave the fstring s' resulting from trimming leading whitespace from s. If s is all whitespace or empty, s' is its right empty string.

|trim| ( s -- s' )

Leave the fstring s' resulting from trimming leading and trailing whitespace from s. If s is all whitespace or empty, s' is its right empty string.

seek ( addr len char -- addr+c[i] len-i found? )

Look for char in the input string, or whitespace if char is BL. If the character is found, the output string begins just after its first occurrence, or is the right empty string, and found? is true. Else the output string is the right empty string and found? is false.

{}in ( char &chars -- char|0 )

Leave char if it is in the null-terminated list of characters starting at &chars. Otherwise leave zero.

{}index ( char &chars -- index|-1 )

If char is in the null-terminated list of characters at &chars, leave its index. Else leave -1. The first character has index zero.

{}scan ( s &chars -- s' )

Leave the substring s' of s starting at the first occurrence of one of the characters in the null-terminated string starting at &chars. If no such character is found and s is not empty, s' is empty. If s is empty, s' is s.

If BL is in the set of delimiters, it is treated literally, not as a shorthand for whitespace.

In this implementation, when empty, s' is the right empty string.

{}seek ( s &chars -- s' char|0 )

Scan s for the first occurrence of any of the characters in the null-terminated string starting at &chars. If the character is found, leave it, with s' the part of s that follows it, or the right empty string if it is found at the end of s. If s is empty, leave s and zero. If s is nonempty and the character is not found, leave zero and the right empty string.

{}starts ( s &chars -- s char|0 )

If the string is not empty and its first character is in the null-terminated list of characters, leave the character and the original string. Else leave zero and the original string.

{}ends ( s &chars -- s char|0 )

If the string is not empty and its last character is in the null-terminated list of characters, leave the character and the original string. Else leave zero and the original string.

separate ( s char -- after.s before.s true | s false )

Scan s for the first occurrence of char, or white space in case char is BL. If found, leave true and the strings after and before char, which may be the right, respectively, and/or left, empty string. If not found, leave s and false.

{}separate ( s &chars -- after.s before.s char | s 0 )

Scan s for the first occurrence of any of the null-terminated sequence of characters at &chars. If a character is found, leave it as char, and leave the strings after and before, which may be the right, respectively, and/or left, empty string. If not found, leave the original string and zero.

s-separate ( s pat.s -- after.s before.s true | s false )

Scan s for the first occurrence of the pattern string. If found, leave true and the strings before and after the pattern. If not, leave false and the original string.

The result for an empty pat.s is to leave true, with s for after.s and the left empty string of s for before.s.

s-after ( s s' -- after.s )

Assume s' to be a substring of s. Leave after.s, the substring of s that follows s'. The calculation is:

    after.s = (addr'+c[len'] [addr+c[len]-c[len']-addr']/c[1])

first-word ( s -- word.s | right.empty.s )

Leave the first word in s. If there is none, leave the right empty string.

separate-word ( s -- after.s word.s )

Leave word.s as a substring of s, the first word in s. Leave after.s as the substring of s that follows the word, after skipping the first of any whitespace characters that trail the word. If after.s or word.s is empty, it is the right empty string.

white-delimited? ( s cbuf #cbuf -- flag )

Assume s to be a substring of the characters in the buffer, including the left and right empty strings. If the beginning or end of s is either at the beginning or end of the buffer, or is preceded or followed, respectively, by a whitespace character in the buffer, leave true. Else leave false.

The following words are currently defined only in the pfe external parsing module.

s-starts-nc ( s uc.pat.s -- s flag )

Leave true if the string s starts with uc.pat.s. The pattern string is assumed to be all upper case. Characters at the beginning of s are forced to upper case before comparison, without changing s.

Note: The flag is true if pat.s is empty or if both s and pat.s are empty.

s-ends-nc ( s uc.pat.s -- s flag )

Leave true if the string s ends with uc.pat.s. The pattern string is assumed to be all upper case. Characters at the beginning of s are forced to upper case before comparison, without changing s.

Note: The flag is true if pat.s is empty or if both s and pat.s are empty.

The following words are currently not implemented.

starts ( s char -- s flag )
starts-nc ( s uc.char -- s flag )
ends ( s char -- s flag )
ends-nc ( s uc.char -- s flag )
{}skip ( addr len &chars -- addr+c[i] len-i )
s-skip ( addr len pat.s -- addr+c[i] len-i )
s-scan ( addr len pat.s -- addr+c[i] len-i )
s-seek ( addr len pat.s -- addr+c[i] len-i flag )

4  Input stream words

4.1  parse area

The words PARSE-AREA@ and PARSE-AREA! provide the basic interface for operating on the parse area with fstring words. For example:

   : parse-area-skip  ( char -- )
     ( char) >r parse-area@ r> skip parse-area! ;
parse-area@ ( -- unparsed.s )

Leave the parse area portion of the input buffer as a string. Factored from Bernd Paysan's version of Frederick Warren's $>, comp.lang.forth, March, 2000. Named by Michael Gassanenko.

parse-area! ( unparsed.s -- )

The unparsed.s string is assumed to lie at the end of the input buffer. Advance the input stream so that unparsed.s is the parse area. Named by Michael Gassanenko.

empty-parse-area ( -- )

Exhaust the input buffer.

Note: REFILL can't always be used for this. When the source is a string, REFILL does nothing. When it is a file, one may not want the current input buffer to be written over.

parse-area-empty? ( -- flag )

The flag is true if the parse area is empty, else false.

parse-name ( "<name>" -- word.s ) CORE EXT

Leave the next word in the parse area, and advance the input stream to just after the first of any trailing whitespace characters. If there is no word in the parse area, word.s is the right empty string in this implementation, and the parse area is left exhausted.

Note: Forth 200x does not require that the empty string be the right empty string. The native version of PARSE-NAME does produce the right empty string for gforth and pfe, but not always for iForth.

preparse-name ( "<name>" -- word.s )

Get the next word in the parse area without advancing the input stream. Leo Wong's PREPARSE.

parse-name-away ( "<name>" -- )

When the parse area is not all white space, parse away the next word and the first of any trailing whitespace characters. Otherwise empty the parse area. A logical companion to PREPARSE-NAME.

{}parse ( "<char>" &chars -- parsed.s char|0 )

Scan the parse area for any of the null-terminated sequence of characters at &chars. If a delimiting character is found, leave it as char. If a delimiting character is not found, leave zero instead.

If a delimiting character is found, the parse area is left positioned just after it, and parsed.s represents the former part of the parse area up to just before char. If a delimiting character is not found, the input stream is left exhausted, and parsed.s represents the former parse area.

>in++ ( -- )

Step the parse area one character forward, unless it is exhausted.

>in-- ( -- )

Step the parse area one character back, unless it is positioned at the start of the input buffer.

The next two words are currently defined only in parsing-test.fs.

>in+ ( u -- )

Advance the parse area by the minimum of u characters or its size.

>in- ( u -- )

Step the parse area back by the minimum of u or the number of characters preceding it in the input buffer.

The following word is currently defined only in the pfe external parsing module.

space|punct-parse ( "<char>"-- parsed.s found.char|0 )

Scan the current SOURCE buffer for a whitespace or any printing character except a letter or digit, beginning at the current position in the buffer. If found, leave it as found.char, with blank substituted for any whitespace character, else leave zero.

If one of the characters is found, the input stream is left positioned just after it, and parsed.s is the initially unparsed part of the input stream up to just before found.char.

If none is found, the input stream is left exhausted, and parsed.s is the initially unparsed part of the input stream.

The following words are currently not implemented.

s-parse ( "<pat>" pat.s -- parsed.s flag )
parse-away ( "<char>" char -- )
{}parse-away ( "<char>" &chars -- )

4.2  across lines or blocks

With the exception of \\, all across-line words in this section echo CR's when the lines are being entered interactively at the terminal. That feature has been tested by copying and pasting the file parsing-test.fs into a terminal window in which pfe or gforth is running. See parsing-test.fs for an explanation.

\\ ( "<eof>" -- ) 2002

Advance to the end of the input stream. The implementation uses REFILL, which does nothing but return false when the source is a string, so \\ covers the string case, too. It should not be used when the input source is the terminal, as then it causes an infinite loop.

?emit-cr ( -- )

Emit a CR if the input source is the terminal.

next-instream-name ( "<name>" -- word.s ) 2002 Next-Word

Leave the next word in the input stream, and advance the input stream to just after the first of any trailing whitespace characters. If word.s is empty, at most whitespace was found. A parsing implementation of Wil Baden's Next-Word, plus conditional echoing of CR's.

The behavior of S-SEEK-INSTREAM and |S|-SEEK-INSTREAM for empty string input is not specified below, but should be derivable from that of SEARCH. We have not thought through what might be a desirable behavior.

s-seek-instream ( "<pat>" pat.s -- flag )

Assume pat.s to be nonempty. Advance the input stream to just after the next occurrence of the string, matching embedded whitespace characters exactly, and leave true if found. If the string is not found after REFILL attempts, empty the parse area and leave false.

|s|-seek-instream ( "<pat>" pat.s -- flag )

Assume pat.s to be nonempty. Advance the input stream to just after the next occurrence of the string, matching embedded whitespace characters exactly, delimited by whitespace on both ends; and leave true if found. If the delimited string is not found after REFILL attempts, empty the parse area and leave false.

(* ( -- )

Advance the input stream to just after next occurrence of the whitespace delimited string "*)".

The point of making comment terminators whitespace delimited is explained by Stephen Pelc in comp.lang.forth, Multi-line comment markers, August 9, 2005:

It then makes sense to require the comment block end marker also to be whitespace delimited to prevent problems with Forth word names in the text.

The following two words are currently uncommentable.

((
---
( -- )
( -- )

Advance the input stream to just after next occurrence of the whitespace delimited string "))", respectively, "---".

The following words are currently not implemented.

skip-instream ( "<char>" char -- )
scan-instream ( "<char>" char -- )
seek-instream ( "<char>" char -- flag )
{}skip-instream ( "<char>" &chars -- char|0 )
{}scan-instream ( "<char>" &chars -- char|0 )
{}seek-instream ( "<char>" &chars -- char|0 )