( Title: Parsing ANS Forth Strings and the Input Stream File: parsing.fs Test file: parsing-test.fs Log File: parsing.log Author: David N. Williams License: LGPL Version: 0.8.3 Revised: March 22, 2009 Some of this file is derived from Wil Baden's ToolBelt, which we believe to be in the public domain. It also contains words from various other sources, which we believe to be compatible with the LGPL. For the sake of the LGPL, the rest is ) \ Copyright (C) 1996-2009 David N. Williams ( This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or at your option any later version. This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. If you take advantage of the option in the LGPL to put a particular version of this library under the GPL, the author[s] would regard it as polite if you would put any direct modifications under the LGPL as well, and include a copy of this request near the beginning of the modified library source. A "direct modification" is one that enhances or extends the library in line with its original concept, as opposed to developing a distinct application or library which might use it. This file is mostly ANS Forth compatible up to case dependence and [UNDEFINED]. This library is for parsing ANS Forth strings, the parse area of the input stream, and the input stream across lines or blocks. Part of it is based on ^Forth Motorola 680x0 parsing words going back to before 1996. Other parts are from Wil Baden's ToolBelt 2000 or 2002, with some name changes. All references to "ToolBelt 2002" in this file refer to the 2003-02-25 revision. The words in this library ignore any counted or measured string model for storage in memory. See the file parsing-ext.c for a C implementation as a pfe external module, and the file parsing-test.fs for Hayes-style tests based on Anton Ertl's ttester.fs. Since they are based on REFILL, words that act across lines are expected to work across blocks as well, but have not been tested with blocks. The code is intended to be character clean. ) decimal : PARSING-LIB-S ( -- version.s ) s" ANS Forth parsing library dnw-0.8.3" ; \ *** ANS FORTH STRING WORDS ( ToolBelt 2002: STRING/ END-C@ SKIP SCAN TRIM| [aka TRIM] SCAN-BACK [aka BACK] WHITE? [aka Is-White] ToolBelt 2000: /SPLIT S-STARTS [aka STARTS?] S-ENDS [aka ENDS?] Other string extensions: CUT-FIRST CUT-LAST KEEP-FIRST KEEP-LAST SKIP-BACK |TRIM |TRIM| SEEK {}IN {}SCAN {}SEEK {}STARTS {}ENDS SEPARATE {}SEPARATE S-SEPARATE S-AFTER FIRST-WORD SEPARATE-WORD WHITE-DELIMITED? Not implemented: STARTS STARTS-NC ENDS ENDS-NC {}SKIP S-SKIP S-SCAN S-SEEK ) \ *** INPUT STREAM WORDS ( Parse area: PARSE-AREA@ PARSE-AREA! EMPTY-PARSE-AREA PARSE-AREA-EMPTY? PARSE-NAME PREPARSE-NAME PARSE-NAME-AWAY >IN++ >IN-- Defined in parsing-test.fs: >IN+ >IN- Not implemented: SPACE|PUNCT-PARSE {}PARSE S-PARSE PARSE-AWAY {}PARSE-AWAY Across lines or blocks: \\ ?EMIT-CR NEXT-INSTREAM-NAME S-SEEK-INSTREAM |S|-SEEK-INSTREAM (* Not implemented: SKIP-INSTREAM SCAN-INSTREAM SEEK-INSTREAM {}SKIP-INSTREAM {}SCAN-INSTREAM {}SEEK-INSTREAM Uncommentable: (( --- ) \ *** UTILITY WORDS [UNDEFINED] -rot [IF] : -rot ( a b c -- c a b ) rot rot ; [THEN] [UNDEFINED] rdrop [IF] : rdrop ( r: a -- ) s" r> drop" evaluate ; immediate [THEN] [UNDEFINED] 2rdrop [IF] : 2rdrop ( r: a b -- ) s" 2r> 2drop" evaluate ; immediate [THEN] [UNDEFINED] char- [IF] : char- ( c-addr -- c-addr' ) [ 1 chars ] literal - ; [THEN] [UNDEFINED] char/ [IF] : char/ ( c[x] -- x ) [ 1 chars ] literal / ; [THEN] [UNDEFINED] chars+ [IF] : chars+ ( x n -- x+c[n] ) chars + ; [THEN] [UNDEFINED] chars- [IF] : chars- ( x n -- x-c[n] ) chars - ; [THEN] [UNDEFINED] [VOID] [IF] : [VOID] ( -- false ) false ; immediate [THEN] [UNDEFINED] [TRUE] [IF] : [TRUE] ( -- true ) true ; immediate [THEN] \ *** TABLE OF CONTENTS ( 0 NOTATION AND TERMINOLOGY 1 ANS FORTH STRINGS 1.1 TOOLBELT 2002 1.2 TOOLBELT 2000 1.3 OTHER STRING EXTENSIONS 2 INPUT STREAM 2.1 PARSE AREA 2.2 ACROSS LINES OR BLOCKS ) \ *** 0 NOTATION AND TERMINOLOGY ( c[n]: The number of address units for n characters. The result of CHARS. fstring: ANS Forth string represented as [addr len], where len is the number of characters and is assumed to be nonnegative. Most often we just say "string". s, or .s suffix: Short for the pair representation [addr len], an fstring. x.len: The len part of x = [addr len]. x.addr: The addr part of x = [addr len]. s = s', or "s is the same as s'": If s is represented by [addr len] and s' by [addr' len'], then addr = addr' and len = len'. That is, the two fstrings have the same representation. s' is a substring of s: addr <= addr' <= addr+c[len] and addr'+c[len'] <= addr+c[len]. s is equivalent to s': The lengths of s and s' are the same. The addresses of the string bodies need not be the same, but if the lengths are nonzero, the bodies have the same sequence of characters. right empty string: For s = [addr len], the right empty string is [addr+c[len] 0], just off its end. According to the definition of a substring, the right empty string of s is a substring of s, even though there is no character of s at its address. That includes the case where s is empty. left empty string: For s = [addr len], the left empty string is [addr 0], at its beginning. According to the definition of a substring, the left empty string is a substring of s, although there is no character of s at its address. That includes the case where s is empty. Certain specifications in the ToolBelt imply one of these when an empty string results. Our specs could have been written to be independent of such implementation details, but we find it clearer and simpler to adopt the same practice. whitespace: In this implementation, any of the ASCII characters 0x0 ... 0x20. word in s: A substring of s satisfying all of the following: 1. it is not empty; 2. it contains no whitespace characters; 3. it is preceded by a whitespace character or is at the beginning of s; 4. it is followed by a whitespace character or is at the end of s. "": In stack comments, the next word in the parse area, or the next word in the input stream if the comment applies across lines. cbuf, #cbuf: In stack comments cbuf is the address of a character aligned buffer. The size of the buffer in characters is #cbuf. This is distinct from "mbuf" and "/mbuf" used for measured strings in mstrings.fs, where /mbuf is the size of a structure instance. ) \ *** 1 ANS FORTH STRINGS \ *** 1.1 TOOLBELT 2002 ( In the 2000 version of his ToolBelt, Wil Baden uses a BL prefix to distinguish words where whitespace is a delimiter, rather than the literal BL character. In the 2002 version he eliminates the BL prefix and does conditional processing for whitespace, using constructions like SCAN[ ... ]SCAN, based on his elegant COND ... THENS. We see a point to maintaining distinct words, which are essentially factors in his consolidated definitions, but we have decided to follow the newer scheme. It seems simpler to us to maintain one slightly more complicated word than two slightly different words. In any case the words in question are good candidates for code words. Although we do not use his COND ... THENS constructions, we generally advocate his suggestions for enhancing the control flow stack beyond the ANS Forth standard. Not all of the ToolBelt string-parsing words are included here, and we change a few names. For some of his words, we revert to the ToolBelt 2000 specification; they are in the next section. ) 0 [IF] \ for reference : /string ( addr len i -- addr+c[i] len-i ) \ ANS 94 ( In the normal case, where i is nonnegative and len-i is nonnegative, the output string is the last len-i characters of the input string. If i is nonnegative and len-i is negative, an ambiguous condition exists. If i is negative, the input string consists of the last len-i characters of the output string, unless that doesn't make sense. ) [THEN] : string/ ( addr len i -- addr+c[len-i] i ) \ 2002 ( In the "normal case", where i is nonnegative and len-i is nonnegative, the output string has the last i characters of the input string. ) \ >r + r@ - r> ; \ toolbelt version not char clean dup >r - chars+ r> ; : end-c@ ( s -- char ) \ 2002 ( Assume the input string is not empty. Leave its last character. ) 1- chars+ c@ ; : white? ( char -- flag ) 33 u< ; \ 2002 Is-White : skip ( addr len char -- addr+c[i] len-i ) \ 2002 ( Advance past leading occurrences of char, or whitespace in case char is BL. If there are no occurrences, the result is the right empty string. ) ( char) dup bl = IF drop BEGIN dup WHILE over c@ white? WHILE 1 /string REPEAT THEN ELSE ( char) >r BEGIN dup WHILE over c@ r@ = WHILE 1 /string REPEAT THEN rdrop THEN ; : scan ( addr len char -- addr+c[i] len-i ) \ 2002 ( Look for char in the input string, or whitespace if char is BL. The output string either begins at the first occurrence of char, respectively, whitespace, or is the right empty string. ) ( char) dup bl = IF drop BEGIN dup WHILE over c@ white? 0= WHILE 1 /string REPEAT THEN ELSE ( char) >r BEGIN dup WHILE over c@ r@ - WHILE 1 /string REPEAT THEN rdrop THEN ; : trim| ( addr len -- addr len-i ) \ 2002 TRIM ( Trim trailing whitespace from the input string. If the input is all whitespace, the result is the left empty string. Note: We prefer the name TRIM| to the ToolBelt's TRIM [the same in 2000 and 2002], because a more natural meaning for TRIM might be to remove both leading and trailing whitespace. The "|" prefix/suffix for "TRIM" was used by George T. Hawkins in his 1987 FStrings Package: http://www-personal.umich.edu/~williams/archive/forth/strings/fstrings/fstrings.txt We adopt that practice. TRIM| is equivalent to the phrase: BL SKIP-BACK ) BEGIN ( len) dup WHILE 1- 2dup chars+ c@ white? 0= UNTIL 1+ THEN ; : scan-back ( addr len char -- addr len-i ) \ 2002 BACK ( Scan the input string backwards for char, or whitespace if char is BL. The output string either ends with the first backward occurrence of char, respectively, whitespace, or is the left empty string. Note: We prefer the name SCAN-BACK to the ToolBelt's BACK, because with no qualification, BACK could just as well mean SKIP-BACK. ) ( char) dup bl = IF drop BEGIN ( len) dup WHILE 1- 2dup chars+ c@ white? UNTIL 1+ THEN ELSE ( char) >r BEGIN ( len) dup WHILE 1- 2dup chars+ c@ r@ = UNTIL 1+ THEN rdrop THEN ; \ *** 1.2 TOOLBELT 2000 : /split ( addr len addr+c[i] len-i -- addr+c[i] len-i addr i ) \ 2000 ( Split the character string [addr len] at the place given by [addr+c[i] len-i]. Wil Baden's "cut-split". ToolBelt 2002 makes this a deprecated synonym for CHOP. ) dup >r 2swap r> - ; : s-starts ( s pat.s -- s flag ) \ 2000 STARTS? ( Leave true if the string s starts with the character sequence in pat.s. The flag is true if pat.s is empty or if both s and pat.s are empty. Note: ToolBelt 2002 changes this to a case-insensitive match, which we avoid here. Both the 2000 and 2002 implementations assume that pat.s.len <= s.len, but this implementation does not. ) ( pat.len) dup >r 2over ( s pat.s s) dup r@ ( s.len pat.s.len) u< IF 2drop 2drop rdrop false ELSE r> min compare 0= THEN ; : s-ends ( s pat.s -- s flag ) \ 2000 ENDS? ( Leave true if the string s ends with the character sequence in pat.s. The flag is true if pat.s is empty or if both s and pat.s are empty. Note: ToolBelt 2002 changes this to a case-insensitive match, which we avoid here. Both the 2000 and 2002 implementations assume that pat.s.len <= s.len, but this implementation does not. ) ( pat.len) dup >r 2over ( s pat.s s) dup r@ ( s.len pat.s.len) u< IF 2drop 2drop rdrop false ELSE r> string/ compare 0= THEN ; \ *** 1.3 OTHER STRING EXTENSIONS \ Assume that len and the output length are nonnegative: : cut-first ( addr len i -- addr+c[i] len-i ) /string ; : cut-last ( addr len i -- addr len-i ) - ; : keep-first ( addr len i -- addr i ) nip ; : keep-last ( addr len i -- addr+c[len-i] i ) string/ ; : skip-back ( addr len char -- addr len-i | empty.s ) ( Trim trailing instances of char from the input string, or of whitespace if char is BL. If [addr len] is all whitespace or empty, the result is empty.s = [addr 0]. ) ( char) dup bl = IF drop BEGIN ( len) dup WHILE 1- 2dup chars+ c@ white? 0= UNTIL 1+ THEN ELSE ( char) >r BEGIN ( len) dup WHILE 1- 2dup chars+ c@ r@ <> UNTIL 1+ THEN rdrop THEN ; : |trim ( s -- s' ) ( Leave the fstring s' resulting from trimming leading whitespace from s. If s is all whitespace or empty, s' is its right empty string. ) bl skip ; : |trim| ( s -- s' ) ( Leave the fstring s' resulting from trimming leading and trailing whitespace from s. If s is all whitespace or empty, s' is its right empty string. ) |trim trim| ; : seek ( addr len char -- addr+c[i] len-i found? ) ( Look for char in the input string, or whitespace if char is BL. If the character is found, the output string begins just after its first occurrence, or is the right empty string, and found? is true. Else the output string is the right empty string and found? is false. ) scan dup IF 1 /string true ELSE false THEN ; : {}in ( char &chars -- char|0 ) ( Leave char if it is in the null-terminated list of characters starting at &chars. Otherwise leave zero. ) ( &chars) >r BEGIN ( char) r@ c@ dup WHILE r> char+ >r over = UNTIL ELSE 2drop 0 THEN rdrop ; : {}scan ( s &chars -- s' ) ( Leave the substring s' of s starting at the first occurrence of one of the characters in the null-terminated string starting at &chars. If no such character is found and s is not empty, s' is empty. If s is empty, s' is s. If BL is in the set of delimiters, it is treated literally, not as a shorthand for whitespace. In this implementation, when empty, s' is the right empty string. ) ( &chars) >r BEGIN ( len) dup WHILE over c@ r@ {}in 0= WHILE 1 /string REPEAT THEN rdrop ; : {}seek ( s &chars -- s' char|0 ) ( Scan s for the first occurrence of any of the characters in the null-terminated string starting at &chars. If the character is found, leave it, with s' the part of s that follows it, or the right empty string if it is found at the end of s. If s is empty, leave s and zero. If s is nonempty and the character is not found, leave zero and the right empty string. ) {}scan dup IF over c@ >r 1 /string r> ELSE 0 THEN ; : {}starts ( s &chars -- s char|0 ) ( If the string is not empty and its first character is in the null-terminated list of characters, leave the character and the original string. Else leave zero and the original string. ) ( &chars) >r 2dup IF c@ r> {}in ELSE drop rdrop 0 THEN ; : {}ends ( s &chars -- s char|0 ) ( If the string is not empty and its last character is in the null-terminated list of characters, leave the character and the original string. Else leave zero and the original string. ) ( &chars) >r 2dup IF over 1- chars+ c@ r> {}in ELSE drop rdrop 0 THEN ; ( Our basic delimiter split word for strings is SEPARATE, which is analogous to PARSE for the input stream. Conceptually they are the same: extract the first factor and include motion to the second factor. And both are open splits, where the delimiting character or string is not included in the factors. ) : separate ( s char -- after.s before.s true | s false ) ( Scan s for the first occurrence of char, or white space in case char is BL. If found, leave true and the strings after and before char, which may be the right, respectively, and/or left, empty string. If not found, leave s and false. ) ( char) >r ( s) 2dup r> scan ( s [char+after].s) dup IF 1 /string ( s after.s) /split ( after.s [before+char].s) 1- true ELSE 2drop false THEN ; : {}separate ( s &chars -- after.s before.s char | s 0 ) ( Scan s for the first occurrence of any of the null-terminated sequence of characters at &chars. If a character is found, leave it as char, and leave the strings after and before, which may be the right, respectively, and/or left, empty string. If not found, leave the original string and zero. ) ( &chars) >r ( s) 2dup r> {}seek ( char|0) dup IF ( char) >r ( s after.s) /split 1- r> ELSE -rot 2drop THEN ; : s-separate ( s pat.s -- after.s before.s true | s false ) ( Scan s for the first occurrence of the pattern string. If found, leave true and the strings before and after the pattern. If not, leave false and the original string. If SEARCH has the behavior assumed in the preamble, the result for an empty pat.s is to leave true, with s for after.s and the left empty string of s fore before.s. ) locals| pat.len pat.addr s.len s.addr | s.addr s.len pat.addr pat.len search ( &found #rem true | s false ) IF ( &found #rem) dup >r ( &found #rem) pat.len /string ( after.s) s.addr s.len r> ( #rem) - ( before.s) true ELSE ( s) false THEN ; : s-after ( s s' -- after.s) ( Assume s' to be a substring of s. Leave after.s, the substring of s that follows s'. The calculation is after.s = [addr'+c[len'],[addr+c[len]-c[len']-addr']/c[1]]. ) chars+ -rot chars+ ( addr'+c[len'] addr+c[len]) over - char/ ; : first-word ( s -- word.s | right.empty.s ) ( Leave the first word in s. If there is none, leave the right empty string. ) bl skip 2dup bl scan nip - ; : separate-word ( s -- after.s word.s ) ( Leave word.s as a substring of s, the first word in s. Leave after.s as the substring of s that follows the word, after skipping the first of any whitespace characters that trail the word. If after.s or word.s is empty, it is the right empty string. ) 2dup first-word 2dup 2>r s-after ( after.len) dup IF 1 /string THEN 2r> ; : white-delimited? ( s cbuf #cbuf -- flag ) ( Assume that s is a substring of the characters in the buffer, including the left and right empty strings. If the beginning or end of s is either at the beginning or end of the buffer, or is preceded or followed, respectively, by a whitespace character in the buffer, leave true. Else leave false. ) LOCALS| #cbuf cbuf len addr | addr cbuf <> IF addr char- c@ white? ELSE true THEN addr len chars+ dup cbuf #cbuf chars+ <> IF c@ white? ELSE drop true THEN and ; \ *** 2 INPUT STREAM \ *** 2.1 PARSE AREA 0 [IF] The words PARSE-AREA@ and PARSE-AREA! provide the basic interface for operating on the parse area with fstring words. For example, : parse-area-skip ( char -- ) \ aka PARSE-AWAY ( char) >r parse-area@ r> skip parse-area! ; It is left to the user to define such words as needed. [THEN] : parse-area@ ( -- unparsed.s ) ( Leave the parse area portion of the input buffer as a string. Factored from Bernd Paysan's version of Frederick Warren's $>, comp.lang.forth, March, 2000. Named by Michael Gassanenko. ) source ( &inbuf #inbuf) >in @ ( #parsed) /string ; : parse-area! ( unparsed.s -- ) ( The unparsed.s string is assumed to lie at the end of the input buffer. Advance the input stream so that unparsed.s is the parse area. Named by Michael Gassanenko. ) ( addr len) source nip ( #tib) swap - >in ! ( addr) drop ; : empty-parse-area ( -- ) ( Exhaust the input buffer. Note: REFILL can't always be used for this. When the source is a string, REFILL does nothing. When it is a file, one may not want the current input buffer to be written over. ) source ( addr #chars) >in ! ( addr) drop ; : parse-area-empty? ( -- flag ) parse-area@ nip 0= ; : >in++ ( -- ) ( Step the parse area one character forward, unless it is exhausted. ) source nip >in @ 1+ min >in ! ; : >in-- ( -- ) ( Step the parse area one character back, unless it is positioned at the start of the input buffer. ) >in @ 1- 0 max >in ! ; : parse-name ( "" -- word.s ) ( Leave the next word in the parse area, and advance the input stream to just after the first of any trailing whitespace characters. If there is no word in the parse area, word.s is empty [the right empty string in this implementation], and the parse area is left exhausted. This is the same as the specification in the ANS Forth standard informative annex A.6.2.2008 for PARSE-WORD. It has been proposed by Anton Ertl for the future standard with the new name, because the old name has conflicting common practice. ) parse-area@ separate-word 2swap parse-area! ; : preparse-name ( "" -- word.s ) ( Get the next word in the parse area without advancing the input stream. Leo Wong's PREPARSE. ) >in @ >r parse-name r> >in ! ; : parse-name-away ( "" -- ) ( When the parse area is not all white space, parse away the next word and the first of any trailing whitespace characters. Otherwise empty the parse area. A logical companion to PREPARSE-NAME. ) parse-name 2drop ; \ *** 2.2 ACROSS LINES OR BLOCKS ( With the exception of \\, all across-line words in this section echo CR's when the lines are being entered interactively at the terminal. That feature has been tested by copying and pasting the file parsing-test.fs into a terminal window in which pfe or gforth is running. See parsing-test.fs for an explanation. The behavior of S-SEEK-INSTREAM and |S|-SEEK-INSTREAM for the empty string is not specified below, but should be derivable from that of SEARCH. We have not thought through what might be a desirable behavior. ) : \\ ( "..." -- ) \ 2002 ( Since REFILL does nothing but return false when the source is a string, the following covers the string case, too. It should not be used when the input source is the terminal, as then it causes an infinite loop. We learned to recognize the equivalence of the phrase "-1 PARSE 2DROP" to EMPTY-PARSE-AREA [inside the loop in the ToolBelt version] from Bruce McFarling, Factoring and Visibility [redux], nntp://comp.lang.forth, 12/4/06. ) empty-parse-area BEGIN refill 0= UNTIL ; : ?emit-cr ( -- ) source-id 0= IF cr THEN ; : next-instream-name ( "" -- word.s ) \ Wil Baden's NEXT-WORD ( Leave the next word in the input stream, and advance the input stream to just after the first of any trailing whitespace characters. If word.s is empty, at most whitespace was found. A parsing implementation of Wil Baden's NEXT-WORD, plus conditional echoing of CR's. ) BEGIN parse-name dup IF EXIT THEN ?emit-cr refill WHILE 2drop REPEAT ; : s-seek-instream ( s -- flag ) ( Advance the input stream to just after the next occurrence of the string across lines, matching embedded whitespace characters exactly, and leave true if found. If the string is not found after REFILL attempts, empty the parse area and leave false. ) ( s) 2>r BEGIN parse-area@ 2r@ ( pa.s s) search ( pa.s false | &found #rem true) 0= WHILE \ not found ( pa.s) 2drop ?emit-cr refill 0= IF \ end of file 2rdrop empty-parse-area false EXIT THEN REPEAT \ found r> /string parse-area! rdrop true ; : |s|-seek-instream ( s -- flag ) ( Advance the input stream to just after the next occurrence of the string across lines, matching embedded whitespace characters exactly, delimited by whitespace on both ends; and leave true if found. If the delimited string is not found after REFILL attempts, empty the parse area and leave false. ) ( s) 2>r BEGIN 2r@ s-seek-instream WHILE parse-area@ drop r@ chars- r@ source white-delimited? IF 2rdrop true EXIT THEN >in++ \ jog for next seek REPEAT 2rdrop false ; : (* ( -- ) s" *)" |s|-seek-instream 0= ABORT" ***Missing *) terminator"; immediate (* The ABORT" in the definition above might catch a nonwhitespace delimited "*)". The point of making comment terminators whitespace delimited is explained by Stephen Pelc in comp.lang.forth, "Multi-line comment markers", August 9, 2005: It then makes sense to require the comment block end marker also to be whitespace delimited to prevent problems with Forth word names in the text. *) \ : (( ( -- ) s" ))" |s|-seek-instream drop ; immediate \ : --- ( -- ) s" ---" |s|-seek-instream drop ; immediate