GSTRINGS -- A Proposed Incremental Improvement Over FSTRINGS By Rj Brown Elijah Laboratories Inc. 201 West High Street P. O. Box 833 Warsaw KY 41095 1 606 567-4613 February 6, 1988 Abstract This paper describes a proposed string handling package for the Forth computer programming language. The package had not been written as of the date of the paper. Readers' comments and suggestions are welcome and would be appreciated. Before a package as ambitious as this is implemented, it is good to get a design review from one's peers. Please address all responses to this paper to RJ BROWN c/o the East Coast Forth Board (ECFB) 1 703 442-8695. Introduction George Hawkins has proposed a set of string operators for the Forth language that has been well received in the Forth community [Tracey, Martin. DDJ Dec 87, p 149]. His FORTH-83 standard implementation of FSTRINGS is in the public domain and is available by downloading FSTRINGS.ARC. He has also implemented the same word set as CODE words for MasterFORTH. Mr. Hawkins' word naming conventions and the functions they perform are extremely well thought out. His definition of what the words do is beautifully isolated from the details of the implementation itself. This is precisely what makes the package so useful: it may be ported from one system to another, with the local implementation taking advantage of the peculiarities of the host environment, without any visibility of functional difference on the part of the applications programmer as he ports his application from one system to another. The baseline FORTH-83 compatible implementation of FSTRINGS is rather conventional in its method of storing strings: the burden is placed on the user to allocate enough room for any string that his code might place in a string variable. This is similar to the way C does it, and not unlike many old string manipulation packages that were written in FORTRAN back in the good old days. The actual manipulations are performed by shuffling bytes around in these string variable buffers. Proposal The proposed GSTRINGS package would use the Hawkins word set for the most part, but the implementation would be quite different. Since FSTRINGS uses the $ (a shift-4) as the initial character in all of the words in its user-visable word set, GSTRINGS would use % (a shift-5) to avoid confusion and to allow the two packages to coexist. Indeed, GSTRINGS will probably use FSTRINGS in its own implementation! Instead of copying and moving bytes to effect most of the string manipulation, this would be done by manipulating pointers. The application view of a string would be a single Forth memory cell that would contain an offset into a String Control Region (SCR) (which could be implemented on Intel 80x8x family processors as a separate memory segment). The SCR would contain a set of String Indirect Reference nodes (SIRs). The first such node would point to the next one chronologically used, and that to the next one after it, etc. This way, the package always knows all the SIRs currently in use. SIRs would be dynamically allocated by the package as needed. One cause for allocation would be a user request to create a string. Each SIR would have a SIR Active Link field (SAL) to chain active SIRs together, a String Actual Reference (SAR), and a String Deferred Reference (SDR). The names SAR and CDR are deliberately reminiscient of the CAR and CDR of Lisp, for reasons that will soon become clear. The SAR and SDR are identical structures: String Identifiers (SIDs). The SID is composed of a pointer and a length. The pointer is the offset of the start of the actual string buffer in the String Dynamic Pool (SDP). The length is the length of that SID's reference to the portion of the SDP starting at the pointer's offset. If the pointer of a SID is NULL, then if the length is zero, then that SID refers to the NULL STRING, but if the length is non-zero, then the length contains the offset of another SID in the SCR. This storage mechanism allows concatenation to be performed by simply allocating a new SID that points to the two original SIDs, the first in the SAR, and the second in the SDR. Likewise, deletion from the right means simply shortening the length of a SID, while deletion from the left means shortening the length and augmenting the pointer to match. Rotation and all of the other fancy manipulations provided by FSTRINGS can almost always be done much faster in GSTRINGS, since all that is usually required is a pointer manipulation. Another nice feature provided by GSTRINGS is the automatic dynamic allocation of string storage by the package. Strings no longer are bound in size by the [lack of] foresight of the programmer. Large string buffers only exist when they are full of something, and even then thay may be shared among multiple strings. Consider the folowing code sequence: %VARIABLE FOO %VARIABLE BAR \ a couple of nice strings %" Once upon a midnight dreary " FOO %%! \ put it in FOO FOO BAR %%! \ now FOO and BAR refer to the same string FOO 10 %|TRIM \ now FOO is " a midnight dreary " FOO %. \ print FOO a midnight dreary BAR %. \ print BAR Once upon a midnight dreary Notice that although BAR got its value by refering to the same SIR as FOO, after the pruning by %|TRIM, FOO got a new SIR, allowing BAR to remain unscathed. We may also now remove the FSTRINGS restrictions on inserting a string into itself, etc. because the original SIR is left undisturbed by the operation. All this bliss does not come without a price, however. The ultimate penalty for gobbling up memory with wild abandon is garbage collection. GSTRINGS is designed with garbage collection in mind, though. Remember how all the SIRs that are referenced by the user's application are chained together? This permits the package to determine which SIRs are in use and which are not: a SIR is in use only if it is on the active chain, or if it is pointed to by the SAR or SDR of any SIR on the active chain. In Lisp, this active chain is called the OBLIST, so in GSTRINGS, we might as well call it the SOBLIST. Once all the active SIRS have been determined, they may be relocated down to the start of the SCR to occupy contiguous locations, followed by the free space in the SCR. The task of compacting the SDP is then performed. The SOBLIST is traversed, and a map is built of all the occupied regions and holes in the SDP. This list can [hopefully, if there is enough memory!] be built in the free space of the SCR. Each different valued pointer referencing the SDP must be indicated in the map. Next, this map of the SDP is traversed, and the actual occupied space in the SDP is compacted down towards the contiguous locations at the begining of the SDP, and, simultaneously, the in-use area of the SCR that was previously compacted is adjusted so that all pointers are corrected to point to the new location of the string space to which they refer. This completes the garbage collection process. We are left with all the free space in the SCR and SDP following all the in-use space. An interesting capability of GSTRINGS (paralleling Lisp) is that since the user's view of a string is linear, but the underlying reality is tree-structured, and since the user only sees the offset of a SIR, it follows that two different user string variables can actually point to the same SIR. This sharing of structure is a powerful tool, and like any power tool, it must be handled carefully, or serious damage can be done! Lisp provides a set of special forms that modify the list structure itself. These are called REPLACA and REPLACD, because they replace the contents of the CAR and CDR pointers, respectively. GSTRINGS will provide analogous operations, to be called %!A and %!D. These words will permit the deliberate sharing of structure, if the programmer determines this to be useful. Consider the following: VARIABLE BAZ %" This is the forest primevil, " BAZ %%! VARIABLE BOK BAZ BOK %%! %" the whispering pines and the hemlocks. " BOK %!D BAZ %. This is the forest primevil, the whispering pines and the hemlocks. Notice how a change to BOK produced a change in BAZ also. This is because they both refer to the same SIR, and that SIR was modified by the %!D operation. %NULL BAZ %!A BOK %. the whispering pines and the hemlocks. Here we have set the SAR of that same SIR to NULL, causing both BAZ and BOK to loose the first line of the poem. This sharing of structure is not recommended for novices, but it can be extremely powerful when an algorithm is especially designed to use it. The Boyer-Moore structure sharing resolution inference algorithm is a prime example. A version of it written in Lisp is available from the ECFB as BRIE.ARC [Brown, R. "The Boca Raton Inference Engine" DDJ Apr 86]. The same techniques are applicable with FSTRINGS using %!A and %!D. Conclusion This paper has presented a proposed string manipulation package that forms an extension of FSTRINGS. It differs from FSTRINGS primarily in the dynamic allocation of string storage, as compared to static allocation, and in the ability for strings to share structure. The implementation is designed with certain considerations made for the peculiarities of the Intel family of 80x8x microprocessors, but does not depend on any features of these machines. The use of separate memory segments for the SCR and SDP, and the use of offsets only in the list structures, permits its use with MS-Windows, where the SCR and SDP can be relocatable memory objects. If the SAR and SDP are contained in a separate library containing most of the GSTRINGS code, then multiple Windows applications can share a common string space and use it to pass information between themselves. There are times when garbage collection is simply not permissable, such as in certain real-time situations. Likewise, there are times when the additional complexity and run-time memory requirements of dynamic string allocation are overkill. In these instances, FSTRINGS is heartily recommended. But in those cases where ease of string manipulation without burdensome manual memory management are important, or where the run-time requirements of a simplistic implementation such as FSTRINGS can be alieviated by the more sophisticated pointer manipulation techniques GSTRINGS, this package has a definite advantage. Indeed, in implementations where off-loading the entire string memory space to a separate memory segment has advantages, such as on Intel processors, and especially under MS-Windows, GSTRINGS will probably be the preferred choice.