Archive Production Notes

David N. Williams
November 8, 2006

The problem is how to implement an analog of an anonymous FTP site as a web archive hosted by an ordinary internet service provider.

We have tried to satisfy several constraints:

Text files

Current browsers handle several kinds of files with aplomb, but plain text files are an exception. So far, we have tried the following:

Directories

We're a little embarrassed to admit that we were pretty far along in this project when we learned that there is such a thing as web directory browsing. We knew that we got a directory listing when we visited anonymous FTP URL's with our browser, but we thought that was special for FTP.

With some inconclusive google'ing, we got the impression that this may be a web server option. But if so, it was always turned on at the sites we were able to test. So maybe the scheme with index.html fies we're about to discuss is overkill.

We handle directories, first of all, by keeping archived files in directories according to the desired tree structure. In each directory, we put a directory file named index.html, which contains a list of links to the remaining files and subdirectories.

An advantage of this scheme is that it is easy to include formatted commentary on the directory page.

In addition, we use server side includes to document the modification dates of the files and subdirectories, and the sizes of the files. There are a couple of disadvantages that we live with:

Note that directory pages are .html files, not .shtml files. This used to be contrary to recommended practise, the thought being that configuring an SSI-enabled web server to parse all .html files for SSI commands, rather than just parsing the .shtml files, would be bad for bandwidth. However, a spokesperson for Information Technology Central Services at the University of Michigan has told us that any extra overhead for parsing all pages is "absolutely trivial", and that they have parsed them that way since 1994.

When this archive is accessed from its host at UM, the SSI commands are thus recognized even though served from .html directory pages; and .html leaves our browser not altogether confused when previewing them at home; so we see everything but the modification dates and sizes.

We prefer the index.html display over that of "direct" directory browsing without index.html pages. In particular, it allows us to hide any extra .txt extensions or .html extensions (which we no longer use) on wrapped text file names.

An advantage of naming the directory file index.html instead of dir.html, as we used to, is that the absolute URL for reaching a directory from outside does not have to include it, since index.html is found automatically. So the URL can be that much shorter. It does have to be appended to relative URL's, which are essential for navigating a copy of the archive for maintenance on a home machine.

Archive Maintenance

Compared to just keeping a master copy of an FTP directory tree archive on a local machine, maintaining an HTML-based archive in our scheme is a bit more hassle. After some trial and error, we do it at home as follows.

We find it less confusing to have the local (master) copy of the archive directory tree pretty much an exact replica of the tree at the server, except that it can be very convenient to use soft links in the local archive tree to point to files elsewhere.

In particular, we define text files that need an extra .txt extension for browser display as links to the files without the extension.

The situation is a little more complicated with HTML-wrapped text files. Although we don't use them anymore, the following describes how we used to handle them.

The HTML-wrapped text files in the local archive were links, and both the wrapped and unwrapped files were elsewhere. Since it's unpleasant if the wrappers are not easily reached for updating when the text files are changed, we made an invisible subdirectory called .texthtml in each directory where the original of a text file in the archive resided. We put the HTML-wrapped text file in the invisible directory, with a soft link to it in the appropriate archive directory. Then the archive itself didn't have to be changed when the original file was edited and the wrapper updated.

We used our text2html script to update the wrapper, which the script put in the right place by prepending .texthtml/ to the output file name. For the case where every file in the source directory was a text file destined for the archive, we used a text2html-all script which updated all the wrappers in the directory at once.

To automate the soft links a bit, we tried to arrange it so that all the wrappers in a given .texthtml directory were destined for the same archive directory. Then we could cd to the archive directory and execute a script ln.texthtml to link them all at once. This was only done when the archive tree structure itself was being rearranged, or built for the first time.

Finally, when we wanted to automate the production of archive index.html files, we used mkindexfile.fs, a Forth interactive program written for pfe. An example is the source output for the archive forth/cstructures directory. Although it has probably been edited a little since it was generated, it should be pretty close to the original.