Archive Production Notes

David N. Williams
November 8, 2006

The problem is how to implement an analog of an anonymous FTP site as a web archive hosted by an ordinary internet service provider.

We have tried to satisfy several constraints:

As far as possible, use simple HTML, so the archive can be maintained with a web browser and text editor on the archivist's home machine, and so mirrors can work well without special server configuration assumptions.
Handle ordinary text files so they both display and download well.
Emulate anonymous FTP directories.

Text files

Current browsers handle several kinds of files with aplomb, but plain text files are an exception. So far, we have tried the following:

Leaving them alone. Text files with certain extensions, like those that are already .txt, Forth files with .fs, and C files with .c and .h, display well enough in our browsers with no processing that we leave them alone, except maybe for expanding tabs.
There is an Apache server-side method for redefining the effect of mime-types like application/x-tex so they are treated as text/plain. At httpd.apache.org we learned that AddType is one of the directives that can be overridden in a hidden .htaccess file. Based on a sample file we saw somewhere on the web, we tried the following .htaccess file:
```
   # tex
   AddType text/plain .tex
```
That's supposed to work for the directory in which it resides, and any subdirectories, and indeed it does with the University of Michigan server. We learned from Ulrike Fischer in news://comp.text.tex that the same effect can be obtained in the Opera browser by using the mime-type "property" menu, but could not get that to work in Firefox. But we have not figured out how to get that method to work with Firefox or Safari.
Unfortunately, the .htaccess file has no effect when previewing our copy of the archive on our local Mac OS X system, but we use it anyway for .tex files.
Appending .txt to the file name and using <a href="name.txt">name</a> tags for references. We don't always like the resulting display in a couple of our browsers, and it has the disadvantage that the .txt extension has to be stripped from the file name by hand when saving the display as a text file from the browser window. But we nearly always do this when the first method doesn't work.
Referencing instead a .shtml wrapper file for the file, containing a server side include (SSI). The wrapper file has the form:
```
   <html><head><title>name</title></head>
   <body bgcolor=#FFFFFF><pre>
   
   </pre></body></html>
```
This gives a good display when served by an SSI-enabled host, but makes maintenance on a home machine unpleasant because browsers ignore the include command, treating it as an HTML comment. SSI unaware mirrors will also not display the file. That there's an extra wrapper file for each text file is probably not a big deal, since the wrappers are small, and can be produced with a tool. But we do not currently use this method.
Actually embedding the file in an HTML wrapper. That is, the SSI command line
```
   
```
in the .shtml wrapper file above is replaced by the body of the text file, and the resulting file is named with a .html extension. The storage overhead is again not very much, and the wrapped files can again be produced with a tool.
There is the disadvantage that all instances of "&", "<", and ">" in the text body have to be replaced by the HTML entities "&", "<", and ">", and that tabs have to be expanded. But that can be done automatically as part of a simple shell script that also wraps the file. (Indeed the included files in the SSI method would also have to be processed this way beforehand.) We'll say a word about the use of the script later.
This method gives a good display. Downloading works pretty well with the Save As Text feature common in browsers, except that we have found a variance in whether browsers add extra blank lines before and/or after the body of the text. (We regard that as a browser bug.) Just as with the .txt scheme, one has to manually remove the extra file name extension, .html when downloading as text, at least for our browsers.
The method allows previewing with a browser at home, and the server doesn't have to use SSI to display the files, but currently we prefer the .txt method.

Directories

We're a little embarrassed to admit that we were pretty far along in this project when we learned that there is such a thing as web directory browsing. We knew that we got a directory listing when we visited anonymous FTP URL's with our browser, but we thought that was special for FTP.

With some inconclusive google'ing, we got the impression that this may be a web server option. But if so, it was always turned on at the sites we were able to test. So maybe the scheme with index.html fies we're about to discuss is overkill.

We handle directories, first of all, by keeping archived files in directories according to the desired tree structure. In each directory, we put a directory file named index.html, which contains a list of links to the remaining files and subdirectories.

An advantage of this scheme is that it is easy to include formatted commentary on the directory page.

In addition, we use server side includes to document the modification dates of the files and subdirectories, and the sizes of the files. There are a couple of disadvantages that we live with:

This information cannot be previewed with a browser on a local machine nor displayed by an SSI unaware server. We regard that as relatively unimportant.
More serious is that it complicates the HTML formatting of the directory page. For example, we use tables to define date and size fields for directory items.

Note that directory pages are .html files, not .shtml files. This used to be contrary to recommended practise, the thought being that configuring an SSI-enabled web server to parse all .html files for SSI commands, rather than just parsing the .shtml files, would be bad for bandwidth. However, a spokesperson for Information Technology Central Services at the University of Michigan has told us that any extra overhead for parsing all pages is "absolutely trivial", and that they have parsed them that way since 1994.

When this archive is accessed from its host at UM, the SSI commands are thus recognized even though served from .html directory pages; and .html leaves our browser not altogether confused when previewing them at home; so we see everything but the modification dates and sizes.

We prefer the index.html display over that of "direct" directory browsing without index.html pages. In particular, it allows us to hide any extra .txt extensions or .html extensions (which we no longer use) on wrapped text file names.

An advantage of naming the directory file index.html instead of dir.html, as we used to, is that the absolute URL for reaching a directory from outside does not have to include it, since index.html is found automatically. So the URL can be that much shorter. It does have to be appended to relative URL's, which are essential for navigating a copy of the archive for maintenance on a home machine.

Archive Maintenance

Compared to just keeping a master copy of an FTP directory tree archive on a local machine, maintaining an HTML-based archive in our scheme is a bit more hassle. After some trial and error, we do it at home as follows.

We find it less confusing to have the local (master) copy of the archive directory tree pretty much an exact replica of the tree at the server, except that it can be very convenient to use soft links in the local archive tree to point to files elsewhere.

In particular, we define text files that need an extra .txt extension for browser display as links to the files without the extension.

The situation is a little more complicated with HTML-wrapped text files. Although we don't use them anymore, the following describes how we used to handle them.

The HTML-wrapped text files in the local archive were links, and both the wrapped and unwrapped files were elsewhere. Since it's unpleasant if the wrappers are not easily reached for updating when the text files are changed, we made an invisible subdirectory called .texthtml in each directory where the original of a text file in the archive resided. We put the HTML-wrapped text file in the invisible directory, with a soft link to it in the appropriate archive directory. Then the archive itself didn't have to be changed when the original file was edited and the wrapper updated.

We used our text2html script to update the wrapper, which the script put in the right place by prepending .texthtml/ to the output file name. For the case where every file in the source directory was a text file destined for the archive, we used a text2html-all script which updated all the wrappers in the directory at once.

To automate the soft links a bit, we tried to arrange it so that all the wrappers in a given .texthtml directory were destined for the same archive directory. Then we could cd to the archive directory and execute a script ln.texthtml to link them all at once. This was only done when the archive tree structure itself was being rearranged, or built for the first time.

Finally, when we wanted to automate the production of archive index.html files, we used mkindexfile.fs, a Forth interactive program written for pfe. An example is the source output for the archive forth/cstructures directory. Although it has probably been edited a little since it was generated, it should be pretty close to the original.