Interim Progress Report

25 April 1996

Contents

  1. The Project
    1. The Objectives
    2. The Progress to Date
    3. The Decisions
    4. The Process
  2. Reflections
    1. On the decisions
    2. On the process
  3. Prospects
  4. Bibliography
  5. Attachments
    1. File ~pfs/codebook.html
      Collecting point and index for working documents
    2. File ~pfs/cb/status.html
      Table providing latest revision information on all working documents.
    3. File ~pfs/cb/bb.html
      Message and announcement archive for reference and historical purposes.
    4. File ~pfs/cb/cb1.dtd
      Slightly obsolete version of standalone CodeBook dtd.
    5. File ~pfs/cb/elements.html
      List of elements declared in the CodeBook dtd, with short description and section number in the dtd.
    6. File ~pfs/cb/descrips.html
      List of code book elements rearranged in order of descriptive caption.
    7. File ~pfs/cb/cb-tei.html
      Documentation of steps taken to integrate standalone CodeBook dtd into TEI as a "base tag set."
    8. File ~pfs/cb/tei/teicb1.dtd
      First attempt (now obsolete) to convert standalone CodeBook dtd to parameterized form required by TEI in its modules.

The Project

The larger project of which the CodeBook DTD is a part is an attempt by the International Consortium for Political and Social Research (ICPSR), a consortium comprising some of the most significant statistical archives and data-producing agencies, to make uniformly searchable and accessible the statistical codebooks issued by the agencies that produce, publish, or archive statistical data sets. Many of the details, and even the general shape, of this larger project and the larger SGML-based system supposed by it, remain invisible to those of us who are engaged with the dtd proper, especially as regards the distribution of responsibility, the nature of the funding, and the possible diversity of implementation among the member institutions of the ICPSR.

The more limited responsibilities assigned the dtd group include the development of an SGML Document Type Description suitable for the encoding of data-archive code books or data dictionaries; the creation of all the ancillary files and style sheets required to make the dtd usable; the creation of all the documentation required by taggers, presumptively with little or no knowledge of SGML, in interpreting the dtd while tagging codebooks; and cooperation with programmers in integrating the dtd and its attachments into larger systems. The dtd should be suitable not only for tagging new codebooks but also for retrospective conversion of old ones; it should apply not only to print codebooks but also to digital-only documentation. It should be usable not only by American and Canadian but also by international data archives. And it should not only accommodate the extremely variable structures of existing codebooks, but impose enough constraints to promote increased consistency among future codebooks.

Objectives

The objectives of the group, though modified by experience, have in general comprised the following:

Progress to Date

A good start has been made on these objectives, but only that.

Decisions

Countless small decisions have been made in the course of the project. Among the most important in their influence on the SGML implementation should be numbered the following:

Content-oriented element selection and structure.
If the choice in SGML development is between procedural, presentational, structural, and informational markup, the choice has been (with the ICPSR Committee's strong support) to use an exclusively informational structure at all but the lowest levels. Whether this decision can be sustained in practice, or must be modified by the inclusion of some upper-level tags to represent divisions like chapters or even page numbers, remains to be seen, but the sense of the Committee is certainly that (in the words of one member) "our business is metadata, not text." The immediate implications of this decision for the dtd structure include not only the selection of exclusively informational tags at the upper levels, but the abandonment of any sort of constraints on sequence or repeatability for most of the elements. The creator of a structurally-oriented dtd (like the default text structure of TEI) has the luxury of knowing that in most texts some matter in the front will be followed by some matter in the middle and then by some matter at the back, and that there will be some sort of internal scheme of chapters or books or stanzas to tag. An informational tagger, on the other hand, can never count on the author appearing before the title, or the data files description before the data description, or the study description before the questionaire, or on any kind of information appearing without being conmingled with other kinds. The lack of predictable sequence rules out constraints on order; the distribution, mingling, and interruption of given kinds of information by other kinds exclude restraints on repeatability.

Concentration on retrospective conversion.
In keeping with this philosophy, it is the sense of the Committee that it is vain to put much value on the possibility of creating a normative codebook structure that entails constraints on sequence. Even if such a beast were created, the institutional habits even of the ICPSR member institutions are too deeply ingrained to allow them readily to adopt an unfamiliar format. Instead, we are urged to make the structure loose enough to accommodate the varied codebooks already in print and on line, and not to worry that this retrospectively-induced laxity will induce prospective laxity as well. My own suggestion in this regard was to emulate the TEI "Dictionaries" module, with which I have spent hundreds of hours, in its solution to the same problem. Older dictionaries, which exist in a bewildering variety of structures, may be tagged with what is effectively an entirely different dtd invoked by use of the "entryFree" instead of the "entry" element at the entry level of the dictionary. In entryFree mode, virtually any element can appear virtually anywhere. Prospective encoders, on the other hand, are encouraged to use the more restrictive "entry" tag and its contents. Such a division may still be incorporated into CodeBook.dtd, but its inclusion is of fairly low priority.

Use of TEI parameter entities.
The TEI parameter entities for low-level elements (chiefly %phrase.seq; %paraContent; and %specialPara;) have, at my suggestion, been used as the tip-of-the-branch content models for almost all of the CodeBook trees. Each entity includes #PCDATA, but also includes a wealth of other elements (presentational, structural, and informational), that thereby become available to the tagger with little effort from the dtd-creator. The lowest of the three, %phrase.seq; is described in the TEI guidelines as consisting of "broth" (as opposed to chunks or soup), or phrase-level elements like:
abbr        abbreviation
address     address
date        date
dateRange   dateRange
emph        emphasis
foreign     foreign
hi          highlighted
name        name
title       title
ptr         pointer
lang        language
Moving up to the paragraph-content level with %paraContent; allows inclusion of lists, tables, figures, captions, and notes, among many other things. %specialPara; includes all of these plus paragraphs themselves, and is capable of handling all the presentational and structural tagging of most running prose found in a code book, as well as that of any figures, tables, or graphs that interrupt it (though most of those tend to be excluded from the digital versions of codebooks in any case, which are mostly mere ASCII at present.)

Creation of a TEI base tag set.
In order to meet the simultaneous needs to (1) create an informationally-based dtd (quite opposed to most of TEI's structurally- oriented treatment of text, not data); and (2) make use of the TEI facility with presentational and structural elements, I proposed that we convert the codeBook dtd into a simulacrum of a TEI base tag set, a close enough facsimile to convince TEI that it belonged, and invoke it (using the teikey2.ent 'INCLUDE'/'IGNORE' values) alongside at least the additional tag set for figures and graphs (for obvious reasons!) and perhaps also the additional set for names and dates. I have not yet decided whether to invoke the TEI default text structure (front-body-back; div/div/div), mostly because, while seeing its usefulness for the representation of such features in the codebooks, I do not yet see a way to allow this structure to coexist happily with the CodeBook's content-based "bibliographic info--study description--data files description--data description" structure. Some experimentation will be necessary. The TEI header should also be mentioned: its advantages lie in its control of bibliographic information and possible source for the automated generation of MARC records; its corresponding drawbacks being its restrictiveness and the obligation it imposes on the tagger to catalog the item, both of which can be vexing.

Limited use of generic tags
The document analyses have been consistent in proposing distinct tags for every kind of information, and even for the same kind of information if its focus or hierarchical placement varied. A strict and literal translation of the analyses would have required separate tags even for general catchall elements like "other materials": there would be an "other materials--study level" tag alongside an "other materials--data level" tag. "Record label" is to be distinguished from "variable label," and so on. While this is a defensible practice in part (it is a mistake, as Maler points out, to think that adding an attribute to an element entails any less overhead for the coder than adding a completely new element: in many authoring systems it may well involve more), in part it also results in a needless proliferation of elements for what is basically the same kind of information. It may be that a certain unfamiliarity or uneasiness with SGML has left the analyzers unhappy with the thought of relying on context for information, with the thought that "otherMat" nested inside "stdyDscr" represents "Study-Description-Other-Materials" just as clearly as a separate "S-D-O-M" tag would do, and therefore somewhat resistant to exploiting the contextual powers of an SGML text base. In any case, though hesitant to strike out too much on my own, I have converted a half dozen tags or so to more generic tags (otherMat, label, txt, stat[istic], etc.), though supplying each of them with an optional "level" attribute for the timid.

Limited use of attributes, nesting, and parameter entities.
Attributes have been used overwhelmingly for one purpose: either to constrain possible values automatically (via declared values in the ATTLIST), or to allow control of possible values through the use of external authority control. In a few cases, the value in the text itself is to be simply deleted and replaced by an empty element with a controlled-value attribute; in most, the form in the text will be allowed to stand, but will be supplemented by a controlled attribute value. For all that, attributes are used relatively sparingly, being reserved for technical specifications (file type, record length, etc.), geographic and date information (date covered, country), and bibliographic information (names), though much of the latter ground is covered by the TEI header.

The TEI %a.global; attribute set is also used throughout, in order to provide the basic rend, resp, lang, and id attributes.

Consistent naming. Eight-character names.
The first is a convenience to all, but especially to the tagger, whose convenience above all must be considered (since manual tagging represents the bulk of the costs of any SGML conversion). The latter is the SGML default, adhered to on principle. Further consideration for user convenience will have to wait on trials with tagging real codebooks.

The Process

The information management process, as it has developed, can be more briefly described. John Brandt acts as supervisor, sets meeting agendas, and funnels information into the group from the ICPSR Committee and from Ann Green. Ann works chiefly with the document analyses, revising them constantly against real-world documents. She posts her changes to her own server at Yale, whence they are picked up by John and contribute to the revised versions that he posts to his server at Michigan. At the moment, this process is somewhat confused, and it is difficult to know where at any one time one might find the latest versions, or what changes have gone into them. I work primarily with the dtd itself, its documentation, and its amalgamation into TEI; prospectively also the creation of visual representations of the dtd for presentation to ICPSR Committee meetings. In the meantime, I post new versions of all the relevant documents to a page in my file space (linked to John's and Ann's), and pick up material on which to base revisions from their pages, as well as from e-mail from anyone in the group. I have just begun to archive this e-mail at the same site, and to implement a careful version-control scheme. A summary of version information for my various documents is available here as well. Nancy Vlahakis, after having contributed large chunks of the DTD, has moved for the time being to test-tagging codebooks, prospectively also to creating a Panorama style sheet. Though it has proven difficult to coordinate our activities (since none of us can give more than a few hours a week to it, usually in spurts), the use of updated web pages and occasional meetings has by and large worked. Movement of information against the flow (say from me to Ann) has been more problematic.

Some Reflections

On the Decisions

On the Process

If one considers the personnel components of the dtd development team in terms of Maler's ideal list of required team members (Maler, 72), it is apparent that some members are missing, or seldom present, and that each of us has taken on different roles in passing. She suggests that this is an acceptable situation, so long as each role is clearly distinguished and consciously taken on. Here we have been remiss, and it is partly in response to her suggestions that I have recently and deliberately taken on the role of "recordist," alongside my basic roles of "implementor," (through seeking outside advice) "guest expert," and occasionally "facilitator." John has been "project leader" and "facilitator," occasionally "recordist"; Ann Green perhaps chiefly "user group" representative. The lack of a visible "project manager," and the lack of real input from users, reviewers, and guest experts (again, using Maler's terminology) has been our weak point as regards roles and personnel.

When we turn to her discussion of schedule and budget, however, things become even more grim. The factors making for a slow expensive project, she says, include:

Our documents are often very long, very complex, and extremely various. We are constrained in time and money (having essentially none of either). It is hard to say who the project leader is, exactly. John and Ann are available from the design team, others more rarely. The most competent dtd implementor on the staff is me! The discipline is severely affected by lack of time, though we do better than most. Considering that we would find it difficult to come with the answers to any of her vital questions (some of them because this is a consortium-based, not a corporate-based, project), it is amazing that we have gotten as far as we have (Maler, 71):

Prospects

The critical tests for this project all lie in the future, the most important within the next few weeks:

If the answers to these are "yes," and I have every reason to suppose that they are, most of the rest of the project will come along happily and affirmatively in their wake. The basis of user documentation is already made; the display and style mechanisms are trivial to complete; and even the HTML filtering, though hardly trivial, is an established technology and technique. We should by summer's end be able to produce what Maler calls the minimum deliverables of a dtd development project:

Bibliography

We have been most deeply engaged with the documents themselves: the sample codebooks (especially ICPSR reprints of US Census data dictionaries), the dtd, and the document analyses. The following have, however, also been essential resources:

Goldfarb, Charles. The SGML Handbook. Oxford, 1990.
A standard book useful for handy access to the SGML standard and its tricks.

Maler, Eve, and Jeanne El Andaloussi. Developing SGML DTDs: From Text to Model to Markup. Englewood Cliffs: Prentice Hall, 1995.
Though given to a little self-indulgence and silliness, this is not only a very readable and intelligent book, but a very practical one aimed at almost exactly our level of expertise and need. Apart from Dale Waldt and Brian Travis, The SGML Implementation Guide (Springer Verlag, 1995), a more technical and systems-oriented book, I know of no other book that addresses the question of practical dtd development at any length.

Maler, Eve. Tutorial on DTD Construction. Meeting of the MidWest SGMl Forum, Ann Arbor, MI, 25 January 1996.

Sperberg-McQueen, C.M. and Lou Burnard. Guidelines for Electronic Text Encoding and Interchange. [TEI P3] Chicago and Oxford, 1994.
Has set the standard for gook SGML dtd documentation, though any attempt to modify the TEI dtd still requires careful study of the files themselves.

Van Wijnen, Eric. Practical SGML. 2nd ed. Dordrecht: Kluwer Academic, 1994.
Despite some improvements from the first edition, Van Wijnen still provides less than meets the eye. His discussions of dtd development consist mostly of hints: useful hints but still hints. He is better at some of the more esoteric topics: marked sections, graphics, EDI, math notation, and references. For this reader at least he has a nack of making the familiar unfamiliar and the simple opaque, but his is still a handy second-choice book to have around.

I have also examined the EAD.dtd, rather cursorily, as well as TEILITE.dtd, FINDAID dtd, and MARC.dtd. I hope to have a look at the just-released EAD tagging guidelines soon, but have not yet been able to do so.

Attachments

See the following pages for illustrative documents.


Paul Schaffner :: 25 April 1996