Interim Progress Report

25 April 1996

Contents

The Project
Reflections
1. On the decisions
2. On the process
Prospects
Bibliography
Attachments
1. File ~pfs/codebook.html
  Collecting point and index for working documents
2. File ~pfs/cb/status.html
  Table providing latest revision information on all working documents.
3. File ~pfs/cb/bb.html
  Message and announcement archive for reference and historical purposes.
4. File ~pfs/cb/cb1.dtd
  Slightly obsolete version of standalone CodeBook dtd.
5. File ~pfs/cb/elements.html
  List of elements declared in the CodeBook dtd, with short description and section number in the dtd.
6. File ~pfs/cb/descrips.html
  List of code book elements rearranged in order of descriptive caption.
7. File ~pfs/cb/cb-tei.html
  Documentation of steps taken to integrate standalone CodeBook dtd into TEI as a "base tag set."
8. File ~pfs/cb/tei/teicb1.dtd
  First attempt (now obsolete) to convert standalone CodeBook dtd to parameterized form required by TEI in its modules.

The Project

The larger project of which the CodeBook DTD is a part is an attempt by the International Consortium for Political and Social Research (ICPSR), a consortium comprising some of the most significant statistical archives and data-producing agencies, to make uniformly searchable and accessible the statistical codebooks issued by the agencies that produce, publish, or archive statistical data sets. Many of the details, and even the general shape, of this larger project and the larger SGML-based system supposed by it, remain invisible to those of us who are engaged with the dtd proper, especially as regards the distribution of responsibility, the nature of the funding, and the possible diversity of implementation among the member institutions of the ICPSR.

The more limited responsibilities assigned the dtd group include the development of an SGML Document Type Description suitable for the encoding of data-archive code books or data dictionaries; the creation of all the ancillary files and style sheets required to make the dtd usable; the creation of all the documentation required by taggers, presumptively with little or no knowledge of SGML, in interpreting the dtd while tagging codebooks; and cooperation with programmers in integrating the dtd and its attachments into larger systems. The dtd should be suitable not only for tagging new codebooks but also for retrospective conversion of old ones; it should apply not only to print codebooks but also to digital-only documentation. It should be usable not only by American and Canadian but also by international data archives. And it should not only accommodate the extremely variable structures of existing codebooks, but impose enough constraints to promote increased consistency among future codebooks.

Objectives

The objectives of the group, though modified by experience, have in general comprised the following:

Interpret the code book document analyses produced by the ICPSR committee and written by David Barber (Michigan), John Brandt (Michigan), and Ann Green (Yale). Clarify obscurities in the analyses through meetings and correspondence. Elicit the continuing input of data specialists from ICPSR institutions regarding faults in the analyses.
Revise these analyses in the light of our own analyses of existing code books, the experience of tagging them with interim versions of the dtd, and the experience of translating the analyses into SGML.
Translate the analyses into a valid SGML dtd, being careful both to honor the intentions of the analyzers and to exercise a discretion schooled in the options available under SGML.
Learn enough SGML to make the previous step possible. Achieve a conceptual grasp of the options available, both as regards document analysis and as regards SGML implementation.
Map out and implement a practical procedure or work-flow management scheme.
Document the procedure and the process so that every decision and every declaration remains transparent both among ourselves and to later users.
Determine the extent to which TEI (or EAD or MARC) SGML elements, sections, entities, etc., may be borrowed for describing codebooks. Should it prove feasible, incorporate CodeBook.dtd within TEI itself, with the prospect of having it eventually accepted as a TEI module in future releases of TEI.
Tag sample codebooks for demonstration, tutorial, and experimental purposes: to determine if the dtd is adequate, to show that it is (in the face of informed experience), and to teach taggers how to implement the dtd.
Write a Panorama style sheet so that the samples can be displayed.
Create visual displays of the dtd hierarchy using Near and Far or Near and Far Lite.
Work with programmers to create scripts that will search the tagged codebooks via a CGI call, and generate SGML and SGML-filtered-to-HTML output online.
Create full user documentation.

Progress to Date

A good start has been made on these objectives, but only that.

Interpretation. The analyses have been interpreted. Changes continue to be made both in the analyses and in the interpreations through (frequent) correspondence, chiefly from Ann Green at Yale, and (rare) meetings of the ICPSR Committee. Obscurities nevertheless remain, most of the annotated as such within the dtd, but some probably unrealized and unacknowledged.
Revision. Some minimal revision (or at least loose interpretation) of the analyses appears in the dtd, but more the result of general considerations of logic and flexibility than of specific experiences with non-compliant codebooks. We have only just begun to compare the dtd with codebooks. The revisions that have been made in the course of translation, though thoroughly annotated in the dtd, have not made their way back in to the analyses, since there is little provision in our working procedures for changes flowing (as it were) backward. There should be.
Translation. As of last night, a "complete" dtd has been created, complete in that every element has been declared, supplied with a content model, and equipped with an attributes list--and every piece of the document analyses has been represented in the dtd. More recent changes in the analyses have yet to be incorporated, the resultant version has yet to be converted to a TEI-compliant modular form, and it has yet to parse. But the basic translation has been done.
Learning SGML. I for one am much more confident with SGML/dtd basics than I was two months ago, and the process has certainly been a genuine learning experience for all concerned. Whether our conceptual grasp is adequate to the task has yet to be determined.
Development of procedures. Though our procedures are still largely dictated by the fact that all of us are contributing to the project only our "spare" time after 100 or 120 hour/week work schedules, and the governing principles are still flexibility and consensus, a more rigorous delegation of tasks in the past few weeks has generated a productivity noticeably greater than that apparent during the more haphazardly arranged weeks that preceded. Duplicate effort has been eliminated, and the results of each contributor's labors are available in predictable ways. I have within the past few days created a version control system that will govern and document every change in the dtd and its derivative documents and do so transparently. A similar system for the analyses is in the works.
Documentation. Documentation has grown up with our procedural maturity, and contributed to it. The dtd itself is physically laid out in a way conducive to human legibility, nearly every tag is annotated in some way, and all changes, either from one version to another or from the analyses to the SGML translation, are fully explained and and defended. Eventually, these comments, growing as lengthy as they are, should be removed from the dtd and placed in a separate file attached to each element (perhaps in the manner recommended by Eve Maler, who suggests that a separate form should be filled out for every element explaining its content and defending its necessity and appropriateness.) I have created several web pages, some within the past few days, that extend the explicit documentation further: documenting the steps taken to integrate the CodeBook dtd into TEI, listing in one convenient place the revision status of all documents, and archiving the more important e-mail messages and announcements by date and subject. The alphabetical elements list, created first simply in order to assure consistency in naming practices, to avoid conflict with TEI names, to generate the list necessary for inclusion in one of the TEI entity files, and to act as an index to the dtd (by section number), will also serve as the basis of the user documentation. When the dtd reaches sufficient stability that we can begin on user documentation, the individual bits of dtd code will be pasted into the alphabetical list of elements and explained by example--the bulk of the necessary documentation, though hardly all of it. Additional documentation (e.g. controlled vocabulary lists and thesauri, attribute lists, etc.) are proposed but not yet created.
Incorporation of TEI elements. (Or determination otherwise of the usefulness of TEI). I believe that the method that I proposed to the group for incorporation of codeBook into TEI, and have since then pursued, will prove a practical one. Another five to ten hours of conversion time, and we should be able to see if the codebook/TEI amalgam will parse as a single DTD. If it does, codeBook will have at its disposal a host of well defined and thoroughly bug-tested and well documented tags, mostly at the lowest levels, and will to that extent be spared a very considerable amount of development effort and time.
Sample tagging. Others in the group have been responsible this week for tagging sample codebooks with our tags. I have not yet seen the results.
Stylesheet-Display-SGML filtering-User Guidelines. All of these are still in the planning stages; all depend on having a valid, parsable, stable dtd to begin with and will have to be postponed till that date.

Decisions

Countless small decisions have been made in the course of the project. Among the most important in their influence on the SGML implementation should be numbered the following:

Content-oriented element selection and structure.

If the choice in SGML development is between procedural, presentational, structural, and informational markup, the choice has been (with the ICPSR Committee's strong support) to use an exclusively informational structure at all but the lowest levels. Whether this decision can be sustained in practice, or must be modified by the inclusion of some upper-level tags to represent divisions like chapters or even page numbers, remains to be seen, but the sense of the Committee is certainly that (in the words of one member) "our business is metadata, not text." The immediate implications of this decision for the dtd structure include not only the selection of exclusively informational tags at the upper levels, but the abandonment of any sort of constraints on sequence or repeatability for most of the elements. The creator of a structurally-oriented dtd (like the default text structure of TEI) has the luxury of knowing that in most texts some matter in the front will be followed by some matter in the middle and then by some matter at the back, and that there will be some sort of internal scheme of chapters or books or stanzas to tag. An informational tagger, on the other hand, can never count on the author appearing before the title, or the data files description before the data description, or the study description before the questionaire, or on any kind of information appearing without being conmingled with other kinds. The lack of predictable sequence rules out constraints on order; the distribution, mingling, and interruption of given kinds of information by other kinds exclude restraints on repeatability.

Concentration on retrospective conversion.

In keeping with this philosophy, it is the sense of the Committee that it is vain to put much value on the possibility of creating a normative codebook structure that entails constraints on sequence. Even if such a beast were created, the institutional habits even of the ICPSR member institutions are too deeply ingrained to allow them readily to adopt an unfamiliar format. Instead, we are urged to make the structure loose enough to accommodate the varied codebooks already in print and on line, and not to worry that this retrospectively-induced laxity will induce prospective laxity as well. My own suggestion in this regard was to emulate the TEI "Dictionaries" module, with which I have spent hundreds of hours, in its solution to the same problem. Older dictionaries, which exist in a bewildering variety of structures, may be tagged with what is effectively an entirely different dtd invoked by use of the "entryFree" instead of the "entry" element at the entry level of the dictionary. In entryFree mode, virtually any element can appear virtually anywhere. Prospective encoders, on the other hand, are encouraged to use the more restrictive "entry" tag and its contents. Such a division may still be incorporated into CodeBook.dtd, but its inclusion is of fairly low priority.

Use of TEI parameter entities.

The TEI parameter entities for low-level elements (chiefly %phrase.seq; %paraContent; and %specialPara;) have, at my suggestion, been used as the tip-of-the-branch content models for almost all of the CodeBook trees. Each entity includes #PCDATA, but also includes a wealth of other elements (presentational, structural, and informational), that thereby become available to the tagger with little effort from the dtd-creator. The lowest of the three, %phrase.seq; is described in the TEI guidelines as consisting of "broth" (as opposed to chunks or soup), or phrase-level elements like:

abbr        abbreviation
address     address
date        date
dateRange   dateRange
emph        emphasis
foreign     foreign
hi          highlighted
name        name
title       title
ptr         pointer
lang        language

Moving up to the paragraph-content level with %paraContent; allows inclusion of lists, tables, figures, captions, and notes, among many other things. %specialPara; includes all of these plus paragraphs themselves, and is capable of handling all the presentational and structural tagging of most running prose found in a code book, as well as that of any figures, tables, or graphs that interrupt it (though most of those tend to be excluded from the digital versions of codebooks in any case, which are mostly mere ASCII at present.)

Creation of a TEI base tag set.

In order to meet the simultaneous needs to (1) create an informationally-based dtd (quite opposed to most of TEI's structurally- oriented treatment of text, not data); and (2) make use of the TEI facility with presentational and structural elements, I proposed that we convert the codeBook dtd into a simulacrum of a TEI base tag set, a close enough facsimile to convince TEI that it belonged, and invoke it (using the teikey2.ent 'INCLUDE'/'IGNORE' values) alongside at least the additional tag set for figures and graphs (for obvious reasons!) and perhaps also the additional set for names and dates. I have not yet decided whether to invoke the TEI default text structure (front-body-back; div/div/div), mostly because, while seeing its usefulness for the representation of such features in the codebooks, I do not yet see a way to allow this structure to coexist happily with the CodeBook's content-based "bibliographic info--study description--data files description--data description" structure. Some experimentation will be necessary. The TEI header should also be mentioned: its advantages lie in its control of bibliographic information and possible source for the automated generation of MARC records; its corresponding drawbacks being its restrictiveness and the obligation it imposes on the tagger to catalog the item, both of which can be vexing.

Limited use of generic tags

The document analyses have been consistent in proposing distinct tags for every kind of information, and even for the same kind of information if its focus or hierarchical placement varied. A strict and literal translation of the analyses would have required separate tags even for general catchall elements like "other materials": there would be an "other materials--study level" tag alongside an "other materials--data level" tag. "Record label" is to be distinguished from "variable label," and so on. While this is a defensible practice in part (it is a mistake, as Maler points out, to think that adding an attribute to an element entails any less overhead for the coder than adding a completely new element: in many authoring systems it may well involve more), in part it also results in a needless proliferation of elements for what is basically the same kind of information. It may be that a certain unfamiliarity or uneasiness with SGML has left the analyzers unhappy with the thought of relying on context for information, with the thought that "otherMat" nested inside "stdyDscr" represents "Study-Description-Other-Materials" just as clearly as a separate "S-D-O-M" tag would do, and therefore somewhat resistant to exploiting the contextual powers of an SGML text base. In any case, though hesitant to strike out too much on my own, I have converted a half dozen tags or so to more generic tags (otherMat, label, txt, stat[istic], etc.), though supplying each of them with an optional "level" attribute for the timid.

Limited use of attributes, nesting, and parameter entities.

Attributes have been used overwhelmingly for one purpose: either to constrain possible values automatically (via declared values in the ATTLIST), or to allow control of possible values through the use of external authority control. In a few cases, the value in the text itself is to be simply deleted and replaced by an empty element with a controlled-value attribute; in most, the form in the text will be allowed to stand, but will be supplemented by a controlled attribute value. For all that, attributes are used relatively sparingly, being reserved for technical specifications (file type, record length, etc.), geographic and date information (date covered, country), and bibliographic information (names), though much of the latter ground is covered by the TEI header.

The TEI %a.global; attribute set is also used throughout, in order to provide the basic rend, resp, lang, and id attributes.

Consistent naming. Eight-character names.

The first is a convenience to all, but especially to the tagger, whose convenience above all must be considered (since manual tagging represents the bulk of the costs of any SGML conversion). The latter is the SGML default, adhered to on principle. Further consideration for user convenience will have to wait on trials with tagging real codebooks.

The Process

The information management process, as it has developed, can be more briefly described. John Brandt acts as supervisor, sets meeting agendas, and funnels information into the group from the ICPSR Committee and from Ann Green. Ann works chiefly with the document analyses, revising them constantly against real-world documents. She posts her changes to her own server at Yale, whence they are picked up by John and contribute to the revised versions that he posts to his server at Michigan. At the moment, this process is somewhat confused, and it is difficult to know where at any one time one might find the latest versions, or what changes have gone into them. I work primarily with the dtd itself, its documentation, and its amalgamation into TEI; prospectively also the creation of visual representations of the dtd for presentation to ICPSR Committee meetings. In the meantime, I post new versions of all the relevant documents to a page in my file space (linked to John's and Ann's), and pick up material on which to base revisions from their pages, as well as from e-mail from anyone in the group. I have just begun to archive this e-mail at the same site, and to implement a careful version-control scheme. A summary of version information for my various documents is available here as well. Nancy Vlahakis, after having contributed large chunks of the DTD, has moved for the time being to test-tagging codebooks, prospectively also to creating a Panorama style sheet. Though it has proven difficult to coordinate our activities (since none of us can give more than a few hours a week to it, usually in spurts), the use of updated web pages and occasional meetings has by and large worked. Movement of information against the flow (say from me to Ann) has been more problematic.

Some Reflections

On the Decisions

Content-based coding. The real problems with this approach will only become visible when it is seriously tested against real documents. Some features of it give one pause:
- The fact that we have created a "loose" structure first will probably make it more difficult to add a more constrained structure later. (Maler, for example, counsels that if one is going to create both, the strict structure should be constructed first.)
- The infinite variability of the codebooks' prose and the lack of obvious connection between what appears on the page and the available tags may present the tagger with some very puzzling decisions, and perhaps tempt him or her into "tag abuse" (use of inappropriate tags for convenience' sake). The problem with informational, as opposed to presentational or even structural, tagging, is that it demands nothing short of full understanding on the part of the tagger, understanding both of the dtd and its intended application, and of the document being tagged. Everyone recognizes a chapter or a list; a study-level range, weighting, variable group, or universe is usually a little harder to pick out.
- The need to jump from one upper branch of the dtd to another several times in a single page or paragraph is not inconceivable and would make for some very laborious tagging, or for some difficult decisions in deciding on the desirable granularity of the tagging.
- Mismatch between structural and informational items creates some unpleasant problems for the tagger.
- Treating a text as a database ignores its textual nature, at the risk of assuming a more logical (and vocabulary-controlled) text than really exists; alternatively, it imposes burdens on the coder by requiring them to modify or discard the text as printed.
- Some problems are created when information is repeated in different forms. The title may for example be tagged as "titl" a dozen times, each time slightly differently. The TEI header helps here.
- Should anyone wish to generate printed codebooks from the SGML file, they may find it difficult to approximate the appearance of the original.
- The actual placement of metadata elements has been entirely at the discretion of the Committee and its appointed subject experts. I have not been in a position (because of both my specific role and my ignorance of statistics) to comment on them or assess their aptness.
Use of TEI entities. Some advantages of using the TEI entities lie in the attributes attached to the tags. Using "date," for example, allows for control of the date form with the "value" attribute regardless of the way that the date is expressed in the text; similarly, using "abbr" allows for control of the full form (say, of an agency) using the "expan" attribute. This feature of the TEI elements invoked by the TEI entities has been considered in examining the question of authority control and consistency, and in several such cases has been relied on to provide the necessary control.
Creation of TEI base tag set. This method seems to be frowned on by both the TEI Guidelines themselves (Spergerg-McQueen and Burnard) and by Maler, the former preferring that users take advantage of the TEI's built-in methods of adding elements, extending attribute lists, and modifying content models, and warning that the creation of an entire new tag set is possible but requires intimate knowledge of the inner workings of TEI and the relationship between core, base, and additional tag sets (Guidelines, 42). The latter suggests that modification of an existing dtd is warranted when the purpose is compatibility and interchange capability, not when it is simply convenience. However, the particular approach that we have taken, midway between writing a dtd from scratch and modifying an existing one, seems to obviate both criticisms: simple extension of TEI would not suit CodeBook's needs, and the gains obtained by using TEI are considerably more than simply convenience. They include, for example, access to TEI's customization scheme and documentation.
Use of Attributes. The current policy seems a good compromise, except that we should perhaps consider adding "role" to the TEI %a.global; attribute as Maler suggests: a very handy element extender.

On the Process

If one considers the personnel components of the dtd development team in terms of Maler's ideal list of required team members (Maler, 72), it is apparent that some members are missing, or seldom present, and that each of us has taken on different roles in passing. She suggests that this is an acceptable situation, so long as each role is clearly distinguished and consciously taken on. Here we have been remiss, and it is partly in response to her suggestions that I have recently and deliberately taken on the role of "recordist," alongside my basic roles of "implementor," (through seeking outside advice) "guest expert," and occasionally "facilitator." John has been "project leader" and "facilitator," occasionally "recordist"; Ann Green perhaps chiefly "user group" representative. The lack of a visible "project manager," and the lack of real input from users, reviewers, and guest experts (again, using Maler's terminology) has been our weak point as regards roles and personnel.

When we turn to her discussion of schedule and budget, however, things become even more grim. The factors making for a slow expensive project, she says, include:

Long, complex documents.
Complexity and variety of document structure.
Constraints on the project.
Incompetence of the project leader.
Unavailability of members of the design team (ie. the document analyzeers).
Incompetence of the DTD implementor.
Lack of discipline on the team in following methodology, documenting everything, and seriously reviewing all documents and code delivered.

Our documents are often very long, very complex, and extremely various. We are constrained in time and money (having essentially none of either). It is hard to say who the project leader is, exactly. John and Ann are available from the design team, others more rarely. The most competent dtd implementor on the staff is me! The discipline is severely affected by lack of time, though we do better than most. Considering that we would find it difficult to come with the answers to any of her vital questions (some of them because this is a consortium-based, not a corporate-based, project), it is amazing that we have gotten as far as we have (Maler, 71):

Who makes decisions?
Who funds the project?
Who decides who is involved?
Who is actually concerned?
Who informs the managers of these people?
Who is in charge and responsible for the results?

Prospects

The critical tests for this project all lie in the future, the most important within the next few weeks:

Can the CodeBook DTD + TEI be made to parse?
Have we understood TEI well enough to integrate this added module into it?
Can entire code books be readily marked up using CodeBook tags?
Will the results parse?

If the answers to these are "yes," and I have every reason to suppose that they are, most of the rest of the project will come along happily and affirmatively in their wake. The basis of user documentation is already made; the display and style mechanisms are trivial to complete; and even the HTML filtering, though hardly trivial, is an established technology and technique. We should by summer's end be able to produce what Maler calls the minimum deliverables of a dtd development project:

The document analysis report (most of this in hand, though all needs and rationales are not yet expressed therein).
The dtd code itself, and a demonstration of its validity.
The DTD user and maintenance documentation.
Test documents marked up with the DTD.

Bibliography

We have been most deeply engaged with the documents themselves: the sample codebooks (especially ICPSR reprints of US Census data dictionaries), the dtd, and the document analyses. The following have, however, also been essential resources:

Goldfarb, Charles. The SGML Handbook. Oxford, 1990.: A standard book useful for handy access to the SGML standard and its tricks.
Maler, Eve, and Jeanne El Andaloussi. Developing SGML DTDs: From Text to Model to Markup. Englewood Cliffs: Prentice Hall, 1995.: Though given to a little self-indulgence and silliness, this is not only a very readable and intelligent book, but a very practical one aimed at almost exactly our level of expertise and need. Apart from Dale Waldt and Brian Travis, The SGML Implementation Guide (Springer Verlag, 1995), a more technical and systems-oriented book, I know of no other book that addresses the question of practical dtd development at any length.
Maler, Eve. Tutorial on DTD Construction. Meeting of the MidWest SGMl Forum, Ann Arbor, MI, 25 January 1996.
Sperberg-McQueen, C.M. and Lou Burnard. Guidelines for Electronic Text Encoding and Interchange. [TEI P3] Chicago and Oxford, 1994.: Has set the standard for gook SGML dtd documentation, though any attempt to modify the TEI dtd still requires careful study of the files themselves.
Van Wijnen, Eric. Practical SGML. 2nd ed. Dordrecht: Kluwer Academic, 1994.: Despite some improvements from the first edition, Van Wijnen still provides less than meets the eye. His discussions of dtd development consist mostly of hints: useful hints but still hints. He is better at some of the more esoteric topics: marked sections, graphics, EDI, math notation, and references. For this reader at least he has a nack of making the familiar unfamiliar and the simple opaque, but his is still a handy second-choice book to have around.

I have also examined the EAD.dtd, rather cursorily, as well as TEILITE.dtd, FINDAID dtd, and MARC.dtd. I hope to have a look at the just-released EAD tagging guidelines soon, but have not yet been able to do so.

Attachments

See the following pages for illustrative documents.

Paul Schaffner :: 25 April 1996