Minimum standards for text capture


Our standard minimum accuracy rate for text is 99.995%, which amounts to one error in 20,000 characters or better.


The character-level accuracy rate must be substantiated by sampling and proof reading of at least 5% of the data (as measured in both pages and bytes). The only sanction we employ to guarantee this rate is rejection of non-compliant data: any text that does not meet specification is rejected and must be corrected or redone and resubmitted.

Fine print

Both 'errors' and 'characters' are admittedly imprecise terms. In practice, we treat 'characters' as equivalent to bytes of encoded text, after all markup has been removed and all multi-byte characters or character entities reduced to single bytes. This is a generous policy, in that it counts as characters some bytes (e.g. newlines) that are not subject to error.

'Errors' of transcription are even harder to define. In general, any omitted letter, mistranscribed letter, or wrongly inserted letter is a single error, as are pairs of transposed characters. Mistaking one letter for two or vice versa is usually treated as a single error. Omission or duplication of entire words or spans of text are variously treated; in most cases each omission or duplication is treated as an error, regardless of length, though if such prove common, we always reserve the right to regard each omission as equal to the sum of omitted characters, and would reject such files in any case. Spacing errors are variously regarded, depending on the nature of the source material; early books, in particular, often display highly variable spacing: we expect keyers to understand the material well enough to normalize such spacing, though we do not penalize them when they fail to do so and instead follow the misleading visual clues in the source. In more modern books, with conventional and significant spacing, we generally treat spacing errors as errors of transcription. 'Global' errors (mistakenly taking one character for another through a whole book) are regarded contextually, and with some attention both to how understandable the error is and to how difficult it would be to correct, short of rejecting the text.

Poor-quality source material ('noisy' sources) introduces another subjective factor into the calculation, since it introduces the obligation to distinguish between excusable and inexcusable errors--those that keyers could not, or could, reasonably be expected to have avoided. A summary of our policy in this regard is available at

Noisy source material also introduces the problem of illegibility, since assessing the quality of text based purely on the number of errors provides a strong incentive to vendors to maintain a very high legibility threshhold, to capture only such letters as can be identified without question, and to mark all others as illegible. Our interest is in favor of lowering that threshhold and obtaining a more complete text, even at the risk of introducing some error. Accordingly, in projects with noisy source material, our policy has been to count unwarranted resort to the illegible flag as itself an inexcusable error of transcription; to count bad guesses as excusable errors; and even to count particularly brilliant guesses as offsetting some other errors, particularly minor errors such as case and punctuation errors.

Finally, the unit of material subject to rejection or acceptance may vary from project to project. In most cases, we reject or accept individual items at the document or book level, though this sometimes means very small sample sizes and some consequent adjustments. We always reserve the right to batch items together for sampling and to reject (or accept) at this batch level.

Character capture


At the minimum, we capture all text that uses characters based on the Latin alphabet (whether or not they have diacritics or other attachments); all standard symbols; unusual symbols when they can be identified and are found in running text; and most standard non-alphabetic conventional signs,'dingbats,' and similar symbols, at least when they are part of the text stream.

That is, every printed character (considered with respect to the universe of characters) must be captured distinctively, so long as it belongs to the Latin alphabet and its customary extensions, including all standard diacritics and supplementary characters (e.g. thorn, yogh, eth, j, v, w, and of course Arabic numerals); to the set of common conventional signs and symbols (para, sect, *, etc.); or to the modern set of punctuation marks (,.; etc.). Also captured should be any recognizable sign or symbol that occurs regularly within the books in question, intermingled with such Latin characters (etc.), even if a novel character encoding has to be invented for it. In practice, almost everything we capture belongs to Latin-1 (ISO 8859-1), with occasional use of the other most commonly used ISO character sets (ISOlat2, ISOtech, ISOpub, ISOdia, ISOnum), and occasionally supplemented by characters from other alphabets or symbolisms. In Unicode terms, this represents a subset of the Latin, supplemented Latin, extended Latin, punctuation, general punctuation, number, diacritic, and symbol blocks, with rare extension to (say) Greek, Hebrew, IPA, and Cyrillic. Use of certain 'overloaded' ASCII-based characters should be discussed and decided on a case-by-case basis. In general, less ambiguity is better, but we are accustomed to using the traditional ambiguous forms in many cases, e.g. HEX 27 (') as both apostrophe and single opening or closing quote (and occasionally also for minutes or prime); HEX 22 (") as both opening and closing quote mark; or HEX 2D (-) as hyphen, figure-dash, and minus-sign. In one or two cases, we customarily divert some of these overused forms to idiosyncratic purposes, e.g. HEX 7C (|) to indicate an EOL hyphen or HEX 5E (^) to flag a superscripted character; any such use must be declared in advance.

Capture of extended text in non-Latin alphabets is optional.

Fine print

Our policy with regard to character capture is largely pragmatic in several regards: on the one hand, our goal is to capture at least enough information about particular characters to make them practical to search for and retrieve; but on the other, we acknowledge that in many cases we are constrained by the limits of the methods by which we produce the text, whether that be OCR or keying. Outside producers of text are welcome to exceed our minimum standard: to distinguish alloglyphs; to split homoglyphs; or whatever suits their purpose: more information is always better than less.

By and large, we prefer to define characters by their functional and semantic value, not by their appearance. This means that, for example, we do not generally distinguish between different forms of "a", or between short and tall "s"--unless that distinction has some significance in the book in question. If the relationship between two forms can be phrased as "x and y are forms of the letter z," then we in general prefer to capture x and y as "z". There are at least five exceptions to this rule, mostly pragmatic concessions:

  • if the same visual glyph can be regarded as representing more than one functional 'character,' we are usually content to record only one character. We attempt to split it on semantic grounds only when practical: e.g. when the occurrences are few or the contexts are readily defined and located.

  • if the same visual glyph historically represents the reflex of several functionally distinct characters, we rarely attempt to undo this historical merger. E.g., we do not attempt to distinguish the "z" of "viz." or "oz." (where it is historically an abbreviation mark) from the "z" of "zero."

  • if the printer (or author) has used the wrong letter in error, we do not attempt to replace it with the right one, regardless of the nature of the error (e.g. a 'spelling' error or a 'typographical' error). The sole exceptions are
    1. inverted or physically displaced type, where we reconstruct what was intended if we can; and
    2. 'hijacked' or substitute characters, when a printer advisedly uses one almost-right character in place of the presumably unavailable right piece of type. In that case, if we can plausibly interpret the result as an alloglyph of the correct character we capture it that way.

  • if we do not know what meaning was intended, we capture with the nearest visual equivalent.

  • and in some projects we give special treatment to some special forms of letters, e.g. decorated initials.

Optionally omitted character types

Locally, there are forms of text that we leave uncaptured, but outside suppliers are again free to exceed our standard and capture it all. We do not generally attempt to capture handwritten additions or corrections to printed works; running text in non-Latin (e.g. Greek, Arabic) alphabets (that is, a word or more, as opposed to the use of individual characters used as symbols or sigla), with a few exceptions; and text in idiosyncratic symbolic systems, e.g. a novel system of shorthand or a personal cipher.


Finally, any text received must resolve in some logical manner the problem of end-of-line hyphens. Locally, our solutions depend on the nature of the material. In modern materials (i.e., 19th century or later), we typically remove end-of-line soft hyphens and preserve the relatively few EOL hard hyphens. In older material, with uncertain hyphenation rules, we preserve all hyphens but capture EOL and other hyphens using separate characters or markup.

Character encoding

We can accept most forms of character encoding that are unambiguous and readily convertible to other forms, so long as we are supplied with an inventory and explanation of characters encoded. We are most familiar with, and readiest to deal with, documents encoded in the traditional SGML manner using ISO 646 (US-ASCII) supplemented by SDATA character entities, preferably but not necessarily exclusively those in the standard ISO character entity sets. We find that this method gives us ready control over the characters in our texts: our SGML declarations forbid 8-bit characters; and our DTDs limit character entities to the specified sets.

Any other form of encoding would require some discussion beforehand, and should allow the same level of control. We should have no trouble accepting documents encoded using ISO 8859-1 (Latin-1) whether or not supplemented by character entities; XML-style numeric (Unicode) character entities; XML general entities equated with ISO entity names in braces (e.g. á => {aacute}); Universal Character Names (UCNs; aka Java Unicode escapes); and Unicode/UTF-8 encoding. The last should be used only if the actual codepoints used are constrained to a defined subset of Unicode and listed in a character inventory, for reasons already given.

Minimum Markup


Our minimum standard for markup is roughly that described in level 4 of the "TEI-in-libraries" guidelines: light markup, chiefly structural but falling back on presentational markup when necessary; calculated to allow intelligible display, efficient navigation, the discrimination of all salient features, automatic linking to corresponding page images, and the automatic extraction of a table of contents or outline directly from the marked-up structure.

We welcome densely encoded text, but recognize that accurate, intelligent, interpretive markup costs a lot. The deeper and denser it goes, the more it costs. As a result, given our institutional interest in bulk production as opposed to hand-crafted work, we have tended to reduce costs as much as possible by restricting markup to only:

  1. what is essential to intelligible display;
  2. what is essential for intelligent navigation;
  3. what is most useful for the most common searches or search restrictions; and
  4. what is feasibly extracted from the original

The same would be our minimum standard for text accepted from outside.


In practice, assuming that the books are reasonably modern in convention and general in type, we expect at least the following features to be marked up:

Our standard DTD also makes provision for some more detailed markup that we require only occasionally, but only when the matter justifies it and makes it feasible, such as:

Our present DTD does not allow for the following, but easily could be made to.

Markup language

At the production level, we are best able to work with SGML-encoded text, but are happy to accept XML as well. In either case, our minimum standard would specify only convertability. That is, we are happy to accept either SGML or XML, so long as the dtd and document instances can be readily converted from one to the other without significant loss of information or technical impediments.

Any XML or SGML features that would hinder such conversion would require discussion beforehand.

Markup rules recommendations

We are a TEI shop and expect markup in most cases to employ at least TEI-based semantics, if not a fully TEI-compliant dtd. This is perhaps not a minimum requirement, but is certainly a strong preference. For most projects, especially those involving outsourced keying and coding we have come to rely on a small subset of TEI that we refer to as our 'vendor' dtd (now in version 2), and derivatives thereof. We have found it adequate to an extremely wide range of material and readily adaptable even to some specialized texts. 'Vendor' and its derivatives handle the encoding for projects as diverse as EEBO/Evans, the Corpus of Middle English, and the Encyclopedic Survey of the University of Michigan. Nevertheless, we realize that projects may require more specialized markup that cannot be readily or efficiently captured using the vendor dtd, and will be glad to discuss more specialized markup schemes; we've used such schemes ourself for a number of small-scale text conversion projects, such as Knight's American Mechanical Dictionary

Michigan's conventions for text capture

We recommend a text capture and encoding regime along the lines described below, but recognize that particular classes of material may require a different approach.

NOTE: the material that follows is a much abridged and slightly altered version of the documentation for the TCP (EEBO/Evans) projects. The full version (complete with TCP peculiarities), with much additional documentation including extensive sample files, can be found online at Documentation for other and earlier DLPS projects, including version 1 of the vendor dtd and keying and coding instructions, are linked from Much of this material is obsolete, but still contains some useful examples, especially in the case of Knight's dictionary.


All DLPS dtds potentially exist in at least two versions: (1) a limited version intended for use by conversion firms (V); and (2) an inhouse version intended for use by inhouse markup reviewers (R). All DLPS dtds potentially exist in at least two formats: (1) as an SGML dtd (S); and (2) as an XML dtd (X). The latest version of our aforementioned 'vendor' dtd exists presently only as a reviewer version in SGML format under the name "vndr2rs.dtd" (or may be invoked with the public ID DOCTYPE ETS PUBLIC "-//UMDLPS//DTD Proof 2.0//EN"). It may be found online at [URL]. This dtd is an extract from TEI P3/P4 (with some slight modifications) and uses TEI semantics; the TEI guidelines (TEI P3 or P4) may be safely used as a general guide to the meaning of particular tags, though local usage may dictate some specific practices. TEI P3 documentation is available online at Michigan: .

Alphabetical list of tags in vendor dtd with brief descriptions

Note: grayed-out elements (marked also with a *) should be used cautiously, rarely, or not at all.

ElementUsed to tag a...Brief description
<abbr>wordContains an abbreviation, especially a word that uses a diacritic or an abbreviation mark (e.g. overlining, etc.) for which no character-level provision has been made. Optional EXPAN attribute contains the expanded form.
*<add>spanContains material added after printing, usually by hand.
<argument>head/footContains a summary (in prose or verse) found at the head or foot of a division. Often labeled "Argument". Sometimes extended to similar material.
*<author>bibl partContains name of author, editor, etc. (role defined by attrib ROLE), espec. within bibliographic citation
<b>spanContains bold-face text. Equals <HI REND="b">. May be used instead of HI or mixed with it.
<back>part of textContains the "back" matter belonging to a given <TEXT>. Compare <BODY>, <FRONT>.
<bibl>spanContains a bibliographic citation. Usually obligatory only within <EPIGRAPH>, but may be used almost anywhere, e.g. in association with quotations or in bibliographic footnotes. May contain AUTHOR TITLE DATE IMPRINT
<body>part of textContains the main body of a given <TEXT>. Compare <FRONT>, <BACK>.
<byline>head/footContains the 3rd-person statement of authorship of a given text division; not always easy to distinguish from SIGNED, a 1st-person attribution.
<cell>part of tableTable cell. Use like HTML <TD>. "ROWS" attribute = HTML "ROWSPAN"; "COLS" attribute=HTML "COLSPAN". Cells containing headings or labels (=HTML <TH>) should add attribute ROLE="label".
<closer>head/footAppears at foot of text division; corresponds to <OPENER> at head of division. Used especially when there is internal structure, e.g. <SIGNED>, <SALUTE>. Compare <TRAILER>
<date>spanContains a date. Usually obligatory only within <DATELINE> and <HEAD>, but may appear almost anywhere.
<dateline>head/footUse within OPENER or CLOSER. Contains span of text at head or foot of text division indicating the circumstances of writing (especially the place and/or date).
*<del>spanContains material deleted (e.g. scratched out) after printing, usually by hand.
<div1>part of front,back,bodyA subdivision of <FRONT> <BACK> or <BODY>.
<div2>part of div1A subdivision of <DIV1>
<div3>part of div2A subdivision of <DIV2>
<div4>part of div3A subdivision of <DIV3>
<div5>part of div4A subdivision of <DIV4>
<div6>part of div5A subdivision of <DIV5>
<div7>part of div6A subdivision of <DIV6>
<document>entire itemToplevel container for entire item, temporary header excluded.
*<emph>spanContains text set apart, e.g. by typeform, as being emphatic.
<epigraph>head/footContains quotation or motto (whether or not accompanied by <BIBL>) at head or foot of text division. Use also for scriptural quotations at head of sermons or commentary chapters.
<figDesc>description of figureUsed with controlled vocabulary to indicate form and content of illustrations especially when other means of identification (e.g caption) is lacking, especially for maps and portraits: <FIGDESC>Map of Africa</FIGDESC>
<figure>illustration 'event'Marks location of illustrations within text. Captions (or similar text) attached to an illustration are captured within <HEAD> <P> or <L> tags within the <FIGURE> tag. <FIGURE> may nest.
*<foreign>spanContains text set apart, e.g. by typeform, as being in a language other than the primary one. Has attrib LANG
<front>part of textContains the "front" matter of a given <TEXT>. Compare <BODY>, <BACK>.
*<fw>spanOccurs only in non-empty variant of PB and MILESTONE tags; contains literal text of material associated with milestone or page break, e.g. page number, running header, or milestone number.
<gap>gap 'event'="Empty tag used to mark the location and nature of material present but not captured (DESC="music" DESC="math", DESC="foreign" DESC="intruder" DESC="duplicate" etc.), or of material that is missing or illegible (DESC="missing" DESC="illegible" DESC="blank").
<group>group of textsUsed to group <TEXT>s if item consists of more than one separate <TEXT> (usu. signalled by separate title page)
<head>head/footContains heading for a text division (<DIV>), a stanza (<LG>), and argument (<ARGUMENT>), or a list or table (<LIST>, <TABLE>). Appears at the top (head) of the structural division. Compare <TRAILER>. Also used to capture the caption of an illustration.
<hi>spanContains text that is designed to be set apart for some reason from the surrounding text, unless the reason is specified by use of a structural tag (e.g. <HEAD>). Attribute REND indicates presentation (e.g. REND="i" (italic), REND="b" (bold), REND="u" (underline), REND="marginal quotes" (marg. quotes).) Cp. <I> <B> <U> (may be used instead of I, B, U, or mixed with them)
<i>spanContains italic text. Equals <HI REND="i">. May be used instead of HI or mixed with it.
<idg>(id elements)Contains ID numbers CAT, VID, BIBNO used to identify item; ID attribute contains tracking number or primary unique item identifier.
*<imprint>bibl partContains imprint info (publisher, pubplace, date) within bibliographic citation.
<insert>(macro)Shortcut for <Q><TEXT><BODY><DIV0>: for inserted textual objects such as quoted documents and letters. Defined identically to *DIV0.
<item>part of listContains an item in an ordered or unordered list (where it may contain a <LABEL>); or the second item in a dictionary list or list of pairings (in which the first item is tagged as <LABEL>). ATTRIB ROLE="label" when item serves as heading to list column.
<l>verse structureContains a partial or complete line of verse. Often part of <LG>.
<label>part of listContains a label attached to an item in a list. May either be paired with the <ITEM> or contained within it. When paired, may use ATTRIB ROLE="label" to tag label used as column header.
*<lb>line-break 'event'Used rarely. Empty tag used to indicate a line break. Use only if there is no other way to indicate the relationship between the material before and after the break.
<lg>verse structureContains a group of verse lines that form a structural unit, e.g. a stanza, refrain, or verse paragraph. May nest.
<list>special formatContains an ordered or unordered list of short items (long items can usually be treated as paragraphs) [cp. HTML <UL>, <OL>]; or a list of label-item pairs [cp. HTML <DL>].
<milestone>milestone 'event'Empty tag used to indicate a numbered stage in an (often non-structural) series (e.g. page number in a different edition; year in a running chronology). Optional non-empty version contains literal text of milestone in FW tag.
*<name>spanContains a name. Use attrib TYPE to indicate personal, place, etc.; attrib NORM to contain controlled normalized version.
<note>noteContains most material that appears with but stands outside the main text flow, whether or not anchored by footnote numbers, etc. Use attrib "N" to record marker; attrib "PLACE" to record location of note relative to text block. Generally NOT used for end notes, since that would require excessive text displacement.
<opener>head/footAppears at head of text division; corresponds to <CLOSER> at foot of division. Used especially when there is internal structure, e.g. <SIGNED>, <SALUTE>. Compare <HEAD>
<p>prose structureParagraph. Basic unit of prose structure. Use "N" attrib. if numbered.
<pb>page-break 'event'Placed at beginning of each page to mark page-break "event". "N" attribute used to capture printed page numbers; "REF" attrib. contains number of image on which page appears. Optional non-empty version contains literal text of page number or running heads in FW tag.
<pscript>postscript/prescript (material appended to DIV)Used to capture self-standing block of text (perhaps with heading and its own signature) that is added to the end of a division after the usual closing elements (e.g. the signature or dateline), especially postscripts to letters. Contained within CLOSER or OPENER and defined otherwise the same as a DIV7
*<ptr>linking tagEmpty pointer to another location in the document, referenced with TARGET attrib.
*<publisher>imprint partContains publisher name within imprint
*<pubplace>imprint partContains publication place within imprint.
<q>spanContains "block" quotations of all kinds (even if set off by typographic cues other than indentation). Use also as an alternative to INSERT to embed quoted documents within prose, using <Q><TEXT><BODY>... etc.
*<ref>linking tagContains material constituting a cross-reference to another location in the document, referenced with TARGET attrib.
<row>part of tableTable row = HTML <TR>.
<salute>head/footGreeting attached to letter or letter-like text division, placed within <OPENER> or <CLOSER>: 'my lord,' 'dear sister.'
<signed>head/footSignature statement (1st-person attribution) attached to letter or letter-like text division, or to other 'verbal actions' (e.g. a praise poem, will, or proclamation); placed within <OPENER> or <CLOSER>. 'Your friend,' 'Yours always'. Actual name optionally tagged as NAME.
<sp>drama structure"Speech"--the basic unit of drama or drama-like texts; normally headed by <SPEAKER> element.
<speaker>drama structureWhen found at the head of a speech, contains the name or designation of the speaker (or speakers)
<stage>drama structureContains stage directions of all kinds, whether within the text or in the margin.
<table>special formatContains tabular material that cannot easily be made intelligible without retaining the two-dimensional layout of the original page. Tables containing nothing searchable (e.g. all symbols or numbers) may be omitted and captured as <FIGURE>.
*<term>keywordContains contolled keywords (index terms) in keywords tag.
<text>self-standing itemUsually=the whole book. May also be used to tag embedded documents that are substantially complete. (cp. <INSERT> <Q> and <GROUP>).
*<title>bibl partContains document title, espec. within bibliographic citation
<trailer>head/footContains heading for a text division (<DIV>) a <TABLE>, or a <LIST>. Appears at the bottome (foot) of the structural division. Compare <HEAD>.
<u>spanContains underlined text. Equals <HI REND="u">. May be used instead of HI or mixed with it.
*<unclear>spanUsed rarely to contain text that is difficult to read but has nevertheless been partly or completely captured with some degree of doubt.

Principles and practices

Specialized vs. general markup. As a rule, if it is not clear that something qualifies for specialized treatment, it can safely be captured as straight text. If you're not sure whether an elaborate treatment is justified, use the simpler treatment instead. This is almost always the safe thing to do: we don't lose any text that way, and we don't perpetrate any incorrect markup: better LESS markup than WRONG markup.

Page-image ID numbers The beginning of each page (including the first page and all blank pages) should be recorded with a <PB> tag. The REF attribute of the <PB> tag is required: its value should be filename or number of the page-image file such as will provide unambiguous reference to the appropriate page image. E.g., a page appearing on the the third page image will begin with <PB REF="3">; a page appearing on the seventh page-image might begin with <PB REF="7">. If it is necessary for some reason to capture the contents of the images in an altered sequence (e.g. because the scans were taken out of order), the REF values must still reflect the original sequence, e.g. as reflected in the filenames of the tiff files.

File naming The text captured from each book should be returned as a single file, [idnumber].sgm (zipped up either singly or as a batch in a standard .zip file). A resubmission of a file previously submitted should insert a "rev" (for "revised") in the filename, e.g. WB1187.rev.sgm.


Full bibliographic metadata is normally stored separately from the text, e.g. in a TEI header or (better) in a MARC record from which a TEI header can be automatically generated, and is not regarded as part of text encoding per se.

The text itself as delivered must contain some bare minimum administrative metadata, with the option to provide more.




Material to record

With a few standard exceptions noted below, the entire text will be recorded in its entirety, first page to last, in the order it was intended to be read (top left to bottom right, left column before right column, etc.).

The chief (and rare) exception is parallel texts. Running parallel texts, printed in a multi-column, multi-row, or facing-page arrangement, or some combination thereof, need to be treated as separate texts (normally, separate <DIV>s, sometimes perhaps separate <TEXT>s), each one recorded until its end and not restarted on each page. Notes and other material relating to only one of the texts on a page needs to be embedded in that text, not in any of the others. If a single heading or figure applies to more than one of the parallel texts, it should be recorded at the appropriate place in each text to which it applies.

Partial or fragmentary parallel texts will normally be broken primarily at the chapter or section level (e.g. <DIV1 TYPE="chapter">), then into parallel versions of that chapter (e.g. <DIV2 TYPE="version">) when necessary. But full parallel texts, e.g. an entire Latin-English parallel New Testament, or a Latin-English parallel Boethius) will normally be broken primarily into versions first (<DIV1 TYPE="version">), then each version into its chapters (<DIV2 TYPE="chapter">).

All material should be recorded in the form in which it appears in the book: do not attempt to correct spelling or typographic errors (except upside-down letters or physically displaced text). Spaces between words should always consist of one space character. Spacing around punctuation should be normalized, either to modern standards or to contemporary ones, if they clearly differ.

Material to record as attribute values

  1. 'Milestone' information

    Page numbers as printed in the book will be preserved only as the value of the "N" attribute of the <PB> (page-break) tag (unless the non-empty form of the PB tag is chosen). Unnumbered pages should receive a <PB> tag with the N attribute omitted. Incorrect page numbers, if they arise from typographic error, should be recorded just as they appear (otherwise: see comments on out-of-order pages). Page numbers will usually consist of arabic or roman numerals, but may also appear as letters or letter-number combinations. If there appear to be multiple separate paginations, choose one to record with the <PB> tag; record the other with a <MILESTONE> tag. Ignore any typographic elements used to set off the page number. E.g. -2-, {p. 2}, and PAGE 2 should all be recorded as <PB N="2">; (ccii) .cc.ii. and -ccii- should all be recorded as <PB N="ccii">; etc.

    Placement of <PB> tags. The rules are: (1) "pages always break at the top"; that is, <PB> tags will be inserted in the text at the actual location of the page break (the "top" of the page), regardless of the location of the page number on the printed page. (2) "Divisions begin at page breaks; they don't end there"; that is, if a structural break of some kind coincides with the page break (e.g., if a new section (<DIV>), paragraph, stanza, etc., begins at the head of the new page, the <PB> tag should be tucked inside the opening tag for the first non-empty structural element, neither inside the closing tag for the old division nor between the two divisions. And (3) "Words cannot break at page breaks"; that is, if a hyphenated word straddles a page break, finish the word and any attached punctuation, then insert the <PB> tag. Treat the hyphen as any other end-of-line hyphen.

    In parallel texts, material on a single page is often recorded at widely separated points in the data stream (once in each parallel <DIV>). In that case, the <PB> tag, including the page number, should be repeated, i.e., recorded in both <DIV>s. (The same may be done in the case of footnotes that carry over to the next page, but in fact we generally omit the extra PB in that case.)

    Foliation. Some books may be foliated instead of paginated, i.e., every leaf may receive a number, rather than every page (in which case, typically, the back page of each leaf has no number). Record a foliated book in the same way as a paginated book, supplying the folio number as the value of the "N" attribute of the <PB> tag. A typical page sequence in this kind of book will look like this:

    <PB N="iij">
    <PB N="iv">
    <PB N="v">
    The folio number may be explicitly labeled as such ("Fol.xvii." or "Folio .cxli."). Discard the label and punctuation and record only the actual number (<PB N="xvii"> <PB N="cxli">), again unless you are using the non-empty PB option).

    Page breaks in unbreakable objects. Occasionally an object such as a table will spread across a two-page opening so that the opening becomes in effect a single page. (This is different from a table that is simply continued from one page to the next.) In this case there is no sensible place to insert the <PB> tag that marks the break between the left and right pages, so it should be inserted before the unbreakable object, with a double "N" value.

    E.g., if a single table is spread across pp. 46 and 47 (both of them on image 22), the tagging should look like this:

    <PB REF="22" N="46-47">
    <TABLE> ... </TABLE>

    Objects that span two or more IMAGES (as opposed to pages) are another matter. This happens fairly commonly with large fold-outs, which may have been filmed in sections. In some cases, it may be possible to break these up into separate objects. In that case, each piece of the original foldout will be tagged as a separate (e.g.) <TABLE> with intervening <PB> tags to indicate the appropriate image on which the piece appears. In other cases, it is more feasible to treat the set of images as a single unbreakable object and insert the PBs in a group at the beginning of it.

    Other (largely) non-structural numerations and alternative numerations. If the book contains some other running numeration system alongside folio or page references, you may use the milestone element to record it. If the nature of the unit is obvious (e.g. chapter 1 -- chapter 2 -- [etc.]), you may use the "unit" attribute to capture that information: <MILESTONE UNIT="chapter" N="2"> <MILESTONE UNIT="chapter" N="1">. Particularly complex sets of MILESTONEs often appear in works with a running marginal chronology, or set of chronologies; in these books, one sometimes finds a suitable UNIT value at the head of the column of marginal years: <MILESTONE UNIT="year before Christ" N="1234">. Note that this applies only to a sequence; occasional notes of this sort should be recorded simply as <NOTE>. If in doubt whether a set of numbers represents <MILESTONE>s or <NOTE>s, use <NOTE>. (Milestones can of course also be found embedded in notes that contain additional information). Some books contain conflicting structural enumerations, e.g. a system of proposition numbers in the margins that does not correspond with the chapter numbers; the former may be recorded using <MILESTONE> tags.

    Some books mark the fine structure of the book or of the book's argument with marginal sequences of numbers. In many cases, such small units of structure (without headings) do not merit tagging as DIVs and the marginal indications can be readily tagged as MILESTONEs

    Biblical verse numbers inserted in the text of a translation or paraphrase (either verse or prose) are usually most readily tagged as MILESTONEs (<MILESTONE UNIT="verse" N="14">).

  2. Structural numerations

    Line numbers in verse should be recorded only as the value of the "N" attribute of the <L> tag. Record in this fashion only line numbers actually printed in the book, and use the form of the number that appears in the book. (Line numbers in prose should usually be regarded as non-structural--that is, they do not correspond to any structure that we are tagging--and recorded as milestones, as above.)

    Stanza, chapter, section numbers, etc. (that is, sequential numbers that appear in the headings to <LG>s and numbered <DIV>s) should be included as they appear in the book as part of the text surrounded by the appropriate <HEAD> tag, but should also be recorded, if possible as an arabic number, as the value of the "N" attribute of the appropriate <DIV> or <LG> tag.

    <DIV2 TYPE="chapter" N="5"><HEAD>Chapter V.</HEAD>

    <LG N="14"><HEAD>Stanza XIV.</HEAD>

    Note that though the chief use of the N attribute is to record numbers, it can be used (guardedly) to record any comparable information, especially if it is sequential : alphabetical sequences (N="a" N="b"); names of countries; reigns of kings; in the latter cases, the N attribute can serve a normalizing function: <DIV1 TYPE="reign" N="Edward I">.

    Paragraph numbers (sequential numbers appearing at the beginning of a series of paragraphs that you have not chosen to regard as <DIV>s) should be included as they appear in the book as part of the text surrounded by the <P> tags, but should also be recorded, if possible as an arabic number, as the value of the "N" attribute of the <P> tag.

    Item numbers and label numbers in lists should be recorded as part of the text included within the <ITEM> (or <LABEL>) tags. They may additionally, if feasible, be recorded and normalized as values of the attribute "N". See below under "Lists and Tables."

    Enumerations in tables may be variously treated: given a column of their own, left as part of the text in a row, or even made part of an embedded <LIST>, whichever adequately represents the information most simply and efficiently. It is usually best to include the numbers as part of the text. See below under "Lists and Tables."

  3. Other attributes

    Language. Supply a value for the LANG attribute of numbered <DIV>s and of whole <TEXT>s, but do so only if the bulk of the text (barring notes) in that <DIV> or in that <TEXT> is in the indicated language. Supply the attribute at the highest level at which it applies: e.g., if an entire text is in Latin, add LANG="lat" to the <TEXT> tag, but not to all the <DIV> tags within that <TEXT>; if one of the <DIV1>s in a text is in Latin and other is in English, assign LANG="lat" to one of the <DIV1>s and LANG="eng" to the other; and so on. Assume that the LANG property is inherited. Optionally: mark units as small as LG, Q, and FOREIGN with LANG attributes.

    Assign multiple LANG values to the same <DIV> or <TEXT> only if it contains two or more languages in some kind of organized relationship. E.g., a bilingual Latin/English dictionary should be coded as <TEXT LANG="lat;eng"> (with a semicolon between the two codes). Use USMARC 3-letter language codes published by the Library of Congress at (These are identical to the 3-letter codes contained in the ISO standard 639-2; see Unless the scope and nature of the project require it, do not normally attempt to differentiate between forms of the same language: e.g., record LANG="fre" for French texts and LANG="eng" for English ones, not LANG="frm" ('Middle French') or LANG="enm" ('Middle English').

    TYPEs of DIV. Supply a value for the TYPE attribute of numbered <DIV> elements if the appropriate value is obvious; otherwise, omit the attribute entirely. You may find it useful to consult our (aging) TCP list of common and preferred DIV TYPEs. If you do supply a value, use these rules:

    1. Use the designation supplied by the book itself. "Chapter 3" should be recorded as <DIV1 TYPE="chapter">

    2. Use lower-case throughout ("chapter" not "Chapter") unless the value includes a proper name.

    3. If the designation is not in English, and there is a ready equivalent in English, use the English. E.g., for "pars" or "partie" use "part"; for "capitulum" or "chapitre" or "cm." or "chapt." or "cap." use "chapter".

      If the designation in the book is a verbose version of a common English term, use the simpler form. E.g., if the book says "Prefatory Remarks by the Author," you shouldn't be afraid to translate this into <DIV1 TYPE="preface">

      As a fallback, use whatever is printed.

    4. If there is no designation in the book, and the <DIV> is used to mark a series of items of similar type, use a term describing the form or genre shared by the items. E.g., in a book of poems, use
               <DIV1 TYPE="poem">
               <DIV1 TYPE="poem">
      See further under Poetry, below.

    5. If there is no designation in the book, and the <DIV> is used to mark a series of items of dissimilar type, or if there is no series at all, just use a term that describes the form of the item as generically as can be (<DIV1 TYPE="letter">; <DIV1 TYPE= "preface">)

    6. The above criteria are intended mainly for vendors, who have the option of omitting the TYPE attribute. Reviewers have no such option and should consult the much more thorough discussion of the assignment of DIV types which is available at

    Provide attribute values only when instructed to and when there is specific information to supply. Do not supply values of this sort: TYPE="unknown" or TYPE="unspecified".

Material not to record at all, barring project-specific exceptions.

  1. Running headers and footers.

  2. Catchwords and quire signatures.

  3. All other text that is simply an artifact of the printing process.

  4. Handwritten notes or other handwritten material [unless the option to use DEL and ADD has been exercised].

  5. Text within illustrations (except usefully searchable captions and similar material).

  6. Separator lines and similar typographic flourishes.

  7. Most formatting not essential to sense. See below.

Material to record only as an empty tag, quasi-empty tag, or flag character marking the location

  1. Illustrations without labels or captions should be captured as an empty <FIGURE> tag (<FIGURE></FIGURE>). Captions or similar labels usefully searchable or useful for identifying the content of the illustration should be captured as a <HEAD> (if a genuine caption) or simply within <P> tags (etc.) placed within an otherwise empty <FIGURE> tag (<FIGURE><HEAD>The meaning of the Embleme.</HEAD></FIGURE>).

    Captions. It is not always easy to distinguish between captions and other text within the illustration. Captions may appear below the illustration, above it, in a circle around it, or even within it (e.g. on a "shield" or similar device), and may often be distinguished from other text by the fact that they provide a summary identification or description of the illustration. If in doubt, assuming that the text can be read, capture it.

    Mixed text and illustration (e.g. where the woodcut frames the text, or where a block of text (e.g. a poem) is printed by means of woodcut, can in most cases be captured by treating the illustration per se and the text as separate items. In the case of a poem printed by means of woodcut at the bottom of a larger illustration, for example, it is often easiest to capture like this: <P><FIGURE></FIGURE></P><LG><L>... </L></LG>.

    In-line illustrations, if they are truly in-line (that is, can be unambiguously located within a line of text) should be inserted (as <FIGURE>) within the text at the appropriate spot. If the appropriate location is not quite so obvious (e.g., an illustration occupying two or three lines of text inserted in the text or placed in the margin), use the rules for marginal notes (below). That is: if the correct location can be identified easily (e.g. by an identifying phrase, "as shown in this figure:") place the <FIGURE> tag within the text at that point; if not, simply place it after the nearest sentence-ending punctuation (e.g. a period or colon).

  2. Complex tables with no searchable textual content if they are difficult to capture may be treated as if they were illustrations. See further below under tables.

  3. Missing pages and large chunks of missing text. A span of text that appears to be more or less completely missing (e.g. because of a missing page image, or a torn page, or even if it appears to have been accidentally left blank), should be marked with the empty tag <GAP DESC="missing"> or <GAP DESC="missing" EXTENT="1 page">, as the case may be. Individual words or letters that are missing should normally be treated using the illegibility flag (@) described below.

    This should be distinguished from spaces deliberately left blank. If these are significant and occur within the text, e.g. as blank spaces left to be filled in by hand in a legal or commercial form, capture these as <GAP DESC="blank">

  4. Duplicate pages may arise from many causes; e.g. in image sets based on microfilm they are fairly common because the original microfilm photographers often retook doubtful shots). When such duplication is noticed, capture only one copy of each page, representing each uncaptured page with a <GAP DESC="duplicate"> tag. We always attempt to capture the book as originally printed or intended, not the accidents attendant on the creation of intervening surrogates.
    There is no firm rule as to *which* copy to keep and which to <GAP> out, except that it would be sensible to keep the better copy and exclude the worse one. Often it will be the second copy which is the better (it is because the photographers thought there might be something wrong with the first copy that they made a second copy). If there is a duplicate run of images, and one is complete and the other incomplete, normally you should keep the complete set and exclude the incomplete set. If the situation gets more complicated, e.g. if both sets are incomplete, but are missing different pages, or if only set is complete, but includes some bad images that can be replaced by images from the incomplete set, you may have to mix and match. In any case, the desired result is : the best possible text, from the best images, in the right order.

    Any images that are given the <GAP> treatment should be represented by separate <GAP> tags for each page (not each image), rather than attempt to represent a span of pages or a span of images with a single <GAP> tag. This is so that each image, regardless of whether it is captured or not, will still be represented in the text by a <PB> tag. Each <PB> tag should, of course, point to the actual image number using the REF attribute.

  5. "Foreign" (non-Latin) alphabets. Extended text in a non-Latin alphabet. Though individual letters (e.g. Greek or Hebrew letters used as manuscript sigla, symbols, reference marks, or abbreviations) should be recorded as special characters, using character entities (see discussion of Characters, below), entire words or extended passages in a non-Latin alphabet (Cyrillic, Hebrew, Greek, Arabic, etc.) should be recorded simply as <GAP DESC="foreign">, without transcribing the word(s) themselves. The tags cannot contain any text, though any notes, milestones, page-breaks, etc. that appear within the passage should be recorded as usual, using <GAP> tags before and after the interrupting milestones as necessary.

    Surrounding structures should be preserved if possible, at the highest level that applies . A line of verse quoted in Greek, for example, should be recorded as <Q><L><GAP DESC="foreign"></L></Q>; a paragrah in Greek as <P><GAP DESC="foreign"></P>; and a stanza in Greek as <LG><GAP DESC="foreign"></LG>.

    example of mixed Greek-English text

    Record as: the semicircle .18.15, <GAP DESC="foreign"> .21.7, <GAP DESC="foreign"> .23

  6. The presence of musical notation should be recorded with the <GAP> tag, with the value of the "DESC" attribute assigned as "music": <GAP DESC="music">.
    Extended spans of music should be captured using a single <GAP> tag, so long as other material (such as text, illustrations, or a page-break) do not interrupt.

    Lyrics printed between lines of music should be recorded as ordinary prose. At every point at which the line of lyrics ends and a line or two of musical notation appears, insert within the running prose a <GAP DESC="music"> tag.

  7. Any mathematical formulas or mathematical notation too complicated (or too dependent on two-dimensional layout) to be rendered as plain text should be recorded with the <GAP> tag, with the value of the "DESC" attribute set to "math."

  8. Illegible text, missing and damaged text, or clear but unrecognized symbols all will require some attention from reviewers. Below are the guidelines for conversion firms. Reviewers should go beyond these instructions to examine the data supplied by the conversion firms, and resolve illegibilities if possible and feasible, if necessary using the <UNCLEAR> tag to surround any words (words only, not individual letters) that are worth capturing but cannot be captured with confidence because all or part of the word is illegible or nearly so. Any remaining illegibilities should be converted into <GAP> tags, with the DESC set to "illegible"; the EXTENT set to (e.g.) "1 letter" or "2 words" or similar indications of extent; and the RESP set to indicate whatever agency decided that the text was illegible. Reviewers should similarly examine all characters flagged (by #) as clear but unknown symbols. They should be resolved, if possible and feasible, into characters; if not, converted into <GAP DESC="symbol" EXTENT="1 character"> (etc., mutatis mutandis).


    Illegible text that cannot be read, for whatever reason, should be marked using variations on the "@" symbol:
    @ = individual character or characters, less than a word.
    @word@ = a whole word
    @span@ = any span of two or more words, or when it is not clear how many words are involved; in any case less than a page.
    @page@ = a whole page.
    Additional variants are possible if it proves useful to flag some other piece of the structure as unreadable, e.g.:
    @para@ = illegible paragraph
    @line@ = illegible line of verse or prose

    Unknown symbols or characters if they can be distinguished from illegible characters, should preferably be recorded as "#" instead of "@".

    The illegibility threshhold. Two extremes should be avoided as far as possible: (1) using the illegibility markers promiscuously to avoid capturing text about which there is some difficulty; and (2) "creative" capture of text that really cannot be read, simply in order to avoid using the illegibility marker. We have prepared some examples of both overuse (EXAMPLE SET 1) and underuse (EXAMPLE SET 1; EXAMPLE SET 2; see also the bottom of SET 3) of the illegibility markers. It is admittedly not always easy to tell when a letter can be recognized with sufficient confidence to make its capture reliable.

Large structures


One text or many? Most works will consist of a single <TEXT> containing a single <BODY> element (optionally also a <FRONT> and/or <BACK> element for front and back matter respectively). Some works will consist instead of a <GROUP> element that contains multiple <TEXT>s (each <TEXT> with its own <BODY> and, optionally, <FRONT> and <BACK>). The GROUP element will normally be reserved for items that contain several works published or bound together, each with its own title page, that were originally printed separately, e.g. the collected works of an author.

The DTD allows for two options that cover most real books:

Embedded texts (i.e., documents of one sort or another embedded in a larger work), can often be successfully captured as quoted texts, using <Q><TEXT> ... </TEXT></Q> or the equivalent <INSERT> element. See below under quotations.


The <BODY> (and, if necessary, the <FRONT> and <BACK> elements) will normally be divided into numbered <DIV>s corresponding to the main divisions of the text. Very simple documents, on the other hand, with no internal division (a work consisting of a single poem, for example, or tract containing only a series of paragraphs) do not require <DIV>s at all: <BODY><P> is sufficient. Use no more <DIV> layers than necessary.


The numbered <DIV> elements, from <DIV1> to <DIV7>, represent a hierarchy: the <BODY> (as also the <FRONT> and <BACK> matter) is subdivided into <DIV1>s; <DIV1>s, if necessary, are subdivided into <DIV2>s, and so on. <DIV>s divide into parts: with few exceptions, you need to have more than one of something to call it a <DIV>.

Individual small texts embedded within a larger work (e.g. entire poems quoted within a chapter of a treatise) should usually not be tagged as <DIV>s but should instead be placed within <Q> tags. The <Q> element may if necessary contain an entire <TEXT>, with its own <BODY>, <FRONT>, <BACK>, numbered <DIV>s etc. The <INSERT> element is a shortcut for this structure. See further under quotations, below.

Useful clues to the DIV structure include:

Weaker evidence for <DIV>s includes:

In general, these are not sufficient to establish a <DIV> and should instead be recorded as ordinary text. Numbered paragraphs, for example, should simply retain the number as part of the paragraph (and as the value of the "N" attribute of the <P> tag), but there is no need to call the number a <HEAD> and therefore make the <P> a <DIV>.

<P N="3">¶ III. In the third place, the Calvinist partie striveth ...
Marginal "headings" that you decide not to treat as <HEAD>s can usually be encoded either as <NOTE>s, with the PLACE attribute set to "marg" or (if they contain a sequential numeration), as <MILESTONE>s.

TYPES of DIVs. See above under "attributes."

Front matter

Front matter (material to include in the <FRONT> element) typically includes title pages, dedications, tables of contents, prefaces, prologues, honorific poems and prose blurbs (encomia), remarks "to the Reader", etc., each of which should be recorded with a numbered <DIV>, their subsections recorded with higher-numbered <DIV>s, etc., just as with <BODY>.

Title pages do not require special tags. Each title page should be recorded as a numbered <DIV> within the <FRONT> element. Include both the front and back (recto and verso), if there is material there to record. If there are multiple title pages, record each in a separate <DIV>. Most title pages can be recorded as simple blocks of prose text (recorded with <P>s). Other structural tags (e.g. <HEAD> or <EPIGRAPH>) should be avoided; verse quotations and illustrations on the title page should of course be recorded as such, using <LG>, <L>, <Q>, and <FIGURE>. Entirely engraved title pages should place the text for the entire page within a <FIGURE> tag and supply the TYPE value TYPE="engraved title page".

Back matter

Back matter (material to include in the <BACK> element) typically includes indexes, glossaries, colophons, afterwords, appendices, etc., each of which should be recorded with a numbered <DIV>, their subsections recorded with higher-numbered <DIV>s, etc., as with <BODY> and <FRONT>.


In general

Do not attempt to record the physical appearance of the page (centering, extra spaces, justification, type face, type size, etc.), though such cues may and should be used to determine the beginning and end of divisions within the text, the distinction between text and notes, etc. On type forms, see the special instructions below about use of the <HI> tag and its relatives.

Record physical line-breaks (with the <LB> tag) only (1) if the text is unintelligible without a break; and (2) if the break is not reflected by a structural tag. Many times, it is better to repeat a tag than to insert a line break in the middle of one; but more often it is possible to get by without doing either, especially if there is any punctuation at the line break. E.g., record this:

            CHAP. XI.
  Some Advantages and Helps for raising
  and affecting the Soul by Meditation.
like this:
  <HEAD>Some Advantages and Helps for raising and affecting
        the Soul by Meditation.</HEAD>
or, better, like this:
  <HEAD>CHAP. XI. Some Advantages and Helps for raising and
        affecting the Soul by Meditation.</HEAD>
but NOT like this:
  <LB>Some Advantages and Helps for raising and
        affecting the Soul by Meditation.</HEAD>

However, some loosely formatted text can only be rendered intelligible by use of <LB> tags, in much the same circumstances that would require the use of the <PRE> tag in HTML. Common uses include the capture of inscriptions with significant line breaks, and quasi-mathematical text such as syllogisms:

Rover is a dog<LB>
All dogs have tails<LB>
ERGO Rover has a tail

Paragraph breaks should be recorded with <p> in prose and with <lg> (line-group or stanza) in verse.

Typeface changes and typeforms

Set the default typeform for a given region using the REND attribute of the various structural tags, e.g. DIV1 TYPE="dedication" REND="italic". Text in roman type is the default and does not need to be marked. Set default typeforms at the highest level that applies (much as when assigning the LANG attribute).

Mark text in a different typeform within a given region by using a mixture of the I B U and HI tags. I B and U are simply shortcuts for HI REND="italic" HI REND="bold" and HI REND="underlined". In practice, it is often easiest to use the separate I B and U tags, and reserve HI for typeforms not otherwise accounted for.

Treat I, B, U, and HI as cancelling the value of the REND attribute of the structural element. But treat I, B, U, and HI as cumulative with respect to each other, much as is done in HTML: <I><B> = bold italic.

Alternatively, treat the tags as mutually exclusive and use the REND attribute of HI to indicate combined typeforms, separating the values with a semicolon. E.g., <HI REND="bold;italic">

When punctuation coincides with the end of a span marked by the <HI> elements, and there is doubt as to whether the punctuation belongs inside or outside the closing tag, place it within the closing </HI> tag:

<HI>Sillepsis,</HI> or the Double supply.

Record superscripted and subscripted text using the keyboard "circumflex" or "caret" character (^ = DECIMAL 94, HEX 5E) before each superscripted character (^a;, ^b; 5^t^h; 2^n^d) and the same "caret" character doubled (^^) before each subscripted character (i.e., ^^a;, ^^b;, etc.), including punctuation characters.

Record large initials, "drop caps," etc., as ordinary capital letters.

Record "small caps" as ordinary capital letters.

Record vertical text (text printed perpendicularly to the main text) as if it were horizontal.

Block quotations

<Q>s are used for block quotations, whether of prose or verse. Don't use them for ordinary "inline quotations."

"Block quotations" include both quotations that are set off from the main text by indentation and blank lines (in the modern way) and also lengthy quotations that are set off by the use of other typographic cues such as a change of typeface (if unambiguously marking a block quotation). If you're not sure if a block of text is a <Q>, simply record the appearance of the text (using, e.g. <P> and <HI>).

See below for the special problem of marginal quotations marks or marginal inverted commas.

<Q>s are usually the best way to tag even very substantial items embedded in prose, e.g. a poem or a document of some kind quoted within a chapter, or within a note, or within an introduction.

<Q> can if necessary even contain an entire <TEXT>, with its own <FRONT> matter, <BODY>, <DIV> structure, and so on. Use <Q> for such embedded items (or the stand-in tag <INSERT>), rather than trying to treat them as <DIV>s of the main text (unless that's really what they are). Treating them as <DIV>s forces you to treat all the material surrounding them as <DIV>s too, at the same level.
Prefer this:

       <DIV1 TYPE="introduction">
       <P>blah blah</P>
       <P>blah blah</P>
         <Q>here's a poem</Q>
       <P>blah blah</P>
to this:

       <DIV1 TYPE="introduction">
       <DIV2 TYPE="stuff before the poem">
         <P>blah blah</P>
         <P>blah blah</P>
       <DIV2 TYPE="poem">
         <LG><L>here's a poem</L></LG>
       <DIV2 TYPE="stuff after the poem">
         <P>blah blah</P>

Block quotations accompanied by citations should record the quotation within <Q> tags and the citation within <BIBL> tags.

Notes, etc.

Most material that is set off from the main body of the text but is adjacent and related to it can be safely tagged as <NOTE>. (But arguments (summaries at the head of <DIV>s), salutations, and speaker names and stage directions in drama are among the note-like features that have their own tags. Also see above concerning MILESTONEs.)

With the exception of end notes, record each note at the point in the main text to which it relates, set off by appropriate tags, not at the point where it appears on the page.

A note that spills onto the next page needs to be treated as a single note, not two, and should be placed in the text where it applies.

  1. Notes tied to points

    If the note points to a place in the text which is marked with a flag of some kind (e.g. a footnote reference number, an asterisk (*), etc.), discard this marker from both note and text once it has served its purpose by locating the <NOTE> in the right place in the text. The marker should be preserved only as the value of the "N" attribute of the <NOTE> tag. Notes that use non-alphabetical symbols such as "daggers," section-marks, paragraph marks, etc., should preserve those characters too in the "N" attribute if possible, using character entities or however the character would be represented elsewhere in the text, like this: <NOTE N="&dagger;">. If the character is not recognized as corresponding to a readily available character entity, supply "#" or "@" as the value, using the rules for unrecognized symbols. If the note contains a marker, but the text does not, or vice versa, act as if the marker were present in both places. If the notes contains a marker that is different from the one in the text, use one or the other (usually the one that makes most sense in the local sequence) and ignore the other.

    Sometimes notes can be accurately placed only by noting their sequence. There may be three marginal notes on a page, for example, matched by three asterisks in the text; the first note is inserted at the first asterisk, the second note at the second asterisk, and the third note at the third asterisk.

    If the note is keyed to the text by line number, verse number, etc., place the note at the end of the line (etc.) to which it applies, and discard the literal number from the note, if that can be done without loss of clarity.

    Use the "PLACE" attribute of the <NOTE> tag to indicate where the note appears on the page:
    • PLACE="marg" in margin or adjacent to the text (even if part of it runs across the whole page because of lack of room in the margin, or if it is set into the edge of the text as a "shoulder" note)
    • PLACE="foot" in a footnote, below the text
    • PLACE="inter" interlinearly (between the lines of text)

    If there are multiple distinguishable sets of notes in the same location (two sets of footnotes, for example; or multiple sets of marginal notes marked by different kinds of flags, one set marked by numbers, one by letters), distinguish them by using the SERIES attribute with something distinctive as its value (usually a simple number): PLACE="foot" SERIES="1" and PLACE="foot" SERIES="2" for example.

    book with multiple sets of notes
    Example of book with two sets of marginal notes, one keyed to letters, one to numbers; record them as <NOTE PLACE="marg" SERIES="1"> and <NOTE PLACE="marg" SERIES="a">

    Notes that apply to two (or more) distinct loci or lines should be reproduced and inserted at *both* (or all) the relevant points.

    These need to be distinguished from notes that apply to a span of loci or lines; notes applying to a span of lines should be placed after the last line in the span with indications of the length of the span (e.g., "14-23" [with reference to line numbers] or "*-*" [with reference to two "*" flags in the text]) retained.

  2. Notes tied to regions, divs, etc.

    1. Mostly in verse:

      A note that appears next to a single verse line or set of lines and seems to relate to that line (or set) should be placed at the end of the line(s) in question.

      A note that relates to a specified group of lines, verses, etc., should be moved into the text at the end of the last item to which it applies.If there are line numbers, the line number indication in the note should be preserved. If physical arrangement, rather than explicit line numbers, serve to specify the line or verse number range, and there are line numbers in the verse, supply the appropriate number range in brackets at the beginning of the note.

      Notes referenced to a line (verse, etc.) number followed by "f." ("2365 f." meaning "line 2365 and following") should be treated as notes referenced to a span of two lines (in this case, 2365-66), that is, placed at the end of the second line (2366), with the full line reference preserved in the note: <NOTE PLACE="foot">2365 f.: ... </NOTE>

    2. Mostly in prose:

      A note that seems to relate to an entire text division (e.g. a <DIV> or <P>) should be inserted at the beginning of the text that comprises that division, or to end of the <HEAD> if that is more convenient (and if it has one). E.g. a marginal note applying to a paragraph as a whole may be inserted at the beginning of the paragraph. This occurs commonly in books that contain a running summary or set of running headers in the margins: if these are not treated as <HEAD>s, or <ARGUMENT>s, they should be treated as <NOTE>s (PLACE="marg") and inserted at the beginning of the section to which they apply. If the summary is found centered at the head of the text proper (instead of in the margin) it should usually be given a tag of its own and tagged as <ARGUMENT> or <HEAD> (see below under Heads").

      A marginal note in a prose text that seems to apply vaguely to the material next to which it is placed should be inserted at the end of the nearest sentence (as marked by punctuation--a period, semicolon, or colon), or at some other break in the text if that seems more appropriate.
      In the case of notes that supply bibliographic citations, similarity of wording between note and text may provide a clue as to the best place to insert the note, as in this example:

            Democ.Instit.    Antonius Demochares saith of him, that he was exiled
            Christ.relig.    in the persecution under Diocletian, and that he
                             returned from banishment after the death of Diocletian
                             and Licinius, and recovered his Bishoprick again,
                             where he continued until the reign of Iulian.

      <P>Antonius Demochares saith of him,<NOTE PLACE="marg">Democ. Instit. Christ. relig.</NOTE> that he was exiled in the persecution under Diocletian, and that he returned from banishment after the death of Diocletian and Licinius, and recovered his Bishoprick again, where he continued until the reign of Iulian.</P>

      A note that relates generally to the material on a page, or for which the appropriate place cannot readily be determined, should be attached to the last line of text at the bottom of the page.

  3. End notes

    End notes (whether appearing at the end of the book or at the end of the section), especially if they occupy any considerable space (a page or more) should not be inserted in the main text, but should instead be captured on the page and in the place where they appear. Depending on their extent, they may be captured as <P>s, <DIV>s, or even list <ITEM>s. All reference information in the note (e.g. footnote number, brief quotation of words from the main text) should be left in place.

    Additionally and optionally, provide links from the main text to the end notes using the <PTR> or <REF> tags. You will need to supply each note with a unique ID based on the ID number for the book as a whole (e.g. ID="A12345-page17-note3"), and reference that ID in the TARGET attribute of the PTR or REF element. The PTR element is an empty element, to be used when there is no particular literal text serving as the cross-reference; the REF element contains any literal text that serves as a cross-reference, e.g. a note marker: <REF TARGET="A12345-page17-note3">^3</REF>.

Reference numbers in the text that point to something other than a note (e.g. to some part of an illustration), or for which the target cannot be found, should simply be recorded as part of the text.

Passages of verse (especially 2 or more lines, quoted and arranged as verse) within a note will normally be most readily coded as a quotation (<Q>) containing <L>s or <LG>s, embedded within the <NOTE> element.

Notes comprising a running interlinear commentary or interlinear gloss poses special problems.

Lists and tables

In general, prefer to record itemized sequences as <LIST>s rather than <TABLE>s if possible. Use <TABLE> when the material cannot be readily understood without the spatial organization that tables provide. It is sometimes possible to capture items outside the main text flow as <NOTE>s or <MILESTONE>s rather than resorting to a <TABLE>.

The TEI 'list' element may be compared to a merger of the HTML OL, UL, and DL elements. It contains essentially two content models, one consisting of a sequence of ITEMs, the other consisting of a series of LABEL-ITEM pairs. The former is more commonly used.

Numbered sequences of items when the items themselves are blocks of text of considerable size (numbered paragraphs, for example) should not be treated as lists, but simply as numbered paragraphs (<P N="3">3. ...).

Complex lists (lists within lists) should be encoded with nested <LIST> tags, i.e. a <LIST> tag within an <ITEM> of another <LIST>:

<ITEM> .. </ITEM>
      <ITEM> .. </ITEM>
      <ITEM> .. </ITEM>

Outline structures, genealogical trees and similar tree structures, and complex formatting involving braces can often best be tagged as nested lists, sometimes nested to a very deep level.

Treat any numbers that enumerate items in a list as part of the text of that item; do not record them with separate <LABEL> tags, though you may (optionally) also include them in normalized form as the N attribute of the ITEM tag. E.g.:

      <ITEM>1. Avarice</ITEM>
      <ITEM>2. Sloth</ITEM>
      <ITEM>3. Pride</ITEM>
      <ITEM N="1">1. Avarice</ITEM>
      <ITEM N="2">2. Sloth</ITEM>
      <ITEM N="3">3. Pride</ITEM>

Typical indexes and tables of contents can be readily tagged using simple lists containing only <ITEM>s, especially if there is punctuation between the items and the page numbers. Always prefer this option if possible. E.g.:

<ITEM>Malva, Wild Mallow, 46.</ITEM>
<ITEM>Maple, 87, 91.</ITEM>
<ITEM>March Mallows, 59.</ITEM>
<ITEM>Matricaria, Featherfew, 54.</ITEM>
<ITEM>Meadow Saffron, 19.</ITEM>
<ITEM>Medune celebrated, 35.</ITEM>
<ITEM>Meleagris, checquer'd Daffedil, 52.</ITEM>
<ITEM>Melilot, Plaister Claver, 46.</ITEM>
<ITEM>Melissa, Balm, 59.</ITEM>

The page number is technically an internal cross-reference and may therefore be tagged using the <REF> tag, especially if it is desired to provide links from the table of contents to the items listed therein.

Even when punctuation is lacking (e.g. when the indexed item is left justified and the page number right justified, simple <ITEM>s will often do. Here is an example without punctuation (and including some nested lists):

  <ITEM>at variance with himself &c. 24</ITEM>
  <ITEM>An inbred malice in him 48</ITEM>
  <ITEM>Pindars account of him 97</ITEM>
  <ITEM>Vnable to judge of crimes 229</ITEM>
  <ITEM>He hath a will but not the power to resist God 125</ITEM>
  <ITEM>Prone to aggravate his own afflictions 254</ITEM>
<ITEM>Masanissa, his famous plot. 142
  <ITEM>what it is 68</ITEM>
  <ITEM>How it differs from pitty Ib.</ITEM>
<ITEM>Michael Ducas, the great plague in his reign 267,268</ITEM>
<ITEM>Mithridates, his cruelty 276</ITEM>

OPTIONALLY, lists of pairs may be tagged with the element pair <LABEL> and <ITEM> (in that order). If you use this option, you may omit any "leader" (e.g. a dot leader) between the paired items. E.g.:

The Prince...............Jn. Longfellow
The Pauper...............Thomas Goodrich
Joan the Tappester........Jack Smithson

      <LABEL>The Prince</LABEL><ITEM>Jn. Longfellow</ITEM>
      <LABEL>The Pauper</LABEL><ITEM>Thomas Goodrich</ITEM>
      <LABEL>Joan the Tappester</LABEL><ITEM>Jack Smithson</ITEM>
Such 2-column table-like lists, if they contain column headers, may distinguish them by use of the ROLE attribute on the ITEM and LABEL element, thus:
(The character)          (The Player)
The Prince...............Jn. Longfellow
The Pauper...............Thomas Goodrich
Joan the Tappester........Jack Smithson

      <LABEL ROLE="label">(The character)</LABEL><ITEM ROLE="label">(The Player)</ITEM>
      <LABEL>The Prince</LABEL><ITEM>Jn. Longfellow</ITEM>
      <LABEL>The Pauper</LABEL><ITEM>Thomas Goodrich</ITEM>
      <LABEL>Joan the Tappester</LABEL><ITEM>Jack Smithson</ITEM>

Tables should be recorded as you would using HTML tables, oriented by row, with the number of columns determined by the number of cells within the row. Use the spatial organization of the text to determine the number of rows and columns (not necessarily reflected in printed border lines). The ROWS and COLS attributes of the <CELL> tag should be used just like the ROWSPAN and COLSPAN attributes of the <TD> in HTML to indicate cells that extend across two or more rows or columns. Cells that contain a heading or label for a row (or column) should receive the attribute ROLE="label".

DLPS dtdHTML equivalent
<ROW> <TR>
<CELL ROLE="label"> <TH>

Particularly complex tables may be recorded (again as in HTML) with nested <TABLE> tags, i.e., a <TABLE> within a <CELL>, or by combinations of <LIST> and <TABLE>, i.e. a <LIST> within a table <CELL> or a <TABLE> within a list <ITEM>.

Physical arrangements that cannot easily be accommodated by our simple table model (e.g., labels with text running vertically) may need to be adapted and adjusted until they fit; it is more important to preserve the relationships between the items in the table than to preserve its exact layout.

Tables that continue from one page to the next may be tagged as one continuous table, with an embedded <PB> tag, especially if its headings are not repeated on the new page. If the headings are repeated, it is usually easier to close the old <TABLE> and open a new one on the new page.

These are to be distinguished from tables that spread across a page. See above under "Page breaks in unbreakable objects."

Difficult-to-capture complex tables containing only numbers or symbols (i.e., without any substantial textual content worth searching) may optionally be captured as <FIGURE> as if they were illustrations. Note, however, that just as with the captions attached to "real" <FIGURES>, the heading for the tables should be included within a <HEAD> tag inside the <FIGURE> tag. For example, this table may be tagged like this: <FIGURE><HEAD>A Table of Houses for the Latitude of 51.degr.34. min. <HI>Sol in Aries.</HI></HEAD></FIGURE>
table to treat as figure

Here is a sample simple table (this one is simple enough that it could almost be done as a <LIST>).

example of table

Recorded as:

<DIV TYPE="table">
<HEAD>By this table, shall ye fynde the Epistles and Gospels, for the Son|daies, and other feastiuall dayes.</HEAD> <P>FOR TO fynde them the sooner, shall ye seke for these capital letters, <HI>A, B, C D,</HI> whi|che stande by the syde of this boke alwaies, On or vnder the letter shall you fynde a crosse &cross;, where the Epistle or the Gospell begynneth, and where the end is, there shal ye find an halfe crosse, @ And the fyrst lyne in this table is alway the e|pistle, and the seconde lyne is alway the Gospell.</P>
<ROW><CELL ROLE="label" COLS="3">On the fyrst Sonday in Aduent.</CELL></ROW>
<CELL>Rom. xiii.</CELL>
<CELL>And for as muche as we knowe</CELL>
<CELL>Math. xxi.</CELL>
<CELL>Nowe when they drew nye vnto</CELL>
<ROW><CELL ROLE="label" COLS="3">On the second sonday in the Aduent.</CELL></ROW>
<CELL>Rom. xv.</CELL>
<CELL>what so euer thynges are writen</CELL>
<CELL>Luc. xx.</CELL>
<CELL>And there shall be signes</CELL>


Headings at the head of text divisions, tables, lists, and stanzas (<DIV>s and <LG>s) should be tagged as <HEAD>. Subheadings may be tagged with a second <HEAD> tag, with the TYPE attribute set to "sub," i.e., <HEAD TYPE="sub">, though in most cases it is probably better to combine the two into one HEAD.

Some headings have special tags (see below). If heading-like material doesn't fall clearly into one of these special categories, use simple <HEAD>.

HEAD has a quite inclusive content model, and may be used to tag objects of various kinds occurring at the head of a division. One such use is the division-beginning illustration occurring before the textual heading of a chapter. E.g. a portrait of "Prince x" occurring at the head of a chapter about Prince x. In such cases, the object may be treated as a special kind of heading (since FIGURE can't appear before HEAD):

<HEAD TYPE="illustration">
<HEAD>x, princeps</HEAD>
<FIGDESC><Portrait of Prince x</FIGDESC>

Other heading-like material with its own tags includes:

  1. Arguments: use <ARGUMENT> for summaries or abstracts that appear at the head of a division, often headed by "ARGUMENT" ("Argumentum"; "Tharguement")

  2. Epigraphs: use <EPIGRAPH> for brief quotations (often in verse) or mottos that appear at the head of a division, with or without mention of the author or book from which it is quoted. Epigraphs are frequently centered, in quotation marks, or italics, or all three--but not always. An epigraph always contains a quotation or at least a quasi-quotation (motto or saying). The quotation itself should be tagged as <Q>; any attached bibliographical information on author or title or source should be recorded as <BIBL>:
    "Idlenesse is lesse harmefull then vnprofitable occupation."
    <Q>"Idlenesse is lesse harmefull then vnprofitable occupation."</Q>

    Epigraphs are a common place to find bits of non-roman script; record those bits with <GAP DESC="foreign"> as described above, but place the "foreign" portion inside the <Q> or <BIBL> of the <EPIGRAPH> tag.

    Commentaries and sermons frequently quote a passage of text at the beginning (or at the beginning of each division), then comment on it. These passages may usually be readily encoded as <EPIGRAPH><Q> ... </Q></EPIGRAPH>, though occasionally there may be enough conmingled head-like material to force all of it into <HEAD>: <HEAD>A sermon on <BIBL>Rom. 8:28</BIBL> <Q>All thinges worke together for good</Q> with some reflections on Providence.</HEAD>.

  3. Openers: use <OPENER> for introductory phrases at the beginning of an item, especially material that belongs to the stereotyped categories that may be tagged as SALUTE, SIGNED, or DATELINE. OPENER is most commonly used with letters or letter-like documents. If in doubt whether material qualifies as <OPENER> or <HEAD>, call it a <HEAD>, especially if the item would otherwise lack a <HEAD>.

  4. Bylines: use <BYLINE> for 3rd-person attributions of authorship (usually with "by ..."), but only if the phrase is easily separable from other heading material. E.g.: <HEAD>The Defense of Poesie</HEAD><BYLINE>By Sir PHILIP SIDNEY Knight</BYLINE>


Material at the end of a text division that is set off from the main text is normally to be tagged as a <TRAILER> or <CLOSER>. <TRAILER> is the more general tag, corresponding to HEAD and used for material without such internal structures as datelines, salutations, or signatures. Typical <TRAILER>s include "Amen," "Finis," and title-like material such as explicits ("here ends the tract written by Master John Knox"; "Explicit liber de gubernatione Dei."). <CLOSER>, on the other hand, is the counterpart of <OPENER>; it is used when the concluding material includes lengthy or complex information, including datelines, salutations, or signatures, especially in letters. See Letters, below, for examples. Requests for prayer for the author's soul are typically recorded as <CLOSER>s.

Epigraphs and bylines can appear at the foot of a division as well as at its head (see above for a description of epigraphs).

<BYLINE> vs. <SIGNED>. It is not always easy to decide whether to use byline (3rd-person) or signed (1st-person) for ascriptions of authorship. If the phrase actually uses "by" ("By Philip Sidney"), <BYLINE> is the better choice. If the item is a document that is normally signed in order to take effect (a letter, a will, an edict or proclamation), <SIGNED> is better.

Special types of texts

  1. Poetry

    Verse lines. Each verse line should be enclosed in <L> tags. Though the REND attribute can be used to indicate degrees or levels of indentation, we normally do not attempt to record the varying indentation of verse lines, but rather pay attention to indentation only insofar as it indicates a stanza break or a "broken" (carry-over) line (see below).

    Broken (carry-over) lines. Sometimes when a verse line is too long to fit on the page, its last word or two is placed (sometimes marked off with an opening bracket or opening parenthesis) at the end of the next line or at the end of the preceding line (wherever it fits best); or rarely at the end of a line several lines away. Such detached bits of verse lines should be recorded if possible at the end of the line to which they really belong.

    Mary had a little lamb, [snow.
      Its fleece was white as

    <L>Mary had a little lamb,</L>
    <L>Its fleece was white as snow.</L>

    Partial lines occur commonly either when a verse extract is quoted, or when a line of verse is interrupted by some larger feature, e.g. a change of speaker in verse drama. The DTD includes the TEI PART attribute by which it is possible to tage partial lines as initial, medial, or final (PART="I" "M" or "F"), or some similar scheme, but we do not generally require this level of markup.

    Groups of lines (<LG>s).

    1. Within a <DIV TYPE="poem">, <BODY>, or <SP>. When a poem constitutes a <BODY> or is tagged as a numbered <DIV> (or as a dramatic speech <SP>), groups of lines forming the smallest subdivisions of the poem should be enclosed in <LG> ("line-group", i.e. stanza) tags. A poem or speech containing no subdivisions (only ungrouped verse lines) does not need an <LG> tag: the <DIV> (or <SP>) tag provides enough context.

      <DIV1 TYPE="poem">
      <L>When the cat's away</L>
      <L>The mice will play</L>

    2. Interspersed with prose. <LG>s should be used around verse line(s) that alternate with prose paragraphs, so that the <LG> tag and the <P> tag serve a similar grouping function. See further below.

      <P>A stitch in time saves nine.</P>
      <L>When the cat's away</L>
      <L>The mice will play</L>
      <P>Too many cooks spoil the broth</P>

    3. Quoted within a <P>, <NOTE>, etc. When lines of verse occur within <Q> tags, e.g. quoted within a prose paragraph, in a note, or as part of an epigraph), do not place the verse lines inside a <LG> tag unless you have good reason to believe that the lines represent a complete stanza, e.g. if more than one stanza is quoted and you need to separate them; or possibly if the metrical form makes it clear that a whole stanza is quoted. If all you know is that some lines of verse are being quoted, then tag them as verse lines (<L>), period. The <Q> tag provides enough context. See further below.

      <P>John walked along, chanting constantly:
      <L>When the cat's away</L>
      <L>The mice will play</L>
      But no one noticed.</P>

      <P>John walked along, chanting constantly:
      <L>Red rover Red rover,</L>
      <L>Come over Come over</L>
      <L>The bird's on the wing,</L>
      <L>The dog's had his fling.</L>
      But no one noticed.</P>

    Lines vs. line-groups. It is often unclear when a group of lines has enough organization to be called a stanza (line- group <LG>). If in doubt, err on the side of fewer line-groups rather than more. And be consistent throughout a particular poem, so that a particular structure is not sometimes tagged as a <LG> and sometimes left untagged. Clues to look at include, in decreasing order of significance:


    Strongly suggestive:
    blank divider lines.
    drop caps

    Indicative, but need support:
    verse structure (rhyme; refrains; etc.)
    indentation (but indentation alone is insufficient to justify a <LG> element)
    paragraph signs (¶; but these are also used in many cases without any structural function)

    <LG>s vs. <DIV>s. It is not always easy to distinguish between <LG>s and <DIV>s: both can have headings; both can nest to create a structural hierarchy. Metrical units (true stanzas) are always <LG>s; verse paragraphs of irregular length are frequently best recorded as <LG>s, especially if they are not consistently supplied with headings. On the other hand, <DIV>s should be used for line-groups big enough to have true titles, or to appear in tables of contents. Any poem with its own title deserves its own <DIV>.

    Groups of stanzas within a poem should receive a numbered <DIV> tag. In most cases, you will use only a single level of <LG> (no nesting), and treat it effectively as the lowest-level text division. Any grouping of stanzas is therefore recorded as a <DIV>. (but see comments on songs in plays, etc., below.)

    Entire poems. Each poem will usually be recorded as a <DIV> of the appropriate number (<DIV1> etc.), with TYPE="poem" (or TYPE="sonnet" etc. if you prefer to distinguish forms). Poems may, of course be subdivided further into <DIV>s and <LG>s of various types. If a poem is quoted within a prose context, it is usually easiest to treat it as a <Q>. See next.

    Poetry mixed with prose. When poetry is truly interspersed with prose, and either the poetry is the predominant form, or there is no clearly predominant form, the prose should be recorded within <P> tags, the verse within <LG> tags. When poetry gives way to prose, close the <LG> and open a <P>; when prose gives way to poetry, close the <P> and open an <LG>, even if the actual prose paragraph, or even the last sentence, is not finished.


    1. Be aware that sometimes the interspersed "prose" is really a <HEAD> or a <TRAILER> to a section of the poetry; it may even be a <NOTE>.

    2. When a group of verse lines is quoted (e.g. in a passage predominantly composed of prose, or in a note), leave the prose <P> open, and embed within it the quoted passage (recorded with <L> and, if appropriate, <LG> tags as usual) within <Q> tags.

    3. An entire poem quoted within a prose context will ordinarily be treated as the <DIV1 TYPE="poem"> of a <BODY> of a <TEXT> quoted within a <Q>. <INSERT TYPE="poem"> amounts to the same thing.

    4. A distinct verse object (a song, poem, or hymn, for example), whether in a verse context or a prose one, but which is not properly a quotation, may by license be tagged as an LG, and subordinate parts of it as nested LGs. The commonest examples are 'SONG's within drama, especially verse drama, which may be tagged as <LG TYPE="song"> with nested LGs for the verses and refrains.

  2. Drama. The drama tag set should be used when converting both ordinary dramatic works (plays, masques, etc.), and other works that employ dramatic conventions. Use drama tags (especially <SP> and <SPEAKER>) whenever the text is consistently dramatic in form and layout. This includes literary dialogs like The Compleat Angler, accounts of debates, trial transcripts, and abstract or allegorical debates, such as between "Protestant" and "Papist" or between "Soul" and "Body"--even though they are not "real" drama. However, 'speeches' (or speech-like divisions) with internal structure, e.g. subordinate sections with headings, will probably have to use a standard DIV structure instead of the dedicated drama tags.

    Aside from a few special tags (below), prose drama should be recorded like other prose (in <P>s, etc.) and verse drama like other verse (in <LG>s, <L>s, etc.), including the rules for interspersed poetry and prose.

    Cast lists. Cast lists (DIV TYPE="dramatis personae") should be recorded like other lists, usually with the <LIST> tag. Cast lists will commonly appear as separate <DIV>s (within the <FRONT> matter of a book if the book contains one play). For complex cast lists, use nested lists and labels to indicate cast groupings.

    Stage directions. Stage directions should be recorded with the <STAGE> element. Stage directions sometimes appear between the columns of a multicolumn text, or in the margin, where they look like notes. In other books, they may be centered (as if they were headings) or indented (as if they were little paragraphs). They are occasionally typographically distinct (it italics; within parentheses; or both).

    Speakers. The name (sometimes abbreviated) of the speaker is recorded with <SPEAKER>. In print, these appear at the head of a speech: e.g. typically above the first line of the speech (sometimes centered), in the margin, in an indented line of its own at the head of the speech, or in italics at the beginning of the first line of the speech. Regardless of where it appears in print, the <SPEAKER> tag is tucked into the beginning of the appropriate <SP> ("speech") tag.

    Additional text associated with the speaker's name may be included in the <SPEAKER> tag, if it cannot readily be disentangled and that is the most convenient way to do it, like this: <SPEAKER>Mr. Jones, chanting in unison with three butchers</SPEAKER>. Readily separable material perhaps belongs rather in STAGE: <SPEAKER>Mr. Jones</SPEAKER><STAGE>descending from garden whilst reading a letter aloud</STAGE>. Multiple names should be enclosed in a single set of <SPEAKER> tags, like this: <SPEAKER>Mr. Jones and Mrs. Smith.</SPEAKER>

    Speeches. The basic unit of drama is the SPEECH (<SP>). A speech normally continues uninterrupted as long as the character speaking it is uninterrupted by another speaker or by the end of a division (act, scene, etc.). If a speech begins or ends in the middle of a verse line or stanza, break the line or stanza: i.e., treat it as two lines (or stanzas), one in one <SP> and one in the next. The PART attribute on L allows such broken lines to be reconstructed, but we do not normally require its use.

    "Songs" and other material specially set off within a speech should not normally be given any special tagging; if they have headings, they may need to be recorded as a nested <LG>. In exceptional cases when they contain an elaborate structure they may be recorded as a quotation (<Q>).

    Prologues and Epilogues should normally be treated as part of the play, recorded as <SP>s like any other speech, though they may sometimes require a numbered <DIV> of their own.

    Acts and Scenes. The act/scene structure should be recorded with appropriately TYPEd and numbered <DIV>s (e.g., <DIV2 TYPE="act" N="3"><HEAD>ACT III</HEAD><DIV3 TYPE="scene" N="4"><HEAD>Scene iv</HEAD><SP>...).

  3. Letters

    Personal letters that appear as text divisions should be treated as <DIV>s just like any other text division (chapters, sections, etc.). Letters quoted within running text (e.g. a letter quoted within the chapter of a book) should be treated like any quoted and inserted document, using either <Q><TEXT><BODY><DIV1 TYPE="letter"> or with the equivalent shortcut, <INSERT TYPE="letter">.

    Note that dedications frequently look like letters, since they contain salutations and signatures, but they're not: treat them as <DIV TYPE="dedication">. (You may, however, still use <OPENER> <CLOSER> <SIGNED> <SALUTE> etc. in such letter-like divisions, if they apply.)

  4. Dictionaries and glossaries. will be recorded differently depending on the complexity of the entries. Some simple word lists may be able to be recorded either as simple one-column <LIST>s or (more aptly) as two-column 'glossary lists' (compare HTML <DL>). Slightly more complex ones may be able to be recorded with <P>s (one <P> for each entry). But more elaborate dictionaries will require numbered <DIV> elements to represent dictionary entries, if not a specialized DTD. The headword for each entry (with any associated grammatical information) can usually be recorded as a <HEAD> to the <DIV>. Complex entries can be subdivided if necessary into component parts using higher-numbered <DIV>s. For example, the following entries from Cotgrave's French-English Dictionary of 1611 ...
    Affaicter. To trim, tricke, decke, dresse curiously, make neat, spruce, fine; to refine; also, to tame, reclaime, breake, make gentle, bring to ciuilitie.
      Affaicter vn oiseau. To man a hauke throughly.
    Affaicterie: f. A trimming, tricking, decking, neat, quaint, or fine dressing; also, neatnesse, nicenesse, curiositie, quaintnesse; also, a breaking, taming, reclayming, ciuilizing, making gentle; (hence) also, the through manning of a hauke, &c.

    ...can be recorded like this. The encoding of the phrasal subentry for "Affaicter vn oiseau" with a <DIV2> is probably superfluous in this case (a new paragraph with a <HI> heading would do as well); it is encoded more thoroughly here as an example of what can be done with more complexe entries if necessary.

        <DIV1 TYPE="entry"><HEAD>Affaicter.</HEAD>
        <P>To trim, tricke, decke, dresse curiously, make neat, spruce,
        fine; to refine; also, to tame, reclaime, breake, make gentle, bring
        to ciuilitie.</P>
        <DIV2 TYPE="subentry">
        <HEAD>Affaicter vn oiseau.</HEAD>
        <P>To man a hauke throughly.</P>
        <DIV1 TYPE="entry">
        <HEAD>Affaicterie: f.</HEAD>
        <P>A trimming, tricking, decking, neat, quaint, or fine dressing;
        also, neatnesse, nicenesse, curiositie, quaintnesse; also, a breaking,
        taming, reclayming, ciuilizing, making gentle; (hence) also, the through
        manning of a hauke, &c.</P></DIV1>


In general punctuation should be retained, but its spacing somewhat regularized. When a colon, semicolon, comma, question mark, closing quotation mark, or period falls between words, place a space after it, but none before it (unless it is being used to set off a number, like this: .lxvi. or .45. in which case it should be spaced as shown; that is, the periods should "hug" the number at front and back, without spaces.). When an opening quotation mark falls between words, place a space before it, but none after it. When a virgule falls between words, place a space before and after it. In case of doubt, follow the spacing system of the original as best you can.

Record the various forms of colon, period, comma, semicolon, and virgule (slanted line) with their modern keyboard equivalents ( : . , / ); a vertical bar should be recorded using the &verbar; entity (since we have reserved the keyboard character for another purpose).

Question marks vary considerably in form (some of them looking like inverted semicolons); record them all with the standard "?"

Opening and closing double quotation marks should be captured consistently: either (preferably) both should be recorded using the ordinary keyboard double-quote character (" = HEX 22),; or the opening quotes should be distinguished from the closing quotes using the &ldquo; and &rdquo; entities.

Opening and closing single quotation marks, as well as apostrophes, should likewise be recorded consistently, either all with the same character, the ordinary keyboard single-quote character (' = HEX 27), or using &lsquo; etc.

Hyphens (and figure-dashes but not other dashes) should normally be recorded using the ordinary hyphen character.

Hyphens at the end of a line should be recorded as the ordinary keyboard "pipe" (vertical bar) character, unless they appear between numerals, when they should be recorded with the ordinary hyphen.

If there is no end-of-line hyphen, but you think that there should have been (i.e., that a single word has been broken across two lines), place a plus sign, instead of a space, between the two halves: "cro+wn" "pri+nce". We recognize that since this requires interpretation of the text, it must remain an optional instruction subject to the discretion of the vendor.

"Figure" dashes (dashes between numerals) may be recorded using the standard keyboard hyphen character

Other dashes should be recorded using the entity &mdash;, regardless of where they appear, or how long they are.

The "minus" sign (−), if it can be distinguished from the m-dash and hyphen, should be recorded with a character entity (&minus;).

The "times" (multiplication) sign (×), if it can be distinguished from the "X", should be recorded with a character entity (&times;).

Ellipses, whether two characters or many--strings of dots or asterisks indicating omitted or missing text--should be recordedeither as ordinary text, using periods or asterisks as appropriate: . . . . . * * * * * . . . -- or as special ellipsis characters or entities (e.g. &hellip;), so long as one practice or the other is followed consistently

Some books mark extended quotations by placing quotation marks at the beginning of every quoted line. The same technique is used in other books to mark proverbs and other sententious remarks. E.g.,

     he made reasons...seyenge:  God made alle thynges
   " by reason, and governethe thynges
   " made by reason; the sterres be movede by reason; and so
   " oure naturalle lyfe excedynge from reason by slawthe and
   " ignoraunce awe to be reducede by lawes and reasons.
   " Wherefore thau3he there be somme thynges in the rule of
   " seynte Benedicte, the intellect of whom the dullenesse of my
   " mynde may not comprehende, y suppose hit be beste to 3iffe
   " credence to auctorite. Wherefore also he persuadeth hymselfe ...

     O no (said Cecropia) company confirmes reso-
   " lutions, & lonelines breeds a werines of ones thoughts,
   " and so a sooner consenting to reasonable profers.

Our vendor instructions for this situation are as follows:

In prose, record the first and last of the marginal quotation marks with the special entities &startq; (first mark) and &endq; (last mark). If there is only one such marginal quotation mark (as sometimes happens with short quotations or proverbs), use both entities in sequence (&startq;&endq;).

In verse, simply record the quotation marks using the " character as it appears in the print, preferably followed by a space to distinguish it from other uses of ".

In review, we resolve these as follows

The verse marks are left alone. The prose marks are removed and the marked block is resolved either into a block quotation using <Q> or into a highlighted section using <HI REND="marginal quotes">.

Braces and brackets that group multiple lines should be ignored if all they do is group portions of ordinary running text, such as poetry. But if they are used to link one piece of text to another, such as frequently in tables and lists, their meaning needs to be interpreted. Sometimes this will require entering text more than once, e.g. if the brace means "this word applies to all these other words," the easiest technique may be simply to apply the word to all of the other words by entering it as many times; sometimes it may require treating the single item as a head or label for a list containing the grouped items; sometimes it may involve attaching a ROWS or COLS attribute to a table <CELL>. Many variations are possible, which the following examples can only suggest.

chapterbrace1 How to build a kite
2 When to fly a kite
3 Famous kite flyers of our time
4 When not to fly a kite
5 "I've flown it: now what?"
(Brace used like "ditto" mark
to associate one word repeatedly
with a series of items;
may be recorded as follows,
by repeating the word:)
 <LABEL>chapter 1</LABEL>
  <ITEM>How to build a kite</ITEM>
 <LABEL>chapter 2</LABEL>
  <ITEM>When to fly a kite</ITEM>
 <LABEL>chapter 3</LABEL>
  <ITEM>Famous kite flyers of our time</ITEM>
 <LABEL>chapter 4</LABEL>
  <ITEM>When not to fly a kite</ITEM>
 <LABEL>chapter 5</LABEL>
  <ITEM>"I've flown it: now what?"</ITEM>
Dramatis Personae
townspeopleexample of braceJoe
Joan, a noblewoman
John, a philosopher
(Brace used to associate one item
as a head of a set of other items;
may be recorded as follows,
placing the one item in <HEAD< tag
and the list of items in <LIST>
and <ITEM> tags:)
<HEAD>Dramatis Personae</HEAD>
 <ITEM>Joan, a COuntess</ITEM>
 <ITEM>John, a philosopher</ITEM>
In apice trianguli.example of braceTriangulus.
In basi præcedens 3.
Sequens & vltima. 3.
(Brace used in a table to place one
cell in conjunction with a set of other
cells; may be recorded using
the COLS or ROWS attribute of the
<CELL> tag:)
<CELL>In apice trianguli.</CELL>
<CELL ROWS="3">Triangulus.</CELL>
<CELL>In basi praecedens 3.</CELL>
<CELL>Sequens &amp; vltima. 3.</CELL>

Characters (glyphs)

Basic letter forms. We assume that most letters encountered will belong to the modern standard Latin alphabet, though their appearance may be strange. Books from different periods will raise peculiar issues best addressed individually. Here, for example, are some considerations that apply especially to the capture of early printed books:

Early printing

  1. "u" and "v," though often interchangeable in spelling, should be recorded as they appear ("u" for "u", "v" for "v", without applying modern spelling practice).

  2. Lower-case "j" is really just a variant of lower-case "i", but record the form that seems to be intended ("i" for "i" and "j" for "j") based on the physical appearance of the letter; "j" appears most often paired with "i" in order to distinguish the pair from letters like "u" or "n"; one thus finds roman numerals like this: xvij, xij, or Latin plurals like this: alijs. The dot on the "i" and "j" is often in the form of a slanted line, like an acute accent, but record these letters as ordinary dotted "i" and "j," NOT as &iacute; or &jacute;. Paired "i" and "j" ("ij") can sometimes resemble a "y".

  3. Upper-case "I" and "J", on the other hand, are often difficult if not impossible to distinguish: if uncertain, use "I".

  4. Many books print a pair of "v"s or even a paired "Uv" where we would expect a "w"; do not convert these pairs to "w" but print whatever letters actually appear: "uu" "vv" etc.

Ligatures. Ligatured characters may be variously treated, so long as firm control of the character inventory is maintained and all representations can be readily resolved to combinations of single characters.

A typical capture policy would be to capture as distinctive characters only those ligatures represented in the standard ISO sets (or in Unicode).

However, our preferred policy with most projects is to ignore all ligatures that can plausibly be regarded as merely formal or 'aesthetic'. By this standard, most ligatures should be captured as either two (ct, st, sp, fi, ff, ss, etc.) or three (ssi, ffl, ffi) separate characters, the fact of the ligature itself ignored. Likely exceptions (i.e. ligatures that are normally regarded as being characters in their own right include chiefly the Latin digraphs ae and oe, the German 'ess-zed' (ss or sz), and perhaps the Dutch ij. Initial AE and OE ligatures, when the rest of the word is in lower-case, can safely be captured as "Ae" (or "Oe") rather than "AE" (or "OE"), e.g.: "Aesop" "Oengus". Be aware that italic fonts especially tend to have ligatures between many more pairs of letters than we are accustomed to seeing. Be aware also that the italic "ae" digraph/ligature usually has no upper bow to the "a" and is easily mistaken for "oe".

This is an "oe":oe
These are all "ae":ae

The common form of the "ss" ligature that consists of a tall-s followed by a short-s has sometimes caused problems in recognition. Here are two examples:

ss = possibility
ss = Passion

Fractions. For the fifteen common fractions listed in either ISOpub or ISOnum (namely: &frac12;, &frac14;, &frac34;, &frac18;, &frac38;, &frac58;, &frac78;, &frac13;, &frac23;, &frac15;, &frac25;, &frac35;, &frac45;, &frac16;, &frac56;, ) , use the entity. Otherwise, simply use the "front slash" (virgule) character between the numbers (e.g., 23/47).

NOTE: Some documents use dual dates (e.g. "12/22 Dec. 1635") because of the discrepancy of ten days between the calendars of different countries caused by the adoption of the Gregorian calendar. These are not really fractions at all, though they look like fractions; they should always be recorded using the "slash" method: 12/22. Likewise dual-year dates (e.g. 1651/2 or 1667/68) are frequently printed so that the end of the date looks like a fraction. Again, it is not; these should always be captured using the slash (1651/2; 1667/68).

Ampersands, whether shaped like & or like "7," should be recorded as &amp;.

"Old-style" roman numerals. Of the letters used commonly in Roman numerals (I V X L C D M), two, namely "M" and "D," can appear in a variant form that makes use of an extra character that resembles a backwards-facing letter "c," combined with "I" and regular "c". E.g., this means "M.D.C.":
(Since I can't represent a backwards "c" on the keyboard, I'll use "(" for "c" and ")" for backwards-c in what follows.) "(I)" is a variant form of "M"; "I)" is a variant form of "D" (If you look closely, you'll see that "(|)" almost looks like an "M" and "I)" almost looks like a "D"). When you find this style of Roman numerals, represent the combination "(I)" as "M" and "I)" as "D". For further examples, see the document on roman numerals.

Letters printed upside-down (a common printer's error), if recognized, should be recorded as if turned right side up. Displaced type of any sort should be put back where it belongs, if possible.

Reserved characters

Diacritics. Recognizable letters with diacritics should be recorded using the standard ISO character entity, if available; or if not, composed from the base character plus the appropriate diacritic(s) from the ISO diacritics set. [if using Unicode, prefer the precomposed characters to the multi-byte composed characters.]

An abbreviation stroke over two or more contiguous letters (whether or not it crosses an upright stroke on one of the letters) should be treated as a generic abbreviation mark; i.e., it should not be recorded as a character at all, but the entire word should be placed within <ABBR> tags. Roman numerals (as in dates) are sometimes "overlined" in whole or in part. Do not record the overlining as such, but place the entire numeral within <ABBR> tags. See also the special document on Roman numerals.

Abbreviation symbols. A number of abbreviation symbols, mostly based on ordinary letters, are distinctive enough and consistent enough in appearance to be recognized. Each should be recorded with its own character entity. These should be rare to nonexistent in books after the seventeenth century.

The following table illustrates the commonest abbreviation symbols. More may be added later. Note that some have conditions attached; e.g., the "q3"- or "q;"-like symbol illustrated below means "-que" when it appears at the end of a word, but means something quite different (e.g. "quam," especially if it has a stroke over it) when it stands alone. It should therefore be recorded as &abque; only when it appears at the end of a word.

SymbolRecord as:MeaningExamples:conditions:
per&abper;per, parperperper 
at the end of a word only
que que que que
que que que que
in Latin at the end of a word only; may appear as a separate word in French; to be distinguished from "Esq;" abbreviation for "esquire."
quod&abquod;quod/quoth quod quod quod quod quod quod  
sed&absed;sedsedonly when forming a word by itself
con&abcon;con- cum-con conat the beginning of a word only
rum&abrum;-rumrumat the end of a word only
at the end of a word only

Letters from other alphabets, e.g. Hebrew and Greek, when used singly (as opposed to in whole words or extended text) should be recorded with ISO standard character entities.

Other symbols include alchemical and astrological symbols, which will rarely if ever appear as part of words, but may appear in or as marginal notes, in designations of units of measure, in calendrical tables, etc.

A selection follows.

SymbolExampleMeaningRecord as
Zodiacal signs
cancer Cancer&Cancer;
leo Leo&Leo;
virgo Virgo (may also appear as abbreviation for "minim" ('drop') in medical recipes)&Virgo;
libra Libra&Libra;
scorpio Scorpio (may also appear as abbreviation for "minim" ('drop') in medical recipes)&Scorp;
Planetary signs (used in alchemy also for corresponding metals)
sun signsun signSun (or gold)&Sun;
moon sign moon signmoon signmoon signMoon (or silver)&Moon;
mercury signmercury sign mercury signMercury (the planet or the metal)&Merc;
venus signvenus signVenus (or copper)&Venus;
earth sign Earth (the planet)&Earth;
mars signmars signmars signMars (or iron)&Mars;
jupiter signjupiter signJupiter (or tin)&Jupit;
saturn signsaturn signsaturn signSaturn (or lead)&Saturn;
Apothecaries' symbols
ounce symbol ounce symbol ounce symbol ounce symbol ounce (apothecaries' unit of measure) &ounce;
dram symbol dram symbol dram symbol dram symbol dram or drachm (apothecaries' unit of measure) &dram;
scruple symbol scruple symbolscruple symbol scruple (apothecaries' unit of measure) &scruple;
recipe symbol   "Recipe" ('take ...') in recipes and prescriptions &rx; (from ISOpub)
ss (semis) abbreviation ss (semis) abbreviation ss (semis) abbreviationss (semis) abbreviationss (semis) abbreviation "Semis" ('half') with units of measure ss (not really a symbol, just the ordinary letter "s" doubled; the second, variant form is rare and should perhaps be marked by <ABBR> tags around the basic "ss" capture.)
Alchemical signs
antimony symbol antimony symbol antimony &antimony;
sal armoniac symbol sal armoniac symbol sal armoniac (in (al)chemical contexts only) &salarmon;
elemental fire symbol elemental fire symbol fire (in (al)chemical contexts only) &fire;
elemental water symbol elemental water symbol water &water;
elemental earth symbol   earth (the element) &earth;
subli- abbreviation symbol subli- abbreviation symbol subli- (forming words like "sublimate") &absubli;
precipi- abbreviation symbol precipi- abbreviation symbol precip- (forming words like "precipitate") &abprecipi;
sulphur abbreviation symbol sulphur abbreviation symbolsulphur abbreviation symbol sulphur or sulphu- (forming words like 'sulphuris') &sulphur;
oil or oleum symbol oil or oleum symbol oil or oleum &oil;
tartar(ic) symbol tartar(ic) symbol tartar (tartrate? tartaric acid?) &tartar;
vitriol symbol vitriol symbol vitriol (sulphuric acid) or vitrio- (forming words like 'vitriolata') &vitriol;
salt symbol salt symbol salt &salt;
nitre or saltpetre symbol nitre or saltpetre symbol nitre or saltpetre (potassium nitrate) &nitre;
Other signs
crosscrosscrosscrosscross (any variety: Greek, Latin, Maltese)&cross;
paragraph signparagraph signparagraph signcapitulum
 right index signright-pointing index finger
(left-pointing finger also found)

Symbols and marks not listed here

  1. Recognizable standard symbols should receive the standard ISO character entity if one exists.

  2. Unrecognized symbols and marks, individual characters that cannot be readily identified as one thing or another ("is this a funny-looking "q" or some kind of symbol?" "Is this a "c" or a "t"?), and symbols other than those listed here or in the standard ISO character sets, should be recorded either

    1. with the hash character (#) if the symbol is clear enough but is not listed here or in the ISO sets; or

    2. as "@" if you're not sure what to make of it.
  3. Dubious characters. Individual characters that cannot be readily identified as one thing or another ("is this a funny-looking "q" or some kind of symbol?" "Is this a "c" or a "t"?) should be recorded as "@". -->However, do not overuse this expedient: if the same symbol recurs repeatedly in a book, please ask us for help in identifying it; do not simply record dozens or hundreds of examples of the same symbol with "@" or "#".

Elements not described here

A number of options are available in the vendor dtd which are not described here except as entries in the summary list, since they are not part of normal coding specification. These include:

In addition, only minimal or severely restricted instructions are given above for the following elements, which are capable of much wider application than we are accustomed to giving them: