AmVerse FAQ no. 10 : the LANG attribute and FOREIGN tag

Revised 1 March 2001

Where to use the LANG attribute

Under current instructions, the only place you need to use the LANG attribute is in connection with the <FOREIGN> tag, used chiefly to mark the location of text in a non-Latin alphabet (e.g. Greek, Hebrew, etc.), where its only content is a <GAP> tag.

[Many other tags, e.g. <Q>, have the LANG attribute but we have not been consistently using it except for this one purpose.]

[<FOREIGN> can of course be used to tag any text in a language other than the predominant language of the document, but we have not generally been marking such text unless it is a non-Roman alphabet. It does no harm to tag (say) Spanish or Latin quotations as <FOREIGN>, but since we make no use of the information at this point, it is more efficient not to record it.]

Validating attributes that refer to IDs (IDREFS)

The LANG attribute (wherever it appears in TEILITE) takes values defined as "IDREFS". This means that the values must correspond exactly and literally with the value of an "ID" attribute somewhere else in the same document. The document won't validate unless there is an ID matching every IDREF in the file.

Our way of handling this problem is to insert a <LANGUSAGE> tag into the <PROFILEDESC> of the <TEIHEADER>. Insert the tag near the beginning of the <PROFILEDESC> which itself should appear at the end of the <TEIHEADER>. See the moa-tei.dtd or the green books (or the online TEI guidelines) for the exact placement rules.

The <LANGUSAGE> element contains one or more <LANGUAGE> tags. The content of the <LANGUAGE> tag is the name of the language; the ID attribute of the <LANGUAGE> tag is an abbreviation for the language. Use the same abbreviation as the value of the ID attribute here as you used when assigning the LANG attribute of the <FOREIGN> tag.

Three-letter Language Codes

The actual language abbreviations that you use should be the three-letter codes listed in the table of Library of Congress MARC codes for languages, found at:

http://lcweb.loc.gov/marc/languages/

Example

Say that in one of your books you find a line in Greek and another in Tibetan. Since we do not attempt to capture Greek (or Tibetan) characters, simply record the presence of the foreign text with <FOREIGN> tags around a <GAP> tag: <FOREIGN LANG="grc"><GAP></FOREIGN> and <FOREIGN LANG="tib"></FOREIGN>.

("grc" is the MARC code for ancient, as opposed to Modern, Greek: use the most general code that fits. Don't try to distinguish between periods of a language or between different dialects of a language unless the code list requires you to do so. In this case there is no general "greek" code, so we have to decide if its ancient or modern. Being a quotation from Plutarch, it's ancient.)

Since you've now inserted two attributes in the document that take a declared value of "IDREFS", the document won't parse unless there are corresponding IDs in the same document.

Insert this in header at beginning of PROFILEDESC

<PROFILEDESC>
<LANGUSAGE>
<LANGUAGE ID="grc">Ancient Greek<LANGUAGE>
<LANGUAGE ID="tib">Tibetan</LANGUAGE>
</LANGUSAGE>

...

</PROFILEDESC>

with a <LANGUAGE> tag for each language used and tagged within that document. The document will now validate.