AMVERSE: Processing of Encoded Text

  1. Vendor.dtd stage

    1. Validate received *.sgm files against vendor.dtd

      (Either add

      <!DOCTYPE TEXT PUBLIC "-//UMDLPS//DTD Vendor 1.0//EN">

      to head of file or precede filename on NSGMLS command line with "vendor.doctype.sgm" (omitting "-B" switch) if necessary to validate.)

    2. Obtain "real" system ID number from vast MOA database (this is not the same as the "text" id, which is largely duplicated, sometimes in truncated form, in the file name).

      • Database is in markup/process.mdb
      • Open by highlighting, pressing and holding shift key, dbl-clicking
      • Look in mamverse query table (qryMamverseTexts) to find first 8 characters of system id. Record both the 8-digit system ID and the text ID in the file amverse.log.txt.

    3. Change name of file to match full text ID, e.g. ChapmTwoPh.sgm

    4. Enter date in "MarkupWhen" column of ProcessMarkup table of MOA database (process.mdb).

  2. Convert to TEILITE

    1. Run script on .sgm file: vnd2tei.pl (found in AmVerse\code)

      This will:

      • change <B> to <HI etc.
      • add doctype
      • put <TEXT> within <TEI.2> .. </>
      • provide placeholder for header
      • remove leading zeroes from "N" attribute of <PB> tag
      • convert most roman numerals in page numbers back to roman form

      redirect output to a file of the same name in Amverse/temp

    2. Obtain header:

      1. Find the "old" MOA headers:

        1. telnet to DLPS4

        2. turn on line-wrap in telnet software

        3. cd to /l1/txt/m/moa/moaidx/raw/monograph

          (FYI this is an alias for either /l/m1 or /l/m2, whichever is current)

        4. invoke pat with "search"

        5. which will give you a PAT promt : >>

        6. search for the tei header that includes the system ID discovered earlier.

        7. note NOTIS id in file amverse.log.txt

        For example:

                 dlps4% search
                 >> region TEIHEADER incl "1993biog"
                 >> pr
         29512409, ..<TEIHEADER> <FILEDESC> <TITLESTMT> <TITLE>Biography and poetical re
        mains of the late Margaret Miller Davidson / || by Washington Irving.</TITLE> <A
        UTHOR>Davidson, Margaret Miller, || 1823-1838.</AUTHOR> </TITLESTMT> <EXTENT>240
         600dpi TIFF G4 page images</EXTENT> <PUBLICATIONSTMT> <PUBLISHER>University of
        Michigan Digital Library Production Services</PUBLISHER> <PUBPLACE>Ann Arbor, Mi
        chigan</PUBPLACE> <DATE>1998</DATE> <IDNO TYPE="notis">ARH0325</IDNO> <IDNO TYPE
        ="Rootid">mm000116/1993biog/v0000/i000</IDNO> <AVAILABILITY> <P>These pages may
        freely searched and displayed.  Permission must be received for subsequent distr
        ibution in print or electronically.  Please contact moa-info@umich.edu for more
        information.</P> </AVAILABILITY> </PUBLICATIONSTMT> <SOURCEDESC> <BIBL> <AUTHOR>
        Davidson, Margaret Miller, || 1823-1838.</AUTHOR> <TITLE>Biography and poetical
        remains of the late Margaret Miller Davidson / || by Washington Irving.</TITLE>
        <IMPRINT>New ed., rev.</IMPRINT> <IMPRINT>Philadelphia : || Lea and Blanchard, |
        | 1843.</IMPRINT> <EXTENT>248 p. ; || 18 cm.</EXTENT> <NOTE>Bound with Davidson,
         Lucretia Maria. Poetical remains. 1843.</NOTE> <NOTE>Irving, Washington, || 178
        3-1859.</NOTE> </BIBL> </SOURCEDESC> </FILEDESC> <PROFILEDESC> <TEXTCLASS> <KEYW
        ORDS> <TERM>Davidson, Margaret Miller, || 1823-1838.</TERM> </KEYWORDS> </TEXTCL
        ASS> </PROFILEDESC> </TEIHEADER> ..
        >>
        >> quit
        

      2. Find the "new" MOA headers

        1. (telnet to dlps4)

        2. cd to appropriate directory, based on NOTIS id

        3. display header

        4. copy and paste header into local document

        for example:

                 dlps4% cd /l3/obj/a/j/e/aje3060.0001.001/
                 dlps4% ls
                 aje3060.0001.001.hdr  aje3060.0001.001.raw  aje3060.0001.001.txt
                 dlps4% cat aje3060.0001.001.hdr
                

      3. Fix headers:

        1. copy and paste into empty document.

        2. tidy up pasted header by changing \n\([^<]\) to \1

        3. replace the existing <EDITORIALDECL N="1"> .. </EDITORIALDECL> with this piece of boilerplate:
            
          <EDITORIALDECL N="4">
          <P>This electronic text file was created by Optical Character Recognition (OCR) from 1-bit 600-dpi Group-4 TIFF page images, proof-read once, encoded in SGML according to the University of Michigan DLPS "vendor" dtd, converted to TEI-compliant SGML encoding using the TEI-lite dtd, and reviewed for correctness and consistency with the AmVerse Project Style Guide. No editing has been done to the content of the original document, except as described below. Encoding complies with the recommendations for Level 4 of the "TEI in Libraries Guidelines."  Digital page images are linked to the text file.</P>
          </EDITORIALDECL>
          </ENCODINGDESC>
          

        4. paste tidied header into .sgm file in place of place-holder

        5. add NOTIS id number as the value of the "ID" attribute of the <TEI.2> element

        6. save locally in /amverse/temp

        7. record completion in amverse.log.txt

    3. Validate

      Only error should be lack of TYPE attribute on the DIVs and LGs (set valcat to point to teilite.implied.dtd in order to eliminate error messages from missing TYPEs).

      Also check that <PUBLISHER> precedes <PUBPLACE> in <PUBLICATIONSTMT>

      <HEADER> may need to be changed to <TEIHEADER>

    4. "~" Search for and replace "~" as necessary, if meaning is clear. Add prominent missing characters to Bill's sgml- generation script (this inserts "~" for every 8-bit character that it hasn't already converted to a character entity).

    5. Count pages (textpad, search-in-files in files *.sgm in directory c:\amverse\temp, searching for <pb, case-insensitive, binary file, file counts only). Record page numbers in amverse.log.txt.

  3. Prepare list of pages for proofreading

    1. Generate list of page numbers. Use textpad (search-in-files, in files *.sgm, binary files, all lines, search for <PB[^>]*>), copy results to new file, check for empty page break tags (supply number if there is none), reduce to page numbers only (replace all text except value of "N" attribute to nil), change all non-numerical text to numerical so that Excel can understand it (e.g., change roman numerals "R0001" to "0.0001"; change t.p. verso "000A" to "0.001" ; change appendix pages A0001 after final page 232 to 232.0001 ; etc.), save with .sgm files as xxx.pages.txt.

    2. Generate sample. Load xxx.pages.txt into Excel. Use Tools/data analysis/sampling to produce random sample of page numbers (set number of samples to 5% of pages in the book; output as row range in B column; sort results (data/sort); check for duplicates; if there are duplicates, replace one with another page). Copy page list into a text file, attach boilerplate and print, like this:
             SAMPLE PAGES FOR PROOFING
             (using page numbers as printed in book)
             
             Text=BenjaInfat
             ---------------
             
             1
             15
             

    3. Attach printed list to back of stack of image prints when those are generated (see below).

  4. Convert to Author/Editor file, move to Novell space

    1. import into AE with moa-tei.rls
    2. save as *.ae in same amverse/temp directory
    3. copy contents of local amverse/temp directory to work/markup/amverse/markupcomplete

  5. Enter items into Access AmVerse database master file

    1. Record date, short title, NOTIS id, size.
    2. Assign item to a MURP.

  6. Assign items

    1. Place copy of .ae file in the "ToProof" folder of the "AmVerse" folder in the individual user directory belonging to the MURP to whom the book is assigned.
    2. Send e-mail to notify of new assignments

  7. Pick up completed items