Electronic Middle English Dictionary


File modified Tuesday, 09-Jan-2001 09:42:52 EST

For other MEC files, see the  MEC INDEX


 x-- Find scripts for this process in \mecode\perl\med2mec --x

1. Merge individual files into fascicle-sized files with DOS copy, e.g.,
   "copy w1d.sgm + w2d.sgm + w3d.sgm + w4d.sgm + w5d.sgm w1.med.sgm"

2. Examine file superficially for novel problems.

   Remove "entry" containing only credits for letter (or fascicle); save in 
   separate file.
   Run script de-ae.pl to strip out AE-style internal newlines.
3. Run med2mec.pl on the file, e.g.
   "perl med2mec.pl w1.med.sgm > w1.mec.sgm"

   * consider adding to this script:

     o  replacement of m-dash with --
     o  replacement of n-dash with -    [this is now done in the BIBREG stage]
     o  replacement of ‘ with `
     o  replacement of ’ with '
     o  replacement of “ with "
     o  replacement of ” with "
     o  replacement of … with ..
     o  ??removal of nested <DEF> tags.
     o  ??removal of <AUTHOR> tags.
   [Adjust beginning and end of file to ensure that it is continuous
   with contiguous files.]

4. Remember to add all global changes to this script for automatic application to
   subsequent files.

5. Examine the file superficially again.

   Search it for <?  (i.e., any new processing-instructions; confirm that
   they can safely be replaced with nil, then do it; add replacement(s)
   to med2mec.pl.

6. Interactively parse the file with PSGMLS/Emacs, being especially
   attentive to affix entries, which will need a little cleaning up,
   especially the <P> tags, which must be within <DEF>
   tags, not vice versa.  Validate.

   Some of the affix entries may need more elaborate tagging in order
   to deal with summary captions leading or preceding the main senses;
   some may need to be tagged like this:

<SENSEGRP><DEF>A suffix and combining element:</DEF>
 <SENSE N="1">
  <DEF>As suffix:
   <DEF><P>(a) towards;</P></DEF>
   <DEF><P>(b) away from;</P></DEF>
  <P>From both OE and OF.</P>
 <SENSE N="2">
  <DEF>As combining element:
   <DEF><P>(a) with nouns;</P></DEF>
   <DEF><P>(b) with adjectives.</P></DEF>

   You may also find:

Failure to close </P>
Extraneous </HI>
Extraneous </TEXT>
New character entities (or variant spellings of entities already in place)

   Add new character entities to 

      copied to \mecode\alt-dtds\medsel1.ent
   \apps\softquad\panoramapub\sdata.map  (delete old sdata.mpc)
      copied to \mecode\catalogs\sdata.map

7. Check for some problems that do not affect parsing, e.g.:

   o Check for sense numbers preserved as literal text, e.g.:

     <SENSE><DEF><P><HI REND="b">1.</HI>  instead of
     <SENSE N="1"><DEF><P>

   o Check for untagged text that should be part of a <NOTE> element,
     as well as for untagged text that should be part of the FORM element,
     which has prematurely closed.

      E.g. look for </FORM>.+  and ^[^<]
     Some examples of prematurely closed FORM sections:
     Also</FORM> <HI REND="b">wak, woke</HI>
     <FORM><ORTH>warantment</ORTH> <POS>n.</POS></FORM> Pl. <HI REND="b">warentmentis.</HI>
     <FORM><ORTH>wā˘rines</ORTH> <POS>n.</POS></FORM>Pl. <HI REND="b">wā˘rinesses.</HI>
     <FORM><ORTH>wā˘r-shēte</ORTH> <POS>n.</POS></FORM>Pl. <HI REND="b">wā˘r(e)shētes.</HI> 
     <FORM><ORTH>wā˘r-wrēthe</ORTH> <POS>n.</POS></FORM>Pl. <HI REND="b">wā˘rewrēthes.</HI>
     <FORM><ORTH>wille</ORTH> <POS>adj.</POS>   </FORM>Comp. (early)
   o It doesn't hurt to look for "contraction(s" at the same time, which should be
     tagged like this (also included in another step below):
      (= <HI REND="b">wenest thou</HI>)
      (?= <HI REND="b">wenest thou</HI>)
     Just search for the word "contraction."
     Ditto "other forms" notes that should be tagged either as part of 
     FORM (with HI REND="b") or a a separate NOTE. Search for
     literal strings: "for forms" and "for other forms"

   o Check for <NOTE>s mistagged as part of <FORM>

      E.g. look for </ORTH>[^<]+[Cc]p\.
      and </POS>[^<]+[Cc]p\.

   o Check for <DEF> tags breaking other than at the real breaking

      E.g. look for <DEF>[^<(A-Z]

   o Look for <ORTH> mistagged as <HI REND="b">

      E.g. look for <FORM>.*<HI REND="b"

8a. Extract italic text (HI REND="i"); sort and uniq; look for 
    text that should be in <USG>  element.
    (There will be a lot of italic text that should be in <TITLE>
    tags, but this will be mostly fixed by later script.)

8b.Run ex-usg.pl, sort and uniq the results, fix oddities manually,
   but if possible add a generalized form of the change to med2mec.pl
   so that it is automatically applied to the next fascicle.

   Note that both MED and MEC application of <USG> tags is uncertain
   and inconsistent. Try to develop a more consistent policy acceptable to

   Currently the "Ppl. xxx as yyy" formula is tagged like this (see W1 and V
   files for additional variants):

   <USG REND="norm"><HI REND="i">ppl.</HI> <HI REND="b">xxx</HI> as noun</USG>:

   Currently, certain formulas are tagged differently depending on whether part
   of it is italicised. This could use some work.  E.g.:

   <USG TYPE="syntax" REND="norm">also refl.</USG>
   &mdash;<USG TYPE="syntax" REND="norm">used refl.</USG>


   also <USG TYPE="general">fig.</USG>
   &mdash;used <USG TYPE="general">fig.</USG>

   alongside (?):

   <USG TYPE="general" REND="norm">also <HI REND="i">fig.</HI></USG>
   &mdash;<USG TYPE="general" REND="norm">used <HI REND="i">fig.</HI></USG>

9. Validate.

10. Run med-etym and med-form and med-def and med-q.pl
    to tidy up presentation.

11. Re-validate.

12. Normalize with SGMLNORM and search normalized file(s) for &#[^;]+;

13a. Search for untagged features:

     a. Search for untagged <POS>

      Use ex-pos.pl (in perl/medj/) to extract commonest pos regions, 
      or search in files for 
     b. Search for untagged <LANG> in <ETYM>s. 
        First check for missing spaces before HI in ETYM, searching
        in files for <ETYM>.+[^ *?]<HI .  Repair.
        Extract untagged languages with cp-lang.pl and cp2-lang.pl.
        Search out the results in individual files and repair them.
        Probably better not done globally, since only manual repair
        will recognize, e.g. "Swed. dial." when script has extracted
        only the "Swed.". These scripts are not perfect, but they
        catch all untagged strings in ETYM previously tagged as LANG,
        so long as they are either preceded by a space and followed 
        by a punctuation mark or space, or preceded and followed by
        ETYM tags.
     c. Search for untagged <LANG> in <DEF>s.
        Extract the bracketed bits in DEFs using either ex-dfbrk.pl or langdef.pl
        (which are equivalent). Delete the non-LANG brackets from the resulting
        text file. Turn the trimmed-down text file into a substitution script with 
        the aid of makesub.pl. Run it on the file(s).
13b. Search for mistagged features; fix LayBrut problem:

     a. Extract <LANG> tags. Examine, repair.
       Check for <LANG> ... / ... </LANG>
        (e.g. <LANG>AF/ONF</LANG>)
        We are tagging these as <LANG>AF</LANG>/<LANG>ONF</LANG>

        search for </LANG> *dial for "dial." designation omitted 
        from LANG tags. Repair.
     b. Fix LayBrut problem.
13c. Revalidate.

14. Do standard QC:
     Upload latest version of file(s) to DEV.
     Upload latest version of TEIish and teidict.dtd if necessary.
     Extract regions with QC-online and check for inconsistencies
     Make corrections concurrently in Emacs.
     Add global changes to one of the preceding scripts, as relevant.

      Extract italics and examine (<HI REND="i">)
      Extract bold text and examine (<HI REND="b">)
      Extract ORTH elements and examine
15. Do standard bib regularization, watching especially the regions
    </TITLE>[^<]+</STNCL> and <AUTHOR>[^<]+</AUTHOR>

    a.  Make sure that st-reg1.pl (in \mecode\perl\medH) contains all the
        changes from the most recent bibreg process. Run it.

    b.  Make sure that Regbib.pl (in \mecode\perl\med2mec)
        contains all the most recent 'maintain' scripts. Run it.

    c.  Extract regions and prepare scripts of necessary changes, following
        the procedure in MED STEPS, stage "H".

   [d.  It doesn't hurt to do two versions of at least parts of "c": one
        version on the results of "a"; one version on the results of "a"
        and "b". Compare them, look for discrepancies indicative of
        problems in the second script (or in both of them!).]
    e.  Extract the OD col. stencils from the list of stencils produced
        at the end of the bib reg process. Regularize these to something
        approaching MED standard (moving refs to BIBL unless it is a
        manuscript; adding asterisks for unpublished items; matching
        other OD-col regularization previously done in order to produce
        mappable stencils, or at least consistent ones; renaming
        stencils to match existing non-OD-col stencils if appropriate.)
    f.  Don't forget to add all your scripts to the main bib reg script,
        for automatic application next time.

16. Look for bracketed quots.; remove brackets in favor of <CIT TYPE="b">

    [See MED STEPS stage "I" for details.]
    NOTE: this step is now more easily done while doing the DATE part
    of bib regularization.

17. Look for parenthetical notes (espec.re: contractions) in <FORM> section.

    [See MED STEPS stage "I" for details.]
    NOTE: much of this step may already have been done while doing QC on
    the FORM section.


Future changes to MEC files
1. Take "Cp." notes out of <FORM> and tag them separately as <NOTE>,
   change the <ORTH> in <NOTE> to <HI REND="b">.
2. Consider multi-tagging/nesting <DEF> elements.
3. Add <AUTHOR> tags (as part of BIB MERGER)
4. Add <POS> tags.
5. Add <LANG> tags.