Electronic Middle English Dictionary

MED PROCESSING: BASIC STEPS

Revised pfs 10 Oct 1998
File modified

See also BASIC STEPS for processing current MED production.

For other MEC files, see the  MEC INDEX


--------------------------------------------------
A. (6-7 pfs; rest jmm or pfs) RECEIVE/SAMPLE FILES
--------------------------------------------------

NOTE: regexp syntax assumed by most regexps below is that of
      TextPad 3.1 (non-POSIX-compliant).

>> Find scripts for this stage in directory /mecode/perl/medA <<

1. Save incoming file in "incoming" directory under letter.

2. Record receipt in Apex summary report file.

2a. If file arrives as text, check for null characters in hex editor.

3. Generate list of page numbers as 5% sample of file using
   MS Excel. If file contains a non-continous run of pages,
   extract a list of page numbers using Textpad (^\^p[0-9]+), save as
   file (one page number per line), import file into Excel,
   then sample cell range. Otherwise, just imput the first
   page number in a1, right-click and drag the lower right
   corner of a1 to a100 (etc), fill-in-series, tool-data analysis-
   sampling-5 samples- range a1:a100.

4. Extract sample pages from file using PFE, save as ?-test.txt
   file in sampling subdirectory.

5. Run apx2html.pl on test file, print, proofread.

6. Record sample page numbers, number of errors, etc. in Apex
   summary report.

6a. Look for recurrent character-entity problems, especially
    those that showed up in proofing. Typical problem to 
    look for is confusion between i-umlaut and i-macron
    (and sometimes i-acute). Here are examples in "M":

    xxx = apex text has umlaut instead of macron
    zzz = apex text has macron instead of umlaut
    yyy = apex text has macron instead of acute
    www = apex text has nothing instead of acute

     yyy  m!ol
     www  mil
     xxx mil!on
     xxx mist!on,
     XXX  modif!er
     xxx mordicac!on,
     xxx Mor!en
     xxx moc!on
     xxx multipli&-;cac!on
     maybe: mundif!er
     xxx mutac!on
     zzz manl&i:;ce
     zzz manl&i:;

    Sample searches: 

    \^b[^\^]+&il;er[^\^]*
    \^b[^\^]+&il;en[^\^]*
    \^b[^\^]+&il;on[^\^]*
    \^b[^\^]+l&i:;[^\^]*

    Also: 

    o search for and repair unknown characters (&?;) or diacritic (\d?\),
      words ($$word$$) and lines ($$line$$)

    o Deal with the Apex conversion report.


7. Report file status to jpw apex.

8. o Run laybrut.pl to change c1300: to BrtOtho: and a1225: to BrtClg:
   o Search for 

       \^u[^\^]+\^[^\^]+\[[^]]*BrtOtho:[^]]+\]  and
       \^u[^\^]+\^[^\^]+\[[^]]*BrtClg:[^]]+\]

     Examine results: confirm that all the affected quots are in fact
     from Lay.Brut. When that is confirmed, remove the "Brt" in PFE.

   o Search for \^u[^\^]+\^[^\^]+\[[^]]*\^d\?*[ac\?]*[0-9]+\^:[^]]+\]
     (or just \^d\?*[ac\?]*[0-9]+\^:); change variant to current
     style, viz. typically a MS id (Rwl:) or a short-title id (^uDSPhilos^u:).

     Run fixvar.pl

     Beginning with E and F (may need to do similar check on G/H, etc.?)
     search also for:
     \^u[^\^]+\^[^\^]+\[[^]]*\^d\?*[ac\?]*[0-9]+\^ *\^u[^\^]+\^:[^]]+\]  and
     \^u[^\^]+\^[^\^]+\[[^]]*\^d\?*[ac\?]*[0-9]+\^ *[^:]*:[^]]+\]
     which looks for partial or full stencils in bracketed variant reading
     notes.

     

     Run var1.pl *.brut >> EF.var
     
     {replace "[" with "$"

     ftp to unix box, perform this:

     sort +1 -t$ EF.var > EF.var.sort

     ftp EF.var.sort back, }

     OR just remove all text up to "[", sort in textpad.

     examine, especially for dates in [ ] variants.
     do not confuse bracketed quotes with bracketed variants.



     script changes or do by hand. add scripted changes to fixvar.pl

     search for \[[^]:]+: and check again. and again if nec.

         some styling decisions:  change:

       [^da1500^:    to   [Rwl:  if MS can be determined by date.
       [^da1500^:    to   [a1500 vr. if MS cannot be unambiguously determined.
       [^d1500^:     to   [1500 ed.: if date is of unidentified printed edition not
                                      listed in bib.
       [^d1500^:     to   [LGW Prol.(2):  if date refers to other version
       [^d1500^ Pyson ed. to [1500 Pynson ed.
       [^uB^. 1425. Hrl.  to [^uB^ (Hrl):
       [^d1500^ Hrl:  to     [Hrl:   or [Hrl 145: as necessary.
       [ed. 1500:     to     [1500 ed.:
       [^uWBible(2)^: to     [^uWB(2)^:  et sim. with other current abbreviations.
       [^da1425^ ^uB^: to    [^uB^ (Pep 2014):
       [vr. ^da1400^ ^uRecl.^: to [Pep:
       [^dc1450^ ^uLond.Chron.Cleo.^ 129: to
           [^bc1450^ ^uLond.Chron.Cleo.^ 129:
       [so ^da1425^ AS] to [so also a1425 (AS)]

       In a couple of cases replaced ^d1496^: with <TITLE>Bk.St.Albans (1496)</TITLE>:
       in quots from Treat.Fish. -- check to see if this is right.

   o check for ^\t\^b[IVX]+\^\. (text) and variants thereon to find sensegroups.
     Run sensegrp.pl or modify it as necessary, then run it.

-------------------------------------------------
B. (pfs or jmm) Bundling / Apx -> Simp Conversion
-------------------------------------------------

1. When all files for a letter have been received, use PFE to
   cut and paste them into files matching the specs defined in
   the bundles.txt file.

[ 2. Leave them in a convenient place on DEV, notify Nigel.]
[ 3. Receive processed "simp" files from Nigel.]

2. Ftp them to dns: /work/staff/pfs/med

3. run the following on each file:

    preprocess0.pl <file> | preprocess1.pl > <outfile>

eg: preprocess0.pl F2.raw | preprocess1.pl > F2.simp

4. Ftp the resultant files back.


---------------------------
C. (jmm) PROCESS SIMP FILES
---------------------------

>> find scripts for this stage in directory /mecode/perl/medC <<

1. Run fix-qg.pl on each file to repair lumped-together quote
   groups.

2. Run fix-lb4w.pl on each file to repair titles mistagged as
   <u> or <lb>.

3. Run simpprep.pl on each file to replace:

   &eolhyphen;     with      {eolhyphen}
   &u:;            with      &uuml;
   &L;             with      &pound;
   &yl;            with      &ymacr;
   &i/;            with      &iacute;
   <!-- page # --> with      (space)

3b. Run simpdef.pl and simptab-def.pl on each file.

4. Manually add doctype to top of each file:

   <!DOCTYPE med PUBLIC "-//MED//DTD MEDSIMP 1.0//EN">

5. Interactively parse document using emacs/psgml till each
   file validates. 

   x Search for comments, repair as needed.\
     (search for: <!-- [^<]+ --> )

   x Search for (regexp) ^\t to find missing <s><def>...</def> tags that
     occur frequently when a def begins with a tab character.

   x Note and report any consistent problems--especially any
     consistent problems not already noted. If possible, devise a 
     global solution; if possible, add it to simpprep.pl.

   x Fix all problems noted in proofreading stage

   x In E and F: search for <def> beginning "Cp." or "Also"

   x In E and F: rewrite all affix entries to nearly current style
     (search for "prefix" and "suffix")

   x Search for bolded sense numbers as clue to missing <s> tags

-----------------------------------------------------
D. (jmm or pfs) CONVERT MED-simp to TEIish (=TEI v.1)
-----------------------------------------------------              

>> find scripts for this stage in directory /mecode/perl/medD <<

1. Place file "normcat" in working directory
  (find original in mecode\catalogs directory)

2. run sgmlnorm to add all closing tags:

"sgmlnorm -cnormcat -Dc:\mecode\alt-dtds -n filename > newfile"

  (call target file Bndl#.s1.sgm)

3. run fix-case.pl to reconvert the tags to lower case

  (call target file Bndl#.s2.sgm)

4. run med2tei.pl to convert to TEIish tagging scheme

  (call resultant file Bndl#.t1.sgm)

5. run captags.pl to capitalize all TEIish tags

  (call resultant file Bndl#.t2.sgm)

6. run fixtit2.pl to change <TITLE> to <USG> in most necessary cases

  (call resultant file Bndl#.t3.sgm)

7. run teiprep.pl  to correct prematurely closing stencils,
  to move opening bracket of bracketed quots. from <CIT>
  element to <DATE> element, to add correct DOCTYPE and
  <TEXT><BODY> tags to each file, and to change the code
  &#38; (added by the normalizer to stray ampersands) to
  &amp;

8.  (call target file Bndl#.t4)

9. run in-etym.pl to change various things that appear incorrectly
  in the <ETYM> to <HI...>, viz. <USG>, <TITLE>, and <ORTH>.

10. correct file using interactive parser in emacs

  Most of this will probably involve adding fake tags to affix
  entries to get them to parse. You can probably replace
  
        </DEF>^J</SENSE> 

  with

        </DEF><EG><CIT><BIBL></BIBL><Q></Q></CIT></EG>^J</SENSE>

11. validate

12. if any other global changes proved necessary, try to add 
    them to the teiprep.pl file to expedite the conversion of the
    next bundle or letter. In any case, note them.

-----------------------------------------------------
E. (pfs or jmm) MISC. CLEANUP (produces post-TEI v.2)
-----------------------------------------------------

NB   Consider re-doing some of these steps on QR and T files.

>> find scripts for this stage in directory /mecode/perl/medE <<

>> In Emacs...

1.   Go back to those fake tags added to make the affix entries
     parse (search for <Q></Q>) and insert <P> tags as needed 
     to reproduce MED paragraphing (i.e., look in print MED to
     see where paragraphs are, add <P>...</P> tags to match).

2.   Search for "&#" to find stray keyboard characters converted
     by the normalizer into decimal entities. (most common is 
     ampersand &#38, but this is eliminated earlier with teiprep.pl.)

3.   search for: [[ ]*</Q>   Fix any problems.

4.   Repair anything noted in <!-- comments -->. Remove comments.

>> At command prompt...

[to pfs: Look at S-*.pl files to see if there is anything there that needs to
be added to these tidy scripts....]

3.   Run tidy.pl on all files, in order to:

     x  [Replace page numbers with spaces. This line is still in
        script, but has now also been added to the simpprep.pl 
        script, in order to get rid of the page numbers earlier on]

     x  Change <USG> to <TITLE> when stencils close prematurely (e.g.:
        <TITLE>Doc.</TITLE></STNCL> in <USG>Sur.Soc.4</USG>)
        (this is the same task performed by the standalone script
        bad-usg.pl)

     x  Move MS abbrevs inside stencils
     x  Attach misplaced version numbers to preceding titles     
     x  change <TITLE>*  to *<TITLE>

     x  replace  <TITLE>.           WITH:   .<TITLE>
        replace  </TITLE>.          WITH:   .</TITLE>
        replace  </TITLE></STNCL>.  WITH:   .</TITLE></STNCL>
        replace  </USG>.            WITH:   .</USG>
        
        (this is probably unnecessary these days, since the
        simp conversion routines seem to produce these results
        anyway, but it doesn't hurt)

     x  replace  .</ORTH>           WITH:   </ORTH>.
        replace  ,</ORTH>           WITH:   </ORTH>,

     x  remove colons after <Q> and ensure that a space starts <Q>
     x  change label [Aa]str. to [Aa]stron. and ensure that Chaucer's
        Astr. is tagged as <TITLE>Astr.</TITLE> and Metham's Palm. as
        <TITLE>Palm.</TITLE>

     x  change <DEF>?space-nonequals to <DEF>?nospace-save
     x  change <ETYM>?space to <ETYM>?nospace
     x  change <ORTH>?space to <ORTH>?nospace
     x  change </DATE>?spaces to </DATE> ?
     x  change (? [a-zA-Z] to (?save
     x  change [? [a-zA-Z] to [?save
     x  fix merged quote groups not caught by the earlier fix-qg.pl
      

4.   Run tidydef.pl which does the following within <DEF>:

     x  change nospace?spaces-nonequls to save-space?-save
     x  change space?spaces-nonequls to  space?nonspace-save
     x  change ,nospace to ,space

5.   Run tidyet.pl to tidy the bold <HI>s in the <ETYM> somewhat,
     changing:

     x  ?spaces to space?
     x  ,nospace to ,space
     
     x  <HI REND="b">string1, string2</HI> to
        <HI REND="b">string1</HI>, <HI REND="b">STRING2</HI>

     x  <HI REND="b">string3.</HI> to <HI REND="b">string3</HI>.
     x  <HI REND="b">*...   to *<HI REND="b">...

6.   Run tidyorth.pl to tidy the <ORTH>s similarly, changing:

     x  ",nospace" within <ORTH>  to:   ",space"
     x  ", "       within <ORTH>  to:   "</ORTH>, <ORTH>"
     x  ". "       within <ORTH>  to:   "</ORTH>. <ORTH>"
     x  " &rarr; " within <ORTH>  to:   "</ORTH> &rarr; <ORTH>"

7.   Run tidyform.pl to:

     x   change <HI rend="b">...</HI> to <ORTH>...</ORTH> in <FORM>
     x   add space after commas in <FORM>
     x   take space away from after ? in <FORM>
     x   move query from inside to outside of <ORTH>

>> In TextPad (and command prompt) ...

8.   Search in files for

                                l[0-9] or [0-9]l 
     and                        1[cdefghijklmnopqrtuvwxyz][^"] 
     and                        [bd-z]1
     and                        ........1b\.[^<].....

     ...generate list, examine and convert list of genuine
     errors to substitution script. run it.  If any other
     good search patterns for one/el confusion turn up, add them
     to this list.

     ...and search for:

     <USG> inside of <STNCL>:     </USG>[^<]*</STNCL>

     If enough of these occur, we may need to make a script to handle it;
     in the meantime, it picks up some interesting errors.

>> in EMACS...

9.   Manually change errors noted in initial sampling of data (you
     were already supposed to do this: this is just a reminder...)


10.  Re-validate.


------------------------------------------------------
F. (jmm or pfs) HYPHEN PROCESS (produces post-TEI v.3)
------------------------------------------------------

>> locate scripts for this stage in /mecode/perl/medF <<

PART 0 (required):

Method: run old scripts on new files.

1. Find old general hyphenkill script ()
2. Examine most recent nonce kill script
3. Add generally useful lines from (2) to (1)
4. Run hyphenkill

5-8. Do same thing for general hyphenharden script ().


PART 1 (required):

Method: Search specific contexts for hyphens; assess and fix those found.

Follow this route:

0. Run scripts from previous volumes
1. Extract stencils using ex-stncl.pl
2. Search in those files for <[^>]+>[^<]*{eolhyphen}[^<]*<[^<]+>
3. Search in all bundle-files for </STNCL>[^<]*{eolhyphen}[^<]*</BIBL>
4. Create substitution scripts: one to remove soft hyphens; one to insert hard ones.
5. Search in all bundle-files for \[[^]]+{eolhyphen}[^]]*:
   or maybe just [^ ]+{eolhyphen}[^ ]+:
6. Examine; add to previous scripts.
7. Search for {eolhyphen}cent
8. Examine; add to previous scripts.
9. Extract definitions using ex-def.pl
10.Search in those files for [^ ]+{eolhyphen}[^ ,.:?!]+
11.Examine; create a new pair of scripts confined to the the <DEF> context.
12.Extract etymologies using ex-ety.pl
13.Search in those files for [^ ]+{eolhyphen}[^ ,.:?!]+
14.Examine; create a new pair of scripts confined to the the <ETYM> context.


NOTE: DON'T FORGET TO ESCAPE METACHARACTERS * . ( ) [ ] + ? { } \

>> to create anchored hyphen-removal scripts in textpad, replace

   \([^|]+\)\|\([^|]+\)

   with

   $victim =~ s#([^a-zA-Z;])\1\\{eolhyphen\\}\2([^a-zA-Z\&])#$1\1\2$2#g;

   and add appropriate lines at top and bottom (see eg pd-def.pl).

>> to create unanchored hyphen-removal scripts in textpad, replace

   \([^|]+\)\|\([^|]+\)

   with

   s#\1\\{eolhyphen\\}\2#\1\2#g;

   and add appropriate lines at top and bottom (see eg killsoft.pl)

>> to create anchored hyphen-hardening scripts in textpad, replace

   \([^|]+\)\|\([^|]+\)

   with

   $victim =~ s#([^a-zA-Z;])\1\\{eolhyphen\\}\2([^a-zA-Z\&])#$1\1-\2$2#g;

   and add appropriate lines at top and bottom (see eg ph-def.pl).

>> to create unanchored hyphen-hardening scripts in textpad, replace

   \([^|]+\)\|\([^|]+\)

   with

   s#\1\\{eolhyphen\\}\2#\1-\2#g

   and add appropriate lines at top and bottom (see eg addhard.pl).

>> top/bottom lines for <DEF>-restricted scripts:

#!\apps\Perl\bin\Perl.exe
$/= "</DEF>";
while (<>) {
while (s#(<DEF>(.|\n)*?</DEF>)#<spot>#) {
$victim = $1;

...

$victim =~ s#<(/?)DEF>#<$1def>#g;
s#<spot>#$victim#;
}
s#<(/?)def>#<$1DEF>#g;
print;
}

>> top/bottom lines for <ETYM>-restricted scripts:

#!\apps\Perl\bin\Perl.exe
$/= "</ETYM>";
while (<>) {
while (s#(<ETYM>(.|\n)*?</ETYM>)#<spot>#) {
$victim = $1;

...

$victim =~ s#<(/?)ETYM>#<$1etym>#g;
s#<spot>#$victim#;
}
s#<(/?)etym>#<$1ETYM>#g;
print;
}

12.Run scripts on all files.


PART 2 (optional; adds about 4-6 hrs/bundle?; may be deferred):

[1. Run any previous hyphen-elimination script on all files]

2.  Extract list of hyphenated words from first file, searching for

    [^ ]+{eolhyphen}[^ ,.:?!]+

    Convert {eolhyphen} to | for ease of inspection.

3.  Generate substitution scripts: Run them on all files. Anchor search strings
    at least at left end, possibly also at right. E.g.:

    # remove these hyphens
    s#([^a-zA-Z;])he{eolhyphen}re([^a-zA-Z&])#$1here$2#g;
    s#([^a-zA-Z;])mo{eolhyphen}ne&thorn([^a-zA-Z&])#$1mone&thorn$2#g;
    s#([^a-zA-Z;])&thorn;e{eolhyphen}se([^a-zA-Z&])#$1&thorn;ese$2#g;
    s#([^a-zA-Z;])corrup{eolhyphen}te([^a-zA-Z&])#$1corrupte$2#g;

     x one script to replace {eolhyphen} with nil
       (plus any special changes, e.g. {eolhyphen} to space).
     x one script to replace {eolhyphen} with hyphen
       (most commonly in stencils, where presence of hyphen is
       mandatory), as well as catching any missed by script 1.

4.  Extract list of hyphenated words from second file. Convert
    {eolhyphen} to | for ease of inspection.

5.  Generate substitution scripts. Run them on all files as before.

6.  Extract list of hyphenated words from third file. Convert
    {eolhyphen} to | for ease of inspection.

7.  Generate substitution scripts. Run them on all files as before.

8.  Etc.

PART 3 (required)

1.  Search all files for | in case any are lurking in there....

2.  Change all remaining \{eolhyphen\}s in files to "|"

3.  Add all or some of the 

-------------------------------------------
G. (jmm) QC-online (produces post-TEI v. 4)
-------------------------------------------

1. Fix any leftovers.

   a. Run worse-usg.pl which

      x changes <USG> to <TITLE> globally within <Q>
      x changes quot(s): <USG> to quot(s): <TITLE> within <DEF>
      x changes <USG>PNElem.</USG> to <TITLE>PNELem.</TITLE> within <DEF>

   b. Run orphan-a.pl which

      x changes (a) </DEF><EG> to </DEF><EG n="a">
      x changes (a) [</DEF><EG><CIT><BIBL><STNCL><DATE>  
                  to </DEF><EG n="a"><CIT><BIBL><STNCL><DATE>[
      x changes [</DEF><EG><CIT><BIBL><STNCL><DATE>  
                  to </DEF><EG><CIT><BIBL><STNCL><DATE>[

   Note: orphan-a version 1 failed to fix N/O files because of fewer line-breaks.
   Use orphan-a2.pl instead.


   c. [whatever (in N/O applied n-lay.pl to change

 [</q></cit><LB><CIT><BIBL><STNCL><DATE>c1300</DATE>:([^\]]+?)\]([^<]+?)<DATE>

  to:

 [Otho:$1]$2</q></cit><LB><CIT><BIBL><STNCL><DATE>]

 (etc.) -- this must be run BEFORE orphan-a2.pl)

 In IJKL, we avoided this problem at step A, using laybrut.pl

In M, searched for

<TITLE>[^<]+</TITLE>[^<]*</STNCL>[^<]+</BIBL> *<Q>.*\[[^]]*\?*[ac]*[0-9][0-9][0-9][0-9]:[^]]*\]]

to see if all c1300: and a1225: 's were from Brut. changed them. checked for others. 


   d. Search files for ([b-z]) *<
   e. Search files for [quot.:</DEF> and [quot.</DEF>

2. Validate.

3. Upload and index.

4. JMM follows his order of examining <FORM> <ORTH> <USG> <Q>
   <DEF> <HI> -- all but <BIBL>, <STNCL>, <TITLE>...

5. Make repairs concurrently with Emacs.

6. Revalidate.

7. Re-upload and reindex as needed. Note problems that can be
   solved globally; insert solutions into earlier automatically-
   applied scripts (if possible): e.g., into teiprep.pl or tidy.pl
   to avoid the problem next time around.

------------------------------------------------------------
H. (pfs) STENCIL/BIBL REGULARIZATION (produces post-TEI v.5)
------------------------------------------------------------
>> Find scripts for this stage in \mecode\perl\medH
>> test new scripts for metacharacters using 

     FIND (REGEXP) ^s#[^#]+[^\\.
                   ^s#[^#]+[^\\]\?
                   ^s#[^#]+[^\\]\*
                   ^s#[^#]+[^\\])
                   ^s#[^#]+[^\\](
                   ^s#[^#]+[^\\]\[
                   ^s#[^#]+[^\\]\]

1. Run general regularization script st-reg1.pl

   (add any additional general stencil regularizations to this script).

2. Extract list of <TITLE>[^<]+</TITLE> on the output, sort and uniq.
   (Use textpad: find-in-files, condition text/regexp, type binary,
   detail all matching lines; tools: sort, order ascending, case-
   sensitive, NOT in char.code order, remove duplicate lines).

   Create B-tit.pl (or sim.) script based on problems discovered
   while surveying list. If necessary, look at further context by
   searching for string (use Textpad: find-in-files, condition text,
   NOT regexp, file type text, detail all matching lines; use regexp/
   binary type only if necessary.) Look out for effect of changes on
   neighboring regions, espec. effect of final stop on spacing after
   close of </TITLE>[</STNCL>]. I.e., if you are adding a final period
   to the title, you should go remove the leading space after </stncl>;
   if you are removing a final period, you need to go ADD a leading
   space after </STNCL>

3. If it is substantial, run script from step(2); otherwise, keep adding
   to it. Extract list of "beginnings" from files (from output of previous
   script, if it has been run).

   A beginning is defined as </DATE>[^<]*<TITLE>   

   o  If working with text with <AUTHOR> tags, extract a list of authors.
      Like this: </DATE> *<AUTHOR>[^<]+</AUTHOR>[^<]*<TITLE>

   Extract list of "middles" from same output. A middle is defined as
   .</TITLE>[^<]*<TITLE>   (don't forget that leading dot). Replace
   initial ^[0-9a-zA-Z;] with "X" Sort and uniq.

   Extract list of dates from same output. A date is defined as
   <DATE>.*</DATE> Sort and uniq. Examine, then reduce [0-9]
   to 6 to check patterns. (Use .* in date definition, instead of
   [^<]+, in order to find merged quots. and truncated definitions.)

   Extract list of "ends" from same output. An end is defined as 
   .</TITLE>[^<]+</STNCL>  Sort and uniq. Reduce initial character
   to "." or "x" ([0-9a-zA-Z);]). Check spacing. Remove initial character
   and resort/reuniq. Examine.

   OPTIONAL. If enough changes have accumulated, create a script based 
   on errors discovered while surveying these four lists. run it.
   OTHERWISE, skip to step 4.

4. Extract a list of REFs from the output of step(3). Abstract,
   reduce, sort, uniq, examine. Start by extracting 

   .</TITLE>[^<]*</STNCL>[^<]*</BIBL>

   check the spacing, then systematically reduce (while checking spacing) to
   the ref itself (.*</BIBL>) and reduce the list further by replacing all
   [0-9] with 6. Look for oddballs, create script of replacements, the more
   general (if safe) the better, e.g.

   s#([12]) Cor\. ([0-9][0-9]?\.[0-9]+)</BIBL>#$1 Cor.$2</BIBL>#g;

5. Run ex-stncl.pl, sort and uniq results, examine. Rearrange results in
   this pattern: <TITLE>Acc.Bridgwater (PRO)</TITLE> {date=(1391)} {*}
   Look for discrepancies.

   [to create sortable file, replace <STNCL><DATE>\([^<]+\)</DATE> *\(.*\)
   with \2 {date=\1} and so on--OR just use presort.pl, located in the
   /mecode/perl/merge directory, which does the same thing instantly.]

   
6. Create one last script based on errors found in step(5) "P-mopup.pl"

7. Validate.

8. Add all useful (usually ALL) replacements from the scripts created
   during this stage to st-reg1.pl for use next time, save the new st-reg1.pl.

-------------------------------
I. POST-PROCESSING STAGES (pfs)
-------------------------------

>> Note: scripts for this stage may be found in: \mecode\perl\medI
   !! but check to see that they really apply to your files first !!


1. Replace bracketed quots. by <CIT TYPE="b">

     Run postproc1.pl  This:
       o adds space after "[;:,]" in "[:;,]?[a-zA-Z]"
       o removes empty <ORTH>s
       o looks for any surviving sense numbers as ORTH
       o removes any periods from .</ORTH></FORM> to </ORTH>.</FORM>
       o removes spaces at end of <ETYM>
       o ensures that newlines precede all <EG> <ETYM> <DEF> <CIT> <Q> tags
     
     [Run pre-bra.pl  This may usually be omitted, since all it does is:
       o move any brackets from before to after <DATE>
       o remove spaces from beginning of <DATE>
       o move any "?" from before to after <DATE>
       o move any "-?-" from before to after <DATE>
       o tag stray "-?-" as <DATE>-?-</DATE>
       o move [( from before to after <DATE>
       o move stray sense letters from beginning of <STNCL> into <EG N= "">
       o remove stray spaces from beginning of <STNCL> 
         [if this last is all that applies, 
         move relevant lines to one of the other scripts; now moved into postproc1]

     Run brack.pl  This:
       o removes stray newlines inside <CIT></CIT>
       o temporarily removes all newlines inside <CIT></CIT>
       o again moves any brackets before <DATE> into <DATE>
       o changes any <CIT> starting with <DATE>[ into <CIT TYPE="b">,
         whilst removing any concluding ] from that <Q>.
       o reinstates the newline before <Q>


 2. Turn certain bold comments and explanations in the form section into
   <HI REND="b"> instead of <ORTH>

   search for <ORTH>[^<]+ ([^<]+</ORTH>
   search for </ORTH> *(
   search for )</ORTH>

   fix the ones that shouldn't be changed globally (may add them to contract.pl)
   check contract.pl. Do you really want that "=" added to the second global line?
   Have you removed last time's local changes? Have you added new local changes?

   run contract.pl

3. Search for <ORTH>[0-9]
   Search for .</ORTH> and ,</ORTH>
   Search for ^Cp  ^See
   Search for </FORM> *Cp
   Search for </FORM>[^<].
   Search for ^[^< ]

[4. defer: Move "Cp. notes" ("see also") out of <ORTH> into a separate <NOTE> field.
    defer: Tag dialect and date labels like this or sim.:

            <LBL DI="SWM" DA="early">early SWM</LBL>

    Note that EF vols. usu. (not always) place labels after form (CD likewise), don't always
    put them in parens, and sometimes bold them. Parts of speech are frequently
    omitted. Parts of speech are almost always omitted in cross references
    (cp. blog (2), not cp. blog n.(2)). "~" used in form section, often without
    clear antecedent/referrent, or with deliberately fudged referrent (often seems
    to stand for a reference to the complete forms list of the second element
    of a compound hence bag-lady, bog ~, beg ~, etc. 
    Cp. notes are often buried between Alsos and Forms: sometimes in pars; sometimes
    not marked by cp. or see also. Labelling range of postposited labels not
    always clear. Distinction between parenthetical and non-parenthetical forms
    obscure, pars removed by pfs per FMS.

    ---------------------------------------------------------------------------------
    Typical EF problems:

    <ORTH>fadres</ORTH>, (early dat. <ORTH>faderen)</ORTH>.
    , <ORTH>il</ORTH> (N &amp; nEM);
    , <ORTH>euch</ORTH> (mostly W);
    , <ORTH>&aelig;ch</ORTH> (early);
    , <ORTH>a(u)ghtan</ORTH> (if following noun begins with cons.);
    <ORTH>~ ... </ORTH>
    <FORM><ORTH>fl&emacr;&dotb;men(e fremth</ORTH>, 
      <ORTH>fl&emacr;&dotb;men-fremth</ORTH>, <ORTH>fl&emacr;&dotb;menes-fremth</ORTH> 
       phrase &amp; n. Also <ORTH>~ fermth</ORTH>, <ORTH>~ frimth</ORTH>, 
       <ORTH>~ firmth</ORTH>, <ORTH>~ freme</ORTH>, <ORTH>~ firme</ORTH>
    <FORM><ORTH>f&amacr;&breve;der in laue</ORTH>, <ORTH>~ in lei</ORTH>. Also...
    -----------------------------------------------------------------------------------

    defer: Tag <LANG>OF</LANG> etc.
    defer: Tag <POS>n.(2)</POS> etc. 
    defer: Tag <TAX>Canis canis</TAX> etc.          ]


5. Validate.

6. Upload and re-index.

---------------------------------------------------------------------------
J. MERGER PREP
---------------------------------------------------------------------------

>> Note: scripts for this stage may be found in \mecode\perl\merge

For process, see separate file: \merge\reports\mrgsteps.txt (also on the web).