File modified Tuesday, 09-Jan-2001 09:42:52 EST
For other MEC files, see the MEC INDEX
x-- Find scripts for this process in \mecode\perl\med2mec --x 1. Merge individual files into fascicle-sized files with DOS copy, e.g., "copy w1d.sgm + w2d.sgm + w3d.sgm + w4d.sgm + w5d.sgm w1.med.sgm" 2. Examine file superficially for novel problems. Remove "entry" containing only credits for letter (or fascicle); save in separate file. Run script de-ae.pl to strip out AE-style internal newlines. 3. Run med2mec.pl on the file, e.g. "perl med2mec.pl w1.med.sgm > w1.mec.sgm" * consider adding to this script: o replacement of m-dash with -- o replacement of n-dash with - [this is now done in the BIBREG stage] o replacement of ‘ with ` o replacement of ’ with ' o replacement of “ with " o replacement of ” with " o replacement of … with .. o ??removal of nested <DEF> tags. o ??removal of <AUTHOR> tags. [Adjust beginning and end of file to ensure that it is continuous with contiguous files.] 4. Remember to add all global changes to this script for automatic application to subsequent files. 5. Examine the file superficially again. Search it for <? (i.e., any new processing-instructions; confirm that they can safely be replaced with nil, then do it; add replacement(s) to med2mec.pl. 6. Interactively parse the file with PSGMLS/Emacs, being especially attentive to affix entries, which will need a little cleaning up, especially the <P> tags, which must be within <DEF> tags, not vice versa. Validate. Some of the affix entries may need more elaborate tagging in order to deal with summary captions leading or preceding the main senses; some may need to be tagged like this: <SENSEGRP><DEF>A suffix and combining element:</DEF> <SENSE N="1"> <DEF>As suffix: <DEF><P>(a) towards;</P></DEF> <DEF><P>(b) away from;</P></DEF> <P>From both OE and OF.</P> </DEF> </SENSE> <SENSE N="2"> <DEF>As combining element: <DEF><P>(a) with nouns;</P></DEF> <DEF><P>(b) with adjectives.</P></DEF> </DEF> </SENSE> </SENSEGRP> You may also find: Failure to close </P> Extraneous </HI> Extraneous </TEXT> New character entities (or variant spellings of entities already in place) Add new character entities to \mecode\dtds\TEIish.dtd \mecode\dtds\teidict.dtd \mecode\alt-dtds\medsel1.ent copied to \mecode\alt-dtds\medsel1.ent \apps\softquad\panoramapub\sdata.map (delete old sdata.mpc) copied to \mecode\catalogs\sdata.map 7. Check for some problems that do not affect parsing, e.g.: o Check for sense numbers preserved as literal text, e.g.: <SENSE><DEF><P><HI REND="b">1.</HI> instead of <SENSE N="1"><DEF><P> o Check for untagged text that should be part of a <NOTE> element, as well as for untagged text that should be part of the FORM element, which has prematurely closed. E.g. look for </FORM>.+ and ^[^<] Some examples of prematurely closed FORM sections: Also</FORM> <HI REND="b">wak, woke</HI> <FORM><ORTH>warantment</ORTH> <POS>n.</POS></FORM> Pl. <HI REND="b">warentmentis.</HI> <FORM><ORTH>wā˘rines</ORTH> <POS>n.</POS></FORM>Pl. <HI REND="b">wā˘rinesses.</HI> <FORM><ORTH>wā˘r-shēte</ORTH> <POS>n.</POS></FORM>Pl. <HI REND="b">wā˘r(e)shētes.</HI> <FORM><ORTH>wā˘r-wrēthe</ORTH> <POS>n.</POS></FORM>Pl. <HI REND="b">wā˘rewrēthes.</HI> <FORM><ORTH>wille</ORTH> <POS>adj.</POS> </FORM>Comp. (early) o It doesn't hurt to look for "contraction(s" at the same time, which should be tagged like this (also included in another step below): (= <HI REND="b">wenest thou</HI>) (?= <HI REND="b">wenest thou</HI>) Just search for the word "contraction." Ditto "other forms" notes that should be tagged either as part of FORM (with HI REND="b") or a a separate NOTE. Search for literal strings: "for forms" and "for other forms" o Check for <NOTE>s mistagged as part of <FORM> E.g. look for </ORTH>[^<]+[Cc]p\. and </POS>[^<]+[Cc]p\. o Check for <DEF> tags breaking other than at the real breaking point. E.g. look for <DEF>[^<(A-Z] o Look for <ORTH> mistagged as <HI REND="b"> E.g. look for <FORM>.*<HI REND="b" 8a. Extract italic text (HI REND="i"); sort and uniq; look for text that should be in <USG> element. (There will be a lot of italic text that should be in <TITLE> tags, but this will be mostly fixed by later script.) 8b.Run ex-usg.pl, sort and uniq the results, fix oddities manually, but if possible add a generalized form of the change to med2mec.pl so that it is automatically applied to the next fascicle. Note that both MED and MEC application of <USG> tags is uncertain and inconsistent. Try to develop a more consistent policy acceptable to Marilyn. Currently the "Ppl. xxx as yyy" formula is tagged like this (see W1 and V files for additional variants): <USG REND="norm"><HI REND="i">ppl.</HI> <HI REND="b">xxx</HI> as noun</USG>: Currently, certain formulas are tagged differently depending on whether part of it is italicised. This could use some work. E.g.: <USG TYPE="syntax" REND="norm">also refl.</USG> —<USG TYPE="syntax" REND="norm">used refl.</USG> but also <USG TYPE="general">fig.</USG> —used <USG TYPE="general">fig.</USG> alongside (?): <USG TYPE="general" REND="norm">also <HI REND="i">fig.</HI></USG> —<USG TYPE="general" REND="norm">used <HI REND="i">fig.</HI></USG> 9. Validate. 10. Run med-etym and med-form and med-def and med-q.pl to tidy up presentation. 11. Re-validate. 12. Normalize with SGMLNORM and search normalized file(s) for &#[^;]+; 13a. Search for untagged features: a. Search for untagged <POS> Use ex-pos.pl (in perl/medj/) to extract commonest pos regions, or search in files for <ENTRYFREE><FORM><ORTH>[^>]+</ORTH>[^<]+ b. Search for untagged <LANG> in <ETYM>s. First check for missing spaces before HI in ETYM, searching in files for <ETYM>.+[^ *?]<HI . Repair. Second, Extract untagged languages with cp-lang.pl and cp2-lang.pl. Search out the results in individual files and repair them. Probably better not done globally, since only manual repair will recognize, e.g. "Swed. dial." when script has extracted only the "Swed.". These scripts are not perfect, but they catch all untagged strings in ETYM previously tagged as LANG, so long as they are either preceded by a space and followed by a punctuation mark or space, or preceded and followed by ETYM tags. c. Search for untagged <LANG> in <DEF>s. Extract the bracketed bits in DEFs using either ex-dfbrk.pl or langdef.pl (which are equivalent). Delete the non-LANG brackets from the resulting text file. Turn the trimmed-down text file into a substitution script with the aid of makesub.pl. Run it on the file(s). 13b. Search for mistagged features; fix LayBrut problem: a. Extract <LANG> tags. Examine, repair. Check for <LANG> ... / ... </LANG> (e.g. <LANG>AF/ONF</LANG>) We are tagging these as <LANG>AF</LANG>/<LANG>ONF</LANG> search for </LANG> *dial for "dial." designation omitted from LANG tags. Repair. b. Fix LayBrut problem. 13c. Revalidate. 14. Do standard QC: Upload latest version of file(s) to DEV. Upload latest version of TEIish and teidict.dtd if necessary. Extract regions with QC-online and check for inconsistencies Make corrections concurrently in Emacs. Add global changes to one of the preceding scripts, as relevant. IF ONLINE QC IS UNAVAILABLE: Extract italics and examine (<HI REND="i">) Extract bold text and examine (<HI REND="b">) Extract ORTH elements and examine 15. Do standard bib regularization, watching especially the regions </TITLE>[^<]+</STNCL> and <AUTHOR>[^<]+</AUTHOR> Viz.: a. Make sure that st-reg1.pl (in \mecode\perl\medH) contains all the changes from the most recent bibreg process. Run it. b. Make sure that Regbib.pl (in \mecode\perl\med2mec) contains all the most recent 'maintain' scripts. Run it. c. Extract regions and prepare scripts of necessary changes, following the procedure in MED STEPS, stage "H". [d. It doesn't hurt to do two versions of at least parts of "c": one version on the results of "a"; one version on the results of "a" and "b". Compare them, look for discrepancies indicative of problems in the second script (or in both of them!).] e. Extract the OD col. stencils from the list of stencils produced at the end of the bib reg process. Regularize these to something approaching MED standard (moving refs to BIBL unless it is a manuscript; adding asterisks for unpublished items; matching other OD-col regularization previously done in order to produce mappable stencils, or at least consistent ones; renaming stencils to match existing non-OD-col stencils if appropriate.) f. Don't forget to add all your scripts to the main bib reg script, for automatic application next time. 16. Look for bracketed quots.; remove brackets in favor of <CIT TYPE="b"> [See MED STEPS stage "I" for details.] NOTE: this step is now more easily done while doing the DATE part of bib regularization. 17. Look for parenthetical notes (espec.re: contractions) in <FORM> section. [See MED STEPS stage "I" for details.] NOTE: much of this step may already have been done while doing QC on the FORM section.
Future changes to MEC files --------------------------- 1. Take "Cp." notes out of <FORM> and tag them separately as <NOTE>, change the <ORTH> in <NOTE> to <HI REND="b">. 2. Consider multi-tagging/nesting <DEF> elements. 3. Add <AUTHOR> tags (as part of BIB MERGER) 4. Add <POS> tags. 5. Add <LANG> tags.