A user-designated synonym for a Unix command or sequence of commands. Differs
from a variable in that its value does
not change: e.g., if you designate m to be your
alias for mailx, then typing
m will always run this mail program. Differs from a
script in that scripts are normally
stored in executable files,
while aliases are loaded as part of the shell environment directly (and are thus
simpler and faster). Aliases are a facility provided by the C-shell
(csh) and its successors, like tcsh.
Of ASCII characters, any string composed of only upper- or
lower-case English letters or Arabic numerals.
Downloading files from a
public-access Internet machine, i.e., one which allows a remote user to
log in as "anonymous" and transfer files using the ftp protocol even if the user does not
have an account on the machine.
An Internet search facility that searches through directory and file names (and in some instances through
file descriptions) in order to determine whether a particular string is present. If you ask an
Archie server to find the string "phone" it will return the names of
files that include this word, whether it refers to a sound or a
As in mathematical or logical usage, specifying a value to be
operated on by a function or other command. By default, this is usually
interpreted as a file name. In the
command cat message, the argument is
message, which is subcategorized as a file name by
The Air Travel Information System
evaluations were a series of evaluations of speech recognition and
spoken language understanding systems sponsored by DARPA. These evaluations began in 1990
and ended in 1995. They are responsible for the development of a corpus of approximately 20,000
utterances regarding air travel, grouped by speaker, session, and data
collection site. The ATIS corpus is distributed by the Linguistic Data Consortium.
The American Standard Code for
Information Interchange is a standard character set that maps character codes 0 through 127
(low ASCII) onto control
functions, punctuation marks, digits, upper case letters, lower case
letters, and other symbols.
A data file, typically a text
file with hard line breaks, that contains only character codes in the numeric
range 0 to 127 (low ASCII), and interprets them according
to the ASCII standard.
The unstandardized highest half (character codes #128-#255) of
the 256 characters in ASCII. While low ASCII is standard
worldwide, high ASCII characters vary from one hardware
platform to another, or even from one software program to another.
In SGML, a qualifier within the
opening tag for an element which specifies a value for
some named property of that element. In an object-oriented
database, a named property of an object which not only holds information
about a particular instance of an object, but also encapsulates behavior
(such as integrity constraints and a default value) that is true of all
instances of the class of objects.
A facility allowing indirect reference in Unix commands, by using the output of one
command, enclosed within backquote characters (`, ASCII #96), as an argument to another command. For
instance, in the command finger `whoami`, first the
whoami program is run, returning the login id of the
user; this is in turn passed as the argument for the command
finger, which returns information about a user.
Running a computer program without any interaction with the process
as it goes along. Sometimes called background processing.
binary, octal, decimal, hexadecimal:
Four common arithmetic bases (2, 8, 10, and 16, respectively)
widely used in computing. Computers use binary numbers internally, and
octal and hexadecimal numbers are easily converted to binary (and vice
versa). Decimal numbers are the norm in text, as usual; binary numbers,
consisting of only 0 and 1, are easily
recognized; octal numbers (now obsolete) use only the decimal digits
[0-7]; hexadecimal (also called hex)
numbers contain the normal decimal digits [0-9], and add
[A-F]to represent eleven through fifteen as single
"digits". These "digits" are pronounced as (English orthographic)
letters, rather than extending conventional morphology; i.e, hex "A5" is
pronounced "A-five", not "*eleventy-five".
More accurately BinHex 4.0. The standard Macintosh format used
when a binary file must be converted
into an ASCII file so that it may be
safely transferred through a network. It preserves the data fork and
resource fork which all Macintosh files must have. Do not confuse
BinHex 5.0, which is not an ASCII format. All BinHex files should by
convention carry the extension .hqx.
Related terms for small units of information. Bit is
an acronym for binary
digit, the smallest possible unit of information: i.e, a single
yes or no (1 or 0), in context. A byte is a unit
consisting of eight bits, in order. There are 28 (= 256) possible bytes (combinations of 0 and 1),
and thus 256 possible characters
in ASCII, each with a unique
byte value. Computer memory is normally specified in kilobytes,
megabytes, and gigabytes.
"black box" evaluation:
The evaluation of a complex system by examining only inputs to the
system and outputs from the system, ignoring intermediate results and
browser, or web browser:
A piece of software which retrieves and displays World Wide Web files. It acts as an
interface to Internet protocols
like ftp and http. Common browsers include Netscape,
Internet Explorer, and Mosaic.
Two competing dialects of Unix. BSD is an acronym for
Berkeley System Distribution, an academic version
developed at the University of California at Berkeley.
SysV stands for System V, a commercial
version originally developed by AT&T. The two systems are
incompatible in some ways, though they are converging in the latest
A character to which an overstriking diacritic is added.
The minimal unit of encoding
for text files. A character typically
corresponds to a single graphic sign, like a letter of the alphabet or a
A numerical code in a data file
which represents a particular character in text.
The full set of character codes used for
encoding a particular language.
A method of text encoding
used by the Oxford Concordance Program
and other software.
The sorting order for all the characters in a character
A unit in a linguistic (i.e, written-language-based) interface to a
computer program or operating system; Unix and DOS have
command-line interfaces, in which the user types commands
which are then executed. Command-line systems are powerful but complex;
they can be added to and customized. They are the earlier of the two
principal user interfaces (the other is the Graphic User
Interface, or GUI).
A single character which is
a composite of two or more other characters. For instance,
à is a composite of a (the base character) and ` (a diacritic).
A list of words, normally in alphabetical order, where each
occurrence of each word is shown with surrounding context and identified
by a reference indicating where it occurs in the text.
control character, control-shift, Ctrl:
The most common and most standard of the ASCII metacharacters. ASCII keyboards
contain a Shift key, which produces
upper-case characters (# 41H
through 5AH) when pressed, instead of lower-case (# 61 through 7AH). The
Control-Shift key, by analogy, produces Control characters (# 01H
through 2AH). These are non-printing and in principle have standard
uses, though in practice they vary greatly. They are often represented
by prefixing caret (^, ASCII #94) to the appropriate alphabetic
character; thus ^M represents CR or Carriage Return,
sent by the Return key on all keyboards, and by the Enter
key on most.
daemon (less commonly demon):
A pre-activated program that is always ready to perform its task
(as opposed to one that must be called by the system activation software
in response to a specific need). Web
server programs are usually run as daemons, for
The Defense Advanced Research Projects
Agency, a branch of the United States Department of Defense
responsible for a wide range of research and applications development,
and a long-time funder of research in language processing. For a number
of years, in the late 1980s and early 1990s, this organization was known
as ARPA. Its Web site is http://www.darpa.mil/.
A small mark (such as an accent mark) added above, below, before,
or after a base character to modify its pronunciation or
An electronic representation of a page of text or other material
which is a picture of the page, rather than a transcription of the text.
A fax is a digital image, for instance, while the wordprocessor file
that produces the page that is faxed is not, since it can be searched.
A collection of files that are
notionally "in" the same "place." Every Unix user has a
home directory, in which one's files may be stored; it
usually has the same name as the login id of the user, and may be
referenced as $HOME or by the tilde convention
(~ is $HOME, ~jlawler is
jlawler's home directory). At any time in a Unix session, a user has a
current directory, which may be changed with the
cd command. Usually called folder in GUI systems.
Domain Name Server. An Internet machine that
knows the names and IP addresses
of other machines in its subnet. When you attempt to connect to the
Internet, your request goes to a DNS, which translates an address like
emunix.emich.edu into an IP number like
22.214.171.124 and forwards your connection request to that
In computational linguistics and artificial intelligence, a
symbolic representation of the objects and relationships in a particular
segment (domain) of the world.
In Unix, special ASCII
files placed in one's home directory to control various
programs and set customized parameters. Their names begin with period
("dot", ASCII # 46) and are by default not shown by the
ls program. Examples are .cshrc, which
contains commands and definitions for the csh shell; .newsrc, for
customizing news readers like
trn; and .login, which contains commands executed once at the
beginning of each Unix session.
Document Type Definition, the definition of the
markup rules for an SGML document.
A program that allows one to create, modify, and save text files. Virtually all popular editors (
pico, emacs, vi) on Unix
are screen editors, like wordprocessors. Early Unix line
editors ( ed, ex) operate with commands instead of direct typing;
i.e., to correct a mistake like fase, you might enter
the command replace s with t, rather than just
overstriking the s with t.
In an SGML file, a single entity
delimited by a start tag and an end
tag. For instance, a title element might be delimited by
<title> and </title>.
The manner in which information is represented in computer data files. Character encoding refers
specifically to the codes used to represent characters. Text encoding refers
specifically to the way in which the structural information in text is
In SGML, a named part of a marked up document. An entity can be
used for a string of characters or a whole file of text. Special characters (like
"Ê") are normally represented by entities (like "Ê") in
An ASCII control or
metacharacter (#27, ^]) with
its own key on most keyboards, intended
originally to signify escape (v)
(sense 1). While it has been put to a number of different uses over the
decades, it is still often used to pause or terminate a program or
process. Frequently called Meta in some programs, notably
emacs, where it is a common command prefix.
- In general, a collection of electronic text, usually compiled on a
principled or systematic basis for the purposes of linguistic and other
- In computational linguistics, a body of linguistic data, either text
or speech, intended to support the study of linguistic phenomena. This
data may be annotated in some way to enhance its usefulness. Examples of
corpora include the Penn
TreeBank and the ATIS corpus.
A file name that can be used as a
command, consisting either of a script of commands to be executed by
typing the name, or of true compiled binary program code. In the
latter sense (also called binaries), the
executable(s) is sometimes used to distinguish compiled binary code
from its human-readable programming-language source: "He gave me the
executable, but I needed the source files."
In a database, a subdivision of a record which stores information of a
A collection of information encoded in computer-readable form
(normally in bytes)and associated with
a single name by which the computer's operating
system stores and retrieves it.
A type of program especially common in Unix in which a file or
other data stream (by default, the standard input) is read
serially, modified in some regular way, and sent (in modified form) to
some other file or stream (by default, the standard
output), without any change to the original data source. There
are many languages for creating simple text filters in Unix, like
sed, awk, and
Synonym for directory
(metaphorically, a place to put files), used in Macintosh, NeXT, Windows
95, and some other Graphic User Interfaces. See GUI.
A collection of bitmaps or outlines which supply the graphic rendering of every character in a character set.
A subcomponent of an operating
system which gives all programs and data files access to multiple fonts for rendering characters.
The encoding scheme, often proprietary, in which the
information in a file is marked up.
Wordprocessing files created by different software are usually
incompatible in format to some extent. To read one program"s files using
a different program requires format translation, which may
be built into a full-featured wordprocessor, but is often a separate
step requiring separate software. Many formats are in use; a frequent
feature of upgrade versions of popular microcomputer software is a
different (and usually incompatible) standard file format, and there are
different standards and versions for different countries and languages.
In a concordance or
similar program, a table showing how many words occur, once, twice,
three times, etc. up to the most frequent word.
Internet File Transfer Protocol, a way of sending files from
one Internet machine to another.
The discipline of using markup
codes in a text to describe the function or purpose of the elements in the text, rather than
the shape or form of a printed or displayed character, as opposed to a pairing
of form and interpretation.
An Internet search facility, which allows the user to search
through a hierarchically organized set of menus in order to find a
particular file. Gopher menus
categorize files according to content (e.g., "libraries," "phonebooks"),
as determined by a human being, not a computer.
A Graphic User Interface is one invoking
visual rather than linguistic metaphors, often employing menus, non-text
input devices like a mouse or trackball, and icons employing visual
symbolism and metaphor, like a desktop with paper files on it.
Contrasts with command-line
Hidden Markov Model (HMM):
A Hidden Markov Model is a statistical model of the distribution of
"hidden" features, such as phonemes or part-of-speech tags, based on
observable features, such as acoustic segments, or words. The
computational models can be automatically trained from data samples, and
then used to recognize the "hidden" layer, based on the statistical
model derived from the training data.
A word which has the same spelling but different meanings, e.g.
lead as a verb "to lead" and as two different nouns: "a leash",
and the metal.
Language is a method of marking a document that is to be
displayed by a web browser. A
subset of SGML,it consists primarily
of formatting tags, like
for boldface italic.
Hypertext transfer protocol. A way of sending hypertext
documents over the Internet.
A non-linear version of text presentation with embedded links to other information. The basis of
the World Wide Web and of the Internet
protocols employed on the Web.
In corpus-based linguistics, an
annotation produced by an
annotation procedure which can be checked against an annotation key.
A list of words, normally in alphabetical order, where each word is
accompanied by a list of references indicating where that word occurs in
the text. Sometimes also called a word index.
In computational linguistics, the process by which information in a
form suitable for entry into a database is generated automatically from
input-output (I/O) redirection:
Process (and capability) allowing a program (typically a filter program) to take its input from
some other program, and/or send its output to another. A characteristic
feature of Unix, much copied in other operating systems. The control
structure implementing this is called a pipe, and the vertical
bar (" |") symbol is used in the Unix command
line to represent this.
The process of searching or querying a text and getting an instant
response. The query is performed on an index which has been built previously.
IP number, IP address:
A four-part number which uniquely identifies an Internet machine,
giving the net and subnet to which it belongs. The IP number
126.96.36.199, for example, designates the Domain Name Server of the
University of Michigan (at press time -- IP numbers are subject to
change without notice), and tells us that it is part of net 35 and
subnet 1. Part of the Internet Protocol.
- To pause a running program
and return control temporarily to the operating system, usually in order
to run some other program. In Unix, the exclamation point (ASCII #33,
!, pronounced "bang") is an escape character that can be used in most
programs to accomplish this.
- To cancel the default (meta-)interpretation of the
following character in a string and interpret it literally
instead. Thus, while the unescaped (meta)expression "
." matches any character, the regular expression
"\." matches a literal period or full stop character
only, because it is escaped by the preceding "\".
In regular expressions, the use of
asterisk ( *, ASCII #30) as a special character to
indicate "any number of" the preceding character (including zero, or
"none of"). Combined with the use of the special character
dot (i.e, period, ASCII #34) to represent "any character",
the regular expression idiom ".*" represents "any
string". Named after the logician Stephen Kleene.
The process of putting words under their dictionary headings, for
example, "go", "going", "gone", "went", under "go".
A unit of organization in a text file including all the characters up to and including the
line end character (either carriage return, line feed, or both,
depending on operating system).
An embedded URL in a hypertext document. Links are
created in HTML using the <a
..> "anchor" tag, and are
displayed in a browser as
emphasized text (usually blue and underlined). When one clicks on a
link, the browser requests the file
and displays it.
Linguistic Data Consortium:
The LDC is an open consortium of universities,
companies and government research laboratories which creates, collects
and distributes speech and text databases, lexicons, and other resources
for research and development in computational linguistics. It is hosted
at the University of Pennsylvania. Its Web site is http://www.ldc.upenn.edu.
A programmed repetition of a set of instructions, typically with
incrementation of some index value. The instructions will then be
repeated on each member of the indexed set of values. Implemented by
the for, while, or do structures in many computer
In computational linguistics and artificial intelligence, a set of
techniques which allow a computer program to improve its performance
iteratively on a chosen task. See training corpus.
Codes added to the stream of an
encoded text to signal structure,
formatting, or processing commands.
A character or (shift-)key to be interpreted as modifying the
value of the character (or key) following it in a string (or produced simultaneously in
typing), either by prefixing a special character (" ^X-Q
terminates the program"), or by interpreting it literally, thus
escaping the default special
interpretation of the following character.
Multi-purpose Internet Mail Extensions.
A way of sending files of different
types (e.g., graphics, sound, or word-processor files) via email without
converting them into ASCII, or plain
text. None of the original information will be lost, and, if the
recipient has a MIME-compliant mailer program, it will call up the
proper program needed to display or play the files.
The Message Understanding Conference refers to
a series of evaluations of text-based language processing systems
sponsored by DARPA. These conferences
are responsible for a series of corpora covering increasingly difficult
tasks and subtasks.
Two independent characteristics of desirable operating systems, both found in Unix. A
multi-user system is one that allows several users to run commands simultaneously without having
to take turns. A multi-processing system is one that allows any user to
run several commands simultaneously without having to wait until each
is done (serial processing). Multi-processing is also called
An Internet utility that allows users to download (notionally,
"read") "articles" posted to "newsgroups" by other users interested in
the topic the newsgroup was formed to discuss. Also called "Usenet".
The newsgroup sci.lang, for
example, is dedicated to discussing the science of language. To read
news, you need a news client like
trn and access to a news server, such as those established at
The process of organizing a database in such a way that no piece of
information occurs more than once in the database.
The fundamental unit of information modeling in the
object-oriented paradigm. In principle, there is a
one-to-one correspondence between notional 'objects' in the data model
and the actual entities in the real world which are being modeled. (This
is not true, in general, of the data structures of conventional
programming languages or database systems, and is less true in practice
than in theory of official object-oriented languages and databases.) An
object stores state information (like the field values of a database record; notionally nouns) and it stores
behavioral information (called methods; notionally verbs) about
what computations can be performed on an instance of the object. The
information stored in an object is encapsulated in that it is not
visible directly; it can only be seen by sending a message to the object
which asks it to perform one of its methods.
A modern paradigm of programming which models information in terms
of objects. Computation occurs when one object receives a
message from another asking it to perform one of its methods,
i.e, special subroutines subcategorized for each type of object. The
object-oriented approach, in which the data and the program behavior are
encapsulated in the objects, contrasts with the conventional approach to
programming, in which a monolithic program operates on data which is
completely separate. Object-oriented programming is more amenable to
A database system which models entities in the real world as
objects and follows the object-oriented
paradigm of programming.
Of software, especially an operating system,
signifying that it conforms to a well-known internal architecture and
set of standards, or that it is not restricted to use on a single brand
of computer, or that it is manufactured and maintained by many vendors,
or some combination of these. Contrasts with proprietary.
operating system (OS):
The basic software that runs a computer, managing all other
software and apportioning computing resources to avoid conflicts.
Windows, DOS, and Unix are examples of operating systems
optical character recognition (OCR):
A method of creating electronic text by automatically analyzing a
digital image of a page of
text and converting the characters on that page to ASCII text.
Oxford Concordance Program (OCP):
A flexible batch processing program for generating concordances, word lists and indexes from many kinds of texts.
A letter or other character
that does not affect the sorting of words.
A text corpus containing the
same text in multiple languages. Such corpora are used for training corpus-based machine
translation systems, for example. The Rosetta Stone is an example of a
part-of-speech (POS) tagging:
The process of assigning lexical categories (that is,
part-of-speech tags) to words in
linguistic data. This process can be performed automatically with a
high degree of accuracy (above 95% in English) without reference to
higher-level linguistic information such as syntactic structure.
- An individual button on a keyboard; by extension, the character(s) or command(s) it signals.
- In searching, a synonym for search
- In indexing or database
management, the most important field,
in the sense that it uniquely identifies an item.
- In corpus-based linguistics, a
benchmark against which the accuracy of an annotation procedure can be
corpus of Wall Street Journal documents annotated with
part-of-speech and bracketing information, distributed by the Linguistic
Data Consortium. The Penn Treebank also includes a bracketed version of
the Brown Corpus. Its web site is http://www.cis.upenn.edu/~treebank.
Point-to-point protocol. A way of accessing the
Internet which allows your home machine to act as if it were, itself, an
Internet machine. PPP, for example, allows you to retrieve and display
Internet graphics files. If you access the Internet through a serial
line (formerly the most common type of modem connection), you can not
use a graphical browser.
In information retrieval or corpus-based linguistics, the number of
answers in an answer set hypothesis which are also in the
answer key, divided by the size of the answer set hypothesis.
Of software, especially an operating
system, signifying that it is manufactured and maintained by
only one vendor, or that it is the only type usable on a particular
computer, or that it does not conform to a widely-accepted standard, or
that its details are secret, or some combination of these. Contrasts
An agreed-upon way of doing things. Internet protocols have been
established for such actions as transmission of information packets (tcp), file transfer (
ftp), and hypertext transfer (http). Any machine which does things according
to these protocols can be a part of the Internet.
In information retrieval or corpus-based linguistics, the number of
answers in an answer set hypothesis which are also in the
answer key, divided by the size of the
In a database, a collection of information about a single entity.
regular expression (RE):
A formal syntactic specification widely implemented in the Unix
language family for reference to strings. For example, the regular
expression denoting a string of alphanumerics (i.e., letters or
numbers) is [A-Za-z0-9]*
The process of converting a
stream of encoded
characters to their correct graphic appearance on a terminal
reverse alphabetical order:
Sorting of words by their endings so that, for example, a word list in reverse alphabetical
order begins with words ending in -a. A word list in reverse
alphabetic order is also called a speculum.
An Internet machine whose specialized job is finding paths along
the net for information packets. It looks for functional, uncongested
paths to destinations, and sends data along them.
Rich Text Format is a special interchange file
format that can be created and
read by most popular wordprocessors. RTF preserves most formatting
information, and graphics. Since they use only low
ASCII, RTF documents can be usefully transmitted by
The process of creating a digital
image of a page of text or other material. This term is
sometimes also used for optical character
A collection of Unix commands,
structured together as a program and stored as an
executable file. The commands in a script are
interpreted by the shell (normally
sh) and treated as if they were entered in order by the
user at the command line.
Software that forms part of a server/client pair.
Typically, a server resides on a central machine and, when it is
contacted by the client software on a user's machine, sends a particular
type of information. Web servers, for example, send hypertext documents; news servers
send articles posted to newsgroups.
Standard Generalized Markup Language is a
method for generalized
markup that has been adopted by ISO (the International
Organization for Standardization) and is consequently gaining widespread
use in the world of computing.
A shareware Unix and DOS program for validating SGML
A kind of tool program that
parses, interprets, and executes commands, either interactively from
the keyboard, or as a script. DOS
uses a shell called COMMAND.COM; there are several
shells available in Unix: the most common are the original Bourne shell
(sh), used mostly for interpreting scripts, and the
C-shell (csh), the standard for interactive commands and
A character that is not
available in one of the character
sets already supported on a computer system.
standard input, standard output:
The input and output streams for DOS or Unix tool programs. The operating system associates these streams
with each program as it is run. The standard input defaults to the
keyboard, and the standard output to the screen, though both are
frequently redirected to
other programs, or to files.
A (long) string of bytes, which may come from any source,
including a file. Streams are
operated upon by filters and other
programs. "Stream" is often used as an alternative, active metaphor for
"file", when considered in terms of sequential (serial) throughput that
can be redirected.
A sequence of bytes. Since bytes
are used to encode text, "string"
is often used as a synonym for "word" or "phrase" in electronic
text-processing environments. Special uses of the term include
search string (the string to be matched in a searching
operation) and replacement string (the string to be
substituted for occurrences of the search string in a replacement
A separate file that is used with a document
to declare how each generalized text element is to be formatted for display.
A directory that is located inside
another directory. There can be long chains of subdirectories in a file's full path if it is deeply buried in the file
One of a number of parameters that may be set for Unix tool programs, each specifying special
instructions (e.g, with the sort tool, sort
-rn specifies reverse numeric
sort). Each program has its own unique array of possible switches,
invoked on the command-line before
arguments, using a switch prefix
(normally minus sign "-") before the individual letters
indicating the switch settings, thus resembling clitics on the command
verb. May be set by menu or checkbox in a GUI. Also called options or
A string of characters inserted into a text file to
represent a markup code. In SGML, each text element of a given type is
delimited by an opening tag of the form
<type> and a closing tag of the
form </type>. In computational
linguistics, a part-of-speech tag is a
lexical syntactic category associated with a word in a
corpus; a coreference tag is an
annotation indicating the referential dependency of the tagged phrase on
other tagged phrases in the corpus.
In computational linguistics, a set of possible tags
for a given annotation task. For example, a part-of-speech tag set is a
list of lexical syntactic categories which may be associated with
lexical items. Cf. paradigm.
TCP, or Transmission Control Protocol:
A way of transmitting information packets on the Internet so that
those belonging to the same body of data can be identified and
reassembled into their original order.
The Text Encoding Initiative is a joint effort
of the Association for Computers and the Humanities, the Association for
Literary and Linguistic Computing, and the Association for Computational
Linguistics to develop
- A list of directories in
which the operating system looks for files. To put a directory in one's
path is to add the directory's name to this list; to put a file
in one's path is to store the file in a directory that is on the
- Used also of the full path or pathname of a file, the
sequential list of directories which locates the file on the disk; the
reference is parsed recursively, like a linguistic tree, e.g, in Unix,
/usr/jlawler/bin/aliases specifies a file named
aliases, which is further specified as being located in
the subdirectory named
bin, which is located in the subdirectory named
jlawler, which is located in the subdirectory named
usr, which is located under the top (root)
directory (always called simply " / ")