Reimplementation of Yarowsky's Algorithm for Word Sense Disambiguation
Siwei Shen
shens@umich.edu
Abstract
Yarowsky's algorithm uses unsupervised (or at least semi-supervised) learning and decision list
to tackle a wide range of Natural Language Processing tasks, such as lexical ambiguity resolution,
sense disambiguation for polysemous words, and very recently name-entity tagging. Because the
original code is not publicly available, it is helpful for the UMich NLP group to have the algorithm
reimplemented. In this course project, I did such a reimplementation and applied it to the word
sense disambiguation problem. My experimental results provide us with some important insights about
the algorithm as well as some suggestions on extensions and modifications to make it more suitable
for our own problem.
Introduction (Excerpt from (Yarowsky, 1994))
The algorithm avoids the need for costly hand tagged training data by exploiting two powerful properties
of human language:
- One sense per collocation: Nearby words provide strong and consistent clues to the sense of a
target word, conditional on relative distance, order and syntactic relationship.
- One sense per discourse: The sense of a target word is highly consistent within any given document.
Natural language is highly redundant so that the sense of a word is overdetermined by the two
properties above. The algorithm uses these properties to incrementally identify collocations for
target senses of a word, given a few seed collocations for each sense. The use of one-sense-per-discourse
is optional for error correction in the learning process. It can be overridden when local evidence
provided by one-sense-per-collocation is strong. Using a decisien list control structure based on
(Rivest, 1987), this algorithm integrates a wide diversity of potential evidence sources (lemmas,
inflected forms, parts of speech and arbitrary word classes) in a wide diversity of positional
relationships (including local and distant collocations, trigram sequences, and predicate-argument
association). The training procedure computes the word-sense probability distributions for all such
collocations, and orders them by
New data are classified by using the single most predictive piece of disambiguating evidence
that appears in the target context. By not combining probabilities, this decision-list approach
avoids the problematic complex modeling of statistical dependenciescies. The algorithm is
especially well suited for utilizing a large set of highly non-independent evidence. There is
always one sense being identified as the default, and is tagged to a target word if no evidence
can be found in the decision list.
Example
For example, consider the word "plant". Two of its meanings are:
- A building or group of buildings for the manufacture of a product; a factory. The equipment,
including machinery, tools, instruments, and fixtures and the buildings containing them, necessary
for an industrial or manufacturing operation.
- Any of various photosynthetic, eukaryotic, multicellular organisms of the kingdom Plantae
characteristically producing embryos, containing chloroplasts, having cellulose cell walls,
and lacking the power of locomotion.
Some sample representative collocations for these two senses are listed below. Note that the
real seeds used in my code tend to be much longer sentence pieces since we use a window size 5 or 10
to search for collocations.
| Sense |
Seed examples (showing collocations) |
| 1 |
... automated manufacturing plant in Fremont ... |
| 1 |
... chemical manufacturing plant , producing viscose ... |
| 1 |
... the manager of the giant auto plant says Friday that ... |
| 1 |
... lead to the closure of the power plant ... |
| 1 |
... discovered at a St. Louis plant manufacturing ... |
| 2 |
... used to strain microscopic plant life from the ... |
| 2 |
... close-up studies of plant life and natural ... |
| 2 |
... establishment phase of the plant virus life cycle |
| 2 |
... mammals. Animal and plant life are delicately ... |
| 2 |
... beds too salty to support plant life. River ... |
Related Work
My software is a reimplementation of Yarowsky's unsupervised learning algorithm introduced
in (Yarowsky, 1994) and (Yarowsky, 1995). Yarowsky applied this algorithm to different tasks
such as restoring missing accents for Spanish words as well as word sense disambiguation.
I am conducting similar experiments with an aim to verify the claimed properties of the
algorithm.
Abney uses Yarowsky's algorithm for name-entity tagging. He also studies a number of variations
of the algorithm to show that they optimize either likelihood (defined in what seems a natural
way for partially labeled data) or a related objective function that lower-bounds likelihood.
The Algorithm
Step 1: Prepare Data
In a large corpus, identify all examples of the given polysemous word, storing their contexts
as lines in an initially untagged training set.
Step 2: Identify Seeds
For each sense of the word to disambiguate, identify a relatively small number of seed examples
representative of that sense. The remainder of the examples (typically 85-98%) constitute an
untagged residual.
Step 3a: Train Classifier
Train the supervised classification algorithm on the seed sets. Identify collocations within the
user-specified window that reliably partition the seed training data. Rank these collocations by
the purity of their distribution expressed by the log likelihood formula mentioned above.
Step 3b: Get New Seeds
Apply the resulting classifier to the entire training set. Take those members in the residual that
are tagged as some sense with probability above a certain threshold, and add those examples to the
growing seed sets. These additions will contain new collocations that can be used to iteratively
train the classifier and reapply on the previously tagged training set to get improvement.
Step 3c: One-Sense-per-Discourse
Optionally, the one-sense-per-discourse constraint is then used to filter and augment this addition.
Step 3d: Repeat
Repeat Step 3 iteratively until the algorithm converges to a stable residual set. The seed set will
tend to grow, while the residual will tend to shrink.
Step 4: Test Data
The decision list learned in the final step can now be applied to new test data.
Experimental Setup and Data
Experiment to verify code validity
Small tests are conducted to verify that the reimplementation code itself works properly (without
bugs), although results from these dummy tests have no implication of the performance. I use the
word "plant" with its 2 senses (listed above) in my small tests. Check it out at
/clair6/projects/msa/siwei/NLPCourse/test-dummy
Not only does my implementation run smoothly, it also achieves close to 100% accuracy in disambiguating
the small sized test data.
The code and usage guidelines can be located at the bottom of this page.
Experiment to check performance
I use part of the AQUAINT corpora -- the LDC corpus used in Question Answering task in TREC --
in my experiments. The corpora contain news articles from 3 sources spanning over a period of
3-5 years. In my project, news articles from APW and NYT between 1998 and 2000 are used. The
polysemous words tested are "plant" and "light". For "plant", I test 2 senses and 3 senses
whereas "light" tested for 2 senses. In each case, I am creating a training set with about some
thousand instances of the polysemous word, and a similarly sized test set. According to Yarowsky,
around 1% examples of each sense are identified initially as the seed examples.
The data can be located on tangra at:
/clair6/projects/msa/siwei/NLPCourse/plant-seed-2 (seed set for "plant" with 2 senses)
/clair6/projects/msa/siwei/NLPCourse/plant-seed-3 (seed set for "plant" with 3 senses)
/clair6/projects/msa/siwei/NLPCourse/plant-train (training set for "plant" with 2935 instances)
/clair6/projects/msa/siwei/NLPCourse/plant-test (test set for "plant" with 2948 instances)
/clair6/projects/msa/siwei/NLPCourse/light-seed (seed set for "light" with 2 senses)
/clair6/projects/msa/siwei/NLPCourse/light-train (training set for "light" with 1873 instances)
/clair6/projects/msa/siwei/NLPCourse/light-test (test set for "light" with 1949 instances)
Due to the size of the data, I am not providing a direct web link to these files.
Evaluation
I run the implemented algorithm on the 3 test cases mentioned above, i.e. disambiguating "plant"
with 2 senses, "plant" with 3 senses, and "light" with 2 senses. The 3rd sense used in "plant-3"
case is its normal verbal meaning. The 2 senses used for light are 1) (n.) the electromagnetic radiation;
and 2) (adj.) opposite to heavy or dark. In each case, the learned decision list from the training set
is applied on the test data to yield tags for all the instances of the polysemous word. 2 human judeges
are presented with the tagged set and are asked to evalute some randomly selected 1000 instances of the
word. Average is taken between the two human judgments. Instances of the polysemous word that exhibit
senses other than the ones being disambiguated are skipped (for example, light in "light up the candle").
The raw results are worse than what Yarowsky claims in his paper. Being confident that the code itself
is bug-free due to exhaustive tests on small examples, I find the reasons for bad results to be mainly
the follows:
- Manually found seed examples are not representative of the senses, and train/test set contains
much more diverse collocations then what the small seed set can include.
- I forcefully stop the iterative process after a small number of iterations due to limited time.
- The probability threshold for adding new seeds is not wisely picked.
Regardless of unsatisfactory disambiguation accuracy, I have verified some hypotheses about the algorithm.
- Using a larger window size for collocation search yields a better disambiguation accuracy.
- Disambiguating fewer number of senses yields a better accuracy.
- The algorithm is self-correcting.
The first two hypotheses are verified by the numbers in the following table:
| Word - #Senses |
Window Size |
Disambiguation Accuracy |
| plant - 2 |
5 |
69.5% |
| plant - 2 |
10 |
77.3% |
| plant - 3 |
5 |
49.1% |
| plant - 3 |
10 |
55.7% |
| light - 2 |
5 |
75.0% |
| light - 2 |
10 |
76.2% |
The accuracies of "plant-3" cases are particularly low because it is much harder to obtain good initial
collocations for more senses. The 3rd sense I use for "plant" is its verbal meaning. Unfortunately, most
coolocations I find for this sense would imply wrong information for the other two sense, thus attributing
to the low accuracies (most resultant tags are Sense 3). On the other hand, the "light-2" case seems overall
the best in part because the collocations indicative of a noun "light" and those signaling an adjective
"light" are quite different.
To verify the 3rd hypothesis, I purposefully use a wrong seed example in the seed set for "plant-3" (making
a supposedly Sense 1 instance Sense 3). However, the resultant tagging for the training set is able t correct
this mistake (remember that the seeds come from the training set).
Having low accuracy means most likely that the decision list of collocations learned from the algorithm is
not ideal. However, a closer look at the final decision list genreated by the various test cases shows that
the top ranking collocations are in fact fairly good in normal cases. Below is an abbreviated version of the
final decision list in disambiguating "plant" with 2 senses ("plant-2") and using a window size of 5.
[Log Likelihood]:[Collocation Pos]:[Sense]:[Collocation]
8.27893600229198:-1:1:power
7.02108396428914:-1:1:assembly
6.04025471127741:-1:1:manufacturing
5.63478960316925:-1:1:auto
5.59842195899838:+-2-5:1:car
5.56068163101553:+-2-5:2:animal
5.39362754635236:+-2-5:1:worked
5.34710753071747:+-2-5:1:kilometers
5.24702407216049:+1:2:species
5.24702407216049:+1:2:life
5.24702407216049:+-2-5:1:Mexico
5.19295685089021:+-2-5:1:missile
5.13579843705026:+-2-5:1:Motor
5.07517381523383:-1:1:Co
5.07517381523383:+-2-5:1:commercial
5.07517381523383:+-2-5:1:against
5.01063529409626:+-2-5:1:told
5.01063529409626:+-2-5:1:news
5.01063529409626:+-2-5:1:nearly
5.01063529409626:+-2-5:1:including
4.9416424226093:+1:1:last
4.9416424226093:+-2-5:1:County
4.9416424226093:+-2-5:1:3000
4.86753445045558:+-2-5:1:struck
4.86753445045558:+-2-5:1:atomic
4.86753445045558:+-2-5:1:Mich
4.86753445045558:+-2-5:1:200
4.78749174278205:+-2-5:1:central
4.70048036579242:-1:1:Ford
4.70048036579242:+1:1:Henry
4.70048036579242:+-2-5:1:suburb
4.70048036579242:+-2-5:1:shift
4.70048036579242:+-2-5:1:port
4.70048036579242:+-2-5:1:missiles
4.70048036579242:+-2-5:1:eastern
4.70048036579242:+-2-5:1:Michigan
4.70048036579242:+-2-5:1:April
4.70048036579242:+-2-5:1:1996
4.60517018598809:-1:1:transmission
4.60517018598809:+-2-5:2:species
Obviously, the top ranking collocations are all quite valid. For example, if word "power" shows up right
before "plant" (-1 pos), it indicates that the "plant" has Sense 1 ("facility"). Similarly, if "animal"
appears in a window of 5 near "plant", it is more likely that the "plant" means a botanic entity (Sense 2).
However, other nonsense collocations are also present in the decision list such as those numerals. Moreover,
a lot of erroneous tags are caused by high-ranking collocations that are stop words (e.g. "of", "and", "to").
Since such stop words appear frequently and everywhere in any discourse, they having a high-rank can easily
lead to suboptimal tagging of the word instances. This problem is most reflected in "plant-3" test where
most seeds for the verb plant (Sense 3) have the word "to" right before "plant". Therfore, the classifier
is fooled to believe that -1:to is a strong indication of Sense3 for "plant". Unfortunately, many Sense1
instances in the test data also follow the word "to" (for example: ... lead to plant closure...). On the
other hand, it is not trivial to find other seeds of Sense3 plant that do not have "to" before the word!
More decision lists can be located at
/clair6/projects/msa/siwei/NLPCourse/test-plant-2-5/
/clair6/projects/msa/siwei/NLPCourse/test-plant-2-10/
/clair6/projects/msa/siwei/NLPCourse/test-plant-3-5/
/clair6/projects/msa/siwei/NLPCourse/test-plant-3-10/
/clair6/projects/msa/siwei/NLPCourse/test-light-2-5/
/clair6/projects/msa/siwei/NLPCourse/test-light-2-10/
Discussion and Future Work
In this section, I comment on future work, some of which are indeed insights to the original algorithm
and/or suggestions for improvements.
A number of improvements and extensions to the current work are being planned. First, I need to come up
with an efficient way of producing representative seed examples for all the senses being disambiguated.
Alternatively, I can try to make the training and test set containing instances of the polysemous word
that have similar collocations to those in the seed set. It is indeed the difficulty of finding such
good seeds that makes some researchers doubtful about whether Yarowsky's algorithm can be referred to
as unsupervised. My experiments show that randomly generated seeds would most likely perform very badly,
especially for large and diverse corpora like the ones used in this project.
Secondly, tuning up parameters in the algorithm is expected to boost the performance. In particular, the
threshold of introducing new seeds can greatly affect the learning results. Similarly, window size could
also be varied to search for the optimal amount of collocations.
The third thing I have in mind is to incorporate local syntax information into collocation identification.
My experience in multiple sequence alignment tells that syntax could very well be a promising clue in this
task. Especially for collocations that are stop words, adding syntax consideration should be deemed very
necessary.
Finally, I propose dynamic maintenance of the decision lists generated by each iteration of the algorithm.
There is a trade-off between strict versus tight admission of new seeds. In the former case, new collocations
can be missed so that early errors are hard to correct. In the latter case, however, the decision lists
tend to be unnecessarily large and contain collocations with very low log-likelihood.
Bibliography
Yarowsky 1994
Yarowsky 1995
Links
README for Usage