Reimplementation of Yarowsky's Algorithm for Word Sense Disambiguation

Siwei Shen
shens@umich.edu

Abstract

Yarowsky's algorithm uses unsupervised (or at least semi-supervised) learning and decision list to tackle a wide range of Natural Language Processing tasks, such as lexical ambiguity resolution, sense disambiguation for polysemous words, and very recently name-entity tagging. Because the original code is not publicly available, it is helpful for the UMich NLP group to have the algorithm reimplemented. In this course project, I did such a reimplementation and applied it to the word sense disambiguation problem. My experimental results provide us with some important insights about the algorithm as well as some suggestions on extensions and modifications to make it more suitable for our own problem.

Introduction (Excerpt from (Yarowsky, 1994))

The algorithm avoids the need for costly hand tagged training data by exploiting two powerful properties of human language: Natural language is highly redundant so that the sense of a word is overdetermined by the two properties above. The algorithm uses these properties to incrementally identify collocations for target senses of a word, given a few seed collocations for each sense. The use of one-sense-per-discourse is optional for error correction in the learning process. It can be overridden when local evidence provided by one-sense-per-collocation is strong. Using a decisien list control structure based on (Rivest, 1987), this algorithm integrates a wide diversity of potential evidence sources (lemmas, inflected forms, parts of speech and arbitrary word classes) in a wide diversity of positional relationships (including local and distant collocations, trigram sequences, and predicate-argument association). The training procedure computes the word-sense probability distributions for all such collocations, and orders them by


New data are classified by using the single most predictive piece of disambiguating evidence that appears in the target context. By not combining probabilities, this decision-list approach avoids the problematic complex modeling of statistical dependenciescies. The algorithm is especially well suited for utilizing a large set of highly non-independent evidence. There is always one sense being identified as the default, and is tagged to a target word if no evidence can be found in the decision list.

Example

For example, consider the word "plant". Two of its meanings are: Some sample representative collocations for these two senses are listed below. Note that the real seeds used in my code tend to be much longer sentence pieces since we use a window size 5 or 10 to search for collocations.

Sense Seed examples (showing collocations)
1 ... automated manufacturing plant in Fremont ...
1 ... chemical manufacturing plant , producing viscose ...
1 ... the manager of the giant auto plant says Friday that ...
1 ... lead to the closure of the power plant ...
1 ... discovered at a St. Louis plant manufacturing ...
2 ... used to strain microscopic plant life from the ...
2 ... close-up studies of plant life and natural ...
2 ... establishment phase of the plant virus life cycle
2 ... mammals. Animal and plant life are delicately ...
2 ... beds too salty to support plant life. River ...

Related Work

My software is a reimplementation of Yarowsky's unsupervised learning algorithm introduced in (Yarowsky, 1994) and (Yarowsky, 1995). Yarowsky applied this algorithm to different tasks such as restoring missing accents for Spanish words as well as word sense disambiguation. I am conducting similar experiments with an aim to verify the claimed properties of the algorithm.

Abney uses Yarowsky's algorithm for name-entity tagging. He also studies a number of variations of the algorithm to show that they optimize either likelihood (defined in what seems a natural way for partially labeled data) or a related objective function that lower-bounds likelihood.

The Algorithm

Step 1: Prepare Data
In a large corpus, identify all examples of the given polysemous word, storing their contexts as lines in an initially untagged training set.

Step 2: Identify Seeds
For each sense of the word to disambiguate, identify a relatively small number of seed examples representative of that sense. The remainder of the examples (typically 85-98%) constitute an untagged residual.

Step 3a: Train Classifier
Train the supervised classification algorithm on the seed sets. Identify collocations within the user-specified window that reliably partition the seed training data. Rank these collocations by the purity of their distribution expressed by the log likelihood formula mentioned above.

Step 3b: Get New Seeds
Apply the resulting classifier to the entire training set. Take those members in the residual that are tagged as some sense with probability above a certain threshold, and add those examples to the growing seed sets. These additions will contain new collocations that can be used to iteratively train the classifier and reapply on the previously tagged training set to get improvement.

Step 3c: One-Sense-per-Discourse
Optionally, the one-sense-per-discourse constraint is then used to filter and augment this addition.

Step 3d: Repeat
Repeat Step 3 iteratively until the algorithm converges to a stable residual set. The seed set will tend to grow, while the residual will tend to shrink.

Step 4: Test Data
The decision list learned in the final step can now be applied to new test data.

Experimental Setup and Data

Experiment to verify code validity
Small tests are conducted to verify that the reimplementation code itself works properly (without bugs), although results from these dummy tests have no implication of the performance. I use the word "plant" with its 2 senses (listed above) in my small tests. Check it out at
	/clair6/projects/msa/siwei/NLPCourse/test-dummy
Not only does my implementation run smoothly, it also achieves close to 100% accuracy in disambiguating the small sized test data.

The code and usage guidelines can be located at the bottom of this page.

Experiment to check performance
I use part of the AQUAINT corpora -- the LDC corpus used in Question Answering task in TREC -- in my experiments. The corpora contain news articles from 3 sources spanning over a period of 3-5 years. In my project, news articles from APW and NYT between 1998 and 2000 are used. The polysemous words tested are "plant" and "light". For "plant", I test 2 senses and 3 senses whereas "light" tested for 2 senses. In each case, I am creating a training set with about some thousand instances of the polysemous word, and a similarly sized test set. According to Yarowsky, around 1% examples of each sense are identified initially as the seed examples.

The data can be located on tangra at:
	/clair6/projects/msa/siwei/NLPCourse/plant-seed-2	(seed set for "plant" with 2 senses)
	/clair6/projects/msa/siwei/NLPCourse/plant-seed-3	(seed set for "plant" with 3 senses)
	/clair6/projects/msa/siwei/NLPCourse/plant-train	(training set for "plant" with 2935 instances)
	/clair6/projects/msa/siwei/NLPCourse/plant-test		(test set for "plant" with 2948 instances)
	/clair6/projects/msa/siwei/NLPCourse/light-seed		(seed set for "light" with 2 senses)
	/clair6/projects/msa/siwei/NLPCourse/light-train	(training set for "light" with 1873 instances)
	/clair6/projects/msa/siwei/NLPCourse/light-test		(test set for "light" with 1949 instances)
Due to the size of the data, I am not providing a direct web link to these files.

Evaluation

I run the implemented algorithm on the 3 test cases mentioned above, i.e. disambiguating "plant" with 2 senses, "plant" with 3 senses, and "light" with 2 senses. The 3rd sense used in "plant-3" case is its normal verbal meaning. The 2 senses used for light are 1) (n.) the electromagnetic radiation; and 2) (adj.) opposite to heavy or dark. In each case, the learned decision list from the training set is applied on the test data to yield tags for all the instances of the polysemous word. 2 human judeges are presented with the tagged set and are asked to evalute some randomly selected 1000 instances of the word. Average is taken between the two human judgments. Instances of the polysemous word that exhibit senses other than the ones being disambiguated are skipped (for example, light in "light up the candle").

The raw results are worse than what Yarowsky claims in his paper. Being confident that the code itself is bug-free due to exhaustive tests on small examples, I find the reasons for bad results to be mainly the follows: Regardless of unsatisfactory disambiguation accuracy, I have verified some hypotheses about the algorithm. The first two hypotheses are verified by the numbers in the following table:

Word - #Senses Window Size Disambiguation Accuracy
plant - 2 5 69.5%
plant - 2 10 77.3%
plant - 3 5 49.1%
plant - 3 10 55.7%
light - 2 5 75.0%
light - 2 10 76.2%

The accuracies of "plant-3" cases are particularly low because it is much harder to obtain good initial collocations for more senses. The 3rd sense I use for "plant" is its verbal meaning. Unfortunately, most coolocations I find for this sense would imply wrong information for the other two sense, thus attributing to the low accuracies (most resultant tags are Sense 3). On the other hand, the "light-2" case seems overall the best in part because the collocations indicative of a noun "light" and those signaling an adjective "light" are quite different.

To verify the 3rd hypothesis, I purposefully use a wrong seed example in the seed set for "plant-3" (making a supposedly Sense 1 instance Sense 3). However, the resultant tagging for the training set is able t correct this mistake (remember that the seeds come from the training set).

Having low accuracy means most likely that the decision list of collocations learned from the algorithm is not ideal. However, a closer look at the final decision list genreated by the various test cases shows that the top ranking collocations are in fact fairly good in normal cases. Below is an abbreviated version of the final decision list in disambiguating "plant" with 2 senses ("plant-2") and using a window size of 5.
			[Log Likelihood]:[Collocation Pos]:[Sense]:[Collocation]
			8.27893600229198:-1:1:power
			7.02108396428914:-1:1:assembly
			6.04025471127741:-1:1:manufacturing
			5.63478960316925:-1:1:auto
			5.59842195899838:+-2-5:1:car
			5.56068163101553:+-2-5:2:animal
			5.39362754635236:+-2-5:1:worked
			5.34710753071747:+-2-5:1:kilometers
			5.24702407216049:+1:2:species
			5.24702407216049:+1:2:life
			5.24702407216049:+-2-5:1:Mexico
			5.19295685089021:+-2-5:1:missile
			5.13579843705026:+-2-5:1:Motor
			5.07517381523383:-1:1:Co
			5.07517381523383:+-2-5:1:commercial
			5.07517381523383:+-2-5:1:against
			5.01063529409626:+-2-5:1:told
			5.01063529409626:+-2-5:1:news
			5.01063529409626:+-2-5:1:nearly
			5.01063529409626:+-2-5:1:including
			4.9416424226093:+1:1:last
			4.9416424226093:+-2-5:1:County
			4.9416424226093:+-2-5:1:3000
			4.86753445045558:+-2-5:1:struck
			4.86753445045558:+-2-5:1:atomic
			4.86753445045558:+-2-5:1:Mich
			4.86753445045558:+-2-5:1:200
			4.78749174278205:+-2-5:1:central
			4.70048036579242:-1:1:Ford
			4.70048036579242:+1:1:Henry
			4.70048036579242:+-2-5:1:suburb
			4.70048036579242:+-2-5:1:shift
			4.70048036579242:+-2-5:1:port
			4.70048036579242:+-2-5:1:missiles
			4.70048036579242:+-2-5:1:eastern
			4.70048036579242:+-2-5:1:Michigan
			4.70048036579242:+-2-5:1:April
			4.70048036579242:+-2-5:1:1996
			4.60517018598809:-1:1:transmission
			4.60517018598809:+-2-5:2:species
	
Obviously, the top ranking collocations are all quite valid. For example, if word "power" shows up right before "plant" (-1 pos), it indicates that the "plant" has Sense 1 ("facility"). Similarly, if "animal" appears in a window of 5 near "plant", it is more likely that the "plant" means a botanic entity (Sense 2). However, other nonsense collocations are also present in the decision list such as those numerals. Moreover, a lot of erroneous tags are caused by high-ranking collocations that are stop words (e.g. "of", "and", "to"). Since such stop words appear frequently and everywhere in any discourse, they having a high-rank can easily lead to suboptimal tagging of the word instances. This problem is most reflected in "plant-3" test where most seeds for the verb plant (Sense 3) have the word "to" right before "plant". Therfore, the classifier is fooled to believe that -1:to is a strong indication of Sense3 for "plant". Unfortunately, many Sense1 instances in the test data also follow the word "to" (for example: ... lead to plant closure...). On the other hand, it is not trivial to find other seeds of Sense3 plant that do not have "to" before the word!

More decision lists can be located at
		/clair6/projects/msa/siwei/NLPCourse/test-plant-2-5/
		/clair6/projects/msa/siwei/NLPCourse/test-plant-2-10/
		/clair6/projects/msa/siwei/NLPCourse/test-plant-3-5/
		/clair6/projects/msa/siwei/NLPCourse/test-plant-3-10/
		/clair6/projects/msa/siwei/NLPCourse/test-light-2-5/
		/clair6/projects/msa/siwei/NLPCourse/test-light-2-10/

Discussion and Future Work

In this section, I comment on future work, some of which are indeed insights to the original algorithm and/or suggestions for improvements.

A number of improvements and extensions to the current work are being planned. First, I need to come up with an efficient way of producing representative seed examples for all the senses being disambiguated. Alternatively, I can try to make the training and test set containing instances of the polysemous word that have similar collocations to those in the seed set. It is indeed the difficulty of finding such good seeds that makes some researchers doubtful about whether Yarowsky's algorithm can be referred to as unsupervised. My experiments show that randomly generated seeds would most likely perform very badly, especially for large and diverse corpora like the ones used in this project.

Secondly, tuning up parameters in the algorithm is expected to boost the performance. In particular, the threshold of introducing new seeds can greatly affect the learning results. Similarly, window size could also be varied to search for the optimal amount of collocations.

The third thing I have in mind is to incorporate local syntax information into collocation identification. My experience in multiple sequence alignment tells that syntax could very well be a promising clue in this task. Especially for collocations that are stop words, adding syntax consideration should be deemed very necessary.

Finally, I propose dynamic maintenance of the decision lists generated by each iteration of the algorithm. There is a trade-off between strict versus tight admission of new seeds. In the former case, new collocations can be missed so that early errors are hard to correct. In the latter case, however, the decision lists tend to be unnecessarily large and contain collocations with very low log-likelihood.

Bibliography

Yarowsky 1994
Yarowsky 1995

Links

README for Usage