To use my implementation of the Yarowsky's algorithm, please follow the steps below. 0. Download and untar yarowsky.tar to your working directory. 1. Prepare Data 1.0 Pick a polysemous word and some of its senses to disambiguate. Number the senses as 1, 2, ..., where 1 stands for the default sense. 1.1 Training data -- cluster of discourses containing one or more instances of the polysemous word, one discourse per line. 1.2 Seed examples -- chosen from the training data that are representative of each sense to disambiguate. The number of seed examples for each sense should be roughly 1% of the size of training set. Tag the instances with corresponding senses by putting "(X)" next to each instance where X is the sense number for that instance. 1.3 Test data -- cluster of discourses containing one or more instances of the polysemous word, one discourse per line. Test set is untagged, and so is the training set. 2. Run Yarowsky's Algorithm on Training Set % perl yarowsky.pl [word] [num_senses] [window_size] [seed_set] [train_set] For example: at /clair6/projects/msa/siwei/NLPCourse/ % perl yarowsky.pl plant 2 5 plant-seed-2 plant-train would disambiguate word "plant" for two of its meanings with window size 5 based on seed examples in file "plant-seed-2" and training set "plant-train". The user does not have to tell yarowsky.pl about the senses being used as long as they tag the seed examples accordingly (refer to Step 1 above). The outputs of this step would be a list of decision list files and an output-train file under the same directory where you run the command. decision_list_X are the decision lists produced at each iteration of the algorithm. The one with the maximum value for X is the final version generated. output-train is a copy of the training set with all instances of the polysemous word tagged with one of the candidate senses. 3. Run Yarowsky's Algorithm on Test Data % perl yarowsky-test.pl [word] [num_senses] [window_size] [test_set] [decision_list] For example: at /clair6/projects/msa/siwei/NLPCourse/ % perl yarowsky-test.pl plant 2 5 plant-test decision-list Note that the decision-list in this step could be any of the ones generated in Step 2. Usually (and most practically), the final decision list of Step 2 should be used in this step. The output of this step is an output-test file under the same directory where you run the command. It is a copy of the test set with all instances of the polysemous word tagged with one of the candidate senses. For questions and concerns, contact Siwei Shen at shens@umich.edu