Large Corpora

Problem Statement

Using the Diversity Kaleidoscope data (provided by professor), generate the Monty Lingua np list from a subset you select of authors and genres and schools. This should involve 500 to 1500 documents. If using your own data, do something of similar size.If you're doing the dk data, create some lists of the np that reveal something about the authors and genres you've selected. For instance, students may talk about different subjects than administrators. If so, generate lists that tell us about those differences.

Abstract

I sought an answer to the following question:

In diversity-related documents, are the terms used by university administration offices different from those used by individuals in non-official capacity? Comparing the top 15 most frequent terms used by each author type suggests that the terms used are significantly similar in documents authored by individuals and administration offices.

Data

I the Diversity Kaleidoscope(DK) data. From that corpus, I needed to extract documents authored by university individuals and administrators (code a and b respectively in second column of the combined data files). I found 412 files authored by individuals and 1126 files from administration offices for a total of 1538 documents.

Steps

Determine what data I need:

Since I need to compare terminology utilized by individuals versus that of administration offices, I selected only those documents authored by the subjects in question.

Extract the relevant URLs:

The 'combined' folder in the 'mcq' (the professor's home) directory has a list of files, one for each university. Each file is a list of URLs to diversity-related web pages. The lines in the list also contain author name, author type and genre of the document. I need to filter the relevant URLs (with author codes 'a' and 'b', for individuals and admins respectively). I use the following grep expressions:


						  grep "|a|" ~mcq/divers/fetchconvert/combined/*.txt 
 
						  grep "|b|" ~mcq/divers/fetchconvert/combined/*.txt

Fetch the documents related to the relevant URLs:

I accomplished this by feeding the URL list to a perl script, fetchdocs.pl. The script fetches the documents and stores them in a specified directory. The call to the perl script and the grep commands are combined as follows:


							grep "|a|" ~mcq/divers/fetchconvert/combined/*.txt | cut -d\|  -f4 | ./fetchdocs.pl student/

							grep "|a|" ~mcq/divers/fetchconvert/combined/*.txt | cut -d\|  -f4 | ./fetchdocs.pl admin/

The 'cut' command above extracts the fourth column in the '|' delimited data piped from the grep command

Convert the fetched documents into one format:

Now the documents need to be converted to the same format, which can be done with DROID. But DROID takes an xml file, so we need to run a perl script for that purpose. I ran the following scripts in order to convert all the files to text format.


							perl droidfilelist.pl student

							perl calldroid.pl student

							perl convertdocs.pl droidoutput.xml

Before running the last command, I created a directory called 'converted' (needed to hold the output from convertdocs.pl)

The same perl scripts are run for the admin directory. I renamed the directory 'converted' and made another 'converted' directory before running the commands for admin.

Parse with Monty Lingua:

I ran a Python script (from mcq directory) on all the converted files.

 
							python ~mcq/montylingua-2.1/python/test-6.py converted > student-monty.txt 

							python ~mcq/montylingua-2.1/python/test-6.py converted > admin-monty.txt

Analyze Monty Lingua Output:

Now I want to isolate the noun phrases from the result files, sort them and count the frequency of each noun phrase: grep "\bnp\b" student-monty.txt > student-np.txt cut -f 5 student-np.txt > noun-phrases-student.txt sort noun-phrases-student.txt > sorted-noun-phrases-student.txt uniq -c sorted-noun-phrases-student.txt | sort -nr > count-of-np.txt

After running the same commands for the administration side, I had two files:
student.txt and admin.txt

Comments

The top 15 most frequently used terms (ignoring stop words) are visualized here. Most of the terms are found in both lists. Terms that make it only to one list are indicated with an oval shape around them. It is apparent from the visualization that there is high correspondence of terms used by both types of authors. It is safe to assume that individuals and academic offices use similar language when discussing the subject of diverstiy. Yet, nuanced differences could possibly have been revealed had I looked beyond the top 15 terms.