Messeret Gebre-Kristos
Exploratory Data Analysis
Home ›Large Corpora

Large Corpora

Problem Statement

Using the Diversity Kaleidoscope data (provided by professor), generate the Monty Lingua np list from a subset you select of authors and genres and schools. This should involve 500 to 1500 documents. If using your own data, do something of similar size.If you're doing the dk data, create some lists of the np that reveal something about the authors and genres you've selected. For instance, students may talk about different subjects than administrators. If so, generate lists that tell us about those differences.

Abstract

I sought an answer to the following question:

In diversity-related documents, are the terms used by university administration offices different from those used by individuals in non-official capacity? Comparing the top 15 most frequent terms used by each author type suggests that the terms used are significantly similar in documents authored by individuals and administration offices.

Data

I the Diversity Kaleidoscope(DK) data. From that corpus, I needed to extract documents authored by university individuals and administrators (code a and b respectively in second column of the combined data files). I found 412 files authored by individuals and 1126 files from administration offices for a total of 1538 documents.

Steps

  • Determine what data I need:
  • Since I need to compare terminology utilized by individuals versus that of administration offices, I selected only those documents authored by the subjects in question.

  • Extract the relevant URLs:
  • The 'combined' folder in the 'mcq' (the professor's home) directory has a list of files, one for each university. Each file is a list of URLs to diversity-related web pages. The lines in the list also contain author name, author type and genre of the document. I need to filter the relevant URLs (with author codes 'a' and 'b', for individuals and admins respectively). I use the following grep expressions:

    grep "|a|" ~mcq/divers/fetchconvert/combined/*.txt
    grep "|b|" ~mcq/divers/fetchconvert/combined/*.txt
  • Fetch the documents related to the relevant URLs:
  • I accomplished this by feeding the URL list to a perl script, fetchdocs.pl. The script fetches the documents and stores them in a specified directory. The call to the perl script and the grep commands are combined as follows:

    grep "|a|" ~mcq/divers/fetchconvert/combined/*.txt | cut -d\| -f4 | ./fetchdocs.pl student/
    grep "|a|" ~mcq/divers/fetchconvert/combined/*.txt | cut -d\| -f4 | ./fetchdocs.pl admin/

    The 'cut' command above extracts the fourth column in the '|' delimited data piped from the grep command

  • Convert the fetched documents into one format:
  • Now the documents need to be converted to the same format, which can be done with DROID. But DROID takes an xml file, so we need to run a perl script for that purpose. I ran the following scripts in order to convert all the files to text format.

    perl droidfilelist.pl student
    perl calldroid.pl student
    perl convertdocs.pl droidoutput.xml

    Before running the last command, I created a directory called 'converted' (needed to hold the output from convertdocs.pl)

    The same perl scripts are run for the admin directory. I renamed the directory 'converted' and made another 'converted' directory before running the commands for admin.

  • Parse with Monty Lingua:
  • I ran a Python script (from mcq directory) on all the converted files.

    python ~mcq/montylingua-2.1/python/test-6.py converted > student-monty.txt
    python ~mcq/montylingua-2.1/python/test-6.py converted > admin-monty.txt
  • Analyze Monty Lingua Output:
  • Now I want to isolate the noun phrases from the result files, sort them and count the frequency of each noun phrase: grep "\bnp\b" student-monty.txt > student-np.txt
    cut -f 5 student-np.txt > noun-phrases-student.txt
    sort noun-phrases-student.txt > sorted-noun-phrases-student.txt
    uniq -c sorted-noun-phrases-student.txt | sort -nr > count-of-np.txt

    After running the same commands for the administration side, I had two files:
    student.txt and admin.txt

Comments

The top 15 most frequently used terms (ignoring stop words) are visualized here. Most of the terms are found in both lists. Terms that make it only to one list are indicated with an oval shape around them. It is apparent from the visualization that there is high correspondence of terms used by both types of authors. It is safe to assume that individuals and academic offices use similar language when discussing the subject of diverstiy. Yet, nuanced differences could possibly have been revealed had I looked beyond the top 15 terms.