Messeret Gebre-Kristos
Exploratory Data Analysis
Home › Regular Expressions

Regular Expressions

Problem Statement

Add or elaborate on your previous input to monty lingua in earlier exercises by adding a perl script designed to catch problematic patterns.

Abstract

In a previous assignment, I fed Monty Lingua a set of converted text files for analysis without any cleaning up of the input. This time around, I use regular expressions to remove spurious or semantically useless strings.

Data

I take the files which were converted into text files using DROID in assignment 1 and clean them up before analyzing them with Monty Lingua. The files comprise a set of documents authored by administration staff and another set authored by individuals.

Steps

  • Write a perl script to remove unnecessary terms.
  • My first attempt was to remove lines which cotained any unwanted characters. The problem with the approach was that it removed entire lines just because they may contain an illegal character. Next, I parsed each line into space-delimited terms and treated each term individually. After removing all the unwanted terms, the rest of the file is written back into a file by the same name but in a different directory ('cleaned-admin' and 'cleaned-student').

    I decided that any characters that are not alpha-numeric would not be useful for my inquiry: (How much difference is there between the terms used by administartion staff and individuals?). > next if($term=~ m/\W/)
    > next if($term=~ m/\A\W/)
    The first regex was my initial attempt to remove non alpha-numeric characters. However, it eliminated words that ended with panctuation marks (e.g. 'diversity:', 'staff,faculty, and students.'). The second line is better, because it takes out only terms that start with non alpha-numeric characters.

    There are a lot of files that have long lines made up of underscores. Because the regular expression above doesn't exclude underscores, I had to remove two or more adjacent underscores explicitly.

    next if($term=~ m/(__)+/);

    I also decided to remove digits because I didn't see the value of tallying how many times, say, 2 or 1998 was mentioned.

    Finally, I checked the stopword list to make sure none of its terms are included in the cleaned-up files. I have decided to remove 'I' and 'We' from the stopword list, because I want to verify whether 'we' is used by administration staff more frequently than by individuals.

    The perl script used for data scrubbing can be accessed here

  • Parse with Monty Lingua
  • python ~mcq/montylingua-2.1/python/test-6.py cleaned-student > student-monty.txt
    python ~mcq/montylingua-2.1/python/test-6.py cleaned-admin > admin-monty.txt
  • Isolate the noun-phrases
  • grep "\bnp\b" student-monty.txt > student-np.txt
    grep "\bnp\b" admin-monty.txt > admin-np.txt
  • Load data into relational database and create report
  • A perl script from previous assignment can load the data into a relational database, and write out a report listing the top 25 most popular terms in each set. The list can be seen here.

Comments

The first obvious difference between the old and new lists is that 'i' has a wide lead in both lists. This is revealed because the phrase is removed from the stop word list. Yet, my expectation that it would be more popular with individual authors than administration staff has not been borne out.
The new lists have less occurences of meaningless words (such as gif, 1, www) which is a sign of progress with using regular expressions to clean up the data. Otherwise, there is not much difference between the old and new lists. The new lists contain most of the top words that were in the old lists. The next step in this process would have been to incoporate ways of lumping singular and plural forms of a word into one. For example, 'minority' and 'minorities' indvidually make it in the top 15 list. If they were considered as one word, their combined score would have put the term even higher in the list (top 5).