Messeret Gebre-Kristos
Exploratory Data Analysis
Home › Plotting with R

Plotting with R

Problem Statement

After hw2, you should have two (or more) lists of nominal phrases comparing two (or more) sets of documents. You've stored these in a sqlite3 database. For hw3, create a back-to-back histogram showing the number of documents containing each of these phrases in both sets. Connect the sqlite3 database to R to accomplish this.

An example of a back-to-back histogram can be found at

http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=136

although this isn't exactly what your display should look like. Your display should have the nominal phrase on the same line as the histograms, which may be differentiated in any way you see fit. You must use the number of documents in which phrases occur. This number will probably be measured on the x-axis of your back-to-back histograms.

Abstract

I loaded the database from the previous assignment into the statistical analysis software R, and generated a back-to-back histogram to compare term frequency lists for administration and student(individual) authors. The contrasting shapes of the paired histograms (power law graph versus the jagged lines) highlight the difference of term usage between the two groups of authors. Yet, it is apparent that both groups of authors use generally similar terms when writing about diversity issues.

Data

I took the database from the previous assignment and cleaned it a little bit to remove unnecessary data. Specifically I removed more phrases which can be considered as stop words.

Steps

  • Exlude more stop words
  • Inspecting the top 50 most frequent terms from the previous assingment, I found a few terms with little semantic value. I added those terms to the stopwrods list.The terms added to the stop words are: digits [1-9], gif, www, edu. All these terms had showed up in the top 50 popular phrases before the data clean-up. I think they bloated the list for no analytic value. With the updated stopwords, I ran the perl script again to re-load the database.

  • Construct the SQL statement and verify it fetches desired data
  • I wanted a table with three columns: phrase/term, term frequency in administration documents, term frequency in student documents. After serval attempts, the following SELECT statament did the trick. (See the result)

  • Load Database into R
  • I downloaded the database into my PC and worked with the R version installed in my machine.

    Create a working folder
    > mkdir hw3
    > cd hw3

    Load the RSQLite library; Instantiate the SQLite engine from the current R session. Allow maximum of 16 concurrent connections and let it load upto 50 records at a time.
    > library(RSQLite)
    > m <- SQLite(max.con = 16, fetch.default.rec = 50)

    Connect to the dk.db database.
    > conn <- dbConnect(m, dbname="dk.db")

    Fetch the data and load the result set into a variable. The query is already displayed above
    > query <- dbSendQuery(conn,"SELECT admin_count.phrase as Phrase, ...")
    > result <- fetch(query)

    Clear result set and disconnect
    > dbClearResult(query)
    > dbDisconnect(conn)

    Set background color and plot back-to-back histograms. > par("bg"="#999999")
    > barplot(result$"Admin Frequency", horiz=TRUE, space=0, col="#5f83ee", xlim=c(-100,200))
    > barplot(-result$"Student Frequency", horiz=TRUE, space=0, col="#FF9900", add=TRUE)

    Configure axis; add labels and legend. > axis(2, at=1:25, labels=result$Phrase, pos=0, col.axis="white", las=2, tick=FALSE, hadj=.5, padj=1, mgp=c(3,0,0))
    > title("Term Frequency for Administrators and Students")
    > legend(80, 15, c("Adminstration", "Students"), fill=c("#5f83ee", "#FF9900"), bg="white")

The resulting histogram is depicted here

Comments

The bars on the right side make a curved pattern whereas the left side is more jagged. This suggests that the terms used by administration staff and studens don't quite match in frequency. Yet, the long bars are generally at the bottom and the short bars tend to be at the top for both sides implying that the discrepancy is generally not very big.

To comment on learning R: I found it very difficult at the beginning. This is because it was a completely alien environment for me. Once I got to understand the basics (reading one of the 'help' manuals), I was happy by how much I can do with R. It is clearly a powerful software once you get over the steep learning curve.