Messeret Gebre-Kristos
Exploratory Data Analysis
Home › Relational Database

Relational Database

Problem Statement

Recreate your data activities in a perl script(s) with a sqlite3 database. The script(s) should help you regenerate and reload new data without having to remember your sql statements.

Abstract

I am still tackling the same question as in Assignment 1: Are the terms used by university administration officials the same as those used by private individuals? The only difference is that I am using relational database to analyze the data. I created a SQLite database with four tables, loaded the relevant data from the Diversity Kaleidoscope files and generated the report that compares most frequently used terms in both sets of documents. All of the steps in the process are reproduced in a perl script, for the purpose of automating similar activities in the futre.

Data

I used again the Diversity Kaleidoscope(DK) data.

Steps

  • Create tables
  • I created four tables in the dk.db database: university, document, phrase, author_type.

    CREATE TABLE university(id integer primary key, name text);
    CREATE TABLE author_type(id integer primary key, type text);
    CREATE TABLE document(id integer primary key, univ_id integer, type_id integer, serial_num text);
    CREATE TABLE phrase(id integer primary key, doc_id integer, phrase text);

    'author-type' is a look-up table for giving the author types meaningful names.

  • Load data:
  • To get the list of universities, I used a directory (in ~mcq/...) which has a list of text (.txt) files named after each university. I parsed the university names from the file list and saved them in a text file.

    ls ~mcq/divers/fetchconvert/combined/*.txt | cut -d \/ -f 7 | cut -d . -f 1

    The first part of the command above yields a file path like this: /u/mcq/divers/fetchconvert/combined/umich.edu
    The second part parses the file path using the '/' delimiter and takes the seventh field (umich.edu).
    The last part parses the file name using the '.' delimiter and takes the first field (umich).

    The 'author-type' table is populated with a couple of INSERT statements:

    INSERT INTO author_type (type) values("administrator");
    INSERT INTO author_type (type) values("student");

    I fill the 'document' and 'phrase' tables from text files that hold Monty Lingua output from previous assignment. The text files hold lists of noun-phrases collected from documents authored by administration staff and individuals respectively.
    Here is a partial file to demonstrate the format.

    The code snippet below is used to parse the documents, and fill the tables with the appropriate data.

    Code Snippet

  • Report output:
  • After populating all the tables, I ran SQL statements to top twenty most frequent terms used by administration staff and by individuals.

    SELECT phrase, count(phrase) as frequency FROM phrase,document,author_type WHERE document.id=phrase.doc_id AND document.type_id=author_type.id AND author_type.type='administration' GROUP BY phrase ORDER BY frequency DESC LIMIT 20;

    The same statement is run for documents authored by individuals (with type='student').
    Both lists are compiled in one file.

  • Recreate the steps in perl
  • The whole process (from creating the database to writing the report) is recreated in a perl script. The source code can be accessed here.

Comments

Because the same data is being treated for the same question as Homework 1, the results remain the same. The lists are now a little better because stop words are excluded. I like having the data in a relational database better, because I find it much easier to manipulate and generate reports quickly.