Messeret Gebre-Kristos
Exploratory Data Analysis
Home › Server Files

Server Files

Problem Statement

present information from a server log. The file photo-logs.txt contains several days worth of log data. Create a report relating the popularity of the photos (it's okay to select just the dozen most popular) to hostname, ownership, and geographic information about traffic to the photos. Feel free to invent the best way to do this, but be sure to let us see relationships. We would like to be able to compare the popularity of photos, hosts, ownership, and geographic location, not just see independent lists of these four attributes.

Abstract

From the log server file, I selected only media files and looked at the most popular files. Using GeoIP, a geolocation software from MaxMind , I obtained geographical data about the site visitors. Putting the popularity and geographical information together, it is obvious that most of the visitors were from the Ann Arbor area. Filtering out visitors from Michigan makes geopgraphical information about other visitors more visible.

Data

The data is from the server log containing all entries of http://www.mickmcquaid.com. for the dates: May 1st to 7th, 2006.

Steps

  • Import the server log file from my prof's folder (~mcq)
  • cp ~mcq/si618week4/photo-logs.txt .
  • Parse the log file and load to database
  • First some data scrubbing: I wanted to exclude all entries from webcrawlers. I also wanted to include only entires for files with the following extensions: .jpg, .ppt, .pdf, .mp3, .mov. These should cover most of the media files.

    The relevant data is then parsed and loaded into a relational database table. All the activities in this step are reproduced in a perl script.

  • Get Geographic information
  • I stored a list of all the IP addresses that form the log file in a text file and ran GeoIP on the file to get geographic data of the IP addresses. GeoIP is a software package developed by MaxMind, and used to gather geopgraphic information from IP addresses.

    Again, I ran a perl script to read the IP addresses from the input file 'ip.txt' and write the geographic information to an output file, 'locations.txt'.. (Thanks, Yarun, for explaining how to use the Geo::IP perl module)

  • Load geographic data into the database
  • I created a table and loaded it with the geographic data. CREATE table location(id integer primary key,ip text,country text,region text,city text,postal_code text); .import locations.txt location
  • Collate data from both tables for analysis
  • Now that I have the log data in one table and related geographic information in another, I want to see the geographic location of visitors who viewed the most popular media files.


    I used SQL statement to filter the data. I wanted the results to be written to a text file results.txt. Before I ran the SQL statement in the sqlite3 console, I redirected the output to the desired destiation.
    .output results.txt
    
    SELECT DISTINCT c.url,c.frequency,c.site,
       l.country,l.region,l.city,l.postal_code 
    FROM location l LEFT JOIN 
       (SELECT url,site,count(*) AS frequency 
       FROM log GROUP BY url 
       ORDER BY frequency DESC) c 
       ON c.site=l.ip 
    ORDER BY c.frequency DESC;
    
    .output stdout

    The file with frequency of url's is here.

Comments

That most visitors are from the Ann Arbor area is very obvious. This is expected because the log is about a collection of graduation images at the University of Michigan soon after the graduation ceremony. If this were my website, I would be more interested to see from what places, apart from Michigan, my pages were visited. This information is made more visible when the overwhelming data from Michigan is filtered out. A small modification to the SQL statement above gives a new results file.

Most visitors are still from the United States, although they are more disperesed among the states this time. Visitors from other parts of the world are also more easily distinguishable. Visitors from Malta, Italy, Colombia and Spain top the list excluding the United States.