Do you wish you knew how to get started in analyzing piles of data? Would you know how to retrieve those piles of data in the first place? Do you hope to have better data handling and analysis skills you can use out there in the world after graduating? Do you wish you had these skills handy for your current courses/projects?
601 aims to help students get started with their own data harvesting, processing, and aggregation. Data analysis is crucial to evaluating and designing solutions and applications, as well as understanding users' information needs and uses. In many cases, the data we need to access is distributed online among many Web pages, stored in a database or available in a large text file. Often these data (e.g., Web server logs) are too large to obtain and/or process manually. Instead, we need an automated way to gather the data, parse it, and summarize it before we can do more advanced analysis. In this course, you will learn to use Perl and its modules to accomplish these tasks in a quick and easy yet useful and repeatable way. The companion half of this half-semester course, SI 618: "Exploratory Data Analysis," teaches how to further
glean insights from the data through analysis and visualization.
We will be using ManyEyes to visualize data and share and discuss our results. Please sign up at ManyEyes (you'll need to have or obtain an IBM ID) and join the DRAT topic hub.
In addition to a beginning Perl book, we'll be using the Cross 'Data Munging with Perl' book that is specific to the data manipulation we have in mind.
Here is a preliminary syllabus:
week 1: introduction to Perl
We'll get started with Perl & ManyEyes, doing some light scripting and visualization.
- Cozens Ch 1-4: First steps in Perl, working with simple values, lists and hashes, loops and decisions
week 2: extracting what you want from textual data
We'll become even more familiar with Perl while learning to parse text by using the split function for columnar text data and using regular expressions to extract parts of text (e.g. dates/times etc.).
week 3: handling large and/or sensitive data, generating summary statistics
We will cover file I/O, compression, and encryption, as well as how to create anonymized versions of the data using hash functions (and discuss circumstances when obscuring IDs is not enough). We will also cover use of associative arrays in Perl for key-value storage (optionally tying it to a Berkeley DB database file for fast future access). We will use associative arrays and simple counter variables to generate summary statistics about the data on the fly.
- Cross Ch 6: Record-oriented data
week 4: retrieving HTML content
Much good data is out there on the web, but manual retrieval is often prohibitively time-consuming. In this week we will learn how to automate retrieval, using some existing utilities such as wget and htttrack, which, given starting URLs will retrieve those pages and other pages that are linked from the start pages out to a certain depth. We will also use Perls UserAgent module to finesse our retrieval or to retrieve specific dynamic content. We will learn how to convert HTML to text (using the HTML:Parser module) and to extract the hyperlinks (using the HTML:LinkExtor module). Finally, we will learn to be polite about our crawling by respecting robots.txt files and slowing down the crawl.
week 5: retrieving XML content & using web services
Sometimes we are lucky in that the data we are seeking is placed by the content provider into a structured XML container through a web service. For example, websites providing information about movie showtimes, current weather, or other data that is retrieved from a database at their end, can provide structured information in XML format(e.g. what the movie is, where it’s playing, and at what time). Websites usually do this so that other web services can inter-operate with them, but it also provides a great way for us to get precisely the data we are looking for. We will learn how to
(a) parse XML to get at the data & work with objects in Perl
(b) use SOAP::Lite in Perl to access the web service in the first place
week 6: retrieving data from a SQL database
What if the data is stored in a SQL database that someone is nice enough to grant you access to? We will focus on using existing databases to gather summary statistics and to retrieve specific data (the creation of one’s own database will be covered in the companion course):
a) short intro to SQL: counting and selecting
b) interfacing with SQL through Perl
reading: Using MySQL with Perl
week 7: fun with large data, Google and Technorati APIs
In this final week will bring together the techniques learned so far. Using one or two large data sets (e.g. web traffic or social network data) we will flex our muscle and demonstrate how to access, parse, extract, and summarize it.
We will also have some fun with the Google and Technorati APIs to get query results and do some “fun” hacks again illustrating techniques learned so far.
- Dornfest at al. Google Hacks Ch1: Web