Resources

Here are resources I've generated in the course of my work. I welcome questions and suggestions for improving these resources.

software

SnowCrawl

SnowCrawl is a python library for directed webcrawls. Nice features include: saved state for backup, support for threading and client-server architecture, lots of flexibility.


SnowCrawl is open source, hosted at google code

data

I welcome questions and suggestions for improving these resources.

A census of the political web

An index of virtually every English political site on the web. This index contains more than 1.8 million web sites, crawled and classified by language (English/non-English) and political content. Of these, roughly 800,000 are political sites. This automated snowball census was conducted 8/1/2010.

The complete index (107 MB, zipped csv)
1% sample of the index

For a description of the process used to generate this census, please see my working paper: An automated snowball census of the political web.

demos

python text classification demo

This 99-line python script trains a text classifier to recognize the difference between Dracula and The Adventures of Huckleberry Finn. It checks accuracy using percent agreement, and generates output that can be used to create a text cloud. The code is lightweight and heavily commented -- perfect for an easy introduction to NLP in python.

Download here