toolkits

... are nifty software packages, tutorials, etc. that I use to get my work done. I hope these can help you too.

Stata

As much as I like the flexibility and frugality of open source, it's hard to argue with Stata as the workhorse for basic statistics in the social sciences. Yes, SAS is nice too, but it's much more of a commitment to learn your way around the program. No -- SPSS is not a legitimate choice for real quantitative researchers any more. It's a dinosaur. Do not use it unless you're strapped to a coauthor who can't use anything else.


Unfortunately, except for this plug, I have no good non-commercial resources to edify would-be Stata users. Don't forget to use a log / do file for everything!

R

... is an open-source competitor to Stata. I use both programs. Stata is my go-to, but I often fall back on R for its specialized packages and ability to talk to python (see rPy below). The R homepage is here.


R Pros:

R Cons:


python

... is a powerful, flexible, popular, and extremely well-documented programming language. python is one of three languages approved by Google for Google work at the Googleplex. I use it for web spidering, natural language processing, data aggregation too large or complicated to do in Stata, and cross-platform statistical work.


Get started with python at the official python website -- it's all here. Download and installation is absolutely painless.

rpy

... is a python library allows you to call R functions from python. I highly recommend it for graphing and Monte Carlo work. Here is the rpy sourceforge site, where you can download the latest version. The official documentation is pretty techincal; you're probably better off starting with this tutorial.

LaTeX

... is a markup language for producing nicely formatted documents. It's especially good at mathematical notation: equations, proofs, tables, etc. It's also good at managing citations, footnotes, and so on. I am desperately hoping that LaTeX will save me from the sleepless reformatting nightmares I've seen other PhD students go through with their dissertations.


Basic resources for getting started in LaTeX:

public data

The American National Election Study (NES)

The longest running study of electoral opinions and demographics in the world. The NES has been asking Americans who they plan to vote for, how they feel about politics, where they get their news, and so on for more than 50 years. This data set (available for instant public download) is a treasure trove of knowledge about how Americans relate to their government.

The NES study page is here.

The National Annenberg Election Study (NAES)

In every presidential election since 2000, the Annenberg election study has fielded an enormous rolling cross-section survey. Unlike the NES, in which respondents are interviewed once before the election and once after, the Annenberg study interviews dozens to hundreds of people every day of the campaign, with total sample sizes numbering in the hundreds of thousands. This makes the study an ideal source of information about the dynamics of campaigns -- when did opinions form, shift, and crystallize.

The Annenberg study page is here.

The Cooperative Congressional Election Study (CCES)

The CCES is a collaboration among universities to study congressional elections. "For each survey of 1,000 persons, half of the questionnaire is developed and controlled by each the individual research team, and half of the questionnaire is devoted to Common Content." Each team has access to their own questions, and the common content is made publically available. With 30-some-odd teams in any given election cycle, the total sample is broad enough to shed light on national and district-specific electoral dynamics.

The (somewhat confusing) CCES study site is here.

Time-sharing Experiments in the Social Sciences (TESS)

TESS is a platform for conducting survey experiments. Time shares are free to researchers, and allocated by a merit-based review process. Study proposals and data are made publically available after six months.

The TESS home page is here.

thomas

The online archives of the Library of Congress are available at thomas.loc.gov. Every bill, resolution, vote, committee report, or treaty on the public record in the last ~20 years is available here. So is the full text of the Congressional record -- every word spoken on the floor of the House or Senate. thomas is probably the best record of the theatre and maneuver of democracy ever assembled.

http://thomas.loc.gov/

projects

... are utilities and data sets that I've assembled for public use. Everything here is in permanent beta -- it seems good enough to share, but comes with no guarantees. For each project, I will try to document functionality and known bugs and glitches. But I take no responsibility if bugs in my code destroy your computer, inaccuracies in my data ruin your career, misspellings in my comments offend your sensibilities, or anything else here makes you miserable in any way. These are the risks you take.


Of course, if by some miracle my stuff works for you, I'd appreciate credit and citations.

python text classification demo

This 99-line python script trains a text classifier to recognize the difference between Dracula and The Adventures of Huckleberry Finn. It checks accuracy using percent agreement, and generates output that can be used to create a text cloud. The code is lightweight and heavily commented -- perfect for an easy introduction to NLP in python.

Download here


A census of the political web

An index of virtually every English political site on the web. This index contains more than 1.8 million web sites, crawled and classified by language (English/non-English) and political content. Of these, roughly 600,000 are political sites. This automated snowball census was conducted 8/1/2010.

Note: documentation, source code, and evaluation statistics (reliability, precision, recall, etc.) will be forthcoming shortly. Please contact me with any questions.

The complete index (107 MB, zipped csv)
1% sample of the index