SI 544 Introduction to Statistics and Data Analysis


SI 544 home

Readings, assignments, etc. will be posted to the course ctools website

problem sets

software tools for the class

Lada Adamic



Fall 2010:

Lectures will be
Tuesdays and Thursdays
from 8:30 to 10:00 am.
Location NQ1255

Office hours:
TuTh 10:00-10:30 and Fri 2-3pm in NQ4360

PS7 Simple linear regression and correlation



1. Campaign contributions and state population (100 pts)

For this question, please download the following data from cTools, which was obtained from http://www.opensecrets.org/races/index.php. It contains the 09-10 campaign contributions, which I aggregated to the candidate level. You will further aggregate it to the state level and merge it with state population data as follows:

contribbycand = read.table("fedcampaigncontrib2010.dat",head=T)

population = read.table("population.txt",head=T)

bystateandtype = aggregate(contribbycand$amount,by=list(contribbycand$type,contribbycand$state),sum)

colnames(bystateandtype) = c("type","state","amount")

senatebypopulation = merge(population,subset(bystateandtype,type=="senate"),by.x="state",by.y="state")

housebypopulation = merge(population,subset(bystateandtype,type=="house"),by.x="state",by.y="state")


A. (20pts) Run a simple linear regression modeling the total amount donated by state for the house congressional candidates as a function of the state's population (using summary(lm())). From the resulting output, report on and interpret the slope of the regression line and the coefficient of determination. Are you able to reject the null hypothesis that there is no correlation between population and amount of money the candidates receive collectively?

B. (10pts) Overlay the fitted regression line and the prediction and confidence intervals on a scatter plot of the house campaign data.

C. (10pts) For a state having the same population as Michigan, what is the average predicted amount donated to its candidates running for house seats in congress?

C. (20pts) Run the same regression for the senate campaign data (no need to submit a plot, but it may be instructive to look at). Compare the R2 (coefficient of determination) for your model of senate campaign contributions and house campaign contributions. Use your knowledge of American government (or acquire it as appropriate) to explain the difference.

D. (10pts) For the model of house campaign contributions, plot the residuals of your linear model as a function of state population.

E. (10pts) In addition, use a qqnorm() plot to evaluate how close to normally distributed the residuals are. Interpret.

F. (20 pts) Test the following hypothesis using the contribbycand data frame. H0: On average, the amount of money obtained by senate and house candidates is equal. Show your analysis, a boxplot, and interpret your result.