UP504 • Multiple Regression
last updated: Wednesday, January 16, 2008 
Assignment Onedue Monday, Jan 28 see also regression notes 
see also additional comments on assignment [link]
Use the data set provided to run multiple regression using SPSS. Begin by examining summary statistics, scatterplots and correlation matrices to become familiar with the data. (This will strengthen both your model building and your interpretation of the results.) Refer to the examples from class, the LewisBeck book, a statistics book, the class notes, and other resources on regression for guidelines on developing a model, testing for significance, and checking whether the assumptions for regression have been met. Remember that a good model is not one with simply a good fit, but instead a model that makes sense theoretically and logically explains variation in the dependent variable.
You are to work in teams of two students. (Note: in group
projects, you are to turn in a single, integrated writeup. All group members
receive the same grade.)
What explains the variation in median housing value by Census Tracts in 2000 (as reported in the 2000 Census)? dependent variable: median housing value
We will provide two data sets for two counties in Michigan: Washtenaw County and Wayne County. You may choose either county. (The Washtenaw data set is already available, and we are in the process of cleaning and loading the Wayne County data set  availability TBA). Note: for maps of the region, see SEMCOG.
Washtenaw County  Wayne County  
file name (Note: the suffix .sav indicates an SPSS data file) 
up504w08assign1wash.sav 

[to be added] 
data set location 
sdcamp/Public/up504/up504w08assign1.sav 






FOUR WAYS TO ACCESS THE DATA SET (simply use the one that is easiest for you. Please let us know if any of these don't work and/or if you find a better way.): 



1. Doubleclick on the file name in this directory 



2. Download the zip (compressed) version of the file [Note: when you download and unzip/expand this file, you may rename it anything you would like, but be sure that the file has the extension ".sav" so that SPSS can recognize the file as a SPSS data file.] 



3. Use Fugu (MAC) or SSH Secure Shell (Windows) to download this data. Host name is "login.itd.umich.edu". User Name is your uniqname. Once you open this connection, select (in the menu) "Operation" > "Go to Folder". Here you will see the hierarchical address for your ifs space. Edit this space to read: /afs/umich.edu/user/s/d/sdcamp/Public/up504. You should now see the data file. Drag this file to where you want it, or rightclick to see download options. 



4. Use mfile to access the data set at: https://mfile.umich.edu/?path=/afs/umich.edu/user/s/d/sdcamp/Public/html/up504/assign1w08 



Thematic Maps of Data.  thematic maps (Note how the housing values vary dramatically across census tracks, with Ann Arbor having both some of the highest and lowest values in the county. Note also the patterns in Ypsilanti.) 

Notes on data: 
NOTE 1: there are 97 census tracts in the data set (from Census tract 4001 to 4660). For example, the Art & Architecture building is located in Tract 4022. NOTE 2: Census Tract 4229 is a rather unusual place (the location of a correctional facility with no hhd income), and thus you might treat as an outlier (you can either delete this case, or instruct SPSS to exclude it for the analysis.) There may be other individual tracts with peculiar characteristics. You may obtain maps of Washtenaw County census tracts (.pdf format) at: 


Data source: The source of the data: We extracted it from the 2000 U.S. Census Census, Summary Files 1 and 3 (that is, from answers to both the short form and long form Census questionnaires): US CENSUS LINKS: 
1. a statement of the research question and hypothesis
2. the SPSS output of your final model (you do NOT need to turn in the SPSS output from all your exploratory, early version models)
3a. the regression equation
3b. an example of how you would use this equation to make a prediction; use
the actual values from one of the cases and calculate the value for
the dependent variable. Then compare this predicted value to the actual value
for this case. How close an estimate was it?
4. an answer to your research question.
5. A concise discussion of the strengths and shortcomings of your model. Elements of the discussion might include:
You do NOT need to turn in exploratory histograms, scatterplots, descriptive
statistics, the raw data (even if you created new variables), etc. (If you create
new variables, simply provide a concise definition of them somewhere.) As a
result, you may have generated a massive number of pages of output; turn in
only those pages that relate to your final model. (and keep the rest as souvenirs).
"How many variables in the model do I need?" A good question. The simple answer
is: at least 2 independent variables (this is a multiple regression assignment,
after all). There is a tradeoff between a simple model and a complex model
(with more explanatory power). Going from two to three variables may raise the
R2 significantly, and is useful. But eventually the additional explanatory power
of each additional variable diminishes, and the model is no longer easily understood.
(Call this diminishing returns.) You want a final model that is not only strong
and stat. significant, but also makes theoretical sense and is parsimonious
(that is, an economical, frugal use of variables). In practical terms, I would
expect that many of you would develop models with 25 interval variables, plus
one or more dummy (nominal) variables.
Re outliers and nonlinear transformations: At this stage in learning regression it is simply important to know how both outliers and nonlinear relationships can affect/distort your model (and lower your Rsquare as well). You can look for signs of outliers and nonlinear relationships both in scatterplots and in the residual analysis (see the options in the regression menu under plots and residuals). But I am NOT expecting you to necessarily do anything about outliers and nonlinear relationships unless they are really obvious. In other words, do pay attention to these issues, but we will not take any points off if you don't do any nonlinear transformations or address outliers.
OVERALL ADVICE ON MODEL BUILDING:
With this data set, and its multiple variables, aggregated unit of analysis, and complex dependent variable, there
is no single, simple model that will best explain the dependent variable.
As a result, there is no single right answer to the assignment.
(and do not expect to get Rsquare values of .95; somewhere in the
.50  .75 is more likely).
Do experiment with different variables. There is a tradeoff between a simple, theoretically logical model (with a lower RSquare) and a more complex model (with a higher RSquare but less intuitively easy to understand). Model building is therefore an art as well as a science. In any case, be sure that you meet the basic requirements of regression models (all independent variables are significant, etc.). See LewisBeck for this, especially the idea of BLUE: best linear unbiased estimate.

http://ssc.utexas.edu/consulting/tutorials/index.html [scroll down for SPSS tutorials]
http://www.ats.ucla.edu/STAT/spss/modules/default.htm
http://www.ats.ucla.edu/stat/spss/notes2/default.htm (this one has short movies showing the steps)
If you want to map residuals, you can get the ESRI data at the site below. Select Michigan, then Washtenaw county, then choose to download the census tracts and whatever else you might want (places, etc.). This is a good resource for base maps in general. You will have to project the file after you download it. I used NAD 1927, State Plane, Michigan South.
http://arcdata.esri.com/data/tiger2000/tiger_download.cfm
You can go to the SPSS website and download a 14day demo of SPSS onto your own computer. You will have to register with their website, but it's a pretty painless process. here's the link: http://www.spss.com/spss/index.htm?source=homepage&hpzone=ad_box