UP504 • Multiple Regression

 

last updated: Wednesday, January 16, 2008

Assignment One

due Monday, Jan 28

see also regression notes

return to UP504 main page

Elements of this page:
research question for the assignment
the data set
product to be turned in / elements of the analysis
suggestions
SPSS tasks to know
Links to SPSS Tutorials, etc. [added]

see also additional comments on assignment [link]

Assignment

Use the data set provided to run multiple regression using SPSS. Begin by examining summary statistics, scatterplots and correlation matrices to become familiar with the data. (This will strengthen both your model building and your interpretation of the results.) Refer to the examples from class, the Lewis-Beck book, a statistics book, the class notes, and other resources on regression for guidelines on developing a model, testing for significance, and checking whether the assumptions for regression have been met. Remember that a good model is not one with simply a good fit, but instead a model that makes sense theoretically and logically explains variation in the dependent variable.

You are to work in teams of two students. (Note: in group projects, you are to turn in a single, integrated write-up. All group members receive the same grade.)
 


Question to be addressed in your analysis:

What explains the variation in median housing value by Census Tracts in 2000 (as reported in the 2000 Census)?    dependent variable: median housing value

 


The Data Set

We will provide two data sets for two counties in Michigan: Washtenaw County and Wayne County. You may choose either county. (The Washtenaw data set is already available, and we are in the process of cleaning and loading the Wayne County data set -- availability TBA). Note: for maps of the region, see SEMCOG.

  Washtenaw County   Wayne County

file name (Note: the suffix .sav indicates an SPSS data file)

up504w08assign1wash.sav

 

[to be added]

data set location

sdcamp/Public/up504/up504w08assign1.sav

 

 

 

 

 

 

FOUR WAYS TO ACCESS THE DATA SET (simply use the one that is easiest for you. Please let us know if any of these don't work and/or if you find a better way.):

 

 

 

1. Double-click on the file name in this directory

http://www-personal.umich.edu/~sdcamp/up504/assign1w08/

 

 

2. Download the zip (compressed) version of the file [Note: when you download and unzip/expand this file, you may rename it anything you would like, but be sure that the file has the extension ".sav" so that SPSS can recognize the file as a SPSS data file.]

up504w08assign1wash.sav.zip

 

 

3. Use Fugu (MAC) or SSH Secure Shell (Windows) to download this data.
In these programs you have two sides of a window: one shows the local computer, and the other side shows the remote site (in this case my public space: /afs/umich.edu/user/s/d/sdcamp/Public/up504/)
Select the file up504w04assign1.sav and then select a location (in the local window) to copy it. You can usually just drag it.

Host name is "login.itd.umich.edu". User Name is your uniqname. Once you open this connection, select (in the menu) "Operation" > "Go to Folder". Here you will see the hierarchical address for your ifs space. Edit this space to read: /afs/umich.edu/user/s/d/sdcamp/Public/up504. You should now see the data file. Drag this file to where you want it, or right-click to see download options.

 

 

 

4. Use mfile to access the data set at: https://mfile.umich.edu/?path=/afs/umich.edu/user/s/d/sdcamp/Public/html/up504/assign1w08

 

 

 

Thematic Maps of Data.

thematic maps (Note how the housing values vary dramatically across census tracks, with Ann Arbor having both some of the highest and lowest values in the county. Note also the patterns in Ypsilanti.)

   

Notes on data:

NOTE 1: there are 97 census tracts in the data set (from Census tract 4001 to 4660). For example, the Art & Architecture building is located in Tract 4022.

NOTE 2: Census Tract 4229 is a rather unusual place (the location of a correctional facility with no hhd income), and thus you might treat as an outlier (you can either delete this case, or instruct SPSS to exclude it for the analysis.) There may be other individual tracts with peculiar characteristics.

You may obtain maps of Washtenaw County census tracts (.pdf format) at:
http://ftp2.census.gov/plmap/pl_trt/st26_Michigan/c26161_Washtenaw/

 

 

Data source:

The source of the data:

We extracted it from the 2000 U.S. Census Census, Summary Files 1 and 3 (that is, from answers to both the short form and long form Census questionnaires):

US CENSUS LINKS:
SF-3 technical documentation (long form -- a sample of the population)
SF-1 technical documentation (short form -- a full-count of the population)
We used both American FactFinder and the "Census Engine" to directly access Census DVDs at the UM Library (S.A.N.D. North in the A&AB).
Note: variables starting with "P" refer to characteristics of Persons, while "H" variables refer to characteristics of Housing Units.


Product to be turned in (be sure to include ALL of the following elements):

1. a statement of the research question and hypothesis

2. the SPSS output of your final model (you do NOT need to turn in the SPSS output from all your exploratory, early version models)

3a. the regression equation
3b. an example of how you would use this equation to make a prediction; use the actual values from one of the cases and calculate the value for the dependent variable. Then compare this predicted value to the actual value for this case. How close an estimate was it?

4. an answer to your research question.

5. A concise discussion of the strengths and shortcomings of your model. Elements of the discussion might include:

You do NOT need to turn in exploratory histograms, scatterplots, descriptive statistics, the raw data (even if you created new variables), etc. (If you create new variables, simply provide a concise definition of them somewhere.) As a result, you may have generated a massive number of pages of output; turn in only those pages that relate to your final model. (and keep the rest as souvenirs).
 


A few suggestions:

HOW MANY VARIABLES?

"How many variables in the model do I need?" A good question. The simple answer is: at least 2 independent variables (this is a multiple regression assignment, after all). There is a trade-off between a simple model and a complex model (with more explanatory power). Going from two to three variables may raise the R2 significantly, and is useful. But eventually the additional explanatory power of each additional variable diminishes, and the model is no longer easily understood. (Call this diminishing returns.) You want a final model that is not only strong and stat. significant, but also makes theoretical sense and is parsimonious (that is, an economical, frugal use of variables). In practical terms, I would expect that many of you would develop models with 2-5 interval variables, plus one or more dummy (nominal) variables.
 

Re outliers and nonlinear transformations: At this stage in learning regression it is simply important to know how both outliers and nonlinear relationships can affect/distort your model (and lower your R-square as well). You can look for signs of outliers and nonlinear relationships both in scatterplots and in the residual analysis (see the options in the regression menu under plots and residuals). But I am NOT expecting you to necessarily do anything about outliers and nonlinear relationships unless they are really obvious. In other words, do pay attention to these issues, but we will not take any points off if you don't do any nonlinear transformations or address outliers.

OVERALL ADVICE ON MODEL BUILDING:
With this data set, and its multiple variables, aggregated unit of analysis, and complex dependent variable, there is no single, simple model that will best explain the dependent variable.   As a result, there is no single right answer to the assignment.   (and do not expect to get R-square values of .95;   somewhere in the .50 - .75  is more likely).

Do experiment with different variables.   There is a trade-off between a simple, theoretically logical model (with a lower R-Square) and a more complex model (with a higher R-Square but less intuitively easy to understand).   Model building is therefore an art as well as a science.  In any case, be sure that you meet the basic requirements of regression models (all independent variables are significant, etc.).   See Lewis-Beck for this, especially the idea of BLUE:  best linear unbiased estimate.

-----

A Selected List of SPSS Tasks/Manipulations that will be potentially useful for this assignment

Links to SPSS Tutorials

http://ssc.utexas.edu/consulting/tutorials/index.html [scroll down for SPSS tutorials]
http://www.ats.ucla.edu/STAT/spss/modules/default.htm
http://www.ats.ucla.edu/stat/spss/notes2/default.htm (this one has short movies showing the steps)

If you want to map residuals, you can get the ESRI data at the site below. Select Michigan, then Washtenaw county, then choose to download the census tracts and whatever else you might want (places, etc.). This is a good resource for base maps in general. You will have to project the file after you download it. I used NAD 1927, State Plane, Michigan South.
http://arcdata.esri.com/data/tiger2000/tiger_download.cfm

You can go to the SPSS website and download a 14-day demo of SPSS onto your own computer. You will have to register with their website, but it's a pretty painless process. here's the link: http://www.spss.com/spss/index.htm?source=homepage&hpzone=ad_box

-->> see also additional comments on assignment