UP504 • Multiple Regression

last updated: Wednesday, January 16, 2008

Assignment One

due Monday, Jan 28

Elements of this page:
research question for the assignment
the data set
product to be turned in / elements of the analysis
suggestions
SPSS tasks to know
Links to SPSS Tutorials, etc. [added]

Assignment

Use the data set provided to run multiple regression using SPSS. Begin by examining summary statistics, scatterplots and correlation matrices to become familiar with the data. (This will strengthen both your model building and your interpretation of the results.) Refer to the examples from class, the Lewis-Beck book, a statistics book, the class notes, and other resources on regression for guidelines on developing a model, testing for significance, and checking whether the assumptions for regression have been met. Remember that a good model is not one with simply a good fit, but instead a model that makes sense theoretically and logically explains variation in the dependent variable.

You are to work in teams of two students. (Note: in group projects, you are to turn in a single, integrated write-up. All group members receive the same grade.)

Question to be addressed in your analysis:

What explains the variation in median housing value by Census Tracts in 2000 (as reported in the 2000 Census)? dependent variable: median housing value

The Data Set

We will provide two data sets for two counties in Michigan: Washtenaw County and Wayne County. You may choose either county. (The Washtenaw data set is already available, and we are in the process of cleaning and loading the Wayne County data set -- availability TBA). Note: for maps of the region, see SEMCOG.

	Washtenaw County	Wayne County
file name (Note: the suffix .sav indicates an SPSS data file)	up504w08assign1wash.sav	[to be added]
data set location	sdcamp/Public/up504/up504w08assign1.sav

FOUR WAYS TO ACCESS THE DATA SET (simply use the one that is easiest for you. Please let us know if any of these don't work and/or if you find a better way.):
1. Double-click on the file name in this directory	http://www-personal.umich.edu/~sdcamp/up504/assign1w08/
2. Download the zip (compressed) version of the file [Note: when you download and unzip/expand this file, you may rename it anything you would like, but be sure that the file has the extension ".sav" so that SPSS can recognize the file as a SPSS data file.]	up504w08assign1wash.sav.zip
3. Use Fugu (MAC) or SSH Secure Shell (Windows) to download this data. In these programs you have two sides of a window: one shows the local computer, and the other side shows the remote site (in this case my public space: /afs/umich.edu/user/s/d/sdcamp/Public/up504/) Select the file up504w04assign1.sav and then select a location (in the local window) to copy it. You can usually just drag it. Host name is "login.itd.umich.edu". User Name is your uniqname. Once you open this connection, select (in the menu) "Operation" > "Go to Folder". Here you will see the hierarchical address for your ifs space. Edit this space to read: /afs/umich.edu/user/s/d/sdcamp/Public/up504. You should now see the data file. Drag this file to where you want it, or right-click to see download options.
4. Use mfile to access the data set at: https://mfile.umich.edu/?path=/afs/umich.edu/user/s/d/sdcamp/Public/html/up504/assign1w08
Thematic Maps of Data.	thematic maps (Note how the housing values vary dramatically across census tracks, with Ann Arbor having both some of the highest and lowest values in the county. Note also the patterns in Ypsilanti.)
Notes on data:	NOTE 1: there are 97 census tracts in the data set (from Census tract 4001 to 4660). For example, the Art & Architecture building is located in Tract 4022. NOTE 2: Census Tract 4229 is a rather unusual place (the location of a correctional facility with no hhd income), and thus you might treat as an outlier (you can either delete this case, or instruct SPSS to exclude it for the analysis.) There may be other individual tracts with peculiar characteristics. You may obtain maps of Washtenaw County census tracts (.pdf format) at: http://ftp2.census.gov/plmap/pl_trt/st26_Michigan/c26161_Washtenaw/
Data source: The source of the data: We extracted it from the 2000 U.S. Census Census, Summary Files 1 and 3 (that is, from answers to both the short form and long form Census questionnaires): US CENSUS LINKS: SF-3 technical documentation (long form -- a sample of the population) SF-1 technical documentation (short form -- a full-count of the population) We used both American FactFinder and the "Census Engine" to directly access Census DVDs at the UM Library (S.A.N.D. North in the A&AB). Note: variables starting with "P" refer to characteristics of Persons, while "H" variables refer to characteristics of Housing Units.

Product to be turned in (be sure to include ALL of the following elements):

1. a statement of the research question and hypothesis

2. the SPSS output of your final model (you do NOT need to turn in the SPSS output from all your exploratory, early version models)

3a. the regression equation
3b. an example of how you would use this equation to make a prediction; use the actual values from one of the cases and calculate the value for the dependent variable. Then compare this predicted value to the actual value for this case. How close an estimate was it?

4. an answer to your research question.

5. A concise discussion of the strengths and shortcomings of your model. Elements of the discussion might include:

the strength of the model and its statistical significance
the strength and statistical significance of each explanatory variable
any nonlinear transformations you performed (remember to account for this in your prediction equation)
any dummy variables or other variable recodings
how you dealt with any outliers
any concerns about whether regression assumptions were met or violated, such as the problem of multicollinearity (see Lewis-Beck Applied Regression, page 26, for a good discussion of these assumptions)
any variables (either original variables or ones you created through recoding, computing, etc.) that you thought would be significant but were not.
what other variables (not in the model) might explain your dependent variable.

You do NOT need to turn in exploratory histograms, scatterplots, descriptive statistics, the raw data (even if you created new variables), etc. (If you create new variables, simply provide a concise definition of them somewhere.) As a result, you may have generated a massive number of pages of output; turn in only those pages that relate to your final model. (and keep the rest as souvenirs).

A few suggestions:

You might try recoding some of the data. For example, where appropriate, convert absolute variables to percentages. Use the COMPUTE function under the TRANSFORM menu. (This step will be very important for the analysis.)
I emphasize the benefits of creating new variables that convert absolute variables (e.g., number of children under 5 years old) to percents (e.g., percent of people in census tract that are under 5 years old). This also applies to racial variables, etc. For examples, if you had "Total Population", "Asian Population", "households" etc all in a model, each is struggling to do the same thing: show a relationship between population size and the dependent variable. This creates multicollinearity problems. It also makes it harder to interpret the coefficients of these variables: are they accounting for size, or for race, etc. ? If you recode to a percent, then the size factor is removed, and the variable simply reflects what you want it to reflect.
Convert some interval variables (that lack a strong relationship with the dependent variable) into dummy (nominal: 0, 1) variables. Use the RECODE function under the TRANSFORM menu. (The format for this procedure uses the "IF" format.)
Convert nominal variables into dummy variables. Use the RECODE function under the TRANSFORM menu. Remember that dummy variables have only two possible values: 0 or 1. (otherwise it is problematic to interpret the coefficients of dummy variables. That is, don't use "1" and "2" for the value of dummy variables.).
Do try some nonlinear transformations if the scatterplots suggest a clear, nonlinear relationship.
Do experiment with different variables. There is a trade-off between a simple, theoretically logical model (with a lower R-Square) and a more complex model (with a higher R-Square but less intuitively easy to understand). Model building is therefore an art as well as a science. In any case, be sure that you meet the basic requirements of regression models (all independent variables are significant, etc.). See Lewis-Beck for this, especially the idea of BLUE: best linear unbiased estimate.
* Be careful to not include independent variables that are actually dependent on the "dependent" variable.
Avoid making an ecological fallacy (using aggregate data to generalize about individual-level behavior).
If you are ambitious, you can extract more variables yourself from the original data set and add them to the existing data set (not hard to do). But this, of course, is going "beyond the call of duty" and is wholly optional.
You could also experiment with creating new variables based on, for example, any geographic characteristics of the individual cases. (e.g., create a dummy variable: is it in Ann Arbor? in Ypsilanti? does the Tract include part of a college campus? etc.)
IMPORTANTLY, address questions number 4 and 5 in the assignment: We do not just want to see your SPSS output, but also a thoughtful discussion of the model. What does your model mean in plain English? Do the variables in the model make sense? Are the signs (+/-) of the relationship as expected? What variables seem to be the most important? Do you suspect any intervening or other kinds of indirect or complex relationships?

HOW MANY VARIABLES?

"How many variables in the model do I need?" A good question. The simple answer is: at least 2 independent variables (this is a multiple regression assignment, after all). There is a trade-off between a simple model and a complex model (with more explanatory power). Going from two to three variables may raise the R2 significantly, and is useful. But eventually the additional explanatory power of each additional variable diminishes, and the model is no longer easily understood. (Call this diminishing returns.) You want a final model that is not only strong and stat. significant, but also makes theoretical sense and is parsimonious (that is, an economical, frugal use of variables). In practical terms, I would expect that many of you would develop models with 2-5 interval variables, plus one or more dummy (nominal) variables.

Re outliers and nonlinear transformations: At this stage in learning regression it is simply important to know how both outliers and nonlinear relationships can affect/distort your model (and lower your R-square as well). You can look for signs of outliers and nonlinear relationships both in scatterplots and in the residual analysis (see the options in the regression menu under plots and residuals). But I am NOT expecting you to necessarily do anything about outliers and nonlinear relationships unless they are really obvious. In other words, do pay attention to these issues, but we will not take any points off if you don't do any nonlinear transformations or address outliers.

OVERALL ADVICE ON MODEL BUILDING:
With this data set, and its multiple variables, aggregated unit of analysis, and complex dependent variable, there is no single, simple model that will best explain the dependent variable. As a result, there is no single right answer to the assignment. (and do not expect to get R-square values of .95; somewhere in the .50 - .75 is more likely).

Do experiment with different variables. There is a trade-off between a simple, theoretically logical model (with a lower R-Square) and a more complex model (with a higher R-Square but less intuitively easy to understand). Model building is therefore an art as well as a science. In any case, be sure that you meet the basic requirements of regression models (all independent variables are significant, etc.). See Lewis-Beck for this, especially the idea of BLUE: best linear unbiased estimate.

-----

A Selected List of SPSS Tasks/Manipulations that will be potentially useful for this assignment

Inserting a New Variable
COMPUTE a new variable
RECODE an existing variable
converting a variable to a dummy variable (with values of zero and one)
DESCRIPTIVE statistics, CORRELATION, HISTOGRAM, scatterplot, REGRESSION
running analysis on a subset of the data cases (i.e., excluding selected cases) -- such as excluding those cases where the dependent variable value is zero. Or excluding specific cases. Use the "Selection Variable" part of the "Linear Regression" window. SPSS will ask you to define a rule. Select a variable here and a rule (such as "not equal to" zero).
You can view "casewise diagnostics" (to look for outliers, etc.) by selecting "Statistics" in the "Linear Regression" window. Select "Casewise diagnostics". You have two choices: either view ALL cases, or simply those that are at least x standard deviations away from the actual value. (The default is 3 standard deviations, but you can change this.)
You can better interpret this output if you label each case. To do so, use "Case Labels": you could insert the variable with the case labels here. [You might even create a new variable that labels selected cases with descriptive terms. Remember that this new variable would be formatted as a "String" rather than a "Numeric" variable.]
Looking for systematic patterns in your residuals (e.g., which cases are underestimated and which cases are overestimated) can sometimes point you to additional explanatory variables that are not yet in the model. [Note: if you were very clever, you could import your geocoded residual data into a GIS program and look for spatial patterns in your residuals -- a great way to visualize your regression model's shortcomings and look for other spatial patterns -- but certainly beyond the call-of-duty for this assignment!)

Links to SPSS Tutorials

http://ssc.utexas.edu/consulting/tutorials/index.html [scroll down for SPSS tutorials]
http://www.ats.ucla.edu/STAT/spss/modules/default.htm
http://www.ats.ucla.edu/stat/spss/notes2/default.htm (this one has short movies showing the steps)

If you want to map residuals, you can get the ESRI data at the site below. Select Michigan, then Washtenaw county, then choose to download the census tracts and whatever else you might want (places, etc.). This is a good resource for base maps in general. You will have to project the file after you download it. I used NAD 1927, State Plane, Michigan South.
http://arcdata.esri.com/data/tiger2000/tiger_download.cfm

You can go to the SPSS website and download a 14-day demo of SPSS onto your own computer. You will have to register with their website, but it's a pretty painless process. here's the link: http://www.spss.com/spss/index.htm?source=homepage&hpzone=ad_box