## UP504 • Multiple Regression

last updated: Wednesday, January 16, 2008

## Assignment One

due Monday, Jan 28

# Assignment

Use the data set provided to run multiple regression using SPSS. Begin by examining summary statistics, scatterplots and correlation matrices to become familiar with the data. (This will strengthen both your model building and your interpretation of the results.) Refer to the examples from class, the Lewis-Beck book, a statistics book, the class notes, and other resources on regression for guidelines on developing a model, testing for significance, and checking whether the assumptions for regression have been met. Remember that a good model is not one with simply a good fit, but instead a model that makes sense theoretically and logically explains variation in the dependent variable.

You are to work in teams of two students. (Note: in group projects, you are to turn in a single, integrated write-up. All group members receive the same grade.)

What explains the variation in median housing value by Census Tracts in 2000 (as reported in the 2000 Census)?    dependent variable: median housing value

# The Data Set

We will provide two data sets for two counties in Michigan: Washtenaw County and Wayne County. You may choose either county. (The Washtenaw data set is already available, and we are in the process of cleaning and loading the Wayne County data set -- availability TBA). Note: for maps of the region, see SEMCOG.

# Product to be turned in (be sure to include ALL of the following elements):

1. a statement of the research question and hypothesis

2. the SPSS output of your final model (you do NOT need to turn in the SPSS output from all your exploratory, early version models)

3a. the regression equation
3b. an example of how you would use this equation to make a prediction; use the actual values from one of the cases and calculate the value for the dependent variable. Then compare this predicted value to the actual value for this case. How close an estimate was it?

5. A concise discussion of the strengths and shortcomings of your model. Elements of the discussion might include:

• the strength of the model and its statistical significance
• the strength and statistical significance of each explanatory variable
• any nonlinear transformations you performed (remember to account for this in your prediction equation)
• any dummy variables or other variable recodings
• how you dealt with any outliers
• any concerns about whether regression assumptions were met or violated, such as the problem of multicollinearity (see Lewis-Beck Applied Regression, page 26, for a good discussion of these assumptions)
• any variables (either original variables or ones you created through recoding, computing, etc.) that you thought would be significant but were not.
• what other variables (not in the model) might explain your dependent variable.

You do NOT need to turn in exploratory histograms, scatterplots, descriptive statistics, the raw data (even if you created new variables), etc. (If you create new variables, simply provide a concise definition of them somewhere.) As a result, you may have generated a massive number of pages of output; turn in only those pages that relate to your final model. (and keep the rest as souvenirs).

# A few suggestions:

• You might try recoding some of the data. For example, where appropriate, convert absolute variables to percentages. Use the COMPUTE function under the TRANSFORM menu. (This step will be very important for the analysis.)
• I emphasize the benefits of creating new variables that convert absolute variables (e.g., number of children under 5 years old) to percents (e.g., percent of people in census tract that are under 5 years old).  This also applies to racial variables, etc.   For examples, if you had "Total Population", "Asian Population", "households" etc all in a model, each is struggling to do the same thing:  show a relationship between population size and the dependent variable.   This creates multicollinearity problems.  It also makes it harder to interpret the coefficients of these variables: are they accounting for size, or for race, etc. ?    If you recode to a percent, then the size factor is removed, and the variable simply reflects what you want it to reflect.
• Convert some interval variables (that lack a strong relationship with the dependent variable) into dummy (nominal: 0, 1) variables. Use the RECODE function under the TRANSFORM menu. (The format for this procedure uses the "IF" format.)
• Convert nominal variables into dummy variables. Use the RECODE function under the TRANSFORM menu. Remember that dummy variables have only two possible values: 0 or 1. (otherwise it is problematic to interpret the coefficients of dummy variables. That is, don't use "1" and "2" for the value of dummy variables.).
• Do try some nonlinear transformations if the scatterplots suggest a clear, nonlinear relationship.
• Do experiment with different variables. There is a trade-off between a simple, theoretically logical model (with a lower R-Square) and a more complex model (with a higher R-Square but less intuitively easy to understand). Model building is therefore an art as well as a science. In any case, be sure that you meet the basic requirements of regression models (all independent variables are significant, etc.). See Lewis-Beck for this, especially the idea of BLUE: best linear unbiased estimate.
• *  Be careful to not include independent variables that are actually dependent on the "dependent" variable.
• Avoid making an ecological fallacy (using aggregate data to generalize about individual-level behavior).
• If you are ambitious, you can extract more variables yourself from the original data set and add them to the existing data set (not hard to do).    But this, of course, is going "beyond the call of duty" and is wholly optional.
• You could also experiment with creating new variables based on, for example, any geographic characteristics of the individual cases. (e.g., create a dummy variable: is it in Ann Arbor? in Ypsilanti? does the Tract include part of a college campus? etc.)
• IMPORTANTLY, address questions number 4 and 5 in the assignment: We do not just want to see your SPSS output, but also a thoughtful discussion of the model. What does your model mean in plain English? Do the variables in the model make sense? Are the signs (+/-) of the relationship as expected? What variables seem to be the most important? Do you suspect any intervening or other kinds of indirect or complex relationships?

# HOW MANY VARIABLES?

"How many variables in the model do I need?" A good question. The simple answer is: at least 2 independent variables (this is a multiple regression assignment, after all). There is a trade-off between a simple model and a complex model (with more explanatory power). Going from two to three variables may raise the R2 significantly, and is useful. But eventually the additional explanatory power of each additional variable diminishes, and the model is no longer easily understood. (Call this diminishing returns.) You want a final model that is not only strong and stat. significant, but also makes theoretical sense and is parsimonious (that is, an economical, frugal use of variables). In practical terms, I would expect that many of you would develop models with 2-5 interval variables, plus one or more dummy (nominal) variables.

Re outliers and nonlinear transformations: At this stage in learning regression it is simply important to know how both outliers and nonlinear relationships can affect/distort your model (and lower your R-square as well). You can look for signs of outliers and nonlinear relationships both in scatterplots and in the residual analysis (see the options in the regression menu under plots and residuals). But I am NOT expecting you to necessarily do anything about outliers and nonlinear relationships unless they are really obvious. In other words, do pay attention to these issues, but we will not take any points off if you don't do any nonlinear transformations or address outliers.

With this data set, and its multiple variables, aggregated unit of analysis, and complex dependent variable, there is no single, simple model that will best explain the dependent variable.   As a result, there is no single right answer to the assignment.   (and do not expect to get R-square values of .95;   somewhere in the .50 - .75  is more likely).

Do experiment with different variables.   There is a trade-off between a simple, theoretically logical model (with a lower R-Square) and a more complex model (with a higher R-Square but less intuitively easy to understand).   Model building is therefore an art as well as a science.  In any case, be sure that you meet the basic requirements of regression models (all independent variables are significant, etc.).   See Lewis-Beck for this, especially the idea of BLUE:  best linear unbiased estimate.

-----

# A Selected List of SPSS Tasks/Manipulations that will be potentially useful for this assignment

• Inserting a New Variable
• COMPUTE a new variable
• RECODE an existing variable
• converting a variable to a dummy variable (with values of zero and one)
• DESCRIPTIVE statistics, CORRELATION, HISTOGRAM, scatterplot, REGRESSION
• running analysis on a subset of the data cases (i.e., excluding selected cases) -- such as excluding those cases where the dependent variable value is zero. Or excluding specific cases. Use the "Selection Variable" part of the "Linear Regression" window. SPSS will ask you to define a rule. Select a variable here and a rule (such as "not equal to" zero).
• You can view "casewise diagnostics" (to look for outliers, etc.) by selecting "Statistics" in the "Linear Regression" window. Select "Casewise diagnostics". You have two choices: either view ALL cases, or simply those that are at least x standard deviations away from the actual value. (The default is 3 standard deviations, but you can change this.)
You can better interpret this output if you label each case. To do so, use "Case Labels": you could insert the variable with the case labels here. [You might even create a new variable that labels selected cases with descriptive terms. Remember that this new variable would be formatted as a "String" rather than a "Numeric" variable.]
• Looking for systematic patterns in your residuals (e.g., which cases are underestimated and which cases are overestimated) can sometimes point you to additional explanatory variables that are not yet in the model. [Note: if you were very clever, you could import your geocoded residual data into a GIS program and look for spatial patterns in your residuals -- a great way to visualize your regression model's shortcomings and look for other spatial patterns -- but certainly beyond the call-of-duty for this assignment!)