Contents:
Writing an outline
Writing up the results
Dealing with unpredictable, messy, incomplete, or ambiguous data
Special Case:Interpreting and Analyzing Survey Data

UP504 (Campbell)
Winter 2008
University of Michigan
last updated: January 27, 2008

 

Advice for Final Project (and more broadly, for the design and interpretation of quantitative research projects):
 

Writing an outline:

  1. Overall be  precise and rigorous in your outlines.    Think it through:  what type of evidence and tests do you need to test your hypothesis?
  2. Focus specifically on methodology.  What are the actual steps to do the research?  
  3. Remember that there is a difference between an overall theory or policy question (e.g., Is sustainable development an important goal for urban planning?) and a precise, testable question that you can actually answer in this project (e.g., What types of large U.S. cities are more likely to have an official, municipal "office of sustainable development"?).
  4. Can you actually test your hypothesis using UP504-like methods? Are your hypotheses falsifiable? (Remember that your hypotheses should directly follow from your research question: a hypothesis is an intelligent "best guess" answer to your research question.) Avoid weak hypotheses (e.g., Do poor people earn less money than rich people?).
  5. Are your methods, data and research question compatible? You may have good methods, a nice data set and interesting research questions, but will they work together? This is a critical aspect of your research project, and may require a mid-project adjustment of your question, data set and/or methodology to bring them in harmony with each other.
  6. Narrow your focus.  Ironically, if you take on a more narrow project you can answer broader questions in the end.
  7. Is your unit of analysis appropriate? Sometimes your units of analysis are too aggregated (e.g., using national-level data when city or individual data is better).   One problematic result is an ecological fallacy.
  8. Do you have enough cases? (The number of needed cases will vary, based on: whether you are using analytical or statistical generalization; the level of confidence you seek; the number and size of subgroups within the data set of interest; etc.) Avoid the temptation of reducing the number of cases just because you think it might save work time. (Often it would take little more time to collect data on, e.g., 200 cities as on just 25 cities. There are scale economies in most data collection.)
  9. Is your data from the right years?
  10. Can you actually find the data?
  11. Can you find good measures of your larger concepts? (Think carefully about the various ways to measure your broader concept. do not conflate a concept with its measure.) see lecture notes.
  12. If you are doing evaluation research, be sure to identify the counterfactual -- what would have happened WITHOUT the intervention?
  13. If you propose doing a regression model, be sure to remember the strengths and limitations of the technique.  It is NOT a "do-all" method, but instead calculates the strength and coefficients of linear relationships.   It is also quite sensitive to whether or not the key assumptions of regression analysis have been met.  Be careful also of an under-specified model:  are you really controlling for all critical factors (e.g., often models of change over time don't take into account demographic compositional change, i.e., that a population age structure has changed, leading to changes in crime rate, home ownership, etc.)? More basically: do have have the basic elements to do regression (e.g., multivariate, interval scale dependent variable, large number of cases, etc.)? If your dependent variable is nominal, then you will need to consider alternate methods, such as logistic regression.
  14. If your regression dependent variable consists of values across time (e.g., housing starts from 1970 - 1995;  or monthly unemployment rates over the last ten years), then this is time series analysis, which merits a look at a statistics book on this variation of regression.
  15. Sometimes you need to back off from looking for a causal model and instead be more exploratory or descriptive, looking for patterns and typologies. Be clear of the nature of the relationship between variables: are you simply pointing out a correlation between two or more variables, or are you also arguing for a specific causal relationship?
  16. Group project are fine, and can be especially rewarding.   I expect that the amount of work put into a group project is proportional to the number of group members.
     


Writing up the results of quantitative research

  1. Make your methodology and your results transparent; let the reader easily see the steps involved.
  2. Help the reader understand the broader context of the research. Why is the research question important? What are the ramifications of the various possible empirical outcomes of the research?
  3. Discuss any potentially significant discretionary methodological choices you made: e.g., the logic of the case selection; excluded outliers; time frame used (i.e., years of data); selection of variables; use of measures to represent broader concepts; etc.
  4. To summarize your findings, distinguish between these different outcomes:
    (a) Findings that confirm the current understanding of the issue [replication of existing theories/assumptions]
    (b) Findings that support your hypothesis and perhaps go against the prevailing understanding/assumptions [this is your intended original contribution]
    (c) Findings that do not support your hypothesis for interesting and explainable reasons [this can be your unexpected but still useful contribution]
    (d) Findings that do not support your hypothesis for unexplainable reasons [this sets your research agenda for the future and keeps you humble]
  5. Note the statistical significance(or lack thereof) where appropriate.


Dealing with unpredictable, messy, incomplete, or ambiguous data

 

1.   It is not unusual to start with a great research question, hypothesis, and a seemingly wonderful data set, only to discover that the data can't easily show the patterns/relationships you were looking for.  This commonly happens when you are running multiple regression.   Some of the sources of this problem are:

*    sometimes there is in fact no underlying relationship between the variables in question.   (The null hypothesis cannot be rejected.)   However, more often there is likely a relationship, yet you can't find it. Therefore, before you conclude that there is no relationship between the variables (and that, e.g., an R-square of 0.12 proves this), consider the following possibilities (and also consider ways to alter your project to address them):

*    incomplete data sets, with lots of missing values (this is a particularly frequent frustration of comparative international data).

*    a unit of analysis that is the wrong scale, especially when it is too aggregated (e.g., national data, when the relationship you are looking for is best revealed with city or household data).  This is a common problem, and the best solution is to collect data at a smaller scale if possible/available. (one runs the risk here of committing an ecological fallacy.)

*    not enough cases (for statistical generalization) -- this can be linked to having too aggregated a unit of analysis (see above) -- for example, running a regression on seven counties in SE Michigan will not work; but, running a regression on the hundreds of census tracts in the region can potentially lead to a statistically significant model.
How about the opposite: too many cases? We have spent time in this class examining different sampling strategies, allowing you to statistically generalize from a relatively small sample to a much larger population. However, if you already have access to a large data set (such as 3,000 cases), try first to work with the entire data set. There is no inherent reason why you have to reduce your number of cases through further sampling. If you have access to an enormous data set (e.g., 120,000 cases and 80 variables), yes, you might want to first cut it back to make it easier to handle (e.g., in Excel). You might also be just interested in a specific subset of cases (e.g., only the Michigan counties in a data set of all U.S. counties). Otherwise, take advantage of the large data set's full power by using all the cases.

*    lots of variables, but not the ones you need (an under-specified model)

*    Wrong level of measurement: for example, you have nominal or ordinal data, when you really need interval-scale data (remember that different levels of measurement require different statistical tests).

*    not enough variation in your dependent variable.  (Since a strong model explains most of the variation, with little variation, there is little to explain.)

*    non-linear relationships, where there is no obvious linear transformation to correct the problem.

*    Using secondary data from poorly constructed survey research (e.g., a questionnaire with ambiguously worded questions).

*    the wrong format for your dependent variable. (therefore, try alternatives). Example? If you are using "average commute time (in minutes)" as a variable and getting weak results, perhaps try to include more information about commute times: e.g., percent of people who commute less than 15 minutes/day; percent over 45 minutes; percent who work at home, etc. Means (averages) can be useful measures of central tendency, but they tend to strip the data of lots of its richness and variation. (For example: two communities may both have mean commute times of 30 minutes, but are otherwise quite different: for the first, everyone commutes 30 minutes a day; and for the second, half the people commute zero minutes while the other half commute for an hour.)

*    etc.
 

2.   What do you do in these situations?
If possible, fill in the missing values, add more variables, cases, disaggregate the data to a smaller scale, do linear transformations, convert absolute values to percentages, try dummy variables, etc.    Or, go the other direction and convert the data to nominal or ordinal scales and look for statistically significant patterns there (using chi-square, etc.).

3.  If that doesn't work, try explaining another dependent variable.
Be flexible:   shift from a hypothetical-deductive mode to an exploratory mode.

4.  If that fails, turn "failure" into the subject of your analysis
You can still write up your analysis.    Take a step back and reflect on the analytical process;  you may have learned more from these apparent "failures" than you think.   Explain why there were no obvious and statistically significant patterns found in the data.  Write a treatise on the politics of data.   Propose a different study that would better answer your research question.   (I once had several students who were preparing a statistical compendium of data from apartheid South Africa;   finding little useful data, they wrote a wonderful paper about the politics of data collection and dissemination in that divided country.)

5.  Morale:  all is not lost if your R-square is low.
That is, for this assignment, if the final results are less than breathtaking, write a thoughtful review and analysis of the steps you took in the research project.   Refer to the concepts and skills from this class.

FINALLY: Remember that the final product of this project is not simple data output, but also an intelligent, reflective discussion of the methodology, the results and their implications, the process of defining measures, selecting variables and cases, finding data, and the shortcomings of the research. Develop an overall narrative that places the research in a larger context.

-------------------------------------
 

SPECIAL CASE:   INTERPRETING AND ANALYZING SURVEY DATA

Survey data present special potentials and difficulties for analysis.  The data is often heavily nominal and ordinal, not what you need for interval-scale analysis (such as correlation and regression).  You often don't have enough cases to be highly statistically significant, especially if the survey is an exploratory one (e.g., 25 cases).   You often have a large number of variables (e.g., 50-100 questions asked), and don't know where to begin the analysis.  Here is one strategy:

1.   Start with the simple task of profiling the respondents overall.  Who answered the survey?  Who didn't?   How representative is the sample for the community in question?   How representative is the sample for the larger world?   How complete were the responses?    How many complete vs. partial responses did you get?  It can be very useful to provide a brief overview table or summary of the sample.

2.  Then calculate the marginal totals for each question.  That is, what frequency of respondents answered each question.  (Example:  27%  "high", 42% "medium," and 31% "low" to your question about a sense of community.) You might find that incorporating the absolute and percent values for each question directly onto the questionnaire itself (either by hand or typing) is an effective way to communicate this information. (This also allows the reader to see the original format and wording of the questions.)

3.  Look for the most interesting variables with significant ranges of answers.  (If 98% answered that they believed that "environmental protection is a high priority for our community," that answer be useful in and of itself, but what you want are variables with a wider range of answers.)   Then run cross-tabs (cross-tabulations).   For example, a 2x3 table comparing male and female responses to the question about a sense of community.   (Note that this a nominal variable crossed with an ordinal variable;   any nominal and/or ordinal variables will do;  or you can demote an interval variable to an ordinal variable.)    Note that you can do more than a 2-dimensional cross-tab, e.g., sex by race by "sense of community" (it just gets more complicated).  [Cross-tabs are the nominal/ordinal equivalent of scatterplots for interval data.]

Example of crosstabs:

"How often do you participate in community development events in your neighborhoods?"
absolute counts (and percent of gender groups) listed
 
Never
Less that once a month
Once a month or more
Total
Male
12 (63%)
4 (21%)
3 (16%)
19 (100%)
Female
10 (32%)
9 (29%)
12 (39%)
31 (100%)
Total
22 (44%)
13 (26%)
15 (30%)
50 (100%)

4.  If there seems to be a pattern in a cross tab, run a test of significance, such as chi-square (for nominal data), or Kendall's tau, gamma, etc. (for ordinal data).

Cross-tabs in SPSS:
The cross-tabs function in SPSS can be found in:
Analyze > descriptive statistics > crosstabs
To run tests of statistical significance on your crosstabs, within the crosstabs dialogue box, click on the "statistics" button. In the statistics window, select the appropriate crosstabs statistic, such as chi-square or tau (for more information on which to use, check a statistics textbook). Do note whether your pair of variables are interval-interval, nominal-nominal, or interval-nominal.

Note: standard tests of difference of means (such as two-mean t-tests and ANOVA) can be found in
Analyze > compare means

Cross-tabs in Excel:
The cross-tabs function in Excel is called "PivotTable" -- one generates a "PivotTable Report".

 

5.  Inductive or deductive:  At this point, you can either use the data to test specific hypotheses, or else present a more exploratory overview of patterns in the data.    If effective, develop graphs to display the most interesting patterns.   Effective analyses of survey data often interweave tables, charts and narrative, telling a story and creating a vivid profile.  An abstract with the key findings is useful.

6.  For advanced users, try models that have a nominal-scale dependent variable, such as a logit model.   It uses a similar logic to that of regression.  (Example:  if you have data on individuals and where each one lived in 1980 and 1990, predicting whether or not each one moved during this period.)

-------------------------------------