An introduction to Stata in 2 pages.1 Comments, questions and corrections most welcome and may be sent to: Andrew Grogan-Kaylor @ agrogan@umich.edu. This document available on the web @ https://agroganweb.wordpress.com/stata-resources/ Commands that you actually type into Stata are represented in `monospace font`

.2 The Stata interface makes it extremely easy to do rapid interactive data analysis. Hit **PAGE-UP** to recall the most recent command, which you can then quickly edit and resubmit. The general idea of most Stata commands is `command variables, options`

. Often it is not necessary to use any options since the authors of Stata have done such a good job of thinking about the defaults. [^do-file-editor] x and y refer to variables in your data. The treatment here is intended to be extremely brief, in order to create a kind of “cheat sheet” that can be presented in 2 pages. More documentation on any command is available in the printed or PDF Stata manuals, or by typing `help command`

.

`codebook x y`

will produce a nicely formatted codebook of selected variables, which is especially useful if you have added variable labels with the label variable command. `codebook`

is especially useful for seeing how numerical values are associated with value labels. `codebook`

by itself will list every variable in your data and generate a lot of [probably too much] output.`lookfor`

allows you to find variables that contain a specified keyword. This is especially useful in large data sets with many variables. Often abbreviated keywords are the most helpful. e.g. to find a poverty variable, type `lookfor pov`

.

`describe`

tells you about the contents of a specific variable. E.g. `describe x y`

. `describe, short`

will tell you very basic things about your data, including the number of observations in the data set, and the size of your data file.

`recode x (oldvalue = newvalue), generate(z)`

will recode a variable into a new variable, often a good idea.

`recode _all (-99/-1 = .)`

will recode all negative numbers from -99 to -1 to missing for all variables in your data. `recode x (7/9 = .)`

changes 7 through 9 to be missing for x. Indeed, `recode`

will change specific values in your data to anything you want, not just missing values.

It is often convenient to `rename`

your variables so that the variables have more intuitively understandable names e.g. `rename x depression`

.

You can create new variables out of old variables using `generate newvar = expression`

e.g. `generate newvar = oldvar1 + oldvar2`

.5 `alpha oldvar1 oldvar2`

will calculate Cronbach’s alpha from this scale.

It is sometimes useful to `sort`

your data. `sort x`

will sort your data by the values of x.

`summarize`

gives you basic descriptive statistics for a variable, such as the mean (average). Especially useful for continuous variables. E.g. `summarize x y`

or `summarize x y, detail`

.

`tabulate`

gives you a frequency distribution for your variable. Especially useful for categorical variables. e.g. `tabulate x`

.

`oneway continous_var categorical_var, tabulate`

gives you a oneway ANOVA of a continuous variable over a categorical factor.Tabulating two categorical variables together gives you a cross-tabulation of those variables, e.g `tabulate x y, row col chi2`

`pwcorr x y, sig`

gives you the pairwise correlation of two continuous variables.

`regress y x`

regresses y on x.8 After running many multivariate models `estat summarize`

will give you simple descriptive statistics for the specific sample used in that particular analysis.

`regress y x z`

regresses y on x and z.9 Other regression commands follow a very similar format: `command y x z`

but are beyond the purview of this 2 page guide. `regress y x i.z`

regresses y on x and z, treating x as continuous and z as a set of categorical indicator variables.

`twoway lfit y x, title(".") xtitle(".") ytitle(".")`

`histogram x`

will give you a nice display of one variable. `histogram x, by(y)`

may be useful for comparing the distributions of two variables over the categories of y.

`histogram x, percent`

will scale the y-axis more intuitively in terms of percentages. `histogram x, discrete`

11 The percent and discrete options can be combined. gives a nicer display for categorical variables.

`twoway scatter y x`

gives you a twoway scatterplot of your data. `twoway lfit y x`

will give you a linear fit graph. The two syntaxes may be combined e.g. `twoway (scatter y x)(lfit y x)`

.

`graph bar x, over(y)`

is useful for creating a bar graph of a continuous or categorical variable graphed across the categories of a categorical variable.

In many cases you may want to look at the results of some calculation for x, or x and y over a third variable z. In such cases the by: syntax will be especially useful. For example to look at the correlation of x and y over different values of z.

`sort z`

`by z: pwcorr x y, sig`