SI 544 Introduction to Statistics and Data Analysis

Resources

SI 544 home

cTools
Readings, assignments, etc. will be posted to the course ctools website

problem sets

software tools for the class

other resources

instructor:
Lada Adamic


Schedule

Winter 2008:

Lectures will be
Tuesdays and Thursdays
from 9:00 to 10:30 am.
On Thursdays we will usually meet in 409 West Hall, on Tuesdays we will be at the DIAD lab.

Office hours:
Mon 4-5pm

Tues/Thurs 10:30-11:00am




PS5 Inference, outliers

 

 

1. Decision rules
You are about to conduct an observational study and need to decide ahead of time what your decision rule should be (what the p level should be in order for you to reject the null hypothesis). There are two scenarios.

  1. You've hacked together a nifty little improvement for the user interface for your company's e-commerce site. You will randomly expose 1000 customers to the modified version of the site and measure whether they spent more money. You don't want to do this too often (after all, customers tend to get upset if things get rearranged on the site from session to session), but it is not that much effort to continue experimenting with the design. The null hypothesis is that your design did not influence people to spend more money.
  2. You are considering a product by a vendor that would handle all of your electronic payment transactions. You do careful testing of the product on your designated test server in order to determine whether transactions will be processed more quickly. The product costs about the same as your current solution, but switching over has big costs in terms of adjusting your current 'production' system. The null hypothesis is that the product you are considering does not speed up transactions.

For scenarios 1 and 2 separately, please answer the following:

  • What consequence (cost) would a type I error imply, what would a type II error imply in each scenario?
  • Given your response to the previous question, what decision rule would you select for each scenario and why?

2. Where is Ann Arbor

Download the nation-wide library data again: http://www-personal.umich.edu/~ladamic/courses/si544w08/data/libraries.dat.

  • Plot the local government funding vs. federal funding for each library, choosing the appropriate scaling for the axes (or transforming the data as necessary instead).
  • Use points() to highlight and label Ann Arbor on this plot.
  • (extra credit) Use lowess() to draw a smooth curve through the data.
  • Does Ann Arbor fall above or below the trend?
  • Identify the name, city and state of an outlier that receives a high level of local funding relative to federal funding it receives using the identify() function.
  • submit a single figure showing all of the above

3. Sampling distribution

Consider the same library data as above. Repeatedly (1000 times each) draw samples (without replacement) of 3 different sizes 10,100, and 500 and compute the sample average of ((the number of librarians) per (1,000 population served by the library)).

What is the mean and standard deviation of the entire 'population' of libraries.

What is the mean and standard deviation of the sample means of 10 libraries? 100 libraries? 500 libraries?
From this construct 95% confidence intervals for each sample size.

You're trying to decide whether you should collect data on 10, 100, or 500 random libraries in a new survey you are about to conduct to see if the av. number of librarians per population has changed since the 1996 survey. If you would like your estimate to be within 30% of the actual with 95% probability, which of the above sample sizes should you use while being as thrifty as possible?

Finally, draw a single sample of your selected size. Report on the mean and construct a confidence interval, using the t distribution. Is the actual population mean within your 95% confidence interval?