Group Members: Wenli Hu, Joyce Jiang, Xi Tian, Ye Yu April 19th, 2012

Is it possible to collect data from the entire population? -If so, we can talk about what is true for the entire population -Often we cannot (time/cost) -If not, we can use a smaller subset: a SAMPLE

Research Background Introduction

Sampling Methods 1. simple random sampling 2. post- stratification 3. regression 4. stratified sampling Conclusion

Pima Indians are the American Indians who live today in the Gila River Indian Community. (Arizona) Genetically, Pima Indians have a high rate of diabetes (type II) much higher than “normal” rate in the US. They are said to be genetically susceptible to diabetes and obesity. These Pima Indians are taken as an example of how genetics can cause diabetes. Pima women seem to have higher rate than men.

Done by the National Institute of Diabetes and Digestive and Kidney Diseases Data received: 9 May 1990 Population: 768 women Pima Indians

Tested positive instances: 268 Tested negative instances: 500

Our observations attributes

Plasma glucose concentration a 2hours in an oral

glucose tolerance test Age Class variable (0 or 1)

Simplest Establish a sample size and proceed to randomly select units until we reach the sample size

• Data set:

We have a list of 532 patients and randomly select 50 of them from this list (without replacement). N=532 n=50

...

•

Data analysis

Advantages

-Simple and unbiased

Disadvantages

-Requires an accurate list of the whole population -Expensive to conduct

stratification after selection of the sample Not balanced with respect to diabetes type

...

Diabetes

Yes No

Sample Size

177 355

Glucose Mean

142.69 114.08

Variance

824.40 632.69

= 26.43

Advantages -make weighted estimates to ensure proportional representation. Disadvantages -Requires more information about the population being sampled.

Regression estimator: age as auxiliary variable

z$glu

80

20

100

120

140

160

180

200

30

40 z$age

50

60

Coefficients:Estimate Std. Error t value Pr(>|t|) (Intercept) 87.3574 13.2850 6.576 3.29e-08 *** z$age 1.0855 0.3939 2.756 0.00826 **

Y: glucose X: age x=31.61466

X =31.6

l = a + b* x

= 87.3574 + 1.0855* 31.6466 = 121.67

Var ( l ) = (N-n)*MSE / (N*n) = (532-50) * 1076.4 / (532*50) = 19.50

performs regression analysis for sample survey data

handle survey sample designs including designs with stratification, clustering, and unequal weighting With ESTIMATE statements, you can specify a regression estimator

proc surveyreg data=Municipalities total=50; cluster Cluster; model Population85=Population75; estimate '1985 population' Intercept 284 Population75 8200; run;

Cited from: http://www.math.montana.edu/~jobo/thai/4ratreg.pdf

Stratified Sampling

nh

n Nh N

n1 17 n2 33…