Essay about Sampling Project

Submitted By Yunjiejiang1
Words: 738
Pages: 3

STAT506 Sampling Project

Group Members: Wenli Hu, Joyce Jiang, Xi Tian, Ye Yu April 19th, 2012



Is it possible to collect data from the entire population? -If so, we can talk about what is true for the entire population -Often we cannot (time/cost) -If not, we can use a smaller subset: a SAMPLE

 Research Background Introduction

Sampling Methods 1. simple random sampling 2. post- stratification 3. regression 4. stratified sampling  Conclusion






 



Pima Indians are the American Indians who live today in the Gila River Indian Community. (Arizona) Genetically, Pima Indians have a high rate of diabetes (type II) much higher than “normal” rate in the US. They are said to be genetically susceptible to diabetes and obesity. These Pima Indians are taken as an example of how genetics can cause diabetes. Pima women seem to have higher rate than men.

  

Done by the National Institute of Diabetes and Digestive and Kidney Diseases Data received: 9 May 1990 Population: 768 women Pima Indians
 Tested positive instances: 268  Tested negative instances: 500



Our observations attributes
 Plasma glucose concentration a 2hours in an oral

glucose tolerance test  Age  Class variable (0 or 1)

 

Simplest Establish a sample size and proceed to randomly select units until we reach the sample size

• Data set:
We have a list of 532 patients and randomly select 50 of them from this list (without replacement). N=532 n=50

...



Data analysis



Advantages
-Simple and unbiased



Disadvantages
-Requires an accurate list of the whole population -Expensive to conduct

 

stratification after selection of the sample Not balanced with respect to diabetes type

...

Diabetes
Yes No

Sample Size
177 355

Glucose Mean
142.69 114.08

Variance
824.40 632.69

= 26.43



Advantages -make weighted estimates to ensure proportional representation. Disadvantages -Requires more information about the population being sampled.



Regression estimator: age as auxiliary variable

z$glu

80
20

100

120

140

160

180

200

30

40 z$age

50

60

Coefficients:Estimate Std. Error t value Pr(>|t|) (Intercept) 87.3574 13.2850 6.576 3.29e-08 *** z$age 1.0855 0.3939 2.756 0.00826 **

 

Y: glucose X: age  x=31.61466

X =31.6

l = a + b* x

= 87.3574 + 1.0855* 31.6466 = 121.67

Var ( l ) = (N-n)*MSE / (N*n) = (532-50) * 1076.4 / (532*50) = 19.50

 

performs regression analysis for sample survey data

handle survey sample designs including designs with stratification, clustering, and unequal weighting With ESTIMATE statements, you can specify a regression estimator



proc surveyreg data=Municipalities total=50; cluster Cluster; model Population85=Population75; estimate '1985 population' Intercept 284 Population75 8200; run;

Cited from: http://www.math.montana.edu/~jobo/thai/4ratreg.pdf

Stratified Sampling

nh 

n  Nh N

n1  17 n2  33…