Covariance and Correlation

Numerical Summary of Data

Pan Chao

November 17, 2014

Numerical summary of data

Covariance and Correlation

Measures of center

Measures of Center

1. Mean: arithmetic average x1 + x2 + . . . + xn

1∑

= xi n n n

x

¯=

i=1

Example:

1, 2, 2, 3, 4, 7, 9

x

¯=

1+2+2+3+4+7+9

= 4.

7

Numerical summary of data

Covariance and Correlation

Measures of center

2. Mode: most frequent value in a data set, highest peak.

Example: 2 is the mode in the previous example.

Remark: can have more than one modes.

Numerical summary of data

Covariance and Correlation

Measures of center

3. Median: midpoint of the data such that half of the values are smaller and half of the values are larger.

How to find the median:

1. arrange the data in increasing order (from smallest to largest)

2. count the number of observations, n.

3a. If n is odd, median is the middle ordered value:

(

M=

n+1

2

)th ordered value

3b. If n is even, median is the average of the two middle ordered values: (n

)th

( n )th and +1 ordered value

M = average of

2

2

Example : observations 7, 9, 10, 12, 14 (The sample median is 10)

Example : observations 3, 4, 9, 12, 14, 19 (The sample median is 10.5)

Numerical summary of data

Covariance and Correlation

Measures of center

Example

Bob’s last 20 golf scores, beginning with his last score

69

76

77

76

73

75

81

83

77

77

82

77

77

78

75

80

80

78

79

84

1. What is the mode for this data set?

69, 73, 75, 75, 76, 76, 77, 77, 77, 77, 77,

78, 78, 79, 80, 80, 81, 82, 83, 84

2. Determine the median (77)

3. Calculate Bob’s mean golf score (77.7)

Numerical summary of data

Measures of variability

Measures of Variability

1. Range: = max - min

(simplest, but not always useful)

Covariance and Correlation

Numerical summary of data

Covariance and Correlation

Measures of variability

2. Variance: based on the diﬀerence between each observation and the mean.

Population variance:

∑

σ2 =

(xi − µ)2

N

Sample variance (almost always):

∑

(xi − x

¯ )2

2

s = n−1 Remarks:

Variance is always non-negative (≥ 0)

0 variance means there is no variation. i.e. the whole data set has the same value.

Numerical summary of data

Covariance and Correlation

Measures of variability

3. Standard deviation: most commonly used for measuring how far observations are from the mean.

Population version: σ= √ σ2 Sample version (almost always):

√

s = s2

Numerical summary of data

Covariance and Correlation

Measures of variability

Example

Compute the standard deviation of the data set including 0, 2, 4 i 1

2

3

xi

0

2

4

xi − x

¯

-2

0

2

Mean: x

¯=2

Variance: s2 = 4

Standard deviation: s = 2

(xi − x

¯ )2

4

0

4

Numerical summary of data

Covariance and Correlation

Measures of variability

4. pth percentile: value such that p% of the observations fall at or below it

Median:

First quartile:

Third quartile:

M = 50th percentile

Q1 = 25th percentile

Q3 = 75th percentile

Numerical summary of data

Covariance and Correlation

Measures of variability

How to find a percentile for data?

1. Order the data in increasing order.

2. Calculate i = np/100, where n is the sample size, p is the percentile. 3a. If i is not an integer, round i up to the next integer. Then take the ith value.

3b. If i is an integer, take an average of the ith and (i + 1)th values. Example: -20, 1, 23, 25, 32.5, 33, 67

Median = 25

First quartile = 1

Third quartile = 33

Example: 1, 2, 4, 6, 8, 9, 12, 13

Median = 7

First quartile = 3

Third quartile = 10.5

Numerical summary of data

Covariance and Correlation

Measures of variability

5. Interquartiles Range (IQR): = Q3 − Q1

Outliers: an observation is said to be a suspected outlier if it is

> Q3 + 1.5∗IQR

OR

< Q1 − 1.5∗IQR

Example: 1, 2, 3, 4, 5, 6, 11

M = 4, Q1 = 2, Q3 = 6, IQR = 4, [Q1 -1.5IQR, Q3 +1.5IQR]

= [-4, 12]

Numerical summary of data

Covariance and Correlation

Five-number summary and boxplot

Five-number Summary

Min, Q1 , Median, Q3 , Max

Remark: Divide our