SOME STATISTICAL CONCEPTS

Mean (average)

Variance

Standard deviation

Bivariate data

Covariance

Correlation

Linear regression

Residuals

Additional formulas

Regression effect

page 2 page 7 page 11 page 13 page 14 page 19 page 21 page 24 page 25 page 27

This document was prepared by Professor Gary Simon with the advice (and consent) of

Professor William Silber, Stern School of Business. If you have comments or suggestions, please send them to either gsimon@stern.nyu.edu or silber@stern.nyu.edu

Release date 15 JULY 2002

1

THE MEAN, OR AVERAGE

This document will introduce a number of statistical concepts, and perhaps some of these may be very new to you. Statistical topics can be confusing because identical subject matter can be described in very different terms.

The very same statistical concept can be described in several ways:

1.

2.

2m.

3.

Data, as numbers.

Data, represented algebraically.

Data, represented algebraically, allowing multiple values.

Conceptually as random variables with probabilities.

A list of data, as in points 1 through 3, might be described as a variable. Each column of a data spreadsheet could be called a variable. In item 3, we do not necessarily have data, and we conceptualize the idea as a random variable.

Let’s illustrate the notions first for the concept of the average of a set of values. The average is also called the mean.

1.

Consider the list of values 48, 46, 54, 51, 53. The average (or mean) of

48 + 46 + 54 + 51 + 53

252

this list is found as

=

= 50.4.

5

5

2.

Let n be the number of items in a list. Represent the list as x1, x2, … , xn.

The three dots simply indicate that we’re omitting some of the values.

Note that we’re using x as the symbol for the list items. You’ll frequently see xi as the generic ith element of the list. You can think of the symbol i as a counter or as an index. Some people will describe the list as { xi ; i = 1, …, n }. You can read this as

“x-sub-i, as i runs from 1 to n.”

The average of this list is

x1 + x2 + ... + xn

1

= ( x1 + x2 + ... + xn ) . n n

Since we’ll be adding lists of numbers rather frequently, it helps to create a simple notation for this concept. We use the summation notation n

åx

i

to represent the sum of the x’s, using the index i to

i =1

enumerate from the starting value i = 1 to the ending value i = n.

Then we have

n

åx

i

= x1 + x2 + … + xn and the average can be

i =1

2

n

1 written as n n

å xi =

åx

i

i =1

n

i =1

. The symbol i is nothing but a

counting convenience. You should note that

n

åx

=

i

i =1

n

åx

j

n

åx

=

u

j =1

.

u =1

In nearly every case you’ll encounter, the entire list of n values will be added, and it’s burdensome to keep the notation above and below the S sign. You can then use å xi as a simpler notation for x1 + x2 + … + xn . Again, the symbol i is a mere counting convenience, and so å xi = å x j = å xu .

The symbol x , which we read as “x bar,” is the most common notation for

1

the average of the x’s. Thus x = å xi . n 2m.

It can happen that the list of value x1, x2, … , xn will involve duplications.

Suppose that there are k different values and that we name them as v1, v2, …, vk . Let’s say that v1 occurs n1 times, v2 occurs n2 times, and so on. The data could then be reorganized to look like this:

Value

v1 v2 v3

.

. vk TOTAL

Now x =

1 n ån

i

Multiplicity n1 n2 n3 .

.

nk n vi . You will also see this as x =

ån v ån i

i

.

i

The formulas in item 2 above are still correct, even if the list involves duplications. 3.

There are times in which we consider problems hypothetically, rather than with numbers (as in item 1) or with algebra symbols (as in items 2 and 2m). In the hypothetical form, we’ll consider X as the phenomenon under discussion, and we’ll give X the technical name random variable.

In this style of thinking, X is endowed with randomness. We should write

3

X in upper case. We may have no data yet, but we can still discuss the possible values