Nikhil Shaganti

MID: M07428499

Data Management (Section-001)

Data Background:

The data used in this project comes from a paper written on the relationship between house prices and clean air in the late 1970’s by David Harrison of Harvard and Daniel Rubinfeld of University of Michigan.

The dataset is downloaded from the UCI Machine Learning Repository

(http://lib.stat.cmu.edu/datasets/boston_corrected.txt) and concerns with housing values in suburbs of

Boston. This dataset has 506 instances of 16 attributes. The following table gives information about the attributes in the dataset:

Attribute Information:

CRIM

ZN

INDUS

CHAS

NOX

RM

AGE

DIS

RAD

TAX

PTRATIO

MEDV

TOWN

TRACT

LON

LAT

per capita crime rate by town proportion of residential land zoned for lots over 25,000 sq.ft proportion of non-retail business acres per town

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nitric oxides concentration (parts per 10 million) average number of rooms per dwelling proportion of owner-occupied units built prior to 1940 weighted distances to five Boston employment centers index of accessibility to radial highways full-value property-tax rate per USD 10,000 pupil-teacher ratio by town median value of owner-occupied homes in USD 1000's name of town census tract longitude of census tract latitude of census tract

From this dataset, I would like to explore the housing dataset with aid of R Statistical package. I would like to see which attributes affect the housing prices across Boston using some of the well-known machine learning algorithms.

Summary Statistics and Exploratory Data Analysis:

As this data is based on the house prices and clean air, let’s explore the variables MEDV and NOX.

Statistically:

Distribution of housing prices in Boston area: (in $1000s)

Distribution of NOX concentration in Boston area:

We will use this statistics for further analysis.

Graphically:

From the above statistics, we can see that MEDV and NOX variables are not following a normal distribution. It appears to be right-skewed.

Correlation Analysis: The following table gives us the correlation of each variable in the dataset with the MEDV variable.

We can see that Average No. of rooms (RM) have the highest positive correlation and pupil-teacher ratio by town has highest negative correlation. We will analyze these in detail later.

Visualizing the data:

This plot simply gives a rough map of the Boston. We can see that many points are clustered around the center. This dense center part might correspond to the Boston city.

The green points represent the areas with above average (NOX> 0.55) NOX concentration. This seems intuitive because pollution appears to be more in the dense center part. This might affect the housing prices which we will analyze in the next plot.

The red points represents the areas with above average (MEDV >21.2) housing prices. It supports our previous conclusion that housing prices does get affected by pollution. The dense part in the center appears to have relatively lower housing prices.

Modelling:

Does Location of the house matters? Let’s Check it.

Linear Regression:

Performance of Linear regression in predicting housing prices by location

Summary says location of the house does matters but linear regression failed to capture the whole scenario. As we can see the blue points misclassifies the points. In our initial analysis we found that dense center part has relatively lower housing prices, but linear regression says otherwise. So it can be misleading. In addition, linear regression classifies only points on the left side totally ignoring the right side of the Boston.

Decision Trees (CART):

(a) CART over-fitting the data

(b) CART after simplification. Lines represents decision boundaries

Initially, CART over-fitted the data. This might fall apart when it tries to predict housing prices based on unknown data. So I simplified the model by increasing the bucket size to 50.