Data Mining for Business Intelligence: Multiple Linear Regression Essay

Submitted By juber123
Words: 927
Pages: 4

Chapter 6: Multiple Linear

Data Mining for Business
Shmueli, Patel & Bruce
© Galit Shmueli and Peter Bruce 2010

Explanatory vs. predictive modeling with

Example: prices of Toyota Corollas
Fitting a predictive model
Assessing predictive accuracy
Selecting a subset of predictors (variable selection) Explanatory Modeling
Goal: Explain relationship between predictors
(explanatory variables) and target
 Familiar use of regression in data analysis
 Multiple linear regression – linear relationship between

a dependent variable Y (response) and a set of predictors
 Model Goal: Fit the data well and understand the

contribution of explanatory variables to the model – model performance assessed by residual analysis
 Model fitted to the entire dataset

Predictive Modeling
Goal: Predict target values in new data where we have predictor values, but not target values
Classic data mining context
Model Goal: Optimize predictive accuracy – how

accurately can the fitted model predict new cases
Model trained on training data and performance is assessed on validation or test data
Explaining role of predictors is not the primary

purpose (although useful)

Regression Method
 Predict the value of the dependent variable Y

based on predictors X1,…,Xp
 Regression coefficients β1, β2,…, βp in the equation:
Y = β1X1 + β2X2 + …..+ βpXp

 Coefficients estimated via ordinary least squares

(OLS) method
 Estimated using training sample

 Predictive capacity assessed by prediction results on

validation set – average squared error
 Assumptions – normality, independence, linearity

Example: Prices of Toyota
Goal: Predict sale prices of used Toyota
Corollas based on their specification
Data: Prices of 1442 used Toyota
Corollas, with their specification information – age, mileage, fuel type, engine size

Data Sample
(showing only the variables to be used in analysis) Variables Used
Price in Euros
Age in months as of 8/04
KM (kilometers)
Fuel Type (diesel, petrol, CNG)
HP (horsepower)
Metallic color (1=yes, 0=no)
Automatic transmission (1=yes,
CC (cylinder volume)
Quarterly_Tax (road tax)
Weight (in kg)

Fuel type is categorical, must be transformed into binary variables
Diesel (1=yes, 0=no)
CNG (1=yes, 0=no)
None needed for “Petrol” (reference category)

Subset of the records selected for training partition (limited # of variables shown)

60% training data / 40% validation data
Multiple linear regression model fitted using ONLY training data The Fitted Regression Model
(XLMiner output)

Predicted Values

Predicted price computed using regression coefficients Residuals = difference between actual and predicted prices Error reports

Error for the validation set is usually larger than that of the training set (as expected)

Distribution of
Symmetric distribution
Some outliers
Average error = 116
50% errors between

Selecting Subsets of

Goal: Find parsimonious model (the simplest model that performs sufficiently well)
Expensive or impossible to measure all predictors

for future predictions
More robust
Multicollinearity can lead to unstable regression coefficients and hence increase variation in predictions and lower predictive accuracy
Sometimes dropping correlated predictors increase bias (average error)
Trade-off between too few and too many

predictors - Bias-variance trade-off

Variable selection methods
 Use domain knowledge – some practical

Expense of collecting future data on predictors
Missing values and inaccurate measurements
Irrelevance to the problem at hand
High correlations

 Two primary methods:
Exhaustive Search

Partial Search Algorithms
 Forward selection
 Backward elimination
 Stepwise regression

Exhaustive Search
 All possible subsets of predictors assessed

(single, pairs, triplets, etc.)
 Computationally intensive
 Judge by “adjusted