95-791 Spring 2013

Lecture #8 Predictive analytics: Regression

Artur Dubrawski

awd@cs.cmu.edu

This unit

• Good-old correlation scores revisited • Locally weighted regression – As an approximator of non-linear functions – As a framework for active/purposive acquisition of data

95-791 Data Mining

Lecture #8 Slide 2

Copyright © 2000-2013 Artur Dubrawski

Correlational scores of association between attributes of data • • • • Linear Rank Quadratic ….

Would not it be great to have an universal formula for computing correlations of all types, no matter how complex were the underlying models (linear, quadratic, …, any kind)... hmmmm… life would be so much more fulfilling then… 95-791 Data Mining Lecture #8 Slide 3 Copyright © 2000-2013 Artur Dubrawski

Correlation coefficient generalized

• Idea: take your data and apply some function approximator to it (e.g. fit some regression model to it), and compute the following: R2 1 ˆ y

i 1 N i 1 N i

yi

i

2

y

ˆ y, : from data, y : predicted

2

Using linear regression to predict

Basically, to predict we can use:

yi ? linear correlation

Using quadratic regression? quadratic correlation

multiple regression, any kind of non-linear regression, any other function approximator we like, and we should still be able to compute the corresponding correlation coefficient. Life is perfect! 95-791 Data Mining Lecture #8 Slide 4 Copyright © 2000-2013 Artur Dubrawski

Generalized correlation

total variation = explained variation + unexplained variation 2 2 2 ˆ ˆ y i y i y i y i i 1 i 1 i 1 N N N

total variation: ~variance observed in the training data explained variation: part of the total variation accounted for

(“explained”) by the trained model unexplained variation: mismatch between the data and the model-based predictions (part of the total variance that is left “unexplained” by the model)

R1.0 R0.0

if all of the total variation is explained by the model (X and Y correlated in the sense of the used model) nothing is explained (data and model don’t match, X and Y uncorrelated given the used model) Lecture #8 Slide 5 Copyright © 2000-2013 Artur Dubrawski

95-791 Data Mining

Notes on R2

• R2 is sometimes called a coefficient of determination • R (sometimes called index of fit) is a direct estimate of correlation coefficient r, but only if the model of choice is (a single, not multiple) linear regression • However, R often makes some practical sense even when applied to other models:

R2 1

ˆ y

i 1 N i 1

N

i

yi

2

SSE of the current model

y

2

i

SSE of the simplest (default) model: y=mean(yi)

95-791 Data Mining

Lecture #8 Slide 6

Copyright © 2000-2013 Artur Dubrawski

Notes on R2

• The sum of ‘variations’ 2 slides back formally holds for basic linear regression. Strictly speaking, in more complex cases one should take into account complexity of the model when estimating R2 (otherwise, R would tend to increase as we add terms to e.g. multiple regression equation, making the comparison between models of different complexities somewhat unfair):

2 Radjusted

1 N 2 ˆ yi y i N p i 1 1 1 N yi 2 N 1 i 1

• It is not immediately clear how to figure out the adequate number of parameters p for non-parametric models (like k-NN: should we just substitute k for p?). Statistical literature provides some hints, but in fact statisticians themselves still debate the correct and useful ways of adjusting R in non-parametric scenarios 95-791 Data Mining Lecture #8 Slide 7 Copyright © 2000-2013 Artur Dubrawski

Rank correlation

• Procedure:

1. Rank the values of the attributes (independently X and Y) 2. Compute correlation coefficient for the ranks (linear, quadratic, …, you-name-it)

• Able to spot non-linear (but monotonic) relationships using a linear approach BTW, what is the...