2017年6月22日 星期四

Data | Linear regression model

Details please refer to Coursea: Linear Regression and Modeling (by Duke University)

Relationship between 2 variables

- explanatory variables (x): independent variable (predictor)
- response variable (y): dependent variable (predicted)

Linear regression model

Least squares line (ls): the line that minimizes the sum of the squared residuals

y = β0 + β1 x

x: explanatory variable
y: response variable
β0: intercept
β1: slope

The least squares line always goes through mean (average) values of (x, y)

Note that the point estimates (estimated from observed data) for β0 and β1 are b0 and b1, respectively.

In R program: ls() for least-squares line

Key points:

1. Regression line always goes through the centre of the data.
2. Intercept is where the regression line crosses the Y-axis; the expected value of the response variable when the explanatory variable is equal to 0.

Conditions necessary for linear regression model & fitting least-squares line (ls):

1) linearity
- relationship between the explanatory and the response variable should be linear
- methods for fitting a model to non-linear relationships exist
- check using a scatterplot of the data, or a residual plot

2) nearly normal residuals
- residuals should be nearly normally distributed, centre at 0
- may not be satisfied if there are unusual observations that don't follow the trend of the rest of the data
- check using a histogram or normal probability plot of residuals

3) constant variability: homoscedasticity
- variability of points around the least squares line should be roughly constant
- implies that the variability of residuals around the 0 line should be roughly constant
- check using a residual plot

Point estimates: b0 (intercept) and b1 (slope)

Estimate for the slope (b1):

b1 = R (SDy/SDx)

R: correlation coefficient
SD(y): standard deviation of the response variable
SD(x): standard deviation of the explanatory variable

For each unit increase in x, y is expected to be higher or lower on average by the slope

Ex1: SD(y) = 3.1%, SD(x) = 3.73%
Correlation (R) between variables: -0.75
Slope of the regression line?

b1 = -0.75 (3.1/3.73) = -0.62

Estimate for the intercept (b0):

b0 = y (average) - b1 (slope) * x (average)

When x = 0, y is expected to equal the intercept. It may be meaningless in the context of data, and only serve to adjust the height of the line

Average value of y = 11.35%
Average value of x = 86.01%
Intercept of the regression line?

b0 = 11.35 - (-0.62)*(86.01%) = 64.68

regression line model for Ex1 and Ex2:

y = b0 - b1 * x = 64.68 - (0.62) * x

Output information in R program:

 Estimate      Std     Error    t value     Pr(>|t|)
(intercept)   64.78   6.80     9.52        0.00
 x             -0.62   0.08     -7.86       0.00

Linear regress with 1 predictor

y^ = b0 + b1 x*

y^: predicted response variable
x*: (given) explanatory variable

Determine predicted y if x = 82
Variables from Ex2: b1 = -0.62, b0 = 64.78

Predicted y = 64.68 - (0.62)*82 = 13.84

Extrapolation: apply a model estimate to values outsides of the realm of the original data
- sometimes the intercept might be an extrapolation

Correlation coefficient: R (Pearson's R)

Properties of R

- the magnitude (absolute value) of the correlation coefficient measures the strength of the linear association between two numerical variables
- the sign of the correlation coefficient indicates the direction of association
- the correlation coefficient is always between -1 and 1, -1 indicating perfect negative linear association, +1 indicating perfect positive linear association, and 0 indicating no linear relationship
- the correlation coefficient is unitless
- since the correlation coefficient is unitless, it is not affected by changes in the centre or scale of either variable (such as unit conversions)
- the correlation of X with Y is the same as of Y with X
- the correlation coefficient is sensitive to outliers


- the percentage of the variability in the response variable explained by the the explanatory variable.
- strength of the fit of a linear model is most commonly evaluated using R^2
- calculated as the square of the correlation coefficient
- the remainder of the variability is explained by variables not included in the model
- always between 0 and 1
- For a good model, we would like this number to be as close to 100% as possible.

Residual (e)

- leftovers from the model fit
- data = fit + residual
- the difference between the observed (y) and predicted (y^) values of the response variable

residual = observed value - predicted value

e = y − (y^)

In linear model, x = 81
predicted y = 64.68 - (0.62)*81 = 14.46

Observed y = 10.3
e = 10.3 - 14.46 = -4.16

In linear model, x = 86
predicted y = 64.68 - (0.62)*86 = 11.36

Observed y = 16.8
e = 16.8 - 11.36 = 5.44


