Relationship between 2 variables
- explanatory variables (x): independent variable (predictor)
- response variable (y): dependent variable (predicted)
Linear regression model
Least squares line (ls): the line that minimizes the sum of the squared residuals
x: explanatory variable
y: response variable
β0: intercept
β1: slope
The least squares line always goes through mean (average) values of (x, y)
In R program: ls() for least-squares line
Key points:
1. Regression line always goes through the centre of the data.
2. Intercept is where the regression line crosses the Y-axis; the expected value of the response variable when the explanatory variable is equal to 0.
Conditions necessary for linear regression model & fitting least-squares line (ls):
1) linearity
- relationship between the explanatory and the response variable should be linear
- methods for fitting a model to non-linear relationships exist
- check using a scatterplot of the data, or a residual plot
2) nearly normal residuals
- residuals should be nearly normally distributed, centre at 0
- may not be satisfied if there are unusual observations that don't follow the trend of the rest of the data
- check using a histogram or normal probability plot of residuals
3) constant variability: homoscedasticity
- variability of points around the least squares line should be roughly constant
- implies that the variability of residuals around the 0 line should be roughly constant
- check using a residual plot
Point estimates: b0 (intercept) and b1 (slope)
Estimate for the slope (b1):
b1 = R (SDy/SDx)
R: correlation coefficient
SD(y): standard deviation of the response variable
SD(x): standard deviation of the explanatory variable
For each unit increase in x, y is expected to be higher or lower on average by the slope
Ex1: SD(y) = 3.1%, SD(x) = 3.73%
Correlation (R) between variables: -0.75
Slope of the regression line?
b1 = -0.75 (3.1/3.73) = -0.62
Estimate for the intercept (b0):
b0 = y (average) - b1 (slope) * x (average)
When x = 0, y is expected to equal the intercept. It may be meaningless in the context of data, and only serve to adjust the height of the line
Ex2:
Average value of y = 11.35%
Average value of x = 86.01%
Intercept of the regression line?
b0 = 11.35 - (-0.62)*(86.01%) = 64.68
regression line model for Ex1 and Ex2:
y = b0 - b1 * x = 64.68 - (0.62) * x
Output information in R program:
Estimate Std Error t value Pr(>|t|) ----------------------------------------------------------- (intercept) 64.78 6.80 9.52 0.00 x -0.62 0.08 -7.86 0.00 ----------------------------------------------------------- |
Linear regress with 1 predictor
y^ = b0 + b1 x*
y^: predicted response variable
x*: (given) explanatory variable
Ex3:
Determine predicted y if x = 82
Variables from Ex2: b1 = -0.62, b0 = 64.78
Predicted y = 64.68 - (0.62)*82 = 13.84
Extrapolation: apply a model estimate to values outsides of the realm of the original data
- sometimes the intercept might be an extrapolation
Correlation coefficient: R (Pearson's R)
Properties of R
- the magnitude (absolute value) of the correlation coefficient measures the strength of the linear association between two numerical variables
- the sign of the correlation coefficient indicates the direction of association
- the correlation coefficient is always between -1 and 1, -1 indicating perfect negative linear association, +1 indicating perfect positive linear association, and 0 indicating no linear relationship
- the correlation coefficient is unitless
- since the correlation coefficient is unitless, it is not affected by changes in the centre or scale of either variable (such as unit conversions)
- the correlation of X with Y is the same as of Y with X
- the correlation coefficient is sensitive to outliers
R^2
- the percentage of the variability in the response variable explained by the the explanatory variable.
- strength of the fit of a linear model is most commonly evaluated using R^2
- calculated as the square of the correlation coefficient
- the remainder of the variability is explained by variables not included in the model
- always between 0 and 1
- For a good model, we would like this number to be as close to 100% as possible.
Residual (e)
- leftovers from the model fit
- data = fit + residual
- the difference between the observed (y) and predicted (y^) values of the response variable
residual = observed value - predicted value
e = y − (y^)
Ex4:
In linear model, x = 81
predicted y = 64.68 - (0.62)*81 = 14.46
Observed y = 10.3
e = 10.3 - 14.46 = -4.16
Ex5:
In linear model, x = 86
predicted y = 64.68 - (0.62)*86 = 11.36
Observed y = 16.8
e = 16.8 - 11.36 = 5.44