In this tutorial we will learn a very important aspect of analyzing regression i.e. Residual Analysis. Residual Analysis is a very important tool used by Data Science experts , knowing which will turn you into an amateur to a pro.
Please go through following articles as well to understand basics of Regression
Tutorial : Concept of Linearity in Linear Regression
Tutorial : Linear Regression Construct
R Tutorial : Basic 2 variable Linear Regression
R Tutorial : Multiple Linear Regression
Introduction
Once we have created a Regression Model we must know whether the model is valid or not. Residual analysis is one of the most important step in understanding whether the model that we have created using regression with given variables is valid or not.
Lets take an example which we took in our 2 variable Linear regression tutorial here – R Tutorial : Basic 2 variable Linear Regression and we will build from that. The data set which we will use is INCOME-SAVINGS as in earlier Linear regression tutorial.
We will quickly write R code to create a linear model and then we will discuss about our main topic of Residual analysis
income = read.csv ("INCOME-SAVINGS.csv") linearmodel = lm (SAVINGS ~ INCOME , data = income) summary(linearmodel) Output : Call: lm(formula = SAVINGS ~ INCOME, data = income) Residuals: Min 1Q Median 3Q Max -13036.3 -4958.9 -316.9 5368.1 16969.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.099e+04 2.459e+03 -4.469 0.000235 *** INCOME 2.970e-01 6.012e-03 49.402 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7327 on 20 degrees of freedom Multiple R-squared: 0.9919, Adjusted R-squared: 0.9915 F-statistic: 2441 on 1 and 20 DF, p-value: < 2.2e-16
Now that we have our basic model ready. Lets understand what residuals are and how to interpret them.
What are Residuals ?
Regression models usually are of the form
Output = [ Constant + Input ] + error
i.e it consists of 2 types of terms Deterministic and Stochastic.
i.e Output = Deterministic + Stochastic
In this case the regression equation which we generated is
SAVINGS = -10990 + 0.297 * INCOME
which means for the data that we used for regression which already has historical actual values of savings, we can generate predicted savings and we can add it to our income data set as SAVINGS_Predicted.
Now our predictions for each of these data will not exactly watch the observed savings value. The difference between observed and predicted values is called error residual.
Residual = Observed value – Predicted Value
i.e as in the form written above the stochastic component is our error residual.
in our case Residual = SAVINGS – SAVINGS_Predicted
income$SAVINGS_Predicted = predict(linearmodel) income$Residual = income$SAVINGS - income$SAVINGS_Predicted head(income) Output: YEAR SAVINGS INCOME SAVINGS_Predicted Residual 1 1974 12298 64968 8303.023 3994.977 2 1975 14196 69233 9569.654 4626.346 3 1976 17320 73824 10933.101 6386.899 4 1977 19995 85267 14331.474 5663.526 5 1978 23601 91507 16184.646 7416.354 6 1979 24213 99632 18597.630 5615.370
Lets plot these predicted values vs the residuals.
plot(income$SAVINGS_Predicted,income$Residual,pch=21,bg="red",col="red") abline(0,0)
Here in this plot y =0 is our regression line. So any points above regression line have positive residuals and points below regression line have negative residuals.
How to interpret Residuals ?
What we should look for here is for patterns. If we see some set patterns for residual that would mean that some of the predictor information is leaking in as error implying we have to look for an explanatory variable to include in the model to account for that leaked pattern.
We manually created a residual plot and residuals here but R model already has computed the residuals for us and they are a part of a variable called as resid inside the model. In our case we can access them using linearmodel$resid
Another very important point that we need to check is that our residuals follow roughly normal distribution. We can do so by checking histogram of residuals. If the histogram of residuals looks normal then we have a valid model.
How to interpret Patterns in Residual Plots ?
Let us now look at few residual plots for other data sets and other models [not necessarily of actual linear models and may represent erroneous cases] and let is see how to interpret these residual plots.
Residual Plot ( a )
- Residuals are randomly distributed around regression line
- Residuals follow normal distribution
- Residuals are Homoscedastic.
- Linear model is valid.
Residual Plot ( b )
- Residuals are non randomly distributed around regression line
- Residuals increase as the predicted value increases, which could mean that we might be missing a variable or two and some predictive pattern is being leaked as a residual.
- Residuals are Homoscedastic.
- Linear model is not valid ( if it has intercept ), check for explanatory variables which might explain the linear residual or the model has failed to account for intercept
- Or the plot does not belong to a linear model at all another option is that the model might be a model forced to pass through origin i.e a non intercept model
Residual Plot ( c )
- Residuals are non randomly distributed around regression line
- Residuals are Homoscedastic
- Residuals have curve pattern to them.
- Linear model is not valid. Curved residual pattern might mean that we may have to fit a polynomial of some order to explain the curved pattern of residuals.
I sincerely hope , the tutorial will be useful for everyone in helping them to understand validity of Regression model using residual plots. Please send me your feedback and suggestion and share the knowledge.
Happy Learning !!!
To get more understanding on Residual analysis and diagnostic plots please read R Tutorial : How to use Diagnostic Plots for Regression Models
Thanks for the nice tutorial.
I have spotted a few mistakes: Residuals for the plot b, and the plot c are heteroscedastic not homoscedastic.
In the first code snippet, there’s a type in the following line:
("INCOME-SAVINGS.csv")
LikeLiked by 1 person
Thanks for replying Ahmed !
Why do you say 2 and 3 are Heteroscedastic ? They have uniform variance along the predictor range . Which means Homoscedastic.
The typos i have corrected. Sometimes when the wordpress editor sees symbols like ” or < it converts it automatically into "e and <
LikeLike
Very concise & informative..
LikeLiked by 1 person