This tutorial goes one step ahead from 2 variable regression to another type of regression which is Multiple Linear Regression. We will go through multiple linear regression using an example in R
Please also read though following Tutorials to get more familiarity on R and Linear regression background.
Technique : Multiple Linear Regression
When to use : When our output variable is numeric
No of variables : > 2
Model Readability : High
For this tutorial we will be using csv version of the excel file uploaded here demandformoney. As always please save the file and then convert it to .csv using save as from excel.
Central Bank prints paper money each year. For each year they need an estimate of how much money to be printed. The decision is based on various economic indicators like GDP, Interest rate etc. We will try to model the solution to this problem using Multiple linear regression.
Step 1 : Read File
money = read.csv("demandformoney.csv") str(money)
‘data.frame’: 35 obs. of 5 variables:
$ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 …
$ Money_printed: int 7374 8323 9700 11200 11975 13325 16024 14388 17292 20000 …
$ GDP : int 474131 478918 477392 499120 504914 550379 557258 598885 631839 598974 …
$ Interest_RATE: num 7.25 7.25 7.25 7.25 9 10 10 9 9 10 …
$ WPI : num 14.3 15.1 16.7 20.1 25.1 24.8 25.3 26.6 26.6 31.2 …
Note : We have 4 variables of interest here Money_printed, GDP , Interest_RATE and WPI and we have conveniently omitted the variable year for a reason. But we will leave that discussion out for another tutorial post.
Step 2 : Identify the output variable and input variable
since ; Money_printed = f ( GDP , Interest_RATE , WPI )
output variable : Money_printed
Input variables : GDP , Interest_RATE , WPI
Step 3 : Scatter plot
All the Input variables seem to be fairly linearly correlated with our output variable Money_printed except for WPI.
Step 4 : Construct a Linear model using R
multilinearmodel = lm (Money_printed ~ GDP + Interest_RATE + WPI, data = money) summary(multilinearmodel)
Call: lm(formula = Money_printed ~ GDP + Interest_RATE + WPI, data = money) Residuals: Min 1Q Median 3Q Max -46875 -7027 1387 15068 46249 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.759e+04 5.052e+04 -0.348 0.73008 GDP 2.975e-01 8.787e-02 3.385 0.00195 ** Interest_RATE -1.626e+04 2.919e+03 -5.570 4.19e-06 *** WPI 2.943e+01 9.038e+02 0.033 0.97423 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22770 on 31 degrees of freedom Multiple R-squared: 0.9848, Adjusted R-squared: 0.9834 F-statistic: 670.4 on 3 and 31 DF, p-value: < 2.2e-16
Please note the formula used here is Money_printed ~ GDP + Interest_RATE + WPI
i.e Output Variable ~ Input Variable 1 + Input Variable 2 ….
An important point to be noted is the P value for variables GDP, Interest_RATE is significant i.e Pr(>|t|) for these variables is less than 0.05 .
For variable WPI however the P value is 0.97423 which is greater than 0.05 which is highly insignificant. What that means is that the variable WPI does not contribute significantly to the model and hence it must be removed.
Please also note that the Adjusted R Squared in 0.9834 which is fairly good but by removing an insignificant variable we can expect marginal increase in Adjusted R Squared as well.
Lets now go through another iteration of creating a model after omitting WPI from input variables.
multilinearmodel = lm (Money_printed ~ GDP + Interest_RATE , data = money) summary(multilinearmodel)
Call: lm(formula = Money_printed ~ GDP + Interest_RATE, data = money) Residuals: Min 1Q Median 3Q Max -47055 -7168 1432 14998 46008 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.906e+04 2.192e+04 -0.870 0.391 GDP 3.003e-01 6.913e-03 43.443 < 2e-16 *** Interest_RATE -1.619e+04 1.977e+03 -8.191 2.35e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22420 on 32 degrees of freedom Multiple R-squared: 0.9848, Adjusted R-squared: 0.9839 F-statistic: 1038 on 2 and 32 DF, p-value: < 2.2e-16
In this 2nd model please note that P values of both input variables are less than 0.05 and so both are significant variables for the model.
Also please note that Adjusted R Squared value has increased from 0.9834 to 0.9839 which is marginal increase but it still indicates that removing variable actually benefited our model.
We must ideally also consider distribution of our residuals before finalizing the model. But we will ignore this point now and accept the model since it has acceptable Adjusted R Squared.
Step 5 : Interpretation of output
From the coefficients of variables in the output above we construct our model for predicting Money_printed as –
Money_printed = -19060 + 0.3003 * GDP – 16190 * Interest_RATE
Now when Central Bank needs to find how much Money is to be printed , they can plug in the values of GDP and Interest Rate into the equation above and find an estimate of money to be printed.
Hope you liked the post. Please post your reviews and comments and don’t forget to share.
Further you can read following tutorials for gaining further understanding