In this tutorial we will try our hands on a very basic 2 variable linear regression using R. We will also learn how to interpret output given by R and tryout various visualizations required for interpreting simple Linear regression.
Please also read though following Tutorials to get more familiarity on R and Linear regression background.
Technique : 2 variable Linear Regression
When to use : When our output variable is numeric
No of variables : 2
Model Readability : High
For this tutorial we will be using csv version of the excel file uploaded here INCOME-SAVINGS . As always please save the file and then convert it to .csv using save as from excel.
Step 1 : Read File
income = read.csv (&amp;amp;amp;quot;INCOME-SAVINGS.csv&amp;amp;amp;quot;) str( income )
'data.frame': 22 obs. of 3 variables: $ YEAR : int 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 ... $ SAVINGS: int 12298 14196 17320 19995 23601 24213 26881 30896 33787 38091 ... $ INCOME : int 64968 69233 73824 85267 91507 99632 123067 142181 157291 185749 ...
So we have 3 variables YEAR, SAVINGS and INCOME
Step 2 : Identify the output variable and input variable
since : savings = f (income)
output variable = SAVINGS
Input variable : INCOME
Step 3 : Scatter plot
plot(income$SAVINGS,income$INCOME,xlab=&amp;amp;quot;Income&amp;amp;quot;,ylab = &amp;amp;quot;Savings&amp;amp;quot; , main = &amp;amp;quot;Savings vs Income&amp;amp;quot;, col ='red')
From the plot it is clear that we have a positive linear relationship between Income and savings and we can use linear regression to predict Savings given the Incomes.
Step 4 : Construct a Linear model using R
linearmodel = lm (SAVINGS ~ INCOME , data = income) summary(linearmodel)
Please Note the first argument to function is SAVINGS ~ INCOME . This argument is of type formula and is usually of the form
Dependent_Variable ~ Independent_Variables
Output of Summary(linearmodel) : Call: lm(formula = SAVINGS ~ INCOME, data = income) Residuals: Min 1Q Median 3Q Max -13036.3 -4958.9 -316.9 5368.1 16969.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.099e+04 2.459e+03 -4.469 0.000235 *** INCOME 2.970e-01 6.012e-03 49.402 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7327 on 20 degrees of freedom Multiple R-squared: 0.9919, Adjusted R-squared: 0.9915 F-statistic: 2441 on 1 and 20 DF, p-value: < 2.2e-16
Step 5 : Interpretation of output
The Model that is generated for us is ( Numbers in RED are coefficients of variables in our Linear regression equation )
SAVINGS = -10990 + 0.297 * INCOME
Please notice that the p-value for INCOME ( values in GREEN) i.e Pr(>|t|) is significant ( i.e less than 0.05 ) and hence the variable is significant in predicting the SAVINGS. If we do not have significant p-value corresponding to the variable we may choose to ignore that variable.
Next number that we have to be aware of is R-squared . In our case Adj R Squared is 0.9915 which implies that the model is able to explain 99% variation in our data . The ideal R-squared value is domain specific. But typically anything above 70% is assumed to be very good and the model is supposed to be a good model for prediction.
We will delve into details of R Squared , t value , residuals and F statistic in subsequent tutorial. For this discussion we can safely ignore them.
The model can be interpreted as – ” When Income rises by 1 unit , the Savings rise by 0.297 units”
Now whenever we have any value of INCOME we can calculate SAVINGS using the equation –
SAVINGS = -10990 + 0.297 * INCOME
I sincerely hope you enjoyed the tutorial , please post your feedback and comments and share other articles on the site.
Till next time Happy Learning.
Next in the series :