In this tutorial we will try our hands on a very basic 2 variable linear regression using R. We will also learn how to interpret output given by R and tryout various visualizations required for interpreting simple Linear regression.

Please also read though following Tutorials to get more familiarity on R and Linear regression background.

R : Basic Data Analysis – Part 1

R Tutorial : Intermediate Data Analysis – Part 2

Tutorial : Concept of Linearity in Linear Regression

Tutorial : Linear Regression Construct

Technique : 2 variable Linear Regression

When to use : When our output variable is numeric

No of variables : 2

Model Readability : High

For this tutorial we will be using csv version of the excel file uploaded here INCOME-SAVINGS . As always please save the file and then convert it to .csv using save as from excel.

Step 1 : Read File

income = read.csv ("INCOME-SAVINGS.csv")
str( income )
'data.frame': 22 obs. of 3 variables:
$ YEAR : int 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 ...
$ SAVINGS: int 12298 14196 17320 19995 23601 24213 26881 30896 33787 38091 ...
$ INCOME : int 64968 69233 73824 85267 91507 99632 123067 142181 157291 185749 ...

So we have 3 variables YEAR, SAVINGS and INCOME

Step 2 : Identify the output variable and input variable

since : savings  = f (income)

output variable = SAVINGS

Input variable : INCOME

Step 3 : Scatter plot

plot(income$SAVINGS,income$INCOME,xlab="Income",ylab = "Savings" , main = "Savings vs Income", col ='red')

Rplot-lr

From the plot it is clear that we have a positive linear relationship between Income and savings and we can use linear regression to predict Savings given the Incomes.

Step 4 : Construct a Linear model using R

linearmodel = lm (SAVINGS ~ INCOME , data = income)
summary(linearmodel)

Please Note the first argument to function is SAVINGS ~ INCOME . This argument is of type formula and is usually of the form
Dependent_Variable ~ Independent_Variables

Output of Summary(linearmodel) :
Call:
lm(formula = SAVINGS ~ INCOME, data = income)

Residuals:
     Min       1Q   Median       3Q      Max 
-13036.3  -4958.9   -316.9   5368.1  16969.3 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.099e+04  2.459e+03  -4.469 0.000235 ***
INCOME       2.970e-01  6.012e-03  49.402  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7327 on 20 degrees of freedom
Multiple R-squared:  0.9919,	Adjusted R-squared:  0.9915 
F-statistic:  2441 on 1 and 20 DF,  p-value: < 2.2e-16

Step 5 : Interpretation of output

The Model that is generated for us is  ( Numbers in RED are coefficients of variables in our Linear regression equation )

SAVINGS = -10990 + 0.297 * INCOME

Please notice that the p-value for INCOME ( values in GREEN) i.e Pr(>|t|) is significant ( i.e less than 0.05 ) and hence the variable is significant in predicting the SAVINGS. If we do not have significant p-value corresponding to the variable we may choose to ignore that variable.

Next number that we have to be aware of is R-squared . In our case Adj R Squared is 0.9915 which implies that the model is able to explain 99% variation in our data . The ideal R-squared value is domain specific. But typically anything above 70% is assumed to be very good and the model is supposed to be a good model for prediction.

We will delve into details of R Squared , t value , residuals  and F statistic in subsequent tutorial. For this discussion we can safely ignore them.

The model can be interpreted as – ” When Income rises by 1 unit , the Savings rise by 0.297 units”

Now whenever we have any value of INCOME we can calculate SAVINGS using the equation –

SAVINGS = -10990 + 0.297 * INCOME

I sincerely hope you enjoyed the tutorial , please post your feedback and comments and share other articles on the site.

As a next step to analyzing model you should also go through Residual Analysis , look at Adjusted R Squared values and interpret F statistic.

Till next time Happy Learning.

Next in the series :

R Tutorial : Multiple Linear Regression

R Tutorial : Residual Analysis for Regression

R Tutorial : How to use Diagnostic Plots for Regression Models

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s