In this series of articles so far we have seen Basics of machine learning, Linearity of Regression problems , Construct of Linear regression and 2 variable as well as multiple linear regression. Just to revise Linear regression can be used when we have our response variable ( aka. dependent variable ) as a continuous variable.
In most of the problems in machine learning however we want to predict whether our output variable belongs to a particular category.
e.g.
All of these are classification problems which fall under the area of Supervised Learning.
Problem 1 and Problem 3 in RED are Binary classification problems since we are classifying the output into 2 classes in both the cases as Yes or No.
Problem 2 and Problem 4 in BLUE are Multi Class Classification problems since we want to classify output into more than one classes.
Following techniques are popular for Binary Classification problems.
a. Logistic Regression
b. Decision Trees
c. Random Forests
d. Neural Networks
e. Support Vector Machines
Multi class classification problems are popularly tackled using following techniques.
a. Multinomial Logistic Regression
b. Support Vector Machines
c. Neural Networks
d. A popular technique is to split a multi class classification problem into multiple binary classification problems and then then model each of the sub problem separately.
We will learn and examine most of these classification techniques in subsequent posts , till then Happy Learning !
]]>In January 2014, Stanford University professors Trevor Hastie and Rob Tibshirani (authors of the legendary Elements of Statistical Learning textbook) taught an online course based on their newest textbook, An Introduction to Statistical Learning with Applications in R (ISLR).
As a supplement to the textbook, you may also want to watch the excellent course lecture videos (linked below), in which Dr. Hastie and Dr. Tibshirani discuss much of the material. In case you want to browse the lecture content, I’ve also linked to the PDF slides used in the videos. For basics of Machine Learning please read through – Introduction to Machine Learning
I will highly recommend you to bookmark this page and use it as a reference whenever required.
Other Preparatory Regression Tutorials
 Tutorial : Concept of Linearity in Linear Regression
 Tutorial : Linear Regression Construct
 R Tutorial : Basic 2 variable Linear Regression
 R Tutorial : Multiple Linear Regression
 R Tutorial : Residual Analysis for Regression
 R Tutorial : How to use Diagnostic Plots for Regression Models
 R Tutorial : Interpretation of R Squared and Adjusted R Squared in Regression
 R Tutorial : How to interpret F Statistic in Regression Models
The article was originally posted at Here.
]]>
We have already seen R Tutorial : Multiple Linear Regression and then we saw as next step R Tutorial : Residual Analysis for Regression and R Tutorial : How to use Diagnostic Plots for Regression Models . Once our model passes the residual analysis we can go ahead and check R Squared and Adjusted R Squared . As a last step of analysis of model we have to interpret and understand an important measure called F Statistic.
We have already discussed in R Tutorial : Multiple Linear Regression how to interpret Pvalues of t test for individual predictor variables to check if they are significant in the model or not.
Instead of judging coefficients of individual variables on their own for significance using t test , F statistic ( aka F Test for overall significance in Regression ) judges on multiple coefficients taken together at the same time.
The model with zero predictor variables is also called “Intercept Only Model”. F – Test for overall significance compares a intercept only regression model with the current model. And then tries to comment on whether addition of these variables together is significant enough for them to be there or not.
The Hypothesis for FTest for significance can be constructed as –
H0 : The fit of intercept only model and the current model is same. i.e. Additional variables do not provide value taken together
Ha : The fit of intercept only model is significantly less compared to our current model. i.e. Additional variables do make the model significantly better.
Without going into actual derivation of F statistic here is the short formula for calculating F statistic of a model –
With reference to the example we took in R Tutorial : Multiple Linear Regression the Fstatistic of multilinearmodel ( as in R Tutorial : Multiple Linear Regression ) is given in summary output as –
Multiple Rsquared: 0.9848, Adjusted Rsquared: 0.9834
Fstatistic: 670.4 on 3 and 31 DF, pvalue: < 2.2e16
Here in this example we had –
n = 35 ( Total number of observations )
k = 4 ( no of variables + 1 for intercept )
So degrees of freedom that we get are
DF Numerator = (k1) = 3 – Matches with our DF as provided by R output
DF Denominator = (nk ) = (35 – 4 ) = 31 – Matches with our DF as provided by R putput
F statistic that we get is –
F = [R² / (k1)] / [ (1R²) (nk) ]
F = [ 0.9848 / 3 ] / [0.0152 /31 ]
F = 670 – Matches with the F Statistic as provided by R
P Value of F Statistic 670 for DF 3 and 31 is extremely small, i.e smaller that 0.001 so we can reject H0 and say that overall addition of variables is significantly improving the model. Which in a way implies that by adding those extra variables we were able to improve the fit of our model significantly.
R squared provides a measure of strength of relationship between our predictors and our response variable and it does not comment on whether the relationship is statistically significant. F Statistic gives us a power to judge whether that relationship is statistically significant in other words it comments on whether or R² is significant or not.
Hope you have learnt few intricacies of regression models by now. Next up I will be writing about Logistic regression models.
Till then Enjoy Life and Keep Learning !
Other previous articles that you may like –
Tutorial : Concept of Linearity in Linear Regression
Tutorial : Linear Regression Construct
R Tutorial : Basic 2 variable Linear Regression
R Tutorial : Multiple Linear Regression
R Tutorial : Residual Analysis for Regression
R Tutorial : How to use Diagnostic Plots for Regression Models
R Tutorial : Interpretation of R Squared and Adjusted R Squared in Regression
]]>
Once we have fitted our model to data using Regression , we have to find out how well our model fits the data. R gives many goodness of fit statistic out of the box when we create a model. In this tutorial we will discuss about an important statistic called RSquared ( R² ). We will also try to bust myths that Low R Squared values are always bad and High R Squared values are always good.
By the way you should look at R Squared only once your model passes Residual analysis test as mentioned R Tutorial : Residual Analysis for Regression and R Tutorial : How to use Diagnostic Plots for Regression Models
R Squared is a measure which tells us how well our regression equation explains observed data values.
R Squared = ( Explained Variation in Observed Values) / (Total variation in Observed Values)
0% < = R Squared <= 100%
So R² = 67% implies that you have a regression equation which can explain 67% variation of observed values around mean.
Obviously when you add more predictor variables to regression equation which explain more variance you will get a higher R². Does it mean that when we compare 2 models on same data , the model with higher R² is always better than the model with lower R² ?
The answer is NO . Not always ! More predictor variables in a model implies more complexity which may have a side effect of Over fitting. So pure R² is not a very reliable measure. We need a measure which can tell us in absolute terms whether addition of new variable can explain variance worth of the additional Complexity.
Its for this reason that we use Adjusted R² .
Adjusted R² is a measure derived from R² which penalizes each addition of variable for additional complexity.
N = Sample Size
p = number of predictors
Please note that p is in denominator and increased p would b=mean a decreased Adj R² if R² does not increase enough and everything else remains constant.
NO. Desirable range of R² is highly domain dependent. Any model which attempts to predict Human behavior is seldom very precise and hence lower R² is expected. Where as for models in medicine and pharma R² values above 90% are very common.
NO. As mentioned in R Tutorial : Residual Analysis for Regression and R Tutorial : How to use Diagnostic Plots for Regression Models even if you have High R² but you have some inherent Residual pattern or the residuals are Heteroscedastic or if residuals are not normally distributed then the model is not considered good enough.
As a next step you should look at interpretation of F Statistic.
Enjoy Life and Keep Learning !
]]>
In the last article R Tutorial : Residual Analysis for Regression we looked at how to do residual analysis manually. R by default gives 4 diagnostic plots for regression models. IN this article we will look at how to interpret these diagnostic plots. We will use the same data which we used in R Tutorial : Residual Analysis for Regression . The data set can be downloaded from here INCOMESAVINGS . You need to convert the excel to csv before creating model.
Lets first create the model.
income = read.csv("INCOMESAVINGS.csv") linearmodel = lm(SAVINGS ~ INCOME , data = income)
# Change the layout to 2x2 to accommodate all plots par(mfrow=c(2,2)) par(mar = rep(2, 4)) # Diagnostic Plots plot(linearmodel)
Lets now look at analysis of each of these plots individually
Before attacking the plot we must know what Influence and what leverage is. Lets understand them first.
Influence : The Influence of an observation can be thought of in terms of how much the predicted scores would change if the observation is excluded. Cook’s Distance is a pretty good measure of influence of an observation.
Leverage : The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The more the leverage of an observation , the greater potential that point has in terms of influence.
Now that we are clear on what Leverage is lets analyze our leverage plot draw inferences.
In this plot the dotted red lines are cook’s distance and the areas of interest for us are the ones outside dotted line on top right corner or bottom right corner. If any point falls in that region , we say that the observation has high leverage or potential for influencing our model is higher if we exclude that point.
Its not always the case though that all outliers will have high leverage or vice versa.
In this case observation #22 has high leverage and we have 3 choices
Choice 1 : Justify the inclusion of #22 and keep the model as is
Choice 2 : Include quadratic term as indicated by Residual vs fitted plot and remodel
Choice 3: Exclude observation #22 and remodel.
We will try both Choice #2 and Choice #3 and see what kind of diagnostic plots we get
linearmodel2 = lm(SAVINGS ~ I(INCOME^2) + INCOME, data = income) summary(linearmodel2) <Output:> Call: lm(formula = SAVINGS ~ I(INCOME^2) + INCOME, data = income) Residuals: Min 1Q Median 3Q Max 10304 2027 252 2149 10352 Coefficients: Estimate Std. Error t value (Intercept) 5.936e+02 2.485e+03 0.239 I(INCOME^2) 8.561e08 1.579e08 5.423 INCOME 2.187e01 1.494e02 14.644 Pr(>t) (Intercept) 0.814 I(INCOME^2) 3.12e05 *** INCOME 8.39e12 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4709 on 19 degrees of freedom Multiple Rsquared: 0.9968, Adjusted Rsquared: 0.9965 Fstatistic: 2968 on 2 and 19 DF, pvalue: < 2.2e16 #Diagnostic plots plot(linearmodel2)
In this case our diagnostic plots are much better Residuals are almost horizontal and well spread. Spread is almost uniform and no point has excess leverage. QQ plot however shows that few points are not along Normal line. But that may be acceptable.
We will check another model without quadratic term and excluding observation #22
#exclude #22 income2 = income[1:21,] linearmodel3 = lm(SAVINGS ~ INCOME, data = income2) summary(linearmodel3) Output: Call: lm(formula = SAVINGS ~ INCOME, data = income2) Residuals: Min 1Q Median 3Q Max 8601 3585 1088 4294 15186 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 8.693e+03 2.064e+03 4.211 0.000473 *** INCOME 2.861e01 5.693e03 50.253 < 2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5837 on 19 degrees of freedom Multiple Rsquared: 0.9925, Adjusted Rsquared: 0.9921 Fstatistic: 2525 on 1 and 19 DF, pvalue: < 2.2e16 #Diagnostic plots plot(linearmodel3)
All the plots do not seem fine in fact they are worse than our original model.
So far the model with quadratic terms seems to have fared the best in terms of our analysis of diagnostic plots. So we can recommend the model with Quadratic term.
Hope you have enjoyed reading the short tutorial. Keep learning and keep sharing !
Other similar articles of interest
Tutorial : Linear Regression Construct
Tutorial : Concept of Linearity in Linear Regression
R Tutorial : Basic 2 variable Linear Regression
]]>Please go through following articles as well to understand basics of Regression
Tutorial : Concept of Linearity in Linear Regression
Tutorial : Linear Regression Construct
R Tutorial : Basic 2 variable Linear Regression
R Tutorial : Multiple Linear Regression
Once we have created a Regression Model we must know whether the model is valid or not. Residual analysis is one of the most important step in understanding whether the model that we have created using regression with given variables is valid or not.
Lets take an example which we took in our 2 variable Linear regression tutorial here – R Tutorial : Basic 2 variable Linear Regression and we will build from that. The data set which we will use is INCOMESAVINGS as in earlier Linear regression tutorial.
We will quickly write R code to create a linear model and then we will discuss about our main topic of Residual analysis
income = read.csv ("INCOMESAVINGS.csv") linearmodel = lm (SAVINGS ~ INCOME , data = income) summary(linearmodel) Output : Call: lm(formula = SAVINGS ~ INCOME, data = income) Residuals: Min 1Q Median 3Q Max 13036.3 4958.9 316.9 5368.1 16969.3 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 1.099e+04 2.459e+03 4.469 0.000235 *** INCOME 2.970e01 6.012e03 49.402 < 2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7327 on 20 degrees of freedom Multiple Rsquared: 0.9919, Adjusted Rsquared: 0.9915 Fstatistic: 2441 on 1 and 20 DF, pvalue: < 2.2e16
Now that we have our basic model ready. Lets understand what residuals are and how to interpret them.
Regression models usually are of the form
Output = [ Constant + Input ] + error
i.e it consists of 2 types of terms Deterministic and Stochastic.
i.e Output = Deterministic + Stochastic
In this case the regression equation which we generated is
SAVINGS = 10990 + 0.297 * INCOME
which means for the data that we used for regression which already has historical actual values of savings, we can generate predicted savings and we can add it to our income data set as SAVINGS_Predicted.
Now our predictions for each of these data will not exactly watch the observed savings value. The difference between observed and predicted values is called error residual.
Residual = Observed value – Predicted Value
i.e as in the form written above the stochastic component is our error residual.
in our case Residual = SAVINGS – SAVINGS_Predicted
income$SAVINGS_Predicted = predict(linearmodel) income$Residual = income$SAVINGS  income$SAVINGS_Predicted head(income) Output: YEAR SAVINGS INCOME SAVINGS_Predicted Residual 1 1974 12298 64968 8303.023 3994.977 2 1975 14196 69233 9569.654 4626.346 3 1976 17320 73824 10933.101 6386.899 4 1977 19995 85267 14331.474 5663.526 5 1978 23601 91507 16184.646 7416.354 6 1979 24213 99632 18597.630 5615.370
Lets plot these predicted values vs the residuals.
plot(income$SAVINGS_Predicted,income$Residual,pch=21,bg="red",col="red") abline(0,0)
Here in this plot y =0 is our regression line. So any points above regression line have positive residuals and points below regression line have negative residuals.
What we should look for here is for patterns. If we see some set patterns for residual that would mean that some of the predictor information is leaking in as error implying we have to look for an explanatory variable to include in the model to account for that leaked pattern.
We manually created a residual plot and residuals here but R model already has computed the residuals for us and they are a part of a variable called as resid inside the model. In our case we can access them using linearmodel$resid
Another very important point that we need to check is that our residuals follow roughly normal distribution. We can do so by checking histogram of residuals. If the histogram of residuals looks normal then we have a valid model.
Let us now look at few residual plots for other data sets and other models [not necessarily of actual linear models and may represent erroneous cases] and let is see how to interpret these residual plots.
I sincerely hope , the tutorial will be useful for everyone in helping them to understand validity of Regression model using residual plots. Please send me your feedback and suggestion and share the knowledge.
Happy Learning !!!
To get more understanding on Residual analysis and diagnostic plots please read R Tutorial : How to use Diagnostic Plots for Regression Models
]]>
Please also read though following Tutorials to get more familiarity on R and Linear regression background.
R : Basic Data Analysis – Part 1
R Tutorial : Intermediate Data Analysis – Part 2
Tutorial : Concept of Linearity in Linear Regression
Tutorial : Linear Regression Construct
R Tutorial : Basic 2 variable Linear Regression
Technique : Multiple Linear Regression
When to use : When our output variable is numeric
No of variables : > 2
Model Readability : High
For this tutorial we will be using csv version of the excel file uploaded here demandformoney. As always please save the file and then convert it to .csv using save as from excel.
Central Bank prints paper money each year. For each year they need an estimate of how much money to be printed. The decision is based on various economic indicators like GDP, Interest rate etc. We will try to model the solution to this problem using Multiple linear regression.
money = read.csv("demandformoney.csv") str(money)
‘data.frame’: 35 obs. of 5 variables:
$ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 …
$ Money_printed: int 7374 8323 9700 11200 11975 13325 16024 14388 17292 20000 …
$ GDP : int 474131 478918 477392 499120 504914 550379 557258 598885 631839 598974 …
$ Interest_RATE: num 7.25 7.25 7.25 7.25 9 10 10 9 9 10 …
$ WPI : num 14.3 15.1 16.7 20.1 25.1 24.8 25.3 26.6 26.6 31.2 …
Note : We have 4 variables of interest here Money_printed, GDP , Interest_RATE and WPI and we have conveniently omitted the variable year for a reason. But we will leave that discussion out for another tutorial post.
since ;
Money_printed = f ( GDP , Interest_RATE , WPI )
output variable : Money_printed
Input variables : GDP , Interest_RATE , WPI
plot(money,col='red')
All the Input variables seem to be fairly linearly correlated with our output variable Money_printed except for WPI.
multilinearmodel = lm (Money_printed ~ GDP + Interest_RATE + WPI, data = money) summary(multilinearmodel)
Call: lm(formula = Money_printed ~ GDP + Interest_RATE + WPI, data = money) Residuals: Min 1Q Median 3Q Max 46875 7027 1387 15068 46249 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 1.759e+04 5.052e+04 0.348 0.73008 GDP 2.975e01 8.787e02 3.385 0.00195 ** Interest_RATE 1.626e+04 2.919e+03 5.570 4.19e06 *** WPI 2.943e+01 9.038e+02 0.033 0.97423  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22770 on 31 degrees of freedom Multiple Rsquared: 0.9848, Adjusted Rsquared: 0.9834 Fstatistic: 670.4 on 3 and 31 DF, pvalue: < 2.2e16
Please note the formula used here is Money_printed ~ GDP + Interest_RATE + WPI
i.e Output Variable ~ Input Variable 1 + Input Variable 2 ….
An important point to be noted is the P value for variables GDP, Interest_RATE is significant i.e Pr(>t) for these variables is less than 0.05 .
For variable WPI however the P value is 0.97423 which is greater than 0.05 which is highly insignificant. What that means is that the variable WPI does not contribute significantly to the model and hence it must be removed.
Please also note that the Adjusted R Squared in 0.9834 which is fairly good but by removing an insignificant variable we can expect marginal increase in Adjusted R Squared as well.
Lets now go through another iteration of creating a model after omitting WPI from input variables.
multilinearmodel = lm (Money_printed ~ GDP + Interest_RATE , data = money) summary(multilinearmodel)
Call:
lm(formula = Money_printed ~ GDP + Interest_RATE, data = money)
Residuals:
Min 1Q Median 3Q Max
47055 7168 1432 14998 46008
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 1.906e+04 2.192e+04 0.870 0.391
GDP 3.003e01 6.913e03 43.443 < 2e16 ***
Interest_RATE 1.619e+04 1.977e+03 8.191 2.35e09 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22420 on 32 degrees of freedom
Multiple Rsquared: 0.9848, Adjusted Rsquared: 0.9839
Fstatistic: 1038 on 2 and 32 DF, pvalue: < 2.2e16
In this 2nd model please note that P values of both input variables are less than 0.05 and so both are significant variables for the model.
Also please note that Adjusted R Squared value has increased from 0.9834 to 0.9839 which is marginal increase but it still indicates that removing variable actually benefited our model.
We must ideally also consider distribution of our residuals before finalizing the model. But we will ignore this point now and accept the model since it has acceptable Adjusted R Squared.
From the coefficients of variables in the output above we construct our model for predicting Money_printed as –
Money_printed = 19060 + 0.3003 * GDP – 16190 * Interest_RATE
Now when Central Bank needs to find how much Money is to be printed , they can plug in the values of GDP and Interest Rate into the equation above and find an estimate of money to be printed.
But before concluding that the model is good we must go through Residual Analysis , look at Adjusted R Squared values and interpret the F Statistic.
Hope you liked the post. Please post your reviews and comments and don’t forget to share.
Further you can read following tutorials for gaining further understanding
R Tutorial : Residual Analysis for Regression
R Tutorial : How to use Diagnostic Plots for Regression Models
]]>
R : Basic Data Analysis – Part 1
R Tutorial : Intermediate Data Analysis – Part 2
Tutorial : Concept of Linearity in Linear Regression
Tutorial : Linear Regression Construct
Technique : 2 variable Linear Regression
When to use : When our output variable is numeric
No of variables : 2
Model Readability : High
For this tutorial we will be using csv version of the excel file uploaded here INCOMESAVINGS . As always please save the file and then convert it to .csv using save as from excel.
income = read.csv (&amp;amp;amp;quot;INCOMESAVINGS.csv&amp;amp;amp;quot;) str( income )
'data.frame': 22 obs. of 3 variables: $ YEAR : int 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 ... $ SAVINGS: int 12298 14196 17320 19995 23601 24213 26881 30896 33787 38091 ... $ INCOME : int 64968 69233 73824 85267 91507 99632 123067 142181 157291 185749 ...
So we have 3 variables YEAR, SAVINGS and INCOME
since : savings = f (income)
output variable = SAVINGS
Input variable : INCOME
plot(income$SAVINGS,income$INCOME,xlab=&amp;amp;quot;Income&amp;amp;quot;,ylab = &amp;amp;quot;Savings&amp;amp;quot; , main = &amp;amp;quot;Savings vs Income&amp;amp;quot;, col ='red')
From the plot it is clear that we have a positive linear relationship between Income and savings and we can use linear regression to predict Savings given the Incomes.
linearmodel = lm (SAVINGS ~ INCOME , data = income) summary(linearmodel)
Please Note the first argument to function is SAVINGS ~ INCOME . This argument is of type formula and is usually of the form
Dependent_Variable ~ Independent_Variables
Output of Summary(linearmodel) : Call: lm(formula = SAVINGS ~ INCOME, data = income) Residuals: Min 1Q Median 3Q Max 13036.3 4958.9 316.9 5368.1 16969.3 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 1.099e+04 2.459e+03 4.469 0.000235 *** INCOME 2.970e01 6.012e03 49.402 < 2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7327 on 20 degrees of freedom Multiple Rsquared: 0.9919, Adjusted Rsquared: 0.9915 Fstatistic: 2441 on 1 and 20 DF, pvalue: < 2.2e16
The Model that is generated for us is ( Numbers in RED are coefficients of variables in our Linear regression equation )
SAVINGS = 10990 + 0.297 * INCOME
Please notice that the pvalue for INCOME ( values in GREEN) i.e Pr(>t) is significant ( i.e less than 0.05 ) and hence the variable is significant in predicting the SAVINGS. If we do not have significant pvalue corresponding to the variable we may choose to ignore that variable.
Next number that we have to be aware of is Rsquared . In our case Adj R Squared is 0.9915 which implies that the model is able to explain 99% variation in our data . The ideal Rsquared value is domain specific. But typically anything above 70% is assumed to be very good and the model is supposed to be a good model for prediction.
We will delve into details of R Squared , t value , residuals and F statistic in subsequent tutorial. For this discussion we can safely ignore them.
The model can be interpreted as – ” When Income rises by 1 unit , the Savings rise by 0.297 units”
Now whenever we have any value of INCOME we can calculate SAVINGS using the equation –
SAVINGS = 10990 + 0.297 * INCOME
I sincerely hope you enjoyed the tutorial , please post your feedback and comments and share other articles on the site.
As a next step to analyzing model you should also go through Residual Analysis , look at Adjusted R Squared values and interpret F statistic.
Till next time Happy Learning.
Next in the series :
R Tutorial : Multiple Linear Regression
R Tutorial : Residual Analysis for Regression
R Tutorial : How to use Diagnostic Plots for Regression Models
]]>
Please go through the Tutorial on Concept of Linearity to understand the basic requirement of linear regression viz Linearity.
Lets consider a very simple data where
Price = f (Demand )
Price  Demand  Price  Demand  
1  48  3  44  
1  49  4  35  
1  50  4  38  
1  51  4  42  
2  44  5  36  
2  45  5  39  
2  46  5  40  
2  47  6  32  
2  48  6  35  
3  40  6  37  
3  42  6  36 
Using excel scatter plot we plot the points and then add a linear Trendline which is nothing but a line using best fit linear regression equation.
All the scatter plot points are the actual observation values of demand given a Price. When we created the linear regression line with equation y = 2.852x +51.59 , we essentially created a prediction of demand at each price point and all our predictions lie on the line represented by the equation. Please Note our regression equation is of the form
Ý = b1 + b2X
For example our prediction for price 4 is 40, where as 3 of our observations for price 4 has actual demand as 35, 38 and 42 . This means that for every point which was observed, when we generated a prediction , we incurred error while generating the prediction.
For Sample –
Let us represent this error term by ei. Lets represent our Actual Demand as Yi for each i and our predicted demand for each i as Ýi . So we can represent our actual values as –
Yi = Ýi + ei : This can be also written as
Yi = b1 + b2Xi + ei
Or
ei = Yi – b1 – b2 Xi ———————— ( I )
Now this is based on limited finite sample so the key question is – Can we find b1 and b2 such that our overall error is minimized. The technique for doing this is called Ordinary Least Squares (OLS)
So Here is what we want to do
Minimize ∑ei² = ∑ ( Yi – Ýi )² ———————( II )
Where : Yi = Actual Y value for ith item
Ýi = Predicted Y value for ith Item
Now we know from ( I ) above ei = Yi – b1 – b2 Xi and ( II ) above
∑ei² = ( Yi – b1 – b2 Xi )² ——————–( III )
Hence ∑ei² = f(b1,b2)
So for given set of data different values of b1 and b2 will give rise to different ei values and thus a different ∑ei²
The OLS method is used to choose b1 and b2 in such a manner that we get a minimum ∑ei². OLS method uses differential calculus to get b1 and b2. Values of b1 and b2 that minimize are obtained by solving the following two simultaneous equations :
∑Yi = nb1 + b2 ∑Xi and
∑YiXi = b1 ∑Xi + b2 ∑Xi ²
These are called least Squares Normal Equations. Solving these for b1 and b2 we get –
Next in the series :
R Tutorial : Basic 2 variable Linear Regression
R Tutorial : Multiple Linear Regression
R Tutorial : Residual Analysis for Regression
R Tutorial : How to use Diagnostic Plots for Regression Models
Reference : Based on Lectures by Dr. Manish Sinha. ( Associate Prof. SCMHRD )
]]>
In Linear Regression the term linear is understood in 2 ways –
Linear regression however always means linearity in parameters , irrespective of linearity in explanatory variables.
A linear regression for 2 variables is represented mathematically as ( u is the error term )
Y = B1 + B2X + u Or
Y = B1 + B2X ² + u
Here the variable X can be non linear i.e X or X² and still we can consider this as a linear regression. However if our parameters are not linear i.e say the regression equation is
Y = B1² + B2²X + u
then this can not be said to represent a linear regression equation.
Model linear in parameters?

Model linear in variables?  
Yes  No  
Yes  Linear Model  Linear Model 
No  Non Linear Model  Non Linear Model 
A function Y = f(x) is said to be linear in X if X appears with a power or index of 1 only. i.e the terms such as x2, Γx, and so on are excluded or if x is not multiplied or divided by any other variable.
Y is linearly related to X if the rate of change of Y with respect to X (dY/dX) is independent of the value of X.
A function is said to be linear in the parameter, say, B1, if B1 appears with a power of 1 only and is not multiplied or divided by any other parameter (for eg B1 x B2 , or B2 / B1)
To reiterate again – For purpose of Linear regression we are only concerned about linearity of parameters B1, B2 …. and not the actual variables X1, X2 ….
Example
For Log(Yi) = Log(B1) + B2 Log(Xi) + u
B2 is Linear but B1 is nonlinear but if we transform α = Log(B1) then the model
Log(Yi) = α + B2 Log(Xi) + u
is linear in α and B2 as parameters. Implying we can make the regression equation linear in parameters using a simple transformation
For other cases we may not have an easy way to transform parameters to their linear form and such equations are hence treated as intrinsically nonlinear and are NOT modeled using linear regression
Next in the series :
Tutorial : Linear Regression Construct
R Tutorial : Basic 2 variable Linear Regression
R Tutorial : Multiple Linear Regression
R Tutorial : Residual Analysis for Regression
R Tutorial : How to use Diagnostic Plots for Regression Models
R Tutorial : How to interpret F Statistic in Regression Models
Reference : Based on Lectures by Dr. Manish Sinha. ( Associate Prof. SCMHRD )
]]>