This tutorial talks about the importance of Testing set and Validation sets for predictive analytic algorithms and different techniques which can be wisely used to get a reliable measure of accuracy.

Please make sure you have gone through previous articles in the series –

Necessary – Introduction to Machine Learning and Machine Learning : Accuracy & Memorization vs Learning

Optional – R : Basic Data Analytics – Part 1 and R : Basic Data Analytics – Part 2

In Supervised learning problems, another key question is – “Even if we know a measure to judge the accuracy of model, how do we make sure that the model works well with the future new data?”. We can measure performance of a model in variety of ways. Accuracy, error measure, AUC for ROC are just few of these. For ease of discussion , we will assume that performance of a model is measured using Accuracy, which is

Accuracy= (Number of correct predictions) / ( Total number of predictions)

A natural solution to this is to divide the total sample data ( whole actual values of outcome variable are known) into 2 sets i.e Training Set and Test set. Training set will be used to create a model and since the model has not seen Test data , we can then run the model on each of the Test set items as inputs and get predictions for each item in Test Set. Since we already know the actual values for Test set , we can find accuracy on Test Set and to a large extent say that we will roughly get similar accuracy on future input data as well. An important assumption that we made while arriving at this conclusion that our Training Set and Test set represent our future data as well and there is no change in prevalent business conditions.

So far so good. However most modelling techniques have parameters . e.g. in case of decision tree we have a choice to generate complete tree which encompasses all combinations of input variables and outcomes for each of these combinations. ( Say this tree has 20 levels. ) A problem with this approach of becoming more and more specific , is that this kind of model not only becomes extremely complicated but it also overfits the training set. In this case if we calculate the accuracy of predictions on training set we will get extremely high accuracy but the model is likely to perform poorly on Test set. To make our model generalized we can decide not to consider full tree but instead say only the top 5 levels or top 7 levels or top 10 levels. So in this case levels in the tree based model is a parameter.

Now the most important question is how do we know which level best generalizes the model such that it performs best on unseen data. Should we use level 2 or level 3 or level 15 or level 20 ? Easiest solution is to create different models with every possible level as a parameter and find accuracy on our Test set for each of these models. Then we can choose the parameter for the model which gives us best accuracy. Say in this case the model performs best with level 5, we choose the tree based model for level as 5. It all seems OK so far. But there is a major flaw in this approach.

By using our Test set to determine a parameter , unintentionally we have exposed our test data in determining our model. What this means is that our Test data is no longer a new data for our model and hence Accuracy calculated on this Test data may not represent the accuracy of future unknown data.

A Solution to this problem is by dividing our Sample data into 3 parts:

1. Training Set 2. Validation Set 3. Test Set

Training set will be used like earlier but instead of using test set to determine parameters for our model, we will use validation set for the purpose. Test set will only be used as a representative future input data and will be used solely to approximately get accuracy of model in real world.

Usually either of the 2 types of techniques is used for arriving at the validation set :

1. Holdout (aka HoldBack ) 2. K-Fold Cross Validation (aka Rotation Estimation )

**Holdout (aka HoldBack)**

This is a really simple way of creating a validation set. What we do is we simple hold back a small percentage ( say around 10 % ) and train the model using rest of the training data. One of the problem that this method suffers from is that it if sensitive to the way data is divided. Any random biases in the Holdout validation set can have adverse effect on our model.

**K-Fold Cross Validation ( aka Rotation Estimation )**

In K-Fold cross validation for a given parameter value ,the Training Set is randomly divided into K equal sized smaller sets. One of these smaller sets is used as validation set and other K-1 sets taken together are used for training the model. This process is then repeated K times ( K folds ) for each of the other sub sets acting as a validation set. After all K subsets have acted as a validation set an average is taken on the measure of accuracy. ( or error ). This is a very popular technique used in validation and parameter tuning because it is immune to biases in selecting the validation set.

We will see in a short post , how to divide our data in Test and Training sets using R. The topic of K Fold cross validation we will take as a part of Decision trees.

Till then happy learning.

Cheers!

I found the content written in a easy to understand style. Congrats keep up the good work.

LikeLike