Cross Validation: Training, Validation and Testing Splits

Cross Validation is a resampling technique that uses training, validation and testing splits to chunk up the data for resampling.

Splits in this context refers to splitting up data. It is standard practice to split up your data into Training, Validation, and (optionally although highly recommended) Test splits. For instance, maybe in a dataset of 100,000 images, we select 90% for training, 9% for validation, and 1% for testing. But what do these mean, and how can I make these splits in code?

Differences between Train, Valid, Test

Train/Training Splits

The training split is the training data. As the name implies, this is the dataset we will use to train from.

Valid/Validation Splits

The Validation split is used to validate the training data. This is used in the process of fitting the model to the training data. The validation set provides an unbiased estimation of model fit (unbiased because it was not used during training), and is often used to slightly tweak hyperparameters during training and/or implement early stopping mechanisms (to prevent overfitting when validation metrics reach a certain level).

Test/Testing Splits

Test splits are not involved in training a model at all. As such, they are used to assess the generalizability of the model. For instance, after training an unsupervised k-means model, we want to use our test data to see how well the model can predict the values of a given parameter in our data. If the model was trained well, the predictions should be pretty close to the test data; this suggests that the model has generalized well, that is, it is not overfitting too much.

As mentioned, test splits are optional. While you do need training and validation sets to build models, test sets are optional indicators used to show how generalizable a model is. But maybe you don’t want a generalizable model; maybe you are just trying things out. Maybe you don’t plan on deploying this model. Maybe you are the only person who will see it. Maybe you know precisely the data that will be fed into the model, and generalizability isn’t an issue. In general, if you are going to deploy a model, you should use a test set, and it’s also good practice to split your data into train/valid/test right from the beginning in any model building.

Train/Valid/Test Ratios

The eternal question about splits is: what is the correct ratio of splits? Is it 70/25/5? 90/9/1?

There are no hard and fast rules about what the correct split ratio is. Often times, the approximate answer to this question revolves around whatever gives you the desired outcome for your model. For instance, maybe you are developing critical software that must be 99.99% accurate in its predictions; finding a split ratio that reaches this bar is the goal, and what that ratio is often doesn’t matter.

However, most models gain in accuracy with larger training datasets. As such, a common ratio is something like 80/15/5 or 80/10/10. But, if you have a dataset of millions, then maybe you are ok with training on 60/35/5; A higher validation set here would give more accuracy in fine tuning the model at every training cycle (or epoch).

It is common practice to start with some training set in between 70-90, validation between 5-25, and test between 5-25.

Maybe we only have a dataset of 10,000 records. In this case, a higher training split might be more appropriate, like 90/5/5, because then we would make sure we use as much of the data as we can to train on (having too few of training samples can lead to underfitting).

Make Sure Your Splits Are Random!

One very important thing to remember about splits is ensuring they are random. For instance, if you have an alphabetically ordered dataset of names, then making a train split of 80/15/5 would get you all the names from A to O. This of course would cause problems when your validation set uses the names from O to X, and your test set uses Y and Z names. Be sure to shuffle your datasets before organizing them into the splits, and ensuring randomness across the splits.

Challenges With Splits

Splits have two primary issues.

  1. The validation estimate can be highly variable, depending on which observations are put into which split. But how will you know which data to go into which splits?

  2. Since only the training data is used to train the model, and we want our model to be trained on increasing amounts of data, the valid split error rate will tend to overestimate for a model fit on increasing proportions of the data set.

One way to solve these challenges is by using LOOCV or k-fold CV.

How Can I Train On Unlabeled Data?

When we get neat, structured data that isn’t missing any values, we rejoice! But often, especially today, we get unstructured data with no labels- and lots of it. One of the great challenges is how to label data at scale. Here’s a few ways to resolve your data labeling issue.

Manually Label It Yourself

The brute force method might make sense in some cases. If you label 10,000 data points and then can train a model off of it that gets ideal test (read: not train or valid) metrics, then maybe this is a decent approach.

The challenge here is that it’s often expensive and time consuming to label data. Usually there are edge cases that are tough to figure out. And on top of all this, how do you know how many data points it will take before you have a model that can achieve acceptable metrics?

Use a Pretrained Model

In some cases, you could use a pretrained model, make predictions on data that you know is easier to predict, and use those labels to train a new model, OR you could build off of the pretrained model. However, in this situation, you would have to trust that the pretrained model is right.

Unsupervised Learning

Unsupervised learning models can be built without labeled data. For example, clustering could be used to assign a label directly to each clustering. Then, once you have labeled data, you could then train a model on it. The key issue with this approach is, how good did my unsupervised learning approach do in categorizing/clustering/labeling the data? There might even be room to oscillate back and forth between using an unsupervised learning approach to label, training a supervised learning model, getting new data to test both of them on and see how accurate they are, playing them off each other, etc.

Code Examples

All of the code examples are written in Python, unless otherwise noted.

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.
#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:train-valid-test

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:train-valid-test

Need help implementing any of this code? Feel free to reach out to datamine-help@purdue.edu and we can help!