13 Step 3: Repeated Cross Validation

Repeated cross validation is a little like data Russian nesting dolls. It is also simpler than most people expect it to be. Repeated cross validation simply refers to the concept of breaking your data into smaller sections to get an average of how well a model does on different subsets of the data. That way you have an average value for whatever you are interested in - whether than be your loss/cost function or predictive accuracy. An average allows a more accurate estimate for outcomes than relying on one partition of the data alone. If you only cut the data once, there is a likelihood that the training partition you randomly selected could have a disproportionately high number of outliers, or didn’t have a good distribution of observation values. Cross validation is a way of making the randomization component of training a dataset less subject to chance.

The three main types of cross validation are k-fold, leave one out, and nested cross validation. You use them in the following instances:

Leave one out cross validation (LOOCV) - best to use when your dataset is very small (n ~ 500).
K-fold cross validation - usually best to use when your dataset is large (n > 1000). The K stands for any number you would like, usually 10-fold cross validation. K stands for how many times you would like the data to be partitioned.
Nested cross validation - best to use when you need to tune hyperparameters like k in KNN models or lambda in regularization techniques.

The value of 1000 observations as a small cutoff isn’t a universal standard. Instead, what constitutes large or small means different things to different people depending on the field, the type of model you are running, and if you need to tune hyperparameters. Yes, this is a vague answer for what is small and what is large. It’s also the most accurate answer. But, to help you out: In psychology, most experiments not involving neuroimaging data from large consortiums or genome-wide association studies are usually considered “small” in the data science world. Large language learning models can have billions of observations.

Now, let’s break down these three types of cross validation to explain them further! All three cross validation options simply cut the data you are working on into small chunks. Depending on which cross validation technique you choose, they will complete slightly different cuts (partitions) to subdivide the data.

Leave one out cross validation iteratively runs as many models as you have observations using all but one observation each time to train the model and then uses the one observation it left out to test the model. It does this until every observation has been used as the testing data. This is best for very small datasets (n < 1000), because when you get up to the thousands of observations the computing power required gets prohibitively expensive and time consuming.

K-fold cross validation cuts the data into k sections with k being any positive integer you like. The smallest value k can be is 2, and the largest option being the number of observations. If your k is the number of observations, you have reinvented Leave one out cross validation! (Fun, right?) Once the data is broken into k sections it then runs as many models as the number k iteratively until all folds have been used as the validation/testing set.

Nested cross validation runs two different cross validations with one within the other (like our nesting dolls!). The outer loop of a nested cross validation runs exactly like a k fold cross validation. The inner fold also runs exactly like a k fold cross validation, but instead of running on the entire data set but leaving 1 fold out, it runs on each fold of the outer loop’s training set. Each loop run in the inner loop uses a different set of hyperparameters, like choosing what value of lambda to use in regularization or what k to use in a KNN model. When the inner loop concludes it can tell you what the optimal hyperparameter is based on which value gives the lowest loss/cost function. The outerloop then runs using the chosen hyperparameter to return the best performing model given your data.

There are a few other types of cross validation, but they are all variations on cutting things into parts and running small data sections through the model. The more computing power and time you have access to, the fancier you can make your cross-validation procedure. I won’t go into detail on some of the more complicated cross validation practices as the fancier ones do best on very large datasets in the millions where all the cutting/partitioning can still be meaningful. If your data, like in this tutorial’s example, has less than 1000 observations, it is less likely that increasing the complexity of your cross validation would be meaningful, helpful, or worth the time and computing power.

For our purposes in this tutorial, our training data set only has 563 observations (which is considered small in machine learning), so we will be using leave one out cross validation (LOOCV). LOOCV is not the only or “perfect” method to use with this or your dataset. There is more than one way to skin a cat and many ways to partition a dataset.

There are multiple ways in R to create code to run a cross validation. We will use the makeResampleDesc() function from the mlr package to define what cross validation procedure you are completing. The mlr (machine learning in R) package came out in 2016 and includes a suite of tools to create machine learning models quickly and efficiently in R. This package is the main package we will be using in this tutorial. All three questions we are investigating use the same size dataset, so we will use LOOCV for all three. I have also shown you in a comment how you would define a 10 fold cross validation, as that is the most often used cross validation method. In the comment below this is repeated 10 times with the reps argument resulting in the model being ran a hundred times.

loocv <- makeResampleDesc(method = "LOO") #define parameters for cross validation

#10fold_cross_validation <- makeResampleDesc(method = "RepCV", folds = 10, reps = 10, stratify = TRUE) #If you want a different number of folds you can change the number to anything you like. If your number of folds is the same number as your observations than you have remade LOOCV!

Above is the only code you need for now. When you run your model, you will set the resampling argument to loocv. We will add loocv to our models when we actually run the model, but to show you what it will look like, here is some dummy code:

#model <- resample(learner = knn, task = data, resampling = loocv)