12 Step 2: Randomly Partition the Data into Training and Test Sections

In this step, we will perform a basic cross validation and set our randomization component. Cross validation occurs when you partition the data into sections or parts. To begin, we must partition that data into two parts: a training and a testing set. You want to fit your models on the training data, and then use the testing data to assess how well the model can predict categories based on new (unseen) data. If you skip this step, you will run into an issue known as data leakage. Data leakage occurs when data from outside the training data is used to build a predictive model. If you were to build the model on all the data and then use a subsection of that to test it, the results you get would not reflect how well that model would predict newly collected data. That model would be too reliant on the specific data you have instead of the truth of the relationship between the variables in the real world, also known as overfitting.

set.seed(1234567) # ensures randomized outputs are reproducible

train_indices <- sample(1:nrow(manydogs_transformed), 0.8 * nrow(manydogs_transformed)) #Grabs a random subset of the data. In this case we asked for 80% of the data to be sampled which is the standard
training_data <- manydogs_transformed[train_indices, ] #Grab the 80% of the data you just pulled out and call it training
testing_data <- manydogs_transformed[-train_indices, ] # Grab the other 20% not in the train indices and call it testing