15  Running your models

Now after all that explanation we finally get to the good part, running the actual models!

In 2016 a package came out called mlr (machine learning in R) that created a suite of tools to create machine learning models quickly and efficiently in R. This package is the only R package we will need for this section of the tutorial. The mlr package breaks creating machine learning models into three smaller steps:

  1. Define the task: This step consists of passing the data into an object and labeling what the outcome variable is.
  2. Define the learner/model: A learner defines the structure of the algorithm. In this step you define/list the model type you want to use (i.e., logistic regression, KNN, random forest, etc.) and then the specific hyperparameters you want the model to use. (i.e., what k to use in a KNN model) Learner is not a general term used throughout machine learning but instead is specific to the mlr package. You can also think of this step as defining your model.
  3. Train the model with the appropriate cross validation approach: Put it all together along with the specified cross validation approach. At this step you create an object that takes the task, learner, and any other additional features and uses that information to output the trained model you can use to make predictions.

15.0.1 Creating the Task

To define the task we will use the makeClassifTask() which defines a classification task. If doing regression tasks, you would use the function makeRegrTask(). You then pass the data to the data argument, and the name of the predictor variables to the target argument. For our purposes, we will be defining three tasks because we have three research questions, each with slightly different data representing the different outcomes and predictors. Because we only want to pass the data that is absolutely necessary for each task, we have to make those data frames before we can use them in to create the Task.

#Research Question #1
#extract only the two columns from the large pre-processed training dataset we made in step 1 that are needed for the research question: training score and the ostensive binary outcome variable
data_RQ_1 <- training_data |>  
  select(training_score, ostensive_binary) |>  
    mutate(ostensive_binary = as.factor(ostensive_binary)) 

#Define the Task for RQ #1
task_RQ_1 <- makeClassifTask(data = data_RQ_1, target = "ostensive_binary")

#Research Question #2
#extract only the two columns from the large pre-processed dataset that are needed for the research question: training score and the nonostensive binary outcome variable
data_RQ_2 <- training_data |> 
  select(training_score, nonostensive_binary) |> 
  mutate(nonostensive_binary = as.factor(nonostensive_binary)) 

#Define the task for RQ #2
task_RQ_2 <- makeClassifTask(data = data_RQ_2, target = "nonostensive_binary")

#Research Question #3
#select the predictor variables we will be using and the outcome variable of os_best
data_RQ_3_factor <- training_data |> 
  select(sex:miscellaneous_score,nonos_best) |> 
mutate(across(c(sex, desexed, purebred, gaze_follow, nonos_best), as.factor))

data_RQ_3_numeric <- data_RQ_3_factor |> 
mutate(across(c(sex, desexed, purebred, gaze_follow), as.numeric))

#Define the task for RQ #3
task_RQ_3_factor <- makeClassifTask(data = data_RQ_3_factor, target = "nonos_best")

task_RQ_3_numeric <- makeClassifTask(data = data_RQ_3_numeric, target = "nonos_best")

Now we have created all the tasks we will need for every model we run! That is why it is nice to define the data before you create the model. We save ourselves extra coding and make our coding more readable.

15.0.2 Creating Models for Question 1

Since the learner defines the model we are using, we will need a learner for each model we plan to run. As we discussed in the previous section, we will be running 3 models for our first and second research questions and two models for our third.

To create a learner we need to use the makeLearner() function. We will pass to that function what type of algorithm we are using and if it is classification, regression, or unsupervised. We will only be using classification for this tutorial since we want to classify things into groups, so all of our algorithms will start with classif.

#Creating the KNN learner for RQ1

knn_learner_RQ_1 <- makeLearner("classif.knn")

Now, we have our learner- but it’s actually missing something: the hyperparameter of what k should be. We need to tell the model how many adjacent data points we want it to use to classify the validation set. Now, we could guess and just make it 3 because that is what we have seen in some examples, but let’s let the math do it for us instead. We can use code to check a bunch of different values and then we can use the value that gives us the best prediction rates. Let’s go!

#finding the best value of k

#define values to check. Best to check 1 through 10 to start.
knnParamSpace_RQ_1 <- makeParamSet(makeDiscreteParam("k", values = 1:10))
#tell the computer to search every value of k we offered (1 thru 10)
gridSearch_RQ_1 <- makeTuneControlGrid()
#cross validate for the k finding
cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps =20)
#Use the tuneParams function and the task we made above to get the best k value
tunedk_RQ_1 <- suppressMessages({ 
  tuneParams("classif.knn", task = task_RQ_1, resampling = cvForTuning,
                          par.set = knnParamSpace_RQ_1, control = gridSearch_RQ_1, show.info = F)
  }) 

You will notice that in the tunedk_RQ_1 I have the suppressMessages() function. This is not necessary when running models yourself, and can be deleted from the code. I have added this function throughout the tutorial so that many redundant messages aren’t printed in the final document. You might want to keep this function in your code if you want only the absolutely necessary outputs.

Now let’s print the result of our tuning:

tunedk_RQ_1
Tune result:
Op. pars: k=9
mmce.test.mean=0.2925956

A k of 9 gave us the best mmce or mean misclassification error. The mean misclassification error is a great way to check the prediction accuracy, as it is simply the proportion of cases classified as the wrong class. So currently we have 29% of our data being classed incorrectly. Let’s add this k of 9 into our learner and keep going!

#Creating the KNN learner for RQ1
knn_learner_RQ_1 <- makeLearner("classif.knn", par.vals = list("k"=9))

Now we need to put it all together to train our model and then perform cross validation. Remember we already made our cross validation argument in the cross validation section: loocv <- makeResampleDesc(method = "LOO"). The resample() function runs the model multiple times with the data from each of the different cross validation sets. We just have to feed it the task, learner, and cross validation arguments we made earlier.

#Training the model while cross validating
KNN_model_RQ_1 <- suppressMessages({ train(learner = knn_learner_RQ_1, task = task_RQ_1)
  })

loocv <- makeResampleDesc(method = "LOO") #cross validation procedure

#Get a sense of how well the model will work in practice with cross validation
KNN_model_RQ_1_loocv <- suppressMessages({ 
  resample(learner = knn_learner_RQ_1, task = task_RQ_1, 
           resampling = loocv, measures = list(mmce, acc))  
  }) 

Now that we made the model and then cross validated it using Leave one out cross validation, we can look at the loss/cost function to see how our model is performing. In this case our loss/cost functions are accuracy (acc) and mean misclassification error (mmce). The aggr() function gives you the two performance metrics I asked the model to pull out in the measures argument. A performance metric tells you how well the model predicted the correct category. As noted above, the mmce is mean misclassifications error or the proportion of incorrectly classified cases, and acc is the accuracy or the proportion of correctly classified cases. You will notice that the two together add up to 100%, so you really only need one or the other.

#Performance
KNN_model_RQ_1_loocv$aggr
mmce.test.mean  acc.test.mean 
     0.2984014      0.7015986 

This model didn’t perform well. We are currently classifying 70% of the data correctly. Let’s run some other models and see if we can do better than this.

Next on our list is making a decision tree. The argument you will need to make a decision tree (NOT a random forest but just 1 tree) is classif.rpart.

#Creating the learner for a Single decision tree
decisiontree_learner_RQ_1_and_2 <- makeLearner("classif.rpart")

Now we can pass our learner to the resample function along with our cross-validation procedure and task.

#Training the model 
decision_tree_model_RQ_1 <- suppressMessages({ 
  train(learner = decisiontree_learner_RQ_1_and_2, 
        task = task_RQ_1)
  })

#Checking how well the model does using cross validating
decision_tree_model_RQ_1_loocv <- suppressMessages({ 
  resample(learner = decisiontree_learner_RQ_1_and_2, 
           task = task_RQ_1, resampling = loocv, 
           measures = list(mmce, acc))
  })

Now let’s check how well the decision tree model is working with the data by looking at our loss/cost function.

#Performance of the Decision Tree
decision_tree_model_RQ_1_loocv$aggr
mmce.test.mean  acc.test.mean 
     0.2841918      0.7158082 

The decision tree did a little better than the KNN model did (70% vs 71.6%) but they currently look pretty similar. We won’t see the final differences until they have been evaluated on new unseen data in Step 5. Something to look forward to!

The third and last model we want to test with this first research question is a SVM model. The SVM model is the most complex and computational expensive model and/that has multiple hyperparameters that need tuning. The four hyper parameters that are most important to tune are kernel, cost, degree, and gamma. Each of the four hyperparameters deals with changing the decision boundary slightly. A Decision boundary is the hyper-plane (often line or squiggle) that separates the data into two classes.

  • kernel - The kernel defines what type of line you want the algorithm to fit to the data. There are four possible options: a line, a polynomial, a radial (circle), or a sigmoid. You only need to check the last three because a one degree polynomial is the same as a line, and if you ask the computer to calculate both you are just wasting time and computational power.
  • cost - This controls how tolerant you will be of misclassifications - i.e., how hard or soft the margin of the decision boundary is. If it’s soft (larger number) it allows misclassifications. If it’s hard (larger number) it doesn’t allow any misclassifications.
  • degree - This controls how “bendy” the decision boundary is. Higher values mean more bendy. The bendier/squigglier the line, the more it fits the testing data and usually the less it generalizes to new unseen data. See the regularization section on overfitting/underfitting.
  • gamma - This controls how much influence the algorithm gives to individual data points. Large values of Gamma mean more intricate decision boundaries (i.e., A more squiggly line that maps every point instead of generalizes with a more linear line) and more likely to have overfitting.

Now we will define the hyperparameters that we want to tune with the same functions we used to define k up above.

#Create vector listing the possible kernels you can 
kernels <- c("polynomial", "radial", "sigmoid")

svm_parametercheck <- makeParamSet(makeDiscreteParam("kernel", values = kernels),
                                       makeIntegerParam("degree", lower =1, upper = 3),
                                       makeNumericParam("cost", lower = 0.1, upper =10),
                                       makeNumericParam("gamma", lower = 0.1, upper = 10))

Now that we have defined what we want the computer to check we have to tell it how to search through the values. Previously, in the KNN model above, we just had the computer search every value between 1 and 10 with makeTuneControlGrid() but if we had the computer look through every value we specified above, we would be waiting a very long time. Instead we will ask the computer to randomly search through a smaller subsection of the values with the makeTuneControlRandom argument. You should search through as many as you have computer power for. Where to start depends on how many values you are searching through. Degree can only be 3 values, so if I was only tuning that hyperparameter, 3 would be the same as searching through every value. However, cost and gamma could be any value between 0.1 and 10, which has hundreds of possible values if we are only going to the hundredths place. Start with a number that would reasonably check at least 5% of values and then run it multiple times. If you consistently get the same values each time you run, you probably have a high enough maximum iteration (below this is defined under maxit) value. If you keep getting very different values each time you run it, increase the maximum iteration value iteratively until you have consistently in results.

Since this is just an example, though we will choose a small number so it runs quickly. Let’s do that now!

randomsearch <- makeTuneControlRandom(maxit = 50)

Next, we will tell our computer to run code using multiple cores (i.e., CPUS) of our computer at once. This helps speed up the computation process by using more of the computer. While the myth that humans only use 10% of their brains is false, computers really don’t use all of their brain unless you tell them to. Check how many cores your computer has with parallel::detectCores(). My machine has 24 CPUS (as you can see in the output below), so we’ll use all of them with the parallelStartSocket function. Note that if you use all of your cores, your computer can’t really do anything else while it is working. So if you want to check email, browse the web, or keep coding during the computation, leave a core or two out.

We also need a cross validation set for this search. We could use the cvForTuning we made earlier with 10 folds and 20 repetitions of each, but that would create 200 iterations. With that many iterations cvForTuning is likely to find the best hyperparameters for our question - but it would take more than 30 minutes to complete on my 24 core machine. Instead, we will make a new cross validation approach called cvForTuning2 with only 5 folds and 5 repetitions creating 25 iterations. This will only take 6 minutes.

Throughout this tutorial you will notice that there are many instances where we have to iteratively go through multiple options per problem. Over time, working with different code and algorithms, you will develop an intuition for what is the optimal number of iterative steps for different problems. More is always better - except for the costs of time and computational power. Often you will run into the problem of diminishing returns on very large cross validation steps. If it is crucial you get something as correct as possible you might have to run your computer for a whole week. If you had an unlimited resource of computational power, then the best option would be to run millions of iterations to be sure you have the exact right set of hyperparameters. However, one of the primary roles of the data scientist in a non-infinite world is to determine the optimal solution to trade off questions of accuracy, time, and power. Throughout this tutorial I am trying to give you a sense of that intuition. In this tutorial I am trying to show you the decision points to pay attention to as you gain that experience, so you can be more transparent - to yourself, as well as to others - when documenting your choices in those decision points. However, you will not be perfect at this skill overnight. It will develop over time as you became more familiar with machine learning.

#use all of the cpus on your computer
parallelStartSocket(cpus = detectCores())

#cross validation procedure for tuning this SVM algorithm
cvForTuning2 <- makeResampleDesc("RepCV", folds = 5, reps =5)

#Tune the hyperparameters
tunedSvmPars_RQ1 <- suppressMessages({ 
  tuneParams("classif.svm", task = task_RQ_1, 
             resampling = cvForTuning2, 
             par.set = svm_parametercheck, 
             control = randomsearch)
})
tunedSvmPars_RQ1
Tune result:
Op. pars: kernel=polynomial; degree=3; cost=5.74; gamma=9.28
mmce.test.mean=0.2842099

Now we have the answer to what some optimal hyperparameter values will be for the SVM for this task! However, if you run the above code multiple times you will notice that you don’t get the same results. This is because it isn’t searching through every possible option; rather, it tells you the best option from the random hyperparameters it happened to search through this time. This suggests that if this model appears to perform among the best compared to the other models you’ve evaluated, it’s worthwhile to revisit this step and conduct additional cross-validation iterations before settling on the final model hyperparameters.

I ran the above code 3 times, and it always chooses a polynomial for the kernel function so we could set polynomial and then just re-tune the other three hyperparameters. Across the 3 iterations, the algorithm choose degree 2 twice and 3 once. The cost and gamma functions changed more widely with values for cost ranging from 1.99 to 9.28. Values for gamma ranged from 3.93 to 9.28.

Let’s put the most recent hyperparameters into the learner and then run the model with our cross validation approach!

#Make the SVM Learner
SVM_learner_tuned_RQ_1 <- makeLearner("classif.svm", par.vals = tunedSvmPars_RQ1$x)

#Train the model
SVM_model_RQ_1 <- suppressMessages({ 
  train(learner = SVM_learner_tuned_RQ_1, task = task_RQ_1)
  })

#Check model performance with LOOCV 
SVM_model_RQ_1_loocv <- suppressMessages({ 
  resample(learner = SVM_learner_tuned_RQ_1, 
           task = task_RQ_1, 
           resampling = loocv, 
           measures = list(mmce, acc))
  })

#Output the Lost/Cost functions for this model
SVM_model_RQ_1_loocv$aggr
mmce.test.mean  acc.test.mean 
     0.2824156      0.7175844 

The model and cross validation approach took about 6 and a half minutes to run on my 24 CPU computer using all of them in parallel. This model worked about as well as the Decision tree, but better than our KNN. If we took more time to run a wider range of hyperparameters for degree, cost, and gamma, the accuracy of the final model would most likely increase. If you are working through this with me, why don’t you create a cross validation procedure with many more iterations, run it, and then go make lunch and come back and rerun the model code. Did you get a higher accuracy rating? (Also how was your lunch?)

15.0.3 Creating Models for Question 2

This section will be very similar to the previous learners section, as they have similar constraints and the same predictor variable. The purpose of this section is to show you how to make very small changes in what you feed the model to change the outcome.

Let’s jump straight into finding the correct value of k that minimizes misclassifications for the KNN model for research question 2. The only thing that we have to change is the task we feed the tuneParams function.

#finding the best value of k

#define values to check. Best to start with checking 1 through 10
knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:10))
#tell the computer to search every value of k we offered (1 thru 10)
gridSearch <- makeTuneControlGrid()
#cross validate for the k finding
cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps =20)
#Use the tuneParams function and the task we made above to get the best k value
tunedk_RQ_2 <- suppressMessages({ 
  tuneParams("classif.knn",
             task = task_RQ_2, 
             resampling = cvForTuning, 
             par.set = knnParamSpace, 
             control = gridSearch)
})

tunedk_RQ_2
Tune result:
Op. pars: k=8
mmce.test.mean=0.2344251

The correct k for this research question was 8! Let’s update the learner to include this new hyperparameter setting.

knn_learner_RQ_2 <- makeLearner("classif.knn", par.vals = list("k"=8))

Now we can run our model with the new learner, task, and cross validation.

#Running the KNN model for RQ 2
KNN_model_RQ_2 <- suppressMessages({ 
  train(learner = knn_learner_RQ_2, task = task_RQ_2)
})

#Running cross validating
KNN_model_RQ_2_loocv <- suppressMessages({ 
  resample(learner = knn_learner_RQ_2, 
           task = task_RQ_2, 
           resampling = loocv, 
           measures = list(mmce, acc))
})
#Looking at how this model did
KNN_model_RQ_2_loocv$aggr
mmce.test.mean  acc.test.mean 
     0.2344583      0.7655417 

Interestingly, this model worked better than the previous KNN model with research question 1 (70% vs 77%). So, training score must be slightly more predictive for the nonostensive task.

To run a decision tree model, we can use the same learner as from research question 1 as we didn’t have to do any tuning. Therefore, we can simply run the decision tree model immediately with the pre-made learner, task, and cross validation procedure.

#Trainging the Decision Tree model for RQ #2
decision_tree_model_RQ_2 <- suppressMessages({ 
train(learner = decisiontree_learner_RQ_1_and_2, 
      task = task_RQ_2)
  })
                                     
#Cross validating
decision_tree_model_RQ_2_loocv <- suppressMessages({ 
  resample(learner = decisiontree_learner_RQ_1_and_2, 
           task = task_RQ_2, 
           resampling = loocv, 
           measures = list(acc))
})
#Performance of the Decision Tree
decision_tree_model_RQ_2_loocv$aggr
acc.test.mean 
    0.7655417 

This decision tree had the same accuracy as the KNN model for this research question as far as we can tell from the proportion of correct classifications.

Now we have to run the SVM and tune it with the new hyperparameters that work best on this task. Most of what we defined above will work with this algorithm, so we don’t need to set new hyperparameters or cv values - we just have to feed the new data to what we already made!

#Tuning hyperparameters for SVM model
tunedSvmPars_RQ2 <- tuneParams("classif.svm", task = task_RQ_2, resampling = cvForTuning2, par.set = svm_parametercheck, control = randomsearch)
[Tune] Started tuning learner classif.svm for parameter set:
           Type len Def                    Constr Req Tunable Trafo
kernel discrete   -   - polynomial,radial,sigmoid   -    TRUE     -
degree  integer   -   -                    1 to 3   -    TRUE     -
cost    numeric   -   -                 0.1 to 10   -    TRUE     -
gamma   numeric   -   -                 0.1 to 10   -    TRUE     -
With control class: TuneControlRandom
Imputation value: 1
Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; level = mlr.tuneParams; cpus = 24; elements = 50.
[Tune] Result: kernel=polynomial; degree=3; cost=8.28; gamma=2.64 : mmce.test.mean=0.2344627
#show me what hyperparameters worked best
tunedSvmPars_RQ2

#update learner with the choosen hyperparameters
SVM_learner_tuned_RQ_2 <- makeLearner("classif.svm", par.vals = tunedSvmPars_RQ2$x)

#Run the model  
SVM_model_RQ_2 <- suppressMessages({ 
  train(learner = SVM_learner_tuned_RQ_2, task = task_RQ_2)
})

#Run Cross Validation
SVM_model_RQ_2_loocv <- suppressMessages({ 
  resample(learner = SVM_learner_tuned_RQ_2, 
           task = task_RQ_2, 
           resampling = loocv, 
           measures = list(mmce, acc))
})
#performance
SVM_model_RQ_2_loocv$aggr
mmce.test.mean  acc.test.mean 
     0.2344583      0.7655417 

This model was also about the same as the decision tree model on accuracy.

15.0.4 Creating Models for Question 3

The two models we decided fit with our model assumptions and data for research question 3 are a KNN model and a Random Forest model. We will start by tuning the k hyperparameter for our KNN model. As this has many more predictors than our first two questions, we will investigate with a wider set of ks. If the k that is chosen is at the top of your range (not somewhere in the middle), you haven’t searched through a wide enough range of values.

#finding the best value of k

#define values to check best to check 1 through 10
knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:50))
#tell the computer to search every value of k we offered (1 thru 10)
gridSearch <- makeTuneControlGrid()
#cross validate for the k finding
cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps =50)
#Use the tuneParams function and the task we made above to get the best k value
tunedk_RQ_1 <- tuneParams("classif.knn", task = task_RQ_3_numeric, resampling = cvForTuning, par.set = knnParamSpace, control = gridSearch)
[Tune] Started tuning learner classif.knn for parameter set:
      Type len Def                                   Constr Req Tunable Trafo
k discrete   -   - 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,1...   -    TRUE     -
With control class: TuneControlGrid
Imputation value: 1
Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; level = mlr.tuneParams; cpus = 24; elements = 50.
[Tune] Result: k=42 : mmce.test.mean=0.2433409
tunedk_RQ_1
Tune result:
Op. pars: k=42
mmce.test.mean=0.2433409

The answer was 35 so we will now use that value of k for our learner.

#Creating the KNN learner for RQ1
knn_learner_RQ_3 <- makeLearner("classif.knn", par.vals = list("k"=35))

Now we run the learner with the task and cross validation and get our prediction accuracy.

#Training the model 
KNN_model_RQ_3 <- suppressMessages({ 
  train(learner = knn_learner_RQ_3, task = task_RQ_3_numeric)
})

#cross validating
KNN_model_RQ_3_loocv <- suppressMessages({ 
  resample(learner = knn_learner_RQ_3, 
           task = task_RQ_3_numeric, 
           resampling = loocv, 
           measures = list(acc))
})

#Preformance
KNN_model_RQ_3_loocv$aggr
acc.test.mean 
    0.7566607 

The KNN model for Research Question 3 did about as good as the models trained on fewer predictors (76%).

Next is the random forest algorithm. We can first just make a learner with the randomForest model type.

randomforest_learner_RQ_3 <- makeLearner("classif.randomForest")

Like SVM, random forest algothims also have a number of hyperparameters that need to be investigated and tuned to the correct values before we run the model. In random forest the four hyperparameters that should be tuned as best practice before the model can be run are: ntree mtry nodesize and maxnodes.

  • ntree - the number of trees you want the model to run
  • mtry - the number of features/predictors to sample at each node
  • nodesize - minimum number of observations allowed in a leaf
  • maxnodes - max number of leaves

We will let the ntree and nodesize values stay at its default of 500 and 1 respectively as we don’t have a particularly large or small amount of data that would suggest we change the default values. These can always be adjusted later after you have tuned anything else. Before we decide what values to choose for the mtry and maxnodes this is a good time to show you how to check what hyperparameters there are for any specific algorithm. The function getParamSet lists the parameters and the default value where applicable.

getParamSet("classif.randomForest")
                     Type  len   Def   Constr Req Tunable Trafo
ntree             integer    -   500 1 to Inf   -    TRUE     -
mtry              integer    -     - 1 to Inf   -    TRUE     -
replace           logical    -  TRUE        -   -    TRUE     -
classwt     numericvector <NA>     - 0 to Inf   -    TRUE     -
cutoff      numericvector <NA>     -   0 to 1   -    TRUE     -
strata            untyped    -     -        -   -   FALSE     -
sampsize    integervector <NA>     - 1 to Inf   -    TRUE     -
nodesize          integer    -     1 1 to Inf   -    TRUE     -
maxnodes          integer    -     - 1 to Inf   -    TRUE     -
importance        logical    - FALSE        -   -    TRUE     -
localImp          logical    - FALSE        -   -    TRUE     -
proximity         logical    - FALSE        -   -   FALSE     -
oob.prox          logical    -     -        -   Y   FALSE     -
norm.votes        logical    -  TRUE        -   -   FALSE     -
do.trace          logical    - FALSE        -   -   FALSE     -
keep.forest       logical    -  TRUE        -   -   FALSE     -
keep.inbag        logical    - FALSE        -   -   FALSE     -

Now we know how to look at the tunable parameters, let’s go back to deciding how to choose what values to use in tuning with mtry and maxnodes. We only have 12 predictors, so let’s do the full range of 1 to 12 for the number of features to sample at each node. For maxnodes we have to choose values less than the number of observations nrow(training_data). For now, let’s leave it at null so it can grow without restriction because we aren’t as worried about time and computing power with a dataset so small. We only have 1 parameter to tune and so we will do that below:

forestParamSpace <- makeParamSet(makeIntegerParam("mtry", lower = 1, upper = 12))

randomsearch_100 <- makeTuneControlRandom(maxit = 100)

cvForTuning_randomforest <- makeResampleDesc("CV", iters = 5)

tunedForestPars <- suppressMessages({ 
  tuneParams(learner = randomforest_learner_RQ_3, 
             task = task_RQ_3_factor, 
             resampling = cvForTuning_randomforest,
             par.set = forestParamSpace, 
             control = randomsearch_100)
})
#Show the result
tunedForestPars
Tune result:
Op. pars: mtry=1
mmce.test.mean=0.2415613

And the result was that 1 was the best value to use for mtry. Now we have to update the learner with the tuned parameter and plug everything into the resampling function and make the model.

#Update the learner
randomforest_learner_RQ_3 <- makeLearner("classif.randomForest", par.vals = list("mtry"=1))

#Train the Model
random_forest_model_RQ_3 <- suppressMessages({ 
  train(learner = randomforest_learner_RQ_3, task = task_RQ_3_factor)
})

#Cross validating
random_forest_model_RQ_3_loocv <- suppressMessages({ 
  resample(learner = randomforest_learner_RQ_3, 
           task = task_RQ_3_factor, 
           resampling = loocv, 
           measures = list(mmce, acc))
})
#Performance
random_forest_model_RQ_3_loocv$aggr
mmce.test.mean  acc.test.mean 
     0.2433393      0.7566607 

Both the random forest model and the KNN model performed the same when they were cross validated.

15.1 Generic Logistic Regression Example

While the current data did not call for us to use logistic regression, it is still a useful algorithm to know how to use for many potential problems. So, here is code that you can use to run your own logistic regression.

We will use the task we created for research question 1. We will have to make a new learner to define our model using the classif.logreg argument for logistic regression. Set the argument predict.type equal to prob if you want the estimated probabilities of each class outputted, as well as what class each observation is predicted to be in. Then all you have to do is put it together and cross validate:

LR_learner_RQ_1 <- makeLearner("classif.logreg", predict.type = "prob")

LR_model_RQ_1 <- suppressMessages({ 
  train(LR_learner_RQ_1, task_RQ_1)
})

LR_model_RQ_1_loocv <- suppressMessages({ 
  resample(learner = LR_learner_RQ_1, 
                       task = task_RQ_1,
                       resampling = loocv)
})

LR_model_RQ_1_loocv$aggr
mmce.test.mean 
     0.2841918 

As predicted, this model did NOT do a good job at predicting the correct category for our dogs.

15.2 Generic Naive Bayes Example

As with logistic regression, we had decided not to use Naive Bayes for this example dataset, but having some base code may be handy for those cases where it would be appropriate. Naive Bayes algorithms only have one hyperparameter to tune, laplace. You typically want to add a small value to laplace, as it is just a placeholder that adds that value to all of the probabilities to handle the issue of a zero probability for some features. Unseen feature values (0s) are assigned a small, non-zero probability, which helps prevent zero probabilities from making your probabilities be artificially low (as anything multiplied by 0 is 0), and improves the robustness of the Naive Bayes classifier.

I will use the task_RQ_1 task and we have to make the learner, train the model, and cross validate.

NB_learner_RQ_1 <- makeLearner("classif.naiveBayes", par.vals = list("laplace" = 1))
  
NB_model_RQ_1 <- train(learner = NB_learner_RQ_1, 
                       task = task_RQ_1)

NB_model_RQ_1_loocv <- suppressMessages({ 
  resample(learner = NB_learner_RQ_1, 
                       task = task_RQ_1,
                       resampling = loocv)
})

NB_model_RQ_1_loocv$aggr
mmce.test.mean 
     0.2841918 

You’ll notice that we are at about 71.6% accuracy which is less predictive than most of our models. But this and all our other models haven’t been tested with new data - let’s do that now!