#Research Question #1
#extract only the two columns from the large pre-processed training dataset we made in step 1 that are needed for the research question: training score and the ostensive binary outcome variable
<- training_data |>
data_RQ_1 select(training_score, ostensive_binary) |>
mutate(ostensive_binary = as.factor(ostensive_binary))
#Define the Task for RQ #1
<- makeClassifTask(data = data_RQ_1, target = "ostensive_binary")
task_RQ_1
#Research Question #2
#extract only the two columns from the large pre-processed dataset that are needed for the research question: training score and the nonostensive binary outcome variable
<- training_data |>
data_RQ_2 select(training_score, nonostensive_binary) |>
mutate(nonostensive_binary = as.factor(nonostensive_binary))
#Define the task for RQ #2
<- makeClassifTask(data = data_RQ_2, target = "nonostensive_binary")
task_RQ_2
#Research Question #3
#select the predictor variables we will be using and the outcome variable of os_best
<- training_data |>
data_RQ_3_factor select(sex:miscellaneous_score,nonos_best) |>
mutate(across(c(sex, desexed, purebred, gaze_follow, nonos_best), as.factor))
<- data_RQ_3_factor |>
data_RQ_3_numeric mutate(across(c(sex, desexed, purebred, gaze_follow), as.numeric))
#Define the task for RQ #3
<- makeClassifTask(data = data_RQ_3_factor, target = "nonos_best")
task_RQ_3_factor
<- makeClassifTask(data = data_RQ_3_numeric, target = "nonos_best") task_RQ_3_numeric
15 Running your models
Now after all that explanation we finally get to the good part, running the actual models!
In 2016 a package came out called mlr
(machine learning in R) that created a suite of tools to create machine learning models quickly and efficiently in R. This package is the only R package we will need for this section of the tutorial. The mlr
package breaks creating machine learning models into three smaller steps:
- Define the task: This step consists of passing the data into an object and labeling what the outcome variable is.
- Define the learner/model: A learner defines the structure of the algorithm. In this step you define/list the model type you want to use (i.e., logistic regression, KNN, random forest, etc.) and then the specific hyperparameters you want the model to use. (i.e., what k to use in a KNN model) Learner is not a general term used throughout machine learning but instead is specific to the
mlr
package. You can also think of this step as defining your model. - Train the model with the appropriate cross validation approach: Put it all together along with the specified cross validation approach. At this step you create an object that takes the task, learner, and any other additional features and uses that information to output the trained model you can use to make predictions.
15.0.1 Creating the Task
To define the task we will use the makeClassifTask()
which defines a classification task. If doing regression tasks, you would use the function makeRegrTask()
. You then pass the data to the data
argument, and the name of the predictor variables to the target
argument. For our purposes, we will be defining three tasks because we have three research questions, each with slightly different data representing the different outcomes and predictors. Because we only want to pass the data that is absolutely necessary for each task, we have to make those data frames before we can use them in to create the Task.
Now we have created all the tasks we will need for every model we run! That is why it is nice to define the data before you create the model. We save ourselves extra coding and make our coding more readable.
15.0.2 Creating Models for Question 1
Since the learner defines the model we are using, we will need a learner for each model we plan to run. As we discussed in the previous section, we will be running 3 models for our first and second research questions and two models for our third.
To create a learner we need to use the makeLearner()
function. We will pass to that function what type of algorithm we are using and if it is classification, regression, or unsupervised. We will only be using classification for this tutorial since we want to classify things into groups, so all of our algorithms will start with classif
.
#Creating the KNN learner for RQ1
<- makeLearner("classif.knn") knn_learner_RQ_1
Now, we have our learner- but it’s actually missing something: the hyperparameter of what k should be. We need to tell the model how many adjacent data points we want it to use to classify the validation set. Now, we could guess and just make it 3 because that is what we have seen in some examples, but let’s let the math do it for us instead. We can use code to check a bunch of different values and then we can use the value that gives us the best prediction rates. Let’s go!
#finding the best value of k
#define values to check. Best to check 1 through 10 to start.
<- makeParamSet(makeDiscreteParam("k", values = 1:10))
knnParamSpace_RQ_1 #tell the computer to search every value of k we offered (1 thru 10)
<- makeTuneControlGrid()
gridSearch_RQ_1 #cross validate for the k finding
<- makeResampleDesc("RepCV", folds = 10, reps =20)
cvForTuning #Use the tuneParams function and the task we made above to get the best k value
<- suppressMessages({
tunedk_RQ_1 tuneParams("classif.knn", task = task_RQ_1, resampling = cvForTuning,
par.set = knnParamSpace_RQ_1, control = gridSearch_RQ_1, show.info = F)
})
You will notice that in the tunedk_RQ_1
I have the suppressMessages()
function. This is not necessary when running models yourself, and can be deleted from the code. I have added this function throughout the tutorial so that many redundant messages aren’t printed in the final document. You might want to keep this function in your code if you want only the absolutely necessary outputs.
Now let’s print the result of our tuning:
tunedk_RQ_1
Tune result:
Op. pars: k=9
mmce.test.mean=0.2925956
A k of 9 gave us the best mmce or mean misclassification error. The mean misclassification error is a great way to check the prediction accuracy, as it is simply the proportion of cases classified as the wrong class. So currently we have 29% of our data being classed incorrectly. Let’s add this k of 9 into our learner and keep going!
#Creating the KNN learner for RQ1
<- makeLearner("classif.knn", par.vals = list("k"=9)) knn_learner_RQ_1
Now we need to put it all together to train our model and then perform cross validation. Remember we already made our cross validation argument in the cross validation section: loocv <- makeResampleDesc(method = "LOO")
. The resample()
function runs the model multiple times with the data from each of the different cross validation sets. We just have to feed it the task, learner, and cross validation arguments we made earlier.
#Training the model while cross validating
<- suppressMessages({ train(learner = knn_learner_RQ_1, task = task_RQ_1)
KNN_model_RQ_1
})
<- makeResampleDesc(method = "LOO") #cross validation procedure
loocv
#Get a sense of how well the model will work in practice with cross validation
<- suppressMessages({
KNN_model_RQ_1_loocv resample(learner = knn_learner_RQ_1, task = task_RQ_1,
resampling = loocv, measures = list(mmce, acc))
})
Now that we made the model and then cross validated it using Leave one out cross validation, we can look at the loss/cost function to see how our model is performing. In this case our loss/cost functions are accuracy (acc) and mean misclassification error (mmce). The aggr()
function gives you the two performance metrics I asked the model to pull out in the measures
argument. A performance metric tells you how well the model predicted the correct category. As noted above, the mmce is mean misclassifications error or the proportion of incorrectly classified cases, and acc is the accuracy or the proportion of correctly classified cases. You will notice that the two together add up to 100%, so you really only need one or the other.
#Performance
$aggr KNN_model_RQ_1_loocv
mmce.test.mean acc.test.mean
0.2984014 0.7015986
This model didn’t perform well. We are currently classifying 70% of the data correctly. Let’s run some other models and see if we can do better than this.
Next on our list is making a decision tree. The argument you will need to make a decision tree (NOT a random forest but just 1 tree) is classif.rpart
.
#Creating the learner for a Single decision tree
<- makeLearner("classif.rpart") decisiontree_learner_RQ_1_and_2
Now we can pass our learner to the resample
function along with our cross-validation procedure and task.
#Training the model
<- suppressMessages({
decision_tree_model_RQ_1 train(learner = decisiontree_learner_RQ_1_and_2,
task = task_RQ_1)
})
#Checking how well the model does using cross validating
<- suppressMessages({
decision_tree_model_RQ_1_loocv resample(learner = decisiontree_learner_RQ_1_and_2,
task = task_RQ_1, resampling = loocv,
measures = list(mmce, acc))
})
Now let’s check how well the decision tree model is working with the data by looking at our loss/cost function.
#Performance of the Decision Tree
$aggr decision_tree_model_RQ_1_loocv
mmce.test.mean acc.test.mean
0.2841918 0.7158082
The decision tree did a little better than the KNN model did (70% vs 71.6%) but they currently look pretty similar. We won’t see the final differences until they have been evaluated on new unseen data in Step 5. Something to look forward to!
The third and last model we want to test with this first research question is a SVM model. The SVM model is the most complex and computational expensive model and/that has multiple hyperparameters that need tuning. The four hyper parameters that are most important to tune are kernel
, cost
, degree
, and gamma
. Each of the four hyperparameters deals with changing the decision boundary slightly. A Decision boundary is the hyper-plane (often line or squiggle) that separates the data into two classes.
kernel
- The kernel defines what type of line you want the algorithm to fit to the data. There are four possible options: a line, a polynomial, a radial (circle), or a sigmoid. You only need to check the last three because a one degree polynomial is the same as a line, and if you ask the computer to calculate both you are just wasting time and computational power.cost
- This controls how tolerant you will be of misclassifications - i.e., how hard or soft the margin of the decision boundary is. If it’s soft (larger number) it allows misclassifications. If it’s hard (larger number) it doesn’t allow any misclassifications.degree
- This controls how “bendy” the decision boundary is. Higher values mean more bendy. The bendier/squigglier the line, the more it fits the testing data and usually the less it generalizes to new unseen data. See the regularization section on overfitting/underfitting.gamma
- This controls how much influence the algorithm gives to individual data points. Large values of Gamma mean more intricate decision boundaries (i.e., A more squiggly line that maps every point instead of generalizes with a more linear line) and more likely to have overfitting.
Now we will define the hyperparameters that we want to tune with the same functions we used to define k up above.
#Create vector listing the possible kernels you can
<- c("polynomial", "radial", "sigmoid")
kernels
<- makeParamSet(makeDiscreteParam("kernel", values = kernels),
svm_parametercheck makeIntegerParam("degree", lower =1, upper = 3),
makeNumericParam("cost", lower = 0.1, upper =10),
makeNumericParam("gamma", lower = 0.1, upper = 10))
Now that we have defined what we want the computer to check we have to tell it how to search through the values. Previously, in the KNN model above, we just had the computer search every value between 1 and 10 with makeTuneControlGrid()
but if we had the computer look through every value we specified above, we would be waiting a very long time. Instead we will ask the computer to randomly search through a smaller subsection of the values with the makeTuneControlRandom
argument. You should search through as many as you have computer power for. Where to start depends on how many values you are searching through. Degree can only be 3 values, so if I was only tuning that hyperparameter, 3 would be the same as searching through every value. However, cost and gamma could be any value between 0.1 and 10, which has hundreds of possible values if we are only going to the hundredths place. Start with a number that would reasonably check at least 5% of values and then run it multiple times. If you consistently get the same values each time you run, you probably have a high enough maximum iteration (below this is defined under maxit
) value. If you keep getting very different values each time you run it, increase the maximum iteration value iteratively until you have consistently in results.
Since this is just an example, though we will choose a small number so it runs quickly. Let’s do that now!
<- makeTuneControlRandom(maxit = 50) randomsearch
Next, we will tell our computer to run code using multiple cores (i.e., CPUS) of our computer at once. This helps speed up the computation process by using more of the computer. While the myth that humans only use 10% of their brains is false, computers really don’t use all of their brain unless you tell them to. Check how many cores your computer has with parallel::detectCores()
. My machine has 24 CPUS (as you can see in the output below), so we’ll use all of them with the parallelStartSocket
function. Note that if you use all of your cores, your computer can’t really do anything else while it is working. So if you want to check email, browse the web, or keep coding during the computation, leave a core or two out.
We also need a cross validation set for this search. We could use the cvForTuning
we made earlier with 10 folds and 20 repetitions of each, but that would create 200 iterations. With that many iterations cvForTuning
is likely to find the best hyperparameters for our question - but it would take more than 30 minutes to complete on my 24 core machine. Instead, we will make a new cross validation approach called cvForTuning2
with only 5 folds and 5 repetitions creating 25 iterations. This will only take 6 minutes.
Throughout this tutorial you will notice that there are many instances where we have to iteratively go through multiple options per problem. Over time, working with different code and algorithms, you will develop an intuition for what is the optimal number of iterative steps for different problems. More is always better - except for the costs of time and computational power. Often you will run into the problem of diminishing returns on very large cross validation steps. If it is crucial you get something as correct as possible you might have to run your computer for a whole week. If you had an unlimited resource of computational power, then the best option would be to run millions of iterations to be sure you have the exact right set of hyperparameters. However, one of the primary roles of the data scientist in a non-infinite world is to determine the optimal solution to trade off questions of accuracy, time, and power. Throughout this tutorial I am trying to give you a sense of that intuition. In this tutorial I am trying to show you the decision points to pay attention to as you gain that experience, so you can be more transparent - to yourself, as well as to others - when documenting your choices in those decision points. However, you will not be perfect at this skill overnight. It will develop over time as you became more familiar with machine learning.
#use all of the cpus on your computer
parallelStartSocket(cpus = detectCores())
#cross validation procedure for tuning this SVM algorithm
<- makeResampleDesc("RepCV", folds = 5, reps =5)
cvForTuning2
#Tune the hyperparameters
<- suppressMessages({
tunedSvmPars_RQ1 tuneParams("classif.svm", task = task_RQ_1,
resampling = cvForTuning2,
par.set = svm_parametercheck,
control = randomsearch)
})
tunedSvmPars_RQ1
Tune result:
Op. pars: kernel=polynomial; degree=3; cost=5.74; gamma=9.28
mmce.test.mean=0.2842099
Now we have the answer to what some optimal hyperparameter values will be for the SVM for this task! However, if you run the above code multiple times you will notice that you don’t get the same results. This is because it isn’t searching through every possible option; rather, it tells you the best option from the random hyperparameters it happened to search through this time. This suggests that if this model appears to perform among the best compared to the other models you’ve evaluated, it’s worthwhile to revisit this step and conduct additional cross-validation iterations before settling on the final model hyperparameters.
I ran the above code 3 times, and it always chooses a polynomial for the kernel function so we could set polynomial and then just re-tune the other three hyperparameters. Across the 3 iterations, the algorithm choose degree 2 twice and 3 once. The cost and gamma functions changed more widely with values for cost ranging from 1.99 to 9.28. Values for gamma ranged from 3.93 to 9.28.
Let’s put the most recent hyperparameters into the learner and then run the model with our cross validation approach!
#Make the SVM Learner
<- makeLearner("classif.svm", par.vals = tunedSvmPars_RQ1$x)
SVM_learner_tuned_RQ_1
#Train the model
<- suppressMessages({
SVM_model_RQ_1 train(learner = SVM_learner_tuned_RQ_1, task = task_RQ_1)
})
#Check model performance with LOOCV
<- suppressMessages({
SVM_model_RQ_1_loocv resample(learner = SVM_learner_tuned_RQ_1,
task = task_RQ_1,
resampling = loocv,
measures = list(mmce, acc))
})
#Output the Lost/Cost functions for this model
$aggr SVM_model_RQ_1_loocv
mmce.test.mean acc.test.mean
0.2824156 0.7175844
The model and cross validation approach took about 6 and a half minutes to run on my 24 CPU computer using all of them in parallel. This model worked about as well as the Decision tree, but better than our KNN. If we took more time to run a wider range of hyperparameters for degree, cost, and gamma, the accuracy of the final model would most likely increase. If you are working through this with me, why don’t you create a cross validation procedure with many more iterations, run it, and then go make lunch and come back and rerun the model code. Did you get a higher accuracy rating? (Also how was your lunch?)
15.0.3 Creating Models for Question 2
This section will be very similar to the previous learners section, as they have similar constraints and the same predictor variable. The purpose of this section is to show you how to make very small changes in what you feed the model to change the outcome.
Let’s jump straight into finding the correct value of k that minimizes misclassifications for the KNN model for research question 2. The only thing that we have to change is the task we feed the tuneParams
function.
#finding the best value of k
#define values to check. Best to start with checking 1 through 10
<- makeParamSet(makeDiscreteParam("k", values = 1:10))
knnParamSpace #tell the computer to search every value of k we offered (1 thru 10)
<- makeTuneControlGrid()
gridSearch #cross validate for the k finding
<- makeResampleDesc("RepCV", folds = 10, reps =20)
cvForTuning #Use the tuneParams function and the task we made above to get the best k value
<- suppressMessages({
tunedk_RQ_2 tuneParams("classif.knn",
task = task_RQ_2,
resampling = cvForTuning,
par.set = knnParamSpace,
control = gridSearch)
})
tunedk_RQ_2
Tune result:
Op. pars: k=8
mmce.test.mean=0.2344251
The correct k for this research question was 8! Let’s update the learner to include this new hyperparameter setting.
<- makeLearner("classif.knn", par.vals = list("k"=8)) knn_learner_RQ_2
Now we can run our model with the new learner, task, and cross validation.
#Running the KNN model for RQ 2
<- suppressMessages({
KNN_model_RQ_2 train(learner = knn_learner_RQ_2, task = task_RQ_2)
})
#Running cross validating
<- suppressMessages({
KNN_model_RQ_2_loocv resample(learner = knn_learner_RQ_2,
task = task_RQ_2,
resampling = loocv,
measures = list(mmce, acc))
})
#Looking at how this model did
$aggr KNN_model_RQ_2_loocv
mmce.test.mean acc.test.mean
0.2344583 0.7655417
Interestingly, this model worked better than the previous KNN model with research question 1 (70% vs 77%). So, training score must be slightly more predictive for the nonostensive task.
To run a decision tree model, we can use the same learner as from research question 1 as we didn’t have to do any tuning. Therefore, we can simply run the decision tree model immediately with the pre-made learner, task, and cross validation procedure.
#Trainging the Decision Tree model for RQ #2
<- suppressMessages({
decision_tree_model_RQ_2 train(learner = decisiontree_learner_RQ_1_and_2,
task = task_RQ_2)
})
#Cross validating
<- suppressMessages({
decision_tree_model_RQ_2_loocv resample(learner = decisiontree_learner_RQ_1_and_2,
task = task_RQ_2,
resampling = loocv,
measures = list(acc))
})
#Performance of the Decision Tree
$aggr decision_tree_model_RQ_2_loocv
acc.test.mean
0.7655417
This decision tree had the same accuracy as the KNN model for this research question as far as we can tell from the proportion of correct classifications.
Now we have to run the SVM and tune it with the new hyperparameters that work best on this task. Most of what we defined above will work with this algorithm, so we don’t need to set new hyperparameters or cv values - we just have to feed the new data to what we already made!
#Tuning hyperparameters for SVM model
<- tuneParams("classif.svm", task = task_RQ_2, resampling = cvForTuning2, par.set = svm_parametercheck, control = randomsearch) tunedSvmPars_RQ2
[Tune] Started tuning learner classif.svm for parameter set:
Type len Def Constr Req Tunable Trafo
kernel discrete - - polynomial,radial,sigmoid - TRUE -
degree integer - - 1 to 3 - TRUE -
cost numeric - - 0.1 to 10 - TRUE -
gamma numeric - - 0.1 to 10 - TRUE -
With control class: TuneControlRandom
Imputation value: 1
Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; level = mlr.tuneParams; cpus = 24; elements = 50.
[Tune] Result: kernel=polynomial; degree=3; cost=8.28; gamma=2.64 : mmce.test.mean=0.2344627
#show me what hyperparameters worked best
tunedSvmPars_RQ2
#update learner with the choosen hyperparameters
<- makeLearner("classif.svm", par.vals = tunedSvmPars_RQ2$x)
SVM_learner_tuned_RQ_2
#Run the model
<- suppressMessages({
SVM_model_RQ_2 train(learner = SVM_learner_tuned_RQ_2, task = task_RQ_2)
})
#Run Cross Validation
<- suppressMessages({
SVM_model_RQ_2_loocv resample(learner = SVM_learner_tuned_RQ_2,
task = task_RQ_2,
resampling = loocv,
measures = list(mmce, acc))
})
#performance
$aggr SVM_model_RQ_2_loocv
mmce.test.mean acc.test.mean
0.2344583 0.7655417
This model was also about the same as the decision tree model on accuracy.
15.0.4 Creating Models for Question 3
The two models we decided fit with our model assumptions and data for research question 3 are a KNN model and a Random Forest model. We will start by tuning the k hyperparameter for our KNN model. As this has many more predictors than our first two questions, we will investigate with a wider set of ks. If the k that is chosen is at the top of your range (not somewhere in the middle), you haven’t searched through a wide enough range of values.
#finding the best value of k
#define values to check best to check 1 through 10
<- makeParamSet(makeDiscreteParam("k", values = 1:50))
knnParamSpace #tell the computer to search every value of k we offered (1 thru 10)
<- makeTuneControlGrid()
gridSearch #cross validate for the k finding
<- makeResampleDesc("RepCV", folds = 10, reps =50)
cvForTuning #Use the tuneParams function and the task we made above to get the best k value
<- tuneParams("classif.knn", task = task_RQ_3_numeric, resampling = cvForTuning, par.set = knnParamSpace, control = gridSearch) tunedk_RQ_1
[Tune] Started tuning learner classif.knn for parameter set:
Type len Def Constr Req Tunable Trafo
k discrete - - 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,1... - TRUE -
With control class: TuneControlGrid
Imputation value: 1
Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; level = mlr.tuneParams; cpus = 24; elements = 50.
[Tune] Result: k=42 : mmce.test.mean=0.2433409
tunedk_RQ_1
Tune result:
Op. pars: k=42
mmce.test.mean=0.2433409
The answer was 35 so we will now use that value of k for our learner.
#Creating the KNN learner for RQ1
<- makeLearner("classif.knn", par.vals = list("k"=35)) knn_learner_RQ_3
Now we run the learner with the task and cross validation and get our prediction accuracy.
#Training the model
<- suppressMessages({
KNN_model_RQ_3 train(learner = knn_learner_RQ_3, task = task_RQ_3_numeric)
})
#cross validating
<- suppressMessages({
KNN_model_RQ_3_loocv resample(learner = knn_learner_RQ_3,
task = task_RQ_3_numeric,
resampling = loocv,
measures = list(acc))
})
#Preformance
$aggr KNN_model_RQ_3_loocv
acc.test.mean
0.7566607
The KNN model for Research Question 3 did about as good as the models trained on fewer predictors (76%).
Next is the random forest algorithm. We can first just make a learner with the randomForest
model type.
<- makeLearner("classif.randomForest") randomforest_learner_RQ_3
Like SVM, random forest algothims also have a number of hyperparameters that need to be investigated and tuned to the correct values before we run the model. In random forest the four hyperparameters that should be tuned as best practice before the model can be run are: ntree
mtry
nodesize
and maxnodes
.
ntree
- the number of trees you want the model to runmtry
- the number of features/predictors to sample at each nodenodesize
- minimum number of observations allowed in a leafmaxnodes
- max number of leaves
We will let the ntree
and nodesize
values stay at its default of 500 and 1 respectively as we don’t have a particularly large or small amount of data that would suggest we change the default values. These can always be adjusted later after you have tuned anything else. Before we decide what values to choose for the mtry
and maxnodes
this is a good time to show you how to check what hyperparameters there are for any specific algorithm. The function getParamSet
lists the parameters and the default value where applicable.
getParamSet("classif.randomForest")
Type len Def Constr Req Tunable Trafo
ntree integer - 500 1 to Inf - TRUE -
mtry integer - - 1 to Inf - TRUE -
replace logical - TRUE - - TRUE -
classwt numericvector <NA> - 0 to Inf - TRUE -
cutoff numericvector <NA> - 0 to 1 - TRUE -
strata untyped - - - - FALSE -
sampsize integervector <NA> - 1 to Inf - TRUE -
nodesize integer - 1 1 to Inf - TRUE -
maxnodes integer - - 1 to Inf - TRUE -
importance logical - FALSE - - TRUE -
localImp logical - FALSE - - TRUE -
proximity logical - FALSE - - FALSE -
oob.prox logical - - - Y FALSE -
norm.votes logical - TRUE - - FALSE -
do.trace logical - FALSE - - FALSE -
keep.forest logical - TRUE - - FALSE -
keep.inbag logical - FALSE - - FALSE -
Now we know how to look at the tunable parameters, let’s go back to deciding how to choose what values to use in tuning with mtry
and maxnodes
. We only have 12 predictors, so let’s do the full range of 1 to 12 for the number of features to sample at each node. For maxnodes
we have to choose values less than the number of observations nrow(training_data)
. For now, let’s leave it at null so it can grow without restriction because we aren’t as worried about time and computing power with a dataset so small. We only have 1 parameter to tune and so we will do that below:
<- makeParamSet(makeIntegerParam("mtry", lower = 1, upper = 12))
forestParamSpace
<- makeTuneControlRandom(maxit = 100)
randomsearch_100
<- makeResampleDesc("CV", iters = 5)
cvForTuning_randomforest
<- suppressMessages({
tunedForestPars tuneParams(learner = randomforest_learner_RQ_3,
task = task_RQ_3_factor,
resampling = cvForTuning_randomforest,
par.set = forestParamSpace,
control = randomsearch_100)
})
#Show the result
tunedForestPars
Tune result:
Op. pars: mtry=1
mmce.test.mean=0.2415613
And the result was that 1 was the best value to use for mtry
. Now we have to update the learner with the tuned parameter and plug everything into the resampling
function and make the model.
#Update the learner
<- makeLearner("classif.randomForest", par.vals = list("mtry"=1))
randomforest_learner_RQ_3
#Train the Model
<- suppressMessages({
random_forest_model_RQ_3 train(learner = randomforest_learner_RQ_3, task = task_RQ_3_factor)
})
#Cross validating
<- suppressMessages({
random_forest_model_RQ_3_loocv resample(learner = randomforest_learner_RQ_3,
task = task_RQ_3_factor,
resampling = loocv,
measures = list(mmce, acc))
})
#Performance
$aggr random_forest_model_RQ_3_loocv
mmce.test.mean acc.test.mean
0.2433393 0.7566607
Both the random forest model and the KNN model performed the same when they were cross validated.
15.1 Generic Logistic Regression Example
While the current data did not call for us to use logistic regression, it is still a useful algorithm to know how to use for many potential problems. So, here is code that you can use to run your own logistic regression.
We will use the task we created for research question 1. We will have to make a new learner to define our model using the classif.logreg
argument for logistic regression. Set the argument predict.type
equal to prob
if you want the estimated probabilities of each class outputted, as well as what class each observation is predicted to be in. Then all you have to do is put it together and cross validate:
<- makeLearner("classif.logreg", predict.type = "prob")
LR_learner_RQ_1
<- suppressMessages({
LR_model_RQ_1 train(LR_learner_RQ_1, task_RQ_1)
})
<- suppressMessages({
LR_model_RQ_1_loocv resample(learner = LR_learner_RQ_1,
task = task_RQ_1,
resampling = loocv)
})
$aggr LR_model_RQ_1_loocv
mmce.test.mean
0.2841918
As predicted, this model did NOT do a good job at predicting the correct category for our dogs.
15.2 Generic Naive Bayes Example
As with logistic regression, we had decided not to use Naive Bayes for this example dataset, but having some base code may be handy for those cases where it would be appropriate. Naive Bayes algorithms only have one hyperparameter to tune, laplace
. You typically want to add a small value to laplace
, as it is just a placeholder that adds that value to all of the probabilities to handle the issue of a zero probability for some features. Unseen feature values (0s) are assigned a small, non-zero probability, which helps prevent zero probabilities from making your probabilities be artificially low (as anything multiplied by 0 is 0), and improves the robustness of the Naive Bayes classifier.
I will use the task_RQ_1
task and we have to make the learner, train the model, and cross validate.
<- makeLearner("classif.naiveBayes", par.vals = list("laplace" = 1))
NB_learner_RQ_1
<- train(learner = NB_learner_RQ_1,
NB_model_RQ_1 task = task_RQ_1)
<- suppressMessages({
NB_model_RQ_1_loocv resample(learner = NB_learner_RQ_1,
task = task_RQ_1,
resampling = loocv)
})
$aggr NB_model_RQ_1_loocv
mmce.test.mean
0.2841918
You’ll notice that we are at about 71.6% accuracy which is less predictive than most of our models. But this and all our other models haven’t been tested with new data - let’s do that now!