3 What is Classification Machine Learning and How Do You Do It?
So, if that’s what machine learning is: what is classification machine learning? Great question! I’m so glad you asked! In this tutorial we are interested in data sets where the outcome variable (aka dependent variable) is discrete or ordinal, and we are therefore trying to predict what category an observation will end up in. That is called classification. Classification falls under the supervised learning umbrella because there are predictors that are either continuous, categorical, or binary, and we are interested in classifying these discrete categorical outcome variable based on predictors*.
To conceptualize what you should do on a classification machine learning journey let’s break it down into 5 simple parts:
1. Explore data and check assumptions
2. Randomly partition the data into training and test sets
3. Run repeated cross validation
4. Choose an algorithm based on variable types and assumptions and run models with hyperparameters
5. Assess how well each model performs on the testing data set
YOU DID MACHINE LEARNING!
So simple, right? Okay, so maybe you aren’t ready to go run models on your own yet ;) Let’s discuss what each step does and why before we jump into how to do it:
- Explore data and check assumptions: Before you can run any models you need to explore your dataset to understand what types of variables you have, clean up the data, and deal with any missing values. Once these steps are complete you must check if the properties of your dataset meet the necessary criteria for each potential algorithm. Each algorithm has its own set of assumptions that must be met for it to work as intended. If the assumptions of an algorithm are violated in the working dataset than models created by that specific algorithm are often useless**. If your model assumes condition x, but your data violates condition x, then the model outcomes will be less predictive and, in some cases, essentially worthless. When assumptions are not met, the model’s outputs can be biased, inefficient, or inaccurate, leading to poor performance or incorrect conclusions. Once you know which algorithms assumptions are met or violated you can choose the appropriate algorithms to use in step 4.
- Randomly partition the data: Breaking data into training and test sets allows you to reserve the testing data to evaluate how the model you create performs on new, unseen data. The test set is the unseen data, and the training data is what you use to estimate model parameters. TO partition the data into training and test sets, it makes sense to randomly partition them to avoid any weird order effect that may go along with data collection. But a problem with randomly partitioning data is that every time you run the analysis, you’ll use a different random partition, which can lead to a slightly different result. This means that your analysis is not fully reproducible. So anyone running your code (including you, when you comeback to rerun your analyses!) will not get exactly the same results. To ensure reproducible results, you should set a seed (i.e., a fixed starting point for the randomization process). Anyone who uses that seed will use the same randomization process.
- Run repeated cross validation: Above you partitioned the data once, which is the most basic form of cross validation. However, to partitioning the data only once can result in weird, uneven distributions of data between the training and testing sets. Instead, we run a bunch of these partitions (called repeated cross validation) and average over them to reduce bias and variance. If you forgo this step, you will likely get a more biased estimation of which model is going to accurately predict new data because of random noise in the training data, rather than the true relationship between your predictors and outcome variables as they exist in the world outside your sample.
- Choose an algorithm and run with hyperparameters: Using the characteristics of your data (number of predictors, type of outcome variables, etc.) and model assumptions (collinearity, outliers, normality, etc.) you can narrow down your analysis to a number of models that you could use to make predictions with a specific dataset. However, you cannot know what single model will run best until you actually run the models and compare the outcomes. This step also includes adding in any necessary hyperparameters the computer needs to run a specific model.
- Assess model performance: After you have run all the models that you are interested in, you then need to determine how well each model predicted the test set or unseen data. Generalization error is the term for this. It is a measure of how well a models predictions match the true values in the training data. In things like a logistic regression this can be as simple as calculating the percent of categories predicted correctly, or as complex as creating an ROC graph. Basically, in this step we are using standardized metrics to evaluate how well each possible model predicted the testing data.
* Predictor, feature, and independent variable all mean the same thing. Depending on which machine learning resource you are looking at they could choose to use any of these terms. Predictor and feature are most frequently used in this tutorial.
** You will notice the words algorithm and model used throughout this document. They refer to distinct pieces of information and are not interchangeable. When the word algorithm is used it refers to a framework/class of machine learning algorithm that could possibly be used to create a specific model. A model refers to the actual model complete with all structures and hyperparameters. For examples an algorithm type is linear regression but a model is a a linear regression run with outcome x, predictors, a b and c and a 10-fold cross validation procedure.