11 Regularization AKA how to fix overfitting and collinearity

Now, if you were to consult other machine learning texts, you will notice one word that often comes up… regularization. The type of regularization discussed here only applies to linear models because it is predicated on changing the slope of a fitted line to improve how well the model can predict new unseen data. For our purposes, we will not be able to use regularization in this tutorial. This is because linear regression is not a classification based model, the assumptions weren’t met for logistic regression, and tuning regularization with support vector machines is outside the scope of this tutorial. If you have no interest in regularization you can skip this section. I’ve included this section in the tutorial because it’s a key concept in machine learning, commonly applied whenever linear or logistic regression is used.

Regularization helps us to solve two problems that plague statistics: collinearity and overfitting. Often after you collect your data, you will find that some of your predictors are colinear, meaning that two or more of your predictor variables are highly correlated with each other. Overfitting means that your model too closely maps onto your training data and isn’t flexible enough to do well or make accurate predictions for new unseen data. All data represents two sources of variation: the true relationships, and the error in your sample (i.e., what is true of your sample but is not true of the population). The smaller your sample size, the more likely it is that you have captured noise rather than the true relationship between variables. The graphic below helps illustrate how overfitting creates a model reliant on error rather than the true relationship.

overfitting

Source: MathWorks (https://www.mathworks.com/discovery/overfitting)

You can solve both of these issues at once with something called regularization. Regularization simply adds a penalty term, lambda, to the slope of the fitted line. This additional value added to the slope adds in a little bit of error to the model. This user-added error is a necessary evil to combat some of the noise that occur in models when the data used to construct the model is not totally representative of the underlying truth of the relationship between variables. The choice to regularize is totally up to you, and is based on how much risk tolerance or uncertainty you want to include in your model outcomes. If you have millions of data points from a highly representative sample, adding in that extra error will probably get you further from the truth of the relationship. If you have, like we do in this example, less than 1000 subjects and are trying to generalize to the species level you probably should do regularization as the sample is small and not totally representative of your total population.

There are three types of regularization that are commonly used:

Lasso Regularization - best to use if you know some of the predictors in your model are useless
Ridge Regularization - best to use if you know all of the predictors in your model are useful
Elastic Net Regularization - best to use if you don’t know how useful/useless your predictors are because it has the best of both worlds (Chill it out take it slow than you’ll rock out the show) - it does the math from lasso + math from ridge to get a new number that is called elastic net.

In each case, applying regularization creates a value that adds together the loss/cost function and a lambda value, then multiples this number by a transformation of the slope of the fitted line. How the slope is transformed is where the three types of regularization differ. If you got a little lost in that last sentence, don’t despair; we will describe each of the three parts that contribute to regularization in detail. I’ll clarify this concept using an example you are likely familiar with, linear regression.

Loss/cost functions provide a single value that quantifies the difference between the predicted values and the actual values, measuring the model’s performance. In a linear regression model, our loss/cost function is the sum of squared residuals (calculated by adding together all the squared values of the differences between each data point and the estimate of where the line says the data point should be), which is a value that represents how well the line predicted the data. If you have a small loss/cost value than you have data that is very good at accurately predicting outcome values. Therefore, you want to minimize the loss/cost value as much as possible. A small loss/cost function is the goal.

The second term, Lambda, is a penalty term that adds a small amount of error to the slope of your line. This error term helps combat overfitting by making the slope slightly more or less steep. By adjusting the slope slightly you make it less dependent on the current training data and more likely to accurately predict the testing data (AKA you prevent overfitting). Lambda can be any positive number. The larger the lambda, the larger the penalty or error you are adding to the model. The value of lambda for a particular model is determined by iteratively going thru various values of lambda, testing each one with the data, until you find the lambda value that minimizes the loss/cost function. In this way, you are balancing the trade-off between fitting data well and overfitting.

The difference between the three types of regularization occur in the third term, which represents how they transform the slope of each predictor. In Lasso regularization, you multiply by the absolute value of the slope (i.e., |slope|). In Ridge regularization you multiply by the squared slope (i.e., slope²). In elastic net, you multiply by adding together both penalties (i.e., |slope| + slope²).

You use Ridge regularization when you believe all your variables are useful, and Lasso when you suspect some are not, is due to the difference between squaring values in Ridge and using absolute values in Lasso. Essentially, it is possible to set a predictor to exactly 0 when you take the absolute value, but when you square a term it can only ever get very small but never exactly 0. Because of this, Lasso and Elastic Net regularization can set some predictors to exactly 0, eliminating those predictors from the model. This solves the problem of collinearity because when two or more predictors are collinear the regularization step will set all but one of your collinear predictors to 0 so only one predictor is in the final model. On the other hand, Ridge will always have a value for each predictor even if it is very small. In ridge regularization the smaller the associated value of a predictor, the less variance an individual predictor will account for in the model. In this way Lasso and Elastic net regularization allows you to eliminate useless or collinear predictors instead of doing it in a semi-arbitrary way (NICE!).

Two really important limitations of regularization are:

When working with collinear variables, the way the math works behind the scenes, Lasso and Elastic Net will always set the first collinear variable to 0 and keep the second variable, so be mindful of this when you order the model variables. In practical terms, this means that you should enter terms into the model in order of importance based on domain knowledge.
These regularization techniques only works with models where the relationship between predictors and the outcome is linear, or the model parameters are linear. Of the methods mentioned in this tutorial, the only algorithms that regularization works are linear regression, logistic regression, and support vector machines.