18 Glossary of Terms
- 4 Types of Variables - In data their are 4 variable types, nominal, ordinal, discrete, and continuous. Nominal and ordinal are both qualitative meaning that the numbers listed represent a category or concept. Nominal data are categories without any order. For example, gender is a question that is categorical in nature because a gender of 1 means that 1 in representing an idea. The concepts of men and women are not in any specific order or ranking. Ordinal data is data that is categorical but ordered in some way. For example, being in a high, medium, or low category mean something in relation to each other. High must be in order above medium and low making it ordinal. Quantitative variables or variables that represent numerically countable things come in discrete or continuous. Continuous means that this number could be anything infinitily on the number line while discrete means that it must be an integer.
- Binary Variables - A binary variable is a type of categorical/quantitative variable that only has two values. For instance, all variables with yes/no answers are binary.
- Collinearity - Collinearity occurs when two variables are correlated with each other. This can cause issues in many analyses because it challenges underlying assumptions that variables are independent of each other.
- Decision Boundary - A decision Boundary is a line or plane that delineates the boundary between data points. In linear regression it is a line that assumes above the line is one classification and below the line is another.
- Feature Selection - Feature Selection is a process that eliminates unnecessary features (AKA predictors or variables) from the data to help a model perform as well as possible with given data.
- Generalization error - Generalization error is a measures of how well an algorithm performs on the testing data. Reminder: testing data is held out at the train and test split step at the beginning of a machine learning project.
- Hyperparameters - Hyperparameters are any variable in a model that changes the accuracy and precision of a model that are not learned from the data. This is in contrast to variables in the model that are learned from the data (i.e. parameters). For example, in a random forest model the number of decision trees that you want the computer to run is a hyperparameter but the best predictor to sample at the first node of each tree is a parameter as that is what the model learns using the data. Thinking more about the best start value for different types of hyperparameters is a whole book/tutorial in itself! In fact someone wrote a book on how to tune hyperparameters in R that is freely available. Please see this book and seek out additional resources on your own when tuning hyperparameters for your own analyses.
- Lost/Cost Function - A loss/cost function quantifies the difference between the predicted values and the actual values, measuring the model’s performance. The goal of machine learning is to minimize the loss function to improve the model’s accuracy and generalizability.There are many different types of functions that measure model performance so loss/cost function is an umbrella term. A loss function refers to the performance of a single data point while a cost function refers to the average performance across a dataset.
18.1 Package Versions
Below I have listed the R and package versions I am using. If you are reading this tutorial in the future after updates have broken code, you can revert back.
- R Version
- 4.3.3
- tidyverse
- Version 2.0.0
- knitr
- Version 1.46
- here
- Version 1.0.1
- janitor
- Version 2.2.0
- ggpubr
- Version 0.6.0
- car
- Version 3.1.2
- knitr
- Version 1.46
- mlr
- Version 2.19.1
- parallelMap
- Version 1.5.1
- parallel
- Version 4.3.3
- randomForest
- Version 4.7.1.1