Most of the following is paraphrased from Elements of Statistical Learning, which is available as a PDF.
Variables can be inputs or outputs
- Inputs can also be called: predictors, independent variables, features
- Outputs can also be called: responses, dependent variables
Variables can be quantitative or qualitative
- Qualitative variables are also known as categorical variables, discrete variables, factors
- Mathematically, qualitative variables are represented through numerical codes.
- When there are only two categories or classes, it can be represented as a boolean with values of 0 or 1.
- There are multiple methods for coding multiclass categorical variables. See: dummy variables,
scikit-learn
'sEncoder
classes
Supervised learning: "using inputs to predict the values of the outputs"
- Regression: predicting quantitative outputs
- Classification: predicting qualitative outputs
- Classifier: algorithm used to solve a classification problem
- Classification usually means specifically supervised learning
Unsupervised learning: "describe associations and patterns among a set of inputs"
- Clustering: using unsupervised learning to group inputs into categories
- Unsupervised learning is often used to generate new input variables for supervised learning.
How supervised learning works:
- Training data: set of input data with known output values used to learn a prediction rule or model
- Test data: set of input data with known output values used to evaluate the performance of a given model
- A model presents a hypothesis about the relationship between the outputs and inputs, usually in the form of a mathematical formula
- This formula has parameters or coefficients that need to be fit to the training data
- Loss function: used to measure the fit, i.e. evaluate the accuracy of the model's predictions
- Training a model: finding the set of parameters that optimizes the fit by minimizing the loss function
- There are many optimization algorithms available, one of the most popular is stochastic gradient descent
Fernando-Delgado et al., 2014 uses the term two-class data sets to describe binary classification problems. That means the output being predicted has only two categories. Multiclass classification is a trickier problem: the output being predicted has multiple categories.
Some useful background context: no free lunch theorem states that no one type of classifier is best for all classification problems.
The paper evaluates 176 classifiers on 121 classification problems to empirically compare their overall performance.