Skip to content

Instantly share code, notes, and snippets.

@hnlee
Last active September 18, 2019 20:07
Show Gist options
  • Save hnlee/a027c6a9926658f31bf80b144552584d to your computer and use it in GitHub Desktop.
Save hnlee/a027c6a9926658f31bf80b144552584d to your computer and use it in GitHub Desktop.
What is a classifier?

What is a classifier?

Most of the following is paraphrased from Elements of Statistical Learning, which is available as a PDF.

Variables can be inputs or outputs

  • Inputs can also be called: predictors, independent variables, features
  • Outputs can also be called: responses, dependent variables

Variables can be quantitative or qualitative

  • Qualitative variables are also known as categorical variables, discrete variables, factors
  • Mathematically, qualitative variables are represented through numerical codes.
    • When there are only two categories or classes, it can be represented as a boolean with values of 0 or 1.
    • There are multiple methods for coding multiclass categorical variables. See: dummy variables, scikit-learn's Encoder classes

Supervised learning: "using inputs to predict the values of the outputs"

  • Regression: predicting quantitative outputs
  • Classification: predicting qualitative outputs
  • Classifier: algorithm used to solve a classification problem
  • Classification usually means specifically supervised learning

Unsupervised learning: "describe associations and patterns among a set of inputs"

  • Clustering: using unsupervised learning to group inputs into categories
  • Unsupervised learning is often used to generate new input variables for supervised learning.

How supervised learning works:

  • Training data: set of input data with known output values used to learn a prediction rule or model
  • Test data: set of input data with known output values used to evaluate the performance of a given model
  • A model presents a hypothesis about the relationship between the outputs and inputs, usually in the form of a mathematical formula
    • This formula has parameters or coefficients that need to be fit to the training data
    • Loss function: used to measure the fit, i.e. evaluate the accuracy of the model's predictions
    • Training a model: finding the set of parameters that optimizes the fit by minimizing the loss function
    • There are many optimization algorithms available, one of the most popular is stochastic gradient descent

Fernando-Delgado et al., 2014 uses the term two-class data sets to describe binary classification problems. That means the output being predicted has only two categories. Multiclass classification is a trickier problem: the output being predicted has multiple categories.

Some useful background context: no free lunch theorem states that no one type of classifier is best for all classification problems.

The paper evaluates 176 classifiers on 121 classification problems to empirically compare their overall performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment