# 5 metrics you need to know in Classification

Performance evaluation plays a very important role in machine learning. How we evaluate machine learning models will affect which model we finally pick, and most importantly, will affect the performance of the machine learning application.

In this post, we are going to introduce 5 common metrics to evaluate classifiers and hope will help you to understand the pros and cons of each metric.

## Confusion Matrix

Confusion matrix is an important tool to access the performance of a classifier. It is a table representation of the classification result of the classifier. Rows of the matrix represent the predicted class and the column represents the actual class.

Confusion matrix of a binary classifier consists of 4 entries: **True Positive (TP)**,** True Negative (TN)**,** False Positive (FP) **and** False Negative (FN)**.

Imagine if we are evaluating a red pandas classifier which distinguishes red pandas from raccoons.

**True Positives**are red pandas being classified as red panda**True Negatives**are raccoons being classified as raccoon**False Positives**are raccoons being classified as red panda**False Negatives**are red pandas being classified as raccoon

## Accuracy

Accuracy is the ratio between the number of correctly classified object and the size of the entire data set.

It is defined as:

$$Accuracy = \frac{TP+TN}{TP+FP+TN+FN}$$

Accuracy is a good metric for storytelling due to its simplicity. However, it is also a very rough metric and in some situations, it may not be able to truly reflect the performance of the model.

For example in the case of fraud detection, the positive class, fraudulent events are very minor as compare to the entire population. Let’s say 1 out of 10,000 events are fraudulent. A dummy classifier which simply considers everything non-fraudulent can score a high accuracy:

- TP=0
- TN = 9,999
- TP+FP+TN+FN = 10,000

$$Accuracy = \frac{9,999}{10,000} = 99.99\%$$

## Recall

Recall is the ratio between the **number of correctly identified positive samples** and **size of all positive samples**:

$$Recall = \frac{TP}{P} = \frac{TP}{TP+FN}$$

It reflects how much positive samples can be “recalled” by the classifier.

Recall is also called sensitivity or true positive rate.

## Precision

Precision is the ratio between the **number of correctly identified positive samples** and the **number of samples being predicted as positive:**

$$Precision = \frac{TP}{TP+FP}$$

It reflects how precise the classifier is when classifying positive samples.

## AUC

AUC stands for **Area Under the ROC (Receiver Operating Characteristic) Curve**.

The ROC curve is a representation of the performance of a classifier under different decision thresholds.

Imagine our red panda classifier ver 2.0 outputs a probability **P(red panda) **instead of a class (red panda or not). The confidence level of the classification can be adjusted by picking a different threshold.

For instance, we can set a 20% threshold to have a better recall (because more samples are considered as red panda) or we can set 80% as the threshold to make the classifier to be more precise (ideally those less likely to be red panda will result in a lower probability). Performance of each threshold can be turned into a data point, by connecting them we get a curve:

AUC is the area under this curve.

The ROC curve always starts from 0,0 and ending at 1,1. These two cases happen when the threshold equals 0% and 100% respectively.

## F-Score

F-score is a metric taking both recall and precision into account. The idea is to have a metric only scores high when recall and precision are both high.

This metric is extremely useful for situations where the recall and precision are both important. For instance, conversion rate estimation in digital marketing.

The F-Score is defined as

$$F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) + \mathrm{recall}}$$

The value \(\beta\) controls the weighting between recall and precision: if the \(\beta\) value is high, the F-Score is weighing more on recall and vice versa.

Normally the value of \(\beta\) is either \(0.5\),\(1\) or \(2\) .

## References

- Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig