## Classification Basics

In order to map a logistic regression value to a binary category, you must define a classification threshold(also called the decision threshold).

### Evaluation Metrics

A **true** positive is an outcome where the model correctly predicts the positive class. Similarly, a **true negative** is an outcome where the model correctly predicts the negative class.

A **false** positive is an outcome where the model incorrectly predicts the positive class. And a **false** negative is an outcome where the model incorrectly predicts the negative class.

### Accuracy

Formally, accuracy has the following definition:

$$ Accuracy = \frac{Number \ of \ correct \ predictions}{Total \ number \ of \ predictions} $$

For binary classification, accuracy can also be calculated in terms of positives and negatives as follows:

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$

Where TP = True Positive, TN = True Negatives, FP = False Positives, and FN = False Negatives.

```
TP FP
FN TN
```

Accuracy alone doesn’t tell the full story when you’re working with a **class-imbalanced** data set.

### Precision and Recall

#### Precision

What proportion of positive identifications was actually correct?

It’s defined as follows:

$$ Precision = \frac{TP}{TP + FP} $$

A model that produces no false positives has a precision of 1.0.

#### Recall

What proportion of actual positives was identified correctly?

It’s defined as follows:

$$ Recall = \frac{TP}{TP + FN} $$

A model that produces no false negatives has a recall of 1.0.

### ROC Curve and AUC

#### ROC Curve

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positive and True Positives. The following figure shows a typical ROC curve:

This curve plots two parameters:

- True Positive Rate It’s a synonym for recall and is therefore defined as follows: $$ TPR = \frac{TP}{TP + FN} $$
- False Positive Rate It’s defined as follows: $$ FPR = \frac{FP}{FP + TN} $$

#### AUC

Auc stands for “Area under the ROC Curve”. That is, AUC measures the entire two-dimensional area underneath the entire ROC curve(Integral Calculus).

AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example.

AUC ranges in value from 0 to 1. - A model whose predictions are 100% wrong has an AUC of 0.0. - A model whose predictions are 100% correct has an AUC of 1.0.

AUC is desirable for the following two reasons:

- AUC is
**scale-invariant**. It measures how well predictions are ranked, rather than their absolute values. - AUC is
**classification-threshold-invariant**. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.

#### Two Caveats of AUC

Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.

Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error.

### Prediction Bias

**Predication bias** is a quantity that measures how far apart between “average of predictions” and “average of observation”. It’s defined as follows:

$$ prediction\ bias = average \ of \ predictions - average \ of \ labels \ in \ data \ set $$

Possible Root Causes of prediction bias are:

- Incomplete feature set
- Noisy data set
- Buggy pipeline
- Biased training sample
- Overly strong regularization

Don’t use the calibration layer that adjusts your model’s output ot reduce the prediction bias.

Note: A good model will usually have near-zero bias. That said, a low prediction bias doesn’t prove that your model is good. A really terrible model could have zero prediction bias. For example, a model that just predicts the mean value for all examples would be a bad model, despite having zero bias.

#### Bucketing and Prediction Bias

Prediction bias for logistic regression only makes sense when grouping enough examples together to be able to compare a predicted value to observed values.

You can form buckets in the following ways: - Linearly breaking up the target predictions. - Forming quantiles.

## Regularization of Sparsity: L1 Regularization

### Problem

Sparse vectors often contain many dimensions. Creating a feature cross results in even more dimensions. Given such high-dimensional feature vectors, model size may become huge and require huge amounts of RAM.

### Solution

In a high-dimensional sparse vector, it would be nice to encourage weights to drop to exactly 0 where possible. **A weight of exactly 0 essentially removes the corresponding feature from the model. Zeroing out features will save RAM and may reduce noise in the model. **Encoding this idea into the optimization problem done at training time, by adding an appropriately chosen regularization term.

#### L0 and L2 Regularization

L0 regularization which penalizes the count of non-zero coefficient values in a model. Though this count-based approach is intuitively appealing, it would turn our convex optimization problem into a non-convex optimization problem that’s NP-hard and it’s not something we can use effectively in practice.

**L2 regularization encourages weights to be small, but doesn’t force them to exactly 0.0.**

#### L1 Regularization

It serves as an approximation to L0, but has the advantage of being convex and thus efficient to compute. **So we can use L1 regularization to encourage many of the uninformative coefficients in our model to be exactly 0, and thus reap RAM savings at inference time**.

### L1 vs L2 Regularization

L2 and L1 penalize weights differently: - L2 penalizes weight^2. - L1 penalizes | weight |.

L2 and L1 have different derivatives: - The derivative of L2 is 2 * Weight. It’s like a force that removes x% of the weight every time and at any rate, L2 does not normally drive weights to zeros. - The derivative of L1 is k (a constant whose value is independent of weight). It’s like a force that subtracts some constant form the weight every time. However, thanks to absolute values, L1 has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out.

Note: Cover Picture