## Concepts

Google Online Machine Learning Course

### Machine Learning

ML systems learn how to combine input to produce useful predictions on never-before-seen data.

#### Terminology

- Labels
- Features
Examples

- labeled examples

`labeled examples: {features, label}: (x, y)`

- unlabeled examples

`unlabeled examples: {features, ?}: (x, ?)`

- labeled examples
Models A model defines the relationship between features and label. Two phases of a model’s life: - Training - Inference

Regression vb. Classification

- Regression: A regression model predicts continuous values.
- Classification:
- A classification model predicts discrete values.

### Linear Regression

- Definition: $$ y = b + Wx $$
- L2 Loss: $$ L_2 Loss = \sum_{x,y \in D}(y - predictions(x))^2 $$
- We’re summing over all examples in the training set.
- Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples. $$ MSE = \frac{1}{ N } \sum_{x,y \in D}(y - predictions(x))^2 $$

### Reducing Loss

#### Iterative trial-and-error process

A machine learning model is trained by starting with an initial guess for the weights and bias and iteratively adjusting those guesses until learning the weights and bias with the lowest possible loss.

#### Gradient Descent

Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.

In machine learning, gradients are used in gradient descent. We often have a loss function of many variables that we are trying to minimize, and we try to do this by following the negative of the gradient of the function.

A gradient is a vector, it has two characteristics:

- a direction
- a magnitude

The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

The gradient descent then repeats this process, edging ever closer ot the minimum.

#### Learning Rate

Gradient descent algorithms multiply the gradient byu a scalar known as the learning rate(also sometimes called step size) to determine the next point.

**Hyperparameters** are the knobs that programmers tweak in ML algorithms.

##### Small Learning Rate

##### Large Learning Rate

##### Right Learning Rate

##### Ideal Learning Rate

The ideal learning rate in one-dimension is

$$ \frac{1}{f(x)^{\prime\prime}} $$

(the inverse of the second derivative of f(x) at x).

The ideal learning rate for 2 or more dimension is the inverse of the Hessian(matrix of second partial derivatives)

In practice, finding a “perfect” (or near-perfect) learning rate is not essential for successful model training. The goal is to find a learning rate large enough that gradient descent converges efficiently, but not so large that it never converges.

##### Stochastic gradient descent (SGD)

It uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term “stochastic” indicates that the one example comprising each batch is chosen at random.

##### Mini-batch stochastic gradient descent (mini-batch SGD)

It is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Performing gradient descent on a small batch or even a batch of one example is usually more efficient than the full batch. After all, finding the gradient of one example is far cheaper than finding the gradient of millions of examples.