## Linear Regression

Google Online Machine Learning Course

Tensorflow consists of the following two components:

• a graph protocol buffer
• a runtime that executes the distributed graph

### Pseudocode for Linear Regression

import tensorflow as tf

# Set up a linear classifier.
classifier = tf.estimator.LinearClassifier(feature_columns)

# Train the model on some example data.
classifier.train(input_fn=train_input_fn, steps=2000)

# Use it to predict.
predictions = classifier.predict(input_fn=predict_input_fn)


## Generation

### Overfitting

Overfitting occurs when a model tries to fit the training data so closely that it does not generalize well to new data.

### Ockham’s Razor Principle

The less complex a model is, the more likely that a good empirical result is not just due to the peculiarities of our sample.

### Three Assumptions

1. We draw examples independently and identically (i.i.d) at random from the distribution.
2. The distribution is stationary: It doesn’t change over time.
3. We always pull from the same distribution: Including training, validation, and test sets.

## Split Data

1. Dividing the data set into three subsets:

• training set: a subset to train a model.
• validation set: evaluate results from the training set.
• test set: a subset to test the trained model.
2. Never train on test data.

## Feature Engineering

### Mapping Raw Data to Features

#### Numeric Values

Integer and floating-point data don’t need a special encoding because they can be multiplied by a numeric weight.

#### Categorical Values

One-hot encoding extends to numeric data that you do not want to directly multiply by a weight, such as a postal code.

### Sparse Representation

If the feature has more than 1000,000 different kinds of categorical values, a sparse representation in which only nonzero values are stored can be used to efficiently represent the feature. In this method, an independent model weight is still learned for each feature value.

## Qualities of Good Features

1. Avoid rarely used discrete feature values:
• Good feature values should appear more than 5 or so times in a data set.
• If a feature’s value appears only once or very rarely, the model can’t make predictions based on that feature.
2. Prefer clear and obvious meanings.
3. Don’t mix “magic” values with actual data.
• For variables that take a finite set of values (discrete variables), add a new value to the set and use ti to signify that the feature value is missing.
• For continuous variables, ensure missing values do not affect the model by using the mean value of the feature’s data.
4. Account for upstream instability.

## Cleaning Data

1. Scaling feature values Scaling means converting floating-point feature values from their natural range(for example, 100 to 900) into a standard range(for example, 0 to 1 or -1 to +1).
2. Handling extreme outliers
3. Binning or Bucketizing Instead of having one floating-point feature, sometimes this feature is better to be represented as distinct boolean feature. Python Code Demo:

def select_and_transform_features(source_df):
LATITUDE_RANGES = zip(range(32, 44), range(33, 45))
selected_examples = pd.DataFrame()
selected_examples["median_income"] = source_df["median_income"]
for r in LATITUDE_RANGES:
selected_examples["latitude_%d_to_%d" % r] = source_df["latitude"].apply(
lambda l: 1.0 if l >= r[0] and l < r[1] else 0.0)
return selected_examples

def get_quantile_based_boundaries(feature_values, num_buckets):
boundaries = np.arange(1.0, num_buckets) / num_buckets
quantiles = feature_values.quantile(boundaries)
return [quantiles[q] for q in quantiles.keys()]

# Divide households into 7 buckets.
households = tf.feature_column.numeric_column("households")
bucketized_households = tf.feature_column.bucketized_column(
households, boundaries=get_quantile_based_boundaries(
california_housing_dataframe["households"], 7))

# Divide longitude into 10 buckets.
longitude = tf.feature_column.numeric_column("longitude")
bucketized_longitude = tf.feature_column.bucketized_column(
longitude, boundaries=get_quantile_based_boundaries(
california_housing_dataframe["longitude"], 10))

4. Scrubbing

• Omitted values
• Duplicate examples
• Bad feature values

Once detected, you typically “fix” bad examples by removing from the dataset. In addition to detection bad individual examples, you must also detect bad data in the aggregate. Histograms are a great mechanism for visualizing your data in the aggregate.

## Feature Crosses: Encoding Nonlinearity

A feature cross is a synthetic feature that encodes nonlinearity in the feature space by multiplying two or more input features together.(The term cross comes from cross product.)Let’s create a feature cross named $x_3$ by crossing $x_1$ and $x_2$: $$x_3 = x_1 x_2$$

Then the linear formula becomes: $$y = b + w_1 x_1 + w_2 x_2 + w_3 x_3$$

A linear algorithm can learn a weight for $w_3$ just as it would for $w_1$ and $w_2$. In other words, although $w_3$ encodes nonlinear information, you don’t need to change how the linear model trains to determine the value of $w_3$.

### Kinds of Feature Crosses

• [A * B]: a feature cross formed by multiplying the values of two features.
• [A * B * C * D * E]: a feature cross formed by multiplying the values of five features.
• [A * A]: a feature cross formed by squaring a single feature.

In practice, machine learning models seldom cross continuous features. However, machine learning models do frequently cross one-hot feature vectors. To cross two continuous values, we can bucketize them. Think of feature crosses of one-hot feature vectors as logical conjunctions.

Linear learners scale well to massive data. Using feature crosses on massive data sets is one efficient strategy for learning highly complex models. Neural networks provide another strategy.

## Regularization

Instead of aiming to minimize loss(empirical risk minimization): $$minimize(Loss(Data|Model))$$

We’ll minimize loss+complexity, which is called structural risk minimization: $$minimize(Loss(Data|Model) + \lambda complexity(Model))$$

where $\lambda$ is the regularization rate.

Now the training optimization algorithm is a function of two terms: - the loss term, which measures how well the model fit the data. - the regularization term, which measures model complexity

### Model Complexity

• Model complexity as a function of the weights of all the features in the model.
• Model complexity as a function of the total number of features with nonzero weights.

#### L_2 Regularization

It defines the regualarization terms as the sum of the squares of all the feature weights:

$$L_2 = || w||_2^2 = w_1^2 + w_2^2 + w_3^2 +…+ w_n^2$$

Benefits:

• Encourages weight values toward 0 (but not exactly 0).
• Encourages the mean of the weights toward 0, with a normal (bell-shaped or Gaussian) distribution.

#### Early Stopping

It means ending training before the model regularization parameters can be confounded with the effects from changes in learning rate or number of iterations.