## Matrix Factorization

The user-item, or ratings matrix is N * M, where N is the number of users and M is the number of items.

• Key: W and U should be much smaller than R
• Actually, we can’t stored R in our memory since it too large.
• But we can represent it using a special data structure: Dict{(u,m) -> r}
• If N = 120K, M = 26K
• N * M = 3.38 billion
• # ratings = 20 million
• Space used: 20 million / 3.38 billion = 0.006
• This is called a “Sparse representation”

### Matrix Factorization

Matrix factorization is to split the matrix into the product of 2 other matrix.

We call it “R hat” because it only approximates R - it is our model of R. The formula is shown below:

So we would like W and U to be very skinny:

• W (N * K)： the users matrix
• U (M*K)：the movies matrix
• K：somewhere from 10-50

#### Some Calculations

if K = 10, N = 130K, M = 26K, then:

• W: 130K * 10 = 1.3 million
• U: 26K * 10 = 0.26 million
• Total size of W + U = 1.3 million + 0.26 million = 1.56 million
• How much saving: 1.56 million / 3.38 billion = 0.0005

This is good, we like # parameters < # of data points.

In practice, we don’t calculate the WU^T which makes the result R (N * M) unless in a small subset of data.

Actually, we often calculate one rating like the score of user i rating item j. It just a dot product between 2 vectors of size K:

$$\hat{r_{ij}} = w_i^T u_j, \hat{r_{ij}} = \hat{R[i,j]}, w_i = W[i], u_j = U[j]$$

### Why does it make sense

Singular Value Decomposition also known as SVD is a matrix X can be decomposed into 3 separate matrices multiplied together:

$$X = USV^T$$

Where:

• X (N * M), U (N * K), S (K * K), V (M * K)
• If I multiply U by S, I just get another N * K matrix:
• Then X is a product of 2 matrices, just like matrix factorization
• Or equivalently, I could combine S with V^T
• R (ratings matrix) is sparse
• If U, S, and V can properly approximate a full X matrix, then surely it can approximate a mostly empty R matrix!

### Interpretation

• Each of the K elements in w_i and u_j is a feature
• Let’s suppose K = 5, and they are:
• Comedy
• Romance
• Horror
• Animation
• w_i[1] is how much user i likes action
• w_i[2] is how much user i likes comedy, etc
• u_j[1] is how much movie_j contains action
• u_j[2] is how much movie_j contains comedy, etc

The w_i^Tu_j means how well do user i’s preferences correlate with movie j’s attributes:

$$w_i^Tu_j = || w_i || ||u_j|| \cos{\theta} \propto sim(i,j)$$

Actually, each features is latent, and K is the latent dimensionality, a.K.A “Hidden causes”. We don’t know the meaning of any feature without inspecting it.

### Dimensionality Reduction

MF reduces the dimensionality of R.

### How to Learn W nad U

#### Loss Function

Using the squared error loss as the loss function:

$$J = \sum_{i,j \in \omega} (r_{ij} - \hat{r}_{ij})^2 = \sum_{i,j \in \omega} (r_{ij} - w_i^T u_j)^2$$

Omega = set of pairs (i,j) where user i rated movie j.

#### Solving for W

Try to isolate w_i, it’s stuck inside a dot product:

$$\sum_{j \in \Psi} (w_i^T u_j) u_j = \sum_{j \in \Psi} r_{ij} u_j$$

Dot product is commutative:

$$\sum_{j \in \Psi} (u_j^T w_i) u_j = \sum_{j \in \Psi} r_{ij} u_j$$

The result of dot product is a scalar, so scalar * vector = vector * scalar:

$$\sum_{j \in \Psi} u_j (u_j^T w_i) = \sum_{j \in \Psi} r_{ij} u_j$$

Drop the brackets:

$$\sum_{j \in \Psi} u_j u_j^T w_i = \sum_{j \in \Psi} r_{ij} u_j$$

Summation doesn’t actually depend on i, so add more brackets:

$$\sum_{j \in \Psi} (u_j u_j^T) w_i = \sum_{j \in \Psi} r_{ij} u_j$$

Now it’s just Ax=b, which we know how to solve:

x = np.linalg.solve(A, b)


$$w_i = (\sum_{j \in \Psi} u_j u_j^T)^{-1} \sum_{j \in \Psi} r_{ij} u_j$$

#### Solving for U

Note: Cover Picture