import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from fastai.collab import *
from fastai.tabular.all import *
Collaborative Filtering (AKA recommender systems)
Here’s a link to the accompanying slideshow for the embeddings section of the talk.
This meet took place at 6:30pm on Thursday 19th October, at the community futures office in valleycliffe (just across from the Backyard brew pub).
The rundown:
Peter gave a short talk on Tranformers - the transformative neural architecture behind ChatGPT, freaky-image-generators, and a bunch of other things.
Mike walked us through a blog he wrote on embeddings and collaborative filtering models (AKA Recommender Systems).
Afterwards we had discussion and a trip to the pub!
What is an embedding layer and how does it work?
In this example I’ll make a collaborative filtering model (recommender system) which uses an entity embedding as part of a system for recommending books to users.
Embeddings are a neat way to take a large number of individual items (users, products, locations for example), and represent each item using an n-dimensional vector instead of using its unique id. At first this might sound like it would increase the size and complexity of the model - since each item now needs an additional vector representation - but in fact this process reduces the number of individual inputs the model needs to see to be able to make predictions.
For example, if we had an embedding for 1000 book titles, without an embedding layer the model would need to see each unique ID and learn the difference between them. An embedding vector for each of these book titles might be 2 dimensions deep, and might encode for each book’s sci-fi-ness and its length. This means we could feed this two dimensional embedding vector as input to the model rather than the 1000 individual titles. Since those inputs represent something real about the book, that might be enough information to make sensible predictions with. In a sense the embedding compresses information about each of the N inputs into an n dimensional vector.
In this blog post I’ll follow a similar process to the one outlined in the fast.ai course which used the movielens dataset. I’ll aim to explain some nuances about embedding layers, since I found this concept pretty confusing at first. Now that I’ve got my head around them I’m pretty amazed at how elegant, powerful and useful embeddings can be, and I’m excited to start trying out creative uses for embeddings.
Read more on embeddings in this paper: Guo, Cheng et al. “Entity Embeddings of Categorical Variables”
def display_all(df):
with pd.set_option('display.max_columns', 0, 'display.max_rows', 0):
print(df)
= Path('/kaggle/input/book-recommendation-dataset/') path
Loading the data into Pandas
It doesn’t look like much, but the Ratings.csv file contains all the data we need to train a collaborative filtering model: a user column, the ISBN of a book, and the rating a user gave for that book.
It will be easier for us to understand if we can replace the book’s ISBN with its title, so the Books.csv file is used to find the titles.
= pd.read_csv(path/'Ratings.csv')
ratings = pd.read_csv(path/'Books.csv', low_memory=False)
books ratings.head()
User-ID | ISBN | Book-Rating | |
---|---|---|---|
0 | 276725 | 034545104X | 0 |
1 | 276726 | 0155061224 | 5 |
2 | 276727 | 0446520802 | 0 |
3 | 276729 | 052165615X | 3 |
4 | 276729 | 0521795028 | 6 |
Scaling
Here I’m dividing all the ratings by 10 so they all lie between 0 and 1 instead of 0 and 10. I wanted to see what effect this had on the loss during training. It reduced the loss by an order of magnitude. This isn’t a meaningful increase in accuracy - it just means that the size of the errors is correspondingly lower since we’re operating within a smaller target range. Regarless I decided to keep the ratings scaled between 0 and 1 since I think it’s just as easy to understand this scale, plus there might be some benefit to this for models with more features. Click here to read more about scaling.
'Book-Rating']=ratings['Book-Rating'].divide(10) ratings[
= ratings.merge(books)
ratings ratings.head()
User-ID | ISBN | Book-Rating | Book-Title | Book-Author | Year-Of-Publication | Publisher | Image-URL-S | Image-URL-M | Image-URL-L | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 276725 | 034545104X | 0.0 | Flesh Tones: A Novel | M. J. Rose | 2002 | Ballantine Books | http://images.amazon.com/images/P/034545104X.01.THUMBZZZ.jpg | http://images.amazon.com/images/P/034545104X.01.MZZZZZZZ.jpg | http://images.amazon.com/images/P/034545104X.01.LZZZZZZZ.jpg |
1 | 2313 | 034545104X | 0.5 | Flesh Tones: A Novel | M. J. Rose | 2002 | Ballantine Books | http://images.amazon.com/images/P/034545104X.01.THUMBZZZ.jpg | http://images.amazon.com/images/P/034545104X.01.MZZZZZZZ.jpg | http://images.amazon.com/images/P/034545104X.01.LZZZZZZZ.jpg |
2 | 6543 | 034545104X | 0.0 | Flesh Tones: A Novel | M. J. Rose | 2002 | Ballantine Books | http://images.amazon.com/images/P/034545104X.01.THUMBZZZ.jpg | http://images.amazon.com/images/P/034545104X.01.MZZZZZZZ.jpg | http://images.amazon.com/images/P/034545104X.01.LZZZZZZZ.jpg |
3 | 8680 | 034545104X | 0.5 | Flesh Tones: A Novel | M. J. Rose | 2002 | Ballantine Books | http://images.amazon.com/images/P/034545104X.01.THUMBZZZ.jpg | http://images.amazon.com/images/P/034545104X.01.MZZZZZZZ.jpg | http://images.amazon.com/images/P/034545104X.01.LZZZZZZZ.jpg |
4 | 10314 | 034545104X | 0.9 | Flesh Tones: A Novel | M. J. Rose | 2002 | Ballantine Books | http://images.amazon.com/images/P/034545104X.01.THUMBZZZ.jpg | http://images.amazon.com/images/P/034545104X.01.MZZZZZZZ.jpg | http://images.amazon.com/images/P/034545104X.01.LZZZZZZZ.jpg |
= ['User-ID', 'Book-Title', 'Book-Rating',]
keep_list = ratings.columns.drop(keep_list)
del_list
del_list= ratings.drop(del_list, axis = 1)
ratings = ratings[keep_list] # changes the order
ratings = ratings.rename(columns={'User-ID': 'user', 'Book-Title': 'title', 'Book-Rating': 'rating'})
ratings ratings.head()
user | title | rating | |
---|---|---|---|
0 | 276725 | Flesh Tones: A Novel | 0.0 |
1 | 2313 | Flesh Tones: A Novel | 0.5 |
2 | 6543 | Flesh Tones: A Novel | 0.0 |
3 | 8680 | Flesh Tones: A Novel | 0.5 |
4 | 10314 | Flesh Tones: A Novel | 0.9 |
Now we’ve got a table of book titles, ratings and user IDs. Let’s make a fastai Dataloaders object
The dataloaders object specifies a way of getting a series of mini batches (training and validation) from a dataset. Here our model will be a collaborative filtering model, which is a little different to what we’ve seen before with image recognition problems. In this case we’ll be using the book rating as the label, and the book-title and user-id as the input features.
Embeddings
Since there are hundreds of thousands of individual user IDs, and many more book titles, it will be useful to compress this data in some way - in a way which keeps the relevant information about each user and book, but doesn’t require the model to learn each individual user ID or book title. This is where Embeddings come in handy.
= CollabDataLoaders.from_df(ratings, item_name='title', bs=16) dls
Now let’s make a dataloaders object
The dataloaders object gives us a quick way of getting a batch of features and labels from separate training and validation datasets. Below we can see pairings of 16 input features- users with book titles- and the corresponding label for these features, which is the rating the user gave for the book. 16 is the batch size, which I’ve chosen to be a small number for displaying here - but I’ll change it to 64 later and experiment with different batch sizes.
dls.one_batch()
(tensor([[ 47626, 124992],
[ 3209, 208661],
[ 57702, 66456],
[ 53134, 126270],
[ 64848, 162789],
[ 30800, 217912],
[ 39640, 125673],
[ 77900, 174712],
[ 3209, 182773],
[ 61918, 76457],
[ 67043, 68012],
[ 21495, 104256],
[ 37042, 49691],
[ 30890, 172413],
[ 37491, 126020],
[ 30687, 108804]]),
tensor([[0.0000],
[0.7000],
[0.0000],
[0.8000],
[0.0000],
[0.8000],
[1.0000],
[0.8000],
[0.0000],
[0.0000],
[0.0000],
[0.0000],
[0.0000],
[0.0000],
[0.3000],
[0.8000]]))
dls.valid.show_batch()
user | title | rating | |
---|---|---|---|
0 | 16488 | Breathing Lessons | 0.8 |
1 | 123790 | Stitch 'N Bitch: The Knitter's Handbook | 0.0 |
2 | 181176 | Lightning (Henry Holt Mystery Series) | 0.0 |
3 | #na# | Au Bonheur Des Ogres | 0.0 |
4 | 37712 | The Da Vinci Code | 0.6 |
5 | 61211 | The Five People You Meet in Heaven | 0.0 |
6 | 237856 | 365 Ways to Become a Millionaire: (Without Being Born One) | 0.8 |
7 | #na# | More Hours in My Day | 0.7 |
8 | 1733 | It'S All In The Game (Harlequin Superromance No. 302) | 0.5 |
9 | 171604 | Twilight Ecstasy (Heartlines) | 0.4 |
Take a sample
To speed up development and testing We’ll work with a random sample of 300,000 users from the dataset.
= 300000
number_of_samples =ratings.sample(number_of_samples)
df= CollabDataLoaders.from_df(df, item_name='title', bs=64) dls
Crosstab
Here’s a crosstab representation of the data. This is how we’ll think of the data, though in reality all the model will see is one batch from the dataloaders object at a time. Note that the table is very sparsely populated - this is because most users haven’t read many of the books in the table.
= df.sample(50)
sdf =sdf.rating, aggfunc='max').head() pd.crosstab(sdf.user, sdf.title, values
title | A Savior Worth Having | Among men and beasts | Back Roads | Betrayals : Book Four of the Blending (The Blending, Book 4) | Beyond Chaos: One Man's Journey Alongside His Chronically Ill Wife | Big Shoe, Little Shoe | Chameleon | Clans of the Alphane Moon | Cowboy Feng's Space Bar and Grille | Dating Without Novocaine (Red Dress Ink) | ... | The Lovely Bones: A Novel | The Magic School Bus Lost in the Solar System (Magic School Bus (Paperback)) | The Magician's Nephew (rack) (Narnia) | The Reptile Room (A Series of Unfortunate Events, Book 2) | The Rock Says... | The Street Lawyer | The Wind Done Gone: A Novel | When We Were Orphans (Vintage International (Paperback)) | While I Was Gone | Wild Rose of Ruby Canyon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user | |||||||||||||||||||||
1249 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11630 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11676 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.8 |
17950 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21659 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN |
5 rows × 50 columns
Creating an embedding matrix
Since we have a very large number of categorical input features, we need some way of compressing this information. We’ll create two matrices of latent factors -one for the users and one for the books. Each of these matrices will have a vector containing factors, where each factor represents something about books, or something about users.
For example - we’ll begin by creating 5 x 3058 matrix for the users, and a 5 * 4473 matrix for the books. conceptually you can imagine these slotting in to the right of the user column, and below the title colum in such a way that each book, and each user, will have its own unique set of 5 factors. These factors will initially be random numbers, but as the model trains, they will start to encode something meaningful about users’ preferences, and something about books’ qualities. We won’t decide what these factors mean; that will be learned by the model during training.
Let’s go ahead and make these matrices.
= len(dls.classes['user'])
n_users = len(dls.classes['title'])
n_titles = 5
n_factors
= torch.randn(n_users, n_factors)
user_factors = torch.randn(n_titles, n_factors)
title_factors user_factors.size(), title_factors.size()
(torch.Size([40705, 5]), torch.Size([116160, 5]))
Looking at the features
To make a forward pass through the model, we’ll take the dot product of some user factors with some title factors. If the vectors are similar, then it means that the user’s tastes are matched to the book’s qualities. Let’s take a look at this more closely:
Suppose user A has the factors (0, 1, 0.5, 0, -1) ,
and book N has the factors (0, 1, 0.6, 1, -1)
Since most of these factors are similar, except at index 3, we’ll get an output which is more positive, indicating that the book is a good match for the user. If the factors were all opposite to one-another, we’d get a more negative output; perhaps not such a good match. The factors in this case might encode for something like this:
‘written in english’
‘short book’,
‘written in the past 20 years’,
‘written by terry pratchett’,
‘contains dragons’
But the factors are learned automatically as the model trains.
Dot Product, Vectors and Scalars
https://www.mathsisfun.com/algebra/vectors-dot-product.html On this site you can get a quick refresher on vectors, scalars, and dot procucts.
In short, if you imagine two vectors on a plane, the dot product returns a scalar value describing how much these vectors overlap, or more accurately, what’s the magnitude of the component shared by both the vectors.
Let’s try this out.
= torch.tensor([1, 0, 0.5])
vector_a = torch.tensor([1, 1, 0.1]) vector_b
The dot product is just the sum of the products of all the features like so:
a1b1 + a2b2 + a3b3
.
So the dot product of these vectors would be
1 + 0 + 0.05 = 1.05
This is just the sum of an elementwise multiplication in python, which is also identical to a matrix multiplaction of two vectors.
# Sum of elementwise multiplication
*vector_b).sum() (vector_a
tensor(1.0500)
# Matrix multiply
@vector_b vector_a
tensor(1.0500)
What’s an embedding layer?
The Embedding class here creates an embedding matrix, just like we did above. It also provides a way of indexing into the matrix to get the vector at a specific index.
The input x in this case is one batch of user IDs and book titles, with the shape bs x 2. When we pass the input to the embedding layer, we’ll get back the vectors containing the factors for that batch of inputs.
The matrix multiply way of doing this is to one-hot encode the input indices in a one dimensional matrix (or 2d vector, however you want to think of it), then do a matrix multiply of this one hot encoded vector with the embedding matrix. The result would be a 16x5 matrix of feature vectors- one for each of the inputs in the minibatch. An embedding layer provides a way to get the embedding vectors out of the embedding matrix using indexes, in a way which looks just like matrix multiplication, without the need to build the one-hot encoded matrix with all those redundant zeros.
There may be some relation between an embedding for a particular book, and an embedding for a particular user, which correlates with the rating that user gave to that book. When we train this model, we’re trying to learn the set of parameters for the embeddings for each book and user, such that the dot product of the book embeddings and the user embeddings is close to the actual ratings a user gave for a particular book.
Our model’s forward method needs to make rating predictions by doing an elementwise multiplication of the user embedding and the book embedding, then sum over this to predict an overall rating. This predicted rating will be compared with the actual rating the user gave the book, then the initially random weights in the embedding matrix will be updated using stochastic gradient descent to create a better embedding.
Through this process the embedding will come to represent some real world features about the data, which relate to the ratings which people gave to books. These features might not be named or explicitly stated by the user, but rather they’ll be discovered by the network as its parameters automatically adjust to minimise the output of the loss function.
Making a PyTorch dot product model
class DotProduct(Module):
def __init__(self, n_users, n_titles, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.title_factors = Embedding(n_titles, n_factors)
def forward(self, x):
= self.user_factors(x[:,0])
users = self.title_factors(x[:,1])
titles return (users*titles).sum(dim=1)
= dls.one_batch()
x,y x.shape, y.shape
(torch.Size([64, 2]), torch.Size([64, 1]))
= DotProduct(n_users, n_titles, n_factors=50)
model = Learner(dls, model, loss_func=MSELossFlat()) learn
5, 5e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.230327 | 0.225140 | 00:28 |
1 | 0.182791 | 0.224386 | 00:27 |
2 | 0.089434 | 0.226122 | 00:28 |
3 | 0.046783 | 0.222329 | 00:27 |
4 | 0.024537 | 0.222625 | 00:27 |
Making the training process more efficient
Training on the entire dataset took 3 mins per epoch.
When I first ran this model it took 15 mins for 5 epochs. The model was still converging after 5 epochs but this is too slow for experimentation - we should find a sample size which allows some convergence, but which we don’t have to wait forever to train.
For the next run, I took a random sample of 300,000 users from the database. This reduced the training time but reduced convergence - the loss measured on the validation set remained high. We need a way of reducing the size of the dataset but retaining most of the data.
Sample only popular books and users with lots of entries.
Deliberately selecting from the most read titles, and the most active readers could be a way of getting the information density up a little. This is definitely a design decision which should be scrutinized, since it biases the system towards more popular items, but it could be a good way to jumpstart training.
Plus it doesn’t make a lot of sense to be training a collaborative filtering model on users who have read only one book: there wouldn’t be any second item to lookup and recommend for another user who has read the same book.
= len(set(ratings.title))
book_count = ratings.title.value_counts()[:1000].keys()
popular_books
= len(set(ratings.user))
reader_count = ratings.user.value_counts()[:1000].keys() avid_readers
len(ratings)
1031136
Overwriting the variable dense_df with this new selection
= ratings[ratings.title.isin(popular_books)]
dense_df = (dense_df[dense_df.user.isin(avid_readers)])
dense_df print(len(dense_df))
76402
Now we’ve got the number of samples in the database down to 76402, and it only contains the top 1000 readers and the top 1000 books.
Make a new dataloaders object to draw training and validation samples from this new dataframe.
= CollabDataLoaders.from_df(dense_df, item_name='title', bs=64)
dense_dls = len(dense_dls.classes['user'])
n_users = len(dense_dls.classes['title']) n_titles
= DotProduct(n_users, n_titles, n_factors=50) model
Let’s see how the model trains now
= Learner(dense_dls, model, loss_func=MSELossFlat())
learn 5, 1e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.175163 | 0.170583 | 00:06 |
1 | 0.129468 | 0.121618 | 00:07 |
2 | 0.106404 | 0.113181 | 00:06 |
3 | 0.098074 | 0.111734 | 00:06 |
4 | 0.094152 | 0.111598 | 00:06 |
Great - the training only takes 5s per epoch, and we’re still seeing convergence after 5 epochs. Let’s try to improve from here
Adding intentional Bias
So far our model only takes the dot procuct of two vectors then adds up these contributions. To improve the model we should add bias. This will allow us to represent the overall bias of a particular book or user. For example, a book might be extremely short and extremely sci-fi, but also be generally terrible. Even for a reader who also loves short sci-fi books, if the book is generally terrible they probably won’t enjoy it. Conversely there might be a book which is very sci-fi but also so good that even non-sci-fi fans enjoy it. We can represent this overall bias of the book by adding or subtracting a scalar to our embedding vector after the elementwise multiplication operation.
The bias in for the user embedding factors lets us represent users who on average, give a higher or lower rating than other users across the board.
Let’s give this a go below
class DotProductBias(Module):
def __init__(self, n_users, n_titles, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.title_factors = Embedding(n_titles, n_factors)
self.title_bias = Embedding(n_titles, 1)
def forward(self, x):
= self.user_factors(x[:,0])
users = self.title_factors(x[:,1])
titles = (users*titles).sum(dim=1, keepdim=True)
result += self.user_bias(x[:,0]) + self.title_bias(x[:,1])
result return(result)
Here we’re just adding another embedding to represent the bias for each user and each book. This scalar value is added to the prediction for a user and book combination.
Initially I added added a sigmoid to the output to keep the predictions between 0 and 1.1 Using an upper limit of 1.1 allows prediction of the number 1, which would be impossible to achieve with sigmoid otherwise, since the sigmoid function scales all inputs from -inf to inf to lie between 0 and 1.
In practice what happened was that all the predicted ratings were between ~0.4 and ~0.5. Removing the sigmoid on the outputs fixed this and all predicted ratings now fall between 0 and 1, perhaps because I’ve pre-scaled the ratings to lie within this range.
Weight decay
L2_Regularization also called weight decay, is also used here. L2 regularization penalizes large weights in the model by adding to the loss function the sum of all the weights squared. This helps reduce overfitting by reducing the chance of any individual weight becoming very large. This will slow down the training of the model, but it will also produce a model which generalizes better - the model will find general patterns rather than producing an overly complex and overfit function which only represents items in the training set.
I have left
= DotProductBias(n_users, n_titles, n_factors=50)
model = Learner(dense_dls, model, loss_func=MSELossFlat()).to_fp16()
learn 5, 1e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.139426 | 0.142063 | 00:08 |
1 | 0.120957 | 0.115809 | 00:07 |
2 | 0.099875 | 0.111417 | 00:07 |
3 | 0.089681 | 0.110612 | 00:07 |
4 | 0.084803 | 0.110506 | 00:08 |
Now that we have the model trained, we can get predictions for any pairings of users an books. The model outputs will be the rating which the model predicts for that user - book combo. Here’s a demonstration which uses one batch of data, so it’s just a random pairing of users with books.
= dense_dls.one_batch()[0].to('cuda') batch
get_device
to check that the tensor is on the GPU (-1 = cpum 0=cuda:0)
batch.get_device()
0
passing a batch of inputs (user/title pairs) gives us an output of predictions for that pairing.
10] model(batch)[:
tensor([[ 0.0823],
[ 0.1963],
[ 0.1214],
[ 0.1322],
[ 0.0816],
[ 0.2963],
[ 0.0160],
[ 0.2217],
[ 0.0436],
[-0.0102]], device='cuda:0', grad_fn=<SliceBackward0>)
looking at the factors for a batch of users
Here we can see the indices of a batch of users. Each one of these users has a corresponding set of factors which are accessed by passing these indices to the Embedding instance called user_factors
0] batch[:,
tensor([748, 32, 929, 9, 481, 864, 243, 19, 357, 434, 564, 699, 361, 435,
326, 570, 497, 296, 659, 706, 722, 658, 64, 875, 324, 589, 73, 226,
660, 351, 861, 120, 703, 708, 662, 85, 49, 694, 297, 39, 83, 246,
657, 910, 782, 194, 174, 96, 265, 819, 140, 956, 804, 896, 805, 601,
697, 535, 256, 584, 984, 243, 489, 785], device='cuda:0')
Thinking about latent factors as components of a vector in an n-dimensional feature space
Here are the factors for each of the users in the batch:
0]) model.user_factors(batch[:,
tensor([[ 0.0817, -0.0264, 0.0331, ..., -0.0489, 0.0411, -0.0605],
[ 0.2433, 0.1127, 0.0474, ..., -0.0423, -0.0127, 0.0787],
[ 0.0681, -0.0004, -0.0345, ..., -0.0058, -0.0629, -0.0205],
...,
[-0.0815, 0.0103, -0.0141, ..., 0.0457, 0.0288, 0.0169],
[-0.0230, 0.0015, -0.0673, ..., -0.0892, 0.0012, -0.0144],
[-0.0735, -0.1434, -0.0621, ..., 0.0003, 0.0648, 0.1412]],
device='cuda:0', grad_fn=<EmbeddingBackward0>)
Each of these numbers represents a learned latent factor for that user. The latent factors can can be thought of as the contribution / component to a vector in n-dimensional space, where each number is a different axis’s contribution. The factors are all orthoganal to oneanother. They can represent things like taste, genre, age etc.
For example: if user A has 3 latent factors x, y, z, and these have values 1, 0.2, -0.9, then we can imagine a vector in 3d space which extends along the x dimension by 1, along y by 0.2, and extends negatively along the z dimension by 1.
Another user, or book title, might point in a very similar direction. This would mean that their factors overlap a lot and tend not to cancel out.
Each of these dimensions could code for something like ‘enjoys horror books’, ‘enjoys shorter books’, younger.
If there was another user who’s factors were -1, 0.2, 1, we might say that they had the opposite taste for horror stories, that they have the same liking for shorter books, and that they are older.
The latent factors encode for real world meaning, but the factors themselves aren’t chosen by the engineer when setting up the neural network - rather they emerge from the relationships between books, users and ratings as the model trains.
Using the trained model
Finding the books with the highest bias
Here’s a list of books with a high bias: they end up having a higher rating across the board, despite the specific features which were learned to describe the books. Intuitively this means that they’re high quality - since they get consistently high ratings despite their genre and the users’ tastes.
= learn.model.title_bias.weight.squeeze()
books_bias = books_bias.argsort(descending=True)[:20]
idxs 'title'][i] for i in idxs] [dense_dls.classes[
['Harry Potter and the Prisoner of Azkaban (Book 3)',
"Harry Potter and the Sorcerer's Stone (Book 1)",
'Harry Potter and the Chamber of Secrets (Book 2)',
'To Kill a Mockingbird',
'Harry Potter and the Order of the Phoenix (Book 5)',
'The Secret Garden',
'A Wrinkle in Time',
"Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))",
'Harry Potter and the Goblet of Fire (Book 4)',
'The Fellowship of the Ring (The Lord of the Rings, Part 1)',
'The Little Prince',
'Fahrenheit 451',
"Where the Heart Is (Oprah's Book Club (Paperback))",
'Lord of the Flies',
'The Lovely Bones: A Novel',
'Anne Frank: The Diary of a Young Girl',
'The Color Purple',
"The Handmaid's Tale",
'One for the Money (A Stephanie Plum Novel)',
'The Da Vinci Code']
Making recommendations for a single user.
We know how to get rating predictions for a single batch: take the dot product of the user factors and title factors for each user/title pariring in the batch. To get predictions for a single user, we’d just need to replace all the user id’s with the id for that single user. Let’s try this:
Now that we have a trained model, to make a recommendation we need to do 2 things:
Find out which books the user has read already. This is just so that we’re not recommending books they’ve already read.
Create a tensor of tuples which contain user IDs and book titles. These will be passed to the
DotProductBias
forward()
method - which takes the dot product of a user-id book-title combination. We need to make the user IDs all the same (11676), and calculate these dot products for every book the user hasn’t yet read. Once this calculation is performed, we’ll have a prediction of what rating this user might give if they were to read these books. Based on these predictions we can recommend the books which get the highest predicted rating.
Let’s take a look at the user who has read the most books:
1000].keys() dense_df.user.value_counts()[:
Int64Index([ 11676, 35859, 76352, 16795, 153662, 102967, 238120, 23768,
230522, 55492,
...
69808, 4385, 168464, 164465, 227250, 35433, 241198, 173632,
133868, 72352],
dtype='int64', length=997)
user 11676
This is the ID of the user we’re trying to recommend books for.
We made an embedding using a subset of the 1000 top users - so we need a way to find which index this ID is at:
def get_index(cat, dataloader):
'get the index of a category from a dataloader'
for i, j in enumerate(dataloader):
if j == cat:
return i
11676, dense_dls.classes['user']) get_index(
32
Let’s confirm that this works by tesing it on a book title:
'The Little Prince', dense_dls.classes['title']) get_index(
793
We’re going to check for book recommendations for user 11676, who is at index 32 in our dense dataloaders object.
= 32
user_index = len(dense_dls.cats['title'].unique())
n_books = torch.full((n_books, 1), user_index, dtype=int).cuda()
user_idxs = torch.linspace(1, n_books, n_books, dtype=int).unsqueeze(1).cuda()
book_idxs = torch.cat((user_idxs, book_idxs), -1)
user_books_tensor user_books_tensor
tensor([[ 32, 1],
[ 32, 2],
[ 32, 3],
...,
[ 32, 998],
[ 32, 999],
[ 32, 1000]], device='cuda:0')
Now we have a tensor pairing the user at index 32 with each of the book indices from 1 to 1000. Passing this into our model’s forward() method will calculate the dot product of this user’s latent factors vector with the latent factors for each book in the dataset. This dot product is the rating prediction.
= model(user_books_tensor)
recommendations = recommendations.argsort(0, descending=True)[:10]
top_10 top_10
tensor([[307],
[174],
[235],
[485],
[363],
[ 1],
[ 41],
[861],
[597],
[125]], device='cuda:0')
Now we have the indices of the top 10 recommended books for this user. Finally we can look up these top indices in the dataloaders classes to get the titles.
'title'][top_10] dense_dls.classes[
(#10) ['Harry Potter and the Chamber of Secrets (Book 2)','Crazy for You','Empire Falls','One True Thing','Isle of Dogs','1984','A Widow for One Year','The Secret','Sisterhood of the Traveling Pants','By the Light of the Moon']
Let’s take a look at all the books this user has read, ordered by rating:
==11676].loc[dense_df.rating==1][:10] dense_df.loc[dense_df.user
user | title | rating | |
---|---|---|---|
69 | 11676 | The Notebook | 1.0 |
189 | 11676 | A Painted House | 1.0 |
12730 | 11676 | Harry Potter and the Chamber of Secrets (Book 2) | 1.0 |
13401 | 11676 | Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback)) | 1.0 |
19165 | 11676 | The Sweet Potato Queens' Book of Love | 1.0 |
19881 | 11676 | Dreamcatcher | 1.0 |
20480 | 11676 | Fight Club | 1.0 |
26049 | 11676 | 1st to Die: A Novel | 1.0 |
26691 | 11676 | The Hot Zone | 1.0 |
28195 | 11676 | The Girl Who Loved Tom Gordon | 1.0 |
min(), recommendations.max()
recommendations.
(tensor(0.1069, device='cuda:0', grad_fn=<MinBackward1>),
tensor(0.9967, device='cuda:0', grad_fn=<MaxBackward1>))
Finding ‘book buddies’
We can use the same approach to pair users with people they’re most similar to - If there are two readers in the model with the same set of latent factors as oneanother, then this means they have very similar tastes in books. I remember the Last.FM music recommendation software had a feature where you could see your ‘musical neighbours’ and see what music they’d been listening to. This likely uses a similar collaborative filtering system.
To find two similar readers, we could use the following approach:
- pick a user
- apply the same process as above but instead of calculating the dot product of this user with every book, calculate the dot product of the user with every other user. If their latent factors are similar they’re likely to have similar tastes in books.
Let’s give this a go!
= len(dense_dls.cats['user'].unique())
n_users = torch.full((n_users, 1), user_index, dtype=int).cuda()
user_idxs = torch.tensor(dense_dls.cats['user'].values).unique().unsqueeze(1).cuda()
all_users = torch.cat((user_idxs, all_users), -1)
pairs = model(pairs).argsort(0)[:10]
top_10_indices
= dense_dls.classes['user'][top_10_indices]
top_10_buddies top_10_buddies
(#10) [173835,153662,191187,146230,104429,75860,108285,93047,146175,204591]
Conclusion
In this post I’ve covered:
- how to load a dataset into fastai
- how to take the dot product of two vectors
- building a custom PyTorch model which inherits from pytorch’s Module class and contains a forward() method and a couple of embedding layers for the input features
- training a model on a denser subset of the data to enable faster model training
- the role of bias and factors in the embedding matrices
- How to get a book reccomendation for a given user
- How to find users with similar tastes in the dataset.
I was able to train a dot product based model with embedding layers on both the input features, and get book reccomendations for a given user.
References
http://fast.ai
Guo, Cheng et al. “Entity Embeddings of Categorical Variables”