Image for post
Image for post
A librarian’s job is pretty hard so I am trying to use AI to simplify it! Source: pixabay

How Simple Is It to Build an End-to-End Item-Based Recommender System?

Collect data, train a model and deploy it using popular data science libraries

Yan Gobeil
Jan 21 · 12 min read

Many websites nowadays use at least some kind of recommendation system to guide their customers towards interesting products and make them buy more. The most advanced ones (like the ones used by Amazon, Netflix or Youtube) can be very complicated, but it is surprisingly simple to make a decent recommendation engine from start to finish for small scale applications. This is what I want to show in this article.

Image for post
Image for post
Example of two different recommendation systems used on Amazon.

The motivation for this project comes from my fiancée, who works in a library. It is impressive how far behind they can be regarding technology for a place promoting learning and information. The best example is the fact that they prefer calling people to tell them that their books are ready for pickup instead of sending automatic emails. Recommendation systems are obviously far in their priority list, even though they spend a large amount of time trying to find books that specific people would like to read.

With that in mind, I decided to make a POC (proof of concept) book recommendation API just to show how simple it can be to build and to use for anyone. This is an example of why no one should pass on the opportunity to improve their online services. Pretty much the whole procedure that I describe can be applied to almost any other topic, not just for books. Here is a summary of the steps that I will explain:

  • Collect the data by scraping the web using beautifulsoup

All of this project is done using python 3.7 and the code can be found on my github. The web app is available here and the app is presented on RapidAPI.

Collecting the data

Usually the best recommender systems leverage user’s data to get good results. For example, if you really liked to watch the Lord of the Rings movies, Netflix will recommend you movies liked by other people who liked LOTR too. Another method is to recommend movies that are similar in content to the movies that you liked. The second method turns out to be less performant but is a better fit to my project for a few reasons. First, I want the API to be universal so I don’t want to focus on user data from a specific platform, which I don’t have anyway. Second, user based recommendations need to be updated often to account for the new interactions, which is too much work for a simple POC. Finally, libraries often insist on not using user data for questions of privacy so I want to show that AI is still powerful without neglecting privacy.

One thing to mention about web scraping is that not every website allows it. Be sure to read the Terms and conditions for the site that you want to get data from to see to what extent web scraping is allowed. If you insist on scraping a website that doesn’t allow it make sure that you don’t overload the servers (by using timeouts in your code to immitate human users for example) and that you don’t use the data commercially.

Given the goal of the project, the data that I need is a list of books with varied information about each of them. I want to focus on novels written in French and I found out that leslibraires.com has everything I need. In addition to that, the source code is very simple so it makes it easy to scrape. This is an important criteria because some websites are built using complicated javascript that makes it very hard to extract information programmatically.

There are already many good tutorials about web scraping with beautifulsoup (for example this one) so I don’t want to detail all of my process. In general, my strategy is to go through the search pages for the most popular literature books to gather the webpages for as many books as possible. Then for each of these books I collect the desired data from the html code.

Image for post
Image for post
Image for post
Image for post
Results from book search on leslibraires.com and example of book page.

Here is an example of one of the 15k books that I collected using this script:

{'title': 'Kukum',
'author': 'Michel Jean',
'ISBN': '9782764813447',
'summary': "Ce roman retrace le parcours d'Almanda Siméon, une orpheline qui va partager sa vie avec les Innus de Pekuakami. Amoureuse d'un jeune Innu, elle réussira à se faire accepter. Elle apprendra l'existence nomade et la langue, et brisera les barrières imposées aux femmes autochtones. Almanda et sa famille seront confrontées à la perte de leurs terres et subiront l'enfermement des réserves et la violence des pensionnats. Racontée sur un ton intimiste, l'histoire de cette femme, qui se déroule sur un siècle, exprime l'attachement aux valeurs ancestrales des Innus et au besoin de liberté qu'éprouvent les peuples nomades, encore aujourd'hui."}

Encoding the data

With the biggest part of the project done, the next step is to figure out how to determine which books are similar to each other. This could be done with hand made criteria, based on the author, the genre, the number of pages and/or the year of publication for example. I however decided to use a more generic approach based on deep learning. The goal is to convert each book into a vector with fixed length to have a more mathematical way of calculating similarity. The question is now: how to generate these vectors? The answer to this question is the main difference between recommendation systems for different topics. The rest of the procedure is exactly the same once the vectors are found.

In my opinion, the piece of data that I extracted that contains the most information about the books is the summary. Fortunately for me, there are many different methods to convert text into vectors, which are called embeddings in this context. Using recent advances in deep learning, people at Google have trained a language model called Universal Sentence Encoder that does exactly what I want: it converts any text into vectors of length 512 that encode the meaning of the text. There is even a multilingual version of the model, which is important to me because my summaries are in French.

Image for post
Image for post
Example of similarity computed using the Universal Sentence Encoder (from TFHub)

Using this model is as simple as can be since it was made available on tensorflow hub, “a repository of trained machine learning models ready for fine-tuning and deployable anywhere”. The following code shows how to calculate the vector embedding for each of the books in the dataset.

Calculating similarities

With a vector in hand for each book, there are a few different metrics that can be used to calculate similarities. Since it is not at first obvious which one performs the best for a given dataset and encoding model, I wanted to try two of the most common ones. The first one is the Euclidean distance, which calculates the distance between the points represented by the vectors. Its efficiency depends highly on if the scale of the vectors is meaningful. The second metric is cosine similarity, which essentially calculates the angle between the vectors, neglecting their lengths. See this article for a more detailed comparison of the two metrics.

Image for post
Image for post
Example of the two different metrics. Here d is Euclidean distance and θ is the angle whose cosine is taken for cosine similarity. Image made by cmry.

Using scikit-learn, the metrics can easily be calculated and a list of the most similar books can be found for each book. This process takes a few seconds and the time scales up with the number of books so it’s preferable to do all the calculations once and save the results for later use.

Visualizing the recommendations

At this point the recommendation system is finished. For each book we are able to get a list of books that are similar to it. Of course searching manually in a dictionary with 15k entries is not super efficient so building a web interface to visualize the recommendations is essential. This can be done easily with flask and some knowledge of web development. I actually explained how to do this in my previous article about flask. Keep in mind that this strategy works for very simple applications. For more complex needs, one should switch to the usual frontend-backend setup used in web development, using for example a javascript framework for the frontend.

The idea is to make a webpage template with some html/css that contains variables which are linked to a python script. Let’s first see what my template looks like.

There are a few things to explain here. First, I don’t want to bother myself with any css so I am using bootstrap. This is a web dev library containing a list of premade css classes. Using it is as simple as loading the library in the head of the html file and adding classes to the html tags. If you don’t know how it works and want a quick intro, w3school has a nice tutorial. Otherwise don’t be intimidated, pure html can also do the job.

The next important thing used in the template is jinja logic. This is enclosed in curly brackets {} and is where the data from the python script is used. Three different types of logic are used in this project:

  • Using the value of a variable in the html code with double curly braces {{ variable }} .
{% for element in list %}
<p>{{ element }}</p>
{% endfor %}
  • Checking if a variable is defined with an if statement and display a part of the page only if it defined
{% if variable is defined %}
<p>This shows only if variable is defined</p>
{% endif %>

Once the template is ready, a flask app must be made to interact with it and feed it data.

The interaction happens in a few stages. First, a bit of python code is run (to get the list of books) and the result is sent to the template to display. Then when someone clicks on the button in the web app, some data is sent back to python (selected book) and the code in the POST section is run to compute the recommendations. The info is finally sent back to the web app, which displays the recommendations. You can explore this web app here.

To launch the app locally, just use the following code in the command line

python app.py

and the app can be accessed at http://127.0.0.1:5000/ in any web browser.

With this in hand, we can compare the quality of the recommendations with the two different metrics. An example that I like to try is The Silmarillion by J.R.R. Tolkien, author of the Lord of the Rings. Comparing the recommendations, it is pretty clear that the cosine similarity recommendations are more relevant than the euclidean distance onces.

Image for post
Image for post
Recommendations for The Silmarillion using euclidean distance. The books don’t seem too related.
Image for post
Image for post
Recommendations for The Silmarillion using cosine similarity. The results seem to be good since most of the suggestions are from the same author.

Structuring the data into a REST API

A web app is very useful for users who want one or two recommendations, but is not very efficient for people who want to use the engine in their own app. This is where a REST API becomes useful. It makes the data available programmatically so a script can access it at will. This can be done using the same setup in flask as the web app.

It is important to have a simple unique identifier for each book to remove any ambiguity when calling the API and interpreting the returns. Fortunately such a thing exists: the ISBN number. The goal is then to have an endpoint where someone requests recommendations for a given ISBN and receives a list of similar ISBNs in return. Here is the code added to the above flask app for the new endpoint:

There are four possible arguments available when making a call:

  • ISBN of the book requested. This is the only one that is required.

A series of conditions are checked to make sure that the arguments are in the right format. For example, the metric can either be ‘cosine’ or ‘euclidean’ and nothing else. If any of these conditions is violated, the API returns a 400 error with a description of the error. If everything is correct, the API returns a list of recommendations with status code 200. Finally, if some other problem occurs, the python error is returned with a 500 error.

Let’s see an example of recommendations for the same Tolkien book as before. The distance metric is not specified so the one used is cosine similarity by default.

Image for post
Image for post

Making the recommendations available via heroku

Having a beautiful web app and a functionnal API is nice, but if no one can use them why bother? This is where heroku comes in. Deploying your code on heroku makes it available to everyone on the internet. All you need is a free account, which can be created here. Deploying your flask app on heroku is surprisingly simple. In my opinion, the best way of doing it is via github. You put your code in a github repository and link this repo to a heroku app. You can even enable automatic deployments to keep your app up to date whenever you push modifications to github.

The only addition to the code that has to be done is adding a file called Procfile that contains the following code:

web: gunicorn app:app

This assumes that the flask app is located in app.py . You also have to make sure to include a requirements.txt file with all the libraries used in your flask code, including gunicorn. Then once the code is pushed to github, go to the heroku dashboard and create a new app. The app name will be used in the URL so make sure that you like it.

Image for post
Image for post
Image for post
Image for post
Steps to create a new app on heroku. First click on new app and then choose a name for your app.

By default the app will be created with free resources. It will perform without any problem for basic apps. The only thing is that the app will go to sleep after some time so the first time someone uses it after a while it will take a few extra seconds to load. Next we connect the app to your github repository and deploy the code located there.

Image for post
Image for post
In the deploy tab, link the github repository where the code is stored.
Image for post
Image for post
In the same tab, deploy the code in a specific branch.

Once the code is deployed successfully, the app can be accessed the same way as before, but the URL is now based on the app’s name. For example, in my case, the full link is recommending-books.herokuapp.com.

Publish the API on RapidAPI

At this point, the project is pretty much done, except for maybe writing a small documentation for the API. However it might be worth sharing the API on a platform to get people to use it. One interesting platform that can be used for that is RapidAPI. It is a marketplace where you can find a bunch of APIs and get access keys to use them, for free or for a small fee. There are many APIs in the marketplace that can be worth discovering.

I don’t want to describe the process of adding an API to the platform since it is pretty simple. I just want to emphasize the advantages of doing so. With your API on a platform like RapidAPI, you have access to stats about how people use your product. There is also an extra layer of security added by the platform, with authentication and tests, to make sure nothing bad happens to your API. Finally, it is very simple to write a small documentation for your endpoints and users can make sample calls directly on the platform. A bonus advantage that can arrise if your API becomes popular is that you can start to ask people to pay to make calls.

The book recommendation API is available here for people who want to take a look and try the RapidAPI platform.

That was a long post and I hope I didn’t lose too many people. My goal was to show you the main steps in a simple end-to-end machine learning project without detailing everything. Let me know if you enjoyed and if you want more content like this. Don’t hesitate to contact me with any question or comment :D

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Yan Gobeil

Written by

I am a data scientist at Décathlon Canada working on generating intelligence from sports images. I aim to learn as much as I can about AI and programming.

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Yan Gobeil

Written by

I am a data scientist at Décathlon Canada working on generating intelligence from sports images. I aim to learn as much as I can about AI and programming.

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store