Machine Learning mini project: Build a book recommender system web app using Python and Streamlit.

4 min readMar 3, 2024

Install Streamlit:

Open you command line and enter:

pip install streamlit

Download dataset:

Best Books (10k) Multi-Genre Data

Data from the "Books That Everyone Should Read At Least Once" list on Goodreads

www.kaggle.com

Python

Create a new python file named ‘book_recommender.py’ in the same folder where you downloaded the dataset.

Add the following code to your python file:

# Import necessary libraries
import streamlit as st
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the dataset 
books = pd.read_csv("goodreads_data.csv")

# Preprocess the data (remove duplicates, handle missing values, etc.)

# Fill NaN values in the 'Description' column with an empty string
books['Description'] = books['Description'].fillna('')

# Create a TF-IDF Vectorizer object
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(books['Description'])

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

The TfidfVectorizer in Python converts text data into numerical vectors using the Term Frequency-Inverse Document Frequency (TF-IDF) technique. Here's a simple explanation:

Bag-of-Words (BoW) Model: Initially, text data is converted into a matrix where each row represents a document, and each column represents a unique word in the entire corpus. The values in this matrix are typically the raw count of words in each document.
TF-IDF Transformation:
Term Frequency (TF): It calculates the frequency of a word in a document. For example, if a word appears twice in a document with 10 words, its TF value would be 2/10.
Inverse Document Frequency (IDF): It measures the importance of a word in the entire corpus. Common words like ‘the’ have low IDF values, while rare words have high IDF values.
TF-IDF: This is the product of TF and IDF. It gives more weight to words that are frequent in a document but rare across documents.
How TF-IDF Improves Over BoW:
BoW treats all words equally based on frequency, while TF-IDF emphasizes words that are important to a specific document but not common across all documents.
TF-IDF helps in capturing the uniqueness of words in documents and is more effective at representing text data for machine learning models.

In summary, TF-IDF assigns weights to words based on their importance in individual documents and across the entire corpus, providing a more nuanced representation of text data compared to simple word counts.

# Function to get book recommendations based on book title
def get_recommendations(book_title, cosine_sim=cosine_sim, data=books):
    # Check if the book title exists in the dataset
    if book_title not in data['Book'].values:
        return "Book title not found in the dataset"
    
    # Get the index of the book that matches the title
    idx = data[data['Book'] == book_title].index
    if len(idx) == 0:
        return "Book title not found in the dataset"
    
    idx = idx[0]  # Get the first index if multiple matches
    
    # Get the pairwise similarity scores with that book
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the books based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the top 10 most similar books
    sim_scores = sim_scores[1:11]
    
    # Get the book indices
    book_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 recommended books
    return data['Book'].iloc[book_indices]

the above code snippet takes a book title as input, finds the most similar books based on cosine similarity scores, and returns a list of top 10 recommended books from the dataset. This process forms the core functionality of a book recommender system that suggests similar books based on a user’s selected book.

# Streamlit App to host the Book Recommender System
def main():
    st.title("Book Recommender System")
    
    # Sidebar to input book title
    book_title = st.sidebar.text_input("Enter a Book Title")
    
    if st.sidebar.button("Recommend"):
        if book_title:
            recommended_books = get_recommendations(book_title)
            st.subheader("Recommended Books:")
            for i, book in enumerate(recommended_books):
                st.write(f"{i+1}. {book}")
        else:
            st.write("Please enter a valid book title.")

if __name__ == '__main__':
    main()