Content-based filtering for Book Recommendation using PySpark

Explains similarity calculations for content-based recommendation

6 min readDec 24, 2021

Content-based filtering recommends items to users on the basis of their prior actions or explicit feedbacks. It uses item features to recommend items similar to what the user likes.

Imagine, a user likes item1 and item1 is similar to item2. Then item2 will be recommended to the user based on their liking of item1. Recommendations on retail websites that say “Since you liked this item you may like this item too” are examples of content-based recommendations.

In this article, I have demonstrated how to build a content-based book recommendation system. I have used the book details from Goodreads¹ and used their associated tags to find similar books and recommend them to hypothetical users. Similarities can be calculated mathematically based on the various features of the books like author, number of pages of the book, genre, year of publication etc.

Data:

I downloaded the sample book ratings and descriptions provided by Goodreads¹ :
Ratings.csv : It had bookids, usersids and the ratings the users have given the books
Books.csv: This file contained book details like bookid, book title, author, release date etc
Tags: This file had tag details. It had all the tag ids and tag names
Tags_book: It contained the various tags associated with a bookid and also the count of how many times a tag has been associated with a particular book.

I loaded the files in the respective spark sql dataframes and explored the data:

ratings= spark.read.csv('/content/drive/MyDrive/Data Science/Books/ratings.csv',header=True, inferSchema=True)books10k_1= spark.read.csv('/content/drive/MyDrive/Data Science/Books/books.csv',header=True, inferSchema=True)tag=spark.read.csv('/content/drive/MyDrive/Data Science/Books/tags.csv',header=True, inferSchema=True)book_tag=spark.read.csv('/content/drive/MyDrive/Data Science/Books/book_tags.csv',header=True, inferSchema=True)

Preprocessing

Then I carried out a number of preprocessing steps using the 4 dataframes before starting to build the recommendation system:

Enc_ID

I created a field enc_id which is the encoded bookid. Then I created a string type bookid by adding ‘A’ in front of the numeric bookids. This helped in giving recommendations at a later stage.

books10k_2=books10k_1.selectExpr(“book_id”, “concat(‘A’,book_id) as enc_id”,”title”, “authors”, “original_publication_year”)

2. Book details to ratings

I joined the ratings table with the book details through a sql like join and filtered out all rows where book title was not present.

ratings_1.createOrReplaceTempView(“ratings”)
books10k_2.createOrReplaceTempView(“desc”)query = “SELECT ratings.book_id, user_id,rating, title, enc_id, authors, original_publication_year FROM ratings left join desc on ratings.book_id=desc.book_id”books_ratings = spark.sql(query)books_ratings=books_ratings.filter(“title is not null”)

Image 4: Joining Book_Details To Ratings

3. Only the top 100 most popular highest rated books for this analysis.
First I found out the most read books. For that I saw which bookids were read and then were rated at least 100 times:

book_freq=books_ratings.groupBy(“book_id”).count()most_read_indices = book_freq.filter(“count>= 100”).select(‘book_id’)

Then I joined books with ratings to calculate the average ratings of the books whose ids fall in my most frequently read set:

most_read_indices.createOrReplaceTempView(“most_read”)books_ratings.createOrReplaceTempView(“books_ratings”)query = “SELECT distinct book_id , title, avg(rating) as average from books_ratings where title is not null and book_id in (select distinct book_id from most_read limit 100) group by book_id, title order by average desc”highestrated_pop = spark.sql(query)

Image 5: Highest Rated Most Popular Books

4. Prepare the tag details

Then I prepared the tag details which I used as the attributes for content-based filtering.

By joining books_tag table with tagdetails, I got the tag names and the counts of how many times a book has been tagged by that particular tag in one table:

temp=book_tag.join(tag, [“tag_id”], “left”)

We then join the tagid table with the highestratedbooks:

temp.createOrReplaceTempView("temp")
highestrated_pop.createOrReplaceTempView("highrated")query = "select distinct goodreads_book_id, enc_id, title, tag_name,count from (select distinct goodreads_book_id, tag_name, count from temp where goodreads_book_id in (select distinct book_id from highrated)) A left join desc on A.goodreads_book_id =desc.book_id "u=spark.sql(query)

Image 7 : TagId table joined with Highest Rated Books

Then I prepared a crosstable where each item gets stored in a row and each attribute as a column indicating if the item has that attribute or not. This is the table that we are going to use for the similarity calculations to find similar books.

temp=bookdetails_tags.sort(“count”, ascending = False).head(10000)
temp_df=spark.createDataFrame(temp)
attributes=temp_df.crosstab(‘enc_id’,’tag_name’)

Similarity Calculations:

Now similarities can be calculated between items based on the table. The similarity number will be between 0 and 1 and higher the number more similar will be the items. Two popular similarity metrics are:
A. Jaccard B. Cosine

Jaccard Similarity:

Since the crosstable consists of 0s and 1s in this case I have done bitwise logic operations otherwise Jaccard similarity will end up being 0 most of the time.

Example: If we are comparing item 1 and item 2 i.e. row 1 and row 2 of table 2 in Image 8:
Item1: 1 1 1 1 0 0
Item2: 0 1 0 1 1 0
Bitwise AND : [False True False True False False]
Bitwise OR : [ True True True True True False]
Numerator = 2
Denominator = 5
Hence Jaccard Sim = 0.4

The below code snippet can be used for for Jaccard index calculation:

intersection=np.logical_and(l1,l2)
union=np.logical_or(l1, l2)
similarity=(intersection.sum())/(union.sum())
similarity = float(similarity)

B. Cosine Similarity:

For the numerator, we take the dot product of the 2 vectors and for the denominator is product of the normal forms of the two vectors.

Example: If we are comparing item 1 and item 2 i.e. row 1 and row 2 of table 2 of Image 8:
Item1: 1 1 1 1 0 0
Item2: 0 1 0 1 1 0
Numerator = 2
Denominator = 3.4641016151377544
Hence Jaccard Sim = 0.5773502691896258

The below code can be used for cosine index calculation:

num=np.dot(l1,l2)
d1=(np.linalg.norm(l1))
d2=(np.linalg.norm(l2))
d=d1*d2
similarity = float(num/d)

Both Jaccard and Cosine similarities are useful distance metrics. For this solution I have used the Jaccard similarities to calculate the similarity list. From the similarity list, I calculated the distance table and added the movie titles to the distance table as well:

rdd=sc.parallelize(similarity_list1)
d = spark.createDataFrame(rdd,schema=col)
d.createOrReplaceTempView("d")
books10k_2.createOrReplaceTempView("desc")
q="select d.*,desc.title from d left join desc on d.enc_id=desc.enc_id"
final=spark.sql(q)

Providing top n recommendations:

Now actual recommendations can be provided with the below code say for the book ‘Blink’:

def recommendations(n,b):
     filt="title=='"+b+"'"
     filt2="title!='"+b+"'"
     d1= final.filter(filt)
     r=""
     h=len(d1.columns)-1
     for i in range(1,h):
         r=r+"'"+str(d1.columns[i])+"',"+d1.columns[i]+","
     u=len(r)-1
     t=r[0:u]
     f="stack("+str(i)+","+t+") as (T,S)"
     fin=d1.selectExpr('enc_id',f)
     fin.createOrReplaceTempView("output")
     q="select S as rating,title from output left join desc on          output.T=desc.enc_id"
     final_out=spark.sql(q).filter(filt2)
     final_out.sort("rating", ascending = False).show(n=n)recommendations(3,"Blink")