Find most relevance text data using pyspark with tf-idf

reza andriyunanto
Apr 28, 2018 · 3 min read

TF-IDF or Term Frequency-Inverse Document Frequency is usually used for text mining purpose. Tf-idf weight used for evaluate how importance a keyword to document in to collection of document using statistic measure. We can find article that have highest relation with keyword.

Okay, first, import python package

from pyspark import SparkConf, SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

Thing that must remember is pyspark need numpy to run this program. So, you need to install package numpy too.

After that we need create configuration for spark :

conf = SparkConf().setMaster("local[*]").setAppName("SparkTFIDF")
sc = SparkContext(conf=conf)

Then, we need to load dataset, this case I use Indonesia news. You can download the dataset from here.

rawData = sc.textFile("dataset.tsv")

Before we process the data, we need to do pre-processing the data to get the partial data from dataset.

kecilRawData = x: x.lower())
fields = x: x.split("\t"))
documents = x: x[2].split(" "))
documentId = x: x[0])

That code process all data into lowercase, then split the data using regex \t because the data separated by tab (.tsv), you need adjust what like dataset that you used, like if you use .csv , you can used comma (,) to split the data. And then, we need save the document id to identity which document belong to.

We can create hashingTF using HashingTF, and set the fixed-length feature vectors with 100000, actually the value can adjust as the feature vectors that will used. And then, we can use the result of hashingTF to transform the document into tf.

hashingTF = HashingTF(100000)
tf = hashingTF.transform(documents)

After we got the tf, we can create the idf using this code

idf = IDF(minDocFreq=1).fit(tf)

We have tf and idf, after that we need to create tf-idf using this

tfidf = idf.transform(tf)

After we got the tf-idf, we can used it to find the most related article using keyword. We need add this code to find the most related article with keyword.

keywordTF = hashingTF.transform([keyword.lower()])
keywordHashValue = int(keywordTF.indices[0])

the keyword need to pre-processing into lower case as like the dataset before. Then we can find relevance keyword with the article dataset.

keywordRelevance = x: x[keywordHashValue])
zippedResults =

Then how to know where is the article that most relevance with my keyword? you can check with this code


Lets try, the code, when I set the keyword with string “MacGyver” the program will show the result.

Result of program

And here the dataset

In indices document ID 8, we can found MacGyver Article.

reza andriyunanto

Written by

Backend Engineer, Tech Enthusiasm, Love data and sometime curious about anything.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade