TF-IDF or Term Frequency-Inverse Document Frequency is usually used for text mining purpose. Tf-idf weight used for evaluate how importance a keyword to document in to collection of document using statistic measure. We can find article that have highest relation with keyword.
Okay, first, import python package
from pyspark import SparkConf, SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
Thing that must remember is pyspark need numpy to run this program. So, you need to install package numpy too.
After that we need create configuration for spark :
conf = SparkConf().setMaster("local[*]").setAppName("SparkTFIDF")
sc = SparkContext(conf=conf)
Then, we need to load dataset, this case I use Indonesia news. You can download the dataset from here.
rawData = sc.textFile("dataset.tsv")
Before we process the data, we need to do pre-processing the data to get the partial data from dataset.
kecilRawData = rawData.map(lambda x: x.lower())
fields = kecilRawData.map(lambda x: x.split("\t"))
documents = fields.map(lambda x: x.split(" "))
documentId = fields.map(lambda x: x)
That code process all data into lowercase, then split the data using regex \t because the data separated by tab (.tsv), you need adjust what like dataset that you used, like if you use .csv , you can used comma (,) to split the data. And then, we need save the document id to identity which document belong to.
We can create hashingTF using HashingTF, and set the fixed-length feature vectors with 100000, actually the value can adjust as the feature vectors that will used. And then, we can use the result of hashingTF to transform the document into tf.
hashingTF = HashingTF(100000)
tf = hashingTF.transform(documents)
After we got the tf, we can create the idf using this code
idf = IDF(minDocFreq=1).fit(tf)
We have tf and idf, after that we need to create tf-idf using this
tfidf = idf.transform(tf)
After we got the tf-idf, we can used it to find the most related article using keyword. We need add this code to find the most related article with keyword.
keywordTF = hashingTF.transform([keyword.lower()])
keywordHashValue = int(keywordTF.indices)
the keyword need to pre-processing into lower case as like the dataset before. Then we can find relevance keyword with the article dataset.
keywordRelevance = tfidf.map(lambda x: x[keywordHashValue])
zippedResults = keywordRelevance.zip(documentId)
Then how to know where is the article that most relevance with my keyword? you can check with this code
Lets try, the code, when I set the keyword with string “MacGyver” the program will show the result.
And here the dataset
In indices document ID 8, we can found MacGyver Article.