EXPEDIA GROUP TECHNOLOGY — DATA

Building Large-Scale Text Similarity Algorithms with Apache Spark ML Pipelines

Flavours of text matching algorithms in Scala

Padma Chitturi

Published in

Expedia Group Technology

4 min readDec 4, 2020

A duplex house — Photo by Terrah Holly on Unsplash

Solving machine learning (ML) problems at scale with a huge volume of data has been a constant challenge for organisations. We have different open source data technologies available, among which Spark stands out in terms of the extensive data processing capabilities it provides. During the phase of ML model development, there is lot of engineering effort involved in

joining different data sources,
picking the right attributes,
creating features and finally making the training data available for model development.

Although creating extract-transform-load (ETL) pipelines using Apache Spark is obvious to many of us, creating ML specific features at scale is still a challenge and interesting problem to explore as each business need/use case requires exploring a variety of algorithms and eventually scaling the chosen methodology.

At Expedia Group™, I needed to solve a deduplication problem using natural language processing (NLP) techniques. During the phase of feature engineering, one of the problems is creating similarity between different textual attributes using string-matching metrics such as cosine similarity, Jaccard similarity, and FuzzyWuzzy.

Since the data was around 4 billion records, processing using Python standalone code would take ages. A simple approach would be to compute these required features using a distributed processing framework. I used Spark DataFrame and ML pipelines to generate features.

Feature engineering Spark ML pipeline

Computing cosine similarity between any two documents involves a series of steps:

Cleaning the text — removing blank spaces, escape sequences, punctuation marks etc
Tokenizing the text — tokenize the document into words.
Removing the stop words — for the specific language of the text we are going to deal with, we provide list of stop words (for English, a default stop words list is already available)
Define the vector size depending on the length of the document/volume of the words the document contains.
Create vectors (could be Word2Vec/term frequency vectors/CountVectoriser vector)
Compute inverse document frequency vectors

Feature engineering pipeline in Scala

Cosine similarity

Unlike Levenshtein distance, which is natively available as part of Spark DataFrame functions, cosine similarity is not natively available. In order to compute this, I used the feature vectors generated from the above pipelines and wrote an algorithm to compute the cosine similarity.

Cosine similarity implementation in Scala

Jaccard similarity/`FuzzyWuzzy`

Similarly, I implemented the Jaccard index (which captures Jaccard similarity) and the FuzzyWuzzy string matching algorithm. I imported FuzzyWuzzy as a Java dependency so that I could use its API.

<dependency>            
<groupId>me.xdrop</groupId>            <artifactId>fuzzywuzzy</artifactId>            <version>${fuzzywuzzy.version}</version> 
</dependency>

Jaccard similarity and FuzzyWuzzy mplementation

These similarity functions can be invoked in the Spark application. Here is an example of how it could be performed:

Use of the previously implemented similarity functions in Spark

Note: Cosine similarity and FuzzyWuzzy are not natively available as part of the Spark ML library, which is why I wrote a custom implementation of them. These can be utilised by anyone who wishes to perform some textual analysis at scale.

Model development/distributed training using a Spark ML pipeline

Once all the features are available, the model can be developed using a Spark ML pipeline. Some of the steps involved:

Load the training and test datasets (sampled) as DataFrames using the DataFrame API
Generate DataFrame of (label, features) using the ml.linalg.Vectors API.
Attach meta data to a features column of type Vector
Build random forest classifier specifying LabelCol and FeatureSubsetStrategy.
Grid search — Search for the best suitable parameters by building a parameter grid. We can use the ParamGridBuilder utility. By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by setting parallelism with a value of 2 or more (a value of 1 will be serial) before running model selection with CrossValidator or TrainValidationSplit.
Cross-validation — CrossValidator begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds, CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. To evaluate a particular ParamMap, CrossValidator computes the average evaluation metric for the 3 models produced by fitting the Estimator on the 3 different (training, test) dataset pairs. After identifying the best ParamMap, CrossValidator finally re-fits the Estimator using the best ParamMap and the entire dataset.
Evaluate the model — A common metric used to evaluate the accuracy of a random forest model with binary classification is Area Under the ROC Curve(AUC). We can use the BinaryClasssificationEvaluator to obtain the AUC.
We can also use the MulticlassMetrics utility to obtain confusion matrix, precision, recall, and F1 scores:

Model development using an Spark ML pipeline

Learn more about technology at Expedia Group