EXPEDIA GROUP TECHNOLOGY — DATA
Building Large-Scale Text Similarity Algorithms with Apache Spark ML Pipelines
Flavours of text matching algorithms in Scala
Solving machine learning (ML) problems at scale with a huge volume of data has been a constant challenge for organisations. We have different open source data technologies available, among which Spark stands out in terms of the extensive data processing capabilities it provides. During the phase of ML model development, there is lot of engineering effort involved in
- joining different data sources,
- picking the right attributes,
- creating features and finally making the training data available for model development.
Although creating extract-transform-load (ETL) pipelines using Apache Spark is obvious to many of us, creating ML specific features at scale is still a challenge and interesting problem to explore as each business need/use case requires exploring a variety of algorithms and eventually scaling the chosen methodology.
At Expedia Group™, I needed to solve a deduplication problem using natural language processing (NLP) techniques. During the phase of feature engineering, one of the problems is creating similarity between different textual attributes using string-matching metrics such as cosine similarity, Jaccard similarity, and FuzzyWuzzy.
Since the data was around 4 billion records, processing using Python standalone code would take ages. A simple approach would be to compute these required features using a distributed processing framework. I used Spark DataFrame
and ML pipelines to generate features.
Feature engineering Spark ML pipeline
Computing cosine similarity between any two documents involves a series of steps:
- Cleaning the text — removing blank spaces, escape sequences, punctuation marks etc
- Tokenizing the text — tokenize the document into words.
- Removing the stop words — for the specific language of the text we are going to deal with, we provide list of stop words (for English, a default stop words list is already available)
- Define the vector size depending on the length of the document/volume of the words the document contains.
- Create vectors (could be
Word2Vec
/term frequency vectors/CountVectoriser
vector) - Compute inverse document frequency vectors
Cosine similarity
Unlike Levenshtein distance, which is natively available as part of Spark DataFrame
functions, cosine similarity is not natively available. In order to compute this, I used the feature vectors generated from the above pipelines and wrote an algorithm to compute the cosine similarity.
Jaccard similarity/FuzzyWuzzy
Similarly, I implemented the Jaccard index (which captures Jaccard similarity) and the FuzzyWuzzy
string matching algorithm. I imported FuzzyWuzzy
as a Java dependency so that I could use its API.
<dependency>
<groupId>me.xdrop</groupId> <artifactId>fuzzywuzzy</artifactId> <version>${fuzzywuzzy.version}</version>
</dependency>
These similarity functions can be invoked in the Spark application. Here is an example of how it could be performed:
Note: Cosine similarity and FuzzyWuzzy are not natively available as part of the Spark ML library, which is why I wrote a custom implementation of them. These can be utilised by anyone who wishes to perform some textual analysis at scale.
Model development/distributed training using a Spark ML pipeline
Once all the features are available, the model can be developed using a Spark ML pipeline. Some of the steps involved:
- Load the training and test datasets (sampled) as
DataFrame
s using theDataFrame
API - Generate
DataFrame
of (label, features) using theml.linalg.Vectors
API. - Attach meta data to a features column of type
Vector
- Build random forest classifier specifying
LabelCol
andFeatureSubsetStrategy
. - Grid search — Search for the best suitable parameters by building a parameter grid. We can use the
ParamGridBuilder
utility. By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by settingparallelism
with a value of 2 or more (a value of 1 will be serial) before running model selection withCrossValidator
orTrainValidationSplit
. - Cross-validation —
CrossValidator
begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., withk=3
folds,CrossValidator
will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. To evaluate a particularParamMap
,CrossValidator
computes the average evaluation metric for the 3 models produced by fitting theEstimator
on the 3 different (training, test) dataset pairs. After identifying the bestParamMap
,CrossValidator
finally re-fits theEstimator
using the bestParamMap
and the entire dataset. - Evaluate the model — A common metric used to evaluate the accuracy of a random forest model with binary classification is
Area Under the ROC Curve
(AUC). We can use theBinaryClasssificationEvaluator
to obtain the AUC.
We can also use theMulticlassMetrics
utility to obtain confusion matrix, precision, recall, and F1 scores: