Igniting the future of data with Apache Spark
“Some of the best theorizing comes after collecting data because then you become aware of another reality.”
Robert J. Shiller, Nobel Prize in Economics
Let’s face it, data has become a crucial and irreplaceable part of today’s society. Data sets of hundreds or thousands of GBs are routinely used, for example, by businesses trying to optimize their processes or even by entire industries looking for marketing trends and trying to predict the next big craze.
Within the software arena, huge amounts of data help improve user experiences in the multiple devices we employ on a daily basis: measuring the pages or aspects of a website that attract the most attention, modeling and predicting user intention in search engines, etc.
With thousands of people accessing the internet daily, achieving a pleasant experience for the user is a key factor for success. How data is gathered, transformed and analysed plays a big role in this process. Users have their unique needs, habits and ways of doing things in a given context, say a website or an application. Understanding those needs is a crucial and powerful way of enhancing the product development and ensuring it takes the right path with benefits for both users and makers.
“The world’s most valuable resource is no longer oil, but data.”
The Economist
At EmpathyBroker, being search and discovery experts makes us naturally inquisitive as to what users do on a website. What they search for after their initial browse? How they refine their searches? Which products are grabbing their attention? Is there any specific item or search that’s booming?
In this pursuit of knowledge and getting to know users and their habits better, we’re going to look in this post at one of the ways that data is used in EmpathyBroker and how the processes around its analysis are continuously developed and improved.
What’s next?
Imagine you step into a famous eCommerce website which happens to use our search engine and you type something in the search box, let’s say ‘jeans’. Once you type what you want and press ‘Enter’ you are immediately redirected to a results page with the most relevant products related to your search.
But what if we could go beyond that first set of results and start predicting and suggesting what you’d like… Basically, offering the user what’s next?
In order to generate these suggestions we need to track the relevant data, process it behind the scenes, and then use the results to provide the user with feedback in the form of related, or next, queries.
How does the magic happen?
As a little introduction to our processing pipeline, we’re going to make a brief stop in order to sketch a simplified version of the algorithm we use to calculate these related queries.
We start by analysing single user sessions, which for us last 30 minutes. Take for example this hypothetical list of query terms that a user could have typed in the search engine box:
[q1, q2, q3, q4, q2, q4, q2, q5, q2, q5, q2, q3]
We first consider the searches that have been performed consecutively during a session, which gives us a set of pairs (q_origin, q_next_query):
[((q1, q2), 1), ((q2, q3), 2), ((q3, q4), 1), ((q4, q2), 2), ((q2, q4), 1), ((q2, q5), 2), ((q5, q2), 2)]
Including the number of times each pair has appeared in the session.
We then group each of the tuples by origin query and calculate the next query probability, dividing the number of pair appearances by the total number of pairs with that origin:
q1, [((q1, q2), 1/1)],
q2, [((q2, q3), 2/5), ((q2, q4), 1/5), ((q2, q5), 2/5)],
q3, [((q3, q4), 1/1)],
q4, [((q4, q2), 2/2)],
q5, [((q5, q2), 2/2)]
These probabilities are further refined re-traversing the graph and taking into account other factors, like the position of the query within the session. The final set of probabilities for a user session look like this:
q1, [((q1, q2), x)],
q2, [((q2, q3), y’), ((q2, q4), y’’), ((q2, q5), y’’’)],
q3, [((q3, q4), z)],
q4, [((q4, q2), a)],
q5, [((q5, q2), b)]
This is then combined with the rest of the users and sessions within a given time frame to get a model of the related queries for any given original query. Of course, this model has to be continuously updated — to adapt it to any new trends or changes in user behaviour.
The good old batch
Previously we calculated next queries the old-fashioned way. A batch job would run periodically, taking sessions within a data set from a database and performing the next query calculations in-memory.
There were several problems with this approach, even though functional, it was neither practical nor elegant:
- Lack of performance. Performing the needed calculations and comparisons in-memory within a single batch job for large data sets isn’t a great idea.
- High coupling. As a ‘child’ service of a bigger fish, we wanted to make all those calculations independent, in their own project, so they could be scaled independently to the current one they were developed in.
- Not so legible code. As it grew organically as the algorithm was developed, the code wasn’t straightforward to read and the documentation wasn’t enough.
All this gave us the opportunity to start afresh. After gathering information about how we could make this better, we came to the following conclusion…
Spark; the way forward
We decided to move the batch processing into Spark for several reasons, but primarily because of:
- High performance. Spark provides distributed and high performance in-memory computations out-of-the-box. Not having to deal with that part ourselves saves a lot of time and increases productivity.
- Clean API. Easily understandable and pretty straightforward compared to how the algorithm was previously implemented, with .map() and .reduce() operations to add extra functionality.
Furthermore, in order to fully exploit all the Spark capabilities, we decided to implement the new batch process in Scala (being the native language in which Spark is developed).
Conclusions of the journey
Since the beginning of this journey a few months ago we’ve learnt a lot through the process.
- Spark brings you simplicity. Elegant API and top-notch performance right out of the box, great!
- Speed, finally! The performance given by the Spark batch processes is twice as fast as the old code in plain Java.
- New language, new powers. After building the whole path with Scala we have seen how powerful it is and how simple it is to develop Spark code with it.
- Service isolation. Moving the batches out of our main project was a relaxing moment. Less dependency and independent scalability.
Hopefully this post has shed some light on Spark and how it can ignite your data pipelines, making them richer and faster than ever before!