Streaming architecture to the rescue

How we solved the challenges of scale and catalogue updating

The second RecSys London meetup took place at Criteo Labs’ office. As well as having a great roof terrace with matching view, it also made a great venue for an informal evening full of talks, Q&A, networking, drinks and tasty food.

Our Lead Data Scientist Dr. Mahbub Gani had the opportunity to speak about the challenges of scale and catalogue updating and share the new streaming architecture being implemented at Bibblio.

Here’s a transcript of his talk, Practical Streaming:


“Hello, I’m Mahbub Gani. I’m the Lead Data Scientist at Bibblio and today I’d like to talk about what I call ‘practical streaming’. Before I dive into the challenges we were trying to solve, I want to start off with the following maxim about the importance of balance.

Miyoko Shida Rigolo at InnoTown® 2014 exhibiting balance with exquisite skill

“Improving your recommender system is all about balancing between your aspirations and the time-to-market.

“Our initial recommender system aimed to deliver suggestions related to source documents in a “semantic” sense. After running a series of experiments, including subjective, blind evaluations by the Bibblio team, we discovered that TF-IDF outperformed other options such as Word2vec and Latent Dirichlet Allocation. This was due to the data available at the time and the nature of content of Bibblio’s clients. We started delivering relevant content recommendations using TF-IDF vectorisation and nearest neighbours.

“Having that in mind, we were presented with two challenges we tried to solve to improve our recommender system for our client base.

“Firstly, the scale challenge. Not surprising at all. With TF-IDF vectorisation using nearest neighbours you have the quadratic complexity challenge of performing pairwise comparisons of different embeddings for the items in your content pool. This is a canonical example of the way in which algorithmic inefficiencies and complexities can influence the business greatly. We have quite a tight on-boarding pipeline for our clients, and we basically throttle that pipeline because we have to wait for a few hours for these pairwise comparisons to be completed. That’s not viable for some of our clients.

“This is connected to the second challenge, which we call the content update challenge. Some of our clients, for example The Day who provide an online news service for schools, publish new articles on a daily basis. It’s in their best interest, and therefore in our interest, that those ‘fresh’ content items are reflected in the recommendations served to their users. With our pre-streaming architecture we need to wait an hour or so, before the recommendations for the freshly ingested items are available. This is because you have to perform the pairwise comparisons which present quadratic complexity.

“We also have other types of clients, for example the Canadian Electronic Library (CEL), who are not ingesting books on an hourly or even daily basis. In this case, we can afford to do a full or bulk index of the entire corpus overnight or when a reasonable number of content items have been accumulated in our system. A subliminal message, by coincidence, in the recommendations for CEL here — Balancing Act.

“This [below] is a busy slide, deliberately so. See it as an attempt to obfuscate our secret sauce. Actually no — it’s not rocket science! I’m just going to run through the principal components of our streaming architecture.

“Streaming is coming to the rescue here. And I’d rather call it quasi-streaming. So we’re not using Spark- streaming, we’re using a conventional Spark architecture to address the streaming problem. So as before, we generate these TF-IDF embeddings, these vectors, for the new items and the original items. But we do have the option of persisting the original item vectors.

“The problem that arises when you don’t recompute the vectors of all of your items, is twofold. First of all, the inverse document frequency of the TF-IDF is across all documents. So you need a way to recomputing those in an approximate manner with freshly ingested items. Secondly, it introduces complexity downstream because you have to do some additional computations. But relative to the quadratic complexity of the distant matrix computation of the pairwise comparison, these problems are not such an issue. We decided in our architecture to just recompute all those vectors.

“Importantly, it’s possible to claw back some computational time by persisting the dictionary which you can reload as part of the TF-IDF pipeline. And because you only have a small number of new items, you need only to recompute the distances between those items and the other items of your corpus. And therefore, if the size of the corpus is N and the number of new items is n, the complexity is of order the product of N and n. And if n is much smaller than N, it’s considerably lower than quadratic complexity. So practically, our solution dramatically reduces the computational time.

Dr Mahbub Gani, Lead Data Scientist at Bibblio, presents at the second RecSys London Meetup

“The other thing I want to mention, and I’m interested to hear your thoughts: we were using column similarity computation routine out-of-the-box from Spark. And frankly, it was… rubbish. There is always a trade-off between taking something out-of-the-box and save development time or ripping it out and building it from scratch so it’s optimised to your particular circumstance. After a considerable amount of pain, we decided to just rip it out and write our own column similarity routine. Conceptually, it isn’t that difficult. It dramatically reduced the computational time. We’re talking going from a 10 hour computation to a 5 hour one or even lower. If you look at the routine, the main problem with the out-of-the-box Spark is that it implements an injudicious management of the RDDs. It’s highly asymmetric, which is a bit silly if you think about it. Our own routine balances the computations between the RDDs in a fairer and more sensible matter.

“After that, we insert what we call inbound recommendations for that freshly ingested item against all other items into a Redis cache. This is all off-line. At presentation time the recommendations will be retrieved from Redis.

“The outbound recommendations, i.e. reflecting those new items in the recommendation sets of the base corpus of original items; that requires new computation and that’s something we do as part of a bulk index. And then you have to take a hit with quadratic complexity.

“This [above] is our streaming architecture. I don’t think it will be much of a surprise now. We have fresh items coming in, which get ingested and enriched. We’ve plugged in our ingestion into Watson Natural Language Understanding to produce metadata. We also present it as an option for clients to open for viewing by the user. Those items are then queued in order to be indexed into Redis. We have a butler handler process that determines which streaming mode to activate, and this draws upon optimisation heuristics. The heuristics are based on the size of the corpus, the number of new items, the nature of the client among other things. This determines whether we should perform a bulk index, which includes an all-against-all comparison and introduces quadratic complexity, or go down the streaming route.

“We can perform a full vectorisation, meaning we recompute all the TF-IDF vectors and the dictionary. Otherwise, we have the option of performing partial vectorisation. We basically still do the full vectorisation, but we load back in the dictionary that’s saved. And as I mentioned before, in both streaming cases we perform a partial distance matrix computation. We finally push all the recommendations into Redis.

“An important side note here. One of the things we’re considering as part of the streaming architecture is spinning up ad hoc clusters. This is in order to perform bulk ingestion or the partial vectorisation for modest amounts of freshly ingested items. Currently, we have permanent, light weight clusters on Amazon Web Services. For our bulk ingestion we’re thinking about using the Google Cloud Platform. With a free account you can get in yourself and try to launch Spark clusters, submit jobs, run your evaluation and then tear it down. And even better, you can do all of this programmatically on a Jupyter Notebook, say.

“To conclude my talk, some parting wisdom on practical streaming. First of all: know your clients. Segment, not regiment. Don’t treat them all as one. One size doesn’t fit all, definitely not in the recommender space.

“Secondly, engage in what I call ‘small’ r and D and not ‘big’ R & D. I was doing a lot of the R when I was in academia, and not much of the D. Almost non-existent. What I’m doing now is a bit of research, some development and then push to production. I have to tell you, from my experience, I get a massive kick out of getting through that cycle quickly.

“Thirdly, Spark is good for certain things, but not that good. You might have that experience, I’m sure I and lots of my data science friends have. I know many people are basically retreating back to just a single beefy machine. They get rid of network overhead and gain control over the integration and your process.

“Finally, think about programmatic cluster creation and jobs submission and termination, which you can perform on demand.

“I think that’s it for today. Thank you all for listening.”


Bibblio solves the problems of audience retention and engagement by showing each user the best of your own content.


Originally published at bibblio.org