Switching Join Strategy, by Radhwane Chebaane and Wassim Almaaoui

Image for post
Image for post

Recently Apache Spark community has released Spark 3.0 which holds many useful new features and significant performance improvements. There is already a wide range of enterprises and developers which are using Spark extensively for all data processing needs. They all will be probably facing the same question, does it worth upgrading from Spark 2 to Spark 3?

Spark community claims that “Spark 3.0 is roughly two times faster than Spark 2.4” in the TPC-DS 30TB benchmark.


3 actionable measures to lower your EMR bills

Image for post
Image for post

Cloud Infrastructure and Managed Services are a game-changer. These give us flexibility and autonomy to focus on the business. By leveraging them we can reduce time to market, go fast in implementing new features and even change orientation or drop components without having to carry a heavy legacy.

However, as we grow, our cloud costs can explode without being noticed. It’s important for infrastructure and development teams to regularly monitor the cost of their services and analyze cost patterns. Many good practices do not require big efforts but can lead to substantial savings.

In this article, we describe 3 measures that helped us to significantly lower our data processing costs on EMR, the managed Hadoop service from…


Calculate average on sparse arrays

Image for post
Image for post

In this fourth article of our Apache Spark series (see Part I, Part II and Part III), we present another real-life use case that we faced at Teads and cover methods to consider when optimizing a Spark job. For those who are not familiar with Spark User Defined Aggregation Functions (UDAF), we take a relatively simple but useful example when dealing with sparse arrays (an array of data in which many elements have a value of zero).

Business Context

At Teads, we deliver ads to 1.5 billion unique users (called viewerId or vid) every month. We use an unsupervised machine learning algorithm to create clusters of those users based on a set of features. This clustering is soft, meaning that for each user we compute the probability to belong to each of our clusters (we have 100 clusters in this…

Wassim Almaaoui

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store