Recently Apache Spark community has released Spark 3.0 which holds many useful new features and significant performance improvements. There is already a wide range of enterprises and developers which are using Spark extensively for all data processing needs. They all will be probably facing the same question, does it worth upgrading from Spark 2 to Spark 3?
Spark community claims that “Spark 3.0 is roughly two times faster than Spark 2.4” in the TPC-DS 30TB benchmark.
Cloud Infrastructure and Managed Services are a game-changer. These give us flexibility and autonomy to focus on the business. By leveraging them we can reduce time to market, go fast in implementing new features and even change orientation or drop components without having to carry a heavy legacy.
However, as we grow, our cloud costs can explode without being noticed. It’s important for infrastructure and development teams to regularly monitor the cost of their services and analyze cost patterns. Many good practices do not require big efforts but can lead to substantial savings.
In this article, we describe 3 measures that helped us to significantly lower our data processing costs on EMR, the managed Hadoop service from…
In this fourth article of our Apache Spark series (see Part I, Part II and Part III), we present another real-life use case that we faced at Teads and cover methods to consider when optimizing a Spark job. For those who are not familiar with Spark User Defined Aggregation Functions (UDAF), we take a relatively simple but useful example when dealing with sparse arrays (an array of data in which many elements have a value of zero).
At Teads, we deliver ads to 1.5 billion unique users (called viewerId or vid) every month. We use an unsupervised machine learning algorithm to create clusters of those users based on a set of features. This clustering is soft, meaning that for each user we compute the probability to belong to each of our clusters (we have 100 clusters in this…