Highlights from Spark + AI Summit 2019 | San Francisco
Takeaways by Math Longo, our Chief Data Scientist.
It’s been almost a month since Spark Summit + AI 2019 at San Francisco. I had a fantastic experience. So many interesting talks and most importantly, some exciting announcements that I’m here to share with you.
Later this year Spark 3.0 will be released.
Yes, Spark 2.4 is not enough, now we have a new version to play around and one of the coolest features is that it will provide Koalas.
Koalas is a library that pretty much replicates every single method from Pandas, but with the ability to run over spark clusters. This is really interesting because now all the pipelines that you program and test in Pandas data frame can be reused in Spark by simply changing two lines of code:
import pandas as pd => import koalas as kspd.read_json(…) => ks.read_json(…)
Now you can code locally, being fully scalable. Amazing…
Databricks will open source Delta Lake.
I got to know Delta Lake in details this summit (many talks/tutorials explained this framework thoroughly) and it got me really engaged. Delta Lake is a framework that works on top of your parquet data lake and it provides a set of amazing properties:
ACID operations: now you will be able to read and write over your parquet files without getting any inconsistency since it uses logging transactions.
Time travel: if you happen to have a data lake that might change along the way, now you can go back in time and run your queries with data of 30 or 60 days ago.
This is particularly useful for data science purposes. Why? Because you can train your models using previous data, test it with current data, and once you are confident enough you move the window.
Enable edit/delete/add: Delta gives the ability to edit or delete records from your parquet files, without the need of running a whole ETL job to do it. Particularly handy in times of GDPR ;)
Delta Lake has been around for almost a year now, gaining quite a bit of robustness. However, up to this Summit, it was private and only those guys working on the cloud and Databricks clients had access to it. Now they have open sourced it to the world which has two major benefits:
1 — Everyone can use it (we will, for sure).
2 — Many contributors will help make this project a lot better.
Besides the announcements, there were also many insightful talks.
I will talk about three of them which I found relevant for what we do on a daily basis on Retargetly.
The talk I liked the most was delivered by Andrew Clegg.
Where he talked about efficient joins on Spark. Whoever has worked with Spark for a while knows how painful joins are in terms of execution time and resources. These problems usually come around when you are dealing with skew data. Meaning, you have some keys with a ton of rows, whereas most of the remaining ones have just a few.
A good way to pinpoint this problem is by looking at the Spark UI and check out the time distributions.
Andrew proposed a way to deal with these situations fairly easily:
- Suppose for now that you have a dataset D1 with (many) repeated keys and which is skewed, and another one with no repeated keys D2. We also assume that D2 is big, so we cannot broadcast it (otherwise broadcast will do the trick for you already). Despite the fact that this seems very restrictive, most of the cases are of this type, therefore the solution is very applicable.
- Now, we can add a new column to D1, this column contains random numbers ranging from 1 to R. Then we create a composed key, in D1, where we append the new column to the original key, let’s call this new column CK (stands for composed key).
- The next step is to replicate every row of D2 R times, and for each row, we add a corresponding number ranging from 1 to R. We also generate the CK column in D2. Finally, we perform the join between D1 and D2. Easy, non-intrusive and very performant.
Isn’t it awesome??
However, it might introduce many new rows for D2, which is kinda undesirable. Thus, insted of doing this for the complete dataframe, we can do it for the keys that we know that are skew, we perform a regular join for the rest of the keys, and then Union them. Still easy.
I loved the approach overall, and it’s extendable for stranger use cases. Also, it can be tweaked for group’s (the other operation that causes many problems).
The second very interesting speech was given by Beck Cronin-Dixon, data-engineer at Eventbrite.
She explained how to build basic ETLs very performant depending on the use cases. She suggested 4 different ingestion approaches for doing near real-time analytics:
- Full overwrite: This strategy is very simple, every time the Spark job reads the whole batch of data, transforms it, and stores it overwriting previous results. It is very simple to implement but it has very high latency since there is a significant load on real-time db.
- Batch incremental merge: In this case, the job gets new/changed rows, then it appends them to previous results. If there is inconsistency with duplicated rows, then there must be a process that fixes those. It clearly has a lower load on real-time but it requires reliable incremental fields, or a second process has to be run, and that introduces a high latency too.
- Append-only: A slight version of the previous approach is to query the real-time db for new-changed rows, then coalesce and write new part files. Finally, it runs a compaction job hourly.
Good: latency in minutes. Ingestion is easy to implement. Simplifies data lake.
Bad: require a compaction process. Extra logic in the queries.
- Key-value store: If our use case is a key-value data, then the problem with duplicates might not be such since it can simply overwrite or append to what was already there.
Good: straightforward to implement. good bridge between data lake and web services
Bad: Batch writes to a key value store are slower than using hdfs. Not optimized for large scans.
- Hybrid: This last strategy ingests from DB transaction logs in Kafka, then merges new rows to base rows and store transaction IDs in the base table. The duplicate rows can be removed on-the-fly if necessary.
- Good: Very fast, relatively easy to implement.
- Bad: The streaming merge is complicated, and the processing is required on the read part.
base_events_df = spark.read.parquet(“hdfs://user/warehouse/events”)# get new rows from mysql
new_mysql_events_df = get_events_jdbc_reader(numPartitions=40) \
.option(“partitionColumn”, “changed”) \
.option(“lowerBound”, max_changed_in_hdfs) \
.option(“upperBound”, max_changed_in_mysql) \
.filter(F.col(“changed”) >= F.lit(max_changed_in_hdfs))# merge the base table and stream, then deduplicate
.orderBy(“id”, F.col(“changed”).desc()) \
.createOrReplaceTempView(“events_live”)spark.sql(“select …….. from events_live”)
At Retargetly we have dealt with a diverse range of pipelines, and all of them have different requirements. The hybrid approach is something that we are going to test for sure in the short term since it is very promising.
Overall, this summit tells us that we have made the right choice since Spark tends to keep being on top of the most adopted technology in the industry.
We started working with Spark a couple of months ago and it has been quite tough I should say, but its benefits are astonishing and coding with map-reduce paradigm is a ton of fun!