Advent of Code with Apache Spark: Day 1

Sam Reghenzi
3 min readDec 4, 2018

--

Photo by Joanna Kosinska on Unsplash

This year the “ Advent of code challenge” seems to have taken more spin than in the past editions… or maybe many people I know are joining the challenge. So to make something different rather than writing solution in functional Typescript or Scala I decided to try to build a solution around Apache Spark API. Engineering wise is not a good choice to bind yourself to a very specific API, built to solve very specific problems to tackle a series of unknown problems. So my initial disclaimer is that these series of posts aren’t supposed to be a tutorial of using Spark or some best practice. Quite the opposite. I just would like to show some use cases for some Spark API and probably some anti-pattern you should try to avoid. You can find all the code on my GitHub repo. I used Spark 2.4.0 in a “local” configuration ( it means you don’t need a separate Spark server to run the code ). Let’s dive into day one:

After feeling like you’ve been falling for a few minutes, you look at the device’s tiny screen. “Error: Device must be calibrated before first use. Frequency drift detected. Cannot maintain destination lock.” Below the message, the device shows a sequence of changes in frequency (your puzzle input). A value like +6 means the current frequency increases by 6; a value like -3 means the current frequency decreases by 3.

First, we want to load the data from the input text file. We are loading this file directly from local fs. In a real-world scenario, we want to load it from a Hadoop cluster or another form of distributed, highly concurrent storage.

val rawData   = sc.textFile("src/main/resources/input_01.txt")

Then we want to convert the raw data into an RDD of Int. We tell Spark to cache these data since we will need them for both solution A and B.

val signedData = rawData.map(s => s.toInt ).cache()def sumAcc(acc:Int, value:Int):Int = {acc + value}

val solutionA = signedData.aggregate(0)(sumAcc,sumAcc)

Then we just have to reduce to obtain the final computation, so the aggregate method of the RDD allow us to combine both RDD data and partition data.

You notice that the device repeats the same frequency change list over and over. To calibrate the device, you need to find the first frequency it reaches twice. What is the first frequency your device reaches twice?

The second problem of Day 1 is one of those you don’t want to solve with Spark: it is heavily based on side effects and is hard to model a process to gain the advantage of the distributed model that Spark’s strength point.

I resort on using recursion to limit the side effects but it is more a style thing: since you have to slice the data and prevent various node to find their own “ first repeated element”, you have to collect on every recursion iteration. Even with slightly sophisticated partition strategies, the concept of dropping an element of the RDD is not something you typically want to run on a Spark cluster ( ie RDD has no drop method ;-) ).

If you have comments or suggestions feel free to open an Issue on Github or drop me a line on twitter @sammyrulez.

--

--