Apache Spark and Me…

This story is a wee bit older than today. Or even a year ago. The true start began when a colleague was taking days off and handed me a couple of Apache Spark problems to handle. This is my 4th month [ October 2014] since I began working for Cloudera and I was not shy to learn new things. But already dealing with YARN, MapReduce and its interrelations with Hive and Pig were taking a toll on this fresh grad entry level engineer person. There was fear and the hope that nothing breaks.

Actually, nothing did. What it did is simply inspire me to learn. That was the origin. The beginning of my learning spree that still goes on.

As I began my journey in the Big Data ecosystem, I witnessed the movement that came from the adoption of YARN (MapReduce version 2) from the old-school MapReduce Version 1. It was a total paradigm shift. The issues in YARN still exist in a lot of ways but they went from explaining to a customer “What is a container” to “This operation is slow or not so performant”. Stability brought with itself a host of new issues that brought YARN to where it is today. Knowing the underlying Resource Management layer well helped me dive into Spark easily. Kinda.

Since those 2 issues, I started minimally playing with Spark and trying to figure it out. There was 1 other co-worker who I went to for help when I couldn’t figure out things in Spark. Him being based in Melbourne, Australia and me in Palo Alto, California presented that nice timezone challenge that prompted 9pm calls to understand what was breaking in Spark for a/ multiple issues. That was fun.

Fast forward a bit to January 2015. I, being part of the journey for customers from MRv1 to MRv2, was faced with looking at the move to Spark. So, I sat there and saw things go from MR1 → MRv2 → SPARK. It felt like customers came back after the holiday season and suddenly decided — “Hey, let’s see what this Spark thing is”. I wasn’t ready for it though. I still had a ton of knowledge gaps. And to add to it, my co-worker from Down Under was on a 3 week leave. So, guess who solved most Spark cases for 3 weeks. Partially freaking out. But that helped. I learnt more in those 3 weeks than ever in any point in life.

This egged me to move forward and dive a wee bit deeper into it. The upcoming months saw this huge growth of the adoption. A bunch of customers went full head first into Spark and challenging problems followed. The interesting thing that keeps me going is that since Spark is fairly new, the issues are pretty much new as well. So, you do not know what you are dealing with at that point of time without breaking in.

So, dealing with the issues and pain points help me better understand how things moved. Within those 6 months, I also jumped a lot into Apache Kafka and Spark, but Spark stuck with me longer. I barely do much of Kafka today but a year ago it was exciting and I enjoyed helping customers succeed early on as the framework became a part of the larger platform.

In June of 2015, I attended Spark Summit 2015 West, in the Hilton of San Francisco. My first conference as a professional was nothing short of a thrill. The large 1.5 hour commute within the Bay Area, using public transportation (missing trains) and the urge to stay longer to network and learn were part of this 3 day conference. I made contacts, talked to customers and also attended a lot of talks about Spark. It was good to see that platform had grown to such a large point where contributions kept pouring in to the ecosystem and rose to thrice the number of people as the previous year. This encouraged me further.

I picked up Learning Spark, a very good starting book for anyone jumping into this community. The basics were clearer than ever and I moved to contributions on the Community. JIRAs and User Mailing List were the places I lived for a few months. Grasping as much as I could. I taught myself Spark. I was happy and wee bit proud.

Working with the issues of a framework give you a weird perspective of things than simply using it. Spark and YARN were buddies that troubled me. The usage of Spark over the Resource layer of YARN was comfortable and troublesome to a greater extent. Still is sometimes, if you don’t know what you are doing. Not just YARN, memory, scale, operations, code are all a list of categories that surrounded our team as we engaged and helped folks and not break things.

You’ve probably read a bunch of “break/ breaking” in this article. That is intentional. Things broke everywhere and we saw repetition seep in. I sent an abstract of a presentation to help understand the above problems to Strata Hadoop World in San Jose for 2016. There was an internal study within my immediate team that helped get the material for the talk that I, initially, sent out, on my own, containing a purely randomly classified problem list. It was a good title — “Breaking Spark”. Catchy. It worked. I was selected to present it in March at San Jose then subsequently in Vancouver for Apache: Big Data 2016 and then Strata Hadoop World 2016 in London.

3 conferences in 4 months. March to June. It was a different yet challenging experience for me. Getting a presentation accepted, formulating a concrete speech around those slides and then answering the potential questions. Each time I learnt new things. New questions. New people. New problems that they shared with me. Never been a speaker before. Now, I can’t get enough of it.

These talks and conference attendances were mainly to meet and understand the community around Spark itself. You can read and talk and look at JIRAs all day. But getting to know the problems and the people experience around it is rare and you, for sure, will never obtain that unless you leave your desk and travel and meet people. My talk aimed at providing a checklist so that anyone jumping into this platform knows what they are getting into. It was me sharing my experiences(working with customers) with the larger community who may have similar and different problems (as I learnt).

This is only the beginning. I want to continue doing this as I move along. You might see me in future conferences with different talks or a better version of this existing one. Spark is growing and it has an encouraging community around it that seeks more helping hands on a daily basis.

Let me end this word blob with — the links :

  1. Talk — Breaking Spark (contains Slides)
  2. Video recording of the Talk (if you wish to see me ramble on for 30+ minutes)