I got the chance to attend Strata+Hadoop World earlier this year. Here’s a brief summary of my favorite talks from the two days I attended:
Large-scale Machine Learning Day
Wednesday was tutorial day and I spent my time at Dato’s machine learning session. My machine learning knowledge is pretty rudimentary so getting a chance to try out their Python library GraphLab Create to prototype ML applications quickly was really cool. Some of the topics covered (via hands-on IPython Notebook-based demos) included feature engineering (basically an intro to working with SFrames, the primary GraphLab data structure), data deduplication using a nearest neighbors model, recommender systems, and using deep learning for image classification. Training materials, including the demos I mentioned as well as exercises and slides are available here: http://dato.com/events/training/2015_strata_ca.html
Sidenote: This was my first exposure to IPython Notebook and I must say that it is quite the nifty tool.
Data Science — Where are we going?
The highlight of the Thursday keynotes was an inspiring talk by DJ Patil, newly appointed U.S. Chief Data Scientist. Special guest President Obama opened the talk (via prerecorded video) calling data scientists to action, asking them to “help us change this country and this world for the better”. DJ then went on to outline the importance of data science for the future of our nation and the specific things he will be focusing on in his new role. I thought the mission statement was a good summary of what he is setting out to do: “responsibly unleash the power of data for the benefit of the American public and maximize the nation’s return on its investment in data”. A video is available on YouTube.
Large Scale Spark
Matei Zaharia and Reynold Xin (of Databricks/AMPLab) presented “Lessons from Running Large Scale Spark Workloads”. They had some fun dispelling the notion that Spark doesn’t scale with inductions into their “Hall of Fame”. Some of the records included:
- Alibaba — longest running job (1 week)
- Tencent — largest single day intake (1 PB/day) and largest cluster (8000 nodes)
- DataBricks — largest sort (1 PB in 4 hours)
Additionally, they discussed some of the upcoming features that will help Spark scale even better:
- elastic resource allocation to improve utilization
- a revamped shuffle that utilizes a new network transport and switches from a hash-based to a new sort-based method
- a new DataFrame API as a more optimized data model compared to RDDs
It was great to see the continued progress of this exciting project.
Hive on Spark
Xuefu Zhang (Cloudera) and Chengxiang Li (Intel) presented Hive on Spark which, to my surprise, seems to be farther along than I expected. This effort enables Spark as an alternative execution engine alongside the traditional MapReduce one and the newer Tez. Apparently there is still quite a bit of optimization work to do, but because of how much of the existing code infrastructure is reused, HiveQL language support should be feature complete. It will be shipped in Hive 1.1 which should be released soon.
Google Cloud DataFlow
Eric Schmidt, a product management lead from Google, gave an overview of the (currently alpha) Google Cloud DataFlow platform. He introduced it as sort of a culmination of many of the big data technologies that had been developed at the company in the past, like MapReduce, FlumeJava, and MillWheel. From the developer’s point of view, Cloud DataFlow is an SDK that, via a functional programming interface, provides a unified model for both distributed batch and stream processing. The SDK, being based on Flume, uses PCollections as the core data model (which were amusingly described as the “free hippy cousin” of RDDs) and also introduces the idea of windows to enable the unbounded streaming use case. Of course, Google Cloud data sources are supported as input/output. A really nice monitoring UI was also demoed, which allows you to inspect the performance of logical components of your job, alleviating the need to dig into log files on physical nodes. Interestingly, as an alternative to the defaut runner which executes your code on Google’s Cloud platform, there is an open source Spark runner which will run Cloud DataFlow jobs on a Spark cluster. I am curious to see what the influence of this project will be on the open source distributed processing community.
Jacques Nadeau, from the Apache Foundation and MapR, presented “Drill into Drill: How Providing Flexibility and Performance is Possible”. Apache Drill is a fairly new SQL query engine with the design (like Impala) based on Google’s Dremel. Drill puts particular emphasis on flexibility and simplicity. Support of schema-less storage formats such as JSON and the ability to query over file paths directly are some examples of this. It also aims to be simplify to management and deployment compared to a system like Hive by only requiring a single type of daemon to run on each node and handling metadata by storing it in or near the data itself, rather than a separate database. As for performance, there is a lot of sophisticated work being done to make this database competitive with Impala and Hive and the talk covered a lot of low-level details which I won’t try to recall here. I left quite impressed and intrigued by this project. It will be interesting to see what the very active and competitive open source SQL-on-Hadoop landscape looks like a couple years from now.
(originally posted at http://patrick.marchwiak.com/blog/2015/02/19/strata-hadoop-world-2015/ )