Apache Zeppelin at ApacheCon BigData North America 2016
This May city of Vancouver saw a biggest get-together of Apache Software Foundation communities in North America.
It is the 5th time that Apache Software Foundation (ASF) throws its annual events together with Linux Foundation and as usually the event was laid out as 2 subsequent conferences “ApacheCon” and “Apache Big Data”, both held at Hyatt Regency Vancouver to the highest possible standards. This year I had an opportunity to join it on behalf of Apache Zeppelin project and below is a short summary with few takeaways from the “Apache BigData” part.
It’s been 3rd time for me to take part in Apache-related events since Zeppelin project joined ASF back in 2014: before, it was ApacheCon Europe 2015 in Budapest, Hungary and an Apache booth at FOSDEM 2016 in Brussels. As this number grows, the experience attending such events becomes especially exciting — not only it is possible to make new friends on a different continent, but one can also meet old ones from the previous events again, which makes a week long stay in a new county much more enjoyable and worth travelling for.
Vancouver BC, Canada itself deserves a separate post. Let me just note here that is very cosmopolitan city, on a par with NYC and Berlin, but with nice ocean, mountains and other nature views very close to the city center.
Street food vendors kept conference attendees and local office workers energized with plenty of healthy food and fruits, and water & coffee provided by the Linux Foundation organizers helped to stay hydrated and fight jet-lag for those who came by 8+ hours flights.
At the Apache Big Data conference a diverse set of topics over the whole industry was discussed and presented though the week, but there was something that definitely stood out from the last year.
As Apache Spark once became an apparent favorite topic 2–3 years ago, this year there was a very visible focus of the community on all things “streaming”. One can clearly see how stream processing rapidly becomes the dominant theme in the past couple of years and attracts many sessions. These days, when ETL “just works”, many things are already “in-memory”, what’s left? Lower latencies for answering analytics questions with the limit being a “real-time” quickly becomes very hot topic, and that is promising for a number of Apache projects.
Apache Zeppelin had at least 4 dedicated talks this year:
- on the community side, Trevor Grant of Market6 was talking about “Everyone Plays: Collaborative Data Science with Zeppelin”
- Apache Zeppelin VP, Lee Moon Soo of NFlabs did a session about “Apache Zeppelin and It’s Pluggable Architecture for Your Data Science Environment”
- my talk was “Mining public datasets using Apache Zeppelin, Spark and Juju” which consisted of 2 examples of data products, using 3 tools from open source data analytics stack
- and another PMC member, Felix Cheung, did a hands-on tutorial on “Interactive Data Science from Scratch with Apache Zeppelin and Apache Spark”
As usually with ApacheCon it was a great chance to meet fellow PMCs, mentors and community members in person, catch up and discuss some meta topics around Apache and industry in general. Building a new bounds in real life with the people, that have already became close and familiar from the mailing lists and collaborative work is something special and very rewarding that helps project a lot in the long term.
It was especially nice to touch the base with Konstantin Boudnik, Roman Shaposhnik and Felix Cheung whom I had an honor to work with during past year and who all were actively engaged in the project during its incubation period.
Apache S2Graph (incubating)
One recent Apache project that has deep connections with the place I live at, Seoul, South Korea is Apache S2Graph (incubating) — a high-performance distributed graph database. It was project’s debut at such large scale community event since it’s joined Incubator this year, they hosted 2 separate sessions.
It was a pleasure to meet 2 contributors Do Yung Yoon and Hyunsung Jo — I hope we can build a relationship that will result in some Apache-related activities here in Korea, so please stay tuned for news at Seoul_Tech in case you are interested!
Luke Han, whom I have met in Shanghai last year also came to Vancouver. Luke is PMC of Apache Kylin and is a person who was leading it since inception from China’s eBay office to the current awesome open source community under Apache.
A pleasant surprise was to learn that since our last meeting at Kylin meetup in Shanghai he left a big company and started his own business based on Apache Kylin! I sincerely wish success to his team at Kyligence!
Apache BigTop is a great project and a very broad initiative. It's been a while since Zeppelin became available to BigTop users after BIGTOP-1769
With Konstantin Boudnik we discussed last obstacles on the way to make integration seamless, like reducing ~500mb Zeppelin convenience binary distribution size and approaches to shrink it by fractionating out some interpreter implementations and making them optional or separate from the main release, as well as a nice way it could play together with the recent contribution of Juju deployment charms at BIGTOP-2435. I’m looking forward resuming it on the mailing list in the near future.
There’s plenty of room for improvement, so incase you ever wanted to get involved in the open source — ASF with its number of projects is a great place to start!
Kevin Monroe, a member of Canonical’s BigData team who had a session “Big Data DumbOps” did an integration of Juju with Apache Zeppelin and Spark by providing layer-apache-zeppelin charm and apache-hadoop-spark-zeppelin bundle.
Although right now it’s managed independently and out of the ASF source tree, but hopefully once Zeppelin becomes a BigTop component, this would be added as a part of ASF code-base.
During the event we managed to meet with more people on the Juju team and discuss some further improvements to ecosystem, like adding missing Ganglia component, “charming” more Apache projects like OODT, Drill, etc which I hope helped and resulted in improvements to recent contribution of BIGTOP-2435 where a new hadoop-processing bundle was introduced with actual Ganglia set up.
It’s great to see Juju adoption growing in the number of solutions though ASF Big Data platform — that is a nice tool that scales and helps data practitioners to save money and scale clusters with their tools of choice on demand.
With covering this gap of deployment automation tool and recent addition of Machine Learning server implementation by PredictionIO as incubating project to the Apache Big Data stack I’m glad to see the number of open source tools becoming de-facto standard in the industry rapidly growing. This represents the growing number of opportunities for small businesses to be built and grown on the rich “soil” of data engineering, proving more fun places to work at outside of the walls of big corporations, which is a good thing by itself.
I’m looking forward making Apache Zeppelin even more instrumental part of such a future and hope to see you all next time in Seville, Spain this Autumn for recently announced ApacheCon EU 2016!
To recap, if you missed this event:
- listen to audio recordings for sesstions, as well as watch keynote videos at http://events.linuxfoundation.org/events/apache-big-data-north-america/program/video-and-audio-recordings
- see photos from the event published in the Linux Foundation’s album on Flickr
- read more details in the post by Tom Barber at http://www.meteoriteconsulting.com/apachecon-na-2016-roundup
P.S I’d like to thank Apache Zeppelin community for its support, my employer NFLabs Inc (hiring in Seoul and Bay Area!) for a chance to work on such a great open source project and Canonical Inc for supporting this trip abroad.