Peeking into the Big Data future: Lessons learned from the DataWorks Summit in Munich

Hans Adriaans
bigdatarepublic
Published in
5 min readMay 4, 2017

For about a year I have been fully submerged in everything regarding Big Data; working with various tools and techniques throwing a bit of data science in the mix. I realized there is a high entry barrier for organizations to start turning their (dormant) data into something useful. With this knowledge, I wanted to look at how some of the leaders and early adopters in Big Data are tackling these barriers and if they will become easier (or harder) to handle in the future. Luckily BigData Republic gave me and 2 colleagues the opportunity to visit the DataWorks Summit in Munich this April, providing some inside information.

The remainder of this blog describes how I as a Big Data Engineer/Architect see the changes that are coming. From a broad perspective, I identified 3 themes throughout BigData ecosystem also tackled at the DataWorks summit:

  1. Privacy by design: General Data Protection Regulation (GDPR)
  2. Tooling for components: Creating faces for Hadoop
  3. Lessons learned in the field: Hadoop a few years in

General Data Protection Regulation

Within the European Union, strict regulations about personal data are going to be implemented by 25 May 2018. The rule is “Privacy by design”, meaning that for every use of data you need personal compliance. Not complying to the rules will carry some hefty fines up to 4% of worldwide turnover. This regulation is a big legal document containing issues like what is personal identifiable information (PII) and how do you access and storage it. I’m not going to bore you with the details, but this comes with some challenges for Big Data platforms i.e.:

  • Strictly control access to data (Logging on data access) — Touching data, even viewing for unrelated purpose is considered an incident and forbidden
  • Right to be forgotten (RTBF) — Erasure of all data on personal demand
  • Limiting data portability — Restrict locations of data storage
  • Restricted processing — Only use data for what is agreed upon

A lot of data-lakes are designed around the principles: save all, append only, unstructured data storage. These things can create some problems. You don’t want to have to drain your lake with years and years of data because it is riddled with PII. At the conference the answer for this problem is presented by using apache Atlas for tagging and Apache Ranger for access restriction. Although these applications can help you with control access and restriction on processing, they don’t help you with the RTBF.

A presentation by Balaji Ganesan from Privacera provides some nice guidelines in handling GDPR in a checklist:

  1. Coordinate with privacy and security teams
  2. Invest in user- and customer identification
  3. Data discovery and classification
  4. Centralize data around consent purpose
  5. Analyze pseudonymization VS anonymization
  6. Constantly monitor personal data for breaches
  7. Automate data retention and recovery strategies

From my practical experience, it is important to identify the points of entry for personal data and already design storage and processing around losing this data with limited impact. A good move would be storing data like website klick-troughs in two places one with personal data and the second with the remainder, linking it with an anonymous key.

Creating faces for Hadoop

A second subject, general for the summit, was the focus on generalizing open source Big Data tooling by creating user interfaces. Installing a Hadoop cluster with i.e. Spark, Flink and Kafka sometimes requires mad Linux command line skills. Hortonworks tries to make the life of administrators easier by extending Apache Ambari for installing, configuration and monitoring these different tools.

Examples of tools Hortonworks is working on together with partners are:

  • Kafka Registry UI
  • Stream processing UI (streaming with Spark and Flink)
  • Data wrangling with Trifecta
  • Automatic deployment and scaling with Cloudbreak

Although these tools look promising, they are often still only distributed via Github with multiple branches and limited committers, or require proprietary contracts. I think these tooling work fine for purpose but when customization is required they create only more complexity. Then again, I write from my perspective as an engineer.

Hadoop a few years in

The third development in Big Data also present at the summit is about early adopters like ING, BMW, Danske Bank and Centrica who have been working with Hadoop software for some years now. These organizations never started as software organizations but are now heavily depending on software developments in the open source Big Data community like HDFS, Spark, Kafka.

Take BMW for example, who has over 6 million vehicles around the world collecting over 1 terabyte of data from car sensors, digital maps, artificial intelligence and digital context models into a data lake each day. By combining, they are actively working on projects like autonomous driving, increasing customer satisfaction and improving production methods. These data driven projects may prove crucial in the future since BMW itself acknowledges being under pressure from new companies like Tesla.

Lessons Learned

So up to some lessons learned from the field and presentations at the Summit:

Have one Big data department over entire organization

Data pioneers need to work together on one Big Data platform to create the highest added value. A lot of organization have small data islands based on POC’s and software suppliers trying to get a foot in the door. Having one department makes sure people use the same data process and connected technologies, this also support knowledge sharing between employees educating by example.

Design Initial pilots based on “fail fast”

Pilots in Big Data depend heavily on reaching a business benefit fast so limited time and effort is wasted on system complexity. An important part of this is cutting losses and keeping focus when new discoveries or exploration are creating too much complexity. Fail fast is also applicable to the design of the software and logging that should put errors close to the source as possible, making debugging previous manufactured pilot code a lot easier.

Provide clear availability of data and agree on rules for usages

The administration and distribution of data needs to be firmly embedded in the organization, activities like use and access require clear guidelines. The entire organization needs to be aware of guidelines regarding:

  • Data availability and accessibility
  • Collaboration across business divisions
  • Security measures for data protection
  • Approved rules regarding data privacy

People first

Often organizations start by investing in technology like servers and licenses without having the people in the organization ready and motivated. Since a Big Data project is embedded in the entire organization you require a team with the right mix skills. Such a team consists of people with open minds capable of tackling any problem be it technical, organizational or theoretical that are motivated by the need to innovate.

Prepare for pitfalls

Looking at the BMW case they also identified several pitfalls regarding the implementations:

  1. Design for reliability
  2. Treat monitoring and automation as fist class citizens
  3. Be prepared for immaturity of software
  4. Stick to your choice
  5. Design with security in mind
  6. Design for multi-tenancy
  7. Be prepared to argue

Although the selection of lessons learned above is far from complete it is a small look at challenges to tackle before or during a Big Data project in any organization.

Written by Hans Adriaans — Big Data Engineer/Architect @
BigData Republic

--

--