My journey at Strata ’19 Conf

Published in

Quantyca

16 min readMay 10, 2019

In this post I’d like to sum up what I’ve learned at Strata Data Conference in London both from keynotes and from the talks that I decided to attend. The conference was organized by O’Really and Cloudera and had many tracks with lots of interesting talks.

Day 1

Highlights

These are the topics that I followed on the first day of the conference that I’d like to share:

Finding your North Star (Cait O’Riordan, Financial Times)
Making the future (James Burke)
Spark NLP in action: How Indeed applies NLP to standardize résumé content at scale (Alexis Yelton & Alexander Thomas, Indeed)
Insights from engineering Europe’s largest marketing platform for fashion (Dirk Petzoldt, Zalando)
An Innovation Architecture industrializes AI from PoCs to production (Teresa Tung & Jean-Luc Chatelain, Accenture Labs)
The Lyft data platform: Now and in the future (Mark Grover & Deepak Tiwari, Lyft)
How do you evolve your data infrastructure? (Neelesh Salian, StitchFix)
Information architecture for an enterprise data cloud (Mark Samson from Cloudera & Phillip Radley from BT)

In few words

In the last year many more companies joined the cloud and many cloud services were born or grew up very fast. We expect technologies to be simple, fast, flexible, reliable and we would like to reduce costs as much as we can.

Nowadays It’s possible to build data platforms capable of storing, processing, and analyzing data across multiple public and private cloud platforms and on-premises data centers.

Serverless solutions are spreaded as more as companies understand cloud services. For many uses cases there’s no need to care of architecture and you can just focus on pipelines. Data Governance & Catalogs together with Lineage solutions are getting more attention even though their importance is still underestimated by many companies.

Data Science and Data Engineering need to co-operate to get value from data and to build both successful pipelines and architectures.

NLP is spreading and growing fast thanks to new techniques in AI, enabling lots of use cases.

How to measure engagement

Financial Times made a great job with its customer strategy and reached its goal of 1M paying subscribers 1 year ahead of schedule. They looked at many factors to understand how to improve customer engagement.

So they decided to focus on engagement, but how can engagement been quantified?

Finally, they found a way to build individual engagement scores based on recency, frequency and volume for each customer.

Given the engagement score they can plan customer specific actions to drive growth.

Here you can find part of this keynote.

The importance of change

James Burke, an Historian with a great sense of humour, talked about innovation in human history and the importance of change.

History teach us that what initially is seen as an outlier or an anomaly can translate routine to evolution by introducing new features.

He also shared its thoughts on climate change and spoke about nano factories, devices that in the near future will be able to put together different kind of atoms: this technology is going to change our future. Indeed, this is not science fiction, because Manchester University built the first atomic scale nano factory robot 5 months ago.

Here part of this keynote

NLP can help you to find a job

Indeed described how to deal with 20 milions of job descriptions and 150 milions of résumés to find the best matches. In fact a single job title can be expressed in many ways and the same happens for résumés.

By using Natural Language Processing with the support of Spark NLP they built a library to do Normalization, i.e. classifying terms as standard term by finding equivalence classes. Immediate Advantages: deduplication of data, possibility to query with equivalent terms, create a small set of classes from a huge corpus.

They had two ways:

Rule Based Normalization, very effective but lacks for completeness, effort required, extensibility and scalability
Learned Normalization, fast, cheap, scalable and proactive. Could lead to errors with low control and human oversight

They used the latter to built a pipeline, starting with preprocessing.

They also used preprocessing to solve some acronyms or to deal with numbers. Then they used document frequency to consider only high frequency words as the normalized values. They combined the MinHash algorithm with Jaccard Distance followed by the application of Levenshtein Distance to filter terms. And by computing the Euclean distance they obtain normalized text.

Here their slides

A fashonable platform

Zalando explained how they managed to build a marketing platform based on machine-learning.

They talked about their archicteture and data flows: their main tools are Spark, Kafka, Kubernetes, Scalyr, Tensorflow and Sagemaker.

They shared 10 insights:

customer centricity crosses system boundaries

users can come from a google search, from clicking on a facebook ad, ..
two main point of attentions: campaign management, user experience

cross functional business ownership

build your teams in a way so that they can adapt to business needs
treat every team like a separate business
measure performances for: Product Managers, Engineers, Data Scientists

API first to automate for scale

for better extensibility
compatible with autonomous teams
to quickly enable A/B testing

measure everything

for continuous improvement you need to measure actions & costs
identify KPI and tests and check them to evaluate progress
build value incrementally

entrust autonomy for speed and best talent

play by the rules to colaborate

define your own engineering principles, tech radar, guidelines, tech rules of play
Use microservices, API, event driven architectures
Use change management systems

deploy yourself and frequently via CDP (continuous deployment)

explore new platforms as early adoptor

es. ML orchestration workflow via aws functions

discover machine learning opportunities

team up with autonomy, mastery and purpose

Champions in innovation

Accenture Labs made a survey of around 1k global senior executives in key industrial sectors and discovered that only 22% achieved a RODI (Return On Digital Investments) that exceeded their expectations, they called these companies “Champions” and analyzed why they’re successful. They also scale more than 50 percent of their digital pilots.

For high-performing Champions, it’s not about scaling more pilots (even though they do). It’s about earning more by scaling better.

The remaining 78% of companies struggles against common challenges.

They identified other two groups, Contenders (trailing Champions) and Cadets, and collected data on some organizational challenges across six categories. They estimated the correlation between these challenges and the RODI, then they created an analysis to discover how much RODI could be incremented by percentage by overcoming these challenges across industries.

New innovations require companies to reimagine how they work, to digitally transform their operations and to exceed their customers’ ever-evolving needs. Each of these tasks comes with a unique set of challenges.

It was a very useful talk to discuss the challenges for scaling and the actions to maximize scale up.

Here you can find much more details and here you can download their report.

Lyft

In this talk Lyft showed how their Data Platform can handle two Use Cases they have. They created their business to connect drivers with users by effectively computing the best price for a ride, by finding the closest rider and by estimating the time of arrival.

Guiding principles for the Data Platform team

Their first need was to process in near real-time data coming from ODS and computing both valuable metrics for business from raw data while, of course, being able to monitor the health of the system. They combined a CDC solution with Kafka to collect events, then they used Flink to develop streaming flows and S3 as data lake, with Hive on top, where they periodically persist data. With Airflow as Orchestrator they developed Spark jobs to compute additional layers on top of raw data. Then they used Google Big Query to analyze this data, both coming from streaming flows (thanks to its API support) and from S3.

The goal of the second use case was to deal with real-time data coming from devices of both riders and users. A strong requirement was to find a quick and simple way to clean data.

Their application must be able to determine if there’s traffic and so update time of arrival and notify the users. In such cases they added a feature called Prime Time, that increases the price so that the users is aware and can decide what to do while driver is helped to face the situation.

For this use case, they managed to use the same flows developed for batches also for streaming with minimum efforts thanks to Beam/Flink.

In day 2, they explained further this second use case and its challenges. Below both Use Cases across the Data Plaform in a single diagram.

They evaluated Kinesis against Kafka, but the former had many limitations for scaling.

Here you can find their slides.

How architectures change their outfit

Stitch Fix offers a personalized styling service. At Strata they made a really interesting talk where they showed how their data infrastructure changed over the months while both business and use cases were changing. They focused on data scientists’ point of view.

They defined as their architecture changed by defining brand new “generations”:

Initially they only had data scientists individually writing their own code and saving data where possible.
Then they started to suffer the lack of sharing both for knowledge and data but also for tools. This is why they built Generation 1

Always learn from your previous generation mistakes.

They found it difficult to build prototypes and to maintain models with Generation 1. As the business is more stable and the use cases are defined they redesigned some parts of the platform, Generation 2, by using modern tools and frameworks with better abstractions.

The talk then explained some details on their current implementation and how/why they worked on an improved Hive Metastore Interface.

There touched on some interesting notes on Livy/Spark to manage sessions and metadata. They also shared some useful hints on Arrow as a good language independent format.

Our approach has been to provide self sufficiency first and iterate on the services moving forward.

Here you can find their presentation. By the way, here you can dive into their blog.

Dive into the data lake

Cloudera’s talk described how typical Data Lake can be organized in 4 horizontal layers, from bottom to top:

Landing Zone, to ingest raw data as it is on source systems
Discovery Zone, used from Data Scientists to explore data this is a mix of views and materialized data. This layer contains data enriched by joining reference data
Shared Zone, data shared across LOB (given security constraints). Here data coming from multiple sources is joined together. Incentives for Data Scientists to move their data and use cases into this zone.
Optimized Zone, built to improve performance and organized by use case (not by source). Data of this layer is often denormalized for performance, uses optimised storage formats (Parquet + partitioning, Kudu, HBase), is accessed by low latency query engines (Impala, Solr) and provide access to ML models e.g. via REST API

What about real-time?

It can be achieved by planning a fast lane that crosses all layers but last:

by using low latency components, e.g. Kafka, Nifi, Spark Streaming
by consuming data straight from sources and transform/analyze it
by deliverying its output directly to the Optimized Zone for low-latency query

Day 2

Hightlights

The unstoppable rise of white box data (Chris Taggart, OpenCorporates)
Privacy, identity and autonomy in the age of big data and AI (Sandra Wachter, University of Oxford)
Mutant tests too: The SQL (Elliot West and Jaydene Green, Hotels.com)
Learning how to perform ETL data migrations with open source tool Embulk (Jason Bell, DeskHoppa)
From BI to big data; Or, There and back again (Mark, Zalando)

Black & White Box Data

At OpenCorporates they collect data of companies across the globe and make it usable for public needs. They already have collected data of around 169 million companies.

In this talk they discussed how data changed over the last years.

In the past data was:

local
siloed
expensive to store

And external data was black box.

What about the present? Everyone knows that nowadays our world is built on data. Let’s see how black box and white box data are different.

black box data

no clear definition
no provenance (no context)
proprietary identifiers (no connections + lock-in)
low quality
high latency
opaque model (limited utility)
limited access (poor feedback loops)

white box data

well documented
provenance (context)
open identifiers (possible connections + no lock-in)
transparent model (widespread utility)
public access (great feedback loops)

But how do Black Box data rise?

Which, in the real world, is part of a wider (and more complex) scenario:

In OpenCorporates’ context, Black Box data can lead to Anti-Money Laundering failures and represents a barrier to innovation in addition to a prosperous habitat for criminals. This is why their “mission” is to yield White Box data representing companies. Among those who use it: Mastercard and Law enforcement/Tax authorities.

Here their slides, here an extract of the video.

Privacy and AI

Sandra Wachter from University of Oxford shared her thoughts in a very interesting talk. While AI models are becoming more and more effective to discover behaviors and preferences of customers new opportunities for making decision rise. But these decisions can be based on sensitive information derived from the initial data and can lead to discriminatory behaviour.

For example, in China they can deny you credit if you play videogames. This is because some models can classify players as unreliable subjects, I strongly disagree, of course (also because I’m a player).

So the problem is: inferences can create enriched data but its usage still need to be regulated, but in this case (with respect to source data) is harder to say what can and what can’t be done. In fact, on one hand if individuals gave their permissions on source data a company gets the right to use it for inferences/analysis, while on the other hand individuals still need to be protected against discriminations, even though they rise indirectly from the data that they agreed to share.

Focus not only on what’s possible but also on what’s reasonable.

Can a query be tested?

Hotels.com showed how they managed to find a way to create unit test and measure coverage for sql code and to apply software engineering best practices to SQL-based data applications.

Their vision:

Any process that alters or interprets data should be tested!
Unit tests are the first defence
Hours of test development saves days of data clean-up

I completely agree with it.

They used HiveRunner as unit test framework for Hive SQL. It is based on JUnit 4 and let’s you define test cases in Java + Hive SQL. Moreover is designed for local execution. It needs a SQL script for setup (which prepares data), a SQL script to be tested (where your queries must be put) and a set of tests.

HiveRunner is currently put in TRIAL in Thoughtworks Tech Radar.

To test monolithic queries they suggest to create views for every inner query and be able to test every part of the SQL code.

To measure test coverage they used Mutation Testing, which consists in modifying the code in small ways and generating mutants and measure the % of mutants killed. New tests are designed to kill additional mutants.

Let’s see an example with a sample table, a sample query and a set of tests.

They used Mutant Swarm to analyze the SQL code and find test coverage. From SQL it identifies Genes, i.e. functions, operators, values that can be used to create mutations. For every Mutation it creates a new query and verify the test reports.

Survivors can suggest a gap in tests or that more tests and discriminative data is required.

Many improvements are going to come for these tools to address performance and flexibility challenges and to increase coverage (by adding mutation types).

Very interesting talk! I definitely want to investigate further these tools.

Here their slides.

Bulk loading, the easy way

Jason Bell talked about Embulk, and made a demo to show how it can be used to quickly and easily move data between two systems.

Embulk is a open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.

It seems to be a very easy solution since it is quickly installed and has some ery useful features:

it can guess schema
it can be executed in paralle (through YARN or K8s)
it has lots of plugin
it supports incremental runs (both for files and for Databases
it can do transaction recovery

I think I would give it a try for one-time only migrations for the moment. Jason showed how it can be scheduled with Airflow to create chains, monitor execution and build pipelines.

Here you can find their presentation.

A Fashonable platform needs also fashonable pipelines

This was Zalando’s second talk that I attended in. This time the focus was on the data platform and on data pipelines. Mark explained that they started as usual by integrating many datasources into a single, central datawarehouse. But as data was growing they needed to evolve, so they added an event bus and a data lake and built an Event Driven Architecture with the support of microservices.

The final architecture can be splitted into three layers: Ingestion, Storage and Serving.

Data in Ingestion Layer can come from the event bus, from the datawarehouse or from google analytics. Storage Layer is built on AWS S3 plus a Metastore. Serving uses different tools to address a wide range of requirements.

The talk continued by describing two data pipelines, one for data coming from the event bus and one for data coming from DWH.

Event Data Pipeline was built with the support of Nakadi (and Nakadi archiver), lambdas, SQS for notifications, and S3. For coming events their pipeline generates files on S3 and writes to a queue. Files are then opened and written to the data lake, they are saved to a dead letter queue for failover on lambdas. Data is then read from Data Lake and sent to a general queue, from here it splits into two directions:

one for being repartitioned/optimized and saved again into the Data Lake
one to notify other systems, by using different queues for different event types

On the other hand, the Relational Data Pipeline needs to incrementally extract data from the data warehouse so it periodically polls through a lambda to understand when new data is available. With new data a lambda writes to a queue and this triggers a separate lambda to check datasets and launch extraction through functions that wrap spark jobs. When spark jobs finish they put a message on a new queue and this triggers a new lambda responsible for triggering merge of historical data with new data (which is a new spark job). When this job runs successfully it writes to a new queue to let other systems know that new data is available in the data lake.

Mark also highlighted some interesting features of the Delta Lake format that they’re using:

it works on top of parquet and keeps track of which files to process efficiently
it guarantees transactionality of readers
efficient file listing
it supports time travelling, for example to check a version of the dataset at a given time and re-run computation
it also supports data delition and thus can help you make your data GDPR compliant

Tools I’d like to bring further

Scalyr, a log management system
Apache Beam, a programming model to design both batch and streaming flows
Flink, a fremework to develop streaming applications
Nakadi, a tool that provides RESTful API on top of Kafka
HiveRunner and Mutant Swarm, for SQL testing and verify test coverage
Embulk, to easily move data
Delta Lake, as alternative storage for Spark applications

A couple of quotes I’d like to share

If the goal of data science is to make data useful, the goal of data engineering is to make data usable.

Or, as some bad guy said:

… to make data scientists useful

;)

Other an interesting thought that was shared:

Information causes change, if it’s not then it’s not information.

A good recipe to build successful teams:

hire interesting people
reward self-learners
use case studies in training

And finally one that is always topical:

So is data the new gold? No, data is fools gold. Insight is pure gold.

Conclusions

Strata conference was a great experience and there were really useful talks
go serverless: rely on guarantees of Cloud Providers; pay only for use
unleash Automation: it enables CI/CD; it can give control/possibility to help to other teams
some of the biggest challenges are not technical, e.g. data quality
there are many tools that I would like to know more about! :)

See you and thank you for reading my post! If you liked it I would appreciate claps, follows and shares. Moreover, visit my company website, or follow our linkedin page to discover more contents.