My journey at Strata ’19 Conf

Pietro La Torre
Quantyca
Published in
16 min readMay 10, 2019

In this post I’d like to sum up what I’ve learned at Strata Data Conference in London both from keynotes and from the talks that I decided to attend. The conference was organized by O’Really and Cloudera and had many tracks with lots of interesting talks.

Day 1

Highlights

These are the topics that I followed on the first day of the conference that I’d like to share:

In few words

In the last year many more companies joined the cloud and many cloud services were born or grew up very fast. We expect technologies to be simple, fast, flexible, reliable and we would like to reduce costs as much as we can.

Nowadays It’s possible to build data platforms capable of storing, processing, and analyzing data across multiple public and private cloud platforms and on-premises data centers.

Serverless solutions are spreaded as more as companies understand cloud services. For many uses cases there’s no need to care of architecture and you can just focus on pipelines. Data Governance & Catalogs together with Lineage solutions are getting more attention even though their importance is still underestimated by many companies.

Data Science and Data Engineering need to co-operate to get value from data and to build both successful pipelines and architectures.

NLP is spreading and growing fast thanks to new techniques in AI, enabling lots of use cases.

How to measure engagement

Financial Times made a great job with its customer strategy and reached its goal of 1M paying subscribers 1 year ahead of schedule. They looked at many factors to understand how to improve customer engagement.

Looking for the right measure

So they decided to focus on engagement, but how can engagement been quantified?

Different ways to get engagement

Finally, they found a way to build individual engagement scores based on recency, frequency and volume for each customer.

Given the engagement score they can plan customer specific actions to drive growth.

Here you can find part of this keynote.

The importance of change

James Burke, an Historian with a great sense of humour, talked about innovation in human history and the importance of change.

History teach us that what initially is seen as an outlier or an anomaly can translate routine to evolution by introducing new features.

James Burke during keynotes

He also shared its thoughts on climate change and spoke about nano factories, devices that in the near future will be able to put together different kind of atoms: this technology is going to change our future. Indeed, this is not science fiction, because Manchester University built the first atomic scale nano factory robot 5 months ago.

Here part of this keynote

NLP can help you to find a job

Indeed described how to deal with 20 milions of job descriptions and 150 milions of résumés to find the best matches. In fact a single job title can be expressed in many ways and the same happens for résumés.

their challenge

By using Natural Language Processing with the support of Spark NLP they built a library to do Normalization, i.e. classifying terms as standard term by finding equivalence classes. Immediate Advantages: deduplication of data, possibility to query with equivalent terms, create a small set of classes from a huge corpus.

an example of normalization

They had two ways:

  • Rule Based Normalization, very effective but lacks for completeness, effort required, extensibility and scalability
  • Learned Normalization, fast, cheap, scalable and proactive. Could lead to errors with low control and human oversight

They used the latter to built a pipeline, starting with preprocessing.

example of preprocessing

They also used preprocessing to solve some acronyms or to deal with numbers. Then they used document frequency to consider only high frequency words as the normalized values. They combined the MinHash algorithm with Jaccard Distance followed by the application of Levenshtein Distance to filter terms. And by computing the Euclean distance they obtain normalized text.

the full pipeline with some examples

Here their slides

A fashonable platform

Zalando explained how they managed to build a marketing platform based on machine-learning.

They talked about their archicteture and data flows: their main tools are Spark, Kafka, Kubernetes, Scalyr, Tensorflow and Sagemaker.

They shared 10 insights:

customer centricity crosses system boundaries

  • users can come from a google search, from clicking on a facebook ad, ..
  • two main point of attentions: campaign management, user experience

cross functional business ownership

  • build your teams in a way so that they can adapt to business needs
  • treat every team like a separate business
  • measure performances for: Product Managers, Engineers, Data Scientists

API first to automate for scale

  • for better extensibility
  • compatible with autonomous teams
  • to quickly enable A/B testing

measure everything

  • for continuous improvement you need to measure actions & costs
  • identify KPI and tests and check them to evaluate progress
  • build value incrementally

entrust autonomy for speed and best talent

play by the rules to colaborate

  • define your own engineering principles, tech radar, guidelines, tech rules of play
  • Use microservices, API, event driven architectures
  • Use change management systems

deploy yourself and frequently via CDP (continuous deployment)

explore new platforms as early adoptor

  • es. ML orchestration workflow via aws functions

discover machine learning opportunities

team up with autonomy, mastery and purpose

Champions in innovation

Accenture Labs made a survey of around 1k global senior executives in key industrial sectors and discovered that only 22% achieved a RODI (Return On Digital Investments) that exceeded their expectations, they called these companies “Champions” and analyzed why they’re successful. They also scale more than 50 percent of their digital pilots.

For high-performing Champions, it’s not about scaling more pilots (even though they do). It’s about earning more by scaling better.

The remaining 78% of companies struggles against common challenges.

They identified other two groups, Contenders (trailing Champions) and Cadets, and collected data on some organizational challenges across six categories. They estimated the correlation between these challenges and the RODI, then they created an analysis to discover how much RODI could be incremented by percentage by overcoming these challenges across industries.

Potential Increase in RODI

New innovations require companies to reimagine how they work, to digitally transform their operations and to exceed their customers’ ever-evolving needs. Each of these tasks comes with a unique set of challenges.

It was a very useful talk to discuss the challenges for scaling and the actions to maximize scale up.

Here you can find much more details and here you can download their report.

Lyft

In this talk Lyft showed how their Data Platform can handle two Use Cases they have. They created their business to connect drivers with users by effectively computing the best price for a ride, by finding the closest rider and by estimating the time of arrival.

Guiding principles for the Data Platform team

Their first need was to process in near real-time data coming from ODS and computing both valuable metrics for business from raw data while, of course, being able to monitor the health of the system. They combined a CDC solution with Kafka to collect events, then they used Flink to develop streaming flows and S3 as data lake, with Hive on top, where they periodically persist data. With Airflow as Orchestrator they developed Spark jobs to compute additional layers on top of raw data. Then they used Google Big Query to analyze this data, both coming from streaming flows (thanks to its API support) and from S3.

Use case 1 Architecture Diagram

The goal of the second use case was to deal with real-time data coming from devices of both riders and users. A strong requirement was to find a quick and simple way to clean data.

Notification in real-time

Their application must be able to determine if there’s traffic and so update time of arrival and notify the users. In such cases they added a feature called Prime Time, that increases the price so that the users is aware and can decide what to do while driver is helped to face the situation.

Traffic info in real-time

For this use case, they managed to use the same flows developed for batches also for streaming with minimum efforts thanks to Beam/Flink.

Use Case 2 Architecture Diagram

In day 2, they explained further this second use case and its challenges. Below both Use Cases across the Data Plaform in a single diagram.

Data Platform with both use cases

They evaluated Kinesis against Kafka, but the former had many limitations for scaling.

Here you can find their slides.

How architectures change their outfit

Stitch Fix offers a personalized styling service. At Strata they made a really interesting talk where they showed how their data infrastructure changed over the months while both business and use cases were changing. They focused on data scientists’ point of view.

They defined as their architecture changed by defining brand new “generations”:

  • Initially they only had data scientists individually writing their own code and saving data where possible.
  • Then they started to suffer the lack of sharing both for knowledge and data but also for tools. This is why they built Generation 1
Generation 1

Always learn from your previous generation mistakes.

  • They found it difficult to build prototypes and to maintain models with Generation 1. As the business is more stable and the use cases are defined they redesigned some parts of the platform, Generation 2, by using modern tools and frameworks with better abstractions.
Generation 2

The talk then explained some details on their current implementation and how/why they worked on an improved Hive Metastore Interface.

There touched on some interesting notes on Livy/Spark to manage sessions and metadata. They also shared some useful hints on Arrow as a good language independent format.

Our approach has been to provide self sufficiency first and iterate on the services moving forward.

Here you can find their presentation. By the way, here you can dive into their blog.

Dive into the data lake

Cloudera’s talk described how typical Data Lake can be organized in 4 horizontal layers, from bottom to top:

  1. Landing Zone, to ingest raw data as it is on source systems
  2. Discovery Zone, used from Data Scientists to explore data this is a mix of views and materialized data. This layer contains data enriched by joining reference data
  3. Shared Zone, data shared across LOB (given security constraints). Here data coming from multiple sources is joined together. Incentives for Data Scientists to move their data and use cases into this zone.
  4. Optimized Zone, built to improve performance and organized by use case (not by source). Data of this layer is often denormalized for performance, uses optimised storage formats (Parquet + partitioning, Kudu, HBase), is accessed by low latency query engines (Impala, Solr) and provide access to ML models e.g. via REST API

What about real-time?

It can be achieved by planning a fast lane that crosses all layers but last:

  • by using low latency components, e.g. Kafka, Nifi, Spark Streaming
  • by consuming data straight from sources and transform/analyze it
  • by deliverying its output directly to the Optimized Zone for low-latency query

Day 2

Hightlights

Black & White Box Data

At OpenCorporates they collect data of companies across the globe and make it usable for public needs. They already have collected data of around 169 million companies.

In this talk they discussed how data changed over the last years.

In the past data was:

  • local
  • siloed
  • expensive to store

And external data was black box.

What about the present? Everyone knows that nowadays our world is built on data. Let’s see how black box and white box data are different.

black box data

  • no clear definition
  • no provenance (no context)
  • proprietary identifiers (no connections + lock-in)
  • low quality
  • high latency
  • opaque model (limited utility)
  • limited access (poor feedback loops)

white box data

  • well documented
  • provenance (context)
  • open identifiers (possible connections + no lock-in)
  • transparent model (widespread utility)
  • public access (great feedback loops)

But how do Black Box data rise?

The Black Box model (simplified)

Which, in the real world, is part of a wider (and more complex) scenario:

The Black Box model (in reality)

In OpenCorporates’ context, Black Box data can lead to Anti-Money Laundering failures and represents a barrier to innovation in addition to a prosperous habitat for criminals. This is why their “mission” is to yield White Box data representing companies. Among those who use it: Mastercard and Law enforcement/Tax authorities.

Here their slides, here an extract of the video.

Privacy and AI

Sandra Wachter from University of Oxford shared her thoughts in a very interesting talk. While AI models are becoming more and more effective to discover behaviors and preferences of customers new opportunities for making decision rise. But these decisions can be based on sensitive information derived from the initial data and can lead to discriminatory behaviour.

For example, in China they can deny you credit if you play videogames. This is because some models can classify players as unreliable subjects, I strongly disagree, of course (also because I’m a player).

So the problem is: inferences can create enriched data but its usage still need to be regulated, but in this case (with respect to source data) is harder to say what can and what can’t be done. In fact, on one hand if individuals gave their permissions on source data a company gets the right to use it for inferences/analysis, while on the other hand individuals still need to be protected against discriminations, even though they rise indirectly from the data that they agreed to share.

Focus not only on what’s possible but also on what’s reasonable.

Can a query be tested?

Hotels.com showed how they managed to find a way to create unit test and measure coverage for sql code and to apply software engineering best practices to SQL-based data applications.

Their vision:

  • Any process that alters or interprets data should be tested!
  • Unit tests are the first defence
  • Hours of test development saves days of data clean-up

I completely agree with it.

They used HiveRunner as unit test framework for Hive SQL. It is based on JUnit 4 and let’s you define test cases in Java + Hive SQL. Moreover is designed for local execution. It needs a SQL script for setup (which prepares data), a SQL script to be tested (where your queries must be put) and a set of tests.

How HiveRunner works

HiveRunner is currently put in TRIAL in Thoughtworks Tech Radar.

To test monolithic queries they suggest to create views for every inner query and be able to test every part of the SQL code.

To measure test coverage they used Mutation Testing, which consists in modifying the code in small ways and generating mutants and measure the % of mutants killed. New tests are designed to kill additional mutants.

Let’s see an example with a sample table, a sample query and a set of tests.

Test Example

They used Mutant Swarm to analyze the SQL code and find test coverage. From SQL it identifies Genes, i.e. functions, operators, values that can be used to create mutations. For every Mutation it creates a new query and verify the test reports.

Creating mutations
Verifying coverage as % mutants killed

Survivors can suggest a gap in tests or that more tests and discriminative data is required.

Many improvements are going to come for these tools to address performance and flexibility challenges and to increase coverage (by adding mutation types).

Very interesting talk! I definitely want to investigate further these tools.

Here their slides.

Bulk loading, the easy way

Jason Bell talked about Embulk, and made a demo to show how it can be used to quickly and easily move data between two systems.

Embulk is a open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.

Embulk plugins

It seems to be a very easy solution since it is quickly installed and has some ery useful features:

  • it can guess schema
  • it can be executed in paralle (through YARN or K8s)
  • it has lots of plugin
  • it supports incremental runs (both for files and for Databases
  • it can do transaction recovery

I think I would give it a try for one-time only migrations for the moment. Jason showed how it can be scheduled with Airflow to create chains, monitor execution and build pipelines.

Here you can find their presentation.

A Fashonable platform needs also fashonable pipelines

This was Zalando’s second talk that I attended in. This time the focus was on the data platform and on data pipelines. Mark explained that they started as usual by integrating many datasources into a single, central datawarehouse. But as data was growing they needed to evolve, so they added an event bus and a data lake and built an Event Driven Architecture with the support of microservices.

The final architecture can be splitted into three layers: Ingestion, Storage and Serving.

Zalando’s Data Lake

Data in Ingestion Layer can come from the event bus, from the datawarehouse or from google analytics. Storage Layer is built on AWS S3 plus a Metastore. Serving uses different tools to address a wide range of requirements.

The talk continued by describing two data pipelines, one for data coming from the event bus and one for data coming from DWH.

Event Data Pipeline was built with the support of Nakadi (and Nakadi archiver), lambdas, SQS for notifications, and S3. For coming events their pipeline generates files on S3 and writes to a queue. Files are then opened and written to the data lake, they are saved to a dead letter queue for failover on lambdas. Data is then read from Data Lake and sent to a general queue, from here it splits into two directions:

  • one for being repartitioned/optimized and saved again into the Data Lake
  • one to notify other systems, by using different queues for different event types

On the other hand, the Relational Data Pipeline needs to incrementally extract data from the data warehouse so it periodically polls through a lambda to understand when new data is available. With new data a lambda writes to a queue and this triggers a separate lambda to check datasets and launch extraction through functions that wrap spark jobs. When spark jobs finish they put a message on a new queue and this triggers a new lambda responsible for triggering merge of historical data with new data (which is a new spark job). When this job runs successfully it writes to a new queue to let other systems know that new data is available in the data lake.

Mark also highlighted some interesting features of the Delta Lake format that they’re using:

  • it works on top of parquet and keeps track of which files to process efficiently
  • it guarantees transactionality of readers
  • efficient file listing
  • it supports time travelling, for example to check a version of the dataset at a given time and re-run computation
  • it also supports data delition and thus can help you make your data GDPR compliant

Tools I’d like to bring further

  • Scalyr, a log management system
  • Apache Beam, a programming model to design both batch and streaming flows
  • Flink, a fremework to develop streaming applications
  • Nakadi, a tool that provides RESTful API on top of Kafka
  • HiveRunner and Mutant Swarm, for SQL testing and verify test coverage
  • Embulk, to easily move data
  • Delta Lake, as alternative storage for Spark applications

A couple of quotes I’d like to share

If the goal of data science is to make data useful, the goal of data engineering is to make data usable.

Or, as some bad guy said:

… to make data scientists useful

;)

Other an interesting thought that was shared:

Information causes change, if it’s not then it’s not information.

A good recipe to build successful teams:

hire interesting people

reward self-learners

use case studies in training

And finally one that is always topical:

So is data the new gold? No, data is fools gold. Insight is pure gold.

Conclusions

  • Strata conference was a great experience and there were really useful talks
  • go serverless: rely on guarantees of Cloud Providers; pay only for use
  • unleash Automation: it enables CI/CD; it can give control/possibility to help to other teams
  • some of the biggest challenges are not technical, e.g. data quality
  • there are many tools that I would like to know more about! :)

See you and thank you for reading my post! If you liked it I would appreciate claps, follows and shares. Moreover, visit my company website, or follow our linkedin page to discover more contents.

me and my colleague Jacopo watching keynotes :)

--

--

Pietro La Torre
Quantyca

Data professional with 12 years of experience in consultancy across technologies and sectors, passionate photographer, serial traveller, amateur chess player