Earlier this year we called out 2019 as the year of the data engineer, and this week we attended the Data Engineering Conference in San Francisco. An overarching sentiment during the conference was the co-dependency of data scientists and data engineers. Data scientists often feel held back by data engineering teams, who are unfairly blamed for resource constraints, limited data access rights, and slow queries. The talks emphasized the need for tools and solutions that make data engineers’ lives easier; in turn, empowering the data scientist to do more, better, and faster.
While the conference was two days, our post will discuss the themes from day one: 1) time to insight, 2) data quality, 3) data security, and 4) data volume.
Accelerating data scientists’ time to insight. One of the first steps in a data scientist’s work flow is finding and accessing data. Lyft found that 25% of data scientists’ time was spent discovering data. In turn they built Amundsen, a data discovery solution built on top of a metadata engine. It identifies data across different databases backends and allows users to search for particular data types. It uses both relevance and popularity of key terms to inform search results. It also tells users the data owners, top five frequent users of the dataset, descriptions, and tags. The solution has decreased the time it takes to discover an artifact to 5% of the pre-Amundsen baseline.
Another tool that speeds time to insight are notebooks, which have become a default tool for data scientists performing quick prototyping and exploratory analysis. Jupyter notebooks are managed JSON documents with a clean interface to execute code within. The Jupyter protocol offers a messaging API to communicate with kernels that act as computational engines. It separates where content is written from where the code is executed. There is a backend file format that stores both the code and the results. Data scientists like notebooks because they are shareable, easy to read, output reports, support multiple languages, and results can be accessed without needing to rerun the code.
While Jupyter notebooks come with a lot of perks, Matt Seal of Netflix, highlighted some drawbacks including the lack of history, difficulty in testing, parameterization issues, and challenging concurrent editing due to mutable documents. In turn, Netflix built Papermill a library for parameterizing, executing, and analyzing Jupyter notebooks. Papermill adds notebook isolation so there can be immutable inputs, immutable outputs, parameterization of notebook runs, versioning, templatizing, and configurable sourcing/sinking.
Finally, Rockset discussed how people efficiency can worsen due to slow queries and pipelines. Rockset helps solve this problem through it serverless search and analytics engine that does not require one to define their schema upfront. Instead, it can ingest any dataset and applies field interning, type hoisting, and converged indexing. It applies “smart schemas” that result in more efficient database configurations and queries.
Data security is a priority. We’ve discussed the need for improved data security solutions due to GDPR, CCPA, and other regulations. Some tactics ML engineers employ to work with sensitive data include anonymization, obfuscation (differential privacy), and encryption. Praneeth Vepakomma of MIT underscored these options are inefficient. He pointed to the deanonymization of Netflix prize data and AOL search queries as evidence that anonymization isn’t a sound technique. Privacy engineers have told us that differential privacy doesn’t work when ML teams try to identify anomalies because it smooths out anomalies when injecting noise. While homomorphic encryption is powerful it has issues at scale due to the computing power needed and when there are numerous multiplication operations.
Vepakomma instead proposed split learning that allows entities to train (and infer from) ML models without sharing any raw data or detailed information about the model. With split learning, each client trains a partial deep network up to a specific layer. The outputs are sent to a server the completes the rest of the training without looking at the raw data from any client. This completes a round of forward propagation. The gradients are then back propagated at the server from its last layer. Split learning has an advantage over federated learning because it does not reveal the gradients of previous layers and requires less resources on the client.
Data quality remains a challenge. We often hear that data engineers are held responsible for bad data populated in executive dashboards. Barr Moses coined “data downtime,” a period of time when data is partial, erroneous, missing, or otherwise inaccurate. She noted that signs you are experiencing data downtime include 1) your data team spends over 10% of its time on fire drills, 2) your company lost money because data was broken, 3) critical analysis fails because missing data went unnoticed, and 4) troubleshooting involves tedious step-by-step debugging. Teams are at different stages of solving data quality including reactive, proactive, automated, and scalable. When the audience asked which stage they were at, the majority stated reactive and proactive. A scalable data quality solution like Apache Griffin or Netflix RAD helps team continuous validate data across columns, tables, and sources flagging data that doesn’t mirror historical data distributions or looks erroneous.
ML needs more data. We recently wrote about how supervised learning models are hungry for more data. Snorkel a data programming solution helps data scientist quickly label data using Python function. If teams haven’t captured the necessary data, Tonic noted they can use synthetic data to augment the training data set.
Solutions data engineers can leverage to decrease blame and empower data scientists will find success. The Data Engineering conferences highlighted that time to insight and data quality, security, and volume are key for productive teams.