Accelerating Zonda — Workday’s Data Streaming Platform

How we applied a lean startup approach to quickly iterate on ideas, change with agility, and productize an Avro/Flink based streaming platform

Enrico
Workday Technology
7 min readNov 16, 2021

--

Writer’s note: the pieces of code presented in this document have been simplified to make it easier to read, however they should still make logical sense.

At Workday, all data is managed and secured within tenants that offer both user transactions and timely reports across a wide range of applications. This is a fundamental part of the Workday architecture (see Exploring Workday’s Architecture). As incorporated machine learning, we had to consider how to transport and transform data in an effective and secure manner. We needed to create data processing pipelines to support new product features and provide access for data scientists.

Our project, code-named Zonda after Argentina’s wind and a very fast car, was conceived with the goal to deliver a fast, scalable and secure data streaming solution based on the Kappa architecture. It would be used to transport huge quantities of data from the services handling user transactions to various storage solutions for access by data scientists. This initial offering would also form the basis of a platform to host inference and data manipulation jobs to be applied to the stream of activities happening within all Workday tenants. We planned to deliver our brand new platform, and run several Flink jobs (the latest technology at the time) in the shortest amount of time.

Our initial design looked as follows, with multiple producers, consumers, a message queue and a schema repository.

the initial design

Security and scalability were the two main challenges that our architecture had to address. We approached security by implementing encryption in the JVM serialization layer, a first in the market (See our presentation at Flink Forward Berlin, 2019). On the scalability side, in addition to handling large volumes of data, we also needed to support the large number of developers who would adopt the platform. This meant offering an easy way to define new data structures, and ensuring the platform could handle a large variety of data structures in a secure and scalable way.

A Startup Mindset

As with any new project, we started with a small team of developers that would grow in size as the project proved it’s worth. We approached the task with a startup mindset.

We found the book Lean Startup Methodology very helpful. Here are my main takeaways from the book:

  1. The goal of a startup is to figure out the right thing to build as quickly as possible. In other words, the thing customers want and will pay for.
  2. Customers do not tell you what they want. They reveal the truth through their action or inaction.
  3. Test and validate your value and growth hypothesis.
  4. Use MVPs to cycle through the Build-Measure-Learn feedback loop as fast as possible.

A typical Workday user interacts with our finance, HR and planning applications but, as a platform team within Workday, we were targeting a very different set of users. Our users would be the Workday developers that enhance Workday products with machine learning capabilities. Our first challenge would be to get our new platform adopted by a significant portion of Workday developers. With that in mind, and drawing inspiration from the Lean Startup Methodology, we focused on these initial tasks:

  1. Get early adopters onto the platform as soon as possible so that we can understand how product features will be built on top of it.
  2. Define metrics and data-points to track the platform usage so we can validate the changes needed to improve the service.
  3. Validate our value hypotheses, that minimizing the cost of adoption (for developers and operations) will increase adoption.
  4. Deliver the MVP to our early adopters quickly and frequently iterate on it to create a feedback loop.

Phase 1. Our first users

We applied the Lean Startup Methodology to our initial design and stripped out all but the bare minimum to allow systems to communicate: the Kafka message bus, a schema to describe data structures and a way to serialize and deserialize data efficiently. We selected the well known serialization library Apache Avro. The basic functioning of Avro is:

basic Avro functionality

As part of this first implementation we provided this very simple API to define the basic interactions

This translated to an interface:

AvroMessageEncoder interface

We provided the logic to interact with Avro in an abstract class, to allow easy evolution of the implementation to define or retrieve schemas.

Abstract class to define the basics of the API

To have the platform up and running in the minimum time, we also provided the first very basic implementation that was using a local available schema, effectively ignoring backward/forward compatibility.

Encoder without schema —implementation

Workday’s commitment to protecting our customer’s data meant that encryption was an essential part of the MVP as we aimed to keep data encrypted with a unique per-tenant key at rest, like on kafka FileSystem. For this we added the encryption into the serialization process, and so our architecture looked like this:

Encoder without schema —flow

We deployed this MVP to Workday’s production environments and early adopters started building new product features based on it.

Phase 2. Supporting different use cases

As adoption of the MVP increased, the number of producers and consumers increased, as did the variety of message formats. Producers and consumers can be deployed at different times, which introduced the need to support backward and forward compatibility for the communication between them. Our initial approach of including the schemas in the deployment artifacts was no longer suitable as it depended on producers and consumers being deployed together. We went back to the drawing board and redesigned our platform to include a schema in each message to allow conversion between versions. This approach solved the problem quickly and with a minimum amount of work. It also allowed our users to gain experience with Apache Avro’s Schema Resolution and the conversion mechanisms. The architecture for this interaction looked as follows:

Encoder with schema in the message — flow
Encoder with schema in the message

Phase 3. Scale for complexity

As our users became more familiar with our platform, the complexity of the schema definitions increased. The size of the schema attached to each message became a drag on performance. In some cases, business objects were defined with more than 5k lines of schema structure that once serialized represented 90% of a message size.

The next logical iteration was to provide some form of schema handling, to offload the definition of the producer schema from the message itself. We needed a Schema Registry such as the Confluent Schema Registry. However, introducing a new service came with a cost that seemed to outweigh the value it offered. We would have to deal with the complexity of adapting the Schema Registry to satisfy Workday’s security requirements, and manage its operation in multiple data centers and sub-environments.

Again we decided to apply a startup mindset and deliver the simplest possible solution. To avoid the need for a new service we implemented schema storage/retrieval functionality by extending our encoding logic to use kafka as a storage solution: https://kafka.apache.org/documentation/#compaction

In this new flow you can see how the schema used by the producer is stored in kafka keyed by its fingerprint, the fingerprint is encoded along with the message. At the consumer side the deserialization logic can extract the fingerprint, retrieve the full schema from Kafka. The schema is then used to apply backward/forward compatibility change to the data to match the local schema used in the code.

Encoder with schema on Kafka — flow
Encoder with schema on Kafka — implementation

Using this approach we were able to offer the minimum required functionality of a Schema Registry without delivering a new service. While this was the second change to how we managed schemas, the two previous iterations allowed us to get users onto the platform and start the feedback loop. We avoided the problems associated with premature optimization.

Future Phases. Delivering for our users

Now that we have established a community of users, decisions about future enhancements are informed by their use of the platform as well as our experience managing it. We anticipate adding a separate Schema Registry service at some point in the future. We now know so much more about the required functionality for our users. We can now clearly plan-out the next evolution of the project without having to make too many assumptions. We will share details about the design of this Schema Registry and how it fits into the overall machine learning data management platform soon.

Conclusion

Our current architecture contains several key differences from the initial design we started with:

  • A promotion pipeline validates the schemas in compliance with the requirements of Workday’s Privacy, Ethics and Compliance policies and makes them available to the consumers logic.
  • Schemas are then included in the producers’ deployment artifact instead of being retrieved from a dedicated Schema Registry.
  • Producers send schema to a kafka topic used as a storage, this schema is then used to resolve backward and forward compatibility.

All these differences surfaced naturally during our iterative process. At each step we focused on the known requirements and worked on the minimum solution to satisfy them. The four principles we extracted from the Lean Startup Methodology gave us a strong base for decision making. Whenever we found ourselves in doubt about the next steps for the project we reviewed the possibilities against our four principles.

During this experience, we learned to focus on what really matters and how important it is to clearly define and agree on an iterative process. Without the focus provided by the Lean Methodology we might have developed features that our users didn’t need and created a bigger code base that would be less adaptable to changing requirements. The initial agreement on the process made it easier for our team to resolve uncertainties and to overcome challenges.

--

--