How to Manage Schemas and Handle Standardization

Tomer Peleg
Riskified Tech

--

In my previous posts, I mainly talked about infrastructure, design, architecture and everything in between. This time, I want to change the point of view a bit and talk about the applicative considerations, as well as some of the solutions we provide to handle the communication protocols and data standardization in an event-driven architecture.

The main concept of event-driven architecture is to decouple services by using events to trigger and communicate. An event is a change in state, or an update, like an order being submitted.

“Schemas are the APIs used by event-driven services, so a publisher and subscriber need to agree on exactly how a message is formatted. This creates a logical coupling between sender and receiver based on the schema they both share.” (O’Reilly)

Schemas may evolve over time (schema evolution), so to ensure data reliability and consistency, we use schemas as a contract, allowing downstream consumers to adapt to changes and process data seamlessly.

This allows us to achieve business value/KPIs by providing dev and non-dev oriented departments like Data Scientists, Analysts and BI with the ability to build continuous evolving streams that won’t break when applying schema changes.

In terms of management and standardization, we need to consider a few factors. Where should we store these schemas? Which communication protocols should we support? Which serialization format (e.g., JSON, Avro, Protobuf) should we even use?

Schema Management

In a few words, Schema Registry is a distributed storage layer for schemas that uses Kafka as its underlying storage mechanism. It stores a versioned history of all schemas based on a specified subject, provides multiple compatibility settings and allows evolution of schemas (docs).

Confluent Schema Registry for storing and retrieving schemas

In our case, Riskifed chose to use Avro as its serializing data format and manage all schemas in a centralized GitHub repository distributed by domains. It uses Avro IDL format (AVDL) to define the schemas, for readability, and the ability to import and reuse schemas among domains like types, enums, metadata and other common records.

It includes a well defined CI/CD pipeline that validates the schema changes with full compatibility (backward and forward), converts the .idl files into a full .avsc record and registers the schema version to Schema Registry.

In addition, it has the ability to detect changes cross-domain and register a new parent schema version in cases where a “leaf” record was changed.

AVDL to AVSC conversion

This mechanism works great, but there are some management issues to consider. Keep in mind that each environment (e.g., production, staging) is isolated for security reasons and so, each one needs its own Schema Registry. We need to register each schema version to all registries and accordingly validate that they are fully synced. This in order to allow services to work with one common schema version across all environments (the schema id won’t necessarily be the same).

Therefore, alongside the compatibility checks, we added another check that blocks merging PRs with schema changes in case a drift was detected.

As an alternative, depending on your use-case, you can consider solutions that include some kind of sync mechanism that replicates the schema’s storage across all environments with a single source of truth. Of course, all solutions require backing up the data to a separate storage continuously (e.g., s3), so we can restore the schema state from the backup on-demand.

Schema Registry CI/CD Pipeline

Data Standardization

One of the toughest challenges in development is achieving standardization.

In a fast growing organization, as the scale grows so does the number of products, teams, services, libraries, and of-course versions. It’s nearly impossible to track them all and ensure they all follow best practices, formats and the defined standards.

Therefore, to increase development velocity and reduce the chance of client abuse, we deliver plug-n-play client libraries to integrate with the data platforms, letting development teams focus on the business while we take care of the rest.

Avro Codec

Being our main serialization format, we have a wide range of utilities to handle Avro serialization.

Focusing on Scala — as our main coding language — we created an sbt plugin to handle Schema Registry integration. An sbt plugin can define a sequence of sbt settings that are automatically added to all projects or explicitly declared for selected projects.

Schema Registry Schemas Configuration in build.sbt

The sbt-schema-registry plugin enables fetching schemas from the Schema Registry and generating code from the schemas. It uses sbt-avrohugger to fetch the configured schemas (by subject and version) in build.sbt at compile time, and generates the equivalent case classes and implicit Avro encoder/decoder using avro4s.

Scala Auto Generated Case Class from Avro Schema

Alongside an implicit SchemaRegistryClient, it then fetches the schema id at runtime and encodes/decodes the message to a byte array stream using the generated Avro encoder/decoder respectively.

You should also consider the wire format, in which the first byte is the Confluent’s magic byte, the next is the 4-byte schema ID as returned by Schema Registry, and then the serialized data for the specified serialization format.

Of course, this mechanism can be adopted and implemented in other languages as well, for code generation, like interfaces in JavaScript, Structs in Ruby, etc.

Generic Records

Another aspect of standardization is the ability to provide support and alignment to a wide range of coding languages and frameworks used by the development teams.

To address this, we created KaaS (kafka-as-a-service) as a sidecar for K8s services, written in Scala. It allows services written in languages for which we don’t maintain Kafka libraries to have plug-n-play Kafka integration. And along with this, get other out-of-the-box features, like a fallback mechanism, metrics, event metadata, serialization, etc.

Now, the real challenge of KaaS wasn’t the service itself, but the fact that it acts as a standalone unit. It needs to be adaptable and provide dynamic serialization support, since it doesn’t have the pre-compiled case classes generated via the above mentioned plugin (sbt-schema-registry).

To solve this, we used Avro GenericRecord to build the Avro records at runtime while using a GenericDatumReader and JsonDecoder to convert a JSON object into a generic datum based on the fetched schema (by subject and version).

Note that this alone won’t work if you plan to use decimals, as a type that arrived later on and isn’t supported in most of Avro libraries and requires an override to add the missing functionality.

Json to Avro GenericRecord Decoder

Wrapping up

In an environment where standards are constantly changing, and independent development teams are creating services in a variety of languages and frameworks, complexity increases rapidly. The complexity isn’t necessarily in the services, but in the communication protocols and data standardization.

With this in mind, using schemas allows us to decouple services owned by different domains/teams and evolve our data structure over time, without the need to coordinate the change, while preserving backward and forward compatibility.

This brings balance to the system with the ability to deliver scalable and adaptable solutions, and allows an increase in development velocity and efficiency.

--

--