Our Secret to Enforcing Schema: A Customer-Loving Synergy

Prashant Piyush
Hevo Data Engineering
5 min readNov 10, 2021

Data is everywhere, and it powers everything we do! So naturally, a well-thought-out Database is an inescapable requirement for any modern Business Management System. It ensures seamless availability of harmonious data and enables you to perform Systematic Data Analysis. But to do so, you need a holistic understanding of how your data is truly structured. One way to garner this knowledge is to have an Entity Relationship Diagram (ERD) available at your fingertips.

An ERD provides an easy-to-understand visual perspective of the logical arrangement of the information in your Database. It illustrates relationships among the entity sets, their attributes, thereby making it exceptionally illuminating for analysis.

But at this point, you might ask why and how Hevo uses ERDs, and what’s in it for us?

Here’s what makes Schema so valuable to Hevo

To start, having an ERD-based Schema in place allows our Data Pipelines to efficiently replicate data to a destination, typically a Data Warehouse. Let’s consider an example to gain more context here.

With data available at an unimaginable scale, each user leverages unique datasets in their desired SaaS applications. Even for two users with the same SaaS source, the data ingested by our Data Pipelines and their corresponding Schemas can be miles apart. This can lead to challenging situations where a missing field value results in no data getting produced at the destination.

To overcome such ambiguities, we at Hevo enforce unique Schemas using pre-defined comprehensive ERDs for each SaaS source we connect with. This helps us ensure that the Destination Schema exactly matches the ERD we describe in our documentation.

Enforcing an ERD-based Schema further enables us to tackle unexpected data errors effortlessly.

For example, in a popular SaaS source, Google Ads, we’ve often encountered data type variations. At times, fields, usually holding integer values, end up with “nulls” at the Source. This minor data variation results in the API returning placeholders such as “—”. Such unexpected data changes directly impact the overall Data Quality and Accuracy.

Enforcing Schemas allowed us to incorporate an intuitive solution for this, compatible with all SaaS sources. Here’s how we did this!

Instead of converting these integers into strings (at the destination), we decided to filter out these unusual data values at the Source itself. To do so, we created robust ERDs and corresponding Schemas for each SaaS source. It allowed us to identify such outliers rapidly.

How we incorporated ERD-based Schemas in Hevo: Implementation

Once a user creates a new Data Pipeline in Hevo, the ERD act as a Blueprint and help generate the Source Schema for each Event type. These Source Schemas, available on the Schema Mapper page, are mapped to the desired Destinations’ Schema. This helps define how the data will eventually get stored in the destination.

Without the ERD-based Schema in place, the system would typically derive the Source Schema from the incoming events data. The Schema, thus obtained, is then mapped to the destination table. However, to ensure that the Schemas remain generic and do not depend on user-specific data points, we ensure that the Source Schema gets generated independently of the incoming event properties.

Schema Mapping flow without ERDs in place

The processing of incoming events thus begins once the Data Mapping is complete. Here, the data type of each field is matched with the expected data type, as defined by the ERD store. In the event of a type-mismatch, we try to “typecast” the incoming data value to the expected data type. However, if there’s no direct way to typecast/convert the values, we deem the incoming Schema “Incompatible”. In such a case, our Data Pipelines automatically sideline the erroneous events to be checked later.

Developers at Hevo then take a look at the “Sidelined Events” to decide on the further actions: Either Drop the Event or Upgrade the Schema.

In addition to managing the events with unexpected data types, we also look into another use case: What if the user wants to transform the event and modify the Schema?

We handle this separately. Here, we bypass the “Data Type Matching” phase for the event properties that got modified during transformations. It thus helps us eliminate the need for unnecessary validations, thereby streamlining the process.

Data flow through Pipelines with ERDs

Challenges we faced while designing and enforcing Schema

One of the biggest challenges we faced was designing a one-size-fits-all Schema solution. With each customer’s data varying significantly, coming up with an ERD design that could handle all such data types was a tough nut to crack.

To help answer this, Product Managers at Hevo had to examine users’ utilities and use cases, garner feedback and analyze extensively. This allowed us to choose the objects we’d retrieve from the Source. However, enabling such a practice for every user is what makes it an intense task.

The second difficulty we considered carefully was dealing with incompatible data. Here, we were presented with two options:

1) Upgrade Destination Schema to Handle Values

  • Optimizing the Destination Schema meant upgrading the table structure such that it could accept the incompatible values. However, this approach had a downside ~ Modifying the Schema required the users to alter their queries while carrying out analysis, resulting in a poor user experience.

2) Prevent Undesired Data from Reaching the Destination

  • Weeding out the undesired data meant that the Product Team must first decide if the data is an outlier/not.
  • It also required the team to verify another perspective, where the unexpected value was a valuable/quality data value, indicating an error with our ERDs’ Design.

Future Possibilities/What Lies Ahead

Kafka forms an essential component of Hevo’s Architecture. Pushing data through Kafka mandates Data Serialization. It involves converting objects into byte streams for seamless data transfer. When pushed through Kafka, data is accompanied by Metadata that gives insights on the value and data type. This is owing to the fact that the Metadata contains the Schema information, along with other values.

As a next step to boost the efficiency of Enforcing Schemas, we will be aiming to build a centralized place to store the Schemas. This would allow us to avoid sending Schemas to Kafka and enhance our throughput. As a result, Hevo’s Data Pipelines will be able to process more events and scale with ease.

We appreciate you reading the post till the end. If you have any questions or suggestions/comments for us, please write to us at dev@hevodata.com. Building a great product takes hard work and determination. With each of us at Hevo sharing the same passion, everyone in the team takes the effort to go the extra mile. If you’d like to be a part of our journey and work on some of these unique challenges, please do check Hevo’s careers page.”

Thanks to Trilok Jain, Talha Khan, Khushboo Bhuwalka, Divyansh Sharma, and Divij Chawla.

--

--