Where’s that event?

Published in

Hevo Data Engineering

5 min readDec 11, 2020

Every week, around 20 billion events are loaded by Hevo into different warehouses belonging to customers from around the world. A lot more events are read from a variety of their data sources — some of them are discretionarily discarded while a few others may be parked in a safe store until a decision on their future is taken.

Similar to how an e-commerce platform like Amazon is able to track the journey of its consignment, Hevo is capable of tracing the journey of each identifiable event that flows through its platform.

This helps in ascertaining if an event reached its intended terminus. When that does not happen, it makes information available to the engineers for identifying the causes for a disruption (bugs) in the journey.

This component (known as Event Tracker) consists of two parts

Data Recorder: This identifies the most important phases in the journey of an event and records them.
Query Engine: This makes the data thus recorded, available to be queried, and provides a simple interface to trace the events.

Data Recorder

As the events pass through different processing phases (eg. Ingestion, Transformation, Mapping, etc), an event trace is generated and recorded via the File Manager. Apart from the significant processing phases, the traces are also generated when the events move out of the active processing workflow and are to be parked in storage or are brought into the active workflow from the storage.

Each trace consists of the event identifier, some metadata and location & temporal data. None of the actual event data is stored as part of this trace.

File Manager

The File Manager library is one of the core libraries used in Hevo for various data accumulation purposes. It has mechanisms for fault tolerance, retries, and rate thresholds based on feedback built into it. Several Event trace records are written to local files that are tracked via a MySQL table. These files are later transferred to S3. This table tracks the state of the files from when they are created locally until they are completely processed. In the case of the Event tracker, as soon as the files are uploaded to S3, they are considered “processed”. One of the crucial roles of the MySQL table is to be able to provide tracking information of a file that may have been missed being uploaded in a non-normal circumstance; debuggability, being one among the others. A sample event trace file key:

team_id=755/integration_id=2787/source_object_id=129216/schema=token_requests/time_group=1606564800/pc_11606600385042–171–405.csv.gz

With about 100K (and growing) files tracked every hour, tracking it via a remote MySQL table was becoming expensive by the day. For example, ~2.5% of all of the query volume on the database was being used up for the bookkeeping of these files. In scenarios of high fan-out of events, database connections were hogged by the file tracker leading to occasional cascading failures.

This is when an alternate version of the File Manager was created that does all of the bookkeeping on the application node locally. A couple of light-weight local data stores were considered out of which RocksDB turned out to be the clear winner.

RocksDB is a high performant, persistent key-value store for fast storage environments.

It helped us achieve both the operations required to track the files locally:

Accessing and updating the metadata of a tracked file
Performing range searches on files partitioned by creation time (indirectly)

With RocksDB serving as the storage layer for the file manager, the bottlenecks to the use of the MySQL table have been removed.

The extended version of File Manager that makes use of RocksDB and is locally self-sufficient

Query Engine

With event tracking data being available in S3 and with an expectation of low query volume, AWS Athena was chosen as the backbone of the query engine. The pre-condition for letting an Athena table queried, is to make the table aware of available partitions. This can be done in two ways:

By adding/removing partitions explicitly: This gives greater control over the data type of the partition columns and also lets the table be queried with an arbitrary combination of columns. The downside of this approach is the overhead of calculating, adding, and removing (to be defunct) partitions.
By defining dynamic partition projections: This completely removes the overhead of partition-management while adding severe restrictions to the variety in queries. The restrictions can easily be circumvented by defining multiple tables (on the same S3 data) based on the desired queries.

The initial few versions of Event Tracking made use of the first approach, which has now been replaced by the second approach. It fits well with the use cases that we have and has eliminated the need to track partitions explicitly.

Revisiting: Where’s that event?

When a customer wants to know the whereabouts of a (potentially) missing set of events, they are requested a couple of samples by our product support specialists. As an example, the primary key values of a MySQL table for which the data failed to reach the destination in time, is the ask. The product support specialists, in turn, submit their event trace queries to the Event Tracking system via a simple web interface. The query is translated to an Athena query and submitted to it via Handyman. The queries are run asynchronously and their progress is continuously monitored by Handyman. Once the results are ready, they are delivered back to the web interface.

A trail for an event with *d5601e0b-f0eb-459c-bb4d-8b6170a6fcfc* as the identification key

What’s next?

Event Tracker has supported to trace the event flow paths along with reducing the search space of some lossy scenarios and it will continue to do so. It is and continues to be a (reactive) investigative tool.

We are now working on a set of proactive assessment tools that will be able to assess the issues in the movement of the events eagerly.

Thank you for reading the post till the end. Please write to us at dev@hevodata.com with your comments and suggestions. If you’d like to work on some of these problems, please do check the careers page at Hevo.

Where’s that event?

Data Recorder

Query Engine

What’s next?

Written by Trilok Jain