When To Share Data Through Data

Liza Wainblat
AppsFlyer Engineering
7 min readJul 4, 2022

Faulty data sharing architecture results in non-generic, high-maintenance and very fragile interfaces that impact customer data accuracy, cause delays and availability issues.
Even when performing the most complex data transformation in any data pipeline, data mesh or other data process manipulations, there are at least two sensitive interfaces in the system whose importance is often neglected: consuming the input data and sharing the result.
While you’re not always in charge of how the consumed data is provided, how you produce data is entirely up to you.

AppsFlyer is a highly data driven company — we receive and send over 100 billion events every day and handle petas of data. The iOS Measurement group at AppsFlyer deals with a pipeline engaging all of the above points: consuming data in multiple formats from multiple sources, merging and processing it, and producing the results for both internal and external consumers.

In this blog, I’d like to introduce you to the data interfaces evolution we implemented in the iOS Measurement group at AppsFlyer while creating new interfaces, and how it has simplified our data interfaces. Making provided data stable, reliable and secure, not only within the group but for all of our data consumers.

Why NOT use APIs

REST, SOAP, GRPC — they are all familiar and established architectural solutions today, used both for sharing real-time/streaming data internally between system components (especially popular in microservices architecture), and for exposing the data to external consumers.

Multiple articles have been written about the advantages and correct implementation of APIs — I will not recite them. However, a lot of articles suggest that while being a solid solution in most cases, this is still a high-maintenance architecture element.
When maintaining an API interface, you always have to be prepared to scale your solution to whatever scale is thrown at you — something completely out of your control. You also have to maintain the stability, when faced with malicious or abusive usage that consumers might come up with, this can occur even with the best defined schema and enforcement.
One of the consequences of APIs being so easy and available, is the tendency to use them even when not required by the architecture. I would like to share and discuss an example of such cases.

Data Constraints

As a data producer, I would like to provide the results of the processing as soon as possible. But, and it’s a considerable “but”, not all data processing requires real-time availability.
Most complex data transformations are done in batches and in hours’ freshness, so we can work offline and a near-zero availability SLA is not required for customers. Let’s not forget the large amounts of data being processed in today’s system, on top of which we would like to refrain from performing runtime analytics due to continuous operational struggle.

When discussing real-time analytics, API is a great solution with limitations that can be dealt with. But what if you don’t have to deal with the limitations? What if instead of dealing with complex limitations, we can ease the constraints?

Alternative Data Sharing

In our implementation, the iOS Measurement group is an iOS analytics data producer for multiple consumers. This is exactly the case described above — batch offline processing of generic data relevant for multiple consumers.

We chose to share the data through a data lake repository. Let’s take a look at the main goals of this architecture:

  • Breaking coupling/dependency between iOS measurement data production and other products’ consumption. The data structure is set in a schema, making it stable, well-defined and backward compatible
  • Easy Access/Integration — as a data producer, we would like to create as generic and rich a data source as possible, while preserving the granularity.
    On one hand, any new consumer can access the data repository without impacting existing consumers, performance or existing data. On the other hand, the new consumer can consume all data or parts of it, without any impact on the producer and without requiring any sort of preparation and pre-processing
  • Generic solution going forward — the data output unifies all possible consumer needs, and can be extended in case of need. The preparation and pre-processing of data will be derived from the result required by the consumers. As long as this agreement between producer and consumer is preserved — the whole product is stable

There are no perfect solutions, and this one, just like any other, has its downsides. At the very least, in this solution the investment required from consumers is considerably bigger than when working with APIs. When consuming data from a data source, the consumers will have to index and inject the data into the database of their choice and build their own access.

Data sharing contract

In order to maintain the data sharing repository functionality, a contract must be defined and upheld by the producer. Usually a contract implies at least two sides agreeing on a relationship, but in our case, the contract is between the producer and the world.
It has to be iron-clad, since updating it will require involving more and more consumers as time passes, and those are the very same consumers that the producer does not want to be aware of.

The contract we decided to enforce on the data is as follows:

  • Schema enforcement on the data repository level, unbreakable
  • Always maintain schema backward compatibility
  • Provide a unified schema for multiple data sources, resulting in a single generic data source. All the data will be available as a single source of truth. A complete data source aligns multiple data sources to a single schema and unifies it into a single, generic data source. The unified datasource is partitioned by the original sources to allow maximum granularity
  • If there are multiple data producers, they will be completely transparent to the data consumers. The consumers will receive a single source with a single schema
  • The data will be as granulated as possible

Metadata

An important part of data is Metadata. I’m referring to any data characteristics that are not an integral part of the raw data. For example — data configurations, data tagging, various partitioning, etc.
These I would like to split into two types:

  1. Metadata that is only relevant for a portion of data or partitioned by it. In our case, part of the configuration is relevant per measurement period, as defined by the customer.

Here it makes sense to bake metadata into the data, since configuration is relevant for specific data processing and the relevant configuration will be stored with the relevant date.

Also, we would like to separate the metadata into a separate location, wherever possible, and manage it separately while not including it in the producer-consumer contract. This data will be hidden, and access will not be guaranteed — breaking changes are allowed in case the metadata is meant for internal usage only, monitoring or debugging.

2. Metadata that is a generic data description, for example: timestamps, data origin, processing rules, etc. The metadata is relatively small, static, and does not change frequently. Generic metadata is usually not only required for data analytics, but also for data visualisation.
Our main purpose as data producers is to ease the consumers’ lives. In this case, either a very lean API or a separate data storage have the right to exist.
The question to be asked in this case once again, is the change rate of this metadata and the data freshness required.

Data Visualisation

Maintaining a strict data schema has another definite advantage — easy visualisation. Data visualisation helps with the most basic data understanding — what it represents, if it is relevant, what it is connected to, what it is useful for, etc. Thanks to better understanding, the visualisation tools can also provide us with the tools for molding metadata into data, and so much more. One of the most common tools for data visualisation today is data catalogs, and those can open a gateway to the next stage, data discovery.

Wrapping up

The migration from API interfaces to data interfaces is just being initialised in our case, but it’s already proving to be beneficial for all the above reasons.

We had a unique opportunity to begin the change while creating a new product with new interfaces, and following with migrating existing ones. During our day to day routine we don’t always remember to stop and rethink decisions already taken. New challenges and major changes are opportunities to stop, think and get out of your comfort zone.
If there is a lesson to be learnt here, it is do not automatically go for a comfortable and familiar solution.

--

--

Liza Wainblat
AppsFlyer Engineering

An experienced Engineering Manager with over 20 years of experience, specialising in scalable distributed performance oriented systems and big data processing.