Dealing with data in a micro-services world

Published in

Loft

7 min readApr 7, 2020

A problem that many companies face these days is data governance. What most people don’t know, is the impact of this problem. Not having the minimum amount of governance enhances local team autonomy, but overall autonomy drops drastically when teams depend on each other. Regarding data analysis, consistency, not only CAP consistent but also structurally, is key to be productive and effective.

The root cause

Business structures are constantly changing, and it should be very well communicated between teams. Communication is the most difficult part of this story, teams change their concept of a business entity and no one is ever informed, or just let it pass because no one truly knows where it would impact. This is a common situation because whole systems are too complex for people to keep track.

Decentralized systems leave teams with full autonomy to do whatever they need, but in the real world, many systems depend on the data generated by other systems. This requires a heavy team and data management effort, to keep track of all changes and making sure that making sure that things are not breaking.

Problems caused

When a team couples itself to another team’s database (using directly or copying data) or to third party schema (CRM), they lose their autonomy to grow. Since they are locked in a schema level, neither team can make schema changes without impacting the other or having to duplicate the data.

In the scenario where a team duplicates data, there is also another huge problem, which is the consistency (now, the CAP consistency) between the databases. Managing multiple databases is hard, given that in order to keep then in sync would involve complex distributed transactions solutions. Those solutions however, are often ignored.

The replication scenario becomes even harder to solve, when there is a third party tool (e.g. CRM, marketing applications) to be updated or, worse still, when there are multiple third party tools for the consistency reason and the coupling that it usually causes.

This leads to companies having multiple schemas of the same data, and managing all those schemas without breaking anything requires a lot of energy.

Avoiding this scenario

To avoid this scenario, interfaces must be well built by not allowing any breaking changes, and well documented so that it’s easy for people to use. Changes are usually slow and normally demands data migration. All teams are responsible for their application, for the generated data and also exposing this data for the required payloads. This is how a well designed system with a well defined interface and purpose should work.

But in a real world, this is way too slow for teams on a fast growing business in which applications are meant to prove their worth before being fully implemented.

This leads to teams having a centralized database in which many applications rely on, or copy data from. Everything a team does will automatically reflect on all applications that depend on this central source of information without having to update services interfaces or documentation. This centralized database is often not exactly a database, but some third party application like a CRM.

It’s all about productivity

Syncing databases and managing schemas are hard to solve problems and should be avoided if possible, because it makes the coordination between teams (or applications) a huge management issue and waste of energy, dropping their productivity drastically.

But how to avoid this scenario in a fast growing company that cannot waste much time with documentations and setting boundaries between applications? How to keep communication and interface enforcement at a balanced level?

Accountability

The first thing to do is defining who owns the data. Setting the accountability of a portion of the data, makes communication easier between teams. This portion of data could be divided (and should if possible) regarding business domain model.

So, for instance, all marketing funnel data is the responsibility of the team responsible for the marketing application. After the conversion of a “lead” the generated data should be the responsibility of Sales team, which controls all sales flow. After selling the Customer Success application handles and generates more data for that customer, so this application’s team is the data owner. Having this owners well known, makes easier to track problems in general because, no matter what a system does, it normally manipulates data. So if data is missing, corrupted or wrong, whoever is responsible for that data chunk should investigate.

This will not solve the problem with systems interface or data being duplicated, but it is a good way to start.

Contract distribution

Interface enforcement is also a big problem to solve which makes team communication much easier, but it usually makes team’s productivity drop because it demands too much technical and non-technical effort.

To give a simple example, suppose there is a Service (A) which is consumed by 3 other services (B, C, D). These services are written in 2 different languages.

Having a well designed and implemented service, normally means having a well documented interface and clients for the most used languages. In this case, we would have 2 client libraries for our service.

If there is a need to change something in the interface, the following tasks would also need to be done:

Effectively change the interface in the code
Change the documentation so other teams can be aware
Change all the client libraries

The task list gets even larger if the change breaks the old interface (i.e. changing field name, removing some field, etc…) because services often need to be backward compatible.

This effort is well paid in the future, but in early age applications this is often an overhead to the development process. Normally, only the interface is changed, causing documentation and clients to be outdated or non-existent.

Another approach is to centralize the definitions of the data that is shared between the systems. All data would be defined as metadata, very close to what SQL databases do, and be distributed to all applications that cares about this data.

This approach enables teams to change the definition of its business entities and distribute it so other teams can consume.

For instance, in a Car Sale system, the team responsible for selling the ads have their ad definition as:

{
   "entityType":"Ad",
   "fields":[
      {
         "name":"title",
         "type":"string",
         "validations":{
            "maxSize":120,
            "required":true
         }
      },
      {
         "name":"value",
         "type":"decimal",
         "validations":{
            "minValue":100.0,
            "maxValue":10000.0,
            "required":true
         }
      },
      {
         "name":"description",
         "type":"string",
         "validations":{
            "maxSize":2000,
            "required":false
         }
      }
   ]
}

The team responsible for creating the Marketing product in which ads will be listed and leads would be generated, can consume ads data with the up-to-date definition of what an Ad is.

There are several tools like Confluent Schema Registry which addresses this Schema Management problem.

The problem is partially solved, all teams have access to the same data definitions, but what about the data itself?

Having a centralized and standardized way to define business entities solves the interface enforcement problem, which is normally the bigger issue. Storing this data is also a huge headache which slows teams down. So, enforcing this definition will only partially solve the productivity problem.

Centralized data storage

To solve this matter, it’s possible to have a centralized data storage which uses the distributed schema to persist and validate data. This way, teams would only be responsible for the definitions of its own business domain models and applications rules. Problems like security, governance, replication, performance would be completely outside of those teams’ scope of work.

Benefits

Aside the productivity gains, other benefits can be achieved by implementing this approach:

Marketing could make better segmentation for their communications;
Business Analytics teams would have less overhead for normalizing data;
Single source of truth to extract Business metrics and OKR’s;
Data sharing between applications

Data Platforms

Many applications called Data Platforms (like Looker or Tableau) try to address the complexity of multiple data sources for analysis. They are meant to unify access to all databases and cross their data.

But consistency (regarding schema or data) is not addressed by those tools. They map to external databases and the analyst who uses the data will have to deal with the complexity of gluing all of those data sources together.

In other words, those data platforms will not solve any of the problems described above, but the concept of one place to look at data is somewhat near the concept described above of centralized data storage. There is another kind of plataform solution trying to solve those issues: Customer Data Platforms.

Customer Data Platforms

They are defined, according to the CDP Institute, as a “packaged software that creates a persistent, unified customer database that is accessible to other systems“.

A CDP is, basically, a standardized way to define centralized business models and storage so that every application can consume business data, which is a great way to solve the problems described above.

Many products are already built around this concept, those tools have their own definition of customers and relations requiring applications to abide by those rules. Those tools are mostly built for marketing purposes. They have many built-in tools to help marketing improve results and allows integrations with other applications like CRM and other marketing platforms.

The problem with those tools arises whenever there is a need to customize their definitions, or to allow other entities besides customer. If there is a need to customize their schema, it would be necessary to implement most of the code to manage schema and persistence layer, which is almost the entire solution.

Know your fight

There are many complexities involved when talking about crossing multi tenant and multi operational workflows data. Building a CDP solution from the ground up, enables the implementation of all of those complex concepts, but using a market solution is normally much faster and usually can solve most of the problems.

In a next post I will describe how we at Loft are solving all of those issues.