An empirical approach to automatic data decomposition

Martin Dlask
12 min readSep 5, 2020

--

For every data scientist, data availability is a crucial pre-requisite for their work. Most commonly, data are acquired from internal and external (paid) sources via APIs. On one hand, data scientists usually don’t care a lot about how the data are provided to them, they are satisfied when they can query them from a database or retrieve from an excel sheet. On the other hand, quick data availability is very important for building any analytics tools.

Usually, the data must go through processing done by integration engineers, before they are ready to be consumed, and that can take a lot of time. Additionally, data mapping is a very tedious task. It is believed that 90% of developers do not enjoy the decomposition of data and their mapping. Raw data from external sources are usually acquired as XML or JSON files for given parameters. Nowadays, one of the most common concepts of storing such data in the cloud is inside a data lake. Once the data is stored there, an engineer develops a pipeline to propagate them into tables in a normalized form in a relational database, or more likely, into a data warehouse.

The concept of the data flow has been heavily automated with tools in the cloud. Namely in AWS, one can use S3, Glue, Athena, Lake Formation, and Redshift for data management and storage. This time we will challenge heuristics in AWS to automatically detect the format and properties of the input data and will see how the grouping into tables works.

Car register example

In our hypothetical example, let’s suppose we are retrieving historical data about cars identified by their VIN number. The first API provides us a history of the mileage measured by the car service stations and the other API returns the history of the registration plates. Below there is an example API call for a car with VIN identifier 1HGBH41JXMN203578.

The first API returns the VIN of the vehicle together with the attribute mileageHistory, which is an array of elements, containing the mileage in km, measured on a particular day. The registration API returns again the VIN together with the array storing the dates when the registration number has changed. Altogether, in these two JSON responses we have three entities:

  1. Vehicle identification (through VIN),
  2. Mileage history (for each vehicle),
  3. Plate history (for each vehicle).

Let’s have a look at one of the possible ways of mapping and decomposition this data into database tables. Allowing each entity to have its own table and each records its own identifier, the schema could look like this.

The table vehicle is storing the vinand vehicleIDis used as a vehicle identifier for saving the historical data, every element of the array as one row. This decomposition is definitely at least in the 3rd normal form and it describes the most common way on how the data could be represented. In general, delivering a system that will automatically populate a DB scheme like this is not trivial — one has to first do a business analysis of the data and analyze their relationship, decide on how to parse the data, how to structure the database, and finally map the data into the appropriate fields.

Challenging AWS Glue

According to the definition in AWS docs, a crawler can automate the process of data decomposition and from raw data. Here is its description:

  • Classifies data to determine the format, schema, and associated properties of the raw data — You can configure the results of classification by creating a custom classifier.
  • Groups data into tables or partitions, while grouped based on crawler heuristics.

We will use two responses of our fictive API as an example, to feed the heuristics and grouping in AWS. The following files were uploaded to S3 and the crawler was applied to them.

The crawler was configured to group data into tables and recognize the data types while reading data from S3 bucket. It was not forced to create a single schema for the files, since there are multiple entities presented. Surprisingly, the crawler ended up creating four database tables, each for one file.

Looking at the structure of the tables, two tables had schema with vinand mileagehistoryas an array,

and two tables had schema with vinand platehistory.

Not really an expected behavior to have as many tables as output files. A data scientist trying to retrieve data from all these four files would have to first retrieve the list of all tables (which will equal the number of files) and later for each decide depending whether it has mileage or plate column property. And that is a very complicated query, not mentioning the need to unnest the arrays into database rows before processing.

Let’s change the configuration of the crawler and allow a single schema for all files. The crawler could at least recognize that mileagehistoryand platehistoryfor one VIN number belong together.

Right now, the crawler created only a single scheme table for all records. The table definition looks promising:

However, a surprise comes when querying the data from the data lake. After selecting all records from the resulting table, one gets:

There are four independent records, while again each file is one line. The data scientist trying to obtain these data points would have to unnest mileagehistoryor platehistoryfor rows that are not empty and combine the data together. Solving the task “find all cars that have at least 10 000 km mileage and their second registration was in 2014” would take a lot of time, considering the amount of data.

The crawler failed in both cases with the data grouping; however, it correctly identified the data types. Data format identification is easy, but the grouping can be tricky especially when dealing with array JSON fields

Proposed algorithm to data grouping

Using the AWS Glue for grouping and categorizing data on our example was not very successful. Therefore, we have come up with a new approach to data normalizing. The main idea utilizes the following hypothesis: During the design of an external API, the business logic has been incorporated into its response(s). To some extent, different endpoints and structuring the response already contains the relations inside. Assuming the API is well-designed, we can capture a lot of logic already from the shape of the response. The proposed algorithm analyzes each response as they come to the system and assumes the existence of two global variables:

1. Pattern table, containing a list of column identifiers and their data types for each so far discovered table — during the process, new tables are originating, their columns and data types are stored to this list in order to check, if data will be appended to an existing table, or a new table will be created

2. Table of assumed relations — discovered relations, such as 1:n will be placed into this structure — e.g. table A and it’s column 1 which is related to table B and it’s column 2

In the beginning, both of these structures are empty. The algorithm works in a recursive way, decomposing the JSON response from the API. The procedure distinguishes whether the structure is an array […] or an object {…}.

Applying this algorithm to our sample data, this is how the pattern table would look like.

The table of relations will look like this.

Using the table of relations and pattern table, this is how the table scheme looks like. It is a bit different from the 3rd normal form mentioned in the introduction, but the main entities and relations remain the same.

The most powerful aspect of the procedure is its reproducibility in SQL commands. Every step that involves data table creation or data row insertion can be fully automated. Therefore, already in the staging layer of a database, one can have a fairly decomposed data without human interaction. However, indexes and optimization features must be added manually later. From a data scientist’s point of view, at least for model training, it is very valuable to have data in a normalized form, despite the longer loading.

The previous exercise was a model example, showing the perfect and theoretical world when the API structure doesn’t change over time. However, in real business, we usually demand our API vendor to make changes and gradually add new data points. Every time the vendor adds a new data point, traditionally, the data engineer must analyze the response and map this new field to a column in a table.

An empirical approach to the decomposition

The communication with the vendors and their deadlines are sometimes hard to be met, therefore the API production is an iterative process. Let’s have a look at a typical conversation with the vendor.

Unless the API vendor would be providing their product “as it is”, there are always some ongoing changes. From time to time from the consumer point of view, sometimes from the vendor. From this point of view, it is needed to implement an additional layer in the algorithm that could be robust to react to these kinds of changes.

Therefore, a second step in the data consumption automation involves the automatic acquisition of a new data point, so data scientists could immediately use it for their analyses. The automation of the decision, to which entity the new fields belong and whether they don’t form a new entity themselves, is a very difficult task.

Example #1:

By having a look at the two responses, it is clear that these data shall not be added to the same table. Why is it so? The first response describes state attributes that usually don’t change, such as the manufacturing year and color of the vehicle. The other response captures the current status of the car, which can change over time. Therefore, these two responses would be represented most likely as two independent tables, e.g. vehiclebasedata and vehiclestatus.

Example #2:

Here, the parameter carBodycould be simply kept with the VIN in the basetablefrom our example, therefore there would be only two entities preserved in both cases of the response — the mileage history and the vehicle base data.

Usually, when adding new data that doesn’t belong to any existing entity, the API producers create a new endpoint with a different name (similar to example #1). When the data could logically belong to the existing structures, they are appended to the existing response (similarly to example #2). Therefore, the reasoning will be based again both on the response format as well as the number of new identifiers.

When comparing the JSON response A to JSON response B, there can be a certain number of identifiers with the same name, and then there are some unique identifiers in A, that is not present in B and vice versa. The decision, whether they will be part of the same entity is based on the calculation of the adjustment ratio. Denote the variables as follows:

UIA = number of identifiers in response A, but not in B (including arrays)
UIB = number of identifiers in response B, but not in A (including arrays)
SAB = shared identifiers within A, B.

In example #1, the adjustment ratio equals 2/3, whereas the example #2 has the adjustment ratio 1/5. When there is no difference within the response structure, the coefficient will equal zero. The decision strategy whether a new entity will be established is based on the comparison with a threshold value. If the adjustment ratio is smaller than the threshold, the entity will be updated, while when the adjustment ratio is higher, a new entity will be produced.

Example of the adjustment ratio

Suppose that we are receiving API responses in the following format. The response gets transferred into a single database table.

After reconstruction and prompting a few records, the table can look like this.

Suppose we have configured the adjustment ratio to be 0.30 (empirically). Imagine, that the data scientist requests a new field being propagated along with the data, for example, the number of doors. The API designer appends the new identifier to the response. The new response could look like this.

The adjustment ratio in this case is 1/(1+2*3) = 1/7 = 0.1428. Since 0.1428<0.30, we change the table definition, populate the previous rows with null and change the Pattern table (the table of assumed relations remains the same). The modified table will look like this.

Once we have at least one object in the pattern table, the adjustment ratio is not anymore calculated with respect to the particular API responses, but to the identifiers in the pattern table itself. Suppose we request one more field, for example, engine displacement. After the API is redesigned and prompting for VIN = JM1BG2246R0816241, we receive the following API response:

Obviously, the API vendor forgot to provide color in this case (this happens quite often if the API developer is not using API REST templates) and this field is not available. This usually throws a violation error, since the field is not available. Not in our case. We calculate the adjustment ratio with respect to the pattern table as 2/(2*3+1+1) =2/8 = 0.25 < 0.30. The adjustment ratio is smaller than the empirically determined threshold, therefore we append a row with null values with identifiers not present in the original response and create a new column for engine displacement while keeping all the previous responses as NULL.

The adjustment ratio denotes the pace with which we allow to append new columns to existing tables or simply the ratio that describes the maximum proportion of missing data. In reality, we need to have two coefficients of adjustment, one for adding and one for tolerating missing data, where

Determination of the threshold cannot be fully automatic; however, its value shall be set larger than 0 and smaller than 0.30. For relatively young API vendors and for low data-amount responses, it is recommended to set this coefficient higher, for standardized responses, one can use lower values of the threshold.

Usually, the new data providers often change their API outputs in the first months of their service, while established businesses that run for several years and provide data services don’t change their responses much. One could add other aspects to the calculation of the adjustment, such as the company age of the data provider or the penalty for missing data.

Conclusion

This algorithm has been verified on real REST APIs with fair results, gaining significantly better grouping performance than AWS Glue, which has shown its inefficiency especially for responses containing arrays with a variable number of elements. In 80% of testing cases, the procedure assured the decomposition into 3rd normal form or higher. In the case, when the response is small, it is still hard for the algorithm to detect, whether the newly added data form a new entity, or stay in the same one.

The proposed approach is still very experimental to what has been done in the data integration automation up till now. It is good to mention, that in the near future, the mapping exercise or integration process won’t be fully automatic, but at least could contribute to the overhead reduction and help to pre-process data in a polished way. Data stability is always an issue and in many cases, the understanding of API responses is even difficult for humans.

Additionally, it is necessary to mention, that this algorithm will not always work perfectly and will not solve every problem with mapping an API output. The performance of the queries is not guaranteed, indexing and optimization have to be always done manually. This approach is meant to deliver the tables inside a staging layer of a data warehouse architecture, where they can be consumed by analysts or serve for reporting experiments.

--

--