Data Engineering Practices

Data storage patterns, versioning and partitions

How to efficiently version, store and process data for your pipeline

Managing storage on disk

When you have large volumes of data, we have found it useful to separate data that comes in from the upstream providers (if any) from any insights we process and produce. This allows us to segregate access (different parts have different PII classifications) and apply different retention policies.

Data processing pipeline between various buckets and the operations performed when data moves from one bucket to the other
Segregation of data based on the amount of processing we have done in the system

Provider buckets

The preferred layout of provider buckets
Buckets by providers come in various shapes. When we have some say in the structure, we give a data provider access to the dataset directory and they follow the data structure they prefer underneath. If we’re given a choice, the structure shown in the image is chose. More details when we talk about Data Partitioning below

Landing bucket

Landing bucket data layout
  1. Ensure that we don’t accidentally share raw data with others (we are contractually obligated not to share source data)
  2. Apply different access policies to raw data when it contains any PII
  3. Preserve an untouched copy of the source if we ever have to re-process the data (providers delete data from their bucket within a month or so)

Core bucket

Core bucket data layout

Derived bucket

Your data platform probably has multiple models running on top of the core data that produce multiple insights. We write the output for each of these into its own directory.

Derived bucket data layout

Advantages of data segregation

  1. Separating the data makes it easier to find the data. When you have terabytes or petabytes of information across your organization with multiple teams working on this data platform, it becomes easy to lose track of the information that is already available and it can be hard to find it if they are stored in different places. Having some way to find information is helpful. For us, separating the data by whether we get it from an upstream system, we produce it or we send it out to a downstream system helps teams find information easily.
  2. Different rules apply to different datasets. You might be obligated to delete data from raw information you have purchased under certain conditions (like when they have PII). Rules for retaining derived data are different if it does not contain any PII.
  3. Most platforms allow archiving of data. Separating the dataset makes it easier to archive different datasets. (we’ll talk about other aspects of archiving during data partitioning)

Data partitioning

Partitioning is a technique that allows your processing engine (like Spark) to read data more efficiently thus making the program more efficient. The most optimal way to partition data is based on the way it is read, written and/or processed. Since most data is written once and read multiple times, optimising a dataset for reads makes sense.

Data versioning

We change the version of the data every time there is a breaking change. Our versioning strategy is similar to the one talked about in the book for Database Refactoring with a few changes for scale. The book talks about many types of refactoring and the column rename is a common and interesting use case.

Versioning on large data sets

When the data volume is high (think terabytes to petabytes), running migrations like this is a very expensive process in terms of the time and resources taken. Also, the application downtime during the migration is large or there’s 2 copies of the dataset created (which makes storage more expensive).

+--------------+-----------------+
| real_name | superhero_name |
+--------------+-----------------+
| Tony Stark | Iron Man |
| Steve Rogers | Captain America |
+--------------+-----------------+
+------------------+----------------+--------------------------+
| real_name | superhero_name | home_location |
+------------------+----------------+--------------------------+
| Bruce Banner | Hulk | Dayton, Ohio |
| Natasha Romanoff | Black Widow | Stalingrad, Soviet Union |
+------------------+----------------+--------------------------+
+----------------+---------------------------+
| superhero_name | home_location |
+----------------+---------------------------+
| Spider-Man | Queens, New York |
| Ant-Man | San Francisco, California |
+----------------+---------------------------+
scala> spark.read.parquet("model=superhero-identities").show()+----------------+---------------+----+-----+---+
| real_name| superhero_name|year|month|day|
+----------------+---------------+----+-----+---+
|Natasha Romanoff| Black Widow|2021| 5| 2|
| Bruce Banner| Hulk|2021| 5| 2|
| null| Ant-Man|2021| 5| 3|
| null| Spider-Man|2021| 5| 3|
| Steve Rogers|Captain America|2021| 5| 1|
| Tony Stark| Iron Man|2021| 5| 1|
+----------------+---------------+----+-----+---+
scala> spark.read.option("mergeSchema", "true").parquet("model=superhero-identities").show()+----------------+---------------+------------------+----+-----+---+
| real_name| superhero_name| home_location|year|month|day|
+----------------+---------------+------------------+----+-----+---+
|Natasha Romanoff| Black Widow|Stalingrad, Sov...|2021| 5| 2|
| Bruce Banner| Hulk| Dayton, Ohio|2021| 5| 2|
| null| Ant-Man|San Francisco, ...|2021| 5| 3|
| null| Spider-Man| Queens, New York|2021| 5| 3|
| Steve Rogers|Captain America| null|2021| 5| 1|
| Tony Stark| Iron Man| null|2021| 5| 1|
+----------------+---------------+------------------+----+-----+---+

Advantages

  1. Each version partition on disk has the same schema (making reads easier)
  2. Downstream systems can choose when to migrate from one version to another
  3. A new version can be tested out without affecting the existing data pipeline chain

Summary

Applications, system architecture and your data always evolve. Your decisions in how you store and access your data affect your system’s ability to evolve. Using techniques like versioning and partitioning helps your system continue to evolve with minimal overhead cost. Thus, we recommend integrating these techniques into your product at its inception so the team has a strong foundation to build upon.

--

--

Writings from Sahaj Software focusing on complex problems without complicating the solutions.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Karun Japhet

Builds solutions at Sahaj Software. Speaks/Writes publicly about scaling software, testing, CD, different event driven architectures & data engineering.