Embracing Data Silos — the journey through a fragmented data world

Amelia
97 Things
Published in
2 min readJun 8, 2019

Bin Fan and Amelia Wong

Working in the big data and machine learning space, we frequently hear from data engineers that the biggest obstacle to extracting value from data is being able to access the data efficiently. Data silos, isolated islands of data, are often viewed by data engineers as the key culprit. For years, there have been many attempts to resolve the challenges caused by data silos, but those attempts have resulted in even more data silos. Rather than attempting to eliminate data silos, we believe the right approach is to embrace them.

Why Data Silos Exist

There are three main reasons why data silos exist. First, within any organization there is data with varying characteristics (IOT data, behavioral data, transactional data etc.) that are intended for different uses, and some data will be more business critical than others. This drives the need for disparate storage systems. Additionally, history has shown that every five to ten years there will be a new wave in storage technologies churning out storage systems that are faster, cheaper or better designed for certain types of data. Organizations also have a desire to avoid vendor lock-ins, and as a result they will diversify their data storage. Finally, there are regulations that mandate the siloing of data.

Embracing Data Silos

We believe data silos in themselves are not the challenge. The fundamental challenge is how to make data accessible to data engineers without creating more complexity or duplication. Instead of eliminating silos, we propose leveraging a data orchestration system, which sits between compute frameworks and storage systems, to resolve data access challenges. We define a data orchestration system as a layer that abstracts data access across storage systems, virtualizes all the data, and presents the data via standardized APIs with global namespace to data-driven applications.

With a data orchestration system, data engineers can easily access data stored across various storage systems. For example, a data engineer may need to join two tables originally stored in two different regions — a local Hadoop cluster and a remote Hadoop cluster. In this case, this engineer can deploy Alluxio (an open source implementation of a data orchestration layer) as the data orchestration layer and change the table location in Hive metastore to Alluxio URLs rather than each individual physical Hadoop cluster.

As a result, the remote table will be cached in Alluxio layer and provides much better performance to follow-up or repeated table access than reading the table directly. Furthermore, storage teams can make the best storage purchasing decisions without being shackled by the impact of their decisions on application teams.

--

--

Amelia
97 Things

Alluxion, adventurer, ex-Boaltie, Cal alum.