Building a Data Science App: Part 3 — Data Storage Architecture

Empirisys
Empirisys
4 min readMar 7, 2023

--

This is the third in a series of blogs about data architecture (click for the previous one). In this post I will discuss the principles to consider when setting up and building a data storage capabilities within a data science based application, drawing from my research and experience working as a data scientist at Empirisys.

Data storage

Photo by Joshua Hoehne on Unsplash

Principle 1: Scalability

As an application is used more often and/or by an increasing number of people, naturally the volume of data will increase, and indeed the data types may change. If nothing is done then there will likely be a drop in efficiency of the pipeline. This is where the pipeline needs to be scaled.

One way to do this is to use cloud platforms like AWS which can scale automatically without manual intervention. For example, extra storage resources can be triggered based on a pre-defined rule when there is an increase in data throughput in a pipe-line. This method can be incredibly effective at keeping costs down in conjunction with the pay-as-you-go options with companies like AWS.

The ability to scale without human intervention is a key aspect to consider in a data science application. This is especially true if you are working in the realms of ‘Big Data’. Many data scientists describe big data in terms of the so called “5 V’s” of big data which summarise the topic, but also highlights the areas to consider when factoring in the scalability of your data science application.

Volume — how much storage space is needed?

Velocity — at what speeds does the data need to be transferred at? Does it need to be streamed like Netflix movies, or sent in batches?

Variety — can the application easily adjust to processing different kind of data sets like PDFs, CSVs, mp3 files etc.

Variability — data can change it’s meaning over time; for example names for things can change, so having a unique id user-defined input is key to identifying information within a database.

Value — data must of course add value to a system. Without this value, scaling is meaningless!

Principle 2: No Data Copies

Photo by Markus Winkler on https://www.pexels.com/photo/people-art-pattern-texture-11404129/

It is of course very important to ensure data is backed-up in some way as indicated in the second blog post in this series. On the other hand, if data is being copied over and over again, this may be needlessly taking up extra space in your database and potentially costing you more money. Eliminating copying data like this (as discussed here) is a very simple way to keep storage costs down, and ensure efficiency within the pipeline.

Cloud service providers have this sort of functionality covered. For example, AWS have a service called timestream which allows continual ingestion of time series data for use cases such as analysing trends and anomalies from IoT devices. When data is ingested into timestream it automatically rejects any new records which are duplicates of existing ones, but can update records if needed.

Principle 3: Well defined data management roles

Photo by Matteo Vistocco on Unsplash

Being clear on who is responsible for the management and use of the data is key in assuring the privacy, quality, and maintenance of the data. In this article, three roles are defined which cover these areas. However you define these roles, it essentially needs to be ensured that everyone is clear on the following:

· Who has read/write/admin access to the data.

· Who is the sole data owner.

· Who manages the data quality.

· Who ensures the data is archived, recovered, maintained, and kept secure.

A good example of this principle being applied is with the recently (2019) developed architecturual model known as Data Mesh. This architecture is akin to the way microservices work in software development (see first blog for explanation of microservices). Instead of having one big monolithic data platform which is managed by a single team, the Data Mesh framework proposes that parts of the data are owned by the domain which understands it best. Each team will then treat their dataset as a product dedicated to a partucular business function.

Data Mesh effectively applies many other principles discussed in these blogs such as security and scaling, all of which can be read about here.

In the next (and last) post in this series I will discuss the principles associated with data science in an application. See you there!

To find out about principles for data pipelines check out our earlier blogs. To find out more about what we do at Empirisys visit out website: Empirisys.io

If you found this useful, please let us know by getting in touch, give us a clap or a follow. You can find more about us at empirisys.io or on Twitter at @empirisys or on LinkedIn. And you can drop us an e-mail at info@empirisys.io, or directly to the author of this article, alex.white@empirisys.io.

--

--

Empirisys
Empirisys

Empirisys helps complex, high-hazard organisations become safer, more productive and deliver better quality