Data Flow Fundamentals

Ian Stebbins
3 min readMar 18, 2024

--

Art Generated with Stable Diffusion

Introduction

Within the domain of personal projects and academic coursework in the field of data science and machine learning, the use of datasets is a near-everyday occurrence. While web platforms such as Kaggle, Github, and Tensorflow offer a vast variety of pre-made datasets, real-world data is often not so clean. Beyond the often unstructured and raw nature of real-world data, production-level systems face the challenge of data flow. In modern ML-integrated systems the movement of data between system components, external entities, and everything in between, is an essential system design constraint that is often overlooked within the academic space. One of the largest challenges when it comes to ML applications in practice is not just model design, but system and data flow architecture.

Basic (and Inefficient) Dataflow: Databases

One of the simplest forms of dataflow is quite literally someone writing to a database, and someone else reading from that same database. While this solution is basic it poses some major issues.

For large-scale applications with lots of data, reading and writing to databases can be slow and high latency. By extension, for many machine learning models, getting the data to the right places as efficiently as possible is a system requirement.

Another issue with using databases to pass data is the privacy concern. If two companies need to exchange some form of data, they would both need to have access to the same database, which is unrealistic in most cases.

Request-Driven and Service Oriented Architecture

Rather than passing data through a shared database, it is much better practice to send data directly through a network. This is commonly accomplished in two ways, either through REST (representational state transfer) or RPC (remote procedure call). REST is typically used for CRUD (create, read, update, delete) operations, while RPC is better suited for sending requests within the same organization or data center and can benefit from lower latency and higher throughput.

In a scenario where companies need to share data, privacy is no longer a concern, as data can now simply be passed through requests over a network.

Similarly, this fits well into a service-oriented architecture, where important data can be passed between different microservices within the same company that may need it. However, as more complexity is required across multiple services, and more data is being passed between them, a request-driven architecture can get both very complicated and slow.

Real-Time Transport

To solve the issue of a swath of overcomplicated requests within a service-oriented architecture, we can look towards real-time transport. By utilizing a single “data broker” services only have to have to communicate and make requests to a single entity, rather than a variety of other services. One of the most common implementations of real-time transport is Apache Kafka.

Apache Kafka serves as a centralized data broker within a service-oriented architecture, streamlining communication between microservices by providing a unified platform for data transport. By leveraging Kafka’s real-time transport capabilities, services can efficiently exchange data in a publish-subscribe model, reducing the complexity and latency associated with traditional request-driven architectures. This allows scalability while maintaining high throughput and low latency.

Other alternatives such as Confluent, Google Cloud Pub/Sub, RabbitMQ, and Amazon Kinesis are also popular throughout the industry.

Takeaways

Data flow can be simplified to reading and writing from a single database. For small and simple machine learning systems, a request-driven architecture may be enough to suit your dataflow needs. However, for complex, cutting-edge, machine learning systems that deal with lots of services and lots of data, utilizing real-time transport offers the best chance at developing low latency, and high throughput systems.

Works Cited

[1] Huyen, Chip. Designing Machine Learning Systems An Iterative Process for Production-Ready Applications. O’REILLY MEDIA, INC, USA, 2022.

[2] “Data-Flow Diagram.” Wikipedia, Wikimedia Foundation, 24 Aug. 2023, en.wikipedia.org/wiki/Data-flow_diagram.

[3] “Kafka in 100 Seconds.” YouTube, YouTube, 10 Jan. 2023, www.youtube.com/watch?v=uvb00oaa3k8&ab_channel=Fireship.

--

--

Ian Stebbins

A young data scientist with too many interests. Engaging in the intersection of data with the real world, all wrapped up in a story-sized bundle.