The State of Data Streaming
An abundance of data is being generated every day. To create the most value, data needs to be provided to relevant systems in real-time. Discover the technology driving the real-time data ecosystem. Alexander Heckmann, Working Student for Cloud Enablement & Integration at Porsche, shares his thought about the state of data streaming.
The Evolution of Business Data
Before the rise of Big Data and software-driven businesses, business data would mostly refer to records of financial statements to create forecasts or produce management reports. It only revolved around whole entities, but not around single process steps where the real value lies.
Long gone are the days, when business data consisted only of structured data used for accounting or customer relationship that fits into relational databases and has no timeliness requirement. NoSQL databases emerged, loosening on the data structure which enables storing semi- or unstructured data. While this made systems more flexible in storage, it still revolves around acting on stored master data; timeliness in propagating business data in real-time is still not resolved.
Tables & Streams
This is where streaming data begs to differ: events can be semi-structured and are mostly represented by data exchange formats such as JSON or XML files. In contrast to the aforementioned master data, events are atomic in a way that they only carry the update to an already existing record, while the context has to be gained by joining the event with external systems. So, there’s no set schema, each event sent can contain another value for a topic that, if used with an entity’s key, adds up into fully described data entities. Although enforcing a schema is considered best practice by utilizing a Schema Registry — this way producer-consumer compatibility is ensured and at the same time better decoupling is in place. Apache Kafka is the de facto standard for applying the principles of data streaming.
This shows the relationship between data in motion and data at rest. While stored data represents the state of data objects at the time of use, events represent the changes done to them; they can be used in a symbiotic way.
This resembles one of the use cases of data streaming: Change Data Capture (CDC). Events contain database changes so they can be propagated to other systems. This provides a scalable and efficient way of creating database replicas, extracting data into operational data stores or even migrating production databases to the cloud.
Often, the raw data delivered cannot achieve business value as it’s duplicate or does not contain enough background information. An illustrative example would be IoT data, delivering sensor measurement data once a second. A way of deduplicating would be adding a stream processing job as an intermediate data sink, actively filtering incoming events. To add a context, events can be enriched by static data from data stores or by joining them with other events within a given timeframe. By using the Kafka Streams API, one could extract data from OLTP databases and enrich this relational data to denormalize it in real-time so that it fits the database schema of data warehouses.
The way Kafka works is that new events are being appended to the system log of a partition. This provides a fail-safe way of saving data with guaranteed order within a partition, creating a centralized source of truth that is a perfect fit for making data streaming platforms an integral part of system architectures. Data streaming can be used to enable asynchronous communication between multiple components without the need of knowing upstream or downstream clients. The request-response type of communication used by APIs may be required for some use cases, but oftentimes it’s not required as underlying events processed by the systems are coordinated, but asynchronous interactions. This results in shifting from entangled systems to a decoupled architecture while providing a common interface. This enables teams to build a scalable, reliable infrastructure for a high throughput of events. A downside is that, while there are already multiple libraries for data streaming platforms such as Apache Kafka, not every language has full-fledged support for native Kafka connectivity. This is where a proxy providing a REST API could come into play.
With the possibility of using a REST API, there are many more use cases that can be covered by data streaming that would not be possible otherwise. Some — like mobile apps — are not built with data streaming in mind, as their environment naturally does not allow them to connect to a broker the whole time during their usage to use Kafka as middleware. This is where stateless integration via HTTP shines. With HTTP, it’s easier to achieve no data loss while having a functioning app in rural areas with limited or no bandwidth compared to a WebSocket implementation. In this case, mobile apps would not directly connect to Kafka but rather send events to a REST API. So, a REST API would be a good addition for additional client integration capabilities.
Another type of use case for the usage of a REST API would be connecting systems with no native support for data streaming, such as legacy systems, proprietary software or clients written in unsupported languages. These types of systems benefit from the HTTP approach, as it’s way easier to implement.
Putting it to use
One example use case here at Porsche that benefits from this abundancy of possibilities to use Kafka as the central nervous system is the Porsche Digital Twin. The digital representation of the physical cars aggregates data from a multitude of systems during the car’s whole lifecycle, so that every detail is stored or can be derived and is ready to be used for further acting on the data. Globally distributed systems update the data entries whenever a real-life event concerning the car happens. That may be master data from development or production plants or transactional data from digital after sales services, events derived from sensor data or workshop visits.
On an organizational level, Kafka is being used for a domain-driven design (DDD). The Porsche Digital Twin is also a prime example for DDD. In the long term, Porsche plans to organize itself around data domains instead of silos. Data domains group data objects with common characteristics, loosening domain boundaries for cross-functional data usage. This means all product teams use the same clusters and events generated by one team may be consumed by different teams for completely different use cases. With all the integration possibilities, this makes Kafka the perfect fit for Domain-Driven Design, enabling Porsche to transform into a process-oriented enterprise and going into the digital future.
About this publication: Where innovation meets tradition. There’s more to Porsche than sports cars — we are developing new digital products and services — always with our customers in focus. On our Medium blog, we tell these stories. It’s about our #nextvisions, emerging technologies, and the people that drive our digital journey. If you want to know more, follow us on Twitter, Instagram and LinkedIn.