Journey to Event-Driven — Part 4: Four Pillars of Event Streaming Microservices | Apache Kafka

Image by LoggaWiggler from Pixabay

Overview

  • Four pillars of event streaming
  • Pillar 1 — Business function: Payment processing pipeline
  • Pillar 2 — Instrumentation plane: Business metrics
  • Pillar 3 — Control plane: Flow control, start, pause, bootstrap, scale and rate limit
  • Pillar 4 — Operational plane: Event logging, DLQs and automation
  • Deployment model
  • Parting thoughts and preparing for the future

Four pillars of event streaming

Recognizing architectural pillars and patterns allows us to move away from complexity.
  1. Business function: The payment processing pipeline
  2. Instrumentation plane: How much money is transferred? How quickly are payments processed?
  3. Control plane: Patterns to start, stop, pause, coordinate and autoscale
  4. Operational plane: Dead letter queues (DLQs), error tracking & handling, logging and NoOps
The four pillars of event streaming architecture applied to our system, details excluded

1. Business function

The business function represents the core functionality being built. It provides business value that is immediate and beneficial to the organization. Examples include payment processing, logistics, data processing (ETL) and so on. This function can be developed as a set of events and related dataflows across one or more bounded contexts.

2. Instrumentation plane

The role of the instrumentation plane is to capture metrics that prove that the business function is performing sufficiently. Metrics may have a variety of uses-they might be used to drive alerts, assist in capacity planning or drive infrastructure automation functions such as auto-scaling.

3. Control plane

An often overlooked aspect of many systems is its ability to control the flow of events. Unlike batch systems, event streaming applications continuously process data. This means that while it might be possible to perform rolling upgrades, the system sometimes needs to halt processing, perform an action, such as an upgrade or logic change, and then resume processing. The control plane becomes essential when outages have occurred and a large-scale system must get back online in a coordinated fashion, perhaps with incremental, restricted functionality.

4. Operational plane

Developing the system for the “happy path” is easy-it’s catering for the remaining paths when simplicity can become compromised. Relying on pushdown semantics of container “restart on failure,” workload feedback cycles to drive elasticity and other underlying functionality-we need to understand all of this and more. We need to build a model for capturing operational processes, such as rolling restarts and wipe and update processes, and develop these in automation scripts to drive the infrastructure automation.

Building the KPay payment system

KPay is a demonstration application that shows how to apply our event streaming pillars. It is a fully functioning payment system, providing a user interface (pictured below), a web tier, a series of Kafka Streams processors as well as a payment generator for load testing. The payment generator and other parts of the web interface are exposed using Swagger (also pictured below). The source code is comprised of the following packages that map onto the pillars:

  • payments: the business function of payment processing
  • metrics: the instrumentation plane to monitor payment throughput and latency
  • control: the control plane to stop and start payment processing
Swagger is used to document the REST interface and expose the control plane.

Pillar 1 — Business function: Payment processing pipeline

Payment processing is an interesting problem. In addition to being deterministic, reliable, idempotent and scalable, more importantly, it can be user-driven or machine-driven. In either case, there can be challenges relating to latency and scale. Latency, where person X gets frustrated at the “busy spinner” and clicks a back button on a webform, or perhaps the purchase of a major sporting event ticket leads them to an error page- did they take my money or not? For synchronous machine-driven payments, client's latency can have a significant knock-on effect. It can force systematic retries or timeouts. If the client supports asynchronous interaction, be it machine or human, then the solution becomes simpler.

  • Scale to millions of payments
  • Perform asynchronous processing
  • Track payments inflight (total value of payments being processed)
  • Remove/debit money from an account when handling a debit event
  • Add/credit money to an account when handling a credit event
  • Track payments confirmed today (total value of payments successfully processed)
Payment processing pipeline
  1. Payment requests flow from the payment.inflight topic into the payments inflight processor. It updates the total balance of payments currently being processed, similar to a read-uncommitted view. It sends the payment request into the payment.inflight topic, keyed by the from account and setting the event type to DEBIT.
  2. An account processor receives the DEBIT request, validates that the payment can be processed and updates the user's balance (KTable). It then emits another event, keyed against the to account, and sets the event type to CREDIT.
  3. Another account processor receives the CREDIT event, updating the user's balance in the KTable and emitting the payment-complete event.
  4. The payments conf’d processor updates the total confirmed balance, which is similar to the read-committed value.

Event flow breakdown

1. Payments inflight

Payments inflight acts as a gateway for the business flow. Its responsibility is to track business-level metrics that are important to application users, including the total payment value inflight and the number of payments currently being processed.In contrast, instrumentation metrics relate to the application support team and might include latency percentiles, payment throughput and outliers.

2–3. Account Processor

The account processor maintains the value of a user’s account by handling debit and credit payments. Depending upon the type of payment being processed, a subsequent payment event will be emitted to the payment.inflight or payment.complete topic. Interactive queries display the AccountBalance values through the user interface.

4. Balance confirmed

The final stage is confirmation. Payments emitted from this stage have been successfully processed.

Kafka Streams interactive queries scaffolding

Kafka Streams provides an endpoint-agnostic means to building your own interactive queries layer. This supports real-time discovery of endpoints as processors scale up or down.

  • Error handling: Implement sagas to handle unwinding on failed payments, perhaps when corrupted or fraudulent activity is detected
  • Pre-payment validation: Add a preliminary stage to support scenarios such as detecting and handling suspicious or fraudulent activity
  • Restrict dependent payments: When it makes sense, only allow a single debit payment for a single account to be inflight at any point in time
  • Limits: Introduce limit checks against accounts with low balances

Pillar 2 — Instrumentation plane: Business metrics

There are a plethora of monitoring tools that provide valuable functionality. Most of them are great for bottom-up metrics or infrastructure metrics.

  • Process payments in < 200 milliseconds (latency)
  • Throughput > 1,000 per second
  • Report the total count and dollar amount (the demo code uses a one-minute window)

Pillar 3 — Control plane: Flow control, start, pause, bootstrap, scale and rate limit

The last pillar of the event streaming architecture is the control plane, which drives and coordinates the system. The demo allows you to halt and resume the processing of payments, in which the former is particularly valuable because it allows the pipeline to complete inflight processing. This essentially provides a flush (or drain) of the current set of inflight payments, which can be useful when there is a topology or logic change.

  • Pause processing and allow existing payments to be completed (drained)
  • Resume processing from the paused state

Pillar 4 — Operational plane: Event logging, DLQs and automation

Operations aren’t a new concept; however, the new age of NoOps, configuration management, and automation, etc., encompasses operational concerns under the banner of DevOps. I will provide a view operational dataflows that can be used to support this pillar. To state the obvious, the operational plane is entirely dependent on the instrumentation plane for observability and the control plane for coordination. Operational aspects include, upgrade processes, 24×7 runtime support, evolutionary support (e.g., new processors or data evolution) and, of course, the infrastructure to handle normal operations. The dataflow patterns normally introduced are summarized below:

  • Application logs: Each microservice will use a log appender to selectively send log messages to a specific topic that is relevant to the application context. The log message will also contain metadata about the runtime, perhaps including the input topic and output topic, consumer group as well as the source event information. These logs are relevant to business events and should be tied to the bounded context.
  • Error/warning logs: Similar to above, but the focus is on error/warning log events and sending to the relevant topic. It is also useful to build a log context that accumulates relevant information to enrich the log message if an error/warning does occur.
  • Audit logs: Each microservice will capture a security context (e.g., user, group, credential or reason) as well as event context (i.e., event ID, event origin, source and destination topic) and emit to a relevant, contextually determined audit log topic.
  • Lineage: Similar to above, except that the event will contain trace information about its journey-for example, each processor, or the context it has accessed or modified on its dataflow.
  • Dead letter queues: These provide a place to store raw events that cannot be processed. Processing failure can be due to any number of reasons, from deserialization failure to invalid values or invalid references (join failure, etc). In any case, the purpose of dead-letter queues is to make you aware of the error so you can fix it. The dead letter queue is the last line of defense and might indicate a failing CI/CD process, in which a scenario has occurred that was not catered for. The dead letter queue should always be empty. In some cases, a saga pattern is needed to unwind the preceding state that caused the error. Remember, events must be processed idempotently.

Putting it all together

The quick and lazy version of running the demo is within your IDE (captured below).To run it in a more robust manner, you can Dockerize the relevant components. Each stream processor and the web tier contain their own main() method. The main() method can be used to run the appropriate class within the Docker image. To see how all stream processing microservices run within a monolith, refer to KPayAllInOneImpl.

  1. git clone git@github.com:confluentinc/demo-scene.git
  2. Run IntelliJ, then open ./scalable-payment-processing/ and import the maven.pom file
  3. Run/debug the RestEndpointIntegrationTest.runServerForABit()
  4. Point the browser to http://localhost:8080/ui/index.html#

Deployment model

If you choose to run the payment system within Kubernetes and Docker, each plane ends up running separately. In the deployment below, four account processors are running. The web tier queries each AccountProcessor using the stream.metadata to interact with each instance as a KV store and returns the JSON data for rendering in the browser.

  • Event objects are exposed through the web tier. This is bad because they need to be contained within the bounded context. There should be a translation layer into POJOs to support evolvability that is independent between the web clients and the bounded context. In short, don’t bleed the domain model (events) outside of a bounded context.
  • There is virtually no error handling, though we need dead letter queues, error queues and saga implementation and handling.
  • There is no configuration or environmental management to support automation, which is required for topic configuration, broker connectivity, security and so on.
  • There is a lack of versioning.

Parting thoughts and preparing for the future

  1. Core business flow: business functionality
  2. Instrumentation plane: metrics and observability
  3. Control plane: start, stop and scale
  4. Operations plane: application support, error handling, and dead letter queues

Interested in learning more?

Head over to liquidlabs.com and see what services we offer, as well as other blog posts about technology (not just Kafka, but serverless opensource projects).

--

--

@Liquidlabs, ex Confluent, CTO-Excelian, Thoughtworks. Founder Logscape, Startups, code-monkey, songwriter tennis-nut, snowboarding-nut, and Dad!

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Neil Avery

Neil Avery

@Liquidlabs, ex Confluent, CTO-Excelian, Thoughtworks. Founder Logscape, Startups, code-monkey, songwriter tennis-nut, snowboarding-nut, and Dad!