Stories by Dr. Yaroslav Zhbankov on Medium

Prompt-Driven Development as a Software Engineering Process

Dr. Yaroslav Zhbankov — Wed, 13 May 2026 03:11:55 GMT

Most “I used AI to build an app” stories are accounts of vibe coding, a fragile process where success depends on the model’s mood and the user’s luck. Vibe coding doesn’t scale because it lacks a feedback loop.

The core thesis: AI-driven development only becomes a reliable engineering process when automated verification gates are inserted between every generation step.

To move beyond one-shot prompts, we must treat AI interaction as an AI Compiler Pipeline. In this model, the human provides the high-level intent, and the AI acts as a multi-stage transformer moving through Specification, Verification, and Execution.

The Problem

I correct text constantly. Grammar checks, tone adjustments, translations. Every time: open browser, navigate to ChatGPT/Gemini, type the prompt, paste the text, copy the result. Dozens of times per day.

I wanted a tool that reduces this to: hotkey → paste → Enter → done. A floating overlay, like Spotlight, that lives in the menu bar and calls an AI API directly.

The AI Compiler Pipeline: Spec → Verify → Execute

The core principle is simple: Never go from idea to code in one step. Instead, treat the LLM as a series of specialized agents operating within a pipeline of Prompt Contracts.

Phase 1: Specification (Design). Produce a design document. Subject it to a verification gate for gaps and risks.
Phase 2: Verification (Prompt Generation). Generate atomic, executable instructions. Subject them to a verification gate for logic and dependencies.
Phase 3: Execution (Agentic Build). Feed validated prompts to a builder agent. Use automated checkpoints to gate progress.

Phase 1: Specification (Design)

I started a conversation with Claude about what I wanted. We discussed technology choices, UX patterns, and constraints. The first design wasn’t the one I built from — each analysis pass caught decisions that would have become bugs.

I initially wanted double-Shift as the hotkey (like IntelliJ). Analysis revealed three layers of risk: a native binding package that breaks during Electron packaging, mandatory Accessibility permissions, and false positives when typing consecutive capitals. We switched to Cmd+Shift+G — zero dependencies, zero risk. The initial design hid the overlay on blur; analysis caught that users would click away to copy text and lose the overlay. We changed to Escape-only dismissal.

By the time I moved to prompt generation, the architecture was validated against real usage patterns, not assumptions.

Phase 2: Verification (Prompt Generation)

When you ask an AI to “build the whole app”, the model’s focus degrades as the context window fills with boilerplate. By breaking the build into a pipeline of Prompt Contracts, you ensure the AI agent always has 100% focus on a 5% problem.

A bug caught in the Specification phase costs one sentence to fix. The same bug caught during Execution costs a full debugging session. We use Verification Gates to catch errors at their cheapest point.

Requirements for Prompts

Before generating any prompts, I established rules:

Each prompt produces one testable feature. Settings is one prompt. API calls are another. Never both. The agent focuses better with a single objective, and if something breaks, you know exactly which prompt caused it.

Each prompt includes a checkpoint. Automated checks (TypeScript compilation, grep for expected code patterns) and manual tests (press this key, expect that result, verify this edge case).

Each prompt is self-contained. The agent should be able to execute it without referencing other prompts. All file paths, function signatures, CSS values, API headers are explicit.

Negative constraints are explicit. “Do NOT hide on blur.” “Do NOT call app.hide() if the settings window is open”. These prevent the agent from applying common patterns that would break the specific UX.

Analysing the Prompts

The first set of prompts looked reasonable. Then I asked Claude to analyse them as if it were the agent executing them — and each pass caught 3–5 issues that would have caused failures during execution.

Missing dependency installs that the agent might not catch on its own. An incorrect Accessibility permissions check for an API that doesn’t require it. A vague tray icon instruction (“use any default icon”) that would error because there is no default. A DMG maker package not installed by default with the template — meaning pnpm run make would fail at the very end of the build. A window resize instruction that didn't specify the IPC round-trip between renderer and main process.

After three analysis passes, the prompts were tight enough to execute.

The Checkpoint Design

The checkpoints aren’t an afterthought — they’re the mechanism that makes the entire approach work. Without them, the agent declares “done” when the code compiles, which is a very different thing from “working”.

A good checkpoint has three layers:

Structural checks verify the code exists and is correct:

npx tsc --noEmit                              # zero TypeScript errors
grep -n "globalShortcut.register" src/main.ts  # registration code exists
grep -n "x-api-key" src/main.ts                # API header present
grep -rn ": any" src/                          # no lazy type annotations

Functional tests verify behaviour:

Cmd+Shift+G → overlay appears, input is focused
Type text → Enter → "Thinking..." → corrected result appears
Escape → overlay hides, focus returns to previous app

Edge case tests verify robustness:

Empty API key → warning placeholder, input disabled
Invalid API key → red error message with 401 status
No network → timeout error after 30 seconds
Escape during loading → does nothing (prevents closing mid-request)

The agent runs all three layers before proceeding. If any check fails, it fixes the issue and re-runs. This is what prevents the compounding errors that sink single-prompt approaches — where a small mistake in step 3 becomes an unfixable mess by step 7.

Phase 3: Execution (Agentic Build)

With validated prompts in hand, I fed them one at a time to Claude Code. Nine prompts built the core:

Most prompts executed cleanly on the first pass. When they didn’t, the feedback loop kicked in — and this loop is a core part of the methodology, not an exception to it. The prompts can be found on GitHub here.

Extending the Application

After the core was working, additional prompts extended it:
Multi-provider support added Anthropic, OpenAI, and local server (Apfel, Ollama, LM Studio) integration. This was too complex for one prompt, so I split it following the principle of data → logic → UI: first the settings schema with migration, then the API call functions for all three providers, then the settings UI with provider-switching tabs. Each prompt was independently testable.
MacOS Services integration added a right-click context menu option in any app. Select text → right-click → “Check with Quick Prompt” → overlay opens with the text. This required an Automator workflow that writes selected text to a trigger file, and a file watcher in the Electron app. Two mechanisms, one prompt, extensive checkpoints.

Each extension followed the same three-phase pattern: design → generate prompt with checkpoints → execute and verify.

The Result

Fig. 1. Quick Prompt application console

Fig. 2. Quick Prompt application settings window

One evening of work produced a distributable .dmg. The application opens with Cmd+Shift+G from any app including fullscreen, auto-pastes clipboard text, sends it to Anthropic, OpenAI, or any local AI server, shows corrected text, copies it to clipboard, and works via right-click context menu through macOS Services.

Conclusion: The Philosophical Shift

The future of software engineering isn’t “talking to machines”; it’s designing pipelines that verify machines.

Prompt-Driven Development shifts the engineer’s role from a Writer to an Architect and Auditor. By implementing staged generation and strict verification gates, we turn the unpredictable “magic” of LLMs into a deterministic engineering process. The AI Compiler Pipeline doesn’t just write code faster — it writes code you can actually trust.

Streaming Essentials: Types, Architectures, and Best Practices

Dr. Yaroslav Zhbankov — Tue, 31 Mar 2026 03:37:34 GMT

Introduction

Streaming is a foundational technique in modern computing that enables continuous, real-time processing and delivery of data. Rather than waiting for an entire dataset to be available before acting on it, streaming allows systems to process data incrementally, as it arrives. From video playback and financial market feeds to real-time analytics and event-driven microservices, streaming powers a vast range of applications that demand low latency and high throughput.

As systems grow in scale and users expect instant feedback, understanding streaming — its types, architectures, trade-offs, and technologies — becomes essential for any software engineer or system designer.

This article making an overview to the different types of streaming, processing models, architectures, technologies, use cases, and trade-offs.

What is Streaming?

Streaming is a method of transmitting or processing data continuously in a sequence of small, manageable chunks (often called events, records, or frames) rather than as a single, complete batch. The data is produced, transmitted, and consumed incrementally — often in real time or near real time.

There are two primary domains where streaming is applied:

Data Streaming: Continuous ingestion and processing of data records/events (e.g., sensor readings, clickstream events, log entries).
Media Streaming: Continuous delivery of audio or video content to end users (e.g., Netflix, Spotify, YouTube).

Both share the same core principle: deliver data incrementally as it becomes available, rather than waiting for the entire payload to be ready.

Streaming Types

Data Streaming

Data streaming involves the continuous flow of data records from producers to consumers, typically through a message broker or streaming platform.

Event Streaming: Captures events as they happen and makes them available to consumers. Events are typically immutable, ordered, and durable. Example: Apache Kafka, Amazon Kinesis, Azure Event Hubs.

Message Streaming: Delivers messages between producers and consumers through a message broker. Messages can be point-to-point or publish-subscribe. Example: RabbitMQ, Amazon SQS, Apache ActiveMQ.

Log Streaming: Aggregates and streams log data from multiple sources for monitoring and analysis. Example: Fluentd, Logstash, AWS CloudWatch Logs.

Change Data Capture (CDC) Streaming: Captures row-level changes (INSERT, UPDATE, DELETE) from a database and streams them to downstream systems. Example: Debezium, AWS DMS, Oracle GoldenGate.

IoT Data Streaming: Continuous data flow from sensors, devices, and edge nodes. Often involves high-volume, low-latency requirements. Example: AWS IoT Core, Azure IoT Hub, MQTT brokers.

Media Streaming

Media streaming involves the continuous delivery of audio, video, or other multimedia content over a network.

Video Streaming: Delivers video content progressively to viewers. Can be live (real-time broadcast) or on-demand (pre-recorded). Example: Netflix (on-demand), Twitch (live), YouTube (both).

Audio Streaming: Delivers audio content continuously to listeners. Example: Spotify, Apple Music, internet radio stations.

Live Streaming: Real-time broadcast of audio/video content with minimal delay. Example: Twitch, YouTube Live, Facebook Live.

Adaptive Bitrate Streaming (ABR): Dynamically adjusts video quality based on network conditions and device capabilities. Example: HLS (HTTP Live Streaming), DASH (Dynamic Adaptive Streaming over HTTP).

API and Application Streaming

Server-Sent Events (SSE): A server pushes updates to the client over a single HTTP connection. Unidirectional (server to client). Example: Real-time dashboards, notification feeds.

WebSocket Streaming: Full-duplex, bidirectional communication over a single TCP connection. Example: Chat applications, multiplayer games, collaborative editing.

gRPC Streaming: Supports unary, server-streaming, client-streaming, and bidirectional streaming RPCs over HTTP/2. Example: Microservice communication, real-time telemetry.

HTTP Chunked Transfer: Server sends response in chunks without knowing the total size in advance. Example: Large file downloads, streaming API responses (e.g., LLM token streaming).

HTTP/2 Server Push and Streaming: Multiplexed streams over a single connection enabling concurrent requests and server-initiated resource push. Example: API streaming responses, multiplexed microservice communication.

HTTP/3 / QUIC: HTTP over QUIC (UDP-based) eliminates head-of-line blocking with independent per-stream flow control and built-in encryption. Example: Low-latency web streaming, mobile streaming over unreliable networks.

Socket.IO: Library on top of WebSockets adding auto-reconnect, rooms, broadcasting, and HTTP long-polling fallback. Example: Chat applications, real-time collaboration.

Mercure: Open protocol built on SSE for real-time push with JWT authorization and topic-based subscriptions. Example: Real-time updates for REST API-based applications.

Long Polling: Client sends HTTP request; server holds it open until data is available. Simulates server push. Example: Legacy real-time applications, restricted network environments.

Reactive and Runtime-Level Streaming

Reactive Streams (Java Flow API): Standard specification for async stream processing with non-blocking backpressure. Example: Interoperability between reactive libraries.

Project Reactor: Reactive Streams implementation (`Mono`/`Flux`) powering Spring WebFlux. Example: Non-blocking Spring microservices.

RxJava / RxJS / ReactiveX: Cross-platform reactive programming with Observables and rich operators. Example: UI event composition, async data source merging.

Akka Streams / Apache Pekko Streams: Graph-based reactive stream processing on the actor model with built-in backpressure. Example: Complex JVM data transformation pipelines.

Node.js Streams: Built-in Readable/Writable/Transform streams with backpressure via `pipe()`. Example: File processing, HTTP streaming in Node.js.

Spring Cloud Stream: Framework for event-driven microservices with binder abstractions for Kafka, RabbitMQ, Kinesis. Example: Spring Boot services with broker-agnostic messaging.

Apache Camel: Integration framework with 300+ connectors implementing Enterprise Integration Patterns. Example: Enterprise system integration with streaming modes.

Stream Processing Models

Real-Time (Event-at-a-Time) Processing

Each event is processed individually as it arrives. Provides the lowest latency but requires careful state management.

Pros: Lowest latency; immediate reaction to events.
Cons: Complex state management; harder to achieve exactly-once semantics.
Use Case: Fraud detection, real-time alerting, stock trading systems.

Micro-Batch Processing

Events are collected into small batches (e.g., every 1–5 seconds) and processed together. A middle ground between true streaming and batch.

Pros: Simpler programming model; better throughput; easier fault tolerance.
Cons: Higher latency than event-at-a-time; not truly real-time.
Use Case: Near-real-time analytics, Spark Structured Streaming workloads.

Batch Processing

Processes a bounded, complete dataset at once. Not streaming, but often compared against it.

Pros: Simplest model; high throughput; well-understood semantics.
Cons: High latency (minutes to hours); stale results.
Use Case: ETL pipelines, daily reporting, historical analysis.

Lambda Architecture

Combines batch and stream processing layers. A batch layer processes historical data for accuracy; a speed layer processes real-time data for low latency. Results are merged at query time.

Pros: Handles both historical and real-time data; fault-tolerant.
Cons: Dual codebase to maintain; operational complexity.
Use Case: Systems needing both real-time and retroactive analytics.

Kappa Architecture

Simplifies Lambda by using a single stream processing layer for both real-time and historical data. All data is treated as a stream; reprocessing is done by replaying the stream.

Pros: Single codebase; simpler ops; stream-native.
Cons: Requires a replayable log (e.g., Kafka); reprocessing can be expensive.
Use Case: Event-sourced systems, modern real-time analytics.

Streaming Architectures

Publish-Subscribe (Pub/Sub)

Producers publish messages to a topic; consumers subscribe to topics of interest. Decouples producers from consumers.

Key Properties: Decoupled communication; fan-out to multiple consumers; topic-based routing.
Technologies: Apache Kafka, Google Pub/Sub, Amazon SNS, Redis Pub/Sub.
Use Case: Event-driven microservices, notification systems.

Message Queue

Producers enqueue messages; consumers dequeue and process them. Typically point-to-point with competing consumers.

Key Properties: Load balancing across consumers; guaranteed delivery; ordering within partitions.
Technologies: RabbitMQ, Amazon SQS, Apache ActiveMQ, Azure Service Bus.
Use Case: Task distribution, workload balancing, asynchronous processing.

Event Log / Commit Log

An append-only, ordered, durable log of events. Consumers read from the log at their own pace using offsets.

Key Properties: Durable; replayable; ordered; supports multiple consumer groups.
Technologies: Apache Kafka, Apache Pulsar, Amazon Kinesis, Redpanda.
Use Case: Event sourcing, CDC, audit logging, data integration.

Stream Processing Pipeline

A topology of processing stages (sources, operators, sinks) connected by streams. Supports transformations, aggregations, windowing, and joins.

Key Properties: Stateful processing; windowed aggregations; exactly-once semantics (in some engines).
Technologies: Apache Flink, Apache Spark Structured Streaming, Apache Storm, Kafka Streams.
Use Case: Real-time analytics, complex event processing, ETL.

Materialized View Pattern

Stream processors continuously update a queryable “materialized view” (table or index) as events arrive. Consumers query the view rather than the raw stream.

Key Properties: Low-latency reads; updated incrementally; derived from source events.
Technologies: Kafka Streams (KTable), Apache Flink (queryable state), ksqlDB.
Use Case: Real-time dashboards, serving layer for analytics, CQRS read models.

Delivery Guarantees

One of the most critical aspects of streaming systems is how they handle message delivery in the presence of failures.

At-Most-Once

Messages may be lost but are never delivered more than once. The producer sends and forgets; no retries.

Pros: Simplest; lowest latency; no duplicates.
Cons: Data loss is possible.
Use Case: Metrics collection where occasional loss is acceptable (e.g., telemetry).

At-Least-Once

Messages are guaranteed to be delivered but may be delivered more than once. The producer retries on failure.

Pros: No data loss.
Cons: Duplicate processing possible; consumers must be idempotent.
Use Case: Most streaming applications where data loss is unacceptable.

Exactly-Once

Each message is processed exactly once, even in the presence of failures. Achieved via transactions, deduplication, or idempotent writes.

Pros: Strongest guarantee; simplifies application logic.
Cons: Higher latency and overhead; complex to implement.
Use Case: Financial transactions, billing, inventory management.

Windowing Strategies

When processing unbounded streams, windowing defines how events are grouped for aggregation.

Tumbling Window

Fixed-size, non-overlapping windows. Each event belongs to exactly one window.

Example: Count events every 5 minutes → [0–5min], [5–10min], [10–15min].
Use Case: Periodic aggregations, regular reporting intervals.

Sliding Window

Fixed-size windows that slide by a configurable interval. Windows overlap, so an event may belong to multiple windows.

Example: 10-minute window sliding every 1 minute.
Use Case: Moving averages, trend detection.

Session Window

Dynamic windows that group events by activity. A window closes after a configurable gap of inactivity.

Example: Group user clickstream events with a 30-minute inactivity timeout.
Use Case: User session analysis, activity tracking.

Global Window

A single window that encompasses all events. Requires a custom trigger to emit results.

Use Case: Accumulating state across the entire stream.

Backpressure and Flow Control

When consumers cannot keep up with producers, streaming systems need mechanisms to handle the imbalance.

Buffering: Queue messages in a buffer (in-memory or on-disk). Pros: Smooths out temporary spikes. Cons: Buffer overflow if sustained.

Dropping: Discard messages when the system is overwhelmed. Pros: Protects system stability. Cons: Data loss.

Rate Limiting: Throttle producers to match consumer throughput. Pros: Prevents overload. Cons: Adds latency on the producer side.

Backpressure Propagation: Signal upstream components to slow down. This is the reactive streams approach. Pros: End-to-end flow control. Cons: Requires protocol support.

Technologies: Reactive Streams (Java), Akka Streams, Apache Flink (credit-based flow control), TCP flow control.

Batch vs Stream Processing — Key Differences

Data scope — Batch operates on a bounded, finite dataset known upfront. Stream operates on an unbounded, continuous flow with no defined end.

Latency — Batch produces results after minutes to hours because it waits for the full dataset to be collected before processing begins. Stream produces results in milliseconds to seconds, processing each event as it arrives.

Throughput — Batch achieves very high throughput through optimized bulk I/O operations. Stream has high throughput but carries per-event overhead that batch avoids.

Complexity — Batch is simpler — the map/reduce model is well-understood. Stream is significantly more complex because you must handle out-of-order events, late data, distributed state, and partial failures mid-flight.

Fault tolerance — If a batch job fails, you restart and reprocess the whole batch. If a stream job fails, you resume from the last checkpoint or Kafka offset, avoiding full reprocessing.

Use cases — Batch is the right tool for ETL pipelines, overnight reporting, ML model training, and billing runs. Stream is the right tool for real-time dashboards, fraud detection, alerting, change data capture (CDC), and operational monitoring.

Freshness — Batch data is stale by definition — it reflects the world as of the last run, which may be hours ago. Stream data is near real-time, typically seconds behind the source.

State management — In batch, state is implicit: the full dataset is available in memory or on disk, so joins and aggregations are straightforward. In stream, state is explicit: you maintain managed state stores (like RocksDB in Flink or Kafka Streams) with TTLs, watermarks, and eviction policies to handle the fact that you can never see the full picture at once.

Resource usage — Batch consumption is bursty: heavy CPU and memory during the batch window, then idle. Stream consumption is steady: a flat, continuous resource profile around the clock.

Streaming Technologies Deep Dive

This section walks through streaming technologies from the most fundamental transport-level primitives to sophisticated distributed platforms. Understanding the full spectrum helps you choose the right tool for your specific latency, throughput, durability, and complexity requirements.

TCP Streaming

How it works: TCP (Transmission Control Protocol) provides a reliable, ordered, byte-stream connection between two endpoints. A server binds to a port, accepts connections, and data flows as a continuous stream of bytes in both directions. TCP handles retransmission of lost packets, flow control (sliding window), and congestion control automatically at the OS kernel level.

Key characteristics:
- Reliable, ordered delivery guaranteed by the protocol
- Connection-oriented (requires a handshake before data flows)
- Built-in flow control and congestion control
- Byte-stream abstraction (no message boundaries)

Limitations:
- Point-to-point only — no native multicast or pub/sub
- Head-of-line blocking: a single lost packet stalls the entire stream until retransmitted
- Connection overhead: the three-way handshake adds latency for short-lived connections
- No built-in message framing — applications must define their own protocol for message boundaries
- Does not scale beyond a single connection without application-level coordination

When to use: Custom low-level streaming protocols, simple producer-consumer pairs on a local network, building blocks for higher-level protocols. Use when you need reliable delivery between two known endpoints and are willing to handle framing and routing yourself.

UDP Streaming

How it works: UDP (User Datagram Protocol) sends discrete packets (datagrams) between endpoints without establishing a connection. Each datagram is independent — there is no guarantee of delivery, ordering, or duplicate protection. The sender simply fires packets at a destination address and port.

Key characteristics:
- Connectionless — no handshake required
- Datagram-based with clear message boundaries
- Supports multicast and broadcast
- Minimal protocol overhead (8-byte header vs. 20+ for TCP)

Limitations:
- No delivery guarantee — packets can be lost, duplicated, or arrive out of order
- No built-in flow control or congestion control — the application must handle these
- Maximum datagram size limited by MTU (typically ~1,472 bytes on Ethernet before fragmentation)
- No built-in reliability — applications must implement their own ACK/retransmit logic if needed

When to use: Real-time media streaming (audio/video) where low latency matters more than perfect delivery, online gaming, DNS queries, IoT telemetry where occasional data loss is acceptable, and as a foundation for protocols like WebRTC, SRT, and QUIC that add selective reliability on top of UDP.

Unix Domain Sockets and Named Pipes

How it works: Unix domain sockets provide inter-process communication (IPC) on the same host using the filesystem namespace instead of network addresses. They support both stream (SOCK_STREAM, like TCP) and datagram (SOCK_DGRAM, like UDP) modes. Named pipes (FIFOs) provide a simpler unidirectional byte-stream between processes via a filesystem path.

Key characteristics:
- Extremely low latency — no network stack overhead
- Higher throughput than TCP loopback (no TCP/IP header processing, checksumming, or routing)
- File-permission-based access control
- Stream or datagram semantics available (domain sockets)

Limitations:
- Single-host only — cannot communicate across a network
- No built-in pub/sub, routing, or load balancing
- Named pipes are unidirectional (need two for bidirectional communication)
- Not portable to all operating systems in the same way (Windows uses a different mechanism)

When to use: High-performance IPC on a single machine — e.g., a web server communicating with a local application server, container sidecar proxies, database client connections (PostgreSQL, MySQL, Redis all support Unix sockets for local connections).

MQTT (Message Queuing Telemetry Transport)

How it works: MQTT is a lightweight publish-subscribe messaging protocol designed for constrained devices and unreliable networks. Clients connect to a central broker, subscribe to topic filters, and publish messages to topics. The broker routes messages from publishers to all matching subscribers. MQTT supports three Quality of Service (QoS) levels: 0 (at most once), 1 (at least once), and 2 (exactly once).

Key characteristics:
- Extremely lightweight — minimal packet overhead (as low as 2 bytes header)
- Publish-subscribe with hierarchical topic structure and wildcard subscriptions
- Persistent sessions and retained messages
- Last Will and Testament (LWT) for detecting client disconnections
- Runs over TCP (or WebSockets for browser clients)

Limitations:
- Centralized broker is a single point of failure (unless clustered)
- Not designed for high-throughput data pipelines (no native partitioning or consumer groups)
- Limited message size (protocol allows up to 256 MB, but practical limits are much lower)
- No built-in message replay or stream history
- QoS 2 (exactly once) adds significant latency overhead

When to use: IoT and edge computing scenarios with constrained devices, low-bandwidth networks, or unreliable connectivity. Home automation, industrial telemetry, fleet tracking, mobile push notifications. Use when devices are resource-constrained and the message volume per device is moderate.

ZeroMQ (ZMQ)

How it works: ZeroMQ is a high-performance asynchronous messaging library that provides socket-like abstractions for various messaging patterns. Unlike traditional brokers, ZMQ is brokerless — it is embedded directly in applications as a library. It provides several socket types that implement specific patterns: REQ/REP (request-reply), PUB/SUB (publish-subscribe), PUSH/PULL (pipeline/fan-out), DEALER/ROUTER (async request-reply), and PAIR (exclusive pair). ZMQ handles connection management, framing, reconnection, and buffering internally.

Key characteristics:
- Brokerless architecture — no central server required (peer-to-peer)
- Multiple transport protocols: TCP, IPC (Unix sockets), inproc (in-process threads), PGM/EPGM (multicast)
- Automatic reconnection and message queuing
- Multi-part messages with atomic delivery
- Extremely low latency (microseconds for inproc, sub-millisecond for TCP)
- Language bindings for 40+ languages

Limitations:
- No message persistence or durability — messages are lost if the receiver is down and the sender’s buffer overflows
- No built-in message broker — you must design your own routing topology
- No built-in authentication or encryption (though CurveZMQ exists as an add-on)
- Debugging distributed ZMQ topologies can be challenging
- No consumer groups, offsets, or replay capability
- PUB/SUB has a “slow subscriber” problem — slow consumers miss messages

When to use: Low-latency inter-process or inter-service communication where you do not need durability or replay. High-frequency trading systems, scientific computing pipelines, distributed task distribution, real-time data collection from multiple sources. Use when you want messaging patterns without the operational overhead of a broker.

Redis Pub/Sub

How it works: Redis Pub/Sub provides a simple publish-subscribe messaging system built into Redis. Publishers send messages to channels; subscribers listen on channels. Messages are delivered to all connected subscribers in real time. It is a fire-and-forget system — messages are not persisted and are only delivered to subscribers connected at the time of publication.

Key characteristics:
- Extremely simple API (PUBLISH, SUBSCRIBE, PSUBSCRIBE for pattern matching)
- Very low latency (sub-millisecond on local network)
- Pattern-based subscriptions with glob-style matching
- No message persistence — pure real-time delivery
- Part of Redis, so no additional infrastructure if Redis is already in use

Limitations:
- No message persistence — if a subscriber is disconnected, it misses all messages
- No consumer groups or load balancing across consumers
- No delivery guarantees — at-most-once only
- No message acknowledgment or retry mechanism
- A slow subscriber can cause memory buildup in the Redis output buffer, potentially crashing Redis
- Messages are not stored — no replay or history

When to use: Real-time notifications, cache invalidation broadcasts, lightweight event signaling between services that are always online. Use when you already have Redis and need simple, ephemeral pub/sub without durability requirements.

Redis Streams

How it works: Redis Streams (introduced in Redis 5.0) provide a durable, append-only log data structure within Redis. Producers append entries (field-value pairs) to a stream using XADD. Each entry gets a unique, time-based ID. Consumers can read entries sequentially (XREAD), or form consumer groups (XREADGROUP) where entries are distributed among group members with acknowledgment tracking (XACK). Unacknowledged entries can be claimed by other consumers (XCLAIM) for failure recovery.

Key characteristics:
- Durable, append-only log persisted to disk (with Redis persistence — RDB/AOF)
- Consumer groups with automatic load balancing and message acknowledgment
- Unique time-based IDs for each entry
- Supports blocking reads for efficient polling
- Pending Entry List (PEL) tracks unacknowledged messages per consumer
- Capped streams (MAXLEN/MINID) for automatic trimming
- Fan-out: multiple consumer groups can independently read the same stream

Limitations:
- Single-node throughput limited by Redis’s single-threaded event loop
- No built-in partitioning across multiple Redis instances (Redis Cluster can shard, but each stream lives on one node)
- Memory-bound — large streams consume significant RAM (even with persistence, data is in memory)
- No native exactly-once semantics — at-least-once with manual deduplication
- Less mature ecosystem compared to Kafka for complex stream processing (no windowing, joins, etc.)
- No built-in schema registry or data governance

When to use: Lightweight event streaming, task queues, and activity feeds when you already use Redis and need durability and consumer groups without deploying Kafka. Suitable for moderate throughput (tens of thousands of messages/second per stream) scenarios like order processing, notification pipelines, or microservice event buses.

RabbitMQ

How it works: RabbitMQ is a traditional message broker implementing the AMQP (Advanced Message Queuing Protocol) standard. Producers publish messages to exchanges, which route messages to queues based on bindings and routing keys. Consumers subscribe to queues. RabbitMQ supports multiple exchange types: direct (exact routing key match), fanout (broadcast to all bound queues), topic (pattern-based routing), and headers (route by message headers). Messages can be persistent (written to disk) or transient.

Key characteristics:
- Flexible routing via exchanges, bindings, and routing keys
- Multiple protocols: AMQP 0–9–1, AMQP 1.0, MQTT, STOMP, and HTTP
- Message acknowledgment with manual or automatic ACK
- Dead letter exchanges for failed message handling
- Priority queues
- Plugin ecosystem (management UI, federation, shovel)
- Quorum queues for high availability and data safety
- RabbitMQ Streams plugin for log-like append-only semantics

Limitations:
- Not designed for high-throughput event streaming (optimized for smart routing, not raw throughput)
- Messages are deleted once consumed and acknowledged (not a replayable log, unless using Streams)
- Performance degrades when queues grow very large (millions of messages)
- Complex routing topologies can be hard to debug and maintain
- No native partitioning for horizontal scaling of a single queue (though consistent hash exchange helps)
- Clustering can be operationally complex; network partitions require careful handling

When to use: Task distribution and work queues, request-reply patterns, complex message routing scenarios, systems requiring multiple protocol support. Best for traditional messaging use cases where you need flexible routing, per-message acknowledgment, and moderate throughput (thousands to low tens of thousands messages/second per queue).

NATS and NATS JetStream

How it works: NATS is a lightweight, high-performance messaging system. Core NATS provides a simple pub/sub and request-reply system with at-most-once delivery — no persistence, no acknowledgment. NATS JetStream (built into the NATS server since v2.2) adds persistence, at-least-once/exactly-once delivery, consumer groups (called “push” and “pull” consumers), message replay, and stream retention policies. JetStream stores messages in streams and allows consumers to read from any position.

Key characteristics:
- Core NATS: Extremely fast (millions of messages/second), simple pub/sub, subject-based addressing with wildcards
- JetStream: Durable streams, consumer acknowledgment, replay from any position, key-value and object stores
- Subject-based addressing (hierarchical subjects with `>` and `*` wildcards)
- Built-in clustering and multi-tenancy (accounts)
- Leaf nodes for edge computing and hub-spoke topologies
- Single binary, minimal configuration, easy to deploy
- Request-reply pattern built into the protocol

Limitations:
- Core NATS: No persistence or delivery guarantees (at-most-once only)
- JetStream: Younger ecosystem than Kafka; fewer integrations and connectors
- No built-in schema registry
- Stream processing capabilities are basic compared to Kafka Streams or Flink
- Large-scale JetStream deployments are less battle-tested than Kafka at extreme scale
- Limited windowing and aggregation support — primarily a messaging/streaming transport, not a processing engine

When to use: Microservice communication (especially request-reply), cloud-native applications, edge computing, IoT. Core NATS for ultra-low-latency fire-and-forget messaging. JetStream when you need persistence and replay without the operational weight of Kafka. Excellent for systems that need both traditional messaging patterns and streaming in a single lightweight platform.

Apache Kafka

How it works: Kafka is a distributed, partitioned, replicated commit log. Producers write records to topics, which are divided into partitions. Each partition is an ordered, immutable, append-only log. Partitions are distributed across brokers and replicated for fault tolerance. Consumers read from partitions using offsets and form consumer groups for parallel consumption — each partition is assigned to exactly one consumer within a group. Kafka retains messages for a configurable retention period (or indefinitely with log compaction).

Key characteristics:
- Extremely high throughput (millions of messages/second per cluster)
- Durable, replayable commit log with configurable retention
- Partitioned for horizontal scalability
- Consumer groups with automatic partition assignment and rebalancing
- Exactly-once semantics via idempotent producers and transactional writes
- Log compaction for maintaining latest-value-per-key
- Rich ecosystem: Kafka Connect (connectors), Kafka Streams (processing), ksqlDB (SQL), Schema Registry
- KRaft mode (no ZooKeeper dependency since Kafka 3.3+)

Limitations:
- Operational complexity — managing brokers, partitions, replication, and rebalancing
- Partition count changes require careful planning (rebalancing, data redistribution)
- Not ideal for very low-latency messaging (typical latency is low milliseconds, not microseconds)
- Consumer rebalancing can cause temporary processing pauses
- JVM-based — significant memory and resource requirements
- Ordering guaranteed only within a partition, not globally
- Not designed for point-to-point or request-reply patterns (though possible with extra work)

When to use: The default choice for large-scale event streaming, data pipelines, event sourcing, CDC, log aggregation, and event-driven architectures. Use when you need a durable, high-throughput, replayable event log with a mature ecosystem. Best for scenarios with high data volumes and multiple downstream consumers.

Apache Pulsar

How it works: Pulsar is a multi-tenant, distributed messaging and streaming platform that separates serving (brokers) from storage (Apache BookKeeper). Topics are divided into partitions, and each partition is stored as a distributed ledger in BookKeeper. This separation allows independent scaling of compute and storage. Pulsar supports both streaming (with consumer offsets and replay) and traditional queueing (shared subscription) semantics on the same topic. It includes built-in geo-replication for multi-datacenter deployments.

Key characteristics:
- Separation of compute and storage — independent scaling
- Multi-tenancy with namespace isolation
- Built-in geo-replication across datacenters
- Multiple subscription modes: exclusive, shared (competing consumers), failover, key-shared
- Tiered storage — offload old data to S3/GCS/HDFS automatically
- Pulsar Functions for lightweight serverless stream processing
- Schema registry built in
- Topic compaction (similar to Kafka log compaction)

Limitations:
- More complex architecture (brokers + BookKeeper + ZooKeeper/metadata store)
- Smaller community and ecosystem compared to Kafka
- Fewer third-party connectors and integrations
- BookKeeper adds operational overhead
- Higher learning curve due to additional concepts (tenants, namespaces, bundles)
- Some features (transactions, key-shared subscriptions) are newer and less battle-tested

When to use: Multi-tenant platforms, multi-region deployments requiring geo-replication, use cases that need both queuing and streaming on the same infrastructure, scenarios requiring independent storage and compute scaling. Consider when you need tiered storage for cost-effective long-term retention.

Amazon Kinesis Data Streams

How it works: Kinesis is AWS’s fully managed real-time data streaming service. Data is organized into streams, which are composed of shards. Each shard provides a fixed capacity (1 MB/sec input, 2 MB/sec output, 1,000 records/sec input). Producers write records with a partition key; Kinesis maps the key to a shard. Consumers read from shards using the Kinesis Client Library (KCL), which handles shard assignment, checkpointing, and failover. Kinesis retains data for 24 hours by default (up to 365 days).

Key characteristics:
- Fully managed — no servers to provision or manage
- Automatic shard splitting and merging for scaling
- Integrated with AWS ecosystem (Lambda, Firehose, Analytics, S3)
- Enhanced fan-out for dedicated throughput per consumer
- Server-side encryption at rest
- On-demand capacity mode (auto-scaling)

Limitations:
- AWS-only — strong vendor lock-in
- Shard-level throughput limits can require careful capacity planning
- Higher latency than self-managed Kafka (typically 200ms+)
- More expensive at high scale compared to self-managed alternatives
- Limited to 7-day retention by default (365-day max costs significantly more)
- Fewer processing frameworks compared to Kafka’s ecosystem
- Record size limited to 1 MB

When to use: AWS-native real-time data ingestion when you want zero operational overhead. Best for teams already invested in the AWS ecosystem that need managed streaming without Kafka expertise. Ideal for moderate-scale streaming with tight AWS service integration (Lambda triggers, S3 delivery via Firehose).

QUIC

How it works: QUIC (Quick UDP Internet Connections) is a transport protocol originally developed by Google and now standardized as RFC 9000. It runs on top of UDP and provides reliable, multiplexed, encrypted connections. Unlike TCP, QUIC supports multiple independent streams within a single connection — a lost packet in one stream does not block others (no head-of-line blocking). QUIC integrates TLS 1.3 encryption directly into the transport layer and supports 0-RTT connection establishment for previously visited servers. HTTP/3 is built on QUIC.

Key characteristics:
- Multiplexed streams without head-of-line blocking
- Built-in TLS 1.3 encryption (always encrypted)
- 0-RTT connection establishment for returning connections
- Connection migration — survives network changes (e.g., Wi-Fi to cellular)
- Improved congestion control and loss recovery per stream
- User-space implementation (not kernel) — faster iteration on protocol improvements

Limitations:
- Higher CPU usage than TCP (user-space processing, encryption overhead)
- UDP may be throttled or blocked by some corporate firewalls and middleboxes
- Newer protocol — less tooling for debugging and monitoring compared to TCP
- Not all CDNs, load balancers, and proxies fully support QUIC yet
- Implementation complexity is higher than TCP

When to use: Modern web streaming (HTTP/3), mobile applications where connection migration matters, any scenario where head-of-line blocking in TCP is a bottleneck (multiple concurrent streams), low-latency CDN delivery.

Google Cloud Pub/Sub

How it works: Google Cloud Pub/Sub is a fully managed, serverless messaging service. Publishers send messages to topics; the service stores them durably and delivers them to subscriptions. Each subscription receives an independent copy of every message (fan-out). Subscribers can pull messages on demand or configure push delivery to an HTTPS endpoint. Messages are retained for up to 31 days (configurable). Ordering is available via ordering keys. Supports exactly-once delivery within Cloud Dataflow pipelines.

Key characteristics:
- Fully managed, globally distributed — no capacity planning
- At-least-once delivery with message ordering via ordering keys
- Push and pull subscription modes
- Dead-letter topics for undeliverable messages
- Message filtering at the subscription level (attribute-based)
- Schema validation (Avro, Protocol Buffers)
- Seek — replay messages from a timestamp or snapshot

Limitations:
- GCP-only — strong vendor lock-in
- Message size limited to 10 MB
- Ordering guaranteed only within the same ordering key (not globally)
- No native stream processing — requires Dataflow/Beam for processing
- Higher latency than self-managed Kafka (~100ms typical)
- Cost can escalate with very high throughput

When to use: GCP-native event ingestion and microservice communication. Best for teams on GCP wanting zero-ops messaging. Ideal for event-driven architectures, data pipeline ingestion into BigQuery/Dataflow, and workloads that benefit from global distribution.

Azure Event Hubs

How it works: Azure Event Hubs is a fully managed, big data streaming platform on Azure. Producers send events to event hubs (analogous to Kafka topics), which are partitioned for parallelism. Consumer groups enable multiple independent readers. Event Hubs provides a Kafka-compatible API endpoint — existing Kafka producers and consumers can connect with only configuration changes (no code changes). Event Hubs Capture automatically delivers events to Azure Blob Storage or Azure Data Lake for long-term retention.

Key characteristics:
- Fully managed with automatic scaling (throughput units or Processing Units in Premium/Dedicated)
- Apache Kafka protocol compatibility (Kafka producer/consumer APIs work directly)
- Event Hubs Capture for automatic archival to Blob Storage/Data Lake
- Partitioned consumer model with consumer groups
- AMQP 1.0, Kafka, and HTTPS protocols
- Up to 90-day retention (7 days default)
- Schema Registry built in

Limitations:
- Azure-only — vendor lock-in
- Throughput unit model can require careful capacity planning (Standard tier)
- Higher latency than self-managed Kafka for some workloads
- Limited stream processing built-in — requires Azure Stream Analytics or external engine
- Partition count cannot be changed after creation (Standard tier)
- Premium/Dedicated tiers needed for advanced features (significantly higher cost)

When to use: Azure-native event streaming, migrating Kafka workloads to Azure with minimal code changes, IoT data ingestion (via IoT Hub routing to Event Hubs), telemetry collection. Ideal for organizations already on Azure wanting managed streaming.

Redpanda

How it works: Redpanda is a Kafka-compatible streaming platform written in C++ using the Seastar framework (a high-performance async framework for I/O-intensive applications). It implements the Kafka protocol so that existing Kafka clients, tools, and ecosystems (Kafka Connect, Schema Registry, etc.) work without modification. Redpanda eliminates the JVM and ZooKeeper dependencies — it runs as a single binary with an embedded Raft-based consensus protocol for metadata management.

Key characteristics:
- Full Kafka API compatibility — drop-in replacement for Kafka
- No JVM — written in C++ with thread-per-core architecture (Seastar)
- No ZooKeeper — built-in Raft consensus
- Lower tail latency (p99) compared to Kafka due to no GC pauses
- Simpler operations — single binary, auto-tuning, fewer moving parts
- Built-in Schema Registry and HTTP Proxy (Pandaproxy)
- WebAssembly (Wasm) inline data transforms

Limitations:
- Smaller community and ecosystem compared to Kafka
- Some Kafka features may lag behind the latest Kafka release
- Less battle-tested at extreme scale (multi-PB deployments)
- Commercial features (Tiered Storage, RBAC) require enterprise license
- Fewer managed service options compared to Kafka (Confluent, MSK, etc.)

When to use: When you want Kafka compatibility with simpler operations and lower latency. Best for teams that find Kafka’s JVM tuning and ZooKeeper management burdensome. Ideal for latency-sensitive workloads, resource-constrained environments, or when operational simplicity is a priority.

Apache ActiveMQ / ActiveMQ Artemis

How it works: Apache ActiveMQ Classic is a traditional, full-featured message broker implementing JMS (Java Message Service), AMQP, STOMP, MQTT, and OpenWire protocols. ActiveMQ Artemis is its next-generation successor — a high-performance, non-blocking architecture inspired by HornetQ. Artemis supports persistent and non-persistent messaging, queues and topics, message groups, transactions, and clustering. Both support master-slave replication for high availability.

Key characteristics:
- Multi-protocol support: AMQP 1.0, STOMP, MQTT, OpenWire, HornetQ
- JMS 2.0 compliant (Java Message Service)
- Persistent messaging with journal-based storage (Artemis)
- Message groups for ordered processing per group
- Clustering with load balancing and HA (failover pairs)
- Large message support (stream large messages to disk)
- Diverts for routing messages between addresses

Limitations:
- Not designed for high-throughput event streaming (not a commit log)
- Messages deleted after consumption (no replay without DLQ/re-routing)
- Clustering complexity — network of brokers can be hard to manage
- Classic ActiveMQ has known scalability limits; Artemis addresses many but is less widely deployed
- Smaller community momentum compared to Kafka/RabbitMQ in recent years
- No native partitioning model for horizontal scaling of a single queue

When to use: JMS-based enterprise applications, Java EE environments, multi-protocol broker needs. Best for traditional enterprise messaging patterns (request-reply, point-to-point, pub/sub) where JMS compliance is required. Artemis is the preferred choice for new deployments.

Amazon EventBridge

How it works: Amazon EventBridge is a serverless event bus that makes it easy to connect applications using events. Events from AWS services (e.g., EC2 state changes, S3 uploads), SaaS integrations (e.g., Zendesk, Shopify), and custom applications are routed to targets based on rules. Rules match event patterns (JSON-based) and route matching events to targets (Lambda, SQS, SNS, Step Functions, API Gateway, etc.). EventBridge also provides Schema Registry (auto-discovers event schemas), event replay (archive and replay past events), and pipes (point-to-point integrations with filtering and enrichment).

Key characteristics:
- Serverless — no infrastructure to manage, pay-per-event
- 100+ AWS service event sources and SaaS integrations built-in
- Content-based filtering with JSON pattern matching rules
- Schema Registry with automatic schema discovery
- Event archive and replay capability
- EventBridge Pipes for point-to-point integrations with enrichment
- Cross-account and cross-region event delivery

Limitations:
- AWS-only — cannot be used outside AWS
- Higher latency than Kafka/Kinesis (~500ms typical)
- Limited throughput compared to Kafka (soft limits, need to request increases)
- Event size limited to 256 KB
- No stream processing capabilities — routing only
- Debugging complex routing rules can be challenging
- Not suitable for high-throughput, low-latency data streaming

When to use: AWS-native event-driven architectures with serverless compute. Best for routing events between AWS services and SaaS applications. Ideal when you need content-based routing, schema management, and integration with Lambda/Step Functions without managing messaging infrastructure.

Debezium

How it works: Debezium is an open-source distributed platform for Change Data Capture built on top of Kafka Connect. It deploys database-specific connectors as Kafka Connect source connectors. Each connector reads the database’s transaction log (WAL in PostgreSQL, binlog in MySQL, oplog in MongoDB, etc.) and converts row-level changes into structured change events published to Kafka topics. Each table gets its own topic. Change events include before/after snapshots, operation type, source metadata, and transaction information. Debezium also performs an initial consistent snapshot of existing data before switching to streaming mode.

Key characteristics:
- Connectors for PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, Db2, Cassandra, Vitess, and Spanner
- Consistent initial snapshot + continuous CDC streaming
- Change events include before/after state, operation type, and metadata
- Built on Kafka Connect — leverages its scalability, offset management, and fault tolerance
- Embedded engine mode (use Debezium as a library without Kafka Connect)
- Single Message Transforms (SMTs) for event transformation
- Outbox pattern support for reliable event publishing from microservices

Limitations:
- Requires Kafka (or Kafka Connect compatible system) in most deployment modes
- Database-specific connector maturity varies (PostgreSQL and MySQL are most mature)
- Initial snapshot of large databases can take hours and impact database performance
- Schema changes require careful handling (some connectors handle DDL better than others)
- Logical replication slots in PostgreSQL can cause WAL accumulation if Debezium is down
- Monitoring and operational tooling is basic compared to commercial CDC solutions

When to use: Streaming database changes to Kafka for downstream consumers. Best for microservice data synchronization, populating search indexes (Elasticsearch), cache invalidation, building event-sourced systems from existing databases, and keeping data warehouses in sync with operational databases.

EventStoreDB

How it works: EventStoreDB is a purpose-built database designed for event sourcing. Events are appended to streams (ordered sequences of events identified by a stream name). Each event has a type, data payload (JSON or binary), metadata, and a monotonically increasing position. Streams support optimistic concurrency control via expected version checks. Projections (written in JavaScript) run server-side to transform and combine events from multiple streams into new derived streams. Subscriptions (catch-up and persistent) allow consumers to process events in real-time or from a specific position.

Key characteristics:
- Immutable, append-only event storage with per-stream ordering
- Optimistic concurrency control for write consistency
- Server-side projections for event transformation and aggregation
- Catch-up subscriptions (from any position) and persistent subscriptions (consumer groups)
- System projections: $by_category, $by_event_type, $streams for built-in event organization
- HTTP and gRPC client APIs
- Scavenge process for reclaiming space from deleted/truncated streams
- Cluster mode with leader election for high availability

Limitations:
- Purpose-built for event sourcing — not a general-purpose database or message broker
- Smaller ecosystem compared to Kafka or PostgreSQL
- Server-side projections (JavaScript engine) have performance limits at high volume
- Operational tooling and monitoring less mature than mainstream databases
- Not designed for high-throughput data streaming (better for domain event volumes)
- Learning curve for event sourcing patterns if team is unfamiliar

When to use: Domain-driven design with event sourcing, CQRS implementations, audit-critical systems, applications where the history of state changes is as important as the current state. Ideal when your primary data model is an event stream per aggregate.

Technology Comparison Matrix

Transport and Protocols

Message Brokers and Event Streaming Platforms

Managed Cloud Services

Stream Processing Engines

Media Streaming Protocols

Client-Side / API Streaming

Streaming Patterns for System Design

Event Sourcing

Instead of storing the current state, store an immutable sequence of events that led to the current state. The state can be reconstructed by replaying events.

Benefits: Full audit trail; temporal queries; supports retroactive changes.
Challenges: Event schema evolution; storage growth; replay performance.
Technologies: Kafka (event log), EventStoreDB, Axon Framework.

CQRS (Command Query Responsibility Segregation)

Separate the write model (commands) from the read model (queries). Stream processors update read-optimized materialized views from the write-side event stream.

Benefits: Independent scaling of reads and writes; optimized read models.
Challenges: Eventual consistency between write and read sides.
Technologies: Kafka + Kafka Streams, Flink + Elasticsearch.

Saga Pattern (Choreography)

Coordinate distributed transactions through a sequence of events. Each service listens for events and publishes its own. Compensating events handle rollbacks.

Benefits: Decoupled services; no central coordinator.
Challenges: Complex failure handling; difficult to debug.
Technologies: Kafka, RabbitMQ, any event broker.

Stream-Table Duality

A stream can be viewed as a changelog of a table, and a table can be viewed as a snapshot of a stream at a point in time. This duality enables powerful joins and enrichments.

Benefits: Enables stream-table joins; unifies batch and streaming semantics.
Technologies: Kafka Streams (KTable/KStream), ksqlDB, Apache Flink.

Dead Letter Queue (DLQ)

Messages that cannot be processed (after retries) are routed to a separate queue for investigation and reprocessing.

Benefits: Prevents poison messages from blocking the pipeline; enables debugging.
Technologies: Kafka DLQ topics, RabbitMQ dead letter exchanges, AWS SQS DLQ.

Streaming Trade-offs

Latency vs. Throughput

Lower latency (processing each event individually) reduces throughput due to per-event overhead. Micro-batching improves throughput at the cost of higher latency.

Low Latency: Process events one-at-a-time for immediate results but at lower throughput.
High Throughput: Batch multiple events together for efficiency but with added delay.

Ordering vs. Scalability

Strict global ordering limits parallelism. Partitioned ordering (ordering within a partition) enables horizontal scaling while maintaining order where it matters.

Strict Ordering: Single partition or single consumer — limits throughput.
Partitioned Ordering: Order per key/partition — enables parallelism with per-entity ordering.

Consistency vs. Availability

In distributed streaming systems, strong consistency (exactly-once, synchronized state) reduces availability during partitions. Eventual consistency improves availability.

Strong Consistency: Exactly-once semantics; synchronized state; higher latency.
Eventual Consistency: At-least-once with idempotent consumers; lower latency; temporary inconsistencies.

Durability vs. Performance

Persisting every event to disk ensures durability but adds I/O overhead. In-memory processing is faster but risks data loss.

Durable: Write-ahead logs, replication — no data loss but slower.
Ephemeral: In-memory only — fastest but data loss on failure.

Complexity vs. Freshness

Real-time streaming architectures are more complex (state management, failure handling, backpressure) than batch, but provide fresher data.

Batch: Simple, well-understood, but data is stale.
Streaming: Complex, operationally demanding, but data is fresh.

Cost vs. Real-Time Requirements

Streaming infrastructure (always-on clusters, partitions, replication) costs more than periodic batch jobs. Not every use case justifies the expense.

Streaming: Higher infrastructure cost; justified for real-time requirements.
Batch: Lower cost; sufficient when latency of minutes or hours is acceptable.

When to Use Streaming?

Real-Time Analytics and Monitoring

Data must be analyzed as it arrives — dashboards, metrics, anomaly detection.
When: You need up-to-the-second insights, not stale batch reports.

Event-Driven Architectures

Services communicate through events rather than synchronous API calls.
When: You are building loosely coupled microservices that react to state changes.

Real-Time User Experiences

Users expect instant feedback — live notifications, chat, collaborative editing, live scores.
When: User experience degrades with even seconds of delay.

Fraud Detection and Alerting

Suspicious activity must be detected and acted upon immediately.
When: Delayed detection means financial or security losses.

IoT and Sensor Data

Millions of devices emit continuous data streams that must be ingested, processed, and acted upon.
When: High-volume, high-velocity data from distributed devices.

Data Integration and CDC

Keep multiple systems in sync by streaming changes from a source of truth.
When: You need near-real-time replication across databases, search indexes, or caches.

Log Aggregation and Observability

Aggregate logs, metrics, and traces from distributed systems for real-time observability.
When: Debugging production issues requires real-time visibility.

Media Delivery

Deliver audio and video content to users without requiring full downloads.
When: Content is large, consumption is sequential, and users expect immediate playback.

When NOT to Use Streaming?

Simple CRUD applications: A traditional request-response model suffices.
Infrequent data updates: If data changes hourly or daily, batch processing is simpler and cheaper.
Small data volumes: The overhead of streaming infrastructure is not justified.
Complex ad-hoc queries: Streaming excels at predefined queries; for exploratory analysis, batch/SQL is better.
Budget constraints: Streaming infrastructure is costlier to run and maintain than periodic batch jobs.

Conclusion

Streaming has evolved from a niche technique for media delivery into a foundational paradigm for building real-time, event-driven systems at scale. Whether you are processing millions of IoT events per second, powering a live dashboard, delivering video to millions of users, or keeping microservices in sync through event-driven communication, streaming provides the architecture to handle continuous, unbounded data with low latency and high throughput.

Choosing the right streaming approach requires understanding the trade-offs: latency vs. throughput, ordering vs. scalability, consistency vs. availability, and complexity vs. freshness. The best systems match their streaming strategy to their actual requirements — not every application needs exactly-once semantics or sub-millisecond latency.

In a world where users expect instant responses and businesses demand real-time insights, streaming is no longer optional — it is an essential tool in the modern system designer’s toolkit.

Building Vision AI Without Training: A Practical Guide to Gemini Embeddings 2

Dr. Yaroslav Zhbankov — Thu, 19 Mar 2026 03:14:13 GMT

Background

Imagine pointing a camera at a supermarket shelf and asking, “Where is the rosemary?” and getting an instant answer. No fine-tuned YOLO model. No labeled dataset. No training pipeline.

Or imagine a factory safety system detecting a “person lying on the floor” without a single developer ever teaching it what a “fallen person” looks like.

This is the power of Gemini Embedding 2. Google’s latest multimodal model maps images and text into a shared 3072-dimensional vector space [1]. Because the math is the same for both media types, you can calculate the distance between a photo and an English phrase using simple cosine similarity.

In this guide, we’ll build two practical examples in under 50 lines of code each:

Visual Phrase Search: A heatmap generator to locate specific items in a busy image.
Zero-Shot Alert System: A real-time monitor for custom safety anomalies.

What Are Embeddings

Think of embeddings as coordinates in a high-dimensional space where meaning becomes distance.

“rosemary” is close to images of rosemary
“fire” is close to flames and smoke
unrelated concepts (e.g., “car” and “banana”) are far apart

Instead of training a classifier, you convert text → vector; convert image → vector; measure distance (similarity). If they are close → they match. This is why you can search images using natural language — without training anything.

Why Multimodal Embeddings Change Everything

Traditional computer vision is rigid. To detect a new object, you must collect images, label bounding boxes, retrain, and redeploy.

Multimodal embeddings flip the script:

Zero-shot detection: Describe your target in plain English.
No training data: Generalizes well across many domains.
Dynamic updates: Change the search query in the UI, and the “model” updates instantly.

In Gemini Embeddings 2 the embedding dimension is 3072, and the cross-modal similarity range is compressed (typically between 0.15 and 0.55), but within that range the signal is remarkably strong.

Setup

You need a Google AI Studio API key (free tier works) and three Python packages [2]:

pip install google-genai Pillow python-dotenv

export GOOGLE_API_KEY=your_key_here

Example 1: Visual Phrase Search — “Find the Rosemary”

The idea: Take a photo, split an image into a grid, embed every patch, and compare them to the search phrase. The result is a heatmap of semantic relevance.

Here is the complete, minimal implementation:

import base64, math, os
from io import BytesIO
from PIL import Image
from google import genai
from google.genai import types

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])
MODEL = "gemini-embedding-2-preview"

def embed_image(img: Image.Image) -> list[float]:
    """Embed a PIL image using Gemini Embedding 2."""
    buf = BytesIO()
    img.save(buf, format="JPEG", quality=85)
    b64 = base64.b64encode(buf.getvalue()).decode()
    resp = client.models.embed_content(
        model=MODEL,
        contents=types.Content(parts=[
            types.Part(inline_data=types.Blob(
                mime_type="image/jpeg", data=b64
            ))
        ]),
        config=types.EmbedContentConfig(
            task_type="SEMANTIC_SIMILARITY"
        ),
    )
    return resp.embeddings[0].values

def embed_text(text: str) -> list[float]:
    """Embed a text phrase using the same model."""
    resp = client.models.embed_content(
        model=MODEL,
        contents=text,
        config=types.EmbedContentConfig(
            task_type="SEMANTIC_SIMILARITY"
        ),
    )
    return resp.embeddings[0].values

def cosine(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    mag = math.sqrt(sum(x*x for x in a)) * math.sqrt(sum(x*x for x in b))
    return dot / mag if mag else 0.0

def search_in_image(image_path: str, phrase: str, grid: int = 10):
    """Find where in an image a phrase matches best."""
    image = Image.open(image_path).convert("RGB")
    w, h = image.size
    pw, ph = w // grid, h // grid
    # 1. Embed every grid patch
    patch_embeddings = []
    for r in range(grid):
        for c in range(grid):
            box = (c * pw, r * ph, (c+1) * pw, (r+1) * ph)
            patch = image.crop(box)
            patch_embeddings.append(embed_image(patch))
    # 2. Embed the search phrase
    text_emb = embed_text(phrase)
    # 3. Score each patch
    scores = [cosine(emb, text_emb) for emb in patch_embeddings]
    # 4. Find top matches
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    for idx, score in ranked[:3]:
        r, c = divmod(idx, grid)
        print(f"  row={r+1} col={c+1} score={score:.4f}")
    return scores

search_in_image("spice_shelf.jpg", "rosemary")

That is it. Under 50 lines of logic. No model training. No labeled bounding boxes.

When I ran this against a photo of a supermarket spice shelf (Fig. 1), the system correctly highlighted the rosemary section with the highest similarity score (Fig. 2) — and when I changed the phrase “ground black pepper”, the heatmap shifted to the correct area (Fig. 3).

Fig. 1. Original photo of supermarket spice shelf

How It Looks

The output is a color-coded heatmap overlay: green regions have low similarity, yellow is moderate, and red is high. The top 3 matching patches get highlighted bounding boxes — red for the best match, orange for second, yellow for third.

Fig. 2. Search for “rosemary”

Fig. 3. Search for “ground black pepper”

The beauty is that I never told the model what “rosemary” looks like. The shared embedding space already “understands” the visual–semantic relationship.

Example 2: Custom Manufacturing Alerts — Zero-Training Anomaly Detection

Now let’s flip the use case. Instead of searching within an image, we compare an entire camera frame against a set of user-defined alert phrases. If the similarity score crosses a threshold, we fire an alert.

Think of it as a completely customizable surveillance system where the operator types what to watch for in plain English:

“person lying on the floor”
“forklift moving near pedestrian”
“spill on the ground”
“smoke or fire”
“worker without hard hat”

No retraining. No new dataset. Just a new string.

import base64, math, os, time
from io import BytesIO
from PIL import Image
from google import genai
from google.genai import types

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])
MODEL = "gemini-embedding-2-preview"

def embed_image(img: Image.Image) -> list[float]:
    buf = BytesIO()
    img.save(buf, format="JPEG", quality=85)
    b64 = base64.b64encode(buf.getvalue()).decode()
    resp = client.models.embed_content(
        model=MODEL,
        contents=types.Content(parts=[
            types.Part(inline_data=types.Blob(
                mime_type="image/jpeg", data=b64
            ))
        ]),
        config=types.EmbedContentConfig(
            task_type="SEMANTIC_SIMILARITY"
        ),
    )
    return resp.embeddings[0].values

def embed_text(text: str) -> list[float]:
    resp = client.models.embed_content(
        model=MODEL,
        contents=text,
        config=types.EmbedContentConfig(
            task_type="SEMANTIC_SIMILARITY"
        ),
    )
    return resp.embeddings[0].values

def cosine(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    mag = math.sqrt(sum(x*x for x in a)) * math.sqrt(sum(x*x for x in b))
    return dot / mag if mag else 0.0

# ── User-defined alerts ─────────────────────────────────────
ALERTS = [
    "person lying on the floor",
    "forklift near a pedestrian",
    "liquid spill on the ground",
    "smoke or fire",
]
THRESHOLD = 0.35   # Tune based on your environment
# Pre-embed all alert phrases once at startup
alert_embeddings = {phrase: embed_text(phrase) for phrase in ALERTS}

def check_frame(frame: Image.Image):
    """Compare one camera frame against all alert phrases."""
    frame_emb = embed_image(frame)
    for phrase, phrase_emb in alert_embeddings.items():
        score = cosine(frame_emb, phrase_emb)
        if score >= THRESHOLD:
            print(f"  ALERT  [{score:.3f}] {phrase}")

# ── Simulated camera loop ───────────────────────────────────
def monitor(image_source: str, interval: int = 5):
    """Poll a camera (or image file) and check for alerts."""
    print(f"Monitoring: {image_source}")
    print(f"Watching for: {', '.join(ALERTS)}")
    print(f"Threshold: {THRESHOLD}\n")
    while True:
        frame = Image.open(image_source).convert("RGB")
        check_frame(frame)
        time.sleep(interval)

monitor("camera_feed.jpg")

Why This Is Powerful

In traditional manufacturing vision systems, adding a new alert type means:

Collecting hundreds of labeled examples
Retraining or fine-tuning a detection model
Redeploying the model
Hoping it generalizes

With multimodal embeddings, a floor manager can literally type “worker without safety goggles” into a text field and the system starts monitoring for it immediately. The cost of a new alert is one API call to embed the new phrase.

Score Calibration

Cross-modal similarity with Gemini Embedding 2 operates in a compressed range (0.5+ Very high match, 0.42–0.5 Strong match, 0.33–0.42 moderate match, 0.22–0.33 weak match, < 0.22 little/no match). This is due to the Modality Gap [2, 3]. While Gemini maps images and text into the same space, they tend to cluster in slightly different “neighborhoods.” To the model, a photo of a rose and the word “rose” are conceptually identical, but structurally different.

For an alert system, start with a threshold around 0.35 and adjust based on your false-positive tolerance. You can also use a two-tier system: 0.35 for “warning” and 0.45 for “critical.”

Combining Both Approaches

The two examples are complementary. In a real deployment you might:

First pass (Example 2): Compare the full frame against alert phrases. If “person lying on floor” scores above threshold, proceed to step 2.
Second pass (Example 1): Run the grid search to locate where in the frame the match is strongest, drawing a bounding box around the detected area.

This gives you both detection and localization — all from text descriptions, all with zero training.

Scaling With a Vector Database

For image-to-image search (e.g., “find all frames similar to this incident”), store embeddings in a vector database like ChromaDB:

import chromadb

db = chromadb.PersistentClient(path="./alert_db")
collection = db.get_or_create_collection(
    name="camera_frames",
    metadata={"hnsw:space": "cosine"},
)
# Store a frame
collection.upsert(
    ids=["frame_001"],
    embeddings=[frame_embedding],
    metadatas=[{"timestamp": "2025-03-15T10:30:00", "camera": "floor_2"}],
)
# Find similar past incidents
results = collection.query(
    query_embeddings=[new_frame_embedding],
    n_results=5,
)

This enables historical search: “show me all frames that looked like this incident” — useful for audits, pattern analysis, and compliance.

Limitations and Practical Tips

Accuracy [4]. Multimodal embeddings are not a replacement for fine-tuned object detectors when you need pixel-precise bounding boxes or 99.9% recall. They are best for flexible, zero-shot, “good enough” detection where adaptability matters more than precision.

Latency. Each embedding API call takes ~200–500ms. For the grid search (100 patches), that is 100 sequential API calls. Batch them, reduce grid size, or use async calls for production.

Score range. Cross-modal scores are compressed (0.15–0.55). Do not compare them to text-to-text similarity. Always calibrate thresholds on your specific domain.

Patch size matters. Too small and the patch loses context. Too large and you lose localization precision. A 10x10 grid is a good starting point for most images.

Conclusion

Gemini Embedding 2 makes a category of computer vision tasks trivially easy that previously required significant ML infrastructure. The ability to compare any image with any text phrase — using a single API and basic cosine similarity — opens up use cases that were simply impractical before:

Retail: Search product shelves by name
Manufacturing: Custom safety alerts in plain English
Agriculture: Identify plant species in field photos
Security: Describe suspicious behavior in words
Healthcare: Flag visual anomalies by description

Zero training. Zero labels. Just embeddings and cosine similarity.

References

Google AI for Developers. (2026). Gemini Embedding 2 Model Documentation. ai.google.dev/gemini-api/docs/models/gemini-embedding-2-preview
Google Cloud Documentation. (2026). Gemini Embedding 2 on Vertex AI. docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/embedding-2
Choi, M. & Duerig, T. (2026). Gemini Embedding 2: Our first natively multimodal embedding model. Google DeepMind Blog. blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/
MindStudio Team. (2026). What Is Matryoshka Representation Learning in Gemini Embedding 2? mindstudio.ai/blog/matryoshka-representation-learning-gemini-embedding-2
VentureBeat. (2026). Google’s Gemini Embedding 2 arrives with native multimodal support to cut costs. venturebeat.com/data/googles-gemini-embedding-2-arrives-with-native-multimodal-support-to-cut

Vector Databases: Searching by Meaning — The Essential Engine of the LLM Era

Dr. Yaroslav Zhbankov — Wed, 25 Feb 2026 04:24:36 GMT

Background

In recent years, the AI revolution has fundamentally reshaped how data stores, retrieves, and reasons. Traditional relational databases — built around structured rows, columns, and exact keyword matches — were never designed to handle the kind of data that powers today’s AI applications: high-dimensional embeddings that capture the meaning of text, images, audio, and more. This gap gave rise to vector databases, a specialized class of data storage systems purpose-built for similarity search in high-dimensional spaces.
Whether you are building a chatbot grounded in proprietary documents, a semantic product search, or a recommendation engine that understands user intent — chances are a vector database sits at the core of architecture. In this article, will be explored what vector databases are, why they have become indispensable, how they work under the hood, and walk through a practical example using Python.

What Is a Vector Database?

A vector database is a specialized database designed to store, index, and query vector embeddings — dense numerical representations of data in high-dimensional space. Unlike traditional databases that match rows on exact values or keyword patterns, vector databases find records that are semantically similar to a query.

Consider a simple example. In a relational database, searching for “running shoes for wide feet” requires the exact phrase (or careful keyword matching). A vector database, on the other hand, understands that “broad-fit athletic footwear” is semantically close — because both phrases map to nearby points in embedding space.

Embeddings: The Foundation

An embedding is a numerical vector (an array of floating-point numbers) produced by a machine learning model. These models — such as OpenAI’s text-embedding-ada-002 or the open-source all-MiniLM-L6-v2 from Sentence Transformers — convert unstructured data into fixed-length vectors (e.g., 384, 768, or 1536 dimensions).

The key property: semantically similar inputs produce vectors that are close together in the embedding space. This transforms the problem of “understanding meaning” into a geometric problem of “finding nearby points.”

How It Differs From Traditional Databases

Why Vector Databases Became So Popular

The meteoric rise of vector databases is directly tied to the AI and LLM boom. Several converging trends created the perfect conditions.

1. The Large Language Model Revolution

The release of GPT-3 in 2020 and ChatGPT in late 2022 triggered an avalanche of AI-powered applications across every industry. LLMs are powerful, but they have critical limitations: they hallucinate, their training data has a knowledge cutoff, and they lack access to proprietary information. Vector databases solve these problems through Retrieval-Augmented Generation (RAG), which we will explore later.

2. Retrieval-Augmented Generation (RAG)

RAG has emerged as the dominant architecture for enterprise AI. Instead of fine-tuning a model (expensive, static, and slow), RAG retrieves relevant context from a vector database and injects it into the LLM prompt at inference time. This single pattern has made vector databases a requirement rather than a nice-to-have.

3. Embedding Models Became Accessible

Open-source embedding models — from Sentence Transformers, Cohere, and others — made it trivial to convert text, images, and code into high-quality vectors. When generating embeddings costs fractions of a cent per document, the barrier to adopting vector search collapses. At the moment the most popular solutions are:

OpenAI text-embedding-3-large: The industry standard for production. It supports up to 3,072 dimensions but features "Matryoshka" technology, allowing you to shorten embeddings to 256 or 512 dimensions with minimal accuracy loss to save on database storage. Priced at approximately $0.13 per 1M tokens. Its smaller sibling, text-embedding-3-small (1,536 dimensions), comes in at just $0.02 per 1M tokens, making it one of the most cost-effective commercial options.
Voyage AI voyage-3-large: Frequently tops the MTEB (Massive Text Embedding Benchmark). Known for specialized variants tuned specifically for legal, medical, and code-based datasets. Pricing sits around $0.18 per 1M tokens, with the standard voyage-3 at roughly $0.06 per 1M tokens. The specialized models (e.g., voyage-law-2, voyage-code-3) fall in a similar range, offering strong domain-specific performance for the price.
Cohere embed-english-v3.0: Highly optimized for RAG. It is unique because it is trained to handle "noisy" real-world queries and provides a "search_query" vs "search_document" parameter to improve retrieval accuracy. Pricing is approximately $0.10 per 1M tokens, with a generous free tier that includes around 100M tokens per month for trial and prototyping use cases.
Google text-embedding-005: Integrated deeply with the Gemini ecosystem. It offers a massive context window and is particularly strong at multilingual retrieval across 100+ languages. Priced at roughly $0.00025 per 1K characters (approximately $0.01–0.04 per 1M tokens depending on average token length), making it one of the most affordable commercial embedding APIs available.

Prices mentioned for the models are accurate as of February 2026.

Where Are Vector Databases Used?

Retrieval-Augmented Generation (RAG)

The most transformative use case. Organizations embed their internal documents — knowledge bases, policies, product catalogs, support tickets — into a vector database. When a user asks a question, the system:

Converts the query to an embedding
Searches the vector database for the most relevant document chunks
Passes these chunks as context to the LLM
The LLM generates a grounded, accurate answer

This approach reduces hallucinations, keeps the system up-to-date without retraining, and is significantly cheaper than fine-tuning.

Semantic Search

Traditional full-text search matches keywords. Semantic search understands intent. A user searching for “how to fix a leaking faucet” finds results about “plumbing repair” and “dripping tap solutions” — because the meaning is similar even though the words differ. Vector databases power semantic search across e-commerce product discovery, enterprise knowledge management, legal document retrieval, and more.

Recommendation Systems

By representing users and items as vectors in the same embedding space, recommendation engines can surface relevant content without relying on explicit ratings. For example, on an e-commerce platform, each product is embedded based on attributes like style, fabric, color, and customer reviews. Browsing one item triggers a nearest-neighbor search to find semantically similar products.

Image and Multimodal Search

With multimodal models (e.g., CLIP), images and text share the same embedding space. A user can type “sunset over mountains” and retrieve matching photographs — or upload an image to find visually similar products. This enables reverse image search, visual product discovery, and content moderation at scale.

Anomaly Detection

Normal behavior patterns are stored as vectors. When a new event — a financial transaction, network request, or sensor reading — produces an embedding that is far from any known cluster, it is flagged as anomalous. This powers fraud detection in fintech, intrusion detection in cybersecurity, and quality control in manufacturing.

How Vector Databases Work

Understanding vector databases requires grasping three core concepts: distance metrics, indexing algorithms, and the query pipeline.

Distance Metrics

To determine how “similar” two vectors are, vector databases compute distances using one of several metrics:

Cosine Similarity — Measures the angle between two vectors, ignoring magnitude. Ideal for text embeddings where direction matters more than length. Values range from -1 (opposite) to 1 (identical).
Euclidean Distance (L2) — Measures the straight-line distance between two points. Smaller values indicate higher similarity. Works well when the absolute position in space matters.
Dot Product — Computes the product of corresponding components. Sensitive to both direction and magnitude. Commonly used when vectors are normalized.

The Brute-Force Problem

Given a query vector, the naive approach is to compute the distance to every vector in the database and return the closest ones. This is an exact nearest-neighbor search — and it is prohibitively slow for large datasets. Scanning 10 million 1536-dimensional vectors for each query is not practical in production.

Approximate Nearest Neighbor (ANN) Algorithms

Vector databases solve this with Approximate Nearest Neighbor algorithms. These sacrifice a small amount of accuracy (typically 90–95% recall) for dramatic speed improvements. The two most widely used algorithms are HNSW and IVF.

HNSW (Hierarchical Navigable Small World)

HNSW is the dominant indexing algorithm in modern vector databases. It builds a multi-layered graph structure:

Layer 0 (bottom): Contains all vectors, each connected to its nearest neighbors.
Higher layers: Contain progressively fewer vectors, forming a “skip list” over the graph.

Search process:

The query enters at the top layer, which has very few nodes
The algorithm greedily navigates to the closest node at each layer
It drops to the next layer and continues searching with more granularity
At layer 0, it refines the search among all vectors

This hierarchical approach allows the algorithm to rapidly skip irrelevant regions of the vector space, achieving O(log n) query time.

Key parameters:

M — Maximum connections per node. Higher values improve recall but increase memory and index build time.
ef_construction — Search width during index building. Higher values produce a better-quality graph.
ef_search — Search width during query time. The primary tuning knob for the recall-vs-latency trade-off.

HNSW offers excellent query performance and supports dynamic insertions without full re-indexing, which is why it has become the default choice in databases like Qdrant, Weaviate, and pgvector.

IVF (Inverted File Index)

IVF takes a different approach: it partitions the vector space into clusters using k-means clustering.

Search process:

During indexing, vectors are assigned to their nearest cluster centroid
At query time, the algorithm identifies the closest centroids
It only searches vectors within those clusters (controlled by the nprobe parameter)

Trade-offs compared to HNSW:

IVF handles very large datasets efficiently because the cluster structure compresses the search space
However, IVF indexes often require full reconstruction when new data is added
HNSW generally offers better latency for real-time applications, while IVF is better suited for batch-oriented workloads

Product Quantization (PQ)

For very large datasets where memory is a constraint, Product Quantization compresses vectors by dividing them into subvectors and replacing each with a compact code. This can reduce memory usage by 4–8x while maintaining reasonable accuracy. PQ is often combined with IVF (as IVF-PQ) for large-scale systems.

Overview of Popular Vector Databases

The vector database landscape is rich and rapidly evolving. Here is an overview of the most notable options, categorized by deployment model.

Cloud-Managed / Serverless

Pinecone A fully managed, serverless vector database. Pinecone handles scaling, indexing, and infrastructure entirely, making it the easiest option to adopt. It offers excellent query performance, strong consistency, and predictable pricing. Best for teams that want zero operational overhead and are building production RAG or search applications.

Zilliz Cloud The managed cloud version of the open-source Milvus. Combines Milvus’s scalability with enterprise features like auto-scaling, monitoring, and role-based access control. Ideal for organizations that want Milvus capabilities without managing the infrastructure.

Open-Source / Self-Hosted

Qdrant Written in Rust, Qdrant is designed for high performance and production readiness. It stands out with powerful payload (metadata) filtering that integrates directly with the vector search — rather than filtering after retrieval. Offers HNSW indexing, horizontal scaling, ACID-compliant transactions, and both gRPC and REST APIs. Available as open-source (self-hosted) or as Qdrant Cloud.

Weaviate Combines vector search with a knowledge graph, enabling hybrid search that blends semantic similarity with keyword matching and structured filters. Its modular architecture allows plugging in different vectorization models. Particularly strong for use cases requiring both semantic understanding and graph-based relationships.

Milvus A distributed, cloud-native vector database built for billion-scale datasets. Supports multiple ANN index types (HNSW, IVF, PQ, DiskANN), GPU-accelerated search, and multi-tenancy. Backed by the Linux Foundation (LF AI & Data). Best for organizations with very large datasets and complex scalability requirements.

Chroma A lightweight, developer-friendly vector database designed for rapid prototyping and small-to-medium-scale AI applications. Simple API, easy to embed into Python applications, and fast to get started with. Ideal for local development and proof-of-concept projects, but may require additional infrastructure for enterprise-grade deployments.

Database Extensions

pgvector (PostgreSQL) A PostgreSQL extension that adds vector data types, distance operations, and HNSW/IVF indexing to your existing Postgres database. The major advantage: you can combine vector search with relational queries in standard SQL — no additional infrastructure required. Realistically handles up to 10–100 million vectors before performance degrades.

Redis (Vector Search) Redis Stack includes vector search capabilities, leveraging Redis’s in-memory architecture for extremely low-latency queries. A practical choice if Redis is already part of your stack and you need simple vector search without a separate database.

Practical Example: Building Semantic Search With Python and Qdrant

Let’s build a working semantic search engine. We will use Qdrant as our vector database and Sentence Transformers for generating embeddings. The scenario: a knowledge base of technical articles that we can search using natural language queries.

Prerequisites

pip install qdrant-client sentence-transformers

We will use Qdrant in in-memory mode — no Docker or cloud setup required for this example.

Step 1: Define the Dataset

articles = [
    {
        "id": 1,
        "title": "Introduction to Microservices Architecture",
        "content": "Microservices architecture breaks down applications into small, "
                   "independently deployable services. Each service runs in its own "
                   "process and communicates via lightweight protocols like HTTP or gRPC.",
        "category": "architecture"
    },
    {
        "id": 2,
        "title": "Understanding Container Orchestration with Kubernetes",
        "content": "Kubernetes automates the deployment, scaling, and management of "
                   "containerized applications. It groups containers into logical units "
                   "for easy management and discovery.",
        "category": "devops"
    },
    {
        "id": 3,
        "title": "Serverless Computing with AWS Lambda",
        "content": "AWS Lambda lets you run code without provisioning servers. It "
                   "scales automatically and charges only for compute time consumed, "
                   "making it ideal for event-driven workloads.",
        "category": "cloud"
    },
]

Step 2: Initialize the Embedding Model and Qdrant Client

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Load a lightweight embedding model (384 dimensions)
model = SentenceTransformer("all-MiniLM-L6-v2")

# Initialize Qdrant in in-memory mode (no server required)
client = QdrantClient(":memory:")

# Create a collection
COLLECTION_NAME = "tech_articles"
VECTOR_SIZE = 384  # Matches the output of all-MiniLM-L6-v2

client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(
        size=VECTOR_SIZE,
        distance=Distance.COSINE,
    ),
)

Step 3: Generate Embeddings and Store in Qdrant

# Prepare texts for embedding (combine title and content for richer representation)
texts = [f"{a['title']}. {a['content']}" for a in articles]

# Generate embeddings in batch
embeddings = model.encode(texts)

# Upload to Qdrant
points = [
    PointStruct(
        id=article["id"],
        vector=embedding.tolist(),
        payload={
            "title": article["title"],
            "content": article["content"],
            "category": article["category"],
        },
    )
    for article, embedding in zip(articles, embeddings)
]

client.upsert(collection_name=COLLECTION_NAME, points=points)

print(f"Indexed {len(points)} articles into Qdrant.")

Step 4: Search

def search(query: str, limit: int = 3):
    """Search for articles semantically similar to the query."""
    query_vector = model.encode(query).tolist()

    results = client.query_points(
        collection_name=COLLECTION_NAME,
        query=query_vector,
        limit=limit,
    )

    print(f"\nQuery: \"{query}\"")
    print("-" * 60)
    for point in results.points:
        print(f"  Score: {point.score:.4f}")
        print(f"  Title: {point.payload['title']}")
        print(f"  Category: {point.payload['category']}")
        print()

Step 5: Run Queries

# Semantic search — no exact keyword matching required
search("deploying containers at scale")

Expected Output

Query: "deploying containers at scale"
------------------------------------------------------------
  Score: 0.6376
  Title: Understanding Container Orchestration with Kubernetes
  Category: devops

  Score: 0.2931
  Title: Introduction to Microservices Architecture
  Category: architecture

  Score: 0.2482
  Title: Serverless Computing with AWS Lambda
  Category: cloud

The vector database correctly identifies the Kubernetes article as the best match, because the meaning of the query aligns with the meaning of the article content. This is the fundamental advantage of vector search over keyword-based retrieval.

Choosing the Right Vector Database

There is no universal “best” vector database. The right choice depends on your constraints:

Just getting started or prototyping? Start with Chroma (zero configuration) or pgvector (if you already use PostgreSQL). You can always migrate later.
Building production RAG with minimal ops? Pinecone provides a fully managed experience with strong guarantees.
Need powerful metadata filtering? Qdrant integrates filtering directly into the search algorithm rather than applying it post-retrieval.
Require hybrid semantic + keyword search? Weaviate combines both natively.
Operating at billion-vector scale? Milvus is designed for extreme scale with GPU acceleration and multiple index types.
Want to avoid new infrastructure? pgvector or Redis add vector search to databases you likely already run.

Conclusion

Vector databases have rapidly evolved from a niche academic tool into critical infrastructure for modern AI applications. The convergence of accessible embedding models, the RAG paradigm, and the explosive growth of LLM-powered products has made vector search a fundamental building block — not a luxury.

The core idea is elegantly simple: convert data into vectors that capture meaning, then find what is similar. But the engineering behind this — HNSW graphs, quantization, distributed indexing, filtered ANN search — is what makes it work at production scale.

As the ecosystem matures, we are seeing a convergence. Traditional databases like PostgreSQL are adding vector capabilities. Dedicated vector databases are adding relational features. The line between them will continue to blur. What will not change is the underlying paradigm: searching by meaning, not by keywords.

If you are building AI-powered applications and have not yet adopted a vector database — now is the time to start. Begin with the practical example above, experiment with your own data, and evaluate which solution fits your scale and operational model.

Why AI-Driven Development Requires Strong Technical Foundations

Dr. Yaroslav Zhbankov — Sun, 18 Jan 2026 02:49:54 GMT

Everyone’s talking about AI-assisted development. I decided to go all-in: premium subscription, complex multi-service architecture, pure natural language instructions. What followed was a journey through three distinct stages — and a very different understanding of what AI can and can’t do.

Spoiler: There’s no magic here. What is powerful is how fast AI exposes weak architecture — and how cheap it becomes once you fix it.

Stage 1: The Honeymoon Phase

I started building a system with multiple services: an MCP server with a browser API, an AI agent, and an orchestration layer — all executing natural language instructions and performing browser actions.

The results were impressive. Claude Code asked clarifying questions, understood my intent, and generated mountains of working code from just a few high-level directives.

But here’s what nobody tells you: reading AI-generated code takes longer than you’d expect.

A lot of code appears very quickly. Understanding how it all fits together? That takes real time. Getting multiple services to communicate correctly required long conversations — describing protocols, responsibilities, and communication patterns in detail.

The generated code itself was usually fine. The architecture, however, needed constant correction. Claude Code can follow patterns and best practices — but only if you explicitly tell it what those are. You need to describe how the system should work internally.

The uncomfortable truth: You need both system architecture skills to direct the AI and deep technical skills to review the output. Vibe coding doesn’t replace either.

By the end of this stage, I had a working prototype. Not production-ready, but functional. Almost no manual coding — just reading, understanding, and giving precise instructions.

The feeling? Equal parts scary and exciting.

Stage 2: The $60 Wake-Up Call

Reality hit hard.

I discovered that several services in my monorepo were tightly coupled in ways that wouldn’t scale. Fair enough — I asked Claude to rework the design, and it did.

But this exposed something important: vibe coding requires constant supervision. As systems grow complex, debugging gets harder. Understanding what the AI changed gets harder. Architecture and clean abstractions aren’t optional — they’re essential.

Then came surprising moment.

I had an AI agent executing natural-language browser commands. Using a top-tier model with multiple processing loops, results were solid and reliable. The quality was genuinely good.

The cost? I burned $60 in couple of minutes.

A closer look showed why: massive amounts of data flowing between the agent and model to achieve that reliability. When I switched to cheaper models or reduced the data exchange, quality collapsed. Results became unpredictable and unstable.

The math became clear:

“Pure AI” approach: Expensive and not very reliable
Cheaper approach: Requires real engineering work

The initial excitement faded into something more grounded. There’s no magic. Every benefit has a cost — money, complexity, or engineering time. Pick your poison.

Stage 3: The Engineering Breakthrough

I was automating browser interactions through a Playwright MCP server. The server provided generic tools, and I was asking the AI to execute abstractly-described tasks using only those tools.

Results: poor, unpredictable, expensive. Still hitting that $60-per-abstract-task ceiling with premium models.

Then I tried something different.

I made the MCP tools more specific. I added an accessibility layer to the web app, effectively treating the AI like a screen reader user. Clear labels. Explicit structure. Defined actions.

Everything changed.

Suddenly, even cheaper models could execute reliably and repeatedly. Costs dropped from $60 to about $0.20 per run — even for complex tasks.

The Real Lesson

AI alone cannot solve engineering problems.

To be effective, AI needs well-designed systems, clear constraints, and engineering support. The technology is powerful, but it’s a tool — not a replacement for thinking.

This applies to vibe coding too. Simply asking AI to write code rarely solves anything meaningful. Real value comes from the engineering work: defining architecture, understanding component interactions, ensuring solutions address actual business needs.

I’ve talked to friends who are engineers at companies ranging from Amazon to small startups. The pattern is consistent: those who see AI as a productivity multiplier combine it with solid engineering fundamentals. Those who expect magic are disappointed.

Three stages of discovery, condensed:

Vibe coding works — but you need architecture skills to direct it and technical skills to review output
“Pure AI” solutions are expensive; cheap solutions require engineering investment
The real leverage comes from building systems that help AI succeed — not from hoping AI will figure it out

The world is changing. But engineering isn’t going anywhere.

Building an MCP Server in TypeScript and Connecting with OpenAI

Dr. Yaroslav Zhbankov — Fri, 21 Nov 2025 02:38:00 GMT

Introduction

As LLMs evolve, the demand for a standardized method to connect them with external tools, data sources, and systems has grown. While these models become excellent at reasoning and text generation, their potential is unlocked when they can interact directly with APIs, databases, and local resources, performing real-world actions beyond text. This combination creates a strong synergy effect.

The Model Context Protocol (MCP) [1] addresses this by defining a unified, structured, and secure way for LLMs to communicate with external systems. It lets developers expose tools, resources, and prompts that models can use dynamically — bridging the gap between language understanding and real-world execution.

This idea is becoming more popular because it makes integrations easier, improves reusability, and offers a clear way to extend AI systems like ChatGPT or Ollama. By the end of this article, you will have a working MCP Server connected to ChatGPT that exposes custom tools and responds to prompts in real time — providing a foundation for smarter and more context-aware AI applications.

What will be build

In this article, we’ll create a simple MCP Server using TypeScript and connect it to ChatGPT via OpenAI’s MCP integration. The server will expose several basic tools — such as file and database access, system health check, and email notifications — that ChatGPT can invoke to interact with external data and execute actions on your behalf.

The goal of this exercise is to demonstrate how developers can connect user-specific data sources (like databases, file systems, scripts, or APIs) and custom actions to ChatGPT through MCP. By the end, you’ll see how ChatGPT can use these tools to analyze data or trigger operations on a server — extending its capabilities beyond text reasoning.

We’ll focus on the core integration workflow and tool definition, keeping things simple and easy to follow. Topics such as scaling, authentication, and production deployment will not be covered here, as they are implementation-specific and depend on your own environment and widley described in other articles/books.

Solution Overview

In this article, we’ll build a simple MCP Server written in TypeScript that exposes a set of practical tools to ChatGPT via the MCP. The server will include tools for performing database queries, reading internal documentation, checking system health metrics, and even sending emails. This setup demonstrates how MCP enables seamless integration between an AI model and external systems, letting the LLM perform real-world tasks instead of only generating text.

At a high level, the solution looks like this:

The MCP Client serves as a bridge between the LLM and the MCP Server. It converts the model’s instructions into tool calls and translates the server’s responses back into natural language. Meanwhile, the MCP Server defines and exposes tools — each representing an executable capability (e.g., system_health_check or send_email)—along with optional resources (structured data or documents that can be queried) and prompts (predefined templates or guidance for specific tasks). In the MCP context, tools perform actions, resources provide information, and prompts guide the LLM’s reasoning or formatting.

MCP supports bi-directional communication and tool discovery, meaning the client can dynamically request the server’s list of available tools and their schemas before invoking any operation. The interaction flow typically follows these steps:

User → MCP Client: The user sends a natural-language request (e.g., “Generate a system health check report.”).
MCP Client → MCP Server: The client requests a list of available tools (tools/list).
MCP Client → LLM: The client sends the user’s request and the tool list to the LLM, asking which tools to use and with what parameters.
LLM → MCP Client: The LLM returns a structured plan — a list of tool calls and arguments.
MCP Client → MCP Server: The client requests the execution of selected tools (tools/call), and the server runs those tools and returns the results.
MCP Client → LLM: The client provides the tool results to the LLM for reasoning and final response generation.
LLM → MCP Client:
- If the LLM needs more data, it produces another tool call → return to step 5 (loop).
- If the LLM has enough information, it generates the final natural-language response.
MCP Client → User: The client presents the final answer to the user.

This description highlights how MCP transforms the way LLMs interact with systems — making AI not only conversational but operational.

MCP Server Implementation

This example describes how to implement an MCP server using TypeScript. The server exposes a set of simple tools, such as running database queries, reading documentation, sending emails, and performing system health checks — just enough to experiment with the MCP.

The MCP server is implemented as a WebSocket-based server that handles requests using commands such as initialize, tools/list, and tools/call.

Below is the initialization of the server using ws package:

import {WebSocketServer} from 'ws';
import controller from './lib/controller/index.js';
import {config} from './lib/config.js';
import {parseSafe} from './lib/utils/index.js';

const wss = new WebSocketServer({ port: config.serverPort });

console.log(`MCP WebSocket server running on ws://localhost:${config.serverPort}`);


wss.on('connection', (ws) => {
    console.log('Client connected');

    ws.on('message', async (raw) => {
        const msg = parseSafe(raw.toString());
        await controller(msg, ws);
    });

    ws.on('close', () => console.log('Client disconnected'));
});

The following code shows the controller layer implementation. Based on the request parameters, it either returns the list of available tools or executes a tool with the provided arguments:

import {WebSocket} from 'ws';
import {tools} from '../models/Tools.js';

function sendResult(client: WebSocket, id: string, result: any) {
    client.send(JSON.stringify({
        jsonrpc: "2.0",
        id,
        result
    }));
}

function sendError(client: WebSocket, id: any, code: number, message: string) {
    client.send(JSON.stringify({
        jsonrpc: "2.0",
        id,
        error: { code, message }
    }));
}

function handleInitialize(msg: any, client: WebSocket) {
    sendResult(client, msg.id, {
        protocolVersion: "2024-11-05",
        serverInfo: {
            name: "typescript-mcp-server",
            version: "0.1.0"
        },
        capabilities: {
            tools: {
                list: true,
                call: true
            }
        }
    });

    client.send(JSON.stringify({
        jsonrpc: "2.0",
        method: "notifications/initialized",
        params: {}
    }));
}

function handleToolList(msg: any, client: WebSocket) {
    const toolList = Array.from(tools.values()).map(t => ({
        name: t.name,
        description: t.description,
        inputSchema: t.inputSchema,
        outputSchema: t.outputSchema
    }));

    sendResult(client, msg.id, {
        tools: toolList,
        nextCursor: null
    });
}

async function handleCallTool(msg: any, client: WebSocket) {
    const { name: toolName, arguments: args } = msg.params ?? {};
    const tool = tools.get(toolName);

    if (!tool) {
        sendError(client, msg.id, -32601, `Tool "${toolName}" not found`);
        return;
    }

    try {
        const output = await tool.execute(args);
        sendResult(client, msg.id, { output });
    } catch (err: any) {
        sendError(client, msg.id, -32000, err?.message ?? "Unknown tool error");
    }
}

export default async function controller(msg: any, client: WebSocket) {
    if (!msg || msg.jsonrpc !== "2.0") {
        sendError(client, null, -32700, "Invalid JSON-RPC format");
        return;
    }

    const { method } = msg;

    switch (method) {
        case "initialize":
            handleInitialize(msg, client);
            break;

        case "tools/list":
            handleToolList(msg, client);
            break;

        case "tools/call":
            await handleCallTool(msg, client);
            break;

        default:
            sendError(client, msg.id, -32601, `Unknown method: ${method}`);
    }
}

The list of tools is defined as a simple hash map. Each tool, according to the MCP protocol, must include:

name — so the client can reference it
description — so the LLM model knows when to use it
input schema — so the LLM model knows how to call it
output schema — so the LLM model and the client know what to expect in the response

In this example, the following tools are defined:

add — a basic math operation
sql_query — runs SQL queries generated by the AI
system_health_check — executes server-side scripts to check system status
api_docs — returns API documentation
send_email — sends an email using SMTP

Here is the code that registers these tools:

import nodemailer from 'nodemailer';
import {queryMySQL} from '../utils/index.js';
import {runSystemHealthCheck} from '../utils/healthCheck.js';
import {readLoginDocs, readGroupsDocs} from '../utils/readFile.js';

export const tools = new Map();

type ToolDefinition = {
    name: string;
    description?: string;
    inputSchema?: any;
    outputSchema?: any;
    execute: (input: any) => Promise;
};

function registerTool(name: string, meta: any, handler: any) {
    tools.set(name, { name, ...meta, execute: handler });
}

registerTool(
    'add',
    {
        title: 'Addition Tool',
        description: 'Add two numbers',
        inputSchema: {
            type: 'object',
            properties: {
                a: { type: 'number' },
                b: { type: 'number' }
            },
            required: ['a', 'b']
        },
        outputSchema: {
            type: 'object',
            properties: { result: { type: 'number' } },
            required: ['result']
        }
    },
    async ({ a, b }: { a: number; b: number }) => {
        const result = { result: a + b };
        return {
            content: [{ type: 'text', text: JSON.stringify(result) }],
        };
    }
);

registerTool(
    'sql_query',
    {
        title: 'SQL Query Tool',
        description: 'Execute SQL query',
        inputSchema: {
            type: 'object',
            properties: { query: { type: 'string' } },
            required: ['query']
        },
        outputSchema: { type: 'array', items: { type: 'object' } }
    },
    async ({ query }: { query: string }) => {
        // put your database credentials here
        const result = await queryMySQL(
            { password: 'mypassword', user: 'myuser', host: 'localhost', database: 'mydatabase' },
            query
        );
        return {
            content: [{ type: 'text', text: JSON.stringify(result) }],
        };
    }
);

registerTool(
    'system_health_check',
    {
        title: 'System Health Check Tool',
        description: 'Return system health status',
        inputSchema: { type: 'object', properties: {} },
        outputSchema: { type: 'object' }
    },
    async () => {
        const result = runSystemHealthCheck();
        return {
            content: [{ type: 'text', text: JSON.stringify(result, null, 2) }],
        };
    }
);

registerTool(
    'api_docs',
    {
        title: 'API Documentation Tool',
        description: 'Return API documentation',
        inputSchema: { type: 'object', properties: {} },
        outputSchema: { type: 'object' }
    },
    async () => {
        const result = await readLoginDocs();
        return {
            content: [{ type: 'text', text: JSON.stringify(result, null, 2) }],
        };
    }
);

registerTool(
    'send_email',
    {
        title: 'Send Email Tool',
        description: 'Send email using SMTP via Nodemailer',
        inputSchema: {
            type: 'object',
            properties: {
                to: { type: 'string', format: 'email' },
                subject: { type: 'string' },
                text: { type: 'string' }
            },
            required: ['to', 'subject', 'text']
        },
        outputSchema: {
            type: 'object',
            properties: { result: { type: 'string' } },
            required: ['result']
        }
    },
    async ({ to, subject, text }: {to: string, subject: string, text: string}) => {
        try {
            // put here your email servce implementation
            const smtpUrl = 'smtp://localhost:25';

            const transporter = nodemailer.createTransport(smtpUrl);

            const mailOptions = {
                from: 'no-reply@gmail.com',
                to,
                subject,
                text
            };

            await transporter.sendMail(mailOptions);

            return { result: 'Email sent successfully' };
        } catch (error: any) {
            return { result: `Error sending email: ${error.message}` };
        }
    }
);

MCP Client Implementation

To interact with the MCP server, we use an MCP client. In this example, the client is a CLI application that reads user input from the console and orchestrates communication between the MCP server and the AI model.

The following code shows a simple console interface that reads user input, processes it, and prints the results:

import {processQuery} from './lib/models/queryProcessor.js';

async function main() {
    console.log('🧩 MCP + OpenAI Client Ready! Type your query (or "exit").');
    process.stdin.setEncoding('utf8');

    const ask = () => {
        process.stdout.write('\nYou: ');
        process.stdin.once('data', async (input: string) => {
            const query = input.trim();
            if (query.toLowerCase() === 'exit') process.exit(0);
            try {
                const response = await processQuery(query);
                console.log('\n🤖 LLM:', response);
            } catch (err: any) {
                console.error('❌ Error:', err.message);
            }
            ask();
        });
    };
    ask();
}

main();

The processQuery defines how the client sends messages between the MCP server and the LLM.
Each query begins with requesting the list of MCP tools. After receiving the tool list and the user query, the client builds a system prompt that defines the LLM’s behavior. The client then sends everything to the LLM.

The LLM responds with a list of tool calls that must be executed. The client sends these tool requests to the MCP server. This loop continues for several iterations until the LLM stops requesting tool calls. After that, the query processor returns the final answer.

import {callMcp} from './mcpClient.js';
import {callAi, Message, Tool} from './aiClient.js';

type ToolSchema = {
    name: string,
    description: string,
    inputSchema: Record
};

async function getMcpTools(): Promise {
    const toolsData = await callMcp("tools/list");

    return toolsData.tools.map((t: ToolSchema) => ({
        type: "function",
        function: {
            name: t.name,
            description: t.description,
            parameters: t.inputSchema || {}
        }
    }));
}

export async function processQuery(userQuery: string) {
    const tools: Tool[] = await getMcpTools();

    const systemPrompt = `
You are an intelligent assistant connected to a tool system.
When appropriate, call a tool using a JSON function call.
Available tools:
${tools.map((t: Tool) => `- ${t.function.name}: ${t.function.description}`).join("\n")}
    `;

    let messages: Message[] = [
        { role: "assistant", content: systemPrompt, name: "system" },
        { role: "user", content: userQuery, name: 'user' }
    ];
    let response = await callAi(messages, tools);

    while (response?.tool_calls?.length) {
        for (const call of response.tool_calls) {
            const { name, arguments: argStr } = call.function;
            const args = argStr ? JSON.parse(argStr) : {};

            const toolResult = await callMcp("tools/call", { name, arguments: args });

            messages.push({
                role: "function",
                name,
                content: JSON.stringify(toolResult)
            });
        }

        // Ask the AI again with updated messages
        response = await callAi(messages, tools);
    }

    return response.content ?? "No response.";
}

The MCP caller is a simple WebSocket client that sends JSON-RPC requests. On first MCP call it establish WebSocket connection with the server and use it for communication further.

import WebSocket from 'ws';
import {v4 as uuidv4} from 'uuid';
import {config} from '../config.js';

let ws: WebSocket | null = null;
let connected = false;

const pending = new Map void>();

async function ensureConnected(): Promise {
    if (connected && ws) return;

    ws = new WebSocket(config.serverUrl);

    await new Promise((resolve, reject) => {
        ws!.once("open", resolve);
        ws!.once("error", reject);
    });

    ws.on("message", (raw) => {
        let msg;
        try {
            msg = JSON.parse(raw.toString());
        } catch {
            console.error("❌ Invalid JSON from MCP:", raw.toString());
            return;
        }

        if (msg.id && pending.has(msg.id)) {
            pending.get(msg.id)!(msg);
            pending.delete(msg.id);
            return;
        }
    });

    const initMsg = {
        jsonrpc: "2.0",
        id: uuidv4(),
        method: "initialize",
        params: {
            clientInfo: { name: "openai-client", version: "1.0.0" },
            protocolVersion: "2024-10-14",
            capabilities: {}
        }
    };

    ws.send(JSON.stringify(initMsg));

    await new Promise((resolve) => {
        const handler = (raw: any) => {
            const msg = JSON.parse(raw.toString());

            if (msg.method === "notifications/initialized") {
                ws!.off("message", handler);
                connected = true;
                resolve();
            }
        };

        ws!.on("message", handler);
    });

    console.log("🔌 Connected to MCP server");
}

function send(msg: any): Promise {
    const id = msg.id;

    return new Promise((resolve) => {
        pending.set(id, resolve);
        ws!.send(JSON.stringify(msg));
    });
}

export async function callMcp(action: "tools/list" | "tools/call", params: any = {}) {
    await ensureConnected();
    const id = uuidv4();

    if (action === "tools/list") {
        const res = await send({
            jsonrpc: "2.0",
            id,
            method: "tools/list",
            params: {}
        });

        if (res.error) {
            throw new Error(res.error.message);
        }

        return res.result;
    }

    if (action === "tools/call") {
        const res = await send({
            jsonrpc: "2.0",
            id,
            method: "tools/call",
            params: {
                name: params.name,
                arguments: params.arguments
            }
        });

        if (res.error) {
            throw new Error(res.error.message);
        }

        return res.result;
    }

    throw new Error("Unknown MCP action: " + action);
}

The callAi uses the OpenAI library. In this example, the model gpt-4.1 is used. The next section of the article will explain how to generate an API key.

import {OpenAI} from 'openai';
import {ChatCompletionMessage} from 'openai/resources/chat/completions/completions';
import {config} from '../config.js';

export type Message = {
    role: 'system' | 'user' | 'assistant' | 'function';
    content: string;
    name: string;
};
export type Tool = {
    type: "function",
    function: {
        name: string,
        description: string,
        parameters: any
    }
};

const openai = new OpenAI({ apiKey: config.openAIApiKey });

export async function callAi(messages: Message[], tools: Tool[]): Promise {
    const response = await openai.chat.completions.create({
        model: "gpt-4.1",
        messages,
        tools,
        tool_choice: "auto"
    });

    return response.choices[0].message;
}

With this simple MCP client and server setup, we get a powerful integration with OpenAI, enabling automatic tool selection, execution, and response handling.

OpentAI Configuration

To start using the solution described above, you need to create an OpenAI API key openAIApiKey mentioned in secion above.
To generate the key, log in to the OpenAI Platform.

In the left sidebar, click API Keys, then in the top-right corner click Create new secret key.
Copy this key — it will be used by your client.

Using the API is not free, so you need to add some funds to your account. The minimum amount for the moment is $5, and this is more than enough to experiment with this setup. For example, 30,000 tokens cost me around $0.10.
To add credit, navigate to the Billing tab and follow the simple instructions on the page to fill your credit balance.

This is a relatively cheap way to start working with the solution.
A free alternative is Ollama, which runs locally on your own hardware. I tested this option, but because I don’t have powerful hardware, the models in Ollama performed poorly for me. After switching to OpenAI, the quality of responses improved significantly and really surprised me.

Of course, if you use Ollama instead of OpenAI, the callAi function and the previous section would need to be updated accordingly.

Demo

This section shows screenshots of user prompts and the results returned by OpenAI based on the data provided by the MCP server.

In the first example, the user requests the list of available tools, and the system returns all tools that can be used.

In the next example, the user asks the system to perform a health check and apply some custom criteria for sending an email notification. This demonstrates how the client performs multiple iterations with OpenAI to call the required tools.

Initially, OpenAI calls the system_health_check tool. Once the client sends the results back, along with the updated context, the model requests the send_email tool. The MCP server executes the email action, and the user receives the message.

Another example shows that OpenAI reads the API documentation and provides guidance on how to use it, including generating a sample curl command.

Conclusion

In this article, you built an MCP Server in TypeScript, connected it to ChatGPT, and exposed tools for real-world actions like database queries and system checks. MCP is changing how we interact with LLMs, turning them from conversational agents into operational assistants. Its growing popularity is reflected in integrations with Docker Desktop, Cursor IDE, and VS Code extensions.

As next steps for enhancing the described solution, consider implementing a layered architecture, adding authentication, exposing more complex APIs, or deploying the MCP server with Docker — opening the door to broader AI-driven workflows.

An example of the implemented solution can be found in [2].

Referense

Node.js at Scale: Handling 100,000 Msg/sec from 50,000 Emulated IoT Devices with Redis and ZMQ…

Dr. Yaroslav Zhbankov — Mon, 29 Sep 2025 01:42:51 GMT

Node.js at Scale: Handling 100,000 Msg/sec from 50,000 Emulated IoT Devices with Redis and ZMQ Without Horizontal Scaling

Introduction

Service-oriented architecture (SOA) often relies on lightweight message buses like ZeroMQ (ZMQ) to enable communication between services. Node.js is a popular choice for implementing services due to its simplicity and asynchronous capabilities. However, Node.js comes with inherent limitations for CPU-bound or high-throughput tasks.

In theory, a simple Node.js service using in-memory ZMQ messages can handle up to ~1 million messages per second in an ideal push-pull setup, assuming very small messages (<200 bytes) and minimal processing. In practice, throughput is constrained by factors such as message size, CPU-bound operations, Redis interactions, network latency, and message serialization/deserialization.

This article demonstrates how to scale message throughput for a Node.js service that consumes messages, interacts asynchronously with Redis, and forwards messages to the next consumer. Using worker threads and batching, it will be shown how to reach ~100,000 messages per second for 50,000 unique keys stored in Redis — each key representing an emulated IoT device identifier sending messages at a given rate — without relying on horizontal scaling. While even higher throughput could be achieved using Kafka, C++, or clustered Redis, this article focuses on optimizing Node.js, Redis, and ZMQ for simplicity and practical relevance.

System Limitations

Node.js is single-threaded and asynchronous, meaning I/O operations (like network or Redis calls) do not block the event loop. CPU-bound tasks, however, block processing and limit throughput. Worker threads, child processes, and clustering can parallelize CPU-bound work and increase performance.
ZMQ is a lightweight messaging protocol over TCP, capable of millions of messages per second under ideal conditions [1]. Actual throughput depends on message size, socket type, network latency, and system resources.
Redis is a performance in-memory store [2]. Single-core operations typically achieve 100k–1M ops/sec; multi-core usage, pipelining, and batching can raise throughput to several million ops/sec. Operations like HGETALL on large hashes are particularly expensive.
Practical throughput estimates (for ~100-byte messages):
Node.js: 300k–600k msg/sec
Redis: 300k–700k ops/sec (without pipelining)
ZMQ: 800k–1.5M msg/sec (based on native benchmarks)

Note: Achieving these maxima simultaneously is difficult, as the slowest component dictates overall throughput. Message handling, waiting for responses, ordering, and blocking operations further reduce performance.

Next, these limitations will be examined in a practical example. You can find the Node.js setup in the GitHub repository [3].

Practical Implementation Overview

The aim is to implement a message processor service that:

Consumes messages using a ZMQ PULL socket.
Parses and extends messages with a timestamp.
Interacts with Redis: reads existing records, deletes old ones, and stores new messages.
Forwards processed messages via a ZMQ PUSH socket to the next consumer.

This example demonstrates throughput optimization in Node.js using worker threads and batch processing. It is important to note that records are stored in Redis by unique key, and the number of unique keys will be equal to half the number of messages sent per second (e.g., 10,000 messages per second will generate 5,000 unique keys, resulting in 5,000 records in Redis), emulating IoT devices sending messages at a 2 Hz frequency.

Methodology

Message Producer: Generates and sends messages via a ZMQ PUSH socket.
Message Processor: Receives messages, performs Redis operations, and forwards messages.
Final Consumer: Reads messages using a ZMQ PULL socket and tracks throughput.

Message Producer

The Node.js producer sends messages with configured frequency. It generates messages with a simple structure — objects containing a key, timestamp, and some random data — stringifies them, and sends them using the ZMQ PUSH protocol.

import { Push } from "zeromq";

const ADDRESS = "tcp://127.0.0.1:7001";
const BATCH_SIZE = 1;
const MSGS_PER_SECOND = 6500;
const KEY_RANGE = 3250;
const DELAY_NANOSECONDS = BigInt(Math.floor(1e9 / MSGS_PER_SECOND));

const keys = Array.from({ length: KEY_RANGE }, (_, i) =>
    String(i + 1).padStart(8, "0") + "A"
);

function logStats(counter, totalMessages) {
    console.log(`Messages sent in last 1s: ${counter.value}`);
    console.log(`Total messages sent: ${totalMessages.value}`);
    counter.value = 0;
}

function createMessage() {
    const key = keys[Math.floor(Math.random() * KEY_RANGE)];
    return JSON.stringify({
        key,
        ts: new Date().toISOString(),
        data: {
            randomInt: Math.floor(Math.random() * 100),
        },
    });
}

async function handleExit(push, totalMessages) {
    console.log(`\nProcess exiting. Total messages sent: ${totalMessages.value}`);
    try {
        await push.close();
    } catch (err) {
        console.error("Error closing Push socket:", err);
    }
    process.exit(0);
}

async function runPublisher() {
    const push = new Push();
    await push.bind(ADDRESS);
    console.log(`Push publisher bound to ${ADDRESS}`);

    const counter = { value: 0 };
    const totalMessages = { value: 0 };

    setInterval(() => logStats(counter, totalMessages), 1000);

    const exitHandler = () => handleExit(push, totalMessages);
    process.on("SIGINT", exitHandler);
    process.on("SIGTERM", exitHandler);

    let next = process.hrtime.bigint();
    while (true) {
        const now = process.hrtime.bigint();
        if (now >= next) {
            const batch = BATCH_SIZE > 1 ? Array.from({ length: BATCH_SIZE }, createMessage) : createMessage();
            await push.send(batch);
            counter.value += BATCH_SIZE;
            totalMessages.value += BATCH_SIZE;
            next += DELAY_NANOSECONDS;
        }
    }
}

runPublisher().catch(console.error);

Important Notes:

The producer binds the port and sends messages whenever a consumer is ready to pull them.
This producer will be used for multiple scenarios in this article to test throughput.

Final Consumer

The consumer reads messages after they have been processed by the message processor. It uses a ZMQ PULL socket to receive messages, parses them, and keeps track of throughput statistics.

import { Pull } from "zeromq";

const ADDRESS = "tcp://127.0.0.1:7000";

function logStats(counter, totalMessages) {
    console.log(`Messages received in last 1s: ${counter.value}`);
    console.log(`Total messages received: ${totalMessages.value}`);
    counter.value = 0;
}

async function handleExit(pull, totalMessages) {
    console.log(`\nProcess exiting. Total messages received: ${totalMessages.value}`);
    try {
        await pull.close();
    } catch (err) {
        console.error("Error closing Pull socket:", err);
    }
    process.exit(0);
}

async function runConsumer() {
    const pull = new Pull();
    await pull.bind(ADDRESS);
    console.log(`Pull consumer bound to ${ADDRESS}`);

    const counter = { value: 0 };
    const totalMessages = { value: 0 };

    setInterval(() => logStats(counter, totalMessages), 1000);

    const exitHandler = () => handleExit(pull, totalMessages);
    process.on("SIGINT", exitHandler);
    process.on("SIGTERM", exitHandler);

    for await (const msg of pull) {
        try {
            JSON.parse(msg.toString());
            counter.value += 1;
            totalMessages.value += 1;
        } catch (err) {
            console.error("Failed to parse batch:", err);
        }
    }
}

runConsumer().catch(console.error);

Important Notes:

The consumer binds a port and reads messages sent by the message processor.
It will be used in multiple scenarios in this article to measure throughput and message handling performance.

Solution 1 — Single-Threaded Node.js Message Processor

The producer and consumer can generate and handle a massive number of messages. Therefore, the main focus should be on the Message Processor, which performs parsing, Redis operations, and message forwarding via ZMQ. Redis itself can become a bottleneck, and as mentioned above, the number of records in Redis will equal half of the message frequency, emulating a 2 Hz message rate per single object. This is similar to some device sending 2 messages per second to a server — so 10,000 messages per second would correspond to 5,000 devices with unique keys.

The single-threaded message processor operates as follows:

for await (const [msg] of zmqPullClient.socket) {
  try {
    counter.value += 1;
    totalMessages.value += 1;
    await handleMessage(msg, dbConnection, zmqPushClient.socket);
  } catch (err) {
    console.error('Handler failed:', err);
  }
}

async function handleMessage(messageRaw, client, socket) {
  let parsedMsg;
  try {
    parsedMsg = typeof messageRaw === 'string' ? JSON.parse(messageRaw) : JSON.parse(messageRaw.toString());
  } catch (err) {
    console.error('Failed to parse message:', err);
    return;
  }

  const redisKey = `test:${parsedMsg.key}`;
  try {
    await client.hGetAll(redisKey);
    await client.del(redisKey);
    await client.hSet(redisKey, ['field1', JSON.stringify(parsedMsg)]);
  } catch (err) {
    console.error('Redis operation failed:', err);
    return;
  }

  try {
    const response = { ...parsedMsg, ts: new Date().toISOString() };
    await socket.send(JSON.stringify(response));
  } catch (err) {
    console.error('Failed to send message via ZMQ socket:', err);
  }
}

This single-threaded solution handles ~7,000 msg/sec and 3,500 unique keys in Redis with no delay that is not very impressive and definitely requires improvement.

Solution 2 — Main Thread for Receiving and Worker for Processing

The idea here is to decouple receiving messages from processing them, which reduces idle time caused by Redis operations and ZMQ forwarding.

The main thread pulls messages and sends them to a worker via IPC:

export async function main() {
    const __filename = fileURLToPath(import.meta.url);
    const __dirname = path.dirname(__filename);
    let worker;

    const zmqPullClient = new ZmqPullClient({ address: 'tcp://127.0.0.1:7001' });
    await zmqPullClient.start();

    // Track worker readiness
    await new Promise((resolve) => {
        worker = new Worker(path.resolve(__dirname, 'worker.mjs'));
        worker.on('message', (msg) => {
            if (msg && msg.ready) {
                resolve();
            }
        });
    });

    for await (const msg of zmqPullClient.socket) {
        worker.postMessage(msg.toString());
    }
}

The worker implements the same Redis read/delete/set logic as in first solution:

parentPort?.on('message', async (msg) => {
    try {
        await handleMessage(msg, dbConnection, { send: safeSend });
        parentPort?.postMessage({ ok: true });
    } catch (err) {
        console.error('Worker error:', err);
        parentPort?.postMessage({ ok: false, error: err?.message || err });
    }
});

async function handleMessage(messageRaw, client, socket) {
    let parsedMsg;
    try {
        parsedMsg = typeof messageRaw === 'string' ? JSON.parse(messageRaw) : JSON.parse(messageRaw.toString());
    } catch (err) {
        console.error('Failed to parse message:', err);
        return;
    }

    const redisKey = `test:${parsedMsg.key}`;
    try {
        await client.hGetAll(redisKey);
        await client.del(redisKey);
        await client.hSet(redisKey, ['field1', JSON.stringify(parsedMsg)]);
    } catch (err) {
        console.error('Redis operation failed:', err);
        return;
    }

    try {
        const response = { ...parsedMsg, ts: new Date().toISOString() };
        await socket.send(JSON.stringify(response));
    } catch (err) {
        console.error('Failed to send message via ZMQ socket:', err);
    }
}

This approach achieves ~20,000 msg/sec with 10,000 unique keys in Redis without queuing or delays, which is already three times better simply by using a separate process to handle messages.

Solution 3 — Main Thread with Multiple Workers

To further increase throughput, multiple workers can process messages in parallel. Messages are distributed to workers in a round-robin manner:

import path from 'path';
import { fileURLToPath } from 'url';
import { Worker } from 'worker_threads';
import { ZmqPullClient } from './ZmqPullClient.mjs';


const WORKER_COUNT = 2;
const workers = [];
let nextWorker = 0;
let readyWorkers = 0;

function getWorker() {
    const w = workers[nextWorker];
    nextWorker = (nextWorker + 1) % workers.length;
    return w;
}

export async function main() {
    const __filename = fileURLToPath(import.meta.url);
    const __dirname = path.dirname(__filename);

    const zmqPullClient = new ZmqPullClient({ address: 'tcp://127.0.0.1:7001' });
    await zmqPullClient.start();

    // Track worker readiness
    await new Promise((resolve) => {
        for (let i = 0; i < WORKER_COUNT; i++) {
            const worker = new Worker(path.resolve(__dirname, 'worker.mjs'));
            worker.on('message', (msg) => {
                if (msg && msg.ready) {
                    readyWorkers++;
                    if (readyWorkers === WORKER_COUNT) {
                        resolve();
                    }
                }
            });
            workers.push(worker);
        }
    });

    for await (const msg of zmqPullClient.socket) {
        getWorker().postMessage(msg.toString());
    }
}

main();

Max throughput increases to ~35,000 msg/sec and 17,500 unique keys in Redis with just two workers.
Adding more workers does not improve throughput, indicating the main thread’s message dispatching is the limiting factor. In Node.js worker_threads, posting messages (worker.postMessage) itself adds overhead.
Compared to Solution 1, this represents a 7X improvement.

Solution 4 — Multiple Workers with Batch Messages

Throughput can be further improved by processing messages in batches. The producer sends multiple messages in a single package, the message processor distributes the batch to workers, and the batch is sent to the next consumer.

for await (const msgList of zmqPullClient.socket) {
  const messages = msgList.map((msg) => msg.toString());
  getWorker().postMessage(messages);
}

parentPort?.on('message', async (msgList) => {
    try {
        const responses = [];
        for (const singleMsg of msgList) {
            const result = await handleMessage(JSON.parse(singleMsg), dbConnection);
            responses.push(result);
        }
        counter += msgList.length;
        await safeSend(JSON.stringify(responses));
        parentPort?.postMessage({ ok: true });
    } catch (err) {
        console.error('Worker error:', err);
        parentPort?.postMessage({ ok: false, error: err?.message || err });
    }
});

Using 4 workers and a batch size of 10 messages increases throughput to ~100,000 msg/sec with 50,000 unique keys in Redis.
Reducing number of unique Redis keys improving throughput.
Further increases are limited by Redis throughput and main thread dispatching.

Conclusion

The strategies described demonstrate how to dramatically increase the throughput of a Node.js service using Redis and ZMQ, achieving up to 100k messages/sec for 50k unique keys (emulating 50k IoT devices sending messages). Major improvements come from:

Parallelizing message handling across multiple worker threads.
Sending messages in batches to reduce inter-thread and network overhead.

Further optimization opportunities include:

Minimizing inter-thread communication overhead with transferable objects or shared memory.
Leveraging Node.js clustering to utilize all CPU cores for receiving and dispatching messages.
Exploring Redis sharding or cluster mode for even larger datasets.

With these techniques, Node.js can handle high-throughput messaging workloads efficiently without requiring complex horizontal scaling.

References

SAML SSO with Microsoft Entra ID: A Practical Guide for Node.js and React Developers

Dr. Yaroslav Zhbankov — Fri, 01 Aug 2025 02:35:43 GMT

Introduction

Single Sign-On (SSO) is commonly implemented in enterprise solutions using two widely adopted protocols: SAML and OAuth 2.0. The importance of SSO for enterprise systems — along with scenarios where it becomes critical — is discussed in [1], which also provides a complete example of an OAuth 2.0-based solution.

This article demonstrates how to configure SSO using Microsoft Entra ID (formerly Azure AD) in a server–client architecture, where multiple client applications — such as desktop and web apps — authenticate through a single backend using SAML. A SAML-based solution will be implemented using a simple Node.js server and a React-based frontend as an example.

Solution overview

As described above, the solution consists of a web application built with Next.js that provides the user interface, a Node.js (Express) web server that exposes a REST API for accessing resources, and Microsoft Entra ID (formerly Azure AD) as the centralized identity provider. The solution implements SAML authentication to enable Single Sign-On (SSO).

For this example, a few core requirements define how the solution is implemented:

The system must support multiple client applications, such as different web and desktop apps, which allow users to access shared server resources.
It must also support multiple web servers, each serving a different customer (tenant).
Client applications must be able to interact with multiple servers (customers) simultaneously, without relying on hardcoded or customer-specific configuration.

Given these constraints, client applications should not be aware of Microsoft Entra ID directly, as each customer may use a different identity provider. Therefore, the authentication logic and IdP configuration reside on the server, while the client follows a consistent authentication flow.

In this model, client applications authenticate through the server, which acts as a proxy to Microsoft Entra ID.

The server implements the following endpoints:

GET /api/auth/login – Redirects the user to the Entra ID login UI.
GET /api/auth/logout – Clears the server session and redirects to the login page.
GET /api/protected – An example of a protected resource endpoint.
POST /api/auth/callback – Handles the SAML response and issues a JWT, setting it as a cookie.

Below is a diagram illustrating the described solution (Fig. 1):

Fig. 1. Diagram of the SSO-enabled client–server solution

The steps shown in the diagram outline the authentication process:

The user visits the web application and clicks the “Login with SAML” button.
The web client redirects the browser to the server’s /api/auth/login endpoint.
The server initiates a SAML authentication request (via passport-saml) and redirects the user to Microsoft Entra ID (the Identity Provider).
The user authenticates with Entra ID using their credentials.
Upon successful authentication, Entra ID returns a SAML Response (containing a signed assertion with user identity and attributes) to the server’s /api/auth/callback endpoint.
The server validates the SAML Response using passport-saml, extracts user attributes (e.g., name, email), and issues a JWT signed with a server secret.
The JWT is set as an HTTP-only cookie, and the user is redirected back to the client application.
The client is now authenticated and can access protected endpoints.
For each request to a protected resource (e.g., /api/protected), the server validates the JWT from the cookie.
If the token is valid, the server grants access and returns user-specific content.

Authentication Sequence: Step-by-Step Flow

Below is a sequence diagram provides more detailed description of the authentication flow (Fig. 2).

Fig. 2 The sequence diagram describing the authentication flow

1. User Initiates Login
- The user opens the application in their browser.
- The WebClient (frontend) displays a login interface with a “Login with SAML” button.

2. Login Request to Server
- When the user clicks the login button, the browser sends a request to the REST server: GET /api/auth/login

3. Redirect to Microsoft Entra ID
- The REST server generates a SAML authentication request.
- It sends back a 302 redirect to the browser pointing to Microsoft Entra ID’s login endpoint.
- The browser follows the redirect and sends the SAML request to Microsoft Entra ID.

4. User Authenticates with Microsoft Entra ID
- Microsoft Entra ID prompts the user to authenticate, e.g., by entering credentials or using SSO.
- Upon successful login, Entra ID responds with a SAML assertion containing user identity information.

5. SAML Response Returned to Backend
- The SAML response is sent to the browser, which then posts it to the backend: POST /api/auth/callback
- This request contains the SAML assertion.

6. SAML Assertion Validation
- The REST server uses Passport-SAML to validate the SAML response and extract user identity claims (like nameID, email, etc.).
- If validation succeeds, it creates a signed JWT token containing the user information.

7. Session Established
- The backend sets a secure, HTTP-only cookie named auth_token with the JWT.
- It then redirects the browser back to the WebClient (frontend).

8. Accessing Protected Resources
- The browser navigates to the home page or another protected route.
- The WebClient sends a request to a protected API endpoint: GET /api/protected including the auth_token cookie.
- The backend verifies the token and returns the authorized user info.
- The WebClient renders the page based on the authenticated user’s profile.

9. Logout
- When the user clicks logout, the WebClient sends a request: GET /api/logout
- The REST server clears the cookie, destroys the session, and optionally redirects to a login or post-logout page.

Solution Implementation

The sections below describe the implementation of each component of the solution. The complete source code is available on GitHub [2].

This article focuses on the implementation details, not the deployment process. Instructions for deploying the client to AWS S3 and the server to AWS EC2 are available in [3] and [4], respectively.

Microsoft Entra ID

Follow the steps below to configure Microsoft Entra ID (formerly Azure AD) for SAML-based SSO:

Step 1. Log in to the Microsoft Azure Portal and click on the Microsoft Entra ID icon, as shown in Fig. 3.

Fig. 3 Microsoft Azure Portal

Step 2. In the left-hand navigation menu, click on Enterprise applications, as shown in Fig. 4.

Fig. 4. Microsoft Entra ID page

Step 3. Click the New application button to create a new application for the solution (see Fig. 5).

Fig. 5. Enterprise applications screen

Step 4. On the next screen, click Create your own application to define a custom application (see Fig. 6).

Fig. 6. Adding application page

Step 5. In the panel on the right, choose the option: “Integrate any other application you don’t find in the gallery (Non-gallery)”, enter your desired application name, and click Create (see Fig. 7).

Fig. 7. Application creation page

Step 6. Since we’re setting up SSO, click on the Set up single sign on option (see Fig. 8).

Fig. 8. Application use case selection

Step 7. Select SAML as the SSO method (see Fig. 9).

Fig. 9. Sign-on method selection

Step 8. On the opened SAML configuration page, click the Edit button. In the panel that opens on the right, configure the following:

Identifier (Entity ID): Use a unique name for your application.
Example: urn:yzhbankov:nodejs
(This will be referenced later as SAML_ISSUER)
Reply URL (Assertion Consumer Service URL): Provide the full URL to the endpoint on backend server that handles the SAML authentication response.
Example: http://localhost:3000/api/auth/callback
(This will be referenced later as SAML_CALLBACK_URL)

Click Save to apply the settings (see Fig. 10).

Fig. 10. Basic SAML configuration

Step 9. After saving the configuration, the SAML certificate and Login URL will be available. These are required for the backend server:

Click Download to get the certificate file, and store it at a known path (This will be referenced later as SAML_CERT_PATH).
Copy the Login URL, which will be used as SAML_ENTRY_POINT.

(See Fig. 11 for reference.)

Fig. 11. SAML certificate and app setup

Step 10. As a final step, navigate to the Users and groups tab (see Fig. 12) and create a new user for testing the login flow.

Fig. 12. User and group management tab

Once these steps are completed, Microsoft Entra ID is fully configured and ready to be used as the identity provider in SAML-based SSO solution.

Web Server

This section describes the implementation of a basic REST API server that includes an authentication layer. The web server is implemented in Node.js with TypeScript. The service itself is intentionally simple, providing only the minimal functionality required for the use case described earlier.

Application Structure
The directory structure is as follows:

my-app/
├── apps
│   ├── server/
│   │   ├── .env
│   │   ├── .env.defaults
│   │   ├── app.ts
│   │   ├── config.ts
│   │   ├── middlewares.ts
│   │   ├── samlStrategy.ts
│   │   ├── package.json
│   │   └── tsconfig.json
│   └── client/
│       └── ...
└── README.md

Configuration

Configuration is handled using the dotenv library. Default values are specified in the .env.defaults file. The config.ts file loads and provides configuration values to the application.
This file includes basic configuration values required for the solution to work.

# ./apps/server/.env.defaults
SERVER_PORT=3000
SAML_ENTRY_POINT=https://login.microsoftonline.com/44b02d65...6f6e646a/saml2
SAML_ISSUER=urn:yzhbankov:nodejs
SAML_CALLBACK_URL=http://localhost:3000/api/auth/callback
SAML_CERT_PATH=/PATH_TO_CERTIFICATE_FOLDER/NodeJs SAML App.cer
REDIRECT_AFTER_LOGIN_URL=http://localhost:8080/home
LOGOUT_REDIRECT_URL=http://localhost:8080
JWT_SECRET=your_secret_jwt_here
SESSION_SECRET=your_session_secret

Descriptions:

SERVER_PORT: Port number for the server.
SAML_ENTRY_POINT: IdP login URL (Microsoft Entra ID tenant-specific).
SAML_ISSUER: Unique identifier for the application, must match the Entity ID configured in the IdP.
SAML_CALLBACK_URL: The URL the IdP will post the SAML response to after successful login.
SAML_CERT_PATH: Path to the X.509 certificate used to validate the IdP signature. Certificate was downloaded in previous section.
REDIRECT_AFTER_LOGIN_URL: Frontend route users are redirected to after login.
LOGOUT_REDIRECT_URL: Redirect target after logging out.
JWT_SECRET: Secret for signing JSON Web Tokens.
SESSION_SECRET: Secret for session signing.

The config.ts file loads environment variables and defines constants used throughout the server:

// ./apps/server/config.ts
import fs from 'fs';
import dotenv from 'dotenv';

dotenv.config();

const SERVER_PORT = process.env.SERVER_PORT || 3000;
const SAML_ENTRY_POINT = process.env.SAML_ENTRY_POINT || '';
const SAML_ISSUER = process.env.SAML_ISSUER || '';
const SAML_CALLBACK_URL = process.env.SAML_CALLBACK_URL || '';
const SAML_CERT_PATH = process.env.SAML_CERT_PATH || '';
const SESSION_SECRET = process.env.SESSION_SECRET || 'your_session_secret';
const LOGOUT_REDIRECT_URL = process.env.LOGOUT_REDIRECT_URL || 'http://localhost:8080';
const REDIRECT_AFTER_LOGIN_URL = process.env.REDIRECT_AFTER_LOGIN_URL || '/';
const JWT_SECRET = process.env.JWT_SECRET || 'your_secret';

let SAML_CERT = '';
if (SAML_CERT_PATH) {
    try {
        SAML_CERT = fs.readFileSync(SAML_CERT_PATH, 'utf-8');
    } catch (error) {
        console.error(`Failed to read SAML cert at ${SAML_CERT_PATH}:`, error);
    }
}

export const config = {
    SERVER_PORT,
    SAML_ENTRY_POINT,
    SAML_ISSUER,
    SAML_CALLBACK_URL,
    SAML_CERT,
    SESSION_SECRET,
    LOGOUT_REDIRECT_URL,
    REDIRECT_AFTER_LOGIN_URL,
    JWT_SECRET,
};

Note: In a production environment, environment variables should be validated (e.g. using zod, joi or livr).

Application Logic

The app.ts file defines the web application and its endpoints. It includes logic for login, callback, logout, and accessing a protected resource.

The /api/auth/login endpoint sends a redirect response that causes the browser to navigate to the Identity Provider (IdP) login page.
After successful login, Microsoft Entra ID redirects the user back to /api/auth/callback via a POST request, including the SAML Response.
The server validates the response using passport-saml, generates a JWT, stores it in an HTTP-only cookie, and sends it back to the client.
Once the cookie is received, the client can access protected resources by including the auth_token cookie in subsequent requests.

Note: This implementation is intentionally simplified and stateful. The server maintains session state internally, which can limit scalability. To support horizontal scaling, consider storing session data (such as JWTs) in an external service like Redis, DynamoDB, or a shared database. This allows the application to be stateless and more scalable in production.

// ./apps/server/app.ts
import express from 'express';
import passport from 'passport';
import jwt from 'jsonwebtoken';
import middlewares from './middlewares';
import { config } from './config';
import './samlStrategy';

const app = express();

app.use(middlewares.cookieParser());
app.use(middlewares.session());
app.use(middlewares.passportInit());
app.use(middlewares.passportSession());
app.use(middlewares.urlencode());
app.use(middlewares.cors());

app.get('/api/auth/login', passport.authenticate('saml', { failureRedirect: '/', failureFlash: true }));

app.get('/api/auth/logout', (req, res) => {
    req.logout(() => {
        res.clearCookie('auth_token');
        req.session.destroy(() => {
            res.redirect(config.LOGOUT_REDIRECT_URL);
        });
    });
});

app.post('/api/auth/callback', passport.authenticate('saml', { failureRedirect: '/' }),
    (req, res) => {
        const token = jwt.sign(req.user, config.JWT_SECRET, { expiresIn: '1h' });
        res.cookie('auth_token', token, { httpOnly: true });
        res.redirect(config.REDIRECT_AFTER_LOGIN_URL);
    }
);

app.get('/api/protected', (req, res) => {
    const token = req.cookies.auth_token;
    if (!token) return res.status(401).json({ error: 'Unauthorized' });

    try {
        const user = jwt.verify(token, config.JWT_SECRET);
        res.json({ message: 'Protected content', user });
    } catch (err) {
        res.status(401).json({ error: 'Invalid token' });
    }
});

app.listen(config.SERVER_PORT, () => {
    console.log(`Server running on http://localhost:${config.SERVER_PORT}`);
});

Endpoints:

GET /api/auth/login: Redirects to IdP login page.
POST /api/auth/callback: Handles SAML response coming from IdP.
GET /api/auth/logout: Clearing user session.
GET /api/protected: A protected resource endpoint.

Middlewares

The middlewares.ts encapsulates middleware initialization for easier reuse and separation of concerns:

// ./apps/server/middlewares.ts
import cors from 'cors';
import express from 'express';
import passport from 'passport';
import session from 'express-session';
import cookieParser from 'cookie-parser';
import {config} from './config';

export default {
    cors: () => cors({
        origin: 'http://localhost:8080', // allow your React frontend
        credentials: true,              // allow cookies (auth_token) to be sent
    }),
    urlencode: () => express.urlencoded({ extended: true }),
    passportSession: () => passport.session(),
    passportInit: () => passport.initialize(),
    session: () => session({
        secret: config.SESSION_SECRET,
        resave: false,
        saveUninitialized: true,
        cookie: { secure: false }, // true if HTTPS
    }),
    cookieParser: () => cookieParser(),
}

SAML Strategy

The samlStrategy.ts file configures the passport-saml strategy for handling SAML authentication:

// ./apps/server/samlStrategy.ts
import passport from 'passport';
import { Strategy as SamlStrategy, SamlConfig, Profile } from 'passport-saml';
import { config } from './config';

type UserProfile = {
    id: string;
    email?: string;
    name?: string;
};

passport.serializeUser((user: Express.User, done) => {
    done(null, user);
});
passport.deserializeUser((user: Express.User, done) => {
    done(null, user);
});

const samlOptions: SamlConfig = {
    entryPoint: config.SAML_ENTRY_POINT,
    issuer: config.SAML_ISSUER,
    callbackUrl: config.SAML_CALLBACK_URL,
    cert: config.SAML_CERT,
    identifierFormat: null,
};

const samlStrategy = new SamlStrategy(samlOptions, (profile: Profile, done) => {
    const user: UserProfile = {
        id: profile.nameID || '',
        email: profile.email || profile['http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress'] as string,
        name: profile['http://schemas.xmlsoap.org/ws/2005/05/identity/claims/name'] as string,
    };
    return done(null, user);
});

passport.use(samlStrategy);

Web Application

This section describes the final component of the solution — the client application, which interacts with the REST API server. Like the server, the client is intentionally kept minimal and implements basic functionality to:

Authenticate a user
Retrieve a protected resource from the server
Implement logout functionality

The client application is built using Next.js, a popular React framework that supports both server-side rendering and static generation.

Application Structure
The directory structure is as follows:

my-app/
├── apps
│   ├── server/
│   │   └── ...
│   └── client/
│   │   ├── pages/
│   │   |   ├── _app.tsx
│   │   |   ├── home.tsx
│   │   |   └── index.tsx
│   │   ├── .env.local
│   │   ├── config.tsx
│   │   ├── tsconfig.json
│   │   ├── package.json
│   │   └── tsconfig.json
└── README.md

Configuration

The application is configured using environment variables, defined in the .env.local file. These variables include the URL of the backend API. In a Next.js application, public environment variables (exposed to the browser) must be prefixed with NEXT_PUBLIC_.

Example .env.local:

NEXT_PUBLIC_API_URL=http://localhost:3000

The config.tsx file loads environment variables and exports constants used throughout the client:

interface IConfig {
    serverURL: string;
}

export default {
    serverURL: process.env.NEXT_PUBLIC_API_URL || ''
} as IConfig;

Application logic

The _app.tsx is a top-level React component that wraps all pages. In this example, it serves as a simple wrapper, but in a real-world application, this is where you’d typically include global styles, providers, and shared layout components.

import type { AppProps } from 'next/app';

export default function App({ Component, pageProps }: AppProps) {
    return ;
}

The index.tsx file is a main entry point of the application, served at the root URL (/). It functions as the login page.

When the user clicks the “Login with SAML (Microsoft Entra ID)” button, they are redirected to the Identity Provider login page. After successful authentication, the user is redirected back to the client (typically /home) with a cookie containing the authentication token.

import config from '../config';

export default function LoginPage() {
    const handleLogin = () => {
        window.location.href = `${config.serverURL}/api/auth/login`;
    };

    return (
        
            Login

            You will be redirected to Microsoft Entra ID to sign in.

                            onClick={handleLogin}
                style={{
                    padding: '0.5rem 1rem',
                    fontSize: '1rem',
                    cursor: 'pointer',
                }}
            >
                Login with SAML (Microsoft Entra ID)
            
        

    );
}

The home.tsx page displays a protected resource. When the component mounts, it sends a request to the /api/protected endpoint using the authentication cookie. If the user is authenticated, the protected data is displayed; otherwise, the user is redirected to the login page.

Logout functionality is also provided: clicking the “Logout” button redirects the user to the /api/auth/logout endpoint, which clears the session on the backend and redirects the user back to the login page.

import { useEffect, useState } from 'react';
import config from '../config';

type User = {
    name?: string;
    email: string;
};

export default function Home() {
    const [user, setUser] = useState(null);

    useEffect(() => {
        fetch(`${config.serverURL}/api/protected`, { credentials: 'include' })
            .then(res => {
                if (!res.ok) throw new Error('Unauthorized');
                return res.json();
            })
            .then(data => setUser(data.user))
            .catch(() => {
                window.location.href = `${config.serverURL}/api/auth/login`;
            });
    }, []);

    const handleLogout = () => {
        window.location.href = `${config.serverURL}/api/auth/logout`;
    };

    if (!user) return Loading...
;

    return (
        
            Welcome, {user.name || user.email}

            You are authenticated!

            
        

    );
}

Conclusion

In this article, was demonstrated a practical and modular approach to integrating Microsoft Entra ID into a client–server architecture implementing SAML based SSO. Was showed how to:

Configure Microsoft Entra ID with appropriate SAML settings
Implement a minimal Node.js REST API to manage the authentication flow
Create a Next.js-based frontend to handle user login, logout and protected resource access

While the current implementation serves as a solid starting point, it should be enhanced with production-ready features such as making server stateless moving user session outside the REST server, using HTTPs instead of HTTP and error handling. With these additions, the solution can evolve into a robust, secure, and maintainable authentication framework suitable for real-world deployments.

References

Implementing OAuth 2.0 Authorization Code Flow with AWS Cognito for Enterprise SSO

Dr. Yaroslav Zhbankov — Tue, 08 Jul 2025 03:44:35 GMT

Introduction

As your product matures into an enterprise-grade solution, or as the number of user accounts grows from dozens to hundreds — or even as you begin deploying the product for multiple customers — you’ll quickly face challenges related to user account management. These include increased security risks, administrative overhead and a poor user experience, especially when a single user needs to access multiple systems within your product that each use separate user management mechanisms.

The solution to these problems is centralized user management and, at a minimum, Single Sign-On (SSO). Today, nearly every modern product supports some form of SSO. Thousands of web applications offer social login options using providers like Google, Facebook, or GitHub. Implementing SSO for web applications has become relatively easy, due to widely adopted standards like OAuth 2.0, SAML and numerous libraries that follow best practices.

However, for enterprise-focused solutions, social login is often not acceptable due to security, compliance, or control requirements. In these cases, you’ll need to use dedicated authentication services that you manage or trust, such as AWS Cognito, Azure AD (Entra ID), Keycloak, or Okta. These systems offer robust capabilities but typically require a bit more effort to configure the correct authentication flow.

In this article, will be shown how to set up SSO (excluding user management) using AWS Cognito for a server–client architecture, where multiple clients — such as desktop and web applications — can authenticate through a single backend. Will be implemented the OAuth 2.0 Authorization Code Flow with client secret exchange. As an example, a simple Node.js server and a React-based frontend will be provided. This pattern is common across many projects, and hope it proves useful for you.

Solution overview

As described above, the solution consists of a web application built with Next.js that provides the user interface, a web server built with Node.js (Express) that exposes a REST API for accessing resources, and AWS Cognito as the centralized identity provider. The solution implements the OAuth 2.0 Authorization Code Flow, enabling SSO.

For this example, I will introduce a few requirements that define how the solution is implemented:

The system should support multiple client applications — such as different web and desktop apps — used by users to access shared server resources.
It should also support multiple web servers, each serving a different customer.
Client applications must be able to work with multiple customers (servers) simultaneously, without any hardcoded or customer-specific configuration.

Given these constraints, client applications should not be aware of AWS Cognito, since each customer may use a different Cognito User Pool. Therefore, some authentication logic with IdP configuration must reside on the server side while client implementing uniform flow.

Since each server is customer-specific, it will be preconfigured with the appropriate AWS Cognito User Pool. In this model, the client applications authenticate through the server, which acts as a proxy to Cognito.

The server will implement endpoints such as:

GET /api/auth/login – returns the Cognito login URL
GET /api/auth/logout – returns the Cognito logout URL
GET /api/auth/refresh – refreshes the access token
GET /api/auth/callback – exchanges the authorization code for tokens

Note: In this particular implementation, the PKCE flow is not used and client secret will not be store in client application but in server side.

Below is a diagram of the described solution (Fig. 1).

Fig. 1 Diagram of the SSO-enabled client–server solution

The steps shown in the diagram illustrate the basic process of how a user is authenticated in the system.

The user visits the web application (1) and decides to log in. The application sends a request to the server (2), which responds with a login URL containing a pre-generated state value (3). The web application then redirects the user to the AWS Cognito Hosted UI (4). If Cognito is connected to an external identity provider (e.g., Azure AD), it may further redirect the user to that IdP.

The user authenticates via Cognito using their credentials (5). Cognito then redirects the user back to the client application with an authorization code and the original state value (6). The client verifies the state to prevent CSRF attacks, extracts the code, and sends it to the server via a callback request (7).

The server exchanges the authorization code for tokens by sending a request to Cognito, including the client ID, client secret, code, and redirect URI (8). If the verification is successful, Cognito returns a set of tokens (ID token, refresh token, and access token) (9). The server then sends these tokens back to the client (10).

The client securely stores the tokens (e.g., using HTTP-only cookies or session storage) and can now make authenticated requests to the server. The server validates the access token included in each request to authorize access to protected resources.

Authentication Sequence: Step-by-Step Flow

Below is a sequence diagram provides more detailed description of the authentication flow (Fig. 2).

Fig. 2 The sequence diagram describing the authentication flow

User Visits the Web App
- Loads the app in the browser.
- Checks for an existing access token (e.g., from memory or secure HTTP-only cookie).
- If no valid token shows login screen with options (e.g., Email/Password, “Login with AWS Cognito”).
User Clicks “Login with AWS Cognito”
- Web App → REST API: GET /api/auth/login.
- REST API builds Cognito Hosted UI login URL with the following query parameters: client_id, redirect_uri, response_type, scope, state.
- REST API → Web App: Returns the generated login URL.
App Redirects to Cognito Hosted UI
- Web App: window.location.href = loginUrl
User Logs In via Cognito
- Cognito Hosted UI presents login options configured in the User Pool (e.g., Email/Password, MFA).
- User → Cognito: Enters credentials and submits the form.
Cognito Authenticates the User
- Validates credentials.
- Generates an OAuth2 authorization_code.
- Redirects to:
https://mywebapp.io/auth/callback?code=abc123&state=xyz
Web App Handles Redirect
- Detects route /auth/callback
- Extracts: code, state
- Web App → REST API: POST /api/auth/callback
REST API Exchanges Code for Tokens
- REST API → Cognito (Token Endpoint): grant_type=authorization_code, code=abc123, redirect_uri=https://mywebapp .io/auth/callback, client_id, client_secret
- Cognito → REST API Response: access_token, id_token, refresh_token, expires_in, token_type.
REST API Returns Tokens to Web App
- REST API → Frontend: access_token, refresh_token.
Browser App Stores Tokens and Loads UI
- Stores tokens securely (e.g., in memory or secure cookie).
- Begins making authenticated API requests to load user-specific data.

Solution Implementation

The complete solution can be deployed on AWS, utilizing Terraform to streamline the cloud deployment process. The sections below describe the implementation of each solution component. The solution itself can be found on GitHub [1]. This article does not focus on how to deploy the client and server to AWS, but rather on the implementation details of each component. Instructions for deploying the client to AWS S3 and the server to EC2 can be found in articles [2] and [3].

AWS Cognito

This section describes Terraform script deploying AWS Cognito. How to configure Terraform in your project can be find in [3]. To configure auth service, it is necessary to deploy the User Pool defined in the cognito_pool section, along with the Application Client (cognito_pool_client), which handles the authentication flow, OAuth scopes, redirect URLs, and token settings. The client section also includes instructions for generating a client secret, which is used by the REST API server to interact with Cognito, and details for implementing the OAuth 2.0 Authorization Code Flow.

In this example, the OAuth scopes and token lifetimes (for ID, access, and refresh tokens) are explicitly configured. The callback and logout URLs are set to http://localhost:8080 for local development. In a production environment, these URLs should be replaced with the actual web application addresses.

The user pool configuration also specifies password requirements, ensuring secure user credential policies.

Additionally, the Terraform script provisions a default user with the email user@example.com and the password P@ssw0rd123!, making a test user available immediately after infrastructure deployment.

resource "aws_cognito_user_pool" "cognito_pool" {
  name = "cognito-example-user-pool"

  username_attributes      = ["email"]
  auto_verified_attributes = ["email"]

  password_policy {
    minimum_length    = 8
    require_uppercase = true
    require_lowercase = true
    require_numbers   = true
    require_symbols   = true
  }
}

resource "aws_cognito_user_pool_client" "cognito_pool_client" {
  name                         = "cognito-client"
  user_pool_id                 = aws_cognito_user_pool.cognito_pool.id
  generate_secret              = true
  supported_identity_providers = ["COGNITO"]

  allowed_oauth_flows_user_pool_client = true
  allowed_oauth_flows = [
    "code"
  ]
  allowed_oauth_scopes = ["email", "openid", "phone"]

  explicit_auth_flows = [
    "ALLOW_USER_AUTH",
    "ALLOW_USER_SRP_AUTH",
    "ALLOW_REFRESH_TOKEN_AUTH",
    "ALLOW_ADMIN_USER_PASSWORD_AUTH",
    "ALLOW_USER_PASSWORD_AUTH"
  ]

  callback_urls = [
    "http://localhost:8080"
  ]

  logout_urls = [
    "http://localhost:8080/logout",
  ]

  token_validity_units {
    access_token  = "hours"
    id_token      = "hours"
    refresh_token = "days"
  }

  access_token_validity  = 1
  id_token_validity      = 1
  refresh_token_validity = 5

  enable_token_revocation       = true
  prevent_user_existence_errors = "ENABLED"
}

resource "aws_cognito_user_pool_domain" "cognito_domain" {
  domain       = "user-pool-domain-yz"
  user_pool_id = aws_cognito_user_pool.cognito_pool.id
}

resource "null_resource" "cognito_dependency" {
  depends_on = [
    aws_cognito_user_pool.cognito_pool,
    aws_cognito_user_pool_client.cognito_pool_client,
    aws_cognito_user_pool_domain.cognito_domain
  ]
}

resource "aws_cognito_user" "default_user" {
  user_pool_id = aws_cognito_user_pool.cognito_pool.id
  username     = "user@example.com"
  attributes = {
    email = "user@example.com"
  }

  force_alias_creation = false
  message_action       = "SUPPRESS" # Prevents sending a signup email

  depends_on = [
    aws_cognito_user_pool.cognito_pool
  ]
}

resource "null_resource" "set_default_user_password" {
  provisioner "local-exec" {
    command = <      aws cognito-idp admin-set-user-password \
        --region ${var.AWS_REGION} \
        --user-pool-id ${aws_cognito_user_pool.cognito_pool.id} \
        --username user@example.com \
        --password "P@ssw0rd123!" \
        --permanent
    EOT
  }

  depends_on = [
    aws_cognito_user.default_user
  ]
}

AWS Cognito is a widely used service, and comprehensive documentation and user management guides are readily available in official [5] and third-party resources [2].

Web Server

Application Structure
The directory structure is as follows:

aws-cognito/
├── .github/
│   ├── workflows/
|   |   ... 
├── apps
│   ├── server/
│   │   ├── .env
│   │   ├── .env.defaults
│   │   ├── app.ts
│   │   ├── config.ts
│   │   ├── middlewares.ts
│   │   ├── package.json
│   │   └── tsconfig.json
│   └── client/
│       └── ...
├── dev-ops/
│   └── ...
└── README.md

Configuration

Configuration is handled using the dotenv library. Default values are specified in the .env.defaults file. The config.ts file loads and provides configuration values to the application.

.env.defaults
This file includes basic configuration values required for the solution to work.

SERVER_PORT=3000
CLIENT_ID=
CLIENT_SECRET=
COGNITO_DOMAIN=user-pool-domain-yz.auth.us-east-1.amazoncognito.com
COGNITO_ISSUER=https://cognito-idp.us-east-1.amazonaws.com/us-east-1_Ed6C3tdul
REDIRECT_URI=http://localhost:8080/
LOGOUT_URI=http://localhost:8080/

Descriptions:

SERVER_PORT: Port number the server listens on.
CLIENT_ID: AWS Cognito client ID (retrieved from the AWS Console).
CLIENT_SECRET: AWS Cognito client secret (retrieved from the AWS Console).
COGNITO_DOMAIN: Domain name of the Cognito user pool (retrieved from the AWS Console).
COGNITO_ISSUER: Issuer URL used for token validation (retrieved from the AWS Console).
REDIRECT_URI: URI Cognito will redirect to after successful login.
LOGOUT_URI: URI Cognito will redirect to after logout.

config.ts
This file loads environment variables and defines constants used throughout the server:

import dotenv from 'dotenv';

dotenv.config();

const SERVER_PORT = process.env.SERVER_PORT || 3000;
const CLIENT_ID = process.env.CLIENT_ID || '';
const CLIENT_SECRET = process.env.CLIENT_SECRET || '';
const COGNITO_DOMAIN = process.env.COGNITO_DOMAIN || '';
const ISSUER = process.env.COGNITO_ISSUER || '';
const REDIRECT_URI = process.env.REDIRECT_URI || '';
const LOGOUT_URI = process.env.LOGOUT_URI || '';
const STATE_SECRET = 'secure-random-state-secret'; // need to be rnadmly generated for each request
const TOKEN_ENDPOINT = `https://${COGNITO_DOMAIN}/oauth2/token`;
const JWKS_URL = `${ISSUER}/.well-known/jwks.json`;

export default {
    SERVER_PORT,
    CLIENT_ID,
    CLIENT_SECRET,
    COGNITO_DOMAIN,
    REDIRECT_URI,
    LOGOUT_URI,
    STATE_SECRET,
    TOKEN_ENDPOINT,
    ISSUER,
    JWKS_URL
}

Note: In a production environment, environment variables should be validated (e.g. using zod, joi or livr).

Application Logic

app.ts
This file defines the web application and its endpoints. It includes logic for login, token exchange, refresh, logout, and a protected resource.

The /auth/login endpoint constructs the login URL for AWS Cognito Hosted UI. The state is currently hardcoded but should be dynamically generated for security reasons in production.

After login, Cognito redirects the user to the specified REDIRECT_URI with a code and state in the query string. The frontend verifies the state and sends the code to /auth/callback to exchange for tokens.

Once tokens are received, the client can access protected resources by including the access token in the Authorization header.

import express, {Request, Response} from 'express';
import cors from 'cors';
import axios from 'axios';
import qs from 'querystring';
import config from './config';
import {checkAccess} from './middlewares';

const app = express();

app.use(express.json());
app.use(cors());

app.get('/auth/login', (req: Request, res: Response): void => {
    const state = config.STATE_SECRET;
    const loginUrl = `https://${config.COGNITO_DOMAIN}/oauth2/authorize?` + qs.stringify({
        response_type: 'code',
        client_id: config.CLIENT_ID,
        redirect_uri: config.REDIRECT_URI,
        scope: 'openid profile email',
        state,
    });
    res.json({ loginUrl });
});

app.get('/auth/callback', async (req: Request, res: Response): Promise => {
    const code = req.query.code as string;
    const state = req.query.state as string;

    if (!code) {
        res.status(400).send('Invalid code');
    }

    try {
        const response = await axios.post(
            config.TOKEN_ENDPOINT,
            qs.stringify({
                grant_type: 'authorization_code',
                code,
                redirect_uri: config.REDIRECT_URI,
                client_id: config.CLIENT_ID,
                client_secret: config.CLIENT_SECRET,
            }),
            { headers: { 'Content-Type': 'application/x-www-form-urlencoded' } }
        );

        res.json(response.data);
    } catch (error: any) {
        res.status(500).json({ error: 'Token exchange failed', details: error.response?.data });
    }
});

app.post('/auth/refresh', async (req: Request, res: Response): Promise => {
    const { refresh_token } = req.body;

    try {
        const response = await axios.post(
            config.TOKEN_ENDPOINT,
            qs.stringify({
                grant_type: 'refresh_token',
                refresh_token,
                client_id: config.CLIENT_ID,
                client_secret: config.CLIENT_SECRET,
            }),
            { headers: { 'Content-Type': 'application/x-www-form-urlencoded' } }
        );

        res.json(response.data);
    } catch (error: any) {
        res.status(500).json({ error: 'Token refresh failed', details: error.response?.data });
    }
});

app.get('/auth/logout', (req: Request, res: Response): void => {
    const logoutUrl = `https://${config.COGNITO_DOMAIN}/logout?` + qs.stringify({
        client_id: config.CLIENT_ID,
        logout_uri: config.LOGOUT_URI,
    });
    res.json({ logoutUrl });
});

app.get('/api/resource', checkAccess, (req, res): void => {
    res.json({ msg: "Resource data", ts: new Date().toISOString() });
});

app.listen(config.SERVER_PORT, () => console.log(`Server running on port ${config.SERVER_PORT}`));

Endpoints:

GET /auth/login: Returns login URL.
GET /auth/callback: Handles token exchange with Cognito.
POST /auth/refresh: Refreshes the token using refresh token.
GET /auth/logout: Returns the logout URL for Cognito.
GET /api/resource: A protected resource endpoint.

Authentication Middleware

middlewares.ts

This file defines the checkAccess middleware used to protect endpoints. It validates the JWT access token using the public key from AWS Cognito.

import {Request, Response, NextFunction} from 'express';
import jwkToPem, {JWK as JwkToPemJwk} from 'jwk-to-pem';
import jwt, {JwtHeader} from 'jsonwebtoken';
import fetch from 'node-fetch';
import config from './config';


type CognitoJwk = JwkToPemJwk & {
    kid: string;
};

async function getPublicKeys(): Promise> {
    const res = await fetch(config.JWKS_URL);
    if (!res.ok) throw new Error(`Failed to fetch JWKS: ${res.status}`);
    const json = (await res.json()) as { keys: CognitoJwk[] };

    return json.keys.reduce((acc, key) => {
        acc[key.kid] = key;
        return acc;
    }, {} as Record);
}

async function verifyAwsToken(token: string): Promise {
    const decoded = jwt.decode(token, { complete: true }) as { header: JwtHeader } | null;
    if (!decoded || !decoded.header?.kid) throw new Error('Invalid token');

    const keys = await getPublicKeys();
    const jwk = keys[decoded.header.kid];
    if (!jwk) throw new Error('Invalid token');

    const pem = jwkToPem(jwk);

    const verified = jwt.verify(token, pem, {
        algorithms: ['RS256'],
        issuer: config.ISSUER,
    });

    if (typeof verified === 'string') {
        throw new Error('Unexpected token format');
    }

    return verified;
}

export const checkAccess = async (req: Request, res: Response, next: NextFunction): Promise => {
    try {
        const authHeader = req.headers.authorization;

        if (!authHeader?.startsWith('Bearer ')) {
            res.status(401).json({ error: 'Missing or invalid Authorization header' });
            return;
        }

        const token = authHeader.split(' ')[1];

        await verifyAwsToken(token);

        next();
    } catch (err: any) {
        console.error('Token verification error:', err?.message || err);
        res.status(401).json({ error: 'Invalid token' });
    }
};

In a production setting the JWKS (public key) should be cached to avoid fetching it on every request.

Web Application

This section describes the missing piece — the client application — which sends requests and interacts with the REST API server. Like the server, the application is intentionally kept minimal and implements basic functionality to:

Authenticate a user
Retrieve a protected resource from the server
Implement logout functionality

The client application is built using Next.js, a popular React framework.

Application Structure
The directory structure is as follows:

aws-cognito/
├── .github/
│   ├── workflows/
|   |   ... 
├── apps
│   ├── server/
│   │   └── ...
│   └── client/
│   │   ├── pages/
│   │   |   ├── _app.tsx
│   │   |   ├── home.tsx
│   │   |   └── index.tsx
│   │   ├── .env.local
│   │   ├── config.tsx
│   │   ├── tsconfig.json
│   │   ├── package.json
│   │   └── tsconfig.json
├── dev-ops/
│   └── ...
└── README.md

Configuration

Example .env.local:

NEXT_PUBLIC_API_URL=http://localhost:3000

config.tsx

This file loads environment variables and defines constants used throughout the client:

interface IConfig {
    serverURL: string;
}

export default {
    serverURL: process.env.NEXT_PUBLIC_API_URL || ''
} as IConfig;

Application logic

_app.tsx

This file is the top-level React component that wraps all pages. In this example, it’s a simple wrapper, but in a production application, this is the place to add providers, global styles, and token/context handling logic.

import type {AppProps} from 'next/app';

export default function App({ Component, pageProps }: AppProps) {
    return ;
}

index.tsx

This is the main entry point of the application, served at the root URL (/). It functions as the login page.

When the user clicks the “Login” button, it redirects them to the authorization URL received from the backend. After a successful login, the Cognito service redirects the user back with an authorization code. This code is captured, exchanged for tokens via the backend, and the access token is stored in sessionStorage.

In a real-world application, it’s more secure to handle token storage via HTTP-only cookies.


import {useEffect} from 'react';
import {useRouter} from 'next/router';
import config from '../config';

export default function LoginPage() {
    const router = useRouter();

    useEffect(() => {
        const code = router.query.code as string;
        const state = router.query.state as string;

        // logic to compare state value
        // ...

        if (code && state) {
            fetch(`${config.serverURL}/auth/callback?code=${encodeURIComponent(code)}`)
                .then(res => res.json())
                .then(data => {
                    sessionStorage.setItem('access_token', data.access_token);
                    router.push('/home');
                })
                .catch(err => {
                    console.error('Error exchanging code:', err);
                });
        }
    }, [router.query]);

    const handleLogin = async () => {
        try {
            const res = await fetch(`${config.serverURL}/auth/login`);
            const data = await res.json();
            if (data.loginUrl) {
                window.location.href = data.loginUrl;
            } else {
                console.error('Login URL not found in response');
            }
        } catch (error) {
            console.error('Error fetching login URL:', error);
        }
    };

    return (
        
            Login Page

            
        

    );
}

home.tsx

This component displays a protected resource. On load, it attempts to fetch an access token from sessionStorage and use it to request a secured resource from the backend via the Authorization header.

Logout logic is also implemented: the client calls the backend to terminate the Cognito session and clears the token from sessionStorage.

// pages/home.tsx

import {useEffect, useState} from 'react';
import {useRouter} from 'next/router';
import config from '../config';

export default function HomePage() {
    const [token, setToken] = useState(null);
    const [resourceData, setResourceData] = useState(null);
    const router = useRouter();

    useEffect(() => {
        const accessToken = sessionStorage.getItem('access_token');
        if (!accessToken) {
            router.push('/');
        } else {
            setToken(accessToken);
            fetchResource(accessToken);
        }
    }, [router]);

    const fetchResource = async (accessToken: string) => {
        try {
            const res = await fetch(`${config.serverURL}/api/resource`, {
                headers: {
                    'Authorization': `Bearer ${accessToken}`,
                },
            });

            if (res.ok) {
                const data = await res.json();
                setResourceData(JSON.stringify(data));
            } else {
                console.error('Failed to fetch resource');
            }
        } catch (err) {
            console.error('Error fetching resource:', err);
        }
    };

    const handleLogout = async () => {
        try {
            const res = await fetch(`${config.serverURL}/auth/logout`, {
                method: 'GET',
                headers: {
                    'Authorization': `Bearer ${token}`,
                },
            });

            if (res.ok) {
                const { logoutUrl } = await res.json();
                sessionStorage.removeItem('access_token');
                window.location.href = logoutUrl; // redirect to the logout URL
            } else {
                console.error('Logout request failed');
            }
        } catch (err) {
            console.error('Logout error:', err);
        }
    };

    return (
        
            Home Page

            Access Token: {token}

            Resource Data: {resourceData}

            
        

    );
}

Conclusion

In this article, was demonstrated a practical and modular approach to integrating Cognito SSO into a client–server architecture implementing OAuth 2.0 Authorization Code Flow. Was showed how to:

Configure AWS Cognito with appropriate OAuth settings
Implement a minimal Node.js REST API to manage the authentication flow
Create a Next.js-based frontend to handle user login, token handling, and protected resource access

While the current implementation serves as a solid starting point, it should be enhanced with production-ready features such as dynamic state generation, secure HTTP-only cookies, token encryption, JWKS caching, and error handling. With these additions, the solution can evolve into a robust, secure, and maintainable authentication framework suitable for real-world deployments.

References

Caching Essentials: Types, Strategies, and Best Practices

Dr. Yaroslav Zhbankov — Wed, 02 Apr 2025 02:35:03 GMT

Introduction

Caching is a crucial technique for improving speed and performance in modern computing. By storing frequently accessed data closer to applications, caching reduces latency, eases backend load, and accelerates response times. From simple in-app caching to complex distributed systems, it plays a key role across various domains.
At the same time, many different caching strategies exist, each serving a specific purpose. It is important to clearly understand when to choose the right one.

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

This article explores different caching types, strategies, use cases, and technologies.

What is Cache?

A cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere.

Cache types

Application-Level Caching

At the application level, caching involves storing data in a fast-access layer (typically in-memory) within or close to the application itself. This contrasts with lower-level caching (e.g., CPU caches or database query caches) or higher-level caching (e.g., CDN caches). The goal is to minimize latency, reduce load on backend systems, and speed up response times for users.
In-Memory Cache: stores frequently used data in RAM for fast access. Common libraries: Redis (key-value store with TTL, pub/sub), Memcached (lightweight key-value store), Guava Cache (Java in-memory caching library).
Object Cache: stores frequently used objects or data structures in memory. Used in frameworks like Spring Cache (Java) or Node.js LRU cache.
Session Cache: Stores user session data to reduce database lookups. Examples: Redis (session store for web apps); Sticky Sessions (server-side session caching).

Database Caching

Database caching involves keeping copies of database query results, rows, or objects in a cache (e.g., an in-memory store like Redis or Memcached, or even the database’s own built-in cache). When an application needs that data again, it retrieves it from the cache instead of hitting the database, reducing latency and load.
Query Cache: Stores the results of expensive database queries. Example: MySQL Query Cache (deprecated in MySQL 8).
Row-Level Cache: Caches individual database rows in memory. Example: Hibernate Second-Level Cache.
Full-Page Cache: stores entire database-generated pages for faster retrieval. Example: WordPress Full-Page Cache.
Write-Through Cache: Writes data to the cache and database at the same time. Ensures consistency but adds write latency.
Write-Back Cache: Writes data to cache first, then asynchronously updates the database. Improves write performance but risks data loss.

Distributed Caching

In a distributed cache, the cache itself is a separate system (e.g., Redis Cluster, Memcached, Hazelcast) that runs on multiple servers working together. Data is partitioned or replicated across these nodes, allowing applications to access it quickly, even under heavy load or in distributed architectures like microservices.
Content Delivery Network (CDN) Cache: Stores static assets (CSS, JavaScript, images) on edge servers close to users. Example: Cloudflare, Akamai, AWS CloudFront.
Distributed Key-Value Stores: Stores cached data across multiple servers. Examples: Redis Cluster, Amazon ElastiCache, Apache Ignite.

Web Caching

Web caching involves temporarily saving web data at various points between the user and the origin server — such as in the browser, a proxy server, or a content delivery network (CDN). The cached data is served when the same resource is requested again, bypassing the need to fetch it from the source.
Browser Cache: Stores static assets in a user’s browser. Example: Cache-Control headers in HTTP.
Proxy Cache: Stores HTTP responses between client and server. Example: Varnish Cache, Squid Proxy.
Server-Side Cache: Stores generated HTML to avoid repeated computation. Example: Nginx FastCGI Cache.

Hardware-Level Caching

Hardware caches are specialized memory units (usually SRAM — Static RAM) integrated into hardware components. They store copies of data that the system predicts will be needed soon, reducing the time it takes to fetch that data from slower main memory (DRAM) or storage (e.g., HDDs, SSDs). It’s all about minimizing latency and maximizing throughput at the lowest level of the computing stack.
CPU Cache: Stores frequently accessed instructions in L1, L2, L3 caches.
Disk Cache: Stores frequently accessed disk data in RAM. Example: Linux Page Cache, Windows SuperFetch.

Cache strategies

Cache Population Strategies

These determine when and how data gets into the cache.
Cache-Aside (Lazy Loading): The application checks the cache first. On a miss, it fetches data from the source (e.g., database) and manually stores it in the cache. Pros: Simple, app controls what’s cached. Cons: Risk of stale data; cache misses slow down first requests. Use Case: App-level caching with Redis (e.g., “check Redis, if empty, query DB and set”).

Write-Through: Data is written to both the cache and the backing store (e.g., database) simultaneously. Pros: Cache stays consistent with the source; no stale data. Cons: Slower writes due to dual updates. Use Case: Hardware caches (e.g., CPU L1) or systems needing strong consistency.

Write-Back (Write-Behind): Data is written to the cache first, then asynchronously synced to the backing store later. Pros: Faster writes; reduces load on the source. Cons: Risk of data loss if cache fails before sync. Use Case: Disk controllers, distributed caches with eventual consistency.

Read-Through: The cache itself fetches data from the source on a miss, transparently to the app. Pros: Simplifies app logic; cache handles loading. Cons: Tighter coupling between cache and source. Use Case: Some ORMs or caching libraries (e.g., Hibernate second-level cache).

Pre-Fetching (Proactive Loading): Predictively load data into the cache before it’s requested, based on patterns or locality. Pros: Reduces misses; speeds up access. Cons: Wastes space if predictions are wrong. Use Case: CPU caches (spatial locality), CDNs pre-loading assets.

Cache Eviction Strategies

These decide what data to remove when the cache is full.

Least Recently Used (LRU): Evict the least recently accessed item. Pros: Good for workloads with temporal locality. Cons: Overhead to track usage order. Use Case: Redis, CPU caches. When to Use: When you have workloads with temporal locality, meaning recently accessed items are likely to be used again in the near future.

Most Recently Used (MRU): Evict the most recently accessed item. Pros: Useful when recent data is less likely to be reused. Cons: Rare in practice. Use Case: Niche streaming apps. When to Use: When recent data is less likely to be reused soon after access, making eviction of recent data more beneficial.

Least Frequently Used (LFU): Evict the least frequently accessed item. Pros: Prioritizes heavily used data. Cons: Needs frequency tracking; slow to adapt to changing patterns. Use Case: Web caches with stable access patterns. When to Use: When you want to prioritize data that is frequently accessed over time, and the access pattern is relatively stable.

First-In, First-Out (FIFO): Evict the oldest item, regardless of usage. Pros: Simple to implement. Cons: Ignores access patterns. Use Case: Basic queues or hardware buffers. When to Use: When you don’t care about the usage patterns of the items and just need to evict the oldest data in a predictable way.

Time-to-Live (TTL): Data expires after a set time, regardless of usage. Pros: Predictable expiration; prevents staleness. Cons: May evict useful data too soon. Use Case: Web caching (HTTP Cache-Control), Redis with expiration. When to Use: When you need data to expire after a fixed period, regardless of usage, to avoid serving stale or outdated data.

Random Eviction: Evict a random item. Pros: Very simple; low overhead. Cons: Unpredictable performance. Use Case: Lightweight systems or as a fallback. When to Use: When simplicity and low overhead are more important than eviction precision, or when eviction can be done with minimal performance concerns.

Cache Consistency Strategies

These handle how cached data stays in sync with the source.

Write-Invalidate: When data is updated in the source, invalidate (remove) the corresponding cache entry. Pros: Ensures fresh data on next read. Cons: Next read causes a miss. Use Case: Distributed caches (e.g., invalidate Redis key on DB update).

Write-Update: Update the cache immediately when the source changes. Pros: Cache stays fresh; no misses after updates. Cons: Higher overhead to propagate updates. Use Case: Multi-core CPU cache coherence (e.g., MESI protocol).

Eventual Consistency: Caching refers to a strategy where data in the cache and the underlying data source (like a database) may not be immediately synchronized, but they will become consistent over time. Pros: Scales well; tolerates delays. Cons: Temporary staleness. Use Case: Distributed systems like CDNs.

Strong Consistency: Cache and source are always in sync (e.g., via write-through or locking). Pros: No stale data. Cons: Slower; less scalable. Use Case: Critical systems (e.g., financial apps).

Cache Placement Strategies

These determine where the cache lives.

Local Caching: Cache is on the same machine or process as the app. Pros: Fastest access; no network. Cons: Limited size; not shared. Use Case: CPU caches, app-level in-memory stores.

Distributed Caching: Cache spans multiple nodes in a cluster. Pros: Scalable; shared across apps. Cons: Network latency; complex. Use Case: Redis Cluster, Memcached in microservices.

Edge Caching: Cache is near the user (e.g., CDN edge nodes). Pros: Reduces latency for end users. Cons: Harder to invalidate globally. Use Case: Web caching with Cloudflare.

Specialized Strategies

These are tailored to specific domains or needs.

Opportunistic Caching: Caching strategy where data is cached when it is accessed or processed for another reason, even if caching wasn’t the primary intent. It’s a “take advantage of the opportunity” approach — whenever data is fetched, the system stores it in the cache in case it’s needed again soon. Use Case: Mobile apps caching data during syncs.

Adaptive Caching: Dynamic caching strategy where the system continuously adjusts its caching policies based on real-time data access patterns, workload changes, or resource availability. Unlike static caching strategies, Adaptive Caching uses algorithms and monitoring to optimize cache size, eviction policies, or data retention dynamically. Use Case: Advanced systems like databases (PostgreSQL buffer manager).

Negative Caching: Caching strategy where the system caches failed or negative responses (e.g., “not found” or “error”) instead of successfully fetched data. The idea is to avoid repeated and unnecessary attempts to fetch the same data when it’s known to be unavailable or erroneous, thereby improving performance by reducing the load on the backend or data source. Use Case: DNS caching, web proxies.

Query Result Caching: When you want to cache the results of expensive or frequent database queries to avoid redundant calculations and reduce response time. Use Case: SQL databases, web applications with repetitive queries.

Geospatial Caching: When you need to cache geospatial data for faster retrieval of location-based queries, especially in systems dealing with proximity searches, maps, or location-based services. Use Case: Location-based services, ride-sharing apps, geospatial search engines.

Cache Trade-offs

Speed vs. Accuracy

Caching improves data retrieval speed by storing frequently accessed data in memory. However, the cached data may become stale if not properly invalidated, leading to potential inaccuracies. Improved Speed: Cache reduces the need for repeated expensive operations (e.g., database queries, computations), leading to faster responses. Potential Inaccuracy: Data served from the cache may be outdated if not updated frequently, leading to inconsistencies.

Memory Usage vs. Cache Size

Caching data takes up memory, and the more data you store, the higher the memory usage. Deciding how much data to cache requires finding a balance between having enough data in the cache to provide fast responses and not consuming too much memory. Increased Cache Size: More data in the cache means faster access times for a broader range of queries. Higher Memory Overhead: Storing a large volume of data can increase memory consumption, which can impact other processes or even cause out-of-memory errors in constrained environments.

Cache Hit Ratio vs. Eviction Policies

Cache hit ratio (the proportion of requests served from the cache) directly impacts the effectiveness of the cache. To maintain a high hit ratio, you need to ensure that the most relevant data remains cached. Eviction policies such as LRU (Least Recently Used) or LFU (Least Frequently Used) are employed to manage cache size by evicting items when the cache becomes full. High Cache Hit Ratio: More requests are served from the cache, reducing load on the backend system, improving performance, and decreasing latency. Eviction Overhead: Frequent evictions or cache misses can degrade performance, as the system has to re-fetch or recompute data.

Data Freshness vs. Caching Duration

Caching data for long periods can improve performance by reducing the need to fetch the data repeatedly. However, the longer the data is cached, the greater the chance it becomes stale. Longer Caching Duration: Reduces the need to perform the same operations over and over, improving performance. Stale Data: If the data changes frequently, long caching durations can lead to serving outdated information.

Write Performance vs. Read Performance

While caches can drastically improve read performance, writes to the cache can become more complex, especially in systems that involve write-through or write-behind strategies. Write-Through: Every time data is written to the cache, it is also written to the underlying storage. This can slow down write operations but ensures the cache and storage are synchronized. Write-Behind: Writes to the cache are asynchronous, improving write performance, but there is a risk of data inconsistency if the write to storage fails.

Cache Invalidation Complexity

Invalidating the cache when data changes is critical to maintaining data consistency. However, designing an effective invalidation strategy can be complex, particularly in distributed systems or scenarios with highly dynamic data. Effective Invalidation: Ensures that data is always up-to-date, but invalidation logic can be complicated to implement and could introduce additional overhead. Invalidation Overhead: Invalidation mechanisms such as time-to-live (TTL) or manual invalidation can create performance bottlenecks if not optimized correctly.

Simplicity vs. Sophistication of Caching Strategies

Simple caching strategies (like FIFO or Random Eviction) are easy to implement but may not be as effective as more sophisticated ones (like LRU or LFU) in specific use cases. Simple Strategies: Easier to implement and maintain with low computational overhead, but less efficient in handling complex data access patterns. Sophisticated Strategies: Provide better performance for specific workloads (e.g., LFU for stable access patterns), but come with higher implementation complexity and overhead.

Latency vs. Cache Update Latency

Caching strategies often aim to reduce latency, but the process of updating the cache itself may introduce delays. Low Latency Reads: Data retrieval from the cache is fast, improving application responsiveness. Cache Update Latency: Depending on whether updates to the cache are synchronous or asynchronous, cache updates may introduce delay, especially in write-heavy applications.

Distributed Caching vs. Local Caching

Distributed caches provide shared cache access across multiple servers, improving scalability and consistency across nodes, but they introduce network overhead and potential latency. Local caches are faster but limited to the node’s scope. Distributed Caching: Helps with horizontal scaling and consistency but may introduce network latency. Local Caching: Faster due to no network overhead, but the cache is not shared across servers, limiting scalability.

When use cache?

High Latency or Slow Data Access

Fetching data from the source (e.g., database, API, disk) takes too long. Cache reduce response time by serving data from faster memory. When: latency is a bottleneck, and users notice delays.

Frequent Access to the Same Data

The same data is requested repeatedly (temporal locality). Cache avoids redundant computation or retrieval. When: data has high read frequency and low change frequency.

Expensive Computations or Queries

Generating data requires significant CPU, memory, or I/O resources. Cache stores the result once and reuse it instead of recalculating. When: processing cost outweighs caching overhead.

Read-Heavy Workloads

Your system has far more reads than writes. Cache offloads the source system and speed up reads. When: read-to-write ratio is high (e.g., 10:1 or more).

Scalability Under Load

Traffic spikes overwhelm your backend (database, server, API). Cache reduce load on the source, allowing it to handle more requests. When: backend resources are maxed out or costly to scale.

Network Bandwidth Constraints

Transferring data over the network is slow or expensive. Cache stores data closer to the user or app to cut bandwidth use. When: network latency or costs impact performance.

Temporary Data Availability

Data is transient but reused within a short window. Cache keeps it in memory instead of regenerating or refetching. When: data has a defined lifespan and frequent access.

Fault Tolerance or Offline Support

Backend systems might fail or be unavailable. Cache provides a fallback to keep the system running. When: reliability or uptime is critical.

Cache technologies

In-Memory Caching Technologies

Redis: Fast in-memory key-value store, often used for real-time data processing and caching. Supports persistence.
Memcached: Simple and lightweight key-value store designed for high-performance caching.
Hazelcast: Distributed in-memory data grid, often used for application-level caching.
Apache Ignite: In-memory data fabric for caching and data processing.

Database Caching Technologies

MySQL Query Cache: Caches query results to reduce database load.
PostgreSQL Cache: Third-party extensions like pg_bouncer or pgpool-II are often used for caching.
Oracle Database Cache: Offers in-memory database caching for accelerating read queries.

Content Delivery Network (CDN) Caching

Cloudflare: CDN with caching capabilities for faster content delivery.
AWS CloudFront: Caches content across global edge locations.
Akamai: Provides caching for large-scale content delivery.
Fastly: Edge cloud platform for real-time caching and acceleration.

Application-Level Caching

Spring Cache: Abstraction in Spring Boot for caching method results.
ASP.NET Cache: Built-in caching mechanism for .NET applications.
Guava Cache: In-memory caching for Java applications.
Caffeine Cache: High-performance Java caching library with time-based and size-based eviction policies.

Distributed Caching Technologies

Amazon ElastiCache: Managed Redis or Memcached service.
Azure Cache for Redis: Managed Redis cache for Azure workloads.
Google Cloud Memorystore: Fully managed Redis and Memcached.
NCache: .NET-based distributed caching platform.

Browser and Client-Side Caching

Browser Cache: Caches web pages, images, and other assets locally.
Service Workers: Client-side caching for offline functionality.
IndexedDB: Database storage in browsers for caching data-intensive apps.

Hybrid and Specialized Caching

Apache Cassandra: NoSQL database with built-in caching mechanisms.
Aerospike: High-performance, real-time database with built-in caching.
Varnish: HTTP accelerator primarily used for caching web content.
Squid: Proxy cache for web traffic optimization.

Conclusion

Cache is an indispensable component in modern computing, boosting system efficiency and user satisfaction by storing frequently accessed data strategically — whether at the application level, in databases, or on hardware. From in-memory solutions like Redis and Memcached to distributed systems and CDNs, caching offers versatile techniques to reduce latency, ease network bandwidth demands, and ensure availability during traffic spikes. Effective caching hinges on balancing population, eviction, and consistency strategies to optimize speed and accuracy. In a data-driven world, understanding and applying the right caching approach distinguishes slow, resource-heavy systems from fast, reliable ones, cementing its role as a cornerstone of scalable, fault-tolerant architectures.