Providing Consistent, Reliable Customer Experiences with CQRS at Scale
By: Amit Meshram, Executive Director, Principal Software Engineer, Chase
This article was originally published on VentureBeat.
As more Chase card customers embrace digital services, we’ve seen a surge in transaction-related inquiries. This increase has put pressure on our distributed backend systems to maintain seamless customer experiences, even during system outages at Systems of Record (SORs) or other layers.
To address these challenges, we embarked on a modernization journey. Our goal was to transition from siloed environments to a modern, always-on ecosystem that scales with customer demand. This transformation allows customers to manage accounts, conduct transactions, and access financial services through our online and mobile platforms without friction.
The result? A significant technological advancement in distributed systems and big data, supporting Chase’s journey to unlock and accelerate new value for our business and customers.
What are SORs?
SORs which were predominantly mainframe-based and eventually evolved to modern technology stacks were designed to ensure reliability of command traffic. Data was ingested into data warehouses, the primary destination for most queries. With the emergence of real-time traffic through digital experiences, SORs began exposing their data to queries through APIs. Over time, the volume of query traffic grew significantly, often surpassing command traffic. Nowadays, it’s not uncommon for queries to constitute up to 90% of total SOR read volume.
These strategic shifts have had profound effects on the cost, scalability and reliability of SORs, often contributing to operational issues.
A Paradigm Shift in Software Architecture
In our quest to mitigate operational issues posed by SORs and provide exceptional customer experiences Chase adopted Command Query Responsibility Segregation (CQRS), a software architectural pattern that separates the responsibilities of handling commands (write operations) and queries (read operations) into distinct parts.
Introduced by Greg Young around 2010, CQRS has gained significant traction in the software development community due to its ability to enhance scalability, performance, and maintainability in complex systems.
At Chase we built and implemented standards to achieve business continuity three-to-five times faster and improved customer experience, speed to market with new experiences, and reliability. These standards include:
- 20+ billion records ingestion daily
- One second end-to-end, real-time data consistency across all channels
- 50–100 millisecond API response times
- 25K+ Transactions Per Second (TPS) consumption support at scale
- 99.99% availability
- Consistent engineering across all products without consumer customization
- Ensuring resiliency across infrastructure and diverse global regulations
Achieving Our Objectives
The read layer in the CQRS pattern played a pivotal role in building a common data product by providing a unified and optimized view of data tailored to various user needs. By decoupling the read operations from the write processes, it allowed for the creation of specialized read models that can aggregate and present data from multiple sources, enabling consistent and efficient access to data across different applications and services.
Embracing the challenge of adopting CQRS was immensely gratifying, and we successfully accomplished the following objectives on our journey:
- Read Optimization: The read side is specifically optimized to handle high query loads efficiently, ensuring fast data retrieval and enhancing the overall user experience by providing quick and reliable access to customer data in low latency fashion.
- Event Sourcing: Over the period, we developed asynchronous and real-time, event driven architecture using modern and distributed message brokers to decouple components and improve responsiveness. We also established a reliable audit trail capability for data souring.
- Low Latency Datastore: One of the critical factors to achieve desired performance was to design and model our datastore to achieve low latency reads and writes. The optimal choice was to adopt a NoSQL database for handling massive volumes of data and I/O operations.
- Data Integrity, Consistency & Assurance: Data reliability is demonstrated through statistical analysis and factual evidence, which are essential for earning the confidence and trust of consumers. Throughout our implementation process, we established various patterns to ensure reliability. To ensure eventual consistency, we implemented a data reconciliation framework leveraging the bookkeeping pattern. This framework includes window event reconciliations and a rehydration process for addressing event count mismatches. Additionally, we incorporated a store-and-forward mechanism to handle temporary glitches in the system and facilitate event replay. Furthermore, we implemented robust data quality measures to identify mismatches at a granular data element level, enabling proactive detection of potential customer issues and impacts.
- Scalability: By separating the command and query operations, we enabled scaling of each part independently. The query side is scaled horizontally to handle high read loads, while the command side focuses on maintaining data integrity and processing commands efficiently.
- Separation of Concerns: The segregation allows developers to optimize each part independently, tailoring them to specific requirements.
- Performance Optimization: Since commands and queries have distinct responsibilities, we worked on opportunities to optimize data structures, storage mechanisms, and caching strategies tailored to meet our customers’ needs. These optimizations lead to improved performance, reduced latency and faster read and write operations.
- Complex Domain Models: We designed targeted domain models for consistency across consumer channels, leading to better maintainability and adaptability to changing business needs.
Evaluating Complexity
While CQRS offers numerous benefits it also introduces additional complexity, especially in terms of system design, implementation and operational overhead. We spent significant time carefully evaluating trade-offs, considering key factors such as eventual consistency, asynchronous communication and aligning with the principles of domain-driven design (DDD).
- Domain-Driven Design: While applying DDD principles to create more expressive, maintainable and scalable software systems, there are challenges to design an accurate domain model that reflects the real-world problem domain considering channel consistency requirements. Chase’s adoption of product architecture played a crucial role in establishing domain boundaries across different products.
- Asynchronous Communication: Though distributed message brokers improve efficiency and responsiveness, mainframes still rely heavily on batch processes, which introduce event burst and sequencing challenges. There is a constant effort at Chase to slowly and gradually migrate batch-oriented processes into a modernized, real-time event streaming process.
- Eventual Consistency: Due to the separation of concerns in CQRS, achieving strong consistency across the entire system becomes challenging. Instead, we have embraced eventual consistency where data consistency is guaranteed over time but may be temporarily inconsistent during system operations.
Adopting CQRS has allowed us to operate at scale and with autonomy. It requires careful consideration of trade-offs and may not suit every project or team. As with any architectural pattern, understanding the principles and applying them judiciously is key to realizing the benefits of CQRS in software development.
Like what you’re reading? Check out all our opportunities in tech here.
JPMorgan Chase is an Equal Opportunity Employer, including Disability/Veterans
For Informational/Educational Purposes Only: The opinions expressed in this article may differ from other employees and departments of JPMorgan Chase & Co. Opinions and strategies described may not be appropriate for everyone and are not intended as specific advice/recommendation for any individual. You should carefully consider your needs and objectives before making any decisions and consult the appropriate professional(s). Outlooks and past performance are not guarantees of future results.
Any mentions of third-party trademarks, brand names, products and services are for referential purposes only and any mention thereof is not meant to imply any sponsorship, endorsement, or affiliation.