Tackling Efficiency & Scalability Head-On: Offloading OLTP Workloads to OLAP Systems

Giridhar Sundeep
Capillary Technologies
12 min readAug 22, 2024

--

Introduction:

Capillary Technologies Loyalty+ is a sophisticated loyalty program solution designed to elevate customer engagement for businesses. Loyalty+ helps organizations develop, manage, and refine their loyalty strategies, fostering deeper connections and building enduring customer relationships.

Our loyalty systems manage essential functions such as points accumulation, validity, expiry, customer tiers, communications, and integrations. These elements interact with OLTP systems through various services, bulk data processing, and data imports. During peak times for bulk activities, particularly at month-end, when customer points expirations, tier modifications, and data updates surge, our OLTP systems have encountered resource constraints due to the contention between real-time processing and the increased demands of bulk data processing and data imports.

In the following sections, we will explore in detail the challenges faced by our OLTP systems due to this heightened workload, using one of our bulk data processing jobs, specifically the “points expiry” job, as a case study. This job is responsible for processing and managing the expiration of customer loyalty points.

Problem Statement: Challenges Faced by Our OLTP Systems:

In managing our loyalty systems, we encountered significant challenges due to bulk jobs that fell into two primary categories: resource constraints and operational problems.

1. Resource Constraints: CPU and I/O

CPU Resource Constraints: When a bulk data processing task like point expiry is running and processing 100s of millions of records, it puts real-time transactional flow under considerable strain given both are working against the same OLTP databases

For example, there were instances where the system had to handle real-time transactions, such as updating customer points after purchase, while simultaneously processing bulk operations like expiring loyalty points for millions of customers. On typical days, the system managed a reasonable number of such updates, but at month-end, when expiry dates coincided, the workload increased exponentially.

With a customer base upwards of a billion and a significant percentage of records requiring enquiry and update, the expiry job had to touch 100s of millions of records. To update these records, the system first identified which points were expiring, which involved a comprehensive database scan and updating relevant records while also inserting records for bookkeeping. This increase in both READ and WRITE operations significantly impacted CPU usage. Consequently, CPU usage spiked, occasionally slowing down the system and affecting real-time processing as it struggled to manage the much larger computational load.

Impact:

CPU Bottlenecks, Resource Contention, Increased Query Latency

I/O Resource Constraints: Bulk processing scans and reads a large number of records and then inserts/updates many records, creating a spike in I/O. The system experienced bottlenecks when disk access speeds couldn’t meet the data processing requirements. This led to data retrieval and storage delays, further straining system performance.

Continuing with the earlier example, with customers upwards of a billion and 100s of millions of records, the database conducted extensive disk reads to identify the records needing updates. For each qualifying record, the system performed disk reads & writes to apply the updates. These operations created I/O bottlenecks, which increased query latency.

Impact:

Slow Data Processing, Impact on Other Operations

2. Operational Problems: Deadlocks and Lock Wait Timeouts

Our system encountered significant locking issues when bulk data processing tasks intersected with real-time transactions. This was evident when two transactions occurred simultaneously, one updating customer points in real-time when a customer made a purchase and the other handling bulk data processing to expire points at the end of the month. While the exact overlap of these transactions was rare, conflicts occurred more frequently across different customers over time. The potential for such conflicts and their consequences were substantial due to MySQL’s locking mechanisms.

Let’s take a scenario that we have experienced:

Transaction 1 (Bulk Data Processing): A scheduled bulk data processing job ran to expire points for all customers whose points were due to expire that day. This job had selected data for updates and locked rows in the customer_points table, including the row that Transaction 2 had locked.

Transaction 2 (Real-Time Update): Simultaneously, a customer made a purchase, prompting the system to update their loyalty points. This transaction locked the row in the customer_points table for the specific customer to ensure no other transaction could alter the data until the update was completed.

Deadlock Scenario:

  • Lock Acquisition: Transaction 1 started “select for update” and locked the row for the customer whose points were expiring but still needed to complete its task.
  • Conflict Occurs: Transaction 2 began and tried to update the same row that Transaction 1 had locked. Since Transaction 1 held the lock, Transaction 2 had to wait.
  • Circular Wait Condition: Meanwhile, Transaction 1 might have needed to update other rows that Transaction 2 had locked during its operation. This created a circular wait condition where each transaction awaited the other to release the locks.
  • Deadlock: The result was a deadlock where both transactions were stuck, waiting indefinitely for the other to release the necessary locks. Neither could proceed because both were holding locks needed by the other.

In addition to deadlocks, lock wait timeouts occurred in this scenario:

  • If Transaction 2 waited for a row that Transaction 1 held, and Transaction 1 was also waiting for a row that Transaction 2 held, each transaction waited for the other to release its lock.
  • MySQL had a timeout setting for how long a transaction could wait for a lock. If this threshold was exceeded, a lock wait timeout occurred. For example, if Transaction 1 waited for Transaction 2 to release its lock beyond the configured timeout period, MySQL aborted Transaction 1 with a lock wait timeout error.
  • Lock wait timeouts led to transaction failures, rollback operations, and retries. This disrupted the real-time updates and bulk data processing, causing delays and increased latency.

Impact:

Increased Lock Contention, Transaction Delays

Proposed Solutions:

To address the challenges we faced with our OLTP systems, we considered two potential solutions:

  1. Scaling the OLTP System:
    One approach was to scale the OLTP system by upgrading its hardware resources, adding more CPUs, increasing RAM, and enhancing network I/O capabilities. While this would boost the system’s capacity to handle increased loads, it would also result in an over-provisioned infrastructure. Essentially, the system would be equipped with more resources than necessary during non-peak times, leading to higher operational costs.
  2. Offloading Read Operations to an OLAP System:
    The alternative solution was offloading the OLTP system’s read operations to an OLAP system. OLAP systems are designed to handle complex queries and large datasets efficiently, making them well-suited for read-intensive tasks. By offloading these operations, we could significantly reduce the load on the OLTP system, allowing it to focus on write-intensive transactions without the contention and resource bottlenecks that had previously hampered its performance.

After careful consideration, we chose to offload the OLTP system’s read operations to the OLAP system. This approach not only resolved the immediate performance issues but also provided a more scalable and cost-effective solution for the future, ensuring that our infrastructure could grow alongside our expanding customer base without the inefficiencies associated with over-provisioning.

How does offloading the OLTP solve the problem:

Offloading the OLTP system solves the problem by strategically redistributing the workload, which addresses several key challenges:

1. Reduces System Load and Resource Constraints

  • The OLTP system was experiencing heavy load, particularly during peak times, leading to CPU and I/O resource constraints.
  • Our analysis of the points-expiring bulk data processing jobs showed that approximately 75% of these queries were read operations, while 25% were write operations. Given the predominantly read-intensive nature of these tasks, we decided to offload the read operations to an OLAP system. This approach aimed to alleviate the load on our OLTP system significantly
  • By offloading read-heavy operations to the OLAP system, the OLTP system is relieved from handling complex queries and data analysis tasks. This allows the OLTP system to focus on its primary function — processing real-time transactions efficiently — without being bogged down by additional workloads.

2. Minimizes Contention and Locking Issues

  • Concurrent real-time processing and bulk data operations on the same dataset were causing lock waits, deadlocks, and query latencies.
  • Offloading tasks such as complex read queries for updates to OLAP systems reduces the frequency and duration of locks on the OLTP system. This minimizes contention between transactions, thereby reducing the risk of deadlocks and lock wait timeouts, ensuring smoother real-time processing.

3. Improves Query Performance and Reduces Latency

  • I/O bottlenecks were causing significant delays in data retrieval and storage, increasing query latency on the OLTP system.
  • OLAP systems, optimized for large-scale data retrieval and analysis, handle these operations more efficiently. By offloading these tasks, the OLTP system experiences fewer delays, resulting in faster query performance and lower latency for real-time operations.

4. Enhances Scalability and Cost Efficiency

  • Scaling the OLTP system alone could lead to underutilization of resources and escalating costs as customer data volumes grow.
  • Offloading to an OLAP system is a more scalable solution. OLAP systems can handle growing data volumes without requiring constant hardware upgrades, thereby reducing the need for frequent scaling of the OLTP system. This approach is more cost-effective in the long term, as it avoids the inefficiencies and high costs associated with continuous hardware expansion.

5. Improves OLTP Performance

  • The combined load of real-time transactions and bulk data processing was straining the OLTP system, leading to degraded performance.

By redistributing the load, the OLTP system can operate more efficiently, maintaining high performance even during peak times. This not only solves the immediate resource and performance issues but also ensures the system remains robust and responsive as the customer base and data volumes continue to grow.

How we switched to this mode:

To fully understand the improvements we’ve made to our points expiry bulk data processing job, it’s important to review the previous architecture before diving into the changes we introduced in the processing pipelines.
The Points Expiry System is methodically designed and implemented in distinct phases to ensure efficient and accurate processing. Here’s a detailed overview:

1. Daily Cron Scheduler:

- The process is automated through a daily cron job, which schedules and initiates the points expiry operation.

2. Data Extraction Layer:

- This phase involves collecting comprehensive data from the OLTP system up to the current date. The system identifies customers whose loyalty points are scheduled to expire on the current date. This process involves scanning the loyalty points database, where each customer’s points are recorded along with their expiration dates. The system compares the expiration dates with the current date, filtering out customers whose points are about to expire. It then compiles a list of these customers, ensuring that only those with points set to expire today are included. This ensures that the system has all the necessary data for precise processing.

3. Data Pre-Processing Layer:

- Extracted data is organized into manageable batches to enhance performance and resource efficiency. This step involves segmenting the data, creating and populating necessary tables, and preparing it for efficient batch processing.

4. OLTP Updates:

- The final phase processes each batch and updates the OLTP system. These updates ensure that point expirations are accurately reflected, maintaining real-time accuracy and consistency in the transactional database. This phase is essential for keeping customer points summaries current.

To gain a deeper understanding, the architectural diagrams offer an in-depth view of the entire process:

Architectural Enhancements for Offloading OLTP

To effectively offload our OLTP system, we implemented two significant architectural enhancements to our points expiry system, which have greatly improved its performance and scalability:

Integration with Our In-House OLAP System for Handling Read-Intensive Operations:

We began by integrating our points expiry system with our in-house OLAP system to manage read-intensive tasks, effectively offloading these operations from the OLTP system. The process involved several critical steps:

  • Identifying Read-Intensive Operations: We meticulously analyzed the points expiry system to pinpoint which read operations consumed the most resources. Once identified, we updated the application logic to redirect these queries to the OLAP system.
  • OLAP Integration & Query Redirection: Our in-house OLAP system, powered by Databricks and Apache Sqoop, became the backbone for managing these operations. Apache Sqoop transfers data daily from the OLTP system to the OLAP system, ensuring the latter remains updated with the latest information. This integration allows the OLAP system to efficiently handle complex queries and data transformations that previously bogged down the OLTP system.
  • Data Preparation: The system prepares queries to retrieve expiring data, which is then temporarily stored in views for processing. This phase also assesses whether there are any outstanding delta updates between the OLAP and OLTP systems. To minimize the risk of stale data, we perform delta updates to ensure that the data in the OLAP system remains synchronized with the latest changes in the OLTP system.
  • Batching and Partitioning: Data is assigned batch numbers using a modulo operation, which ensures even distribution across batches. We further optimized query performance by partitioning data based on specific columns, allowing the system to handle large datasets more efficiently.
  • Query Execution and Data Storage: These prepared queries are executed on a Databricks cluster, where the data is processed and updated. The processed batch views are then stored as files in an AWS S3 bucket, leveraging S3’s scalability and cost-effectiveness for efficient data management.

Incorporation of AWS S3 for On-Demand Batch File Fetching and Parallel Processing:

We incorporated AWS S3 to facilitate on-demand batch file fetching, which supports parallel processing of data batches.

Continuous Data Flow: Data stored in S3 is fetched on-demand for parallel processing. This setup leverages S3’s scalability and cost-effectiveness, enabling efficient handling of large data volumes. Batch files are streamed from S3 to create batch tasks and then published to a message queue configured with parallel consumers. This approach ensures seamless scaling with increasing data volumes.

Parallel Processing: By distributing batch tasks across multiple consumers, we optimize resource utilization and expedite the data processing pipeline, enhancing the system’s efficiency and capacity to handle varying workloads.

This method improves the performance of read-intensive operations, enhances the overall efficiency and reliability of data processing, and ensures a robust and scalable points expiry system.

Let’s take a look at the changed architecture:

Improved Performance of OLTP:

Reduced OLTP Systems load:
By offloading the OLTP transactions, we have observed a significant reduction in the database load. About 34% reduction in DB load >> reduction from 6.05 to 4 referring to the peaks:

CPU Utilization:
A notable 25% reduction in CPU usage following the offloading of read operations from the OLTP system. This reduction has freed up valuable CPU resources, allowing for more efficient processing and overall system performance.

Query Latency:
We have observed an impressive 75% reduction in query latency, particularly for select operations. Peak latency has decreased dramatically from approximately 2 ms to 0.5 ms, ensuring faster data retrieval and more responsive systems dependent on OLTP systems. We alleviated the slowdowns on dependent business processes and applications by doing this.

Network I/O:
Disk read operations have drastically reduced by 75%, significantly lowering the strain on network I/O. This improvement enhances the overall efficiency of data handling and reduces bottlenecks in data access.

Eliminated Lock Waits and Deadlocks:
The new architecture effectively eliminates lock waits and deadlocks, which previously hindered system performance and stability for the OLTP system.

Enhanced Operational Efficiency:

Reduced Select Latency:
By offloading read operations to the OLAP system, we’ve dramatically reduced select latency, ensuring faster and more efficient data retrieval.

Unblocked Dependent Services:
With the reduced latency and operational issues, all dependent services on the OLTP system now operate without interruptions, leading to smoother and more reliable overall system performance.

Improved job processing time:

We’ve observed a remarkable improvement after reviewing our job processing performance over a month. The 95th percentile processing time has decreased dramatically, from 389.22 seconds to just 73.78 seconds, for processing 300,000 customers.

There is a reduction of 89% in processing time, showcasing a major enhancement in overall processing efficiency. Additionally, the average job processing time has decreased remarkably, going from 144 seconds to just 16 seconds in one of the clusters.

Conclusion & learnings:

Leveraging OLAP systems for queries has resulted in substantial performance gains. These improvements have not only reduced execution and processing times but have also allowed us to manage larger datasets more effectively, and maintain high performance levels even as our data volumes grow.

The observed metrics demonstrate the success of our approach, reflecting significant time savings and efficiency gains. Moving forward, we will figure out other areas in the system that can benefit from moving from OLTP to OLAP system to optimize our system performance further and maintain our commitment to delivering high-quality services.

--

--