What We Learned from Refactoring a Monolith App

Chewy
Chewy Innovation Blog
10 min readJun 23, 2020

By Hemalatha Bahadur, Staff Software Engineer @ Chewy

Colorful cubes hanging from ceiling

Services Based Integration

This blog post captures the learning from a recent project implementation involving complex integrations between various in-house and third-party systems as a part of a legacy upgrade. The day-to-day changes in how the business runs are promoting enterprises to invest in technology to enable effective customer satisfaction, be agile, and retain the competitive edge. Technology enables the business to be more agile, thereby being customer-oriented and in turn leading to customer satisfaction. Most of the engagements that IT is seeing these days are legacy modernization and business transformations for which companies invest heavily in the system integrations, cloud, and business analytics. This article will cover a few of the lessons learned from a legacy modernization engagement and will primarily focus on below three key parameters:

1. Integration Delivery Process

2. Production Readiness

3. Breakdown and Contingency

This article is an outcome of the exercise that we carried out to improve the stability and reliability of the integration framework for a large distribution client. The core integration framework consisted of Oracle Fusion Middleware — Oracle Integration Cloud. I will describe the challenges faced with the Monolith service-oriented architecture (SOA) application and my firsthand experience splitting the application into the logical SOA services, the challenges, the gain, and the technical overheads.

Introduction

I will share some lessons learned from the legacy modernization engagement with the focus on system/service (multiple heterogeneous boundary systems) to be integrated with the modern Oracle ERP via Oracle Fusion Middleware stack. The entire portfolio of the integration was designed on Oracle Integration Cloud with a common goal to establish a scalable platform to connect with Oracle ERP SaaS and various other third-party applications using Integration Cloud Service.

During the initial phase of the project, the goal was to promote easy plug and play for any future systems, which worked during the initial phase (when the team and scope were limited), i.e., the scope worked perfectly well during the design and development phases where the number of systems to integrate was limited to three only. However, as the business volume increased, managing the services became an issue. I will explain the issues faced with the project structure/portfolio of services and documents how we broke the huge applications to independent (logical) services and promoted a stable and reliable infrastructure.

Business Case

The integration scope was expanding in terms of both new processes and change requests to the existing interfaces. For example, a data mapping change in the 810 Interface for EDI would affect the Invoice Data for Oracle ERP SaaS.

The day-to-day exponential increase in the data volume was affecting the daily operations, such as missed SLAs for Data Sync jobs between the boundary systems, incomplete job executions, a lot of manual monitoring of the services and the system. Business Goal:

The business goal was to define a scalable integration framework for current and upcoming business requirements. To align to the business goal, the product portfolio and design pattern were finalized.

Deviation and Evolution

To meet the deadlines, we often (or most of the time) deviate from the guiding principles and shift the goal to “get this done and ready for UAT” and then GO–LIVE. This engagement was no different. To meet the deadline, we enriched the knowledge base with the new design patterns and the approach for the pending interfaces, and thus each team completed development, keeping the design/development standards from their past experiences around integration — delivering stable and scalable solutions.

Challenges and Issues with the Above Approach

Stuck threads and performance degradation: By design, all the services participated in a single thread and a single transaction. This led to stuck (locked) threads by multiple long-running applications during peak loads, which in turn affected the overall performance of the Middleware platform.

In a true sense, this does not promote loose coupling between the end systems. For example, during the maintenance window on the target side, all the service calls would fail. There was no easy way to resubmit the transactions from the Middleware stack. All the transactions were resubmitted from the Source system. These affected both the business (a lot of time is spent in resubmitting the transactions) and IT (the backlog of transactions affected day-to-day operations).

Overhead network calls: Separate database adapters were used for multiple tables in the same interface, such as Order Header and Order Lines. Each line was published as a separate transaction. Multiple calls to the database system affected the performance of the system and led to data inconsistency.

The coding/development side did not take into consideration topics like Error Handling and Compensation Handling.

A bug in one affects all the services. An instance where a locally tested change (update the logical ‘or’ condition to logical ‘and’) led to the failure of the most critical end of day job and an increase in operations and maintenance cost.

Business Objective

Breaking up the applications and redesign (including auditing, exception handling, and key design changes) would enable the business to:

  • Split the existing portfolio to a logical set of governed services, which enables the business to meet the changing business requirements more efficiently.
  • Decrease the operations and maintenance cost, open the channel for innovation (for example, the ability to easily build and pilot microservices), and allow dollar value to be invested in innovation rather than operations and maintenance.
  • Have better visibility to the business transactions — auditing and logging.

IT Objective

The services should support concurrent (high concurrent) requests. For example, the system should support processing of 100,000+ records during the nightly batch run and should not affect the performance of other services.

Other goals of providing a scalable and reliable architecture include:

  • Have better control over the services in terms of change management and unit testing.
  • Easily isolate the change and better code maintenance, providing developers with the to work independently and in parallel on multiple services. Adding a new interface should not affect the ongoing change requests, ability to work and close multiple incidents or tickets in a reduced time.
  • Provide a platform to innovate easily; each service can now be tweaked and be production ready in isolation.

State-of-the-Art Business Solution

The approach, Design Pattern 1 (Point — Point), required both the source and target to be available at the same time for day-to-day business operations. This presented with both business and technical flaws and left only a little room for innovation. To overcome a majority of the issues noted with Design Pattern 1, we evolved the design pattern to introduce a level of decoupling between the services connecting the source application and the services connecting the target application.

Introduction of Persistence Layer

The addition of the persistence layer (Queues) enabled a degree of decoupling between the services connecting the source and the target systems, providing an ability to get away from manual resubmission of the targets from the Middleware stack and an ability to separate the request into separate threads and transactions, providing control over better thread management. However, it did introduce a couple of additional challenges, such as Order of Execution of the transaction (must follow FIFO).

In an ideal scenario, the flow should mimic something like this: as soon as a new/updated/unprocessed message (or messages in a batch) is picked (via scheduled) from the source boundary system, publish the message into the queue; mark the message as read (intermediate); and in a separate transaction, continue processing the message. However, due to some business restrictions, the concept of intermediate status was not possible and, therefore, the message had to be marked processed only. So, the record is marked as processed as soon as the message is published to the queue. This worked like charm until the last stages of the testing (UAT/load), where we saw that the records were marked as processed even when they failed to reach the target system. To overcome this, the records were marked as processed only once successfully published to the target system. This worked, but it does not justify to the use of the Queue.

In case of any technical errors while publishing the data to the target system, the message is not marked as processed and would continue to fail until corrected. Since the process was designed to commit all or none for the batch, any subsequent batch would continue to fail.

In case of planned maintenance (target side), the number of backlogs would increase in the queue, and each new message will contain all the unprocessed records (causing duplicate records in the queue). As a precaution, the trigger source service is shut down to avoid duplicate transactions.

Production Readiness and Breakdown

Smoke Testing

The Middleware was all set/ready and live to accept the business transactions. Like with any other systems (bound to fail), the INTEGRATION failed due to various reasons classified as follows:

Infrastructure level

The code base that worked perfectly well in Dev, Staging broke (as a process) in PRODUCTION. No data was being inserted into the staging tables in the Oracle Target side.

Process/Design level

Usage of heavy database calls and costly joints in the process orchestration layer degraded the overall performance of the system. We tested for a majority of the test cases, but the first batch to warehouse management system (WMS) for shipping failed. This was all due to the fact that the same delivery number was sent in multiple batches split across multiple routes. It turns out this is not the correct process. As per the “ACTUAL” process, the deliveries should be sent in the batches based on routes, not deliveries.

The order/sequence in which transactions should flow was not considered while developing the interfaces (as the interfaces were designed to run in isolation and scheduled for each night). Adding a delay in the two does not guarantee the sequence as well.

The process was changed to schedule one parent process, which invokes the child processed in sequence (as a result set).

There were also date/time issues in servers across different time zones. The last run date was captured for a few interfaces, and the date was computed from the Middleware system. However, this was different than the ERP system, so duplicate records were being pushed from the source of records to target systems.

Processing of a huge backlog of records in one service (sales data) was taking up all the resources.

Monitoring and Managing the Change

The lack of a monitoring and auditing framework was a huge blow to the system, as the system got into production, and things started to fail. Root cause analysis (RCA) became next to impossible.

As there was only a limited set of logging, it became difficult (without running the scenario on lower instances) to identify what causes the process to fail.

For each and every issue, everyone from the team (ERP and Middleware) was involved to isolate the area of the failure.

Spending Time on the Regression Testing

High-level solution

An iterative approach was designed to work on business-critical processes, gauge the service performance, and proceed based on priority. This also involved numerous proofs of concept before the actual go-green. Multiple proofs of concept and design meetings were conducted on topics like:

Overall architecture: Hardware, operating system, networking, database (AWS),key design patterns, when and how to use AWS, how the smaller services would fit into the overall business flow, exception handling, load balancer

  • Service identification: Service naming standards, Folder structure, Load testing scenariosDeployment and rollbackInstance retention and purging

Solution details

The vision to align the technical delivery with the business goals was based on the core SOA and microservices approach.

1. Logical breakup of the services modeled around boundary system/business domains

2. Automated build and deploy

3. Continuous integration and platform for continuous delivery

4. Asynchronous communication: Use of messaging for scalable and reliable integrations

5. Enriched logging and monitoring: Better auditing and tracking, isolate failure (or point of failure)

6. Reduced build and deploy time

7. Easier for the developer to manage the change requests/defects Increased developer productivity

8. Ability to execute or abort (in production) the processes based on the boundary systems Better

9. Control over the process

10. Isolated database objects packages

11. Decentralization of the core database objects

At a high level, the services were scoped to perform one key function and easily plug into the overall business scope. For instance, a job to query the source of records and create boundary system-specific files was split into 2+ services — one to query the source of records and the other to write the boundary system-specific files. The services were logically designed to perform just the core function and participate in the overall business flow via the help of messaging and orchestration.

Assessment

The EOD critical to business was selected as the pilot process for re-architecture.

Business benefits

This approach helped to overcome business critical issues, such as:

1. Efficiently manage/replace the change

2. Ability to independently release and scale the service

3. Better control over the process

  • Asynchronous communication
  • Stability and reliability — A service change for a boundary system does not affect the business functionality of other boundary system services
  • Ability to run in parallel and on demand
  • Streamlined review, test, and deployment process

4. Scalability and performance

  • Add dedicated work managers to high-volume/time-intensive processes
  • Independently fine-tune the services

5. Fault tolerance (service level)

  • The collaborative design approach helped to identify the point of failures and exception handling mechanisms (automated or manual)
  • Retries
  • Ability to shut down the service gracefully without affecting other services

However, when a monolith app is broken down into microservices, full-blown testing and a sign-off from business is required to make sure all systems are live in production and no functionality has been missed. Identification of the services (true)– and not just end up having multiple huge services as opposed to huge application(s)

  • Multiple services require enriched and up-to-date documentation
  • Availability of services
  • It’s necessary to have excellent auditing and reporting capability. Most of the communication in the overall business flow is asynchronous.
  • Service versioning and retiring. Must ensure backward compatibility
  • Rollback strategy in case of a bad deployment

Conclusion

This document details out the challenges experienced in large-scale integration engagements, which touch the following key aspects on any engagement:

Integration Delivery process

Production Readiness

Breakdown and Fallback

To overcome most of the issues, a complete re-architecture was done for business-critical services, which helped in:

Efficiently managing/replacing the change

Ability to independently release and scale the service

Better control over the process

Stability and reliability — A service change for a boundary system does not affect the business functionality of other boundary system services

Scalability and performance

  • Add dedicated work managers to high-volume/time-intensive processes
  • Independently fine-tune the services

Fault tolerance (service level)

  • The collaborative design approach helped to identify the point of failures and exception handling mechanisms (automated or manual)
  • Retries

by Hemalatha Bahadur

Staff Software Engineer @ Chewy

If you have any questions about careers at Chewy, please visit https://www.chewy.com/jobs

--

--