Holiday Season Readiness: preparing your On-Premise IBM Sterling OMS powered Omnichannel ecosystem to scale
Around two decades ago, the need of an Order Management System for driving complex E-Commerce Business capabilities used to be a topic of debate/decision.
Today, almost all organizations which have or are planning to implement a complex Omnichannel Ecosystem realize the great Business value, agility and Speed-To-Live capabilities that Order Management Systems like IBM Sterling OMS unearth for such use cases.
An Order Management System like Sterling manages multiple Lifecycles including Flow of Order, Flow of Stock Availability (by managing Supplies and Demands), Flow of Shipments (Logistics) as well as Flow of Financials (Payments, Settlements and Invoicing). This Dance-Of-The-Workflows, when completely in sync, makes millions of consumers happy, thereby achieving positive Retailer Ratings or Net Promotor Scores, resulting in a great Top Line for the organizations. When this is further enriched with Sourcing optimizations, it also reduces costs of fulfillment and brings in the element of Sustainability.
With the Peak Season just a couple of months away, come the real acid tests as well as the plethora of opportunities for such complex implementations powered by a packaged Order Management System like Sterling; Cyber Week being one of the most popular and most-sought-upon ones in Americas and Europe. Similarly Singles-Day (11/11) brings in the biggest opportunities as well as challenges for organizations running their ECom Businesses for consumers in China. Every year, various task forces in Business and IT across ECom, Supply Chain, Planning, Procurement, Forecasting, etc. prepare for these days to be able to handle the exponential increment in Order Volumes and to convert those into successful sales. KPIs are defined to measure the success of the IT Systems while they are serving millions of consumer orders during the Peak events.
The goal is twofold and simple:
a) To ensure 100% Systems uptime as high volumes of transactions pour into Sterling OMS, and
b) To ensure the required Non-Functional-Requirements are met, so that eventually the desired SLAs promised to the customer through the channels they purchase their favourite products on, are achieved
Success of such events are critical for retailers not just for their own platforms, but also for marketplace platforms they are doing business with as a bad Fulfilment experience not only impacts the retailers' NPS, but also the Marketplace's NPS.
Indeed, a strategic approach is required with focus on scaling the systems to meet the consumer order volumes. This article outlines some of the guidelines based on my experience below.
While this is not an exhaustive list of the endless possibilities that can be explored to tune our systems, these are some of the things the teams handling the IT Systems may want to look at.
Let’s consider the Click to Release process as an example. One of the success measure may include 95–100% of the consumer orders successfully landing up in the chosen Store or Warehouse for fulfillment, within 30 minutes of original checkout. A typical Click to Release process in IBM Sterling would involve a series of Transactions (ORDER_CREATE.0001, SCHEDULE.0001 and RELEASE.0001) taking their individual messages/orders to be processed from the JMS Queues upon trigger.
The Yellow dots represented below indicate the various areas where there are opportunities to tune or scale, but these are also the areas where improper configurations or potential bottlenecks may lead to an overall Click to Release. slowness.
There are two possibilities to prepare a landscape involving Sterling OMS for Peak Season: Improvements IN the Application, and Improvements AROUND the Application (Integrations).
- Application Improvements
As they say, there is no better teacher than experience. While the Application Operations have strong operational workarounds to resolve the ongoing issues during the Peak Season, this activity often results in a backlog creation of items to be addressed by the Teams who implemented the solution. There are many notable examples of common areas to watch out for issues, like
a) Database queries running Table scans instead of Index Scans
b) Read-Only APIs resulting in DB Locks
c) Records stuck in Task Queue table resulting in a low throughput
d) Frequent/Prolonged Inventory Locks due to incorrect or missing HotSKU parameters
f) Queue Pile-ups
Often such issues are addressed on-the-fly by implementing workarounds. However, additionally many such issues serve as the backlog items from previous peak season and worked upon for permanent fixes for the next peak. It's always advisable to periodically look at the application components well in time for technical debts and work on them so that the Application is ready to support at scale when the improvements AROUND it are made
2. Observability / Monitoring set-up
The landscape needs to have the right set of monitoring tools or Logging, Tracing, Metrics as well as Alerting. While there are market leading APM tools like Dynatrace or CA Wily Introscope etc. which are used to monitor not only the various JVMs (Agents and Integration Servers), but also the health of the Database your Sterling OMS instance is connected to.
In many cases, Sterling OMS capabilities are served through REST APIs exposed via an API Gateway, or via Messaging Queues, either directly or via a middleware platform. Observability can be implemented by using Technology Agnostic Platforms like ELK (Elastic-Logstash-Kibana) or combining different tools for Log Collection (e.g. Filebeat or Fluentd), Visualization tools like Grafana / Instana etc. These are very effective especially when monitoring API calls made by external systems to Sterling OMS. Additionally, an effective alerting mechanism using platforms like OpsGenie or Pagerduty result in timely and proactive alerting.
Defining the right SLOs (Service Level Objectives) and alerting on them often results in proactive resolution or issues before they turn into Severity 1 / Priority 1 incidents. Middlewares like IBM ACE and Messaging Platforms like IBM MQ or Kafka can be monitored as well using the right tools.
Remember to add observability to every layer possible, along with a linked alerting mechanism.
3.Performance Benchmarking
One of the most important (and also common) activities recommended for Peak Season readiness is the Performance Benchmarking. The projected volumes are gathered for the Market/countries and multiple Performance runs are simulated on a Pre-Production or a dedicated Performance Test environment using tools like JMeter, NeoLoad, LoadRunner etc. As the tests run, the tools for monitoring and observability are monitored for potential bottlenecks. The outcome of such a performance run, if not 100% successful, can be a set of items including the following:
- Potential need of application fixes
- Scaling / Tuning recommendations for Threads and Instances for Agents/Application Servers
- Database Index (es) to be created, if any
- Need of infrastructure scaling
Often multiple iterations of the run are made to test the effectiveness of the fixes. The scope of the scenarios is often defined based on the volumes projected and the critical scenarios for the market.
4. Scaling / Tuning
The obvious choice of scaling to achieve the required throughput is made during Performance runs, where the JVMs (Agents and/or Integration Servers are scaled by Threads and Number of instances until we reach one of the below states (assuming the other components like Database, CPU Utilisation etc. appear healthy:
- The desired Latency and Throughput as defined by the NFRs have been met, OR
- Any further attempts to scale doesn't improve the Latency or Throughput, or in fact, results in degradation of overall system's performance
If it's the second case and the desired numbers have still not been met, then there are additional areas where further scaling could be investigated. For example, in case of Messaging Queues, its easier to detect slowness by looking at the Enqueue/Dequeue rate. If the Messaging platform (e.g. MQ) or Integration Platform (e.g. IBM ACE) is hosted on Cloud, the underlying Network Storage could be upgraded to an option that gives a better IOPS rate. Tuning at Database level is also done by analysing the Explain plans for the queries that are reported to be taking most of the time. In case of inventory tables locking, HotSKU properties can be tweaked to achieve the desired performance.
To conclude, I'd like to circle back to state that achieving Operational Excellence by achieving 100% system stability, indeed, requires a set of Task forces with proper planning and execution at the end of the day. However, a landscape involving Sterling OMS can only be scaled to meet the exponential volumes as long as the application has been engineered the right way without introducing any undesired bottlenecks. Following the right principles like Separation of Concerns, Throwing the right Exception, Writing the SDF Components to have them Loosely Coupled and Highly Cohesive, Using the Correct APIs for the correct purpose while optimizing your API templates, Following any new Database Extensions with the right Indexes, Use of correct SELECT Methods etc. are some of the many best practices that engineers can incorporate to ensure that the Sterling OMS Foundation is ready to achieve the desired results with the right Tuning and Scaling options in place.
Finally, after sharing the strategic approach, I'd derive pleasure in contradicting myself by stating
Operational Excellence is not a project or an initiative, it's an attitude that begins the day the first line of code is written
The views, thoughts, and opinions expressed in the text belong solely to the author, and do not represent the opinion, strategy or goals of the author’s employer, organization, committee or any other group or individual.