Look before you leap — setting up eCom observability at scale

Hari Suman Naik
adidoescode
Published in
6 min readDec 3, 2021

The explosion in eCommerce business has been one of the standout stories of the last few years. With our day-to-day lives so deeply entrenched in the digital ecosystem around us, we are witnessing a steady shift in attitudes and purchasing behavior as more and more consumers rely on online shopping services. Smart organizations realize that a seamless eCommerce experience is not just a revenue driver but a key differentiator in shaping the brand in the eyes of the consumer.

Operating at a global scale means that downtime is often hard to identify, and late detection of anomalies can lead to an accumulating bleed of potential revenue and reputation. Take for example the following scenario — A hypothetical eCommerce company uses a 3rd party API integration, only in a mid-sized country to fill up consumer details in the order delivery form. However, a recent change caused the integration to fail and the users in that country were unable to complete their purchases. Although the Site Reliability Engineers (SRE) and operations teams monitor service level indicators (SLIs) like availability & latency to detect technical issues, there weren’t any significant rise in error logs and the outage did not make a significant dent in any of the global indicators. The issue went undetected, until much later when the local teams reached out to them after being overwhelmed with a barrage of support calls. Frantic resolution soon followed, although sales and consumer perception in that country had already suffered a great deal.

As an eCommerce business matures, scenarios like these move from being edge cases to the norm. They necessitate a concentrated drive towards achieving best-in-class observability of the end-to-end eCommerce value stream.

Uplifting observability requires technical and engineering enhancements as well as socio-cultural changes to the organization’s structure and its people. The two must go hand-in-hand as a proactive, cohesive effort to zero in the right indicators to observe while setting up the necessary ability to detect.

Look before you leap — Stock image from Pixabay

To start with, an organizational structure that provides for regular collaboration and feedback between the dedicated observability team and local markets is important to jointly define indicators, correlations, and anomalies. It is vital that SREs and other team members step into market teams’ shoes to understand business indicators and tie them to complementary technical indicators. An agile product-led setup allows via design sprints exploration of market and other stakeholder needs, which can then be channeled and refined into requirements for engineers. It lets the entire team develop a market-driven mindset that truly encapsulates local and business indicators that define success and failure.

As the next step, a central monitoring infrastructure to aggregate metrics, logs, traces, releases, and deployments from multiple systems is essential. Large organizations tend to use a multitude of specialized monitoring tools for various purposes. There are specialized tools for experience analytics that give frontend metrics like visits, click through rates and conversion rates as well as details on various page errors. Application performance managers go into further detail aggregating and calculating indicators from calls made in the ecosystem. Using a central infrastructure with ELK for data logging, Victoria Metrics for metrics, Grafana for visualization and alerting and Opsgenie for channeling alerts to measure the health of applications, 3rd party APIs and underlying platforms. Data aggregation via various tools and Kafka streaming to this central location forms the foundation for correlating data, accurately identifying anomalies and even inferring causality. Market and SRE inputs play a major role in defining the indicators to measure, the choice of monitoring tools, and the subsequent correlation of business and technical indicators.

The foundational data can then be aggregated to provide a holistic live eCom health dashboard and also be used to set up smart alerts directed automatically to the right experts for a speedy resolution. The Observability dashboard acts as a command center with all the necessary indicators to know the current health of eCom. Aggregating multiple indicators helps with speedy resolution as it allows us to see at a glance, any correlations and even suggest issue causality. It can also be used to triage outages based on impact and integrating with ticketing systems can also assist in an effective root cause analysis during the postmortem.

Observability dashboard: A holistic view of eCom health

Once data integration and processing evolve a fair amount, it opens newer possibilities for predictive monitoring using Machine Learning or Artificial Intelligence tools. Various machine learning models such as unsupervised random cut forest or supervised recurrent neural networks (RNN) can learn patterns of incidents and data and forecast expected behavior. An RNN model for example, can keep in mind periodic fluctuations in traffic and infer anomalies using multiple dimensions simultaneously on an almost real-time basis. Using multiple complementary business and technical time series data in a single model can even efficiently predict with increasing confidence anomalies before they bubble up to affect consumer experience.

Visualizing and aggregating health from biz & tech indicators for correlation and inferring causality

The engineering skillset needed to drive cutting-edge observability would be diverse due to the integration of multiple systems and developing tools customized for the organization. While considerable expertise in alerting and monitoring is mandatory, the team must also possess competent domain knowledge and be able to demonstrate data streaming and integration skills. Developing a holistic visualization across the value stream would require an application development team, while a bespoke AI-driven predictive model, which is likely to be more effective than out-of-the-box solutions would require additional data science and engineering competence.

Approach for state-of-the art observability

With the right organizational, infrastructural, and team foundations in place, bringing to life state-of-the-art observability that can detect nuanced outages across a large, global, and highly interconnected eCommerce business is as straightforward as these five steps –

1) Take care of the Basics: Close gaps in data collection and standardize application logging, metrics, and tracing.

2) Listen to your markets: Interface with markets to capture the right business indicators relevant to them. Ensure that they are measured and centrally correlated with technical indicators.

3) Visualize: Develop a holistic visualization of eCommerce health that brings together all operations stakeholders, possibly in near real-time.

4) Alert smart: Create smart alerts that leverage both business and technical indicators and their correlation to be more accurate and

5) Detect with AI: Build advanced alerting tools that are robust enough to withstand fluctuating business trends and accurately detect anomalies leveraging cutting-edge machine learning models.

Striving for experience excellence in a physical store would require eyes and ears on the ground to observe the needs, emotions, and troubles of a consumer within their sight. In the scaled-up digital world, cutting-edge observability is essential to detect online consumer pain points. With the right vision and mission explained above in place and feedback channels to refine and pivot, digital organizations can rapidly scale up and establish observability as an indispensable value driver for success.

--

--