Diagnosing microservice issues — A modern tragedy

Arjun Dutt
Towards Application Data Monitoring
3 min readDec 22, 2020

The SRE lead for a very prominent company in the last mile logistics space recently told us:

“I have all the data in world but very little information”

We were talking about the challenges that came with microservices adoption. I’ve heard similar refrains over and over again from others who were modernizing their systems but none that were quite as pithy and direct.

Companies are breaking up their monolithic applications into smaller services. Their typical goals are to decouple and decompose functionality, unlocking the agility and scalability they need in order to serve their customers better and stand out relative to the competition. As they take on this modernization effort, they are increasingly starting to find that true decoupling is difficult to achieve and most modern systems end up with many complex interdependencies.

A daily struggle that has emerged is around issue detection and diagnosis. As developers independently make changes to their services, from something as banal as a variable type, to the entire structure of the requests they handle, they can inadvertently cause other services to fail.

When this happens in most modern microservices-based systems, you can start to see errors in one part of the system while the root cause lies somewhere else. With increasingly distributed, global engineering teams, the process of uncovering the underlying problem can take hours, days and sometimes even weeks. An unacceptable reality in today’s high pressure markets.

According to Stripe Research, developers spend roughly 17.3 hours each week debugging software. This accounts for almost half of their entire work week.

It doesn’t take a big leap to recognize that if it were possible to reduce the amount of time it takes an engineer to find and subsequently resolve problems, this would provide a massive boost.

Which brings me back to my conversation with the SRE lead. Companies have adopted a variety of monitoring and logging tools to diagnose problems created by bad code and code changes that cause breakage. However, SREs and devops teams have to dig through many dashboards and tools on a hunt for clues. The typical mode they follow is to respond to an alert from one system, to then checking a dashboard from another tool, to then searching through log files, then finding a reference to a change that points to a potential culprit, reviewing its dashboard, and on and on until they eventually find the source of the problem.

At Layer 9, we’re harnessing the power of machine learning along with disciplined systems engineering work to create a better way. We pull in every service’s health and data quality metrics along with contextual information about changes from the CI/CD system as well as alerts from the monitoring systems into one place.

Our AI models correlate service issues with the change history of an individual services as well as the change history of other services they depend upon. The result is a well curated list of events and changes that point to a likely root cause. This automated pattern detection provides a multiplier effect for the SRE and devops team, leading to faster issue diagnosis and less time spent hunting through logs and following hunches.

For the modern enterprise, we turn reams of data into actionable information in order to provide a force multiplier to the engineering team’s productivity.

If you’re interested in learning more, please drop us a note via layer9.ai or follow us on Twitter @layer9ai or on LinkedIn @ Layer 9 AI

--

--

Arjun Dutt
Towards Application Data Monitoring

Co-founder and CEO of Layer 9, the Application Data Monitoring company.