Reducing Downtime with Alerting and Monitoring

How ProdOps helped Twist Bioscience reduce downtime and prevent potential problems by using diverse tools such as Prometheus, Grafana, Sentry and New Relic.

Daniela Kortin
ProdOpsIO
Published in
3 min readMar 5, 2018

--

Implementation of Automatic Monitoring for all your environments allows you to deploy better systems at lower risk!

A happy customer is one that has his bugs fixed before he realizes there were bugs to be fixed!

You may not realize it, but it’s not magic to get error information as it is happening.

These days, when a lot of companies are on a journey of moving to Microservice, Containers, Cloud, etc, we expose ourselves to a lot of different systems that potentially can break and create downtime.

For our customers, downtime equals losing money and losing money is unacceptable to us.

Downtime can be prevented in many ways, but two essential factors are alerting and monitoring.

By Implementing the correct methods and tools for our customers, we reduce downtime and prevent the loss of money.

  • When using alerting, you can be notified as soon as you have a problem with your systems.
  • Monitoring can assist you to predict a potential problem and give you an inside look at the core problems.

In the case study, I’ll show and explain how we implemented for a client of ours, Twist Bioscience about how we helped them to reduce downtime and prevent potential problems by using diverse tools such as Prometheus, Grafana, Senty and New Relic.

These tools tackle all parts of the environment from cloud resources, infrastructure, dependencies and applications.

Highlights:

Challenges:

  • More than 15 microservices running on the client’s 5 different Kubernetes clusters.
  • Many AWS services that need to be monitored.
  • On-premise environment with several VPN tunnel connections between their on-premise environment and AWS accounts.
  • Problematic code versions deployed to a bunch of AWS lambda functions.
  • Used a broken third-party library that caused many problematic symptoms.
  • Non-scalable in-house code that had been deployed.

Solutions:

  • Added various alerts using Prometheus Alertmanager system.
  • Added an alert that sends a notification immediately each time a human error occurred.
  • Implemented Sentry, providing immediate notifications on integration and code issues.
  • Grafana graphs helped us understand and continue to refine our understanding of the problems.
  • Grafana shows all the resources consumed by the services running in Kubernetes clusters.

Results:

  • Saved more than 25% per month on the cost of EC2 instances.
  • Notification of errors quickly to allow implementation of a solution before customers experiences any issue.
  • Time saved in finding and fixing bugs/problems quickly.

Twist Bioscience’s team sees many long-term benefits, including improved efficiency across the entire company. This includes the streamlining of several crucial processes and the elimination of the needless overhead that was wasting human resources.

For more details on how we helped Twist Bioscience prevent downtime and save money, by implementing monitoring and alerting, read the full case study

Thank you for reading!

A special thanks to Twist Bioscience.

By Ziv Rechnitser

Monitoring and Alerting are two of the many solutions we, at ProdOps, provide as a single click deployment on your cloud platform’s account. Feel free to reach out for further details if you’d like to have the deployment described in this post provided as a ready-to-go package.

--

--