It is difficult to stay completely reliable in an always-on world. So it’s very important to choose the right Incident Management solution that can solve your problems. In this blog, we have highlighted the benefits of Squadcast and why you should adopt it.

“Being on-call sucks!”

Often incident response teams use this phrase when talking about their on-call experiences. Despite using best practices for managing infrastructure, incidents do occur from time to time.

In order to avoid delays in responding to incidents and prevent being overwhelmed by on-call notifications, you should find a solution that helps in resolving incidents efficiently…

Reliability is a team game. More the collaboration between Developers and SREs, greater will be the success of the product. In this blog, we have listed down the five best practices that developers can adopt, to make the SRE’s life easier.

It is not easy to be a site reliability engineer. Monitoring system infrastructure and aligning them with the key reliability metrics is quite a daunting task. Whereas, a software engineer’s job is to deliver high-quality software.

Relationships between software engineers and site reliability engineers can sometimes be tricky. To begin with, developers are generally assigned to write code that…

Cloud Operations Sandbox serves as a simulation tool for budding SREs to learn the best practices from Google and apply them to real cloud services. In this blog, we have compiled a list of FAQs surrounding the use of Google’s Cloud Operations Sandbox.

The Google SRE sandbox provides an easy way to get started with the core skills you need to become an SRE. It simulates all the behavioural complexities of a real GCP(Google Cloud Platform) environment, so that budding SREs can practice hands-on while learning SRE best practices.

The core skills you need to become a good SRE are:

CI/CD enables DevOps teams to go from development to production while tackling unexpected glitches. But, choosing the right CI/CD tool is always a challenge. In this blog, we have covered the tips that will help you select the right CI/CD tool for your team.

A few years ago, it was nearly impossible to find a software development shop that wasn’t deploying their code using some sort of Continuous Integration and Continuous Delivery (CI/CD) tool. The benefits of CI are clear: with automated testing in place, new builds can always be tested and deployed quickly. …

DevOps and SRE are domains with rapid growth and frequent innovations. With this blog you can explore the latest trends in DevOps, SRE and stay ahead of the curve.

The past decade has seen widespread adoption of DevOps methodologies in software development. Unsurprisingly, as the needs of users change, DevOps techniques have evolved as well. In this blog we will look at the trends that are most likely to have a significant impact in the coming years.

The trends mentioned below are most likely to have a lasting impact in the field of DevOps and SRE:

  1. AIOps and self-healing platforms

This is a guest post collaboration between Squadcast & Threat Stack.

The move to the cloud has rapidly expanded the cyber threat surface of modern cloud apps. This blog in partnership with Threat Stack, outlines how you can stay on top of your game with help of context-rich alerting & resolve security incidents rapidly along with few best practices to follow for faster incident response.

It’s easy for on-call engineers to become overwhelmed by alerts, especially as cloud environments continue to scale at a rapid pace. …

Labelling your alert payloads although simple can significantly improve the time it takes for your team to respond to incidents. In this blog learn how Squadcast’s auto-tagging feature can be a game changer by enabling intelligent labelling & routing of alerts to ultimately reduce your MTTR.

A frequent problem faced by on-call engineers when critical outages occur is pinpointing the exact point of failure. Even though modern monitoring tools and incident management platforms provide context around each alert, there is still room for improvement. A relatively simple solution is to add labels to your alert payloads.

As an on-call engineer…

Prometheus has emerged as the de-facto open source standard for monitoring Kubernetes implementations. In this tutorial, Kristijan Mitevski shows how infrastructure monitoring can be done using kube-prometheus operator. The blog also covers how the Prometheus Alertmanager cluster can be used to route alerts to Slack using webhooks.

In this tutorial by Squadcast, you will learn how to install and configure infrastructure monitoring for your Kubernetes cluster using the kube-prometheus operator, displaying metrics with Grafana, and configuring alerting with Alertmanager.

Infrastructure Monitoring

One of the key principles of running clusters in production is Monitoring.

You must be aware of the resource allocation and…

A few minutes of unexpected downtime can have catastrophic effects! Having a great incident response plan is more than a luxury — it is a necessity for organisations of all sizes today. This blog outlines key activities that can help you in formulating a better incidence plan.

Picture this scenario — your organisation has suffered a catastrophic outage, phones are ringing off the hook and customers are ranting online. Unfortunately, you do not have a reliable plan to deal with this unexpected happening. Already under significant pressure, you start throwing resources at the problem. …

With the rise of microservices based cloud applications & its corresponding complexities, the need for observability is greater than ever. This blog looks into the what-why of distributed tracing along with few best practices to adopt for the same in microservices architecture.

Distributed tracing for Microservices architecture is an emerging concept that is gaining momentum across internet-based business organizations.

We know that microservices architecture introduced an all-new way to scale an application (cloud) with several independent services. It does facilitate high resiliency, scalability, productivity, and efficiency when compared to monolithic architectures.

However, this comes with its own complexities like difficulty…


Incident Response — The SRE Way

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store