GumGum Tech Blog
Published in

GumGum Tech Blog

Enabling Sumologic through Infrastructure as Code

Someone doing a puzzle
Photo credits — Ross Sneddon https://unsplash.com/photos/sWlDOWk0Jp8

A needle in a haystack?

During the first computing years, monitoring systems lacked many incredible features that we rely on nowadays. If someone was lucky, some logs could be found instead of just some core dumps in case something crashed. So, you might be wondering how engineers were able to find the root cause of a problem at that time. It’s not a surprise to find some of the first monitoring tools in Unix and Linux since they became a standard of use in the community.

More specific tasks like connecting to servers via SSH and tracking log files were important to find issues. This meant you had to be an operating system expert to maintain the applications and resolve any problem quickly. In Linux, the most important logs can be divided into one of the 4 main categories: Application logs, Event logs, Service logs, and System logs.

As the data being monitored increased, we started facing problems related to the efficiency of finding the root cause of an event. Some tools by then had to include some filter features to have a better performance. As you may know, it is required to have a certain experience in filtering data to find specific information about services, specifically identifying what is causing issues on a system.

Sumologic to the rescue

Monitoring data is not just about finding something, but understanding and having an overview of what is going on in the platform. For that reason, some tools were developed and became pretty popular across very large systems, one of them being Sumologic.

Sumologic provides log management and analytics services that take advantage of all the signals generated by the machines to deliver real-time insights. We can collect data from several sources and provide meaningful value to the business.

Diagram of sumo logic in the middle, connecting to containers, servers, microservices on the left, and business, support, IT operations, security, and developers on the right. In the middle around sumo logic is predict, collect, monitor, visualize, alert, and search.
In Sumologic, all of the data has a meaning for the business areas.

GumGum leverages this tool by using alerts and searches to observe the activity of the platform. There are some main use cases around it:

  • Alerts that can warn you about specific issues in the application like stopped processes or spikes in high incoming traffic.
  • Scheduled searches that happen from time to time reporting specific metrics regularly, for example, to review error count.

Searches can help with filtering data to better understand what is happening in all the services. Also, we can filter or sort by using key patterns and parse according to the team’s needs. Sometimes the searches or alerting may require developers to take action and solve the problems; other times it would be just a matter of notifying them.

When configuring alerts, the team seeks to set the limits for any misbehaving service and inform when a threshold is breached.

Terraform is an open-source solution that helps you capture your infrastructure as code by creating modules. Terraform leverages the concept of providers to interface with different third-party services including Sumologic. By capturing all the necessary resources needed to configure Sumologic as a module, you can abstract away all the complexity of setting up the monitoring collectors, alerts, and searches.

Terragrunt provides simple access to these modules and enables your development team to create and update monitoring alerts in a unified and controlled versioned way.

Diagram of terragrunt template with a search query and other parameters (recipient, time range, folder name, cron expression). 3 developers connect to the template with an arrow.
Developers only have to focus on setting the values and the query to configure the Terraform module behavior.

By reusing modules, we can create new searches and alerts; this allows developers to simplify the creation of components by focusing only on parameters. As mentioned earlier, Terraform and Terragrunt help maintain code consistency by leveraging the repository, allowing developers to avoid duplication of work and share insights into their configurations with other team members (when they raise a pull request to merge their code).

The GumGum Verity™ team also considers the cost savings associated with the components in the platform to keep Sumologic’s costs down by using budgets. In an outage, a service may generate a lot more data than usual, leading to daily quotas being consumed rapidly, resulting in overage spending. Such repeated events could make the logging platform very expensive. Therefore, to save money, collectors can be configured to stop collecting data from sources once a quota limit is reached.

Every service in Verity™ is assigned to a specific budget in Sumologic, which means that every service’s collector will stop ingesting data whenever that limit is crossed. The budget can be manually reset if you need to re-enable log ingestion, otherwise it will reset automatically the next day.

While defining collector’s budgets, it is important to keep some room on your daily quotas so that you can afford resetting budgets without overspending. Remember, when things blow up, you need your logs the most to figure out what is going on.

Diagram with platform sources on the left with 3 services consisting of EC2 instance and docker, pointing to sumologic collectors on the right, with 3 collectors having different capacities for the budget. It says “data keeps coming as long as the threshold is not exceeded in the middle.
Collectors with different budgets. The ingest budget configuration is also made in Terraform.

Going deeper into the configuration

The DevOps team wants to make it easier for developers to add new searches to the code, therefore, we store the search queries as raw templates which are going to be reflected in the Sumologic console.

_sourceCategory=/ecs/va/va-content-extraction — prod* AND _sourceName=”te-training-pipeline” | parse regex “Diffbot\sresponse\sfor\s(?<url>.*)\swith\stld\s(?<domain>.*)\swas\s(?<response>.*)” | count as count
Query search used directly in the console.

It is also worth mentioning that the infrastructure is deployed through Atlantis, which makes it even easier to deploy new changes to the platform. You can review this previous blog post to dive into the details. To describe the big picture, the flow is as follows, and that is how we create new searches and alerts.

Diagram with a developer on the far left, with an arrow that says “Parameters & query search” pointing to “Terragrunt templates” which points down to “Terraform modules” and right to “Terraform plan” with the words “Pull Request” on the arrow. That points to “sumologic” with the words “atlantis apply” on the arrow.
Developer workflow to create searches and alerts in Sumologic through IaC.

Summary

When working on a team, it is important to have reviews from other peers. In this case, it is much better to have the infrastructure deployed through code instead of applying the changes manually. This practice is safe, consistent and the changes are preserved in a repository.

Even though you can get incredible insights into your platform with Sumologic, you also need to allow for the incurring prices it might cause if you are not careful enough. Setting collector’s budgets are going to help have the prices under the limit.

Enabling the team to contribute to the infrastructure development will decouple the dependencies between the developers and the DevOps team, providing them with more independence. To achieve this goal, it is important to design the modules in a self-service approach and reduce the complexity in the configurations so they only provide some custom parameters.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | Linkedin | Instagram

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store