Grafana & GitLab — happier developers

Or how to achieve 90% of results with 10% of efforts

Published in

Agoda Engineering & Design

7 min readDec 26, 2023

At Agoda, we utilize Grafana and the Grafana Loki plugins to streamline log access. This dynamic duo provides a flexible and straightforward way to explore and analyze logs, empowering our teams to troubleshoot and manage system data efficiently.

Our log storage follows a structured format, combining the core message with stack traces (for errors or warnings) and additional attributes. This approach provides a comprehensive view of each log entry, ensuring our logs carry rich contextual information. With messages, stack traces, and attributes seamlessly integrated, we’ve created a detailed and informative logging system that facilitates quick identification and resolution of issues.

Inspired by a convenient feature in my beloved IntelliJ Idea that allowed me to click on a line in a stack trace and navigate directly to the corresponding code, I questioned why a similar experience couldn’t be seamlessly integrated into Grafana for any project.

This simple yet powerful functionality can significantly enhance our debugging process, enabling users to swiftly trace the origin of issues from logs in Grafana directly to their precise locations in the codebase. Such a feature would be particularly valuable for platform and horizontal teams dealing with substantial amounts of code, often with multiple owners, streamlining their daily workflow and collaborative efforts.

Thus, we can formalize a functional requirement as follows — As a developer, I need to have a way to navigate directly to the source code for a specific line mentioned in a stack trace in any project, irrespective of the number of dependencies it has. We can make this happen since we operate our own Grafana instance.

JVM Stack Traces

Parsing stack traces accurately for every programming language would require a significant amount of time to implement. However, let’s explore the possibility of making compromises and limiting the functionality to a subset of languages. To achieve this, we can introduce several constraints:

We will narrow our focus exclusively to JVM stack traces, given that the vast majority of services at Agoda are written in Scala and Kotlin.
We will prioritize only the code authored by Agoda developers, as, in 99% of cases, these are the lines that people are most interested in.
We will simplify the parsing of JVM stack traces as much as possible.

Implementation in Grafana

To implement this feature in Grafana, we need to make changes to both the backend and frontend components. On the backend, the following steps are required:

Identify whether a log contains a Java stack trace.
Parse the identified stack trace.
For each relevant stack trace, determine its location in GitLab.
Transmit the results to the front end.

On the front end, our task is to utilize the information obtained from the backend and incorporate it into the visualization by adding clickable links.

Due to its versatility, Grafana employs data frames as a fundamental entity for communication between components. A data frame comprises fields; each data source typically utilizes a constant set of fields. Each field is characterized by a name and a type, and we will leverage these attributes to identify the new column we intend to add.

For each log message, our new field will include a map associating source file names with links in GitLab. Consequently, the visualization process becomes straightforward. If we have link information for a specific log message, we split it into lines. Lastly, we add links to every log line that corresponds to a file in the map. Here is an example of additional field, sent by the backend:

{
    "File1.scala:52": "http://gitlab/team/project/-/blob/branch/filepath/File1.scala#L52",
    "File1.scala:60": "http://gitlab/team/project/-/blob/branch/filepath/File1.scala#L60",
    "File1.scala:127": "http://gitlab/team/project/-/blob/branch/filepath/File1.scala#L127",
    "File2.scala:34": "http://gitlab/team/project/-/blob/branch/filepath/File2.scala#L141",
}

Understanding JVM Stack Traces

Before determining whether a string contains a JVM stack trace, it’s important to understand their structure. A typical stack trace would look like this:

Exception in thread "main" java.lang.NullPointerException
   at com.example.myproject.Book.getTitle(Book.java:16)
   at com.example.myproject.Author.getBookTitles(Author.java:25)
   at com.example.myproject.Bootstrap.main(Bootstrap.java:14)

Additionally, according to java doc, stack trace lines may contain some more information:

    at com.foo.loader/foo@9.0/com.foo.Main.run(Main.java:101)

The first element, “com.foo.loader” is the name of the class loader. The second element “foo@9.0” is the module name and version. The third element is the method containing the execution point; “com.foo.Main”” is the fully-qualified class name, and “run” is the name of the method. “Main.java” is the source file name, and “101” is the line number.

To filter only stack traces according to our constraints, we can check if a line starts with “at” and contains “com.agoda” as a primary package name within Agoda. Following this, we can apply a simple regular expression to validate that the line filtered in the previous step is indeed a Java stack trace. Initially, the regular expression served as only a filtering step, but given the significant volume of logs fetched, it evolved into a bottleneck.

Following this logic, we can simply parse filtered lines into the package, file name, and log line where it happened, keeping in mind that the logline is optional. Unfortunately, languages like Scala make parsing logic more complicated, bringing additional symbols like $ and @.

at map @ sttp.client.impl.monix.TaskMonadAsyncError$$anon$1.run(TaskMonadAsyncError.scala:26)

To manage this syntax, we split stack trace lines by spaces, focusing only on the last parts. Additionally, to mitigate concerns about anonymous classes, we can extract everything before the first “$” symbol and operate with that information.

Once the parsing is complete, we obtain a structured set of information including filepath, filename, and log line. With this information, we can seamlessly leverage GitLab to retrieve the corresponding source code links.

type parsedFile struct {
    Path string
    FileName string
    Line uint32
}

Leveraging GitLab’s API

Our source code is stored in GitLab. Thus, we will utilize the search API with blob scope to retrieve information regarding the code location from GitLab.

GET https://gitlab/api/v4/search?scope=blobs&search=filename:DB.scala&sort=asc

Response:
[
    {
       "path": "sql/src/main/scala/com/agoda/sql/DB.scala",
       "ref": "aa6286ba8f1d3aae30f07b5ed9fa2c59d89ff595",
       "project_id": 30932
    },
    {
       "path": "sql/src/main/scala/com/agoda/sql/DB.scala",
       "ref": "ca8329434329ccfabd0fe3589ae901df246e5b3b",
       "project_id": 24643
    },
    …
]

Response from GitLab contains a list of search results, including file path, commit hash, and project_id. Now, we need to fetch associated projects. Nothing fancy, just another GET call:

GET http://gitlab/api/v4/projects/12345

Response
{
    "id": 12345,
    "description": "Simple project",
    "name": "Project",
    "path_with_namespace": "Team/project",
    "created_at": "2023–01–01T00:00:01.000Z",
    "default_branch": "master",
    "forks_count": 0,
    …
}

Using this information, we can build a link to the file in GitLab by simply concatenating the project path, search result commit, and file path.

GitlabURL + project.PathAndNamespace, "-/blob/", searchResult.Commit, searchResult.Path

While our integration is already operational, it is not flawless, primarily due to the substantial number of HTTP calls made to GitLab. Additionally, GitLab returns a considerable number of results, which means we need some heuristics to identify the best match.

To reduce the GitLab workload, we can first parse all stack trace log lines and query GitLab for each unique file only once. This approach will significantly help, given that stack traces often repeat the same file multiple times with different lines. Additionally, we will add a daily cache for project information, as projects don’t change frequently, and an hourly cache for search results.

Now, we need to reduce the number of file search results. The objective is to strike a balance between obtaining sufficient information and having too many results.

After a series of experiments, we managed to achieve the best balance possible by adding the file path along with the filename. However, a new issue arose — the file path from the stack trace doesn’t necessarily always end with “class,” for example, “com.agoda.Service.withResources.” We would naturally find nothing if we attempt to provide such a path as a filename. To address this challenge, I implemented an additional optimization. In 99% of cases, the third part of the path represents a repository or library name. Therefore, we significantly enhance search results and overall accuracy by appending “com/agoda/Service” to the search path.

Now, with significantly fewer results and higher relevance, there are still a few remaining corner cases:

Forked stale repositories, which people created to open PRs to the original repo. This can be mitigated by using fork_count; we are interested in projects with the highest possible count.
There could be the same files in different folders; for this, we can count the number of matches between the path from the search result and the stack trace line. The more, the better.
Last but not least, the change in the newest code is considered more relevant than stale code, so we sort results by their commit dates.

After applying all the optimizations listed above, we can dramatically improve the search precision of the results.

Conclusion

With relatively low effort, we now have functionality that closely mirrors what developers experience while running applications locally. Although the file search precision is remarkable, it is still not 100% accurate. For instance, if applications use some very old libraries, which were in main several years ago, they might be excluded from the Gitlab index.

It means we will either get no results for search or results that are not relevant anymore. Regardless, after having this change deployed in the production environment for several months, we have observed the significance of this feature. Without this integration, navigating through stack traces becomes significantly more challenging.”

We would also be thrilled to contribute this change to Grafana, should the developers consent to its integration.