RipTrace: Debugging Microservice and Container Applications

Published in

The Startup

4 min readSep 6, 2020

Introduction

Debugging microservices is an inherently complex topic due to the interactions between multiple disparate applications. Yes, decoupling should always be kept at the forefront when developing microservices, but more often than not, a highly decoupled system is an unattainable utopian dream. The microservices will undoubtedly rely on each other, and when one fails, it can have a cascading effect on the others. In addition to this, coming up with isolated unit tests will often only account for happy path scenarios, without taking into consideration the chaotic path that orchestrating multiple microservices can bring about. What if there was a way to introduce distributed breakpoint-style debugging across multiple microservices at the same time, with the capability of tracking local variables, timing, and thread execution? Enter RipTrace.

A Focus on Go

Due to the prevalence of its use in distributed systems (Kubernetes, Docker, Moby, Etcd) and ease of deployment (single compiled binary) for microservices, Go was chosen as the target for RipTrace. In addition to the above, the magnificent Delve debuger (https://github.com/go-delve/delve) is a perfect resource to tap for building a distributed breakpoint debugging system. RipTrace combines Go’s concurrency patterns, the robustness of the Delve debugger, and the NATS messaging system (https://nats.io/) into a distributed debugging system with the goal of making Go-based microservices easier to troubleshoot.

RipTrace Architecture

Below is a diagram that presents the overall proposed architecture of RipTrace.

On the left side of the diagram is the RipTrace agent. The RipTrace agent leverages Delve and its capability of attaching to processes using ptrace to debug existing Golang applications. It is important to note that the container must have PTRACE capabilities, therefore this should only be used in environments where the containers can be trusted.

All commands on the agent are executed using NATS topics. By using NATS topics, this means the RipTrace agent only needs to be able to speak out to the NATS server, and the server does not need to have access into the RipTrace agent. Theoretically speaking, this means that the NATS server and RipTrace server can exist outside of the network(s) that the agents are operating in. Below are an example of some of the request/response and topic patterns that are seen within RipTrace:

<host>.profile.get: Allows the RipTrace server to request an agent to return the host’s profile, or all currently executing Go processes that can be attached to, as well as some various host information about the system.
<host>.debugger.attach: The server will provide the agent with a process ID (acquired during the profile.get request) and the agent will attempt to attach a debugger to the provided process. Once a debugger is attached, this will create a “TRACE_PUBLISHER” that will publish all output from the debugger as tracepoints are hit.
<host>.debugger.createTracepoint: A message on this topic will notify the agent to inject a tracepoint at the specified location, identified by filename and line just like a typical debugger. Each time a tracepoint is hit, it will emit messages on the TRACE_PUBLISHER that can then be picked up by the RipTrace server.
<host>.debugger.trace: This topic is where messages emitted by the TRACE_PUBLISHER can be found.

By using a messaging system like NATS, it is very easy to scale the system up to have many agents, and be able to coordinate and orchestrate many debugging instances at the same time. In addition to this, each trace message is tagged with a coordinated timestamp (provided by the server) so that strict ordering is guaranteed. How is this powerful? For example, say that tracepoints are injected into multiple microservices and a request is run. It is then possible to see the order in which lines were hit, the mutation of local variables along the way, and if any error handling lines were hit. This essentially provides a unified breakpoint debugger across multiple systems that are running at the same time.

RipTrace Server

The RipTrace server coordinates and orchestrates the “fleet” of active debugging agents by injecting tracepoint, attach, and profile request messages into the NATS topics, while also attaching to the TRACE_PUBLISHER topics to monitor for trace messages being emitted by the debuggers. In addition this, the RipTrace server will also provide a frontend interface very similar to an IDE where lines can be clicked on to inject the breakpoints. The server will synchronize with Git repositories to reflect the code currently being executed in the microservices. This step of source code synchronization is very important to ensure accurate breakpoints and filenames.

Current Status and Goals

RipTrace is at the very early stages of design and development. Leveraging Delve’s internal resources was highly experimental, but is currently in a working state. Currently, the agent will respond to the above topics and trace the attached process correctly, without affecting the runtime of the attached process.

Long term, the goals of the RipTrace project are as follows:

Ability to coordinate n microservices and agents to debug large distributed systems
A method to inject reproducible tests (in the form of configuration files) that can be run on a set of services at the same time
Provide an easy to use frontend that presents an easy-to-use and familiar IDE-like experience that is seamless with other developer tools

Feedback and Ideas

The project is located here https://github.com/csthompson/riptrace, but keep in mind the project is in its VERY VERY early stages. I just want to start getting feedback from other developers to see if the idea is viable or worthwhile. Feel free to drop a comment below!