OpenTracing at Scale in .NET
One of the greatest strengths of OpenTracing is the community that has been built around it, across a wide variety of languages and technologies. With that in mind, I’m very excited to present a guest post on the OpenTracing blog today, written by Aaron Stannard. Aaron Stannard is the Founder and CEO of Petabridge, a startup that helps .NET companies build large-scale distributed systems. He’s also the co-founder of the Akka.NET project. You can find him on Twitter at https://twitter.com/Aaronontheweb
For the past five years I’ve served as the maintainer and one of the co-founders of the Akka.NET open source project, a C# and F# port of the immensely popular Akka project originally developed in Scala. I embarked on the project originally because the .NET ecosystem simply lacked the tools and frameworks for building the type of real-time, large-scale applications like the kind I was developing at MarkedUp, the marketing automation and analytics startup I was running at the time.
After shutting down MarkedUp I went on to found Petabridge, an open source company dedicated to supporting and developing Akka.NET and other distributed systems technologies in .NET.
I’m pleased to report that these days, the .NET community has a much stronger open source ecosystem and there are many more tool choices available for building the types of large-scale applications in .NET that I was working on back in 2013–14.
The .NET ecosystem as a whole is changing significantly with the arrival of.NET Core, a new implementation of the .NET runtime that is high performance, lightweight, and 100% cross-platform. This has opened up a new realm of possibilities for .NET developers that simply weren’t available before.
Large Scale .NET with Akka.NET and the Actor Model
Akka and Akka.NET, in case you haven’t heard of either, are implementations of the actor model built on top of general purpose virtual machines (JVM and CLR, respectively.) The actor model is an old concept dating back to the early 1970s, but it’s been resurgent in recent years because it offers an understandable computational model that is easy to distribute across large datacenters or public cloud environments.
“An understandable computational” to do what, you ask? Specifically, the actor model has found a home for developers who need to build scalable real-time systems, such as:
- Multiplayer video games;
- Marketing automation;
- Healthcare / medical IOT;
- Logistics, transportation, and shipping;
- Finance; and
- Real-time transaction processing (ACH, payment processors, etc.)
What all of these applications have in common is that they fulfill their obligations to customers and stakeholders, they must be able to complete their work in a consistently fast (real-time) manner regardless of the total amount of traffic on the system (scalable.) In order for these applications to meet both of these goals they must be stateful, meaning that the source of truth comes from application memory, not an external database. In order for stateful applications to be both fault-tolerant and highly available, they must also be decentralized — the state can’t be concentrated into a single area, otherwise the system becomes vulnerable to single point of bottle neck and single point of failure limitations.
This is what the actor model allows developers to do: build highly decentralized, fault-tolerant, stateful applications where each unit of work (actor) is self-contained with private state that can’t be modified directly from the outside. The only way to modify an actor’s state is through sending that actor a message, which the actor will eventually process, possibly resulting in an update to the actor’s state.
In .NET, Akka.NET is the dominant actor model implementation for building these types of applications — and it’s used by hundreds of companies including Dell, Bank of America, Boeing, S&P Global, Becton Dickinson, U.S. Department of Energy, Zynga, and others.
However, the actor model presents some significant challenges for software teams who try to adopt it at scale, one of the most painful of which is diagnosing and debugging programming errors and network-related problems at scale. This is where OpenTracing and distributed tracing come into the picture.
Making the Complex Understandable at Low Cost with OpenTracing
The trouble with Akka.NET and distributed actors at scale, is that at any given time your system can have tens of millions of interactions per-second that look not too dissimilar to this:
Each actor inside an Akka.NET ActorSystem usually has some small amount of self-contained state, some message-handling code where its actual work executes, and some references to other actors it frequently communicates with. Actors communicate with each other by passing messages back and forth. 100% of message passing inside the actor model is asynchronous by default — actors will always process messages in the order in which they were sent, but it’s possible one actor might have to process messages from many other actors.
Actors can also transparently communicate with each other across process and network boundaries — thus it’s possible that a message sent to a single actor inside one process could end up being propagated to multiple processes. And therein lies the problem: this location transparency that makes actors so good at distributing work in a scalable fashion can make them acutely frustrating to debug when things go wrong in production: knowing where and when something went wrong becomes a non-trivial problem — especially when you have millions of operations like this occurring all the time.
This is where we’ve found OpenTracing to be exceptionally useful.
Akka.NET applications do not exist as single-threaded, monolithic processes; they are highly concurrent and often distributed processes — therefore traditional tracing tools that are commonplace in .NET, such as Intellitrace, often cannot help us answer the question “what went wrong?” inside our systems.
What we need are distributed tracing tools that can gather context from multiple processes, correlate them together, and tell a complete story from the point of view of a distributed system. We need the ability to answer questions like “what did akka.tcp://ClusterSys@10.11.22.248:1100/user/actorA/child2 send to akka.tcp://ClusterSys@10.11.22.249:1100/user/processB/child1 when it received msg1?” — only a distributed tracing tool running on both processes can effectively answer this question for us, and that’s exactly how we use OpenTracing at Petabridge.
OpenTracing Implementation and Benefits
Petabridge is in the business of professionally supporting users who are adopting Akka.NET at scale, and this means that we have to provide all kinds of tooling to help make their lives easier. This is ultimately why we went onto create Phobos, a turnkey monitoring and tracing solution for Akka.NET.
We wanted to help our users solve this Akka.NET observability problem by developing some kind of distributed tracing implementation they could easily include alongside their application code — but we had one small problem: there’s zero chance our customers would accept a single-vendor solution for something as critical as application performance monitoring and they definitely wouldn’t accept something that worked only for Akka.NET and not other important .NET technologies such as ASP.NET Core and SignalR.
OpenTracing solved this problem elegantly and simply for us: by targeting the OpenTracing standard, rather than any single vendored solution such as Zipkin or Jaeger, we could leave the door open for our customers to pick any tracing solution they wanted. We also knew that we’d, more than likely, get into the business of creating some OpenTracing-compatible drivers for .NET users who’d like to be able use our products and others which rely on the standard.
Thus, we built Phobos’ tracing capabilities against the excellent the OpenTracing C# library and designed all of our first party integrations for tools like Zipkin and Jaeger to work against the OpenTracing bindings themselves. This significantly reduced our development costs and increased the freedom of choice enjoyed by our users.
Each time an actor sends or receives a message we create a new Span and we propagate the tracing identifiers into each of the messages we pass between actors, including over the network. We were able to build all of this so it worked behind the scenes without much in the way of manual instrumentation. And sure enough, OpenTracing allowed us to produce understandable graphs like this one, using Jaeger:
In this case, we’re modeling a “fan out” call where one node pings out to many others over the network — something that is notably difficult to capture with traditional tools because it involves a large amount of concurrent processing on multiple nodes and asynchronous communication between each. But with OpenTracing’s standards, it was easy for us to do this with tools like Jaeger, which has a great OpenTracing-compatible driver in C#.
Creating OpenTracing Drivers in .NET
Once Phobos fully supported OpenTracing as an integration point for our end-users, we knew that any Akka.NET user who had an in-house or third party tracing solution that didn’t natively support OpenTracing could eventually find a way to hook things together using the OpenTracing library.
However, we decided to go the extra mile and take some existing tools that are either already popular in the .NET community or are becoming more highly sought after and reduce the barrier to entry by rolling out first party OpenTracing drivers and adapters for these products.
The first one we built was Petabridge.Tracing.Zipkin, a high performance OpenTracing-compatible driver for Zipkin; we wanted to use Zipkin ourselves in-house and wanted to support transport options like Kafka natively.
The second and more interesting one we built, at the behest of many .NET users, was a Microsoft Application Insights OpenTracing adapter for use with our Akka.NET tracing products.
We wanted to be able to support Application Insights as a tracing target for our users running on top of Azure, but there wasn’t a built-in solution for plugging Application Insights into OpenTracing. Thus, we followed a standards document written by the Microsoft team that allowed us to map the Application Insights conventions on top of OpenTracing’s lexicon and were able to create an open source software package, Petabridge.Tracing.ApplicationInsights, that bridged the gap between these two technologies and made Application Insights perfectly workable inside large-scale Akka.NET applications.
We discovered later after shipping the package, that even Microsoft itself is using OpenTracing and our Application Insights driver to instrument some of their own cloud applications in-house. This is a great thing for everyone in the .NET ecosystem as a whole: as OpenTracing continues to gain traction it will help drive its use as an industry standard practice.
As we continue to push the boundaries on the size and speed of large-scale .NET systems, organizations like ours will continue to invest in technologies like OpenTracing and its promising monitoring counterpart, OpenMetrics, to cap the operating and management costs of running these systems. So far, OpenTracing has performed marvelously for our company and for the Akka.NET project as a whole — and we look forward to seeing a more of it in the future.
Co-founder, Akka.NET project