TorfluxDB: Anonymous metrics from Go

Published in

InterPlanetary Social Network

17 min readDec 10, 2018

I love metrics, you love metrics, everybody loves metrics… but why? Software became too complex to reason about in isolation: looking at the source doesn’t tell you much about its execution performance; and running the code on your local machine doesn’t tell you much about its deployed performance.

For example, a recent optimization in the go-ethereum project introduced a new caching layer, which (as expected) resulted in lower processing times and fewer disk accesses (purple is the new code).

Oh look, the optimization makes expected metrics better!

Looking at other metrics over the span of multiple days however also revealed a slight increase in memory consumption, as well as the persistent presence of database read spikes.

Erm, apparently the optimization makes certain unexpected metrics worse…

Analyzing the charts is out of the scope of this article, rather the point I want to make is that code changes can have unforeseen performance implications. Without real time insights into what a deployed software is doing in the wild, it’s impossible to detect undesired anomalies.

Collecting metrics may not only help with anomaly and regression detection, it also aids pinpointing performance bottlenecks and predicting future growth issues. Long story short, if you’re running a distributed system, or rely on one, you need metrics.

Collecting metrics

Ok, we supposedly need metrics, but how do we go about collecting them? It’s possible to dream up infinite ways to put some numbers on internal aspects of your system, but there’s a lot more to metrics than meets the eye.

Suppose you want to measure the network traffic your system generates. If you log each individual network packet, you’ll end up with gazillions of events that you’ll never be able to analyze. If you log only entire network requests, you won’t be able to meaningfully compare those due to their varying time spans. Then there’s still the challenge of making your measurements consumable by visualizers.

A correct way of gathering metrics is to push the most granular data available into a local aggregator, which flattens the contiguous stream into a handful of dynamic numbers (e.g. average bandwidth over the last 5 mins). These values can then be periodically reported to a central database for analysis.

Instead of reinventing these measurement aggregators yourself, there are well defined constructs that play nicely with the tools in this ecosystem:

Gauges are instantaneous measurements of some values (e.g. the number of network connections you have open).
Meters are measurements of the rate of some events over time (e.g. bytes per second uploaded), able to also report moving averages (e.g. egress on average over the last 15 minutes).
Histograms measure the statistical distribution of values in some stream of events (e.g. processing time your system needed per request).

And there are a couple of derived metrics too that combine multiple simple ones from the above list:

Counters are fancier gauges with support for increment and decrement operations (e.g. it’s simpler to bump a number when a new connection is establishes and reduce it when the connection breaks vs. having to add a mechanism to exactly count the number of live connections).
Timers are a combination of meters and histograms, where the meter part is used to measure the rate of events, and the histogram is used to measure the distribution of durations (e.g. you can track both the frequency of API requests as well as their processing time).

Any system property that you can measure using the above standard concepts will instantly be portable, visualizable and analyzable with almost any tool in the ecosystem. And the best part? There are off-the-shelf libraries that will do these local aggregations for you, so your only concern is streaming the data in and collecting the flattened results out.

Collecting metrics from Go

As mentioned above, collecting and aggregating metrics is non-trivial. Luckily for Go users, there’s an awesome library from Richard Crowley, go-metrics, that does all the heavy lifting for us, supporting all of the aggregators defined in the previous section.

For the purpose of this article, we’ll create a simple demo program simulating a file server, which accepts network connections and streams random data:

So, how does the above code behave? How many clients are connected at any given point in time? What’s the total throughput of our simulated file server? We have absolutely no clue!

We could of course add some logs to the serve method, but that will just get noisy fast without any particular value:

Ha! You try to make heads or tails of these logs!

Instead of emitting logs at various levels of granularity and scrolling through without hope, we can make small measurements and aggregate them locally via the go-metrics library:

In the above code we simply created 2 aggregators: a counter for tracking the number of clients connected and a meter for measuring the rate at which data is being streamed to the clients. We update the counter in the serve preamble and update the meter for every transferred chunk.

Making nice charts out of these metrics is still a couple steps away, so for now lets just add a code segment that can print them to the console every now and again (just dump this into main):

Voilà! Even though we’ve designed a chaotic system with a lot of randomness involved, we managed to export a very accurate picture as to how the system is behaving at any instance in time (full code).

Exporting metrics

Ok, we can collect the metrics we supposedly need, but how do we go about analyzing them? Keeping them in memory doesn’t do anyone any good and dumping them into a log file is a sure way to ignore them afterwards.

… here be dragons, the push vs. pull war…

There aren’t that many ways to export your metrics. You can either push them yourself to a central database; or you can expose them inside your application and have the central database pull them off of you. (You could also analyze the metrics within your own app, but that’s conflating responsibilities).

The main open source champions of the two data models are Prometheus for the pull model (sources) and InfluxDB for the push one (sources). Both have been written in Go (not that it matters)!

There are heated technical debates as to which option is better, mostly boiling down to tribalism and overemphasis of oversimplified corner cases. Generally it doesn’t matter at all which model you choose, you’ll need to solve the same problems anyway. Go with the tools that help your use case the most.

One curious case where the pull model breaks is when you have a dynamic set of machines you’d like to monitor, but they are outside of your control. This is generally the case when you’re only building the software, but not running it yourself. This is also the primary case I myself am interested in!

This case is interesting not only due to its technicality, but also from a social perspective concerning data protection and privacy. Before focusing on this aspect however, lets get our hands dirty a bit.

Exporting metrics into InfluxDB

First thing’s first, what is InfluxDB? It’s a time series database. A what?!

A “time series” is a very fancy way of saying “timestamps with some numerical fields associated with it” (e.g. yesterday→{min-degrees: 5, max-degrees: 15}, today→{min-degrees: 4, max-degrees: 17}).

A time series database is kind of like any other database you worked with, but optimized for storing (and querying) massive amounts of these timestamp-to-number mappings. Since the data model is so restricted, time series databases can also aggregate values while querying to minimize network traffic.

Since an endless stream of timestamp-to-number points is useless without any context, these databases also allow you to tag each data point with arbitrary metadata that you can query on (e.g. today{city: Prague}→{min-degrees: 4}, today{city: Barcelona}→{min-degrees: 25}).

Lastly, it’s nice and all that we can shove such a “time series” into a database, but wouldn’t it be grand if we could push multiple different series too into the same database (kind of like tables in usual database lingo)? Of course, this is supported by grouping the data points into individual measurement streams (e.g. temperatures ⇒ [today{city: Prague}→{min-degrees: 4}, today{city: Barcelona}→{min-degrees: 25}]).

Sum the above all up, and you got a definition of InfluxDB: It’s a database that can store multiple streams of measurements, where each stream is a mapping of timestamps to numerical fields, with optional associated metadata tags.

Before we can start pushing data into InfluxDB though, we need an instance of it, don’t we? 😋 To keep things simple, we’ll just run one via docker:

docker run -p 8086:8086 --env=INFLUXDB_DB=mydb influxdb:alpine

The above command starts an ephemeral InfluxDB instance with it’s API port exposed to localhost and also pre-creates an empty database mydb. InfluxDB supports segregating multiple data-sets from each other, optionally requiring user authentication on top, but we won’t be using that here.

Exporting metrics into InfluxDB from Go

With our InfluxDB instance up and running, lets get back to our demo server’s metrics and instead of printing them periodically to the console, push them to the time series database.

First thing’s first, we need to be able to talk to the database from Go. Luckily InfluxDB is written in Go, so naturally there’s an official client library too for it, which we can simply import github.com/influxdata/influxdb/client/v2.

Exporting the metrics boils down to creating an HTTP client to the database server; transforming our custom Go metrics into InfluxDB time series points; and finally pushing them.

Since we replaced our console logger with the above reporting, executing the demo (full code) will just sit there silently. If we however look at the InfluxDB logs, we’ll find a steady stream of write operations:

[httpd] 172.17.0.1 - - [19/Nov/2018:16:39:03 +0000] "POST /write?consistency=&db=mydb&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" 9fece4fb-ec19-11e8-8009-0242ac110002 9768
[httpd] 172.17.0.1 - - [19/Nov/2018:16:39:08 +0000] "POST /write?consistency=&db=mydb&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" a2e97c8a-ec19-11e8-800a-0242ac110002 4109
[httpd] 172.17.0.1 - - [19/Nov/2018:16:39:13 +0000] "POST /write?consistency=&db=mydb&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" a5e52ec1-ec19-11e8-800b-0242ac110002 4228

Visualizing metrics

Ok, we’ve managed to export the metrics we supposedly needed into this thing called a time series database, but that didn’t get us closer to analyzing them. If anything, we’re seemingly further away than the console logs, which we could at least take a peek at.

The final step is to create meaningful visualizations of these time data points. Although I’m sure there are many tools that could be used, one of the current leaders is an open source project called Grafana, also written in Go (sources)!

As with InfuxDB, we don’t want to install any messy dependencies to our local machine, so we’re going to launch a Grafana instance via docker too:

docker run -p 3000:3000 grafana/grafana

You can access your instance via http://localhost:3000 and sign in with the default credentials admin / admin .

After logging in, you’ll be greeted with a dashboard, telling you that the next thing you need to do is to add a new data source. Most options are fairly self explanatory, perhaps the only curious one is the “Browser” access. It’s telling Grafana to access InfluxDB through the browser, saving us an step linking the two docker containers.

You’ll see a green “Data source is working” notification if everything is set up correctly. The last step is to create a new dashboard (side menu plus button), add a graph visualization and fill it with data. You can find the Edit button if you click the “Panel title” header bar.

Before editing the actual fields, let’s recap the metrics we have exported from our file server simulator:

We have two measurement streams exported: connections and bandwidth, the former containing the count field whilst the latter the egress. To create our first visualization, select connections for the “select measurement”; pick count for “value” inside “filed(value)”; and remove “time($__interval)” from the query rule. You should end up with something like this:

You can close the chart editor with the small X on the top right corner (might be middle-right on your screen if the chart preview is also shown). Since most probably you have only a few data points fed into the time series database yet, adjust the time range from “Last 6 hours” to “Last 5 minutes” on the top right.

Repeat the same for bandwidth.egress and voilà, you have 2 beautiful charts on how the connection count and egress bandwidth evolves within your app, along with historical data retention and infinite analysis capabilities!

Digging further into Grafana is out of scope, but I wholeheartedly recommend exploring all its capabilities, as you’ll find it an exceedingly capable tool. Next up however, anonymity!

Anonymous metrics

If you are running your own private infrastructure and want to collect metrics to track what your machines are doing, you are pretty much done. Go, have a blast with your newly found knowledge!

If, however, you are a software vendor wanting to collect telemetry for devices you don’t necessarily own, you’ll quickly run into resistance: people will freak out (when Caddy introduced metrics, hacker-news blew up)!

Honestly, is anonymity a legitimate concern? Yes, yes it absolutely is! When the select few internet giants are data mining your every movement to feed you ads and use all your personal data for perfecting their own services, it’s natural to have a huge backlash against sweeping up metadata.

If you’re Microsoft, Facebook, Google or Apple, then you have a “get out of jail free” card because you’ve already locked the entire world into your ecosystem. If you’re a small fish however, you have to improvise… people will much more readily accept telemetry collection if you can prove it’s not possible to identify them (and as long as you don’t go overboard and collect too many details).

In the case of the Caddy web-server, do I care that it uploads how many requests it served yesterday? Nope! Do I care that the telemetry server knows *I* served that many requests yesterday? Hell yeah! Whilst there are understandable reasons for collecting metrics, there is no technical reason whatsoever for collecting personal metadata along the way.

We kill people based on metadata. ~General Michael Hayden (NSA)

But how can we break the link between the collection and the identification of the measurements? The answer is to tumble the data stream through the Tor network… and we’re going to do that from Go!

Wait, we can use Tor from within Go? Yep, I have an article on it, go read it!

How would this work exactly? As highlighted in the above section, the issue people have with telemetry collection is that they are not only used to make better software, but also abused to identify and target people. By passing all the telemetry streams through Tor, the aggregator gets left in the dark as to where the source is located at, so the possibility for identification is reduced by a huge margin.

How Tor Works (Electronic Frontier Foundation)

I can already hear some of you getting upset: “If the software vendor is collecting the metrics, what good does using Tor do, since they can always issue an update to disable it? We’re still trusting the vendor!” My answer is that yes, you are still trusting the vendor, but only the current incarnation of the vendor. Companies change, pro-privacy companies might end up doing nasty things (Hi Google 👋). Collecting metrics through the Tor network ensures that if a company abandons their privacy beliefs, they are still unable to abuse data collected in the past.

Anonymous metrics from Go

We know the theory, stream the metrics through Tor… but how do we do that in practice, from Go? First thing’s first, we need to actually connect to the Tor network. We can do that in 3 ways:

Use a pre-installed Tor from the local machine
Statically embed a pre-compiled Tor library
Statically embed a Go CGO Tor wrapper

Using pre-installed Tor is the easiest, import github.com/cretz/bine/tor and create a gateway from within Go gateway, err := tor.Start(nil, nil). This assumes tor is available through the PATH environment variable. Of course this is almost never the case, so we won’t be using this option.

The second approach is to pre-build Tor as a static library and have Go pick it up importing github.com/cretz/bine/process/embedded, creating the gateway via the &tor.StartConf{ProcessCreator: embedded.NewCreator()} parameter to the previous tor.Start method. The issue here is that Tor is C++ and Go’s build system doesn’t allow custom build steps. Pre-building the library entails Makefiles and copying the resulting binaries into magical GOPATH locations, so we won’t be using this option either.

The last option is to create a proper Go library out of the Tor sources via a ton of CGO black magic. I’ve already done this in github.com/ipsn/go-libtor, so you can go get -u -a -v -x github.com/ipsn/go-libtor and Go will download and automagically build it for you. A few caveats:

-a is needed because Go doesn’t detect CGO changes in includes.
-v -x keeps your sanity, because building Tor will take 3+ minutes.
go-libtor supports Android and Linux amd64/386/arm64/arm libc/musl .

Let’s integrate this! First up, the imports. We need to import the prerequisites for Tor, github.com/cretz/bine/tor and github.com/ipsn/go-libtor.

After sorting out the imports, we can replace the old InfluxDB client creation with a Tor proxy instantiation and a parametrization to use the Tor proxy as the TCP dialer:

Note, we can’t use localhost any more for the database address, as we’re passing all the network traffic through Tor. You’ll need a publicly reachable instance for now, but don’t worry, we’ll get rid of this limitation in a moment.

And… that’s it! We managed to pull off the same metrics gathering as in the previous demos, but anonymized over the Tor network, keeping the location of our user private!

A word of warning, when you go through the Tor network, your network traffic is visible at the last hop, in the Tor exit node. Never use plain HTTP over Tor, rather always use HTTPS.

Anonymous metrics from Go, within Tor

Although the above code works and satisfies all anonymity requirements, we can still make it a bit more robust. As mentioned, passing the traffic through Tor requires a public endpoint (your database is not anonymous) and proper security requires HTTPS (you need to obtain an SSL certificate). We can use Tor to get rid of both limitations at the same time!

Tor features a concept called onion services, where not only the initiator, but the destination of a network connection can also use the Tor network. This is achieved by allowing service providers to connect to Tor and advertise some TCP service they are running, whilst using the Tor network as a rendezvous point. Since both initiator and destination would be using Tor, neither need public IP addresses, open firewall ports or specialized encryption.

If a client wants to connect to an onion service, it needs to know it’s public key, which is used to ensure anonymity and traffic privacy. These keys are usually encoded as onion URLs (end in .onion).

Ok, enough theory! How do we use it? Although we could go and create a full Tor proxy for InfluxDB here, that would be way out of scope, so I’ve done the dirty work for you! Enter TorfluxDB!

TorfluxDB is a relatively small open source project (a lot of work went into its simplistic design), aiming to put a Tor proxy in front of an InfluxDB database. The proper way of course is to have a separate process (or docker container) in front on InfluxDB, but that’s a lot of configuration with little value for this article, so I’ll delegate you to the project’s README for a production setup.

For a quick-and-dirty approach, the project also features a docker image that has both InfluxDB and the Tor proxy pre-configured in a single container. We can replace our previous InfluxDB instance with it (shut the previous down):

docker run -p 8086:8086 --env=INFLUXDB_DB=mydb ipsn/torfluxdb

It’s the same command, just a different image! Among all the usual InfluxDB logs however, you’ll see the Tor proxy talking to you:

For all the nitty-gritty details, please see the README. The only thing we are interested in now is the line stating that our proxy is online and the particular URL it’s accessible at (long public key with some .onion TLD to make it more easily recognizable). Lets just take that URL and feed it into our demo server’s InfluxDB client configuration:

If you run that… woah… it “just works”. Our demo application (full source) is able to reach the TorfluxDB instance without actually having a clue where the database is, or without the database needing to have a publicly reachable IP address! Everything encrypted, everyone’s location anonymized!

Tracking metrics

There’s one last topic that we need to cover before we can call it a day. In our demo application, we only had a single instance reporting the metrics. What happens if there are multiple machines sending data? With the current code, everything gets mangled together.

We need the ability to differentiate machines from one another. Yes, we could give them names, but that makes them easy to impersonate in an adversarial setting. We need something unpredictable! The simplest solution is for each machine to pick a long, random session ID for itself to report under.

Since InfluxDB already supports adding arbitrary metadata to a data point, we can extend our demo program to include an ephemeral id field for every measurement it reports (full source).

With the id tags reported, we can tweak our Grafana settings a bit so that the plotted points are grouped by them.

From this point onward, we can start arbitrarily many instances of our demo application, as each will appear as their own set of data points!

Two questions beg some answers:

Q: Can’t the random session IDs be impersonated?
A: They can, if you know them. However, since the entire connection stream is end-to-end encrypted, only the metrics server knows them. If that is too much trust, sessions could be extended to rely on digital signatures, but that would need more processing logic in both the client as well as the proxy.

Q: Doesn’t the use of session IDs break the anonymity?
A: It mostly depends on what data you collect, but if long term statistics is an issue for you, the session IDs could be periodically reset (e.g. on startup).

Epilogue

Phew! This was a long read. Sill, I’m hopeful it was worth your time!

I’ve been developing software for quite a few years now in, but honestly it was only the last year or so when I actually understood the importance of metrics. Even so, jumping into this whole “ecosystem” can be quite intimidating with all the moving components. I hope this article can give you a head start in getting to the end of the rabbit hole.

As for the anonymity part, I acknowledge that this is not something most care much about (or should really, in trusted environments). If, however, you’d like to collect metrics from machines out of your control, the concepts I presented here could make it much more acceptable by your users, protecting them from any unintended data leaks or hacks your infrastructure suffers.

It’s time to take responsibility for the data you collect! Your users often do not know any better. With the anonymity tools given to you here, you have no more excuses. Do the right thing!

Of course, as with any public data collection, you are bound to receive false metrics from certain people who just want to watch the world burn. That is however the case with any system, not a particularity of my approach. Your only solution is to monitor and filter out anomalies reported, but that’s way out of scope of this article.

Until next time…

PS: You can find TorfluxDB at https://github.com/ipsn/go-torfluxdb