‘all Monitor

How we learned to stop worrying and love data monitoring.

Alan Cole
Chip
3 min readMar 4, 2019

--

“It is a capital mistake to theorise before one has data” or so said Sherlock Holmes in ‘a Study in Scarlet’, and who are we to disagree with a man in a deer stalker?

Here at Chip we collect a ton of the stuff, from the essential super secret, super secure, account information of our customers to the less sensitive and more mundane. Like how many changes we’ve made to the code that runs our site (a little over seven thousand, or about 40 times a week, because I know you were wondering) this data flows into our databases and is available at the touch of a button (apart from the secret stuff) to our developers.

Data on demand was great, it allowed us to build a service that helps tens of thousand of our customers save, but there was a problem, a big one. We had to know what we’re looking to find it, spotting mistakes was left to hunches or feelings, and whilst we’ve got pretty good instincts, that wasn’t good enough.

Enter monitoring.

Of course we had some monitoring, we knew generally what was happening at any given moment and we always knew what was happening with our customers’ money, but spotting problems for our customers needs more than that. So we switched to a new approach that can most readily be described as ‘it’s a number, so let’s make it a metric.’

Live shot of our developers tracking all the numbers.

Monitor All The Things

So what did we actually change, and how is that helping us? Firstly, we decided to give the Chip community a little more information about the state and performance of our service by launching a status website status.getchip.tech where our team can feedback to customers in real time, even if unexpected events disrupt things.

Our main change was a shift to a “Data Warehouse” (we chose Google BigQuery) which allows us to aggregate all of our data, from lots of sources. We can take it, secure it, anonymise it, and write queries on it for thousands of different data points:

  • How many saves were there in the last hour?
  • How many customers have active bank connections?
  • How many have failed ones?
  • How many people opened the app in the last hour?
  • How many customers withdrew money in the last 24 hours?

We collected a host of such questions from relevant people in the business then we built the queries for each, allowing us to display those numbers on a dashboard for all our staff (we’ve written our own internal tool, called Dashi to display all this data in a usable and relevant way). This gives every member of the team the opportunity to spot abnormalities.

We recently invested in a set of shiny new TVs to be placed around our office upon which we’ll have these realtime dashboards of core metrics. If someone in our marketing teams knows the answer to the question ‘How many app downloads were there in the last 24 hours?’ is normally ‘50’ but spots that today it’s ‘30’ they can action a plan to find out why, without having to get all the devs involved.

The information passively gives everyone the power to ensure our customers experience of the app is the best it can be.

A look at our customer status website, keeping our customers informed.

What’s Next?

The new monitoring drive here at chip is just the start. As we grow these stats and metrics out, we’ll be well placed to start saying exactly what a ‘normal’ system looks like. This will allow us to extend our automatic monitoring and alerts, so not only can our team members spot problems before they happen but so can our systems themselves. Computers watching computers, ‘Statception’ if you will.

--

--