Building a Robust Monitoring Scheme for Kin Ecosystem on Top of Datadog

Doody Parizada
Jan 24, 2019 · 4 min read
Image for post
Image for post

At Kin, we are building open source tools to help developers quickly onboard and build great applications on top of the Kin Blockchain.

We understand that running infrastructure and services on top of the blockchain can be a pain and, sometimes, destructive for innovation. This is why we provide our partners and Kin Developer Program participants a managed service that we run with ensured SLA.

Running a software as a service (SAAS) requires an extensive monitoring plan. We chose Datadog as our monitoring platform, and use it to collect metrics, display graphs, and alert us on different events.

As a simplified overview, our backend stack is divided into four layers:

  • Blockchain Core: Performs consensus, pulls in blocks, mints blocks
  • Blockchain API: Gives developers a simple API on top of Blockchain Core
  • Payment Service: Manages incoming and outgoing payments using the Blockchain API
  • Marketplace Service: Manages users and wallets, and powers the earn/ spend/peer-to-peer flows using the Payment Service

Each of these services requires a unique set of monitoring schemes, of which, I’ll give a few examples.

Blockchain Core

Loss of consensus: Track when a core is out of sync. When a core falls too far behind we might want to replace it.

Image for post
Image for post
Loss of consensus

Quorum size: The number of nodes participating in the consensus. We must have five out of seven.

Image for post
Image for post
Quorum size

Blockchain API

As its implemented in go, we can monitor number of goroutines.

Image for post
Image for post

Number of developer connections: Number of ecosystem apps connected to the API.

Image for post
Image for post
Client connections

Rate of 5xx errors: As a REST server, we can follow 5xx errors.

5xx errors

Payment Service

Queue size and number of idle vs busy workers: Our payment service provides an asynchronous API using a queue on its backend. We monitor the size of the queue and the number of available workers.

Image for post
Image for post
Queue size

Time to submit a transaction: Monitor the time it takes to submit a transaction to the blockchain.

Image for post
Image for post
Transaction acceptance time

Number of concurrent transactions: We can view the load we’re creating on the blockchain using this metric.

Image for post
Image for post
Concurrent transactions

We also monitor the amount of Kin left in each server wallet and the rate of Kin being paid to users for this service.

Marketplace service

Request Latency: The marketplace service is used directly by our SDK, which is used by ecosystem applications, and it is hit by real users. We must make sure we provide a fast experience.

Image for post
Image for post
Request Latency

Time of earn/spend complete flow: Our SDK provides simple earn and spend flows. We monitor the completion of the entire flow of earn and spend.

Image for post
Image for post
Earn/ Spend completion

Other than these custom metrics, we also collect dry metrics such as %CPU, memory, and disk space.

Tagged Metrics

We use metric reporting similar to Datadog, which allows us to tag metrics and filter them later on. Some of the tags we use are:

  • app_id: As we provide services to multiple partners we want to be able to view the SLA and performance per application.
  • git_version: We use a blue/green deployment methodology. Each service is “colored” with the git_version of the code. When we deploy a new version, we can make sure there are no regressions in our services by comparing the metrics of each version.
Image for post
Image for post
git_version tag filter

The following is the latency of the payment-service across deployments of different versions:

Image for post
Image for post
Latency across git versions

Failure Reporting and Debugging

We strive to provide a 100 percent failure free experience to Kin apps, but sometimes failures do occur. When we detect a failure, it is shipped as an event to Datadog where we can be alerted and find the reason for the failure.

To debug such failures and run post-mortems, we ship all our logs to Datadog’s centralized logging service. We trace all our logs with a unique request_id, letting us follow the entire flow of the request across services.

I hope I’ve given you some insight into how we run our monitoring stack at Kin Ecosystem. If you have questions or suggestions, please feel free to comment here or on our Github projects:

Kin Blog

Kin is money.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store