Building a Robust Monitoring Scheme for Kin Ecosystem on Top of Datadog
At Kin, we are building open source tools to help developers quickly onboard and build great applications on top of the Kin Blockchain.
We understand that running infrastructure and services on top of the blockchain can be a pain and, sometimes, destructive for innovation. This is why we provide our partners and Kin Developer Program participants a managed service that we run with ensured SLA.
Running a software as a service (SAAS) requires an extensive monitoring plan. We chose Datadog as our monitoring platform, and use it to collect metrics, display graphs, and alert us on different events.
As a simplified overview, our backend stack is divided into four layers:
- Blockchain Core: Performs consensus, pulls in blocks, mints blocks
- Blockchain API: Gives developers a simple API on top of Blockchain Core
- Payment Service: Manages incoming and outgoing payments using the Blockchain API
- Marketplace Service: Manages users and wallets, and powers the earn/ spend/peer-to-peer flows using the Payment Service
Each of these services requires a unique set of monitoring schemes, of which, I’ll give a few examples.
Loss of consensus: Track when a core is out of sync. When a core falls too far behind we might want to replace it.
Quorum size: The number of nodes participating in the consensus. We must have five out of seven.
As its implemented in go, we can monitor number of goroutines.
Number of developer connections: Number of ecosystem apps connected to the API.
Rate of 5xx errors: As a REST server, we can follow 5xx errors.
Queue size and number of idle vs busy workers: Our payment service provides an asynchronous API using a queue on its backend. We monitor the size of the queue and the number of available workers.
Time to submit a transaction: Monitor the time it takes to submit a transaction to the blockchain.
Number of concurrent transactions: We can view the load we’re creating on the blockchain using this metric.
We also monitor the amount of Kin left in each server wallet and the rate of Kin being paid to users for this service.
Request Latency: The marketplace service is used directly by our SDK, which is used by ecosystem applications, and it is hit by real users. We must make sure we provide a fast experience.
Time of earn/spend complete flow: Our SDK provides simple earn and spend flows. We monitor the completion of the entire flow of earn and spend.
Other than these custom metrics, we also collect dry metrics such as %CPU, memory, and disk space.
We use metric reporting similar to Datadog, which allows us to tag metrics and filter them later on. Some of the tags we use are:
- app_id: As we provide services to multiple partners we want to be able to view the SLA and performance per application.
- git_version: We use a blue/green deployment methodology. Each service is “colored” with the git_version of the code. When we deploy a new version, we can make sure there are no regressions in our services by comparing the metrics of each version.
The following is the latency of the payment-service across deployments of different versions:
Failure Reporting and Debugging
We strive to provide a 100 percent failure free experience to Kin apps, but sometimes failures do occur. When we detect a failure, it is shipped as an event to Datadog where we can be alerted and find the reason for the failure.
To debug such failures and run post-mortems, we ship all our logs to Datadog’s centralized logging service. We trace all our logs with a unique request_id, letting us follow the entire flow of the request across services.
I hope I’ve given you some insight into how we run our monitoring stack at Kin Ecosystem. If you have questions or suggestions, please feel free to comment here or on our Github projects: