DApp Infrastructure Design (Part I): Reliable Ethereum Event Tracking with Kubernetes, Docker, and Parity
Learning Solidity and writing a smart contract is relatively easy, but we’ve found that the harder technical challenge is designing a DApp backend infrastructure that is secure, scalable, and snappy. Unlike traditional apps, DApps rely on the inherently probabilistic nature of blockchain state and utilize Ethereum components that are relatively new and under active development. This is Part I of a series of articles about the architectural patterns and practices we’ve learned that might benefit other DApp developers.
Among the earliest lessons for anyone developing DApps on Ethereum is that logging events in smart contracts is an efficient way to report state changes and keep track of actions performed. Since events are emitted on the blockchain and can be replayed, while at the same time are technically not being stored in expensive blockchain state (i.e. memory or storage), they are a cost effective workaround for “storing” blockchain data. Events are also emitted real-time, allowing for real-time discovery and reporting of actions.
Events sound great! Everyone should use events!
Hold on, not so fast… Another early, and potentially painful, lesson for DApp developers is that event watching is unreliable: an event is only useful if your DApp detects it when needed. If you need real-time data, but event discovery is delayed or events are missed altogether, this could result in a poor user experience or complete DApp failure.
Our solution: DApp developers who need reliable event tracking should consider adding Kubernetes/Docker to their toolkit. While these technologies do come with a bit of a learning curve, they neatly solve very specific problems in the current Ethereum technology ecosystem, mainly arising from reliability.
Problems with current Ethereum infrastructure technology
At CoinAlpha, we have been running a lot of experiments and research on Ethereum infrastructure and reliability to support our product. We have been using different cloud providers such as AWS, Google Cloud, and Digital Ocean to host nodes in addition to running our own local nodes. We have set up node monitors that allow us to track node performance and reliability, as the sample screen shot shows.
Since our projects require as close to real-time event tracking as possible, we have had to come up with solutions to address the current issues with Ethereum infrastructure:
- Ethereum nodes continually drop peers: For a number of reasons, Ethereum nodes periodically fail; they either stop syncing or fall behind, failing to have the latest blockchain information. The main cause for this is a drop in peer count, a kiss of death for any node. Drops in peer count can occur at any time, usually because a node gets stuck on a “bad” block that has been reorganized or ends up on a sidechain. Ethereum nodes work like the popular kids’ table; if your node ends up on a bad block, your node gets blacklisted or tainted, and all peers will try to drop it and prevent it from reconnecting.
- Infura websockets are unreliable: A few months ago, Infura released the capability for web sockets connectivity to support event tracking. This provides the ability for developers to track events without having to run their own nodes. While this was a substantial improvement over not supporting events at all, it is still not a perfect solution. Infura’s web sockets are not able to maintain a constant connection and will drop connections every few minutes. Anyone who used the first versions of the Augur desktop client, which used Infura to build and maintain a database of Augur-related events, probably noticed this effect.
- Live event watching using
watch()
can miss events: Nodes regularly fall behind in syncing, by a few blocks. One main reason for this is hard drive read/write speed. Paying up for high speed SSD drives is a possible mitigant, but with the current Ethereum mainnet database exceed ing100GB, cloud hosting bills begin to snowball pretty quickly. In addition, this still doesn’t address the drop in peer-dropping problem. If your node falls behind and then does a syncing catch-up, we have found that live events may be missed.
Solution: Redundancy and Monitoring
A fault-tolerant Ethereum event watching system is one that can detect failures, conduct triage, and continue operating as if nothing happened. Here’s how we design ours:
- Node redundancy: We run multiple Ethereum nodes, each with attached event trackers, to provide redundancy in the event of single or multiple node failures/delays.
- Event replays: Since live event tracking is unreliable, we cycle our event watchers to replay events, and stagger the replay timing between nodes. For added efficiency, we daisy-chain replays; we note the current block number at the time of each replay, so that the subsequent replay will only replay from that previous current block number.
- Event aggregator: Our event watchers report events to a centralized event aggregator. Our event aggregator keeps track of events it has already seen, and dispatches newly discovered and unique events to the different services we have that need to handle that event.
The result is we have an architecture that looks like the following, with additional discussion of components below:
Below are some explanations and discussion on some of our architecture choices. And just some of the cool things we like about Kubernetes. I understand there are many ways you can address the problems I’ve raised. If you have any comments, questions, or suggestions, you can contact our team on our CryptoBaskets telegram group or send us an email.
Why Dockerize?
For our application, the benefit of using docker is that we are able to add on additional services that need access to the blockchain simply by launching new containers that can link to our parity instances. Each container runs independently, so changes, additions, and upgrades can all be completed without disrupting any of the other existing running containers/services.
In the example above, we only have three containers connecting to the parity data: (1) the node monitor, (2) an event watcher for App 1, and (3) an event watcher for App 2. But we can easily add more containers, as and when our needs change.
Why Kubernetes?
Managing multiple node clusters (Ethereum clients and their connected services) can get pretty complicated and messy quickly. Not only are there multiple services that are linked and dependent on each other, updating and maintaining configurations as well as managing secrets (such as API keys) requires coordination and creates potential security vulnerabilities. In addition, monitoring servers and services for failures and restarting failed containers can get challenging in increasingly larger systems. That’s where Kubernetes comes in; it handles all of these issues.
Below are some of the features that we find particularly useful:
- Replicas / scaling: Kubernetes is great for redundancy since it has built in scalability. You can automatically replicate node clusters (Ethereum clients and their attached services) by default, or through a single command line instruction. No need to configure each cluster separately.
- Adding on new services: You can simply add on new services or trackers that need access to your Ethereum node by adding containers to your Kubernetes deployment or stateful set.
- Persistent Volumes with Stateful Sets: Docker containers are ephemeral in nature. For syncing an Ethereum node, you don’t really want to sync a 100GB+ database every time your node fails and you have to restart. Kubernetes allows you to create Persistent Volumes, which is a segregated data store that maintains its data and state. If your Ethereum client is restarted, it will re-connect to the Persistent Volume and resume from where it last left off.
- Security: Using Kubernetes secrets are a neat way for storing API keys and any other sensitive data. The sensitive data is only exposed upon secret creation; once a secret it created, it is encrypted. The secrets can then be mounted as as a volume for any container that needs access to it; the original, unencrypted data does not have to be re-shared.
- IPC connections for added security and access control: By clustering services together and providing each access to a shared volume (such as the Persistent Volume), services that need a connection to parity can access it via IPC (inter-process communication) through the cluster’s file system. This creates additional security by helping to prevent unauthorized access to the Ethereum client. By default, most developers will connect to web3 providers using an RPC (remote procedure call) http connection web3.httpProvider() over the internet. Unlike a connection via IPC, an RPC connection is potentially open to the general public and internet, creating the risk of unauthorized users discovering and connecting to your RPC web3 provider and overloading your client. Worst case, someone connecting via RPC web3 from the internet would be able to send transactions from any account that may have inadvertently been unlocked.
- Liveness and readiness probes: Kubernetes allows you to create monitors that restart or take pods out of service that have failed or are not yet ready. For example, you can use a readiness probe to prevent an Ethereum client from accepting any incoming connections if it has not yet been fully synced.
- Live, rolling updates / no down-time: When deploying updates to live apps, kubernetes neatly creates new pods before destroying the old, existing pods. While the new pods are being created and prepared (awaiting the readiness probe), existing pods remain in service. Only once the updated pods are ready and put into service, will the old pods be destroyed.
- Nginx ingress controllers: Kubernetes allows for the setup on an nginx ingress controller to direct traffic to different services on the cluster, which are routed by the URL address used. In your domain manager, you simply direct all subdomains to your ingress address, so you don’t have to manage each separately in case your servers’ IP addresses change. The nginx ingress controllers also manage https routing and TLS certificates. In the DApp ecosystem, this can also be applied to creating web3 providers for front-end applications. By default, Ethereum clients provide only insecure http connections.
Parity vs. Geth
Despite the company’s previous missteps (multi-sig hack, library hack), the one thing Parity has going for it is they have created one reliable Ethereum client (ok…let’s also ignore the consensus vulnerability for now). While most developers start out using the Go implementation of Ethereum “Geth”, a quick google/stack exchange search will reveal a lot of frustration and problems for syncing Geth. We’ve found that in practice, Geth nodes take longer to sync and fall behind more often than Parity nodes (which you can even see in our node monitor dashboard screenshot above). On the other hand, we have found that Parity nodes will sync from scratch and are available for use within a few hours to a day.
Being relatively new technologies, Ethereum clients like Parity and Geth are constantly being updated and improved, which is why we maintain both types of nodes.
Conclusion
Decentralized applications have a bright future, but given the nascency of the technology stack for them, substantial back-end work is necessary to make them feel as responsive and reliable as centralized web and mobile applications. Luckily, we’ve found that a careful, intentional approach to architectural design makes this feasible, and we’ll publish posts about other important aspects such as security in the future.