RPC Uptime & Reliability in Web3

Addressing the elephant in the server room

KagemniKarimu
Lava Network
8 min readMay 4, 2023

--

tl;dr — By standard tech industry measures of uptime, the availability of common web2 infra currently exceeds that of common web3 infrastructure like RPCs. Even a small difference in downtime is consequential in highly available production-grade systems. Increasing the availability of RPC will improve adoption of web3.

Rise in traffic = collapse in RPC

When Reliable RPC for Ethereum, Cosmos Hub and beyond?

As mentioned in Developer Web3 RPC Woes and The Web3 RPC Problem, unreliable RPC is a major painpoint for web3 developers. Unreliable RPC endpoints lead to mission-critical unavailability incidents, poor user experience and usability, and existential-threat-level business risk. There are ample examples in web3 of RPC endpoints pining out or disappearing in the height of need.

Some examples abound:

Web3 is no stranger to these sorts of issues, and RPC is a major contributor to downtime in web3 at the moment. In the next section, we investigate uptime and the difference in scale made by even minute changes in downtime.

The Scale of Uptime: 99% vs 99.9% vs 99.99%

We define uptime as the amount of time a server or service stays operational and accessible to users measurable in units of time. downtime is therefore any measurable time that a service becomes offline and inaccessible to users for any reason. availabilityis calculated by subtracting the total amount of downtime from the ideal uptime; availabilityis usually calculated over longer periods of time where it is more observable. availability, here, is the ratio of uptime to downtime, represented as a percentage.

A simple pseudo-code example demonstrates this in terms of one month (or 43,860 minutes). A simple 60 minutes — a mere one hour — of monthly downtime knocks us all the way down below 99.9% availability:

uptime = 43,830
downtime = 60

availability = (43,830 - 60) / 43,830 x 100
availability = 99.863%

Here, we treat 99.9% (AKA “three nines”) as a minimal threshhold — something that we should look for as the bare minimum in any system we call ‘highly available.’ However, the real prize is at 99.999%.

The difference between 99.9% and 99.999% available might seem minimal to our non-computational eyes, but it can have a significant impact over a stretch of time on a highly available system. For instance, 99.9% availability corresponds to approximately 43 minutes of downtime per month, while 99.999% availability translates to only 26 seconds of downtime per month. That is quite different, after all! The compounding effects of an occasionally unavailable system are more apparent when they are projected over an entire year.

Let’s take a look at a simple chart which illustrates this more clearly below:

As we can see, the effects compound quickly. Making a system more available by nine-hundredths of a percent results in a significant reduction in downtime. To put this into perspective, “five nines” (99.999% availability) means an app has likely less than 78 seconds of downtime per quarter. With only a few seconds of downtime every few months, our “five nines” figure is widely considered the hallmark of reliability measures.

Unfortunately, most of web3, and especially web3 RPC has not made it to this enshrined figure, major RPC service providers only offer 99.9% availability in their Service-Level Agreements (SLA). We know that downtime can be caused by hardware failure, infrastructure disruptions, and software glitches caused by configuration errors or bugs. Ironically and very importantly, a great deal of RPC downtime is simply caused by excessive traffic. With Centralized Service Providers offering a 99.9% availability SLA, we’re still off to the races to find a solution which reaches the height of RPC demand; with all this talk of centralization, redundancy, and distributed systems one would dream of a system experiencing zero downtime. It is a dream to be dream… Now, what about the various endpoints that are available publicly all over the place? Don’t they indicate ubiquitous blockchain uptime and availability?

Public RPC is Not Enough

Public RPC endpoints are a beautiful notion. They give an alternative to self-hosting and operation overnight dev ops. They provide ‘good enough’ RPC services to otherwise unserviced developers. They even prove the altruism of the open-source web3 community. Unfortunately, they alone are generally not enough to serve the needs of expanding applications and services. While, centralized RPC service providers offer a usable 99.9% SLA by standard, other public RPC endpoints offer no promise of availability or uptime. Centralization offers some benefit of surety but comes at a cost. And with regards to unpaid Public RPC — unfortunately, as the time-honored Robber-Baron-Age adage goes, “you get what you pay for!”

Public RPC can go down, be overwhelmed, or just up and disappear. In many or most cases, there is no SLA between users and operators. Once an application or service reaches scale beyond hobbyist or testing levels, RPC endpoints often fail due to insufficient resources. Paid, private RPC providers are generally prepared for this load and financially incentivized to upscale their resources, but unpaid, public RPC endpoints must do things to protect themselves — such as rate-limit users. Public RPC is great in many circumstances. However, when a project is production-grade, it requires a high level of uptime and reliability to ensure that it continuously delivers value to its users. Reaching ‘high availability’ is something that Public RPC alone cannot do.

Addressing Uptime Requirements

In discussing highly reliable RPC, it’s important that we note here that there are things that node operators and RPC providers can do to ensure maximizing up-time and increasing the availability of their service. Some of the most crucial are catalogued here:

  • DDos Protection: On today’s internet, Distributed Denial of Service (DDoS) protection is mandatory for any publicly exposed resource with an globally shared URL. Running a node without proper DDoS protection makes it vulnerable to spam traffic from malicious users and bots — a guaranteed recipe for downtime.
  • Proactive monitoring: Regular and proactive monitoring of systems can help detect issues early on and address them before they lead to downtime. Setting acute alerts with Centreon, Prometheus, Grafana, or other tooling assists with letting operators know there is a problem before it becomes a problem for the end-user.
  • Redundancy and fail-over: A backup and redundancy plan can minimize the risk of data loss and system downtime. RPC Providers ideally will build a robust system that allows smart detection of downed resources and a seamless roll-over of service. That means that RPC users will not have to go chasing new links and new providers if and when an outage occurs.
  • Regular maintenance and updates: Regular maintenance and updates help ensure that systems are running efficiently and securely. This could include updating software and hardware, patching vulnerabilities, and upgrading systems regularly. Unfortunately in the real world, updates sometimes break things as often as they fix them — having a system whereby real-time rollbacks are possible can be crucial for keeping services online. And, of course, all of this should ideally be done with minimal interruptions of service.
  • Malware protection: Malicious software can and will make its way to any publicly exposed resource on the web. Installing workable, lightweight security software which monitors processes on a given node can help tremendously in stopping both attacks and outages. Alternatively, creating a secure-by-design operating environment can do the trick.
  • Security Best Practices: As mentioned, secure-by-design infrastructure is optimal. There are many vectors for attacking nodes — and its easy to forget that the humble RPC endpoint is a high value target for a hacker looking for an easy win. Data reliability is a major function of RPC, and an unsung priority that can be ensured by security best practices.
  • Load balancing and scaling: Load balancing and scaling help distribute traffic and prevent system overload. This could include load balancing to distribute traffic or scaling up systems with more resources as traffic increases. In either case, node operators and RPC providers alike must have some sort of safeguards in place for when the hordes come for blockchain data. It is a sad irony that abundant use is a major killer of RPC endpoints.

Why Reliable RPC is Crucial for Web3

This entire section can be summed up in a single word: adoption. Availability and uptime are crucial for adoption of web3 technology; they make dApps and services usable. In our current situation, we rely on the good graces of public RPC, maverick node runners, and the SLA of centralized providers to ensure our uptime. Perhaps, somewhat obvious to most, as more production-grade decentralized applications and services supplant hobby projects and enthusiast-driven development, the uptime statistics of our RPC available will need to improve with time.

Let’s agree to this: downtime hurts everyone. It hurts RPC providers who offer blockchain data, sometimes at a profit. It hurts end-users who are interested in seemless front-ends and good user experiences. And, finally, it hurts dApp builders and developers who want to build web3. Any significant amount of downtime leads to some loss in revenue and some cost for restoration for everyone involved. But what cannot be accounted for is the inevitable loss of reputation that happens when someone’s app or service goes down in the moment of greatest need. Availability is extremely crucial to consider in the scope of the various things web3 has to offer the world wide web economy.

Solutions

Fortunately, redundant, distributed and decentralized networks such as Lava can ease the pain and awkwardness of implementing all of these by incentivizing independent RPC providers to offer service. It is our vision that the five-nine future is not far away…

About the Author🧑🏿‍💻

KagemniKarimu is current Developer Relations Engineer for Lava Network and former Developer Relations at Skynet Labs. He’s a self-proclaimed Rubyist, new Rust learner, and friendly Web3 enthusiast who entertains all conversations about tech. Follow him on Twitter or say hi to him on Lava’s Discord where he can be found lurking.

About Lava 🌋

Lava is a decentralized network of top-tier API providers, where developers make one subscription to access any blockchain. Providers are rewarded for their quality of service, so your users can fetch data and send transactions with maximum speed, data integrity and uptime. Pairings are randomized, meaning your users can make make queries or transact in privacy.

We help developers build web3-native apps on any chain, while giving users the best possible experience.

--

--