Where there is a limit, there should be monitoring

Monitoring is a very important aspect in case of distributed systems. Without this, we wouldn’t know what’s happening with the service and the users. For Instance, can you answer the following questions if you don’t have monitoring.

Are there any issues with features?

What do customers like and how are they using the service?

Monitoring scenarios are different for each application/component/service. Each component has its limits to perform better.

where there is a limit, there should be monitoring.

Let me explain this catchy thought through an example — the cache system. Almost every distributed service maintains a cache layer. The cache layer is used to store HOT data to reduce the impact on the data layer and also for temporary data such as user sessions, counters, etc. There is a chance that the whole system can go down if there is any issue with the cache. You would understand how critical it is to monitor the cache layer if you read the recent outage with Facebook.

The following points describe the limits in the cache system.

  • Every request goes through Cache layer making it fully loaded. But the underlying cache web server has a limit on the no. of requests that it can serve efficiently.
  • Cache data is stored in RAM for faster access but the RAM space is limited.
  • If there is a cache miss, the cache layer gets data from the storage/database. The purpose of cache layer is to reduce the load onto the data layer because of the limitation on how many queries it can serve efficiently (Note: cache miss count is equal to number of requests to the data layer).
  • During the cache miss, updating the cache layer consumes network. If there are large no. of misses, load on the network is high, which certainly has a limit on how many bytes the network can transfer.

RAM space, incoming requests, cache miss count and the load on network are a few things that have to be monitored to make the cache system highly available. Just like this example (cache scenarios), the key point is to identify the limits on every component in the system architecture in order to prevent disasters and improve the up-time of a system.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade