your customers are not a monitoring tool
If your customers are finding out about problems with your applications before you do, then you might have a monitoring problem. A recent example is that of a Comcast customer setting up a Twitter bot to tweet Comcast when his service is degraded. This is a much more public example than most, but the lesson is the same. Never let your customers do your monitoring for you.
There are many different levels of monitoring. You can monitor your servers, your services, the interactions between your services, the calls your services make to datastores, the queries sent to those datastores, and every detail throughout the code in every one of these components or interactions. Some will say that you can never have too much monitoring.
Idealistically, you can never have too much monitoring, but there is a cost associated with monitoring. Monitoring utilizes resources. This may not be a problem in all situations, but when the monitoring becomes too detailed, it can cost a lot of cycles. You don’t want your monitoring to negatively impact your customer. Choose wisely your default monitoring level, and then iterate until you find the perfect balance. This may be, and likely will be, different for each application.
Of course, you can’t start monitoring if you don’t know how. New Relic is a tool which you can easily use to start getting data out of most applications and servers immediately. It is a SaaS so you won’t need to run your own servers. You can also use Prometheus if you’re comfortable running your own servers. Prometheus is slightly harder to set up, but it has a powerful query language that allows for a lot of flexibility. It also comes with an alerting system called AlertManager. This has proven to be a very powerful combination for my monitoring needs.
These two tools are not necessarily trying to solve the same problem, though. I find Prometheus to be best suited for internal platform monitoring. While New Relic excels at monitoring customer facing applications. With these two solutions in place, I now know the status of my platform and can react quickly to issues, but I can also understand my full application stack.
Hopefully with this in place, you can set up alerting for the events which you might expect. For example, error conditions, heavy loads, increased latencies, platform-level failures, etc. However, what happens when everything looks fine internally, but the application doesn’t work for the customer? How would we even know if this occurs? How would we know how to fix it?
New Relic has a tool for this, but you can do this on your own if you want. They are called synthetic transactions. Others may call them user experience tests. These transactions should be formatted to mimic end users and should run all the time. This will allow you to catch latency issues, workflow failures, connection failures, and regional issues. These transactions can be run from all over the world to better represent how your users may experience your application from different regions.
This is an incredibly powerful tool that isn’t used enough. If you are using Selenium, then you likely already have all the code you need to start testing today. At the very least, you need to know when your site is down from the perspective of the outside world. So if you do nothing else after reading this, go set up an alert for your website (or any website) when it goes down at ping.gg. You don’t even need to use curl. You can do it right in your browser: `https://ping.gg/<your_email>/github.com` Your app may be running fine, but if no one can access it, then it might as well be down.