Monitoring is too hard, as illustrated by TLS certificates expiring
Grumpy thesis: monitoring TLS certificate expiry is too hard (evidence: good people keep having certs expire on them). Why don't web servers ship with routine cron jobs that email you when any actively used TLS certificate is N days or less from expiring, for example?
Having a TLS certificate for a public web server unexpectedly expire on you is practically a rite of passage for a system administration team. And I'm not here to throw stones, because while we have a reasonably good system for monitoring our TLS certificates, it's critically reliant on us remembering to add monitoring for the actual TLS website. When the TLS website is a standalone web server, that's fairly easy (because we know we want to check if the site is actually up), but when it's yet another virtual host on our central web server, it's also easy for it to drop through the cracks because we know we're already monitoring the web server as a whole.
As a general rule, when people keep doing something wrong, they're actually right and your system is wrong. Put another way, "if your system depends on humans never making errors, you have a systems problem". If it takes extra steps and extra attention to add monitoring, people will keep forgetting to do so and then they will get burned by it. TLS certificates are an obvious case, but there are lots of other ones. How many systems ship with default monitoring that tries to let you know if the local disk space is getting alarmingly low, for example?
Today, you have to spend a great deal of time and effort to build out a monitoring system for your systems. Once you have built that system (as we have with our Prometheus setup), the incremental monitoring for a new system is easy and it's alarmingly easy to feel smug about your successes and other people's failures. But we're standing on a mountain, and it's a mountain that not everyone has either the time or the expertise to climb.
Of course building systems to monitor themselves by default is not an easy job. However, we've already done some of it (and come to accept it as essentially required for a good quality implementation); for example, Linux systems these days often default to sending email if issues show up in your software RAID arrays or disk SMART attributes. We could do more, especially since there's a lot of obvious low hanging fruit.
It would be nice if in the future 'default monitored' was like 'default secure' is becoming today. You could change it or replace it, but at least things would start out in a good place.
Link: An opinionated list of best practices for textual websites
An opinionated list of best practices for textual websites by Rohan Kumar is what it says in the subject. I'm not sure I agree with everything in it (and I certainly don't do everything there), but I think it has useful information and it's certainly given me things to think about.
(Since this entire blog is a textual website, I have a decided interest in this area and some opinions of my own.)