2019-04-21
My view on upgrading Prometheus (and Grafana) on an ongoing basis
Our normal habit is to use the distribution versions of programs we want to run, on whatever version of whatever OS we're running (usually Ubuntu Linux, but we follow this on CentOS, OpenBSD, and so on). If we can't do that for some reason, we generally freeze the version of the program we're using for the lifetime of the system, unless there are either security issues or strongly compelling upgrades (or, sometimes, that the version we're using is falling out of support so it won't get security updates any more).
We initially started with the Ubuntu 18.04 versions of various Prometheus components, but we pretty rapidly determined that they were too old and lacked crucial bug fixes and switched over to the upstream prometheus.io versions of the time. This would normally leave us freezing the versions of what we were running once we had reached a functional and sufficiently bug free state. While I considered that initially, I changed my mind and now believe the right thing for us to do is to actively keep up with the upstreams for both Prometheus and Grafana.
The problem with freezing our versions of either Prometheus or Grafana is that both do regular, frequent releases; both projects put out new significant versions at most every few months and often faster. Freezing our versions for even a year would put us multiple versions behind, and normally we'd freeze things until 2022 when we had to start planning an upgrade of the metrics server's Ubuntu version (it's currently running 18.04).
There are at least two problems with having to upgrade from a lot of versions behind. The first is that that's a lot of changes to absorb and deal with all at once, simply because a lot of releases means a lot of changes. Not all of them are significant or matter, but some of them do and we'd have to sort them out, read through tons of release notes, and so on. Spreading this out over time by updating to keep up to date lowers the burden in practice and makes it less likely that something important will slip through the cracks.
The second is that it may not be easy or even possible to cleanly jump from a version that is so far behind to the current one. In the easier case, we will have missed the deprecation window for some API, configuration file change, or practice, and we will have to upgrade a bunch of things in synchronization so they all jump from the old way to the new way at once. If we had kept up to date, we'd have had time to migrate while both an old way and a new way were still supported. In the hard case, there's been a database or storage format migration except that we're skipping straight from a version that only had the old format to a version that only supports the new one. Probably the only solution here would be to temporarily bring up an intermediate version purely to do the migration, then skip forward further.
(Another advantage of upgrading frequently is that you get to spot undocumented regressions and changes right away, and perhaps file bugs to get developers to fix them. Even if you can't do anything, you have a relatively confined set of changes going on at once, which helps to troubleshoot problems. Turning over the entire world has the drawback that, well, you've turned over the entire world.)
This is really nothing new. It's always been pretty clear that a slow stream of generally minor changes are generally easier to deal with than a periodic big bang of a lot of them at once. It's just that I relatively rarely get faced with the choice of this, especially in a situation where version to version upgrades are relatively painless.
(When version to version upgrades are not painless, there is a very strong temptation to put the pain off for as long as possible rather than keep going through it over and over. This is how we are behind on Django versions for our web app.)