Staged rollouts of things still have limitations

August 5, 2024

One of the commonly suggested remedies for deploying things that can go wrong is to do staged rollouts, where you deploy to only a subset of the things at a time and look for problems before proceeding. Staged rollouts are in general a good idea, but it's important to understand that there are limits on how much they can improve the situation, especially if the staged rollouts are going out to outside people ('customers') instead of internally, within your organization in environments that you control.

The first limitation is that staged rollouts only help to the extent that you can actually detect problems before continuing with the rollout. Often what problems you can detect (and how soon) are limited by the telemetry you have available and the degree to which you can inspect and monitor the systems that you're rolling out to. If you're rolling out internally, this can possibly be quite high, but if you're rolling out to customers, you may have limited telemetry (partly because customers will object to your software constantly reporting things back to you, especially if you want to report lots of details) and no ability to reach out and inspect systems. A related issue is that when you build rollout telemetry and monitoring, you're probably basing the telemetry on what problems you expect. If your rollout triggers a problem that you didn't foresee, you may have no telemetry that would tell you about it.

(For a topical example, consider the telemetry you'd need to detect that your application has made your customer's machines crash and be unable to boot. Since the machines aren't booting, you can't send any telemetry from them to actually report the problem; instead you'd need some telemetry signal that your application was running fine and then monitor this signal for a rapid decrease in your staging group. Would you think to both build and monitor this telemetry signal in advance?)

The second limitation is that if your staged rollout detects problems, you've (still) inflicted problems on some people, just not as many of them as without a staged rollout. Again, this is more of a problem with external staged rollouts than with internal ones. When your staged rollout is internal, you're inflicting problems on yourself; when your staged rollout is external, you're inflicting problems on other people and they're going to be unhappy with you. Staged external rollouts don't eliminate problems, they merely reduce them.

(For instance, Ubuntu has a system of 'phased updates' for non-security updates of some packages, such as OpenSSH, but if an update is bad and detected in this phased update process, and you happen to be one of the people who got the update early, you get to sort out whatever mess it's made of your system.)

In addition, staged rollouts are in conflict with rapid updates. The slower and more carefully you do a staged rollout, the longer (on average) it takes for your update to reach people and become functional. This isn't vital for some updates, but we know update speed matters for some things. As an extreme example, if you're pushing out an update to deal with a security problem that's being actively exploited, most people are going to want it right now and the slower your staged rollout runs, the more people will wind up being exploited.

This doesn't excuse doing a non-staged rollout that blows up. Or even a staged rollout that only blows up some people. It's your job to only roll out good changes, and as part of that to test your changes (and your systems) before throwing them into the field. Staged rollouts are an emergency backup in case an error slipped through your other precautions, especially external staged rollouts, where you can't easily fix any problems that you caused.

(The corollary is that if staged rollouts are regularly saving you, you have additional problems and should probably fix them first.)

PS: There are probably situations where it's sensible to make internal staged rollouts your main defense against bad updates. But otherwise it's my view that staged rollouts should be your emergency backup to all of the other testing and validation you're doing.

Written on 05 August 2024.
« The speed of updates for signatures of bad things matters (a lot)
Host names in syslog messages may not be quite what you expect »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Aug 5 22:45:06 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.