How and why the new iptables -w option is such a terrible fumble

October 10, 2016

I wrote recently about the relatively new -w option for iptables and how it will make things blow up. Unfortunately for Linux sysadmins everywhere, exactly how the iptables people introduced this option is a case study in how not to make changes like this; it is essentially backwards from what you want to do. They could probably have made the situation worse than it is now, but it would take some ingenuity.

Perhaps it is not obvious why iptables -w is so terrible (I mean, clearly it wasn't obvious to the iptables developers). To start seeing where they went so wrong, let's ask a simple question: how do you write a script (or a program) that will run on both a system without this change and a system with it?

You can't just use -w on all your iptables commands, because the old version of iptables doesn't support the option; if you add it blindly, every command will fail. You can't not use -w on systems that support it, because omitting -w will make random iptables commands that you're running fail under some circumstances (as we've seen); in practice -w is a mandatory iptables option on systems that support it unless you have a relatively unusual system.

So the answer is 'you must probe for whether or not -w is supported on this version of iptables'. Which cuts to the root of the problem:

Introducing -w this way created a flag day for all uses of iptables.

Before the flag day, you could not use -w. After the flag day, you must use -w. Or at least, you must use -w if you want your iptables commands to be reliable all the time under all circumstances, including odd ones.

That's the next failing: the flag day introduction of -w created a situation where most or all uses of plain iptables on modern systems are subtly buggy and dangerous. They aren't obviously broken so that they fail all or most of the time; instead they now have a race condition. Race conditions are hard to run into (or find deliberately) and hard to diagnose, making them one of the most pernicious classes of bugs. We can see that this is the case because there are still buggy uses of iptables on Fedora.

The final failing is that the iptables developers made this use a single global lock. This maximizes the chance that iptables commands will collide with each other, even if they happen to be doing two completely unrelated things that would not interfere with each other in the least. Are you setting up IPv6 blocks in parallel with querying IPv4 ones? Tough luck, iptables will save you from yourself by making things fail.

All of this is a completely unforced set of errors on the part of the iptables developers. Faced with the underlying bug that two simultaneous iptables commands could interfere with each other in some situations, they could have solved the issue by serializing all iptables commands by default (ie, the equivalent of '-w'). This would have solved the problem without breaking all current uses of plain iptables. People who wanted their commands to fail instead of wait could have had a new 'fail immediately' option.

(I've written before about the related issue of how to deprecate things. Arguably this actually is the same issue, since in practice the iptables developers have deprecated use of iptables without -w.)

Sidebar: A bonus additional issue (fortunately rare)

If you happen to be running multiple iptables commands in parallel with -w and one stream of them is sufficiently unlucky that it waits for long enough, it will print to standard error a message like this:

Another app is currently holding the xtables lock; waiting for it to exit...

(The iptables developers have varied this message repeatedly as they've fiddled with various micro-issues around the implementation of locking, so different versions of different distributions will have somewhat different messages.)

This is not quite the total failure that printing new warning messages by default is, since you have to give a new command line option to produce this behavior. Still, it's not very helpful and of course it's not documented and it's generally hard to hit this, so you can easily write programs that don't expect this and will blow up in various ways if it ever happens.


Comments on this page:

It's not quite a flag day, but I recently ran into a similar problem with cURL's command line tool. In late 2013 the command line tool got a --http1.1 flag (7.33.0) and two years later switched to HTTP/2 by default (7.47.0). The problem is that too many HTTP/2 webservers are terribly broken, and recent versions of cURL fails to fetch from these hosts unless the protocol is manually "downgraded." As a result, right now I must use HTTP/1.1, since it's the only one that works correctly everywhere. However, I would like my program, which depends on the system-provided cURL, to work across many different systems, some with cURL older than 7.33.0. This has my program sniffing the cURL version in order to properly select the command line flags. Switching the defaults in just two years was too fast.

Written on 10 October 2016.
« The modern web is an unpredictable and strange place to develop for
I have yet to start using any smartphone two-factor authentication »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Oct 10 23:03:07 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.