Wandering Thoughts archives

2015-03-11

The irritation of being told 'everyone who cares uses ECC RAM'

One of the hazards of hanging around ZFS circles is hearing, every so often, that everyone who cares about their data uses ECC RAM and if you don't, you clearly don't care (and should take your problem reports and go away). With Rowhammer in the news, this attitude may get a boost from other sources as well. Like other 'if you really care you'll do X' views, this attitude makes me reflexively angry because, fundamentally, it pretends that the world is a simple single-dimensional place.

The reality is that in the current world, picking ECC RAM on anything except server machines is generally a tradeoff. For this we may primarily blame Intel, who have carefully insured that only some of their CPUs and motherboard chipsets support ECC. Although the situation is complex, ever-changing, and hard to decode, it appears that you need either server Xeon CPUs or lower-end desktop CPUs; the current and past middle of the road desktop CPU line (i5 and i7) explicitly do not support ECC. Even with a CPU that supports ECC, you need a chipset and even a motherboard that does, and it's not clear to me what those are and how common they are.

(AMD gets its share of the blame, because apparently maybe not all AMD CPUs, AMD chipsets, and motherboards support it.)

Eliding a bunch of ranting, the upshot is that deciding you must have ECC is not trivial and will almost certainly force you to give up other valuable things in many cases. You'll probably sacrifice some combination of thermal efficiency, system performance, motherboard and system features, and sheer cost in order to get ECC, at least in the desktop space.

(These complications and tradeoffs are why my current desktop machines do not have ECC, although I would love to have it if I could. In fact I have a whole list of desired desktop motherboard features that are probably all more or less mutually exclusive, because desktop choices are suffering.)

For people to say that ECC should be your most important criteria anyways is, well, arrogance; it assumes that the world turns around the single axis of having (or not having) ECC and anything else is secondary. The real world is much more complex than that, especially given that not using ECC does not make your system aggressively dangerous in practice (even with lots of RAM). It follows that saying people who do not use ECC don't actually really care about their data is abrasively arrogant. It is the kind of remark that gets people to give you the middle finger.

It is a great way to make a lot of bug reports go away, though (and a certain amount of people with them).

This applies to pretty much any specific technology, of course. ECC is just the current bugbear (or at least mine).

PS: the corollary to this is that system designs that are actively dangerous or useless without ECC RAM are not broadly useful designs, because plenty of machines do not and will not have ECC RAM any time soon. A 'must have ECC' design is in practice a server only design, and maybe not even then; I don't know if ECC RAM is now actually mandatory on much or all server hardware designs so that, eg, our low-end inexpensive Dell 1Us will all have it.

(I'd like it if they all did, but I don't think we even thought about it when selecting the machines. We did specifically insure and get ECC RAM on our new OmniOS servers, in part because ZFS people keep banging this drum.)

UseECCIrritation written at 00:52:31; Add Comment

2015-03-10

Why installing packages is almost always going to be slow (today)

In a comment on my entry on how package installs are what limits our machine install speed, Timmy suggested that there had to be a faster way to do package installs and updates. As it happens, I think our systems can't do much here because of some fundamental limits in how we want package updates to behave, especially ones that are done live.

The basic problem on systems today is that we want package installs and updates to be as close to atomic transactions as possible. If you think about it, there are a lot of things that can go wrong during package install. For example, you can suddenly run out of disk space halfway through; you can have the system crash halfway through; you can be trying to start or run a program from a package that is part way through being installed or updated. We want as many of these to work as possible, and especially we want as few bad things as possible to happen to our systems if something goes wrong part way through a package update. At a minimum we want to be able to roll back a partially applied package install or update if the package system discovers that there's a problem.

(On some systems there's also the issue that you can't overwrite at least some files that are in use, such as executables that are running.)

This implies that we can't just delete all of the existing files for a package (if any), upend a tarball on the disk, and be done with it. Instead we need a much more complicated multi-step operation with writing things to disk, making sure they've been synced to disk, replacing old files with new ones as close to atomically as possible, and then updating the package management system's database. If you're updating multiple packages at once, you also get a tradeoff of how much you aggregate together. If you basically do each package separately you add more disk syncs and disk IO, but if you do all packages at once you may grow both the transient disk space required and the risks if something goes wrong in the middle.

(Existing package management systems tend to be cautious because people are more willing to excuse them being slow than blowing up their systems once in a while.)

To significantly accelerate this process, we need to do less IO and to wait for less IO. If we also want this process to not be drastically more risky, we have no real choice but to also make it much more transactional so that if there are problems at any point before the final (and single) commit point, we haven't done any damage. Unfortunately I don't think there's any way to do this within conventional systems today (and it's disruptive on even somewhat unconventional ones).

By the way, this is an advantage that installing a system from scratch has. Since there's nothing there to start with and the system is not running, you can do things the fast and sloppy way; if they blow up, the official remedy is 'reformat the filesystems and start from scratch again'. This makes package installation much more like unpacking a tarball than it normally is (and it may be little more than that once the dust settles).

(I'm ignoring package postinstall scripts here because in theory that's a tractable problem with some engineering work.)

SlowPackageInstalls written at 00:06:10; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.