**2018-07-11**

## You should probably write down what your math actually means

Let's start with my tweet:

I have no brain, but at least I can work out what this math in my old code actually means, something I didn't bother to do when I wrote the original code & comments years ago. (That was a mistake, but maybe I didn't actually understand the math then, it just looked good.)

For the almost authentic experience I'm going to start with the code and its comments that I was looking at and explain it afterward.

# if we are not going to protect a lot more data # than has already been protected, no. Note that # this is not the same thing as looking at the # percentages, because it is effectively percentages # multiplied by the number of disks being resilvered. # Or something. if rdata < psum*4 or rdata < esum*2: return (False, "won't protect enough extra data")

ZFS re-synchronizes redundant storage in an unusual way, which has some unfortunate implications some of the time. I will just quote myself from that entry:

So: if you have disk failure in one mirror vdev, activate a spare, and then have a second disk fail in another mirror and activate another spare, work on resilvering the first spare will immediately restart from scratch. [...]

My code and comment is from a function in our ZFS spares handling system that is trying to decide if it should activate another spare even though it will abort an in-progress spare replacement, or if it should let the first spare replacement finish.

The problem with this comment is that while it explains the idea
of this check to a certain extent, it doesn't explain the math at
all; the math is undocumented magic. It's especially undocumented
magic if you don't know what `rdata`

, `psum`

, and `esum`

are and
where they come from, as I didn't when I was returning to this code
for the first time in several years (because I wanted to see if
it still made sense in a different environment). Since there's no explanation of
the math, we don't know if it actually express the comment's idea
or if it's making some sort of mistake, perhaps camouflaged by how
various terms are calculated.

(It's not that hard to get yourself lost in a tangle of calculated
terms. See, for example, the parenthetical discussion of how `svctm`

is calculated in this entry.)

In fact when I dug into this, it turns out that my math was at least misleading for us. I'll quote some comments:

# psum: repaired space in all resilvering vdevs in bytes # esum: examined space in all resilvering vdevs in bytes [...] # NOTE: psum == esum for mirrors, but not necessarily for # raidz vdevs.

Our ZFS fileservers only have
mirrors and none of our spares handling code has ever been tested
on raidz pools. Using both `psum`

and `esum`

in my code was at best
a well intentioned brain slip, but in practice it was misleading.
Since both are the same, the real condition is the larger one, ie
'`rdata < psum*4`

'. `rdata`

itself is an estimate of how much
currently unredundant data we're going to add redundancy for with
our new spare or spares.

To start, let's rewrite that condition to be clearer. Ignoring
various pragmatic math issues, '`rdata < psum*4`

' is the same as
'`rdata/4 < psum`

'. In words and expanding the variables out,
this is true if we've already repaired at least as much data as one
quarter of the additional data we'd make redundant by adding more
spares.

Is this a sensible criteria in general, or with these specific numbers? I honestly have no idea. But at least I now understand what the math is actually doing.

In fact it took two tries to get to this understanding, because it
turns out that I misinterpreted the math the first time around,
when I made my tweets. Only when I had to break it down again to
write this entry did I really work out what it's doing. This really
shows very vividly that the moment you understand your math (or
think you do), **write that understanding of your math down**. Be
specific. It's not necessarily going to be obvious to you later.

(If you work on some code all the time, or if the math is common knowledge in the field, maybe not; then it falls into the category of obvious comments that are saying 'add 2 + 2'. Also, perhaps better variable names could have helped here, as well as avoiding the too-clever use of a multiplication instead of a division.)

PS: Since I wrote 'Or something.' even in the original comment, I clearly knew at the time that I was waving my hands at least a bit. I should have paid more attention to that danger sign back then, but I was probably too taken with my own cleverness. When it comes ot this sort of math and calculation work, this is an ongoing issue and concern for me.

** (Previous day | Next day) **