The increasingly surprising limits to the speed of our Amanda backups
When I started dealing with backups the slowest part of the process was generally writing things out to tape, which is why Amanda was much happier when you gave it a 'holding disk' that it could stage all of the backups to before it had to write them out to tape. Once you had that in place, the speed limit was generally some mix between the network bandwidth to the Amanda server and the speed of how fast the machines being backed up could grind through their filesystems to create the backups. When networks moved to 1G, you (and we) usually wound up being limited by the speed of reading through the filesystems to be backed up.
(If you were backing up a lot of separate machines, you might initially be limited by the Amanda server's 1G of incoming bandwidth, but once most machines started finishing their backups you usually wound up with one or two remaining machines that had larger, slower filesystems. This slow tail wound up determining your total backup times. This was certainly our pattern, especially because only our fileservers have much disk space to back up. The same has typically been true of backing up multiple filesystems in parallel from the same machine; sooner or later we wind up stuck with a few big, slow filesystems, usually ones we're doing full dumps of.)
Then we moved our Amanda servers to 10G-T networking and, from my perspective, things started to get weird. When you have 1G networking, it is generally slower than even a single holding disk; unless something's broken, modern HDs will generally do at least 100 Mbytes/sec of streaming writes, which is enough to keep up with a full speed 1G network. However this is only just over 1G data rates, which means that a single HD is vastly outpaced by a 10G network. As long as we had a number of machines backing up at once, the Amanda holding disk was suddenly the limiting factor. However, for a lot of the run time of backups we're only backing up our fileservers, because they're where all the data is, and for that we're currently still limited by how fast the fileservers can do disk IO.
(The fileservers only have 1G network connections for reasons. However, usually it's disk IO that's the limiting factor, likely because scanning through filesystems is seek-limited. Also, I'm ignoring a special case where compression performance is our limit.)
All of this is going to change in our next generation of fileservers, which will have both 10G-T networking and SSDs. Assuming that the software doesn't have its own IO rate limits (which is not always a safe assumption), both the aggregate SSDs and all the networking from the fileservers to Amanda will be capable of anywhere from several hundred Mbytes/sec up to as much 10G bandwidth as Linux can deliver. At this point the limit on how fast we can do backups will be down to the disk speeds on the Amanda backup servers themselves. These will probably be significantly slower than the rest of the system, since even striping two HDs together would only get us up to around 300 Mbytes/sec at most.
(It's not really feasible to use a SSD for the Amanda holding disk, because it would cost too much to get the capacities we need. We currently dump over a TB a day per Amanda server, and things can only be moved off the holding disk at the now-paltry HD speed of 100 to 150 Mbytes/sec.)
This whole shift feels more than a bit weird to me; it's upended my perception of what I expect to be slow and what I think of as 'sufficiently fast that I can ignore it'. The progress of hardware over time has made it so the one part that I thought of as fast (and that was designed to be fast) is now probably going to be the slowest.
(This sort of upset in my world view of performance happens every so often, for example with IO transfer times. Sometimes it even sticks. It sort of did this time, since I was thinking about this back in 2014. As it turned out, back then our new fileservers did not stick at 10G, so we got to sleep on this issue until now.)
A learning experience about the performance of our IMAP server
Our IMAP server has never been entirely fast, and over the years it has slowly gotten slower and more loaded down. Why this was so seemed reasonably obvious to us; handling mail over IMAP required a fair amount of network bandwidth and a bunch of IO (often random IO) to our NFS fileservers, and there was only so much of that to go around. Things were getting slowly worse over time because more people were reading and storing more mail, while the hardware wasn't changing.
We have a long standing backwards compatibility with our IMAP
server, where people's IMAP clients have
full access to their
$HOME and would periodically go searching
through all of it. Recently this started causing us serious problems,
like running out of inodes on the IMAP server,
and it became clear that we needed to do something about it. After
a number of false starts (eg), we
wound up doing two important things over the past two months. First
we blocked Dovecot from searching through a lot of directories, and then we started manually migrating
users one by one to a setup where their IMAP
sessions could only see their
$HOME/IMAP instead of all of their
$HOME. The two changes together significantly reduce the number
of files and directories that Dovecot is scanning through (and
sometimes opening to count messages).
Well, guess what. Starting immediately with our first change and increasing as we migrated more and more high-impact users, the load on our IMAP server has been dropping dramatically. This is most clearly visible in the load average itself, where it's now entirely typical for the daytime load average to be under one (a level that was previously only achieved in the dead of night). The performance of my test Thunderbird setup has clearly improved, too, rising almost up to the level that I get on a completely unloaded test IMAP server. The change has basically been night and day; it's the most dramatic performance shift I can remember us managing (larger than finding our iSCSI problem in 2012). While the IMAP server's performance is not perfect and it can still bog down at some times, it's become clear that all of the extra scanning that Dovecot was doing was behind a great deal of the performance problems we were experiencing and that getting rid of it has had a major impact.
Technically, we weren't actually wrong about the causes of our IMAP server being slow; it definitely was due to network bandwidth and IO load issues. It's just that a great deal of that IO was completely unproductive and entirely avoidable, and if we had really investigated the situation we might have been able to improve the IMAP server long ago.
(And I think it got worse over time partly because more and more people started using clients, such as the iOS client, that seem to routinely use expensive scanning operations.)
The short and pungent version of what we learned is that IMAP servers go much faster if you don't let them do stupid things, like scan all through people's home directories. The corollary to this is that we shouldn't just assume that our servers aren't doing stupid things.
(You could say that another lesson is that if you know that your servers are occasionally doing stupid things, as we did, perhaps you should try to measure the impact of those things. But that's starting to smell a lot like hindsight bias.)
Some numbers for how well various compressors do with our
Recently I discussed how
gzip --best wasn't very fast when
compressing our Amanda (tar) backup of
and mentioned that we were trying out zstd
for this. As it happens, as part of our research on this issue I
ran one particular night's backup of our
/var/mail through all
of the various compressors to see how large they'd come out, and
I think the numbers are usefully illustrative.
The initial uncompressed tar archive is roughly 538 GB and is
probably almost completely ASCII text (since we use traditional
mbox format inboxes and most email is encoded to 7-bit ASCII). The
compression ratios are relative to the uncompressed file, while the
times are relative to the fastest compression algorithm. Byte sizes
were counted with '
wc -c', instead of writing the results to disk,
and I can be confident that the compression programs were the speed
limit on this system, not reading the initial tar archive off SSDs.
|Compression ratio||Time ratio|
(The 'uncompressed' time is for '
cat <file> | wc -c'.)
On this very real-world test for us, zstd is clearly a winner
gzip; it achieves better compression with far less time.
gzip --fast takes about 32% less time than
gzip --best at only
a moderate cost in compression ratio, but it's not competitive with
zstd in either time or compression. Zstd is not as fast as lz4
but it's fast enough, while providing clearly better compression.
We're currently using the default zstd compression level, which
zstd -3' (we're just invoking plain '
numbers suggest that we'd lose very little compression from switching
zstd -1' but get a significant speed increase. At the moment
we're going to leave things as they are because our backups are now
fast enough (backing up
/var/mail is now not the limiting factor
on their overall speed) and we do get something for that extra time.
Also, it's simpler; because of how Amanda works, we'd need to add
a script to switch to '
(Amanda requires you to specify a program as your compressor, not a program plus arguments, so if you want to invoke the real compressor with some non-default options you need a cover script.)
Since someone is going to ask,
-fast got a compression ratio of 1.78 and a time ratio of 1.27.
This is extremely unrepresentative of what we could achieve in
production on our Amanda backup servers, since my test machine is
coreCPU Xeon Silver 4108. The parallelism speed
increase for pigz is not perfect, since it was only about 9.4 times
gzip --fast (which is single-core).
(Since I wanted to see the absolute best case for pigz in terms of
speed, I ran it on all
cores CPUs. I'm not interested
in doing more tests to establish how it scales when run with fewer
cores CPUs, since we're not going to use it; zstd
is better for our case.)
PS: I'm not giving absolute speeds because these speeds vary tremendously across our systems and also depend on what's being compressed, even with just ASCII text.
Today's learning experience is that gzip is not fast
For reasons beyond the scope of this entry, we have a quite large
/var/mail and we take a full backup of it every night. In order
to save space in our disk-based backup system,
for years we've been having Amanda compress these backups on the
Amanda server; since we're backing up ASCII text (even if it
represents encoded and compressed binary things), they generally
compress very well. We did this in the straightforward way; as part
of our special Amanda dump type that forces only full backups for
/var/mail, we said '
compress server best'. This worked okay for
years, which enticed us into not looking at it too hard until we
recently noticed that our backups of
/var/mail were taking almost
(They should not take ten hours.
/var/mail is only about 540 GB
and it's on SSDs.)
It turns out that Amanda's default compression uses gzip, and when you tell Amanda to
use the best compression it uses '
gzip --best', aka '
Now, I was vaguely aware that gzip is not the fastest compression
method in the world (if only because ZFS uses lz4 compression by
default and recommends you avoid gzip), but I also had the vague
impression that it was reasonably decently okay as far as speed
went (and I knew that bzip2 and xz were slower, although they
compress better). Unfortunately my impression turns out to be very
wrong. Gzip is a depressingly slow compression system, especially
if you tell it to go wild and try to get the best compression it
can. Specifically, on our current Amanda server hardware '
--best' appears to manage a rate of only about 16 MBytes a second.
As a result, our backups of
/var/mail are almost entirely constrained
by how slowly gzip runs.
(See lz4's handy benchmark chart for one source of speed numbers. Gzip is 'zlib deflate', and zlib at the 'compress at all costs' -9 level isn't even on the benchmark chart.)
The good news is that there are faster compression programs out there, and at least some of them are available pre-packaged for Ubuntu. We're currently trying out zstd as probably having a good balance between running fast enough for us and having a good compression ratio. Compressing with lz4 would be significantly faster, but it also appears that it would get noticeably less compression.
It's worth noting that not even lz4 can keep up with full 10G Ethernet speeds (on most machines). If you have a disk system that can run fast enough (which is not difficult with modern SSDs) and you want to saturate your 10G network during backups, you can't do compression in-stream; you're going to have to capture the backup stream to disk and then compress it later.
PS: There's also parallel gzip, but that has various limitations in practice; you might have multiple backup streams to compress, and you might need that CPU for other things too.
Using a local database to get consistent device names is a bad idea
People like consistent device names, and one of the ways that Unixes have historically tried to get them is to keep a local database of known devices and their names, based on some sort of fingerprint of the device (the MAC address is a popular fingerprint for Ethernet interfaces, for example). Over the years various Unixes have implemented this in different ways; for example, some versions of Linux auto-created udev rules for some devices, and Solaris and derivatives have /etc/path_to_inst. Unfortunately, I have to tell you that trying to get consistent device names this way turns out to be a bad idea.
The fundamental problem is that if you keep a database of local device names, your device names depend on the history of the system. This has two immediate bad results. First, if you have two systems with identical hardware running identical software they won't necessarily use the same device names, because one system could have previously had a different hardware configuration. Second, if you reinstall an existing system from scratch you won't necessarily wind up with the same device names, because your new install won't necessarily have the same history as the current system does.
(Depending on the scheme, you may also have the additional bad result that moving system disks from one machine to an identical second machine will change the device names because things like MAC addresses changed.)
Both of these problems are bad once you start dealing with multiple systems. They make your systems inconsistent, which increases the work required to manage them, and they make it potentially dangerous to reinstall systems. You wind up either having to memorize the differences from system to system or needing to assemble your own layer of indirection on top of the system's device names so you can specify things like 'the primary network interface, no matter what this system calls it'.
Now, you can have this machine to machine variation problems even with schemes that derive names from the hardware configuration. But with such schemes, at least you only have these problems on hardware that's different, not on hardware that's identical. If you have truly identical hardware, you know that the device names are identical. By extension you know that the device names will be identical after a reinstall (because the hardware is the same before and after).
I do understand the urge to have device names that stay consistent even if you change the hardware around a bit, and I sometimes quite like them myself. But I've come to think that such names should be added as an extra optional layer on top of a system that creates device names that are 'stateless' (ie don't care about the past history of the system). It's also best if these device aliases can be based on general properties (or set up by hand in configuration files), because often what I really want is an abstraction like 'the network interface that's on network X' or 'the device of the root filesystem'.
Our revised Dovecot IMAP configuration migration plans (and processes)
Back at the start of January, I wrote up the goals and problems
of our Dovecot IMAP migration, and in
an appendix at the end I outlined what became our initial migration
plans. We would build an entirely new Dovecot server that was set
up with people's IMAP mail folder storage being a subdirectory of
$HOME/mail (call this the IMAP root), and
then we would get people to move to this server one by one. Migration
would require them to change their clients and might require them
(or us) to move files in Unix. Eventually we would tell the remaining
holdouts that we were just going to turn off the old IMAP server
and they had to migrate now.
Initially, the great virtue I saw in this plan was that it was entirely user driven and didn't require us to do anything. The users did everything, could go at their own speed, and were completely responsible for what happened. In an environment where we couldn't count on clients using IMAP subscriptions so we could know what people's mailboxes actually were, things had to be user-driven anyway, and we generally try to stay out of doing per-user things because it doesn't scale; we have a lot of users and not very many people looking after our central systems (including the IMAP server).
As we talked more and more about this, we realized that the central problem with this plan is that everyone had to migrate and this involved the users doing things (often at the Unix level), or getting someone to help them. As mentioned, we have a lot of users, and some of them are quite important (eg, professors) and can't just be abandoned to their fate. There was no way to make this not be disruptive to people. At the same time, most of our users were not causing any problems, which meant that we'd be forcing a lot of people to do disruptive things (on all of their devices, better not miss one) to deal with a problem created by a much smaller number of users.
If this was the only way to deal with things, we might still have
gone ahead with it. But as I sort of alluded to in passing in the
January entry, it's possible to do
this on a per-user basis in Dovecot using a shell script (see the
bottom of MailLocation).
After we talked it over, we decided that this was the way we wanted
to handle the migration to people's IMAP sessions being confined
to a subdirectory of their
$HOME; it would be done on a per-user
basis and we'd directly target high-priority problem cases. The
vast majority of our current users would forever stay un-migrated,
while new users would be set up to be confined to a
subdirectory from the start (ie, using the new IMAP root).
As much as possible, we wanted this migration to be transparent to
users (or at least important ones). That meant that the IMAP mailbox
names as seen by the clients couldn't change, and that meant that
no matter what we were going to have to move files around; there's
no other way for this to be transparent to clients when you change
the IMAP root. Given that, it wasn't important to pick a new IMAP
root that people already used for mailboxes, so we picked
for various reasons (including that calling it this made it clear
what it was for).
Since this plan means that we're moving user mailboxes around at
least some of the time (in order to migrate problem users), knowing
what those mailboxes were became important enough to get us to
hack some mailbox logging into Dovecot.
Having this information has been extremely reassuring. Even when
it just duplicates the information in a user's
file, it also confirms that that information is accurate and complete.
We started out with plans for a two-stage operation for most users,
where we'd first tell them to move all of their IMAP mailboxes under
mail/' in their client (ie,
$HOME/mail) before some deadline,
then at the deadline we'd make
and flip the server setting that made
$HOME/IMAP their IMAP root.
In practice it's turned out to be easier to do the file moving
ourselves, based on both
.subscriptions and the logs, so our
current approach is to just tell various users 'unless you object,
at time X we'll be improving your IMAP client experience by ...'
and then at time X we do everything ourselves. It's been a little
bit surprising how few actual active mailboxes some of these users
have, especially relative to how much of an impact they've been
having on the server.
(This genuinely does improve the IMAP client experience for people,
for obvious reasons. An IMAP client that is scanning all of your
$HOME and maybe opening all the files there
is generally not a responsive client, not if your
$HOME is at all
PS: Although I haven't been writing about it here on Wandering Thoughts until recently, our IMAP situation has been consuming a lot of my attention and time at work. It's turned into a real learning experience in several ways.
Our current ugly hacks to Dovecot to help mitigate our IMAP problems
Back in the comments of this entry from the end of December, I said that we weren't willing to take on the various burdens of changing our local Dovecot to add some logging of things like the mailboxes that people's clients were accessing. In yesterday's entry I mentioned that we actually had hacked up our Dovecot to do exactly that. You might wonder what happened between December and now to cause us to change our minds. The short version is that from our perspective, things on our IMAP server got worse and so we became more willing to do things to mitigate our problems (especially since our migration plans were clearly not going to give us any short term improvements).
(It's not clear to me if the problems got worse in the past few months, which is certainly possible, or if we just noticed more and more about how bad things were once we started actively looking into the state of the server.)
We wound up making two changes to help mitigate our problem; our
added logging is actually the second and less alarming one. Our
first and most significant change was we hacked Dovecot so that
LIST operations would ignore all names that started with
were called exactly
public_html, which is the name of the symlink
that we drop into people's home directories to point to their web
space. We made this change because monitoring runaway Dovecot
processes that were rummaging through people's
$HOME showed that
many of them were traversing through subdirectory hierarchies that
went through subdirectories like
$HOME/.cache, and so on. None of those have actual mailboxes but
all of them are great places to find a lot of files, which is not
a good thing in our environment. The
public_html part of this
had a similar motivation; we saw a significant number of Dovecot
sessions that had staged great escapes into collections of data and
other files that people had published in their home pages. Making
this change didn't eliminate our problems but it clearly helped;
we saw less load and less inode usage for Dovecot's indexes.
(While this sounds like a big change, it was a very small code modification. However, the scary part of making it was not being entirely sure that the effects of the change were only confined to IMAP LIST operations. Yes, we tested.)
Once we'd broken the ice with this change, it was much less of a
deal to add some logging to capture information about what IMAP
mailboxes people were using. We started out by logging for
but seeing our logging in action made it obvious that clients used
a variety of IMAP commands and we needed to add logging to all of
them to be confident that we were going to see all of the mailboxes
they were using. To reduce the log volume, we skip logging SELECTs
of INBOX; it turns out that clients do this all the time, and it's
not interesting for our uses of the information.
(I had fun hunting through the IMAP RFC for commands that look mailbox names as one of their arguments, and I'm not sure I got them all. But I'm reasonably confident that we log almost all of them; we currently log for LIST, APPEND, MOVE, COPY, and RENAME. I didn't bother with CREATE, on the grounds that clients would probably do some other operation after CREATE'ing a mailbox if it mattered.)
Once we were adding logging, I decided to throw in logging of LIST
arguments so we could understand when and how it was being used.
This turned out to be very valuable, partly because I was starting
from a position of relative ignorance about the IMAP protocol and
how real IMAP clients behave. A fair bit of what I wrote about
yesterday came from that logging, especially
the realization that clients could scan through all of
without leaving tell-tale signs in Dovecot's indexes, which meant
that our problems were worse than we'd realized. Unfortunately the
one current limitation of our LIST logging is that we can't log how
many entries were returned by the LIST command. For obvious reasons,
it would be very handy to be able to tell the difference between a
LIST command that returned ten names and one that returned 5,000
I was quite pleasantly surprised to discover that the Dovecot source
code is very nicely structured and organized, which made these
changes much easier than they might otherwise have been. In particular,
each IMAP command is in a separate source file, all with obvious
names like '
cmd-list.c', and their main operation was pretty self
contained and obvious. Logging was really easy to add and even the
change to make LIST skip some names wasn't too difficult (partly
because this part of the code was already skipping
which gave me a starting point). As I noted yesterday, I hacked
this directly into the main Dovecot source rather than trying to
figure out the plugin API (which is undocumented as far as I can
see). I believe that we could do all of the logging we're currently
doing through the plugin API, and that's clearly the more generally
correct approach to it.
Knowing what mailboxes people are using is a relatively important part of our current migration plans (which have completely changed from what I wrote up for various reasons), but that's going to be another entry.
Some things about Dovecot, its index files, and the IMAP LIST command
We have a backwards compatibility issue
with our IMAP server, where people's IMAP roots are
$HOME, their home directory,
and then clients ask the IMAP server to search all through the IMAP
namespace; this causes various bad things to happen, including
running out of inodes. The reason we ran
out of inodes is that Dovecot maintains some index files for every mailbox it looks
We have Dovecot store its index files on our IMAP server's local
/var/local/dovecot/<user>. Dovecot puts these in a
hierarchy that mirrors the actual Unix (and IMAP) hierarchy of the
mailboxes; if there is a subdirectory
Drafts, the Dovecot index files will be in
.../<user>/Mail/.imap/Drafts/. It follows that you can hunt
through someone's Dovecot index files to see what mailboxes their
clients have looked at, although this may tell you less than you
think and what their active mailboxes are.
(One reason that Dovecot might look at a mailbox is that your client
has explicitly asked it to, with an IMAP
SELECT command or perhaps
MOVE operation. However, there are other
When I began digging into our IMAP pain and working on our planned migration (which has drastically changed directions since then), I was operating under the charming idea that most clients used IMAP subscriptions and only a few of them asked the IMAP server to inventory everything in sight. One of the reasons for this is that only a few people had huge numbers of Dovecot index files, and I assumed that the two were tied together. It turns out that both sides of this are wrong.
Perhaps I had the idea that it was hard to do an IMAP
operation that asked the server to recursively descend through
everything under your IMAP root. It isn't; it's trivial. Here's
the IMAP command to do it:
m LIST "" "*"
That's all it takes (the unrestricted * is the important bit). The sort of good news is that this operation by itself won't cause Dovecot to actually look at those mailboxes and thus to build index files for them. However, there is a close variant of this LIST command that does force Dovecot to look at each file, because it turns out that you can ask your IMAP server to not just list all your mailboxes but to tell you which ones have unseen messages. That looks like this:
m LIST "" "*" RETURN (SPECIAL-USE STATUS (UNSEEN))
Some clients use one LIST version, some use the other, and some seem to use both. Importantly, the standard iOS Mail app appears to use the 'LIST UNSEEN' version at least some of the time. iDevices are popular around the department, and it's not all that easy to find the magic setting for what iOS calls the 'IMAP path prefix'.
For us, a user with a lot of Dovecot index files was definitely
someone who had a client with the 'search all through
problem (especially if the indexes were for things that just aren't
plausible mailboxes). However, a user with only a few index files
wasn't necessarily someone without the problem, because their client
could be using the first version of the
LIST command and thus not
creating all those tell-tale index files. As far as I know, stock
Dovecot has no way of letting you find out about these people.
(We hacked logging in to the Ubuntu version of Dovecot, which involved some annoyances. In theory Dovecot has a plugin system that we might have been able to use for this; in practice, figuring out the plugin API seemed likely to be at least as much work as hacking the Dovecot source directly.)
Sidebar: Limited LISTs
IMAP LIST commands can be limited in two ways, both of which have more or less the same effect for us:
m LIST "" "mail/*" m LIST "mail/" "*"
For information on what the arguments to the basic LIST command mean, I will refer you to the IMAP RFC. The extended form is discussed in RFC 5819 and is based on things from, I believe, RFC 5258. See also RFC 6154 and here for the special-use stuff.
Why Let's Encrypt's short certificate lifetimes are a great thing
I recently had a conversation on Twitter about what we care about in TLS certificate sources, and it got me to realize something. I've written before about how our attraction to Let's Encrypt has become all about the great automation, but what I hadn't really thought about back then was how important the short certificate lifetimes are. What got me to really thinking about it was a hypothetical; suppose we could get completely automatically issued and renewed free certificates but they had the typical one or more year lifetime of most TLS certificates to date. Would we be interested? I realized that we would not be, and that we would probably consider the long certificate lifetime to be a drawback, not a feature.
There is a general saying in modern programming to the effect that if you haven't tested it, it doesn't work. In system administration, we tend towards a modified version of that saying; if you haven't tested it recently, it doesn't work. Given our generally changing system environments, the recently is an important qualification; it's too easy for things to get broken by changes around them, so the longer it's been since you tried something, the less confidence you can have in it. The corollary for infrequent certificate renewal is obvious, because even in automated systems things can happen.
With Let's Encrypt, we don't just have automation; the short certificate lifetime insures that we exercise it frequently. Our client of choice (acmetool) renews certificates when they're 30 days from expiring, so although the official Let's Encrypt lifetime is 90 days, we roll over certificates every sixty days. Having a rollover happen once every two months is great for building and maintaining our confidence in the automation, in a way that wouldn't happen if it was once every six months, once a year, or even less often. If it was that infrequent, we'd probably end up paying attention during certificate rollovers even if we let automation do all of the actual work. With the frequent rollover due to Let's Encrypt's short certificate lifetimes, they've become things we trust enough to ignore.
(Automatic certificate renewal for long duration certificates is not completely impossible here, because the university central IT has already arranged for free certificates for the university. Right now they're managed through a website and our university-wide authentication system, but in theory there could be automation for at least renewals. Our one remaining non Let's Encrypt certificate was issued through this service as a two year certificate.)
How I tend to label bad hardware
Every so often I wind up dealing with some piece of hardware that's bad, questionable, or apparently flaky. Hard disks are certainly the most common thing, but the most recent case was a 10G-T network card that didn't like coming up at 10G. For a long time I was sort of casual about how I handled these; generally I'd set them aside with at most a postit note or the like. As you might suspect, this didn't always work out so great.
These days I have mostly switched over to doing this better. We have a labelmaker (as everyone should), so any time I wind up with some piece of hardware I don't trust any more, I stick a label on it to mark it and say something about the issue. Labels that have to go on hardware can only be so big (unless I want to wrap the label all over whatever it is), so I don't try to put a full explanation; instead, my goal is to put enough information on the label so I can go find more information.
My current style of label looks broadly like this (and there's a flaw in this label):
volary 2018-02-12 no 10g problem
The three important elements are the name of the server the hardware came from (or was in when we ran into problems), the date, and some brief note about what the problem was. Given the date (and the machine) I can probably find more details in our email archives, and the remaining text hopefully jogs my memory and helps confirm that we've found the right thing in the archives.
As my co-workers gently pointed out, the specific extra text on this label is less than idea. I knew what it meant, but my co-workers could reasonably read it as 'no problem with 10G' instead of the intended meaning of 'no 10g link', ie the card wouldn't run a port at 10G when connected to our 10G switches. My takeaway is that it's always worth re-reading a planned label and asking myself if it could be misread.
A corollary to labeling bad hardware is that I should also label good hardware that I just happen to have sitting around. That way I can know right away that it's good (and perhaps why it's sitting around). The actual work of making a label and putting it on might also cause me to recycle the hardware into our pool of stuff, instead of leaving it sitting somewhere on my desk.
(This assumes that we're not deliberately holding the disks or whatever back in case we turn out to need them in their current state. For example, sometimes we pull servers out of service but don't immediately erase their disks, since we might need to bring them back.)
Many years ago I wrote about labeling bad disks that you pull out of servers. As demonstrated here, this seems to be a lesson that I keep learning over and over again, and then backsliding on for various reasons (mostly that it's a bit of extra work to make labels and stick them on, and sometimes it irrationally feels wasteful).
PS: I did eventually re-learn the lesson to label the disks in your machines. All of the disks in my current office workstation are visibly labeled so I can tell which is which without having to pull them out to check the model and serial number.