Where the speed limits on our Amanda backups appear to be in 2023
A bit over five years ago, I wrote about the increasingly surprising speed limits on our Amanda backups. At that time we were using spinning rust hard drives on both the Amanda backup servers and our fileservers of the time, 10G networking on the Amanda servers, and 1G networking on the fileservers (for reasons). Generally this made either the Amanda 'holding disk' (where it streams backups to) or the fileserver 1G networking the limit on our backup speeds. Since then there have been two significant changes.
The first change was the move to our Linux fileservers, with 10G networking and SSD storage. With SSDs, we had both greatly increased read bandwidth and greatly increased IOs per second (which matters for scanning filesystems and figuring out what to back up for incremental backups). This unambiguously moved the limit on our backup speeds to the write speed of our Amanda 'holding disks', which were a striped set of HDDs. With some hand-waving the striped filesystem could do around 200 Mbytes/sec (starting out higher and then dropping over time), which was under what the fileservers could now deliver.
For a while we were okay with this, but as people kept using more space on our fileservers, we became increasingly interested in improving the situation. Moving to a striped filesystem involving more HDDs wasn't a good solution for us, because losing even a single HDD caused the loss of Amanda's holding filesystem, which generally took out at least one night's worth of backups. So, after working out that SSD write volume limits were a problem in this workload, we took a deep breath and bought some expensive 'enterprise' grade SSDs with high write endurance to use as our Amanda holding disks. Because we really want our backups to always succeed, we use these SSDs in mirrored pairs for our Amanda holding disk filesystems.
Technically, this new setup has probably not changed where the speed limit is of our Amanda backups. Based on our metrics, the Amanda holding disks can do about 500 Mbytes/second of sustained write bandwidth, and this appears to be the speed limit; we see the disks at 100% utilization and both the network and disk at 500 Mbytes/second. However, this is significantly higher than what we used to be getting, and it's noticeably shortened our backups.
Improving this would probably require going to NVMe drives, which might be able to sustain enough write bandwidth to make the 10G networking link the limiting factor. Or perhaps we'd find that fileservers couldn't scan their filesystems, run tar, and so on fast enough to saturate 10G.
(We'd need high write endurance NVMe drives, which would probably have to be 'enterprise' style drives in the U.2 form factor. Our Amanda backup servers don't need hot-swap drive bays because we can shut them down more or less any time we want to, so we'd probably be fine with high write endurance NVMe M.2 drives. We could probably even put them on a PCIe card.)