Wandering Thoughts archives


SSD versus NVMe for basic servers today (in early 2021)

I was recently reading Russell Coker's Storage Trends 2021 (via). As part of the entry, Coker wrote:

Last year NVMe prices were very comparable for SSD prices, I was hoping that trend would continue and SSDs would go away. [...]

Later Coker notes, about the current situation:

It seems that NVMe is only really suitable for workstation storage and for cache etc on a server. So SATA SSDs will be around for a while.

Locally, we have an assortment of servers, mostly basic 1U ones, which have either two 3.5" disk bays, four 3.5" disk bays, or rarely a bunch of disk bays (such as our fileservers). None of these machines can natively take NVMe drives as far as I know, not even our 'current' Dell server generation (which is not entirely current any more). They will all take SSDs, though, possibly with 3.5" to 2.5" adapters of various sorts. So for us, SSDs fading away in favour of NVMe would not be a good thing, not until we turn over all our server inventory to ones using NVMe drives. Which raises the question of where those NVMe servers are and why they aren't more common.

For servers that want more than four drive bays, such as our fileservers, my impression is that one limiting factor for considering an NVMe based server has generally been PCIe lanes. If you want eight or ten or sixteen NVMe drives (or more), the numbers add up fast if you want them all to run at x4 (our 16-bay fileservers would require 64 PCIe lanes). You can get a ton of PCIe lanes but it requires going out of your way, in CPU and perhaps in CPU maker (to AMD, which server vendors seem to have been slow to embrace). You can get such servers (Let's Encrypt got some), but I think they're currently relatively specialized and expensive. With such a high cost for large NVMe, most people who don't have to have NVMe's performance would rather buy SATA or SAS based systems like our fileservers.

(To really get NVMe speeds, these PCIe lanes must come directly from the CPU; otherwise they will get choked down to whatever the link speed is between the CPU and the chipset.)

Garden variety two drive and four drive NVMe systems would only need eight or sixteen PCIe lanes, which I believe is relatively widely available even if you're saving an x8 for the server's single PCIe expansion card slot. But then you have to physically get your NVMe drives in the system. People who operate servers really like drive carries, especially hot-swappable ones. Unfortunately I don't think there's a common standard for this for NVMe drives (at one point there was U.2, but it's mostly vanished). In theory a server vendor could develop a carrier system that would let them mount M.2 drives, perhaps without being hot swappable, but so far I don't think any major vendor has done the work to develop one.

(The M.2 form factor is where the NVMe volume is due to consumer drives, so basic commodity 1U servers need to follow along. The Dell storage server model that Let's Encrypt got seems to use U.2 NVMe drives, which will presumably cost you 'enterprise' prices, along with the rest of the server.)

All of this seems to give us a situation where SATA remains the universal solvent of storage, especially for basic 1U servers. You can fit four 3.5" SATA drive bays into the front panel of a 1U server, which covers a lot of potential needs for people like us. We can go with two SSDs, four SSDs, two SSDs and two big HDs, and so on.

(NVMe drives over 2 TB seem relatively thin on the ground at the moment, although SSDs only go up one step to 4 TB if you want plenty of options. Over that, right now you're mostly looking at 3.5" spinning rust, which is another reason to keep 1U servers using 3.5" SATA drive bays.)

tech/ServerSSDVsNVMeIn2021 written at 22:03:28; Add Comment

Counting how many times something started or stopped failing in Prometheus

When I recently wrote about Prometheus's changes() function and its resets(), I left a larger scale issue not entirely answered. Suppose that you have a metric that is either 0 or 1, such as Blackbox's probe_success, and you want to know either how many times it's started failing or how many times it's stopped failing over a time interval.

Counting how many times a ((probe_success) time series has started to fail over a time interval is simple. As explained at more length in the resets() entry, we can simply use it:

resets( probe_success [1d] )

We can't do this any more efficiently than with resets(), because no matter what we do Prometheus has to scan all of the time series values across our one day range. The only way this could be more efficient would be if Prometheus gained some general feature to stream through all of the time series points it has to look at over that one-day span, instead of loading them all into memory.

Counting how many times a probe_success time series has started to succeed (after a failure) over the time interval is potentially more complex, depending how much you care about efficiency. The straightforward answer is to use changes() to count how many times it has changed state between success and failure and then use resets() to subtract how many times it started to fail:

changes( probe_success [1d] ) - resets( probe_success [1d] )

But unless Prometheus optimizes things, this will load one day's worth of every probe_success time series twice, first for changes() and then again for resets().

One approach to avoiding this extra load is to count changes and divide by two, but this goes wrong if the probe started out in a different state than it finished. If this happens, changes() will be odd and we will have a fractional success, which needs to be rounded down if the probe started out succeeding and rounded up if the probe started out failing. We can apparently achieve our desired rounding in a simple, brute force way as follows:

floor( ( changes( probe_success[1d] ) + probe_success )/2 )

What this does at one level is add one to changes() if the probe was succeeding at the end of the time period. This extra change doesn't matter if the probe started out succeeding, because then changes() will be even, the addition will make it odd, and then dividing by two and flooring will ignore the addition. But if the probe started out failing and ended up succeeding, changes() will be odd, the addition will make it even, and dividing by two will 'round up'.

However, this has the drawback that it will completely ignore time series that didn't exist at the end of the time period. Because addition in Prometheus does set union and the disappeared time series aren't present in the right side set, their changes() disappears entirely. As far as I can see, there is no way out of this that avoids a second full scan of your metric over the time range. At that point you might as well use resets().

For dashboard display purposes you might accept the simple 'changes()/2' approach with no clever compensation for odd changes() values, and add a note about why the numbers could have a <N>.5 value. Not everything on your dashboards and graphs has to be completely, narrowly correct all of the time even at the cost of significant overhead.

(This is one of the entries I'm writing partly for my future self. I'd hate to have to re-derive all of this logic in the future when I already did it once.)

sysadmin/PrometheusCountOnOrOff written at 00:01:41; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.