Wandering Thoughts

2024-03-17

Disk write buffering and its interactions with write flushes

Pretty much every modern system defaults to having data you write to filesystems be buffered by the operating system and only written out asynchronously or when you specially request for it to be flushed to disk, which gives you general questions about how much write buffering you want. Now suppose, not hypothetically, that you're doing write IO that is pretty much always going to be specifically flushed to disk (with fsync() or the equivalent) before the programs doing it consider this write IO 'done'. You might get this situation where you're writing and rewriting mail folders, or where the dominant write source is updating a write ahead log.

In this situation where the data being written is almost always going to be flushed to disk, I believe the tradeoffs are a bit different than in the general write case. Broadly, you can never actually write at a rate faster than the write rate of the underlying storage, since in the end you have to wait for your write data to actually get to disk before you can proceed. I think this means that you want the OS to start writing out data to disk almost immediately as your process writes data; delaying the write out will only take more time in the long run, unless for some reason the OS can write data faster when you ask for the flush than before then. In theory and in isolation, you may want these writes to be asynchronous (up until the process asks for the disk flush, where you have to synchronously wait for them), because the process may be able to generate data faster if it's not stalling waiting for individual writes to make it to disk.

(In OS tuning jargon, we'd say that you want writeback to start almost immediately.)

However, journaling filesystems and concurrency add some extra complications. Many journaling filesystems have the journal as a central synchronization point, where only one disk flush can be in progress at once and if several processes ask for disk flushes at more or less the same time they can't proceed independently. If you have multiple processes all doing write IO that they will eventually flush and you want to minimize the latency that processes experience, you have a potential problem if different processes write different amounts of IO. A process that asynchronously writes a lot of IO and then flushes it to disk will obviously have a potentially long flush, and this flush will delay the flushes done by other processes writing less data, because everything is running through the chokepoint that is the filesystem's journal.

In this situation I think you want the process that's writing a lot of data to be forced to delay, to turn its potentially asynchronous writes into more synchronous ones that are restricted to the true disk write data rate. This avoids having a large overhang of pending writes when it finally flushes, which hopefully avoids other processes getting stuck with a big delay as they try to flush. Although it might be ideal if processes with less write volume could write asynchronously, I think it's probably okay if all of them are forced down to relatively synchronous writes with all processes getting an equal fair share of the disk write bandwidth. Even in this situation the processes with less data to write and flush will finish faster, lowering their latency.

To translate this to typical system settings, I believe that you want to aggressively trigger disk writeback and perhaps deliberately restrict the total amount of buffered writes that the system can have. Rather than allowing multiple gigabytes of outstanding buffered writes and deferring writeback until a gigabyte or more has accumulated, you'd set things to trigger writebacks almost immediately and then force processes doing write IO to wait for disk writes to complete once you have more than a relatively small volume of outstanding writes.

(This is in contrast to typical operating system settings, which will often allow you to use a relatively large amount of system RAM for asynchronous writes and not aggressively start writeback. This especially would make a difference on systems with a lot of RAM.)

WriteBufferingAndSyncs written at 21:59:25; Add Comment

2024-03-02

Something I don't know: How server core count interacts with RAM latency

When I wrote about how the speed of improvement in servers may have slowed down, I didn't address CPU core counts, which is one area where the numbers have been going up significantly. Of course you have to keep those cores busy, but if you have a bunch of CPU-bound workloads, the increased core count is good for you. Well, it's good for you if your workload is genuinely CPU bound, which generally means it fits within per-core caches. One of the areas I don't know much about is how the increasing CPU core counts interact with RAM latency.

RAM latency (for random requests) has been relatively flat for a while (it's been flat in time, which means that it's been going up in cycles as CPUs got faster). Total memory access latency has apparently been 90 to 100 nanoseconds for several memory generations (although individual DDR5 memory module access is apparently only part of this, also). Memory bandwidth has been going up steadily between the DDR generations, so per-core bandwidth has gone up nicely, but this is only nice if you have the kind of sequential workloads that benefit from it. As far as I know, the kind of random access that you get from things like pointer chasing is all dependent on latency.

(If the total latency has been basically flat, this seems to imply that bandwidth improvements don't help too much. Presumably they help for successive non-random reads, and my vague impression is that reading data from successive addresses from RAM is faster than reading random addresses (and not just because RAM typically transfers an entire cache line to the CPU at once).)

So now we get to the big question: how many memory reads can you have in flight at once with modern DDR4 or DDR5 memory, especially on servers? Where the limit is presumably matters since if you have a bunch of pointer-chasing workloads that are limited by 'memory latency' and you run them on a high core count system, at some point it seems that they'll run out of simultaneous RAM read capacity. I've tried to do some reading and gotten confused, which may be partly because modern DRAM is a pretty complex thing.

(I believe that individual processors and multi-socket systems have some number of memory channels, each of which can be in action simultaneously, and then there are memory ranks (also) and memory banks. How many memory channels you have depends partly on the processor you're using (well, its memory controller) and partly on the motherboard design. For example, 4th generation AMD Epyc processors apparently support 12 memory channels, although not all of them may be populated in a given memory configuration (cf). I think you need at least N (or maybe 2N) DIMMs for N channels. And here's a look at AMD Zen4 memory stuff, which doesn't seem to say much on multi-core random access latency.)

ServerCPUDensityAndRAMLatency written at 22:54:58; Add Comment

2024-02-29

The speed of improvement in servers may have slowed down

One of the bits of technology news that I saw recently was that AWS was changing how long it ran servers, from five years to six years. Obviously one large motivation for this is that it will save Amazon a nice chunk of money. However, I suspect that one enabling factor for this is that old servers are more similar to new servers than they used to be, as part of what could be called the great slowdown in computer performance improvement.

New CPUs and to a lesser extent memory are somewhat better than they used to be, both on an absolute measure and on a performance per watt basis, but the changes aren't huge the way they used to be. SATA SSD performance has been more or less stagnant for years; NVMe performance has improved, but from a baseline that was already very high, perhaps higher than many workloads could take advantage of. Network speeds are potentially better but it's already hard to truly take advantage of 10G speeds, especially with ordinary workloads and software.

(I don't know if SAS SSD bandwidth and performance has improved, although raw SAS bandwidth has and is above what SATA can provide.)

For both AWS and people running physical servers (like us) there's also the question of how many people need faster CPUs and more memory, and related to that, how much they're willing to pay for them. It's long been observed that a lot of what people run on servers is not a voracious consumer of CPU and memory (and IO bandwidth). If your VPS runs at 5% or 10% CPU load most of the time, you're probably not very enthused about paying more for a VPS with a faster CPU that will run at 2.5% almost all of the time.

(Now that I've written this it strikes me that this is one possible motivation for cloud providers to push 'function as a service' computing, because it potentially allows them to use those faster CPUs more effectively. If they're renting you CPU by the second and only when you use it, faster CPUs likely mean more people can be packed on to the same number of CPUs and machines.)

We have a few uses for very fast single-core CPU performance, but other than those cases (and our compute cluster) it's hard to identify machines that could make much use of faster CPUs than they already have. It would be nice if our fileservers had U.2 NVMe drives instead of SATA SSDs but I'm not sure we'd really notice; the fileservers only rarely see high IO loads.

PS: It's possible that I've missed important improvements here because I'm not all that tuned in to this stuff. One possible area is PCIe lanes directly supported by the system's CPU(s), which enable all of those fast NVMe drives, multiple 10G or faster network connections, and so on.

ServersSpeedOfChangeDown written at 22:43:13; Add Comment

2024-02-25

Open source culture and the valorization of public work

A while back I wrote about how doing work that scales requires being able to scale your work, which in the open source world requires time, energy, and the willingness to engage in the public sphere of open source regardless of the other people there and your reception. Not everyone has this sort of time and energy, and not everyone gets a positive reception by open source projects even if they have it.

This view runs deep in open source culture, which valorizes public work even at the cost of stress and time. Open source culture on the one hand tacitly assumes that everyone has those available, and on the other hand assumes that if you don't do public work (for whatever reason) that you are less virtuous or not virtuous at all. To be a virtuous person in open source is to contribute publicly at the cost of your time, energy, stress, and perhaps money, and to not do so is to not be virtuous (sometimes this is phrased as 'not being dedicated enough').

(Often the most virtuous public contribution is 'code', so people who don't program are already intrinsically not entirely virtuous and lesser no matter what they do.)

Open source culture has some reason to praise and value 'doing work that scales', public work; if this work does not get done, nothing happens. But it also has a tendency to demand that everyone do it and to judge them harshly when they don't. This is the meta-cultural issue behind things like the cultural expectations that people will file bug reports, often no matter what the bug reporting environment is like or if filing bug reports does any good (cf).

I feel that this view is dangerous for various reasons, including because it blinds people to other explanations for a lack of public contributions. If you can say 'people are not contributing because they're not virtuous' (or not dedicated, or not serious), then you don't have to take a cold, hard look at what else might be getting in the way of contributions. Sometimes such a cold hard look might turn up rather uncomfortable things to think about.

(Not every project wants or can handle contributions, because they generally require work from existing project members. But not all such projects will admit up front in the open that they either don't want contributions at all or they gatekeep contributions heavily to reduce time burdens on existing project members. And part of that is probably because openly refusing contributions is in itself often seen as 'non-virtuous' in open source culture.)

OpenSourceCultureAndPublicWork written at 23:21:12; Add Comment

2024-02-16

Options for genuine ECC RAM on the desktop in (early) 2024

A traditional irritation with building (or specifying) desktop computers is the issue of ECC RAM, which for a long time was either not supported at all or was being used by Intel for market segmentation. First generation AMD Ryzens sort of supported ECC RAM with the right motherboard, but there are many meanings of 'supporting' ECC RAM and questions lingered about how meaningful the support was (recent information suggests the support was real). Here in early 2024 the situation is somewhat better and I'm going to summarize what I know so far.

The traditional option to getting ECC RAM support (along with a bunch of other things) was to buy a 'workstation' motherboard that was built to support Intel Xeon processors. These were available from a modest number of vendors, such as SuperMicro, and were generally not inexpensive (and then you had to buy the Xeon). If you wanted a pre-built solution, vendors like Dell would sell you desktop Xeon-based workstation systems with ECC RAM. You can still do this today.

Update: I forgot AMD Threadripper and Epyc based systems, which you can get motherboards for and build desktop systems around. I think these are generally fairly expensive motherboards, though.

Back in 2022, Intel introduced their W680 desktop chipset. One of the features of this chipset is that it officially supported ECC RAM with 12th generation and later (so far) Intel CPUs (or at least apparently the non-F versions), along with official support for memory overclocking (and CPU overclocking), which enables faster 'XMP' memory profiles than the stock ones (should your ECC RAM actually support this). There are a modest number of W680 based motherboards available from (some of) the usual x86 PC desktop motherboard makers (and SuperMicro), but they are definitely priced at the high end of things. Intel has not yet announced a 'Raptor Lake' chipset version of this, which would presumably be called the 'W780'. At this date I suspect there will be no such chipset.

(The Intel W680 chipset was brought to my attention by Brendan Shanks on the Fediverse.)

As mentioned, AMD support for ECC on early generation Ryzens was a bit lackluster, although it was sort of there. With the current Socket AM5 and Zen 4, a lot of mentions of ECC seem to have (initially) been omitted from documentation, as discussed in Rain's ECC RAM on AMD Ryzen 7000 desktop CPUs, and Ryzen 8000G series APUs don't support ECC at all. However, at least some AM5 motherboards do support ECC with recent enough firmware (provided that you have recent BIOS updates and enable ECC support in the BIOS, per Rain). These days, it appears that a number of current AM5 motherboards list ECC memory as supported (although what supported means is a question) and it will probably work, especially if you find people who already have reported success. It seems that even some relatively inexpensive AM5 motherboards may support ECC.

(Some un-vetted resources are here and here.)

If you can navigate the challenges of finding a good motherboard, it looks like an AM5, Ryzen 7000 system will support ECC at a lower cost than an Intel W680 based system (or an Intel Xeon one). If you don't want to try to thread those rapids and can stand Intel CPUs, a W680 based system will presumably work, and a Xeon based system would be even easier to purchase as a fully built desktop with ECC.

(Whether ECC makes a meaningful difference that's worth paying for is a bit of an open question.)

DesktopECCOptions2024 written at 23:52:09; Add Comment

2024-01-31

Using IPv6 has quietly become reliable (for me)

I've had IPv6 at home for a long time, first in tunneled form and later in native form, and recently I brought up more or less native IPv6 for my work desktop. When I first started using IPv6 (at home) and for many years afterward, there were all sorts of complications and failures that could be attributed to IPv6 or that went away when I turned off IPv6. To be honest, when I enabled IPv6 on my work desktop I expected to run into a fun variety of problems due to this, since before then it had been IPv4 only.

To my surprise, my work desktop has experienced no problems since enabling IPv6 connectivity. I know I'm using some websites over IPv6 and I can see IPv6 traffic happening, but at the personal level, I haven't noticed anything different. When I realized that, I thought back over my experiences at home and realized that it's been quite a while since I had a problem that I could attribute to IPv6. Quietly, while I wasn't particularly noticing, the general Internet IPv6 environment seems to have reached a state where it just works, at least for me.

Since IPv6 is everyone's future, this is good news. We've been collectively doing this for long enough and IPv6 usage has climbed enough that it should be as reliable as IPv4, and hopefully people don't make common oversights any more. Otherwise, we would collectively have a real problem, because turning on IPv6 for more and more people would be degrading the Internet experience of more and more people. Fortunately that's (probably) not happening any more.

I'm sure that there are still IPv6 specific issues and problems that come up, and there will be more for a long time to come (until perhaps they're overtaken by year 2038 problems). But t you can have problems that are specific to anything, including IPv4 (and people may already be having those).

(As more people add IPv6 to servers that are currently IPv4 only, we may also see a temporary increase in IPv6 specific problems as people go through 'learning experiences' of operating IPv6 environments. I suspect that my group will have some of those when we eventually start adding IPv6 to various parts of our environment.)

IPv6NowReliableForMe written at 22:26:21; Add Comment

2024-01-26

Histogram data is most useful when they also provide true totals

A true histogram is generated from raw data. However, in things like metrics, we generally don't have the luxury of keeping all of the raw data around; instead we need to summarize it into histogram data. This is traditionally done by having some number of buckets with either independent or cumulative values. A lot of systems stop there; for example OpenZFS provides its histogram data this way. Unfortunately by itself this information is incomplete in an annoying way.

If you're generating histogram data, you should go the extra distance to also provide a true total of all of the raw data. The reason is simple; only with a true total can one get a genuine and accurate average value, or anything derived from that average. Importantly, one thing you can potentially derive from the average value is an indication of what I'll call skew in your buckets.

The standard assumption when dealing with histograms is that the values in each bucket are randomly distributed through the range of the bucket. If they truly are, then you can do things like get a good estimate of the average value by just taking the midpoint of each bucket, and so people will say that you don't really need the true total. However, this is an assumption and it's not necessarily correct, especially if the size of the buckets is large (as it can be at the upper end of a 'powers of two' logarithmic bucket size scheme, which is pretty common because it's convenient to generate).

I've certainly looked at a number of such histograms where it's clear (from various other information sources) that this assumption of even distribution wasn't correct. How incorrect it was wasn't all that clear, though, because the information necessary to have a solid idea wasn't there.

Good histogram data takes more than counts in buckets. But including a true total as an additional piece of data is at least a start, and it's probably inexpensive (both to export and to accumulate).

(Someone has probably already written a 'best practices for gathering and providing histogram data' article.)

HistogramsNeedTotalsToo written at 22:41:23; Add Comment

2024-01-24

The cooling advantage that CPU integrated graphics has

Once upon a time, you could readily get basic graphics cards, generally passively cooled and certainly single-width even if they had to have a fan in order to get you dual output support; this is, for example, more or less what I had in my 2011 era machines. These days these cards are mostly extinct, so when I put together my current office desktop I wound up with a dual width, definitely fan-equipped card that wasn't dirt cheap. For some time I've been grumpy about this, and sort of wondering where they went.

The obvious answer for where these cards went is that CPUs got integrated graphics (although not all CPUs, especially higher end ones, so you could wind up using a CPU without an IGP and needing a discrete GPU. When thinking about why integrated graphics displaced such basic cards, it recently struck me that one practical advantage integrated graphics has is cooling.

The integrated graphics circuitry is part of the CPU, or at least on the CPU die. General use CPUs have been actively cooled for well over a decade now, and for a long time they've been the focus of high performance cooling and sophisticated thermal management. The CPU is probably the best cooled thing in a typical desktop (and it needs to be). Cohabiting with this heat source constrains the IGP, but it also means that the IGP can take advantage of the CPU's cooling to cool itself, and that cooling is generally quite good.

A discrete graphics card has no such advantage. It must arrange its own cooling and its own thermal management, both of which cost money and the first of which takes up space (either for fans or for passive heatsinks). This need for its own cooling makes it less competitive against integrated graphics, probably especially so if the card is trying to be passively cooled. I wouldn't be surprised if the options were a card that didn't even compare favorably to integrated graphics or a too-expensive card for the performance you got. There's also the question of whether the discrete GPU chipsets you can get are even focused on low power usage or whether they're designed to assume full cooling to allow performance that's clearly better than integrated graphics.

(Another limit, now that I look, is the amount of power available to a PCIe card, especially one that uses fewer than 16 PCIe lanes; apparently a x4 or x8 card may be limited to 25W total (with an x16 going to 75W), per Wikipedia. However, I don't know how this compares to the amount of power an IGP is allowed to draw, especially in CPUs with more modest overall power usage.)

The more I look at this, the more uncertainties I have about the thermal and power constraints that may or may not face discrete GPU cards that are aiming for low cost while still offering, say, multi-monitor support. I imagine that the readily available and more or less free cooling that integrated graphics gets doesn't help the discrete GPUs, but I'm not sure how much of a difference it really makes.

CPUIGPCoolingAdvantage written at 22:22:06; Add Comment

2024-01-22

Desktop PC motherboards and the costs of extra features

My current office desktop and home desktop are now more than five years old (although they've had some storage tuneups since then), so I've been looking at PC hardware off and on. As it happens, PC desktop motherboards that have the features I'd like also not infrequently include extra features that I don't need, such as built in wifi connectivity. I'm somewhat of a hardware minimalist so in the past I've reflexively attempted to avoid these features. The obvious reason to do this is that they tend to increase the cost. But lately it's struck me that there's another reason to want a desktop PC motherboard without extra features, and that is PCIe lanes.

Processors (CPUs) and motherboard chipsets only have so many PCIe lanes in total, partly because supporting more PCIe lanes is one of those product features that both Intel and AMD use to segment the market. This matters because days, almost everything built into a PC motherboard is actually implemented as a PCIe device, which means that it normally consumes some number of those PCIe lanes. The more built in devices your motherboard has, the more PCIe lanes they consume out of the total ones available, which can cut down on other built in devices and also on connectivity you want, such as NVMe drives and physical PCIe card slots. Physical PCIe slots can already have peculiar limitations on which ones can be used together, which has the effect of reducing the total PCIe lanes they consume, but you generally can't play very many of these games with built in hardware.

(You can play some games; on my home desktop, the motherboard's second NVMe slot shares two PCIe lanes with some of my SATA ports. If I want to run the NVMe drive with x4 PCIe lanes instead of x2, I can only have four SATA ports instead of six.)

Of course, all of this is academic if you can only find the motherboard features you want on higher end motherboards that also include these extra features. Provided that there aren't any surprise limitations that affect things you're going to use right away, you (I) just get to live with whatever limitations and constraints on PCIe lane usage you get, or you have to drop some features you want. This is where you have to read motherboard descriptions quite carefully, including all of the footnotes, and perhaps even consult their manuals.

(What features I want is another question, and there are tradeoffs I could make and may have to.)

Fortunately (given the growth of things like NVMe drives), the number of PCIe lanes available from CPUs and chipsets has been going up over time, as has their speed. However I suspect that we're always going to see Intel and AMD differentiate their server processors from their desktop processors partly by the number of PCIe lanes available, with the 'desktop' processors having the smaller number. My impression is that AMD desktop CPUs have more CPU PCIe lanes than Intel desktop CPUs and also I believe more chipset PCIe lanes, but Intel is potentially ahead on PCIe bandwidth between the chipset and the CPU (and thus between chipset devices and RAM, which has to go through the CPU). Whether you'll ever stress the CPU to chipset bandwidth that hard is another question.

MotherboardFeaturesPCIeCosts written at 23:29:05; Add Comment

2024-01-13

Indexed archive formats and selective restores

Recently we discovered first that the Amanda backup system has to read some tar archives all the way to the end when restoring a few files from them and then sometimes it can do quick restores from tar archives. What is going on is the general issue of indexed (archive) formats, and also the potential complexities involved in them in a full system.

To simplify, tar archives are a series of entries for files and directories. Tar archives contain no inherent index of their contents (unlike some archive formats, such as ZIP archives), but you can build an external index of where each file entry starts and what it is. Given such an index and its archive file on a storage medium that supports random access, you can jump to only the directory and file entries you care about and extract only them. Because tar archives have not much special overall formatting, you can do this either directly or you can read the data for each entry, concatenate it, and feed it to 'tar' to let tar do the extraction.

(The trick with clipping out the bits of a tar archive you cared about and feeding them to tar as a fake tar archive hadn't occurred to me until I saw what Amanda was doing.)

If tar was a more complicated format, this would take more work and more awareness of the tar format. For example, if tar archives had an internal index, either you'd need to operate directly on the raw archive or you would have to create your own version of the index when you extracted all of the pieces from the full archive. Why would you need to extract the pieces if there was an internal index? Well, one reason is if the entire archive file was itself compressed, and your external index told you where in the compressed version you needed to start reading in order to get each file chunk.

The case of compressed archives shows that indexes need to somehow be for how the archive is eventually stored. If you have an index of the uncompressed version but you're storing the archive in compressed form, the index is not necessarily of much use. Similarly, it's necessary for the archive to be stored in such a way that you can read only selected parts of it when retrieving it. These days that's not a given, although I believe many remote object stores support HTTP Range requests at least some of the time.

(Another case that may be a problem for backups specifically is encrypted backups. Generally the most secure way to encrypt your backups is to encrypt the entire archive as a single object, so that you have to read it all to decrypt it and can't skip ahead in it.)

SelectiveRestoresAndIndexes written at 23:28:30; Add Comment

(Previous 10 or go back to January 2024 at 2024/01/10)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.