Wandering Thoughts


Thinking about why ZFS only does IO in recordsize blocks, even random IO

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. As is perhaps less well known, a ZFS logical block is the minimum size of IO to a file, both for reads and especially for writes. Since the default recordsize is 128 Kb, this means that many files of interest are have recordsize blocks and thus all IO to them is done in 128 Kb units, even if you're only reading or writing a small amount of data.

On the one hand, this seems a little bit crazy. The time it takes to transfer 128 Kb over a SATA link is not always something that you can ignore, and on SSDs larger writes can have a real impact. On the other hand, I think that this choice is more or less forced by some decisions that ZFS has made. Specifically, the ZFS checksum covers the entire logical block, and ZFS's data structure for 'where you find things on disk' is also based on logical blocks.

I wrote before about the ZFS DVA, which is ZFS's equivalent of a block number and tells you where to find data. ZFS DVAs are embedded into 'block pointers', which you can find described in spa.h. One of the fields of the block pointer is the ZFS block checksum. Since this is part of the block pointer, it is a checksum over all of the (logical) data in the block, which is up to recordsize. Once a file reaches recordsize bytes long, all blocks are the same size, the recordsize.

Since the ZFS checksum is over the entire logical block, ZFS has to fetch the entire logical block in order to verify the checksum on reads, even if you're only asking for 4 Kbytes out of it. For writes, even if ZFS allowed you to have different sized logical blocks in a file, you'd need to have the original recordsize block available in order to split it and you'd have to write all of it back out (both because ZFS never overwrites in place and because the split creates new logical blocks, which need new checksums). Since you need to add new logical blocks, you might have a ripple effect in ZFS's equivalent of indirect blocks, where they must expand and shuffle things around.

(If you're not splitting the logical block when you write to only a part of it, copy on write means that there's no good way to do this without rewriting the entire block.)

In fact, the more I think about this, the more it seems that having multiple (logical) block sizes in a single file would be the way to madness. There are so many things that get complicated if you allow variable block sizes. These issues can be tackled, but it's simpler not to. ZFS's innovation is not that it insists that files have a single block size, it is that it allows this block size to vary. Most filesystems simply set the block size to, say, 4 Kbytes, and live with how large files have huge indirect block tables and other issues.

(The one thing that might make ZFS nicer in the face of some access patterns where this matters is the ability to set the recordsize on a per-file basis instead of just a per-filesystem basis. But I'm not sure how important this would be; the kind of environments where it really matters are probably already doing things like putting database tables on their own filesystems anyway.)

PS: This feels like an obvious thing once I've written this entry all the way through, but the ZFS recordsize issue has been one of my awkward spots for years, where I didn't really understand why it all made sense and had to be the way it was.

PPS: All of this implies that if ZFS did split logical blocks when you did a partial write, the only time you'd win would be if you then overwrote what was now a single logical block a second time. For example, if you created a big file, wrote 8 Kb to a spot in it (splitting a 128 Kb block into several new logical blocks, including an 8 Kb one for the write you just did), then later wrote exactly 8 Kb again to exactly that spot (overwriting only your new 8 Kb logical block). This is probably obvious too but I wanted to write it out explicitly, if only to convince myself of the logic.

solaris/ZFSWhyIOInRecordsize written at 01:07:47; Add Comment


The increasingly surprising limits to the speed of our Amanda backups

When I started dealing with backups the slowest part of the process was generally writing things out to tape, which is why Amanda was much happier when you gave it a 'holding disk' that it could stage all of the backups to before it had to write them out to tape. Once you had that in place, the speed limit was generally some mix between the network bandwidth to the Amanda server and the speed of how fast the machines being backed up could grind through their filesystems to create the backups. When networks moved to 1G, you (and we) usually wound up being limited by the speed of reading through the filesystems to be backed up.

(If you were backing up a lot of separate machines, you might initially be limited by the Amanda server's 1G of incoming bandwidth, but once most machines started finishing their backups you usually wound up with one or two remaining machines that had larger, slower filesystems. This slow tail wound up determining your total backup times. This was certainly our pattern, especially because only our fileservers have much disk space to back up. The same has typically been true of backing up multiple filesystems in parallel from the same machine; sooner or later we wind up stuck with a few big, slow filesystems, usually ones we're doing full dumps of.)

Then we moved our Amanda servers to 10G-T networking and, from my perspective, things started to get weird. When you have 1G networking, it is generally slower than even a single holding disk; unless something's broken, modern HDs will generally do at least 100 Mbytes/sec of streaming writes, which is enough to keep up with a full speed 1G network. However this is only just over 1G data rates, which means that a single HD is vastly outpaced by a 10G network. As long as we had a number of machines backing up at once, the Amanda holding disk was suddenly the limiting factor. However, for a lot of the run time of backups we're only backing up our fileservers, because they're where all the data is, and for that we're currently still limited by how fast the fileservers can do disk IO.

(The fileservers only have 1G network connections for reasons. However, usually it's disk IO that's the limiting factor, likely because scanning through filesystems is seek-limited. Also, I'm ignoring a special case where compression performance is our limit.)

All of this is going to change in our next generation of fileservers, which will have both 10G-T networking and SSDs. Assuming that the software doesn't have its own IO rate limits (which is not always a safe assumption), both the aggregate SSDs and all the networking from the fileservers to Amanda will be capable of anywhere from several hundred Mbytes/sec up to as much 10G bandwidth as Linux can deliver. At this point the limit on how fast we can do backups will be down to the disk speeds on the Amanda backup servers themselves. These will probably be significantly slower than the rest of the system, since even striping two HDs together would only get us up to around 300 Mbytes/sec at most.

(It's not really feasible to use a SSD for the Amanda holding disk, because it would cost too much to get the capacities we need. We currently dump over a TB a day per Amanda server, and things can only be moved off the holding disk at the now-paltry HD speed of 100 to 150 Mbytes/sec.)

This whole shift feels more than a bit weird to me; it's upended my perception of what I expect to be slow and what I think of as 'sufficiently fast that I can ignore it'. The progress of hardware over time has made it so the one part that I thought of as fast (and that was designed to be fast) is now probably going to be the slowest.

(This sort of upset in my world view of performance happens every so often, for example with IO transfer times. Sometimes it even sticks. It sort of did this time, since I was thinking about this back in 2014. As it turned out, back then our new fileservers did not stick at 10G, so we got to sleep on this issue until now.)

sysadmin/AmandaWhereSpeedLimits written at 23:28:38; Add Comment

Spam from Yahoo Groups has quietly disappeared

Over the years I have written several times about what was, at the time, an ongoing serious and long-term spam problem with email from Yahoo Groups. Not only was spam almost all of the Groups email that we got, but it was also clear that Yahoo Groups was allowing spammers to create their own mailing lists. I was coincidentally reminded of this history recently, so I wondered how things were today.

One answer is that spam from Yahoo Groups has disappeared. Oh, it's not completely and utterly gone; we rejected one probable spam in last December and two at the end of July 2017, which is almost as far back as our readily accessible logs go (they stretch back to June 15th, 2017). But for pretty much anyone, much less what it was before, that counts as completely vanished. Certainly it counts for not having any sort of spam problem.

But this is the tip of the iceberg, because it turns out that email volume from Yahoo Groups has fallen off significantly as well. We almost always get under ten accepted messages a day from Yahoo Groups, and some days we get none. Even after removing the spam, this is nothing like four years ago in 2014, when my entry implies that we got about 22 non-spam messages a day from Yahoo Groups.

At one level I'm not surprised. Yahoo has been visibly and loudly dying for quite a while now, so I bet that a lot of people and groups have moved away from Yahoo Groups. If you had an active group that you cared about, it was clearly time to find alternate hosting quite some time ago and probably many people did (likely with Google Groups). At another level, I'm a bit surprised that it's this dramatic a shift. I would have expected plenty of people and groups to stick around until the very end, out of either inertia or ignorance. Perhaps Yahoo Groups service got so bad and so unreliable that even people who don't pay attention to computer news noticed that there was some problem.

On the other hand there's another metric, the amount of email from Yahoo Groups that was rejected due to bad destination addresses here (and how many different addresses there are). We almost always seen a small number of such rejections a day, and the evidence suggests that almost all of them are for the same few addresses. There are old, obsolete addresses here that have been rejecting Yahoo Groups email since last June, and Yahoo Groups is still trying to send email to them. Apparently they don't even handle locally generated bounces, never mind bounces that they refuse to accept back. I can't say I'm too surprised.

Given all of this I can't say I regret the slow motion demise of Yahoo Groups. At this point I'm not going to wish it was happening faster, because it's no longer causing us problems (and clearly hasn't been for more than half a year), but it's also clearly still not healthy. It's just that either the spammers abandoned it too or they finally got thrown off. (Perhaps a combination of both.)

spam/YahooGroupsDisappeared written at 01:44:09; Add Comment


The sensible way to use Bourne shell 'here documents' in pipelines

I was recently considering a shell script where I might want to feed a Bourne shell 'here document' to a shell pipeline. This is certainly possible and years ago I wrote an entry on the rules for combining things with here documents, where I carefully wrote down how to do this and the general rule involved. This time around, I realized that I wanted to use a much simpler and more straightforward approach, one that is obviously correct and is going to be clear to everyone. Namely, putting the production of the here document in a subshell.

cat <<EOF
your here document goes here
with as much as you want.
) | sed | whatever

This is not as neat and nominally elegant as taking advantage of the full power of the Bourne shell's arcane rules, and it's probably not as efficient (in at least some sh implementations, you may get an extra process), but I've come around to feeling that that doesn't matter. This may be the brute force solution, but what matters is that I can look at this code and immediately follow it, and I'm going to be able to do that in six months or a year when I come back to the script.

(Here documents are already kind of confusing as it stands without adding extra strangeness.)

Of course you can put multiple things inside the (...) subshell, such as several here documents that you output only conditionally (or chunks of always present static text mixed with text you have to make more decisions about). If you want to process the entire text you produce in some way, you might well generate it all inside the subshell for convenience.

Perhaps you're wondering why you'd want to run a here document through a pipe to something. The case that frequently comes up for me is that I want to generate some text with variable substitution but I also want the text to flow naturally with natural line lengths, and the expansion will have variable length. Here, the natural way out is to use fmt:

cat <<EOF
My message to $NAME goes here.
It concerns $HOST, where $PROG
died unexpectedly.
) | fmt

Using fmt reflows the text regardless of how long the variables expand out to. Depending on the text I'm generating, I may be fine with reflowing all of it (which means that I can put all of the text inside the subshell), or I may have some fixed formatting that I don't want passed through fmt (so I have to have a mix of fmt'd subshells and regular text).

Having written that out, I've just come to the obvious realization that for simple cases I can just directly use fmt with a here document:

fmt <<EOF
My message to $NAME goes here.
It concerns $HOST, where $PROG
died unexpectedly.

This doesn't work well if there's some paragraphs that I want to include only some of the time, though; then I should still be using a subshell.

(For whatever reason I apparently have a little blind spot about using here documents as direct input to programs, although there's no reason for it.)

unix/SaneHereDocumentsPipelines written at 23:05:30; Add Comment

A CPU's TDP is a misleading headline number

The AMD Ryzen 1800X in my work machine and the Intel Core i7-8700K in my home machine are both 95 watt TDP processors. Before I started measuring things with the actual hardware, I would have confidently guessed that they would have almost the same thermal load and power draw, and that the impact of a 95W TDP CPU over a 65W TDP CPU would be clearly obvious (you can see traces of this in my earlier entry on my hardware plans). Since it's commonly said that AMD CPUs run hotter than Intel ones, I'd expect the Ryzen to be somewhat higher than the Intel, but how much difference would I really expect from two CPUs with the same TDP?

Then I actually measured the power draws of the two machines, both at idle and under various different sorts of load. The result is not even close; the Intel is clearly using less power even after accounting for the 10 watts of extra power the AMD's Radeon RX 550 graphics card draws when it's lit up. It's ahead at idle, and it's also ahead under full load when the CPU should be at maximum power draw. Two processors that I would have expected to be fundamentally the same at full CPU usage are roughly 8% different in measured power draw; at idle they're even further apart on a proportional basis.

(Another way that TDP is misleading to the innocent is that it's not actually a measure of CPU power draw, it's a measure of CPU heat generation; see this informative reddit comment. Generally I'd expect the two to be strongly correlated (that heat has to come from somewhere), but it's possible that something that I don't understand is going on.)

Intellectually, I may have known that a processor's rated TDP was merely a measure of how much heat it could generate at maximum and didn't predict either its power draw when idle or its power draw under load. But in practice I thought that TDP was roughly TDP, and every 95 watt TDP (or 65 watt TDP) processor would be about the same as every other one. My experience with these two machines has usefully smacked me in the face with how this is very much not so. In practice, TDP apparently tells you how big a heatsink you need to be safe and that's it.

(There are all sorts of odd things about the relative power draws of the Ryzen and the Intel under various different sorts of CPU load, but that's going to be for another entry. My capsule summary is that modern CPUs are clearly weird and unpredictable beasts, and AMD and Intel must be designing their power-related internals fairly differently.)

PS: TDP also doesn't necessarily predict your actual observed CPU temperature under various conditions. Some of the difference will be due to BIOS decisions about fan control; for example, my Ryzen work machine appears to be more aggressive about speeding up the CPU fan, and possibly as a result it seems to report lower CPU temperatures under high load and power draw.

(Really, modern PCs are weird beasts. I'm not sure you can do more than putting in good cooling and hoping for the best.)

tech/TDPMisleading written at 02:04:17; Add Comment


Link: Parsing: a timeline

Jeffery Kegler's Parsing: a timeline (via) is what it says on the title; it's an (opinionated) timeline of various developments in computer language parsing. There are a number of fascinating parts to it and many bits of history that I hadn't known and I'm glad to have read about. Among other things, this timeline discusses all of the things that aren't actually really solved problems in parsing, which is informative all by itself.

(I've been exposed to various aspects of parsing and it's a long standing interest of mine, but I don't think I've ever seen the history of the field laid out like this. I had no idea that so many things were relatively late developments, or of all of the twists and turns involved in the path to LALR parsers.)

links/ParsingATimeline written at 00:48:04; Add Comment

Go and the pragmatic problems of having a Python-like with statement

In a comment on my entry on finalizers in Go, Aneurin Price asked:

So there's no deterministic way to execute some code when an object goes out of scope? Does Go at least have something like Python's "with" statement? [...]

For those who haven't seen it, the Python with statement is used like this:

with open("output.txt", "w") as fp:
    ... do things with fp ...

# fp is automatically closed by the
# the time we get here.

Python's with gives you reliable and automatic cleanup of fp or whatever resource you're working with inside the with block. Your code doesn't have to know anything or do anything; all of the magic is encapsulated inside with and things that speak its protocol.

Naturally, Go has no equivalent; sure, we have the defer statement but it's not anywhere near the same thing. In my opinion this is the right call for Go, because of two issues you would have if you tried to have something like Python's with in Go.

The obvious issue is that you would need some sort of protocol to handle initialization and cleanup, which would be a first for Go. You need the protocol because a big point of Python's with is that it magically handles everything for you without you having to remember to write any extra code; it's part of the point that using with is easier and shorter than trying to roll your own version (which encourages people to use it). If you're willing to write extra code, Go has everything today in the form of defer().

But beyond that there is a broader philosophical issue that's exposed by Aneurin Price's first question. In a language like Go where your local data may escape into functions you call, what does it mean for something to go out of scope? One answer is that things only go out of scope when there's no remaining reference to them. Unfortunately I believe that this is more or less impossible to implement efficiently without either going to Rust's extremes of ownership tracking in the language or forcing a reference counting garbage collector (where you know immediately when something is no longer referenced). This leaves you with the finalizer problem, where you're not actually cleaning up the resource promptly.

The other answer is that 'going out of scope' simply means 'execution reaches the end of the relevant block'. As in Python, you always invoke the cleanup actions at this point regardless of whether your resource may have escaped into things you've called and thus may still be alive somewhere. This implicit, hidden cleanup is a potentially dangerous trap for your code; if you forget and pass the resource to something that retains a reference to it, you may get explosions (much) later when that now-dead resource is used. If you're in luck, this use is deterministic so you can find it in tests. If you're unlucky, this use only happens in, say, an error path.

Using defer() instead of an implicit cleanup doesn't stop this problem from happening, but it makes explicit what's going on. When you write or see a defer(fp.Close()), you're pointedly reminded that at the end of the function, the resource will be dead. There is no implicit magic, only explicit actions, and hopefully this creates enough warning and awareness. Given Go's design goals, being explicit here as part of the language design makes complete sense to me. You can still get it wrong, but at least the wrongness is more visible.

(I don't think being explicit is necessarily better in general than Python's implicit magic. Go and Python are different languages with different goals; what's appropriate for one is not necessarily appropriate for the other. Python has both language features and cultural features that make with a good thing for it.)

programming/GoVersusPythonWith written at 00:40:28; Add Comment


Some notes and issues from trying out urxvt as an xterm replacement

I've been using xterm for a very long time, but I'm also aware that it's not a perfect terminal emulator (especially in today's Unicode world, my hacks notwithstanding). Years ago I wrote up what I wanted added to xterm, and the recommendation I've received over the years (both on that entry and elsewhere) is for urxvt (aka rxvt-unicode). I've made off and on experiments with urxvt, but for various reasons I've recently been trying a bit more seriously to use it regularly and to evaluate it as a serious alternative to xterm for me.

One of my crucial needs in an xterm replacement is an equivalent of xterm's ziconbeep feature, which I use to see when an iconified xterm has new output. Fortunately that need was met a long time ago through a urxvt Perl extension written by Leah Neukirchen; you can get the extension itself here. In my version I took out the audible bell. Without this, urxvt wouldn't be a particularly viable option for me, so I'm glad that it exists.

Urxvt's big draw as an xterm replacement is that it will reflow lines as you widen and narrow it. However, for a long time this didn't seem to work for me, or didn't seem to work reliably. Back in last September I finally discovered that the issue is that urxvt only reflows lines after a resize if it's already scrolled text in the window. This is the case both for resizing wider and for resizing narrower, which can be especially annoying (since resizing wider can sometimes 'un-scroll' a window). This is something that I can sort of work around; these days I often make it a point to start out my urxvt windows in their basic 80x24 size, dump out the output that I'll want, and only then resize them to read the long lines. This mostly works but it's kind of irritating.

(I'm not sure if this is a urxvt bug or a deliberate design decision. Perhaps I should try reporting it to find out.)

Another difference is that xterm has relatively complicated behavior on double-clicks for what it considers to be separate 'words'; you can read the full details in the manpage's section on character classes. Urxvt has somewhat simpler behavior based on delimiter characters, and its default set of delimiters make it select bigger 'words' than xterm does. For instance, a standard urxvt setup will consider all of a full path to be one word, because / is not a delimiter character (neither is :, so all of your $PATH is one word as far as urxvt is concerned). I'm highly accustomed to xterm's behavior and I prefer smaller words here, because it's much easier to widen a selection than it is to narrow it. You can customize some of this behavior with urxvt's cutchars resource (see the urxvt manpage). Currently I'm using:

! requires magic quoting for reasons.
URxvt*cutchars:   "\\`\"'&()*,;<=>?@[]^{|}.#%+!/:-"

This improves the situation in urxvt but isn't perfect; in practice I see various glitches, generally when several of these delimiters happen in a row (eg given 'a...', a double-click in urxvt may select up to the entire thing). Since I'm using the default selection Perl extension, possibly I could improve things by writing some complicated regular expressions (or replace the selection extension entirely with a more controllable version where I understand exactly what it's doing). If I want to exactly duplicate xterm's behavior, a Perl extension is probably the only way to achieve it.

(I'm not entirely allergic to writing Perl extensions for urxvt, but it's been a long time since I wrote Perl and I'm not familiar with the urxvt extensions API, so at a minimum it's going to be a pain.)

Given these issues I'm not throwing myself into a complete replacement of my xterm usage with urxvt, but I am reaching for it reasonably frequently and I've taken steps to make it easier to use in my environment. This involves both making it as conveniently accessible as xterm and also teaching various bits of my window manager configuration and scripting that urxvt is a terminal window and should be treated like xterm.

This whole thing has been an interesting experience overall. It's taught me both how much I'm attuned to very specific xterm behaviors and how deeply xterm has become embedded into my overall X environment.

unix/UrxvtNotes written at 00:26:59; Add Comment


The unfortunate configuration choice Grub2 makes in UEFI configurations

When I talked about my new home machine, I mentioned that I wasn't even trying to use UEFI on it after my experiences with my work machine, fundamentally because Grub2's UEFI setup makes an unfortunate configuration choice. Today I'm going to talk about what that choice is and why it's an unfortunate one, at least for people like me (and for people with servers).

Put simply, it's that the UEFI Grub2 requires you to put grub.cfg in the EFI system partition. On the one hand, this is a reasonable choice, and probably one that simplifies Grub's life. In a UEFI environment, the EFI system partition exists and is easily accessible through (U)EFI services, so you have a natural place to put everything else that Grub2 needs, including both its additional modules and grub.cfg. In a traditional non-UEFI environment, grub2 needs to do some magic to be able to load your grub.cfg from whatever sort of filesystem and RAID setup and so on your /boot is in; the actual mechanisms of that are relatively impressive and more than a bit complex (cf). UEFI makes life much simpler.

On the other hand, grub.cfg is the one piece of your Grub2 configuration that changes frequently, because it gets updated every time you add a kernel, remove a kernel, or modify kernel command line arguments. This is an issue for me and for people with servers, because the EFI system partition can't be part of a RAID mirror. If you want to be able to boot from a second disk if your primary disk fails, you need a duplicate copy of your EFI system partition, and because grub.cfg changes frequently, you need to keep this up to date on a frequent basis. Otherwise, not only will you perhaps be booting from an older (but functional) version of Grub2 that you didn't update, you'll probably be trying to boot kernels that don't even exist any more (perhaps with wrong or missing kernel arguments). You'll probably be able to boot your system in the end, but it's not likely to be easy or automatic.

Life would be a lot easier and better here if you could configure Grub2 to load your real grub.cfg from your (non-EFI) /boot. You could use software RAID, LVM, and any filesystem normally supported by Grub2. People with mirrored system disks would still have all of the good stuff that you get with them in a non-EFI configuration (although every so often you'd need to remember to update the stuff in your EFI system partition on the second disk).

My guess is that the easiest way to add this to Grub2 would be to give Grub2 some way of including additional files in grub.cfg. With this, you'd still have a stub grub.cfg in the EFI system partition (which Grub2 could load with UEFI services just like today), and this stub would specify everything else. It would know the UUID of the filesystem with your /boot in it and also what additional RAID or LVM UUIDs it needed to look for and start in order to find it, just as a non-EFI Grub2 knows those details today (cf), but these wouldn't change very often so your EFI system partition grub.cfg would stay mostly unchanging.

Of course this Grub2 configuration choice isn't important unless you have mirrored system disks. If your system disk is unmirrored, an unmirrored EFI system partition creates no additional problems and the current UEFI Grub2 design is fine. Since this probably describes most systems using UEFI today, I don't expect UEFI Grub2 to change any time soon. Probably it will only start to change when servers become UEFI-only and people running them discover that their mirrored, redundant system disks actually aren't any more because of this.

linux/Grub2UEFIBigMistake written at 03:07:46; Add Comment


For the first time, my home PC has no expansion cards

When I started out with PCs, you needed a bunch of expansion cards to make them do anything particularly useful. In the era of my first home PC, almost all I used on the motherboard was the CPU and the memory; graphics, sound, Ethernet (if applicable to you), and even a good disk controller were add-on cards. As a result, selecting a motherboard often involved carefully counting how many slots you got and what types they were, to make sure you had enough for what you needed to add.

(Yes, in my first PC I was determined enough to use SCSI instead of IDE. It ran BSDi, and that was one of the recommendations for well supported hardware that would work nicely.)

Bit by bit, that's changed. In the early 00s, things started moving on to the motherboard, starting (I believe) with basic sound (although that didn't always work out for Linux people like me; as late as 2011 I was having to use a separate sound card to get things working). When decent SATA appeared on motherboards it stopped being worth having a separate disk controller card, and eventually the motherboard makers started including not just Ethernet but even decent Ethernet chipsets. Still, in my 2011 home machine I turned to a separate graphics card for various reasons.

With my new home machine, I've taken the final step on this path. Since I'm using the Intel onboard graphics, I no longer need even a separate graphics card and now have absolutely no cards in the machine; everything is on the motherboard. It's sometimes an odd feeling to look at the back of my case and see all of the case's slot covers still in place.

(My new work machine still needs a graphics card and that somehow feels much more normal and proper, especially as I've also added an Ethernet card to it so that I have a second Ethernet port for sysadmin reasons.)

I think one of the reasons that having no expansion cards feels odd to me is that for a long time having an all-onboard machine was a sign that you'd bought a closed box prebuilt PC from a vendor like Dell or HP (and were stuck with whatever options they'd bundled in to the box). These prebuilt PCs have historically not been a great choice for people who wanted to run Linux, especially picky people like me who want unusual things, and I've had the folkloric impression that they were somewhat cheaply put together and not up to the quality standards of a (more expensive) machine you'd select yourself.

As a side note, I do wonder about the business side of how all of this came about. Integrating sound and Ethernet and so on on motherboards isn't completely free (if nothing else, the extra physical connectors cost something), so the motherboard vendors had to have a motivation. Perhaps it was just the cut-throat competition that pushed them to offering more things on the board in order to make themselves more attractive.

(I also wonder what will be the next thing to become pervasive on motherboards. Wireless networking is one possibility, since it's already on higher end motherboards, and perhaps BlueTooth. But it also feels like we're hitting the limits of what can be pulled on to motherboards or added.)

tech/PCAllOnboard written at 22:00:40; Add Comment

A learning experience about the performance of our IMAP server

Our IMAP server has never been entirely fast, and over the years it has slowly gotten slower and more loaded down. Why this was so seemed reasonably obvious to us; handling mail over IMAP required a fair amount of network bandwidth and a bunch of IO (often random IO) to our NFS fileservers, and there was only so much of that to go around. Things were getting slowly worse over time because more people were reading and storing more mail, while the hardware wasn't changing.

We have a long standing backwards compatibility with our IMAP server, where people's IMAP clients have full access to their $HOME and would periodically go searching through all of it. Recently this started causing us serious problems, like running out of inodes on the IMAP server, and it became clear that we needed to do something about it. After a number of false starts (eg), we wound up doing two important things over the past two months. First we blocked Dovecot from searching through a lot of directories, and then we started manually migrating users one by one to a setup where their IMAP sessions could only see their $HOME/IMAP instead of all of their $HOME. The two changes together significantly reduce the number of files and directories that Dovecot is scanning through (and sometimes opening to count messages).

Well, guess what. Starting immediately with our first change and increasing as we migrated more and more high-impact users, the load on our IMAP server has been dropping dramatically. This is most clearly visible in the load average itself, where it's now entirely typical for the daytime load average to be under one (a level that was previously only achieved in the dead of night). The performance of my test Thunderbird setup has clearly improved, too, rising almost up to the level that I get on a completely unloaded test IMAP server. The change has basically been night and day; it's the most dramatic performance shift I can remember us managing (larger than finding our iSCSI problem in 2012). While the IMAP server's performance is not perfect and it can still bog down at some times, it's become clear that all of the extra scanning that Dovecot was doing was behind a great deal of the performance problems we were experiencing and that getting rid of it has had a major impact.

Technically, we weren't actually wrong about the causes of our IMAP server being slow; it definitely was due to network bandwidth and IO load issues. It's just that a great deal of that IO was completely unproductive and entirely avoidable, and if we had really investigated the situation we might have been able to improve the IMAP server long ago.

(And I think it got worse over time partly because more and more people started using clients, such as the iOS client, that seem to routinely use expensive scanning operations.)

The short and pungent version of what we learned is that IMAP servers go much faster if you don't let them do stupid things, like scan all through people's home directories. The corollary to this is that we shouldn't just assume that our servers aren't doing stupid things.

(You could say that another lesson is that if you know that your servers are occasionally doing stupid things, as we did, perhaps you should try to measure the impact of those things. But that's starting to smell a lot like hindsight bias.)

sysadmin/IMAPPerformanceLesson written at 02:06:21; Add Comment

(Previous 11 or go back to April 2018 at 2018/04/12)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.