Wandering Thoughts


Why a network connection becoming writable when it succeeds makes sense

When I talked about how Go deals with canceling network connection attempts, I mentioned that it's common for the underlying operating system to signal you that a TCP connection (or more generally a network connection) has been successfully made by letting it become writable. On the surface this sounds odd, and to some degree it is, but it also falls out of what the operating system knows about a network connection before and after it's made. Also, in practice there is a certain amount of history tied up in this particular interface.

If we start out thinking about being told about events, we can ask what events you would see when a TCP connection finishes the three way handshake and becomes established. The connection is now established (one event), and you can generally now send data to the remote end, but usually there's no data from the remote end to receive so you would not get an event for that. So we would expect a 'connection is established' event and a 'you can send data' event. If we want a more compact encoding of events, it's quite tempting to merge these two together into one event and say that a new TCP connection becoming writable is a sign that its three way handshake has now completed.

(And you certainly wouldn't expect to see a 'you can send data' event before the three way handshake finishes.)

The history is that a lot of the fundamental API of asynchronous network IO comes from BSD Unix and spread from there (even to non-Unix systems, for various reasons). BSD Unix did not use a more complex 'stream of events' API to communicate information from the kernel to your program; instead it used simple and easy to implement kernel APIs (because this was the early 1980s). The BSD Unix API was select(), which passes information back and forth using bitmaps; one bitmap for sending data, one bitmap for receiving data, and one bitmap for 'exceptions' (whatever they are). In this API, the simplest way for the kernel to tell programs that the three way handshake has finished is to set the relevant bit in the 'you can send data' bitmap. The kernel's got to set that bit anyway, and if it sets that bit and also sets a bit in the 'exceptions' bitmap it needs to do more work (and so will programs; in fact some of them will just rely on the writability signal, because it's simpler for them).

Once you're doing this for TCP connections, it generally makes sense for all connections regardless of type. There are likely to be very few stream connection types where it makes sense to signal that you can now send (more) data partway through the connection being established, and that's the only case where this use of signaling writability gets in the way.

ConnectingAndWritability written at 01:09:43; Add Comment


How I move files between iOS devices and Unix machines (using SSH)

Suppose, not hypothetically, that you're a Unix person with some number of iOS devices, such as a phone and a tablet, and you wind up with files in one environment that you would like to move to or access from the other. On the iOS devices you may have photos and videos you want to move to Unix to deal with them with familiar tools, and on Unix you may have files that you edit or read or refer to and you'd like to do that on your portable devices too. There are a variety of ways of doing this, such as email and Nextcloud, but the way I've come around to is using SSH (specifically SFTP) through the Secure Shellfish iOS app.

Secure Shellfish's fundamental pitch is nicely covered by its tagline of 'SSH file transfers on iOS' and its slightly longer description of 'SSH and SFTP support in the iOS Files app', although the Files app is not the only way you can use it. Its narrow focus makes it pleasantly minimalistic and quite straightforward, and it works just as it says it does; it uses SFTP to let you transfer files between a Unix account (or anything that supports SFTP) and your iOS devices, and also to look at and modify in place Unix files from iOS, through Files-aware programs like Textastic. As far as (SSH) authentication goes, it supports both passwords and SSH keys (these days it will generate RSA keys and supports importing RSA, ECDSA, and ed25519 keys).

If the idea of theoretically allowing Secure Shellfish full access to your Unix account makes you a bit nervous, there are several things you can do. On machines that you fully control, you can set up a dedicated login that's used only for transferring things between your Unix machine and your iOS devices, so that they don't even have access to your regular account and its full set of files. Then, if you use SSH keys, you can set your .ssh/authorized_keys to force the Secure Shellfish key to always run the SFTP server instead of allowing access to an ordinary shell. For example:

command="/usr/libexec/openssh/sftp-server",restrict ssh-rsa [...]

(sftp-server has various command line flags that may be useful here for the cautious. As I found out the hard way, different systems have different paths to sftp-server, and you don't get good diagnostics from Secure Shellfish if you get it wrong. On at least some versions of OpenSSH, you can use the special command name 'internal-sftp' to force use of the built-in SFTP server, but then I don't think you can give it any command line flags.)

To avoid accidents, you can also configure an initial starting directory in Secure Shellfish itself and thereby restrict your normal view of the Unix account. This can also be convenient if you don't want to have to navigate through a hierarchy of directories to get to what you actually want; if you know you're only going to use a particular server you configure to work in some directory, you can just set that up in advace.

As I've found, there are two ways to transfer iOS things like photos to your Unix account with Secure Shellfish. In an iOS app such as Photos, you can either directly send what you want to transfer to Secure Shellfish in the strip of available apps (and then pick from there), or you can use 'Save to Files' and then pick Secure Shellfish and go from there. The advantage and drawback of directly picking Secure Shellfish from the app strip is that your file is transferred immediately and that you can't do anything more until the transfer finishes. If you 'save to files', your file is transferred somewhat asynchronously. As a result, if you want to immediately do something with your data on the Unix side and it's a large file, you probably want to use the app route; at least you can watch the upload progress and know immediately when it's done.

(Secure Shellfish has a free base version and a paid 'Pro' upgrade, but I honestly don't remember what's included in what. If it was free when I initially got it, I upgraded to the Pro version within a very short time because I wanted to support the author.)

PS: Secure Shellfish supports using jump (SSH) servers, but I haven't tested this and I suspect that it doesn't go well with restricting your Secure Shellfish SSH key to only doing SFTP.

IOSUnixFileTransfer written at 00:45:26; Add Comment


PCIe slot bandwidth can change dynamically (and very rapidly)

When I added some NVMe drives to my office machine and started looking into its PCIe setup, I discovered that its Radeon graphics card seemed to be operating at 2.5 GT/s (PCIe 1.0) instead of 8 GT/s (PCIe 3.0). The last time around, I thought I had fixed this just by poking into the BIOS, but in a comment, Alex suggested that this was actually a power-saving measure and not necessarily done by the BIOS. I'll quote the comment in full because it summarizes things better than I can:

Your GPU was probably running at lower speeds as a power-saving measure. Lanes consume power, and higher speeds consume more power. The GPU driver is generally responsible for telling the card what speed (and lane width) to run at, but whether that works (or works well) with the Linux drivers is another question.

It turns out that Alex is right, and what I saw after going through the BIOS didn't quite mean what I thought it did.

To start with the summary, the PCIe bandwidth being used by my graphics card can vary very rapidly from 2.5 GT/s up to 8 GT/s and then back down again based on whether or not the graphics driver needs the card to do anything (or the aggregate Linux and X software stack as a whole, since I don't know where these decisions are being made). The most dramatic and interesting difference is between two apparently very similar ways of seeing if the Radeon's bandwidth is currently downgraded, either automatically scanning through lspci's output with 'lspci -vv | fgrep downgrade' or manually looking through it with 'lspci -vv | less'. When I used less, the Radeon normally showed up downgraded to 2.5 GT/s. When I used fgrep, other things before the Radeon showed up as downgraded but the Radeon never did; it was always at 8 GT/s.

(Some of those other things have been downgraded to 'x0' lanes, which I suspect means that they've been disabled as unused.)

What I think is happening here is that when I pipe lspci to less, lspci gets the Radeon's bandwidth before any output is written to the screen (less reads it all in a big gulp and then displays it), so at the time the graphics chain is inactive. When I use the fgrep pipe, some output is written to the screen before lspci gets to the Radeon and so the graphics chain lights up the Radeon's bandwidth to display things. What this suggests is that the graphics chain can and does vary the Radeon's PCIe bandwidth quite rapidly. Another interesting case is that running the venerable glxgears doesn't bring the PCIe bandwidth up from 2.5 GT/s, but running GpuTest's 'fur' test does (it goes to 8 GT/s as you might expect).

(It turns out that nVidia's Linux drivers also do this.)

Of course all of this may make seeing whether you're getting full PCIe bandwidth a little bit interesting. It's clearly not enough to just look at your system, even when it's moderately active (I have several X programs that update once a second); you really need to put it under some approximation of full load and then check. So far I've only seen this happen with graphics cards, but who knows what's next (NVMe drives could be one candidate to drop their bandwidth to save power and thus reduce heat).

PCIeVaryingBandwidth written at 00:38:31; Add Comment


Some important things about how PCIe works out involve BIOS magic

I'll start with my remark on Mastodon:

I still don't know why my Radeon graphics card and the PCIe bridge it's behind dropped down from PCIe 3.0 all the way to PCIe 1.0 bandwidth, but going into the BIOS and wandering around appears to have magically fixed it, so I'll take that.

PCIe: this generation's SCSI.

When I added some NVMe drives to my office machine and ran into issues, I discovered that the Radeon graphics card on my office machine was operating at 2.5 GT/s instead of 8 GT/s, which is to say PCIe 1.0 data rates instead of PCIe 3.0 ones (which is what it should be operating at). At the end of the last installment I speculated that I had accidentally set something in the BIOS that told it to limit that PCIe slot to PCIe 1.0, because that's actually something you can do through BIOS settings (on some BIOSes). I went through the machine's BIOS today and found nothing that would explain this, and in fact it doesn't seem to have any real settings for PCIe slot bandwidth. However, when I rebooted the machine after searching through the BIOS, I discovered that my Radeon and the PCIe bridge it's behind were magically now at PCIe 3.0's 8 GT/s.

I already knew that PCIe device enumeration involved a bunch of actions and decisions by the BIOS. I believe that the BIOS is also deeply involved in deciding how many PCIe lanes are assigned to particular slots (although there are physical constraints there too). Now it's pretty clear that your BIOS also has its fingers in decisions about what PCIe transfer rate gets used. As far as I know, all of these decisions happen before your machine's operating system comes into the picture; it mostly has to accept whatever the BIOS set up, for good or bad. Modern BIOSes are large opaque black boxes of software, and like all such black boxes they can have both bugs and mysterious behavior.

(Even when their PCIe setup behavior isn't a bug and is in fact necessary, they don't explain themselves, either to you or to the operating system so that your OS can log problems and inefficiencies.)

How do you know that your system is operating in a good PCIe state instead of one where PCIe cards and onboard controllers are being limited for some reason? Well, you probably don't, not unless you go and look carefully (and understand a reasonable amount about PCIe). If you're lucky you may detect this through side effects, such as increased NVMe latency or lower than expected GPU performance (if you know what GPU performance to expect in your particular environment). Such is the nature of magic.

PCIeAndBIOSDecisions written at 01:26:03; Add Comment


Desktop motherboards can have fewer useful PCIe slots than they seem to

When I had to get an adapter card for my office machine's second NVMe drive, I worried (in advance) about the card itself slowing down the drive. It turns out that I should have also been concerned about a second issue, which is what PCIe slot to put it in. My current office machine uses an ASUS Prime X370-Pro motherboard, and if you look at it there are six PCIe slots; three of them are 'x1' single lane slots (which the manual calls PCIEX1_1 through 3), and three are physically x16 slots (called PCIEX16_1 through 3). In theory, this should make life simple; a good NVMe drive requires 4 PCIe lanes (written as 'x4'), and I have three slots that can do that.

Of course life is not so simple. For a start, just because a slot is physically a PCIe x16 slot doesn't mean that it supplies all 16 lanes (especially under all conditions), and basically no current motherboards actually provide three true x16 slots. Of the three x16 slots on this motherboard, the third is only PCIe x4 and the first two are only collectively x16; you can have one card at x16 or two cards at x8 (and it may be that only the first slot can do x16, the manual isn't entirely clear). The next issue is that the x16 @ x4 slot also shares PCIe lanes, this time with the second and third PCIe x1 slots. If you use a PCIe x1 card in either of the x1 slots, the x16 @ x4 slot becomes an x16 @ x2 slot. Finally, the first PCIe x1 slot is physically close enough to the first x16 slot that a dual width GPU card more or less precludes using it, which is unfortunate since that's the only PCIe x1 slot that doesn't conflict with the x16/x4 slot.

My office machine has a Radeon graphics card that happens to be dual width, an x1 Intel gigabit Ethernet card because I need a second network port, and now a PCIe NVMe adapter card that physically requires a PCIe x4 or greater slot and wants to be x4 to work best. The natural layout I used when I put the machine together initially was the Radeon graphics card in the first PCIe x16 slot and the Intel card in one of the two PCIe x1 slots where it would fit (I picked the third, putting it as far away from the Radeon as possible). When I added the NVMe card, I put it in the third PCIe x16 slot (which is physically below the third PCIe x1 slot with the Intel card); it seemed the most natural spot for it, partly because it kept room for air circulation for the fans of the Radeon card. Then I noticed that the second NVMe drive had clearly higher latency (especially write latency) than the first one, started looking, and discovered that it was running at PCIe x2 instead of x4 (because of the Intel Ethernet card).

If my graphics card could use x16 and I wanted it to, it might still be possible to make everything work at full speed, but I'd have to move the graphics card and hope that the second PCIe x16 slot can support full x16, not just x8. As it is, my card fortunately only wants x8, which means the simple resolution to my problem is moving the NVMe adapter card to the second PCIe x16 slot. If I wanted to also add a 10G-T Ethernet card, I might be out of luck, because I think those generally want at least x4.

(Our current 10G-T hardware turns out to be somewhat inconsistent on this. Our Intel dual 10G-T cards seem to want x8, but our Linux fileservers claim that their onboard 10G-T ports only want x1 with a link speed of 2.5GT/s.)

All of this is annoying, but the more alarming bit is that it's unlikely to be particularly obvious to people if their PCIe lane count is being reduced with cards like this PCIe to NVMe adapter card. It will still work, just more slowly than you'd expect, and then perhaps people write reviews saying 'this card is inferior and doesn't deliver full performance for NVMe drives'.

(This also omits the additional issue of whether the PCIe lanes in question are directly connected to the CPU or have to flow through the chipset, which has a limited bandwidth connection to the CPU. This matters on modern machines because you have to go through the CPU to get to RAM, so you can only get so much RAM bandwidth total from all PCIe devices behind the chipset, no matter how many PCIe lanes the chipset claims to provide (even after limitations). See also my older entry on PCIe and how it interacts with modern CPUs.)

Sidebar: PCIe 3.0 versus 2.0 in slots

The other issue in slots is which PCIe version they support, with PCIe 3.0 being potentially faster than PCIe 2.0. On my motherboard, only slots driven directly by the CPU support PCIe 3.0; slots driven through the AMD X370 chipset are 2.0 only. All of the PCIe x1 slots and the PCIe x16 @ x4 slot are driven by the chipset and so are PCIe 2.0, which may have been another source of my NVMe performance difference. The two full x16 slots are PCIe 3.0 from the CPU, as is the motherboard's M.2 slot.

PCIeSlotsLimitations written at 01:04:01; Add Comment


The problem of multiple NVMe drives in a PC desktop today

My current office workstation currently has two 250 GB Samsung 850 EVO SSDs (and some HDs). These were decent SSDs for their era, but they're now any number of years old and 250 GB isn't very large, so as part of our general stocking up on 500 GB SSDs at good sale prices, I get to replace them. To my surprise, it turns out that decent 500 GB NVMe drives can now be had at roughly the same price as 500 GB SSDs (especially during sales), so I got approval to get two NVMe drives as replacements instead of two SSDs. Then I realized I had a little issue, because my motherboard only has one M.2 NVMe slot.

In general, if you want multiple NVMe drives in a desktop system, you're going to have problems that you wouldn't have with the same number of SSDs (or HDs). PC motherboards have been giving you lots of SATA ports for a long time now, but M.2 slots are much scarcer. I think that part of this is simply an issue of physical board space, since M.2 slots need a lot more space than SATA ports do, but part of it also seems to be that M.2 drives consume a lot more PCIe lanes than SATA ports do. An M.2 slot needs at least two lanes and really you want it to have four, and even today there are only so many PCIe lanes to go around, at least on common desktops.

(I suspect that this is partly segmentation on the part of Intel and to a lesser extent AMD. They know that server people increasingly want lots of PCIe lanes, so if they restrict that to expensive CPUs and chipsets, they can sell more of them. Unusual desktop people like me get to lose out.)

I'm solving my immediate problem by getting a PCIe M.2 adapter card, because fortunately my office desktop has an unused PCIe x4 card slot right now. But this still leaves me with potential issues in the long run. I mirror my drives, so I'll be mirroring these two NVMe drives, and when I replace a drive in such a mirror I prefer to run all three drives at once for a while rather than break the mirror's redundancy to swap in a new drive. With NVMe drives, that would require two addon cards on my current office machine and I believe it would drop my GPU from x16 to x8 in the process (not that I need the GPU bandwidth, since I run basic X).

(And if I wanted to put a 10G-T Ethernet card into my desktop for testing, that too would need another 4x capable slot and I'd have my GPU drop to 8x to get the PCIe lanes. Including the GPU slot, my motherboard has only three 4x capable card slots.)

One practical issue here is that apparently PCIe M.2 adapter cards can vary somewhat in quality and the resulting NVMe IO rates you get, and it's hard to know whether or not you're going to wind up with a decent one. Based on the low prices for cards with a single M.2 slot and the wide collection of brands I'd never heard of, this is a low margin area dominated by products that I will politely call 'inexpensive'. The modern buying experience for such products is not generally a positive one (good luck locating decent reviews, for example).

(Also, apparently not all motherboards will boot from an NVMe drive on an adapter card in a PCIe slot. This isn't really an issue for me (if the NVMe drives in my motherboard M.2 slot fails, I'll move the adapter drive to the motherboard), but it might be for some people.)

Hopefully all of this will get better in the future. If there is a large movement toward M.2 (and I think there may be), there will be demand for more M.2 capacity from CPUs and chipsets and more M.2 slots on desktop motherboards, and eventually the vendors will start delivering on that (somehow). This might be M.2 slots on the motherboard, or maybe more x8 and x16 PCIe slots and then adapter cards (and BIOSes that will boot from them).

MultiNVMeMotherboardIssue written at 01:45:12; Add Comment


LiveJournal and the path to NoSQL

A while back on Mastodon, I said:

Having sort of distantly observed the birth of NoSQL from the outside, I have a particular jaundiced view of its origins that traces it back to LiveJournal's extensive wrangling with MySQL (although this is probably not entirely correct, since I was a distant outside observer who was not paying much attention).

Oddly, this view is more flattering to NoSQL than many takes on its origins that I've seen. LJ's problems & solutions in the mid 00s make a good case for NoSQL.

LiveJournal aren't the only people who had problems with scaling MySQL in the early and mid 2000s, but they do have the distinction of having made a bunch of public presentations about them, many of which I read with a fair degree of interest back in the days. So let me rewind time to the early 2000s and quickly retrace LiveJournal's MySQL journey. To set the stage, this was a time before clouds and before affordable SSDs; if you ran a database-backed website at scale, you had real machines somewhere and they used hard drives.

LiveJournal started out with a single MySQL database machine. When the load overwhelmed it, they added read replicas (and then later in memory caching, creating memcached). However, their read replicas were eventually overwhelmed by write volume; reads can be served from one of N replicas, but the master and all replicas see the same write load. To deal with this, LiveJournal wound up manually sharding parts of their overall MySQL database into smaller 'user clusters', having already accepted that they never wanted to do what would be the equivalent of cross-cluster database joins.

At the point where you've manually sharded your SQL database, from a global view you don't really have an SQL database any more. You can't do cross-shard joins, cross-shard foreign keys, or cross-shard transactions, and you don't even have a guaranteed schema for your database since every shard has its own copy and they might not be in synchronization (they should be, but stuff happens). You're almost certainly querying and modifying your database through a higher level application library that packages up the work of finding the right shard, handling cross-shard operations, and so on, not by opening up a connection and sending SQL down it. The SQL is an implementation detail.

These problems aren't LiveJournal's alone. You need all of this sharding to operate at scale in the mid 2000s, because hard drives imposed a hard limit on how many reads and writes a second you could do. You couldn't get away from the tyranny of hard drive limitations on (seek) IO operations a second the way you can today with SSDs and now NVMe drives (and large amounts of RAM). And if you have to shard and don't have an SQL database after you shard, you might as well use a database that's designed for this environment from the start, one that handles the sharding, replication, and so on for you instead of making you build your own version. If it's really fast, so much the better.

(This idea is especially attractive to small startups, who don't have the people to build LiveJournal level software and libraries themselves. Small startups are busy enough trying to build a viable product without trying to also build a sharded database environment.)

Today, in a world of SSDs, NVMe drives, large amounts of RAM, cloud providers, managed SQL compatible database offerings, and so on, it's tempting to laugh at the idea of NoSQL. But at the time I think it looked like a sensible response to the challenges that large websites were facing.

(Whether so many people should have adopted NoSQL databases is a separate issue. But my impression is that startups are nothing if not ambitious.)

LiveJournalAndNoSQL written at 22:40:53; Add Comment


TCP/IP and a consequence of reliable delivery guarantees

I recently read My hardest bug to debug (via), which discusses an interesting and hard to find bug that caused an industrial digital camera used for barcode scanning to hang. The process of diagnosis (and the lessons learned from it) are interesting, so I urge you to go read the article now before reading further here, because I have to spoil the actual bug.

(Really, go read the article. This is your last chance.)

One part of the control system worked by making a TCP connection to the camera, doing some initial setup, and then leaving the connection open so that it could later send any setting changes to the camera without having to re-open a connection. It turned out that the camera had an undocumented behavior of sending scan results over this TCP connection (as well as making them available in other ways). The control system didn't expect this return traffic, so it never listened for responses on the TCP connection. The article ends with, in part:

I still don't understand how this caused the camera to lock up. We were receiving the TCP results via Telnet but we weren't reading the stream. Did it just build up in some buffer? How did this cause the camera to lock up? I still can't answer these questions.

The most likely guesses are that yes, the sent data built up in a collection of buffers on both the receiver and the sender, and this caused the hang because eventually the camera software attempted to send more data to the network and the OS on the camera put the software to sleep because there wasn't any more buffer room.

While you might blame common networking APIs a bit here, in large part this is a deep but in some sens straightforward consequence of TCP/IP promising reliable delivery of your bytes in a world of finite and often limited resources. A properly operating receiving system cannot throw away bytes after it's ACK'd them, so it must buffer them; a sensible system will have a limit on how many bytes it will buffer before it stops accepting any more. Similarly, a properly operating sending system can't generally throw away bytes after accepting them from an application, so if the receiving system isn't accepting more bytes the sending system has to (eventually) stop accepting them from the application. All of this is familiar in general as backpressure. When the backpressure propagates all the way back to the sending application, it can either stall itself or it can get a 'no more buffer space' error from the operating system.

(Where the APIs come in is that when there's no more buffer space, they generally opt to have the application's attempt to send more data just stall instead of producing an error. This is generally easier for many applications to handle, but sometimes it's not what you want.)

TCP's reliable delivery guarantee means that you can only send so much data to something that isn't dealing with it. You cannot just send data forever and have it vanish into the void because no one is dealing with it; that wouldn't be reliable delivery. After all, the receiving application might wake up and start reading that accumulated data some day, and if it does all the data had better be there for it.

TCPReliableDeliveryConsequence written at 23:54:10; Add Comment


Filesystem size limits and the complication of when errors are detected

One of the practical things that further complicates limiting the size of things in filesystems is the issue of when people find out about this, or rather how this interacts with another very desirable property in modern filesystems.

For practical usability, you want people (by which I mean programs) to be notified about all 'out of space' errors synchronously, when they submit their write IOs, or more exactly when they get back the results of submitting one (although this is effectively the same thing unless you have an asynchronous write API and a program is using it). Common APIs such as the Unix API theoretically allow you to signal write errors later (for example on close() in Unix), but actually doing so will cause practical problems both for straightforward programs that just write files out and are done (such as editors) and more complicated programs that do ongoing IO. Beyond carelessness, the problem is that write errors that aren't tied to a specific write IO leave programs in the dark about what exactly went wrong. If your program makes five write() calls at various offsets in the file and then gets an error later, the file could be in pretty much any state as far as it knows; it has no idea which writes succeeded, if any, and which failed. Some write errors can't be detected until the IO is actually done and have to be delivered asynchronously, but delivering as many as possible as early as possible is quite desirable. And while 'the disk exploded' write errors are uncommon and unpredictable, 'you're out of space' is both predictable (in theory) and common, so you very much want to deliver it to programs immediately if it's going to happen at all.

By itself this is no problem, but then we get into the other issue. Modern filesystems have discovered that they very much want to delay block allocation until as late as possible, because delaying and aggregating it together across a bunch of pending writes gives you various performance improvements and good features (see eg the Wikipedia article). The natural place to detect and report being out of various sorts of space is during block allocation (because that's when you actually use space), but if this is asynchronous and significantly delayed, you don't have prompt reporting of out of space issues to programs. If you try to delay block allocation but perform size limit checking early, there's a multi-way tradeoff between basically duplicating block allocation, being completely accurate (so that there's little or no chance of a delayed failure during the later block allocation), and being fast.

In theory, the best solution to this problem is probably genuinely asynchronous write APIs that delay returning their results until your data has been written to disk. In practice asynchronous APIs leave you with state machines in your programs, and state machines are often hard to deal with (this is the event loop problem, and also).

FilesystemQuotaErrorTiming written at 22:14:06; Add Comment


Limiting the size of things in a filesystem is harder than it looks

Suppose, not entirely hypothetically, that you want to artificially limit the size of a filesystem or perhaps something within it, for example the space used by a particular user. These sort of limits usually get called quotas. Although you might innocently think that enforcing quotas is fairly straightforward, it turns out that it can be surprisingly complicated and hard even in very simple filesystems. As filesystems become more complicated, it can rapidly become much more tangled than it looks.

Let's start with a simple filesystem with no holes in files and us only wanting to limit the amount of data that a user has (not filesystem metadata). If the user tries to write 128 Kb to some file, we already need to know where in the file it's going. If the 128 Kb is entirely overwriting existing data, the user uses no extra space, if it's being added to the end of the file, they use 128 Kb more space, and if it partly overlaps with the end of the file, they use less than 128 Kb. Fortunately the current size of a file that's being written to is generally very accessible to the kernel, so we can probably know right away whether the user's write can be accepted or has to be rejected because of quota issues. Well, we can easily know until we throw multiple CPUs into the situation, with different programs on different CPUs all executing writes at once. Once we have several CPUs, we have to worry about synchronizing our information on how much space the user is currently using.

Now, suppose that we want to account for filesystem metadata as well, and that files can have unallocated space in the middle of themselves. Now the kernel doesn't know how much space 128 Kb of file data is going to use until it's looked at the file's current indirect blocks. Writing either after the current end of the file or before it may require allocating new data blocks and perhaps new indirect blocks (in extreme cases, several levels of them). The existing indirect blocks for the file may or may not already be in memory; if they aren't, the kernel doesn't know whether it can accept the write until it reads them off disk, which may take a while. The kernel can optimistically accept the write, start allocating space for all of the necessary data and metadata, and then abort if it runs into a quota limit by the end. But if it does this, it has to have the ability to roll back all of those allocations it may already have done.

(Similar issues come up when you're creating or renaming files and more broadly whenever you're adding entries to a directory. The directory may or may not have a free entry slot already, and adding your new or changed name may cause a cascade of allocation changes, especially in sophisticated directory storage schemes.)

Features like compression and deduplication during writes complicate this picture further, because you don't know how much raw data you're going to need to write until you've gone through processing it. You can even discover that the user will use less space after the write than before, if they replace incompressible unique data with compressible or duplicate data (an extreme case is turning writes of enough zero bytes into holes).

If the filesystem is a modern 'copy on write' one such as ZFS, overwriting existing data may or may not use extra space even without compression and deduplication. Overwriting data allocates a new copy of the data (and metadata pointing to it), but it also normally frees up the old version of the data, hopefully giving you a net zero change in usage. However, if the old data is part of a snapshot or otherwise referenced, you can't free it up and so an 'overwrite' of 128 Kb may consume the same amount of space as appending it to the file as new data.

Filesystems with journals add more issues and questions, especially the question of whether you add operations to the journal before you know whether they'll hit quota limits or only after you've cleared them. The more you check before adding operations to the journal, the longer user processes have to wait, but the less chance you have of hitting a situation where an operation that's been written to the journal will fail or has to be annulled. You can certainly design your journal format and your journal replay code to cope with this, but it makes life more complicated.

At this point you might wonder how filesystems that support quotas ever have decent performance, if checking quota limits involves all of this complexity. One answer is that if you have lots of quota room left, you can cheat. For instance, the kernel can know or estimate the worst case space usage for your 128 Kb write, see that there is tons of room left in your quota even in the face of that, and not delay while it does further detailed checks. One way to deal with the SMP issue is to keep a very broad count of how much outstanding write IO there is (which the kernel often wants anyway) and not bother with synchronizing quota information if the total outstanding writes are significantly less than the quota limit.

(I didn't realize a lot of these lurking issues until I started to actually think about what's involved in checking and limiting quotas.)

FilesystemLimitingSizeProblems written at 21:31:43; Add Comment

(Previous 10 or go back to September 2019 at 2019/09/20)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.