Wandering Thoughts archives

2020-05-18

Reading the POSIX standard for Unix functions is not straightforward

I recently wrote about exploring munmap() on page zero, and in the process looked at the POSIX specification for munmap(). One of my discoveries about the practical behavior of Unixes here is that OpenBSD specifically disallows using munmap() on address space that isn't currently mapped (see munmap(2)). In my entry, I said that it wasn't clear whether POSIX strictly authorized this behavior, although you could put forward an interpretation where it was okay.

In a comment, Jakob Kaivo put forward the view that POSIX permitted this and any other behavior for when munmap() was applied to unmapped address space because of a sentence in the end of the Description:

The behavior of this function is unspecified if the mapping was not established by a call to mmap().

At first reading this seems clear. But wait, it's time to get confused. Earlier in the same description of munmap()'s behavior, POSIX clearly says that it can be used if there is no mapping:

[...] If there are no mappings in the specified address range, then munmap() has no effect.

(Note that 'has no effect' is different from 'unspecified'.)

POSIX doesn't require that this not raise an error, but you can read its description of when you can get EINVAL to require that you don't (some of the time). Assuming addr is aligned and len is not zero, you get EINVAL if some of the address space you're unmapping is 'outside the valid range for the address space of a process', and perhaps implicitly not otherwise. And then you have the question of what POSIX intended here by saying 'a process space' instead of 'the (current) process'.

One of the things we can see here is that it's hard for non-specialists to truly read and understand the POSIX standards. Both Jakob Kaivo and I are at least reasonably competent C and Unix programmers and we've both attempted to read a reasonably straightforward POSIX specification of a single function, yet we've wound up somewhere between disagreeing and being uncertain about what it allows.

This is a useful lesson for me to remember any time I'm tempted to appeal to a POSIX standard for how something should work. POSIX standards are written in specifications language, and if they're not completely clear I should be cautious about how correct I am. Probably I should be cautious even if they seem perfectly clear.

(And anyway, the actual behavior of current Unixes matters more than what POSIX says. A POSIX specification is merely a potential lower bound on behavior, especially future behavior. If a Unix does something today and that something is required by POSIX, the odds are good that it will keep doing that in the future.)

PS: My interpretation of the unspecified behavior versus 'no behavior' here is that POSIX is saying that it's unspecified what happens if you munmap() legitimate address space that wasn't obtained through your own mmap(). For instance, if you munmap() part of something that you got with malloc(), anything goes as far as POSIX is concerned. It might work and not produce future problems, it might have no effect, it might kill your program immediately, and it might cause your program to malfunction or blow up in the future.

unix/POSIXReadingIsHard written at 22:49:57; Add Comment

2020-05-17

Syndication feeds (RSS) and social media can be complementary

Every so often I read an earnest plea to increase the use of 'RSS', by which the authors mean syndication feeds of all formats (RSS, Atom, and even JSON Feed). Some times, as in this appeal (via), it's accompanied by a plea to move away from getting things to read through social media (like Twitter) and aggregators (like lobste.rs). I'm a long term user and fan of syndication feeds, but while I'm all in favour of more use of them, I feel that abandoning social media and aggregators is swinging the pendulum a bit too far. In practice, I find that social media and aggregators are a complement to my feed reading.

(From now on I'm just going to talk about 'social media' and lump aggregators in with them, so I don't have to type as much.)

The first thing I get through social media is discovering new feeds that I want to subscribe to. There's no real good substitute for this, especially for things that are outside my usual areas of general reading (where I might discover new blogs through cross links from existing ones I read or Internet searches). For instance, this excellent blog looking at the history of battle in popular culture was a serendipitous discovery through a link shared on social media.

The second and more important thing I get through social media is surfacing the occasional interesting to me content from places where I don't and wouldn't read regularly. If I'm only interested in one out of ten or fifty or a hundred articles in a feed, I'm never going to add it to my feed reader; it simply has too much 'noise' (from my perspective) to even skim regularly. Instead, I get to rely on some combination of people I follow on normal social media and the views of people expressed through aggregator sites to surface interesting reading. I read quite a lot of articles this way, many more than I would if I stuck only to what I had delivered through feeds I was willing to follow.

(Aggregator sites don't have to involve multiple people; see Ted Unangst's Inks.)

So, for me subscribing to syndication feeds is for things have a high enough hit rate that I want to read their content regularly, while social media is a way to find some of the hits in a sea of things that I would not read regularly. These roles are complementary. I don't want to rely on social media to tell me about things I'm always going to want to read, and I don't want to pick through a large flood of feed entries to find occasional interesting bits. I suspect that I'm not alone in this pattern.

A corollary of this is that social media is likely good for people with syndication feeds even in a (hypothetical) world with lots of syndication feed usage. Your articles appearing on Twitter and on lobste.rs both draws in new regular readers and shares especially interesting content with people who would at best only read you occasionally.

tech/SyndicationFeedsAndSocialMedia written at 21:42:56; Add Comment

Some views on having your system timezone set to UTC

Some people advocate for setting the system timezone on servers to UTC, for various reasons that you can read about with some Internet searches. I don't have the kind of experience that would give me strong opinions on this in general, but what I do know for sure is that for us, it would be a bad mistake to set the system timezone to UTC instead of our local time of America/Toronto. In fact I think we are a great example of a worst case for using UTC as the system timezone.

We, our servers, and most of our users are located in Toronto. Most of the usage of our systems is driven by Toronto's local time (and the Toronto work week), which means that so is when we want to schedule activities like backups, ZFS snapshots, or daily log rotations. When users report things to us they almost always use local Toronto time (eg, 'I had a connection problem at 10am'), and when they aren't in Toronto they generally don't use UTC for their reports. If we used UTC for our system timezone, almost everything we do would require us to translate between UTC time and local time; looking at logs, scheduling activities, investigating problem reports, and so on. Using Toronto's local time means we almost never have to do that.

(And when something happens to our servers because of a power outage, a power surge, an Internet connectivity problem, or whatever, almost all of the reporting and time information on it will be in Toronto local time, not UTC. Almost no one reports things relevant to us in UTC.)

Given all of this, dealing with the twice-yearly DST shift is a small price to pay, and in a sense it is honest to have to deal with it. Our users experience the DST shift, after all, and their usage shifts one hour forwards or backwards relative to UTC.

If we had servers located elsewhere (such as virtual machines in a cloud), we would probably still operate them in Toronto local time. Almost all of the reasons for doing so would still apply, although there might be some problems that now correlated with the local time of the datacenter where they were located.

The more you diverge from this, the more I suspect that it potentially makes sense to set your system timezone to UTC. If you have people working around the world, if your servers are scattered across the globe, if usage is continuous instead of driven by one location or continent, and so on, the less our issues apply to you and the more UTC is a neutral timezone. Running software in UTC also avoids it having to deal with time zone shifts for DST, which means that you don't necessarily have to test how it behaves in the face of DST shifts and then fix the bugs.

(Software sometimes can't avoid dealing with DST shifts at some level in the stack, because it handles or displays times as people perceive them and people definitely perceive DST shifts. But handling time for people is a hard issue in general with no simple solutions.)

sysadmin/ServerUTCTimeViews written at 00:30:59; Add Comment

2020-05-15

Why we use city names when configuring system timezones

One of the things we generally wind up doing when setting up systems is configuring their time zone (except for the people who leave their servers in UTC). On most Unixes, you can find the available time zone names underneath /usr/share/zoneinfo, which has become something of a cross-Unix standard. Once upon a time, it was normal to set these to the name of a time zone, such as 'Eastern', and even today some things will accept names like 'US/Eastern' or 'Canada/Pacific'. However, we normally no longer do that; instead we try to set system time zones to the name of the city that the server (or you) are in, such as 'America/Toronto'. The timezone information then maps this city to the relevant time zone name and details.

The reason for this indirection is that people have an inconvenient habit of changing the rules governing local time zones (either how they work or what time zone a particular place is in). For one example, some areas in North America have decided to not shift between daylight saving time and standard time, effectively putting themselves into a different time zone than they used to be. It's much easier to automatically apply a rule change that applies to you if you've told the system 'whatever the rules are for this city' than if you've just said 'we're in Eastern time'. If you've done the latter, it's going to be on you to change that to 'Eastern time without ...' or 'Canadian (not USA) Eastern time' or whatever other unusual trick of time zones people have come up with lately.

(This is perhaps more likely to be an issue in North America, where there are two countries and multiple states involved in this, creating plenty of room for many people with clever ideas.)

In fact, I'm not sure whether modern Unixes even really let you pick the generic timezone names any more or if they more or less force you to specify the timezone as a city. There are times when specifying the timezone by city is not necessarily ideal (for example, when you're not in one of the listed cities), but on the whole forcing you to be as specific as possible is a good idea.

(This entry was sparked by reading Time on Unix.)

PS: While some people advocate strongly for always setting servers to UTC, I feel that this is definitely not a one size fits all issue (for reasons outside the scope of this entry). Take it as a given that you've decided not to set your machine to UTC, so you need some way to specify a useful timezone for it.

sysadmin/TimezonesSetByCity written at 22:40:53; Add Comment

2020-05-14

Exploring munmap() on page zero and on unmapped address space

Over in the Fediverse, I ran across an interesting question on munmap():

what does `munmap` on Linux do when address is set to 0? Somehow this succeeds on Linux but fails on FreeBSD. I'm assuming the semantics are different but cannot find any reference regarding to such behavior.

(There's also this additional note, and the short version of the answer is here.)

When I saw this, I was actually surprised that munmap() on Linux succeeded, because I expected it to fail on any address range that wasn't currently mapped in your process and page zero is definitely not mapped on Linux (or anywhere sane). So let's go to the SUS specification for munmap(), where we can read in part:

The munmap() function shall fail if:

[EINVAL]
Addresses in the range [addr,addr+len) are outside the valid range for the address space of a process.

(Similar wording appears in the FreeBSD munmap() manpage.)

When I first read this wording, I assumed that this meant the current address range of the process. This is incorrect in practice on Linux and FreeBSD, and I think in theory as well (since POSIX/SUS talks about 'of a process', not 'of this process'). On both of those Unixes, you can munmap() at least some unused address space, as we can demonstrate with a little test program that mmap()s something, munmap()s it, and then munmap()s it again.

The difference between Linux and FreeBSD is in what they consider to be 'outside the valid range for the address space of a process'. FreeBSD evidently considers page zero (and probably low memory in general) to always be outside this range, and thus munmap() fails. Linux does not; while it doesn't normally let you mmap() memory in that area, for good reasons, it is not intrinsically outside the address space. If I'm reading the Linux kernel code correctly, no low address range is ever considered invalid, only address ranges that cross above the top of user space.

(I took a brief look at the relevant FreeBSD code in vm_mmap.c, and I think that it rejects any munmap() that extends below or above the range of address space that the process currently has mapped. This is actually more restrictive than I expected.)

In ultimately unsurprising news, OpenBSD takes a somewhat different interpretation, one that's more in line with how I expected munmap() to behave. The OpenBSD munmap() manpage says:

[EINVAL]
The addr and len parameters specify a region that would extend beyond the end of the address space, or some part of the region being unmapped is not part of the currently valid address space.

OpenBSD requires you to only munmap() things that are actually mapped and disallows trying to unmap random sections of your potential address space, even if it falls within the bottom and top of your address space usage (where FreeBSD would allow it). Whether this is completely POSIX compliant is an interesting but irrelevant question, since I doubt the OpenBSD people would change this (and I don't think they should).

One of the interesting things I've learned from looking into this is that Linux, FreeBSD, and OpenBSD each sort of have a different interpretation of what POSIX permits (assuming I'm understanding the FreeBSD kernel code correctly). The Linux interpretation is most clearly permitted, since it allows munmap() on anything that might potentially be mappable under some circumstances. OpenBSD, if it cares, would likely say that the 'valid range for the address space of a process' is what it currently has mapped and so their behavior is POSIX/SUS compliant, but this is clearly pushing the interpretation in an unusual direction from a narrow specification style reading of the wording (although it is the behavior I expected). FreeBSD sort of splits the difference, possibly for implementation reasons.

PS: The Linux munmap() manpage doesn't even talk about 'the valid address space of a (or the) process' as a reason for munmap() to fail; it only talks abstractly about the kernel not liking addr or len.

Sidebar: The little test program

Here's the test program I used.

#include <sys/mman.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

#define MAPLEN  (128*1024)

int main(int argc, char **argv)
{
  void *mp;

  puts("Starting mmap and double munmap test.");
  mp = mmap(0, MAPLEN, PROT_READ, MAP_ANON|MAP_SHARED, -1, 0);
  if (mp == MAP_FAILED) {
    printf("mmap error: %s\n", strerror(errno));
    return 1;
  }
  if (munmap(mp, MAPLEN) < 0) {
    printf("munmap error on first unmap: %s\n", strerror(errno));
    return 1;
  }
  if (munmap(mp, MAPLEN) < 0) {
    printf("munmap error on second unmap: %s\n", strerror(errno));
    return 1;
  }
  puts("All calls succeeded without errors, can munmap() unmapped areas.");
  return 0;
}

I think that it's theoretically possible for something like this program to fail on FreeBSD, if our mmap() established a new top or bottom of the process's address space. In practice it's likely that we will mmap() into a hole between the bottom of the address space (with the program text) and the top of the address space (probably with the stack).

unix/MunmapPageZero written at 23:46:42; Add Comment

2020-05-13

Getting my head around what things aren't comparable in Go

It started with Dave Cheney's Ensmallening Go binaries by prohibiting comparisons (and earlier tweets I saw about this), which talks about a new trick for making Go binaries smaller by getting the Go compiler to not emit some per-type internally generated support function that are used to compare compound types like structs. This is done by deliberately making your struct type incomparable, by including an incomparable field. All of this made me realize that I didn't actually know what things are incomparable in Go.

In the language specification, this is discussed in the section on comparison operators. The specification first runs down a large list of things that are comparable, and how, and then also tells us what was left out:

Slice, map, and function values are not comparable. However, as a special case, a slice, map, or function value may be compared to the predeclared identifier nil. [...]

(This is genuinely helpful. Certain sorts of minimalistic specifications would have left this out, leaving us to cross-reference the total set of types against the list of comparable types to work out what's incomparable.)

It also has an important earlier note about struct values:

  • Struct values are comparable if all their fields are comparable. Two struct values are equal if their corresponding non-blank fields are equal.

Note that this implicitly differentiates between how comparability is determined and how equality is checked. In structs, a blank field may affect whether the struct is comparable at all, but if it is comparable, the field is skipped when actually doing the equality check. This makes sense since one use of blank fields in structs is to create padding and help with alignment, as shown in Struct types.

The next important thing (which is not quite spelled out explicitly in the specification) is that comparability is an abstract idea that's based purely on field types, not on what fields actually exist in memory. Consider the following struct:

type t struct {
  _ [0]byte[]
  a int64
}

A blank zero-size array at the start of a struct occupies no memory and in a sense doesn't exist in the actual concrete struct in memory (if placed elsewhere in the struct it may have effects on alignment and total size in current Go, although I haven't looked for what the specification says about that). You could imagine a world where such nonexistent fields didn't affect comparability; all that mattered was whether the actual fields present in memory were comparable. However, Go doesn't behave this way. Although the blank, zero-sized array of slices doesn't exist in any concrete terms, that it's present as a non-comparable field in the struct is enough for Go to declare the entire struct incomparable.

As a side note, since you can't take the address of functions, there's no way to manufacture a comparable value when starting from a function. If you have a function field in a struct and you want to see which one of a number of possible implementations a particular instance of the struct is using, you're out of luck. All you can do is compare your function fields against nil to see whether they've been set to some implementation or if you should use some sort of default behavior.

(Since you can compare pointers and you can take the address of slice and map variables, you can manufacture comparable values for them. But it's generally not very useful outside of very special cases.)

programming/GoUncomparableThings written at 23:29:19; Add Comment

The modern HTTPS world has no place for old web servers

When I ran into Firefox's interstitial warning for old TLS versions, it wasn't where I expected, and where it happened gave me some tangled feelings. I had expected to first run into this on some ancient appliance or IPMI web interface (both of which are famous for this sort of thing). Instead, it was on the website of an active person that had been mentioned in a recent comment here on Wandering Thoughts. On the one hand, this is a situation where they could have kept their web server up to date. On the other hand, this demonstrates (and brings home) that the modern HTTPS web actively requires you to keep your web server up to date in a way that the HTTP web didn't. In the era of HTTP, you could have set up a web server in 2000 and it could still be running today, working perfectly well (even if it didn't support the very latest shiny thing). This doesn't work for HTTPS, not today and not in the future.

In practice there are a lot of things that have to be maintained on a HTTPS server. First, you have to renew TLS certificates, or automate it (in practice you've probably had to change how you get TLS certificates several times). Even with automated renewals, Let's Encrypt has changed their protocol once already, deprecating old clients and thus old configurations, and will probably do that again someday. And now you have to keep reasonably up to date with web server software, TLS libraries, and TLS configurations on an ongoing basis, because I doubt that the deprecation of everything before TLS 1.2 will be the last such deprecation.

I can't help but feel that there is something lost with this. The HTTPS web probably won't be a place where you can preserve old web servers, for example, the way the HTTP web is. Today if you have operating hardware you could run a HTTP web server from an old SGI Irix workstation or even a DEC Ultrix machine, and every browser would probably be happy to speak HTTP 1.0 or the like to it, even though the server software probably hasn't been updated since the 1990s. That's not going to be possible on the HTTPS web, no matter how meticulously you maintain old environments.

Another, more relevant side of this is that it's not going to be possible for people with web servers to just let them sit. The more the HTTPS world changes and requires you to change, the more your HTTPS web server requires ongoing work. If you ignore it and skip that work, what happens to your website is the interstitial warning that I experienced and eventually it will stop being accepted by browsers at all. I expect that this is going to drive more people into the arms of large operations (like Github Pages or Cloudflare) that will look after all of that for them, and a little bit more of the indie 'anyone can do this' spirit of the old web will fade away.

(At the same time this is necessary to keep HTTPS secure, and HTTPS itself is necessary for the usual reasons. But let's not pretend that nothing is being lost in this shift.)

web/HTTPSNoOldServers written at 00:25:58; Add Comment

2020-05-11

Why we have several hundred NFS filesystems in our environment

It may strike some people as unusual or extreme that we have 340 odd NFS mounts on our machines, as I mentioned in my entry on how systemd can cause an unmount storm during shutdown. There are several levels of why we have that many NFS mounts on our systems, especially all of the time. First, I'll dispose of a side reason, which is that we don't like conventional automounters (ie, any system that materializes (NFS) mounts 'on demand'). That means that all of our potential NFS filesystems are always mounted, and shifts the question to why we have so many NFS filesystems.

The starting point is that we have on the order of 2600 accounts, six ZFS fileservers, 38 TB of used space, and quite a number of ZFS pools (because of how people get space). This naturally spreads out data across multiple filesystems, and on top of this, our backup system operates on whole filesystems and there's a limit to how large we want one 'backup object' to ever be. Obviously if we limit the maximum size of filesystems, we get more of them. We also encourage people to not put all of their data in their home directory, so people and groups often have separate workdir filesystems that they use as work areas, for shared data, and so on. For various security reasons, our web server also requires people to put all of the data they intend to expose to it on specially designated workdir filesystems, instead of their home directory.

(Among other practical issues, it's much easier to safely expose a portion of a completely separate filesystem to other people or the web than something under your home directory. This extends to any situation where we want or need different NFS export permissions for two things; they have to be in separate filesystems, even if that means a new filesystem for one of them.)

ZFS filesystems are also the primary way we implement space restrictions and space guarantees for people. If you want something to only use X amount of space or always have X amount of space available to it, it has to be in a separate filesystem (and then we set appropriate ZFS properties to guarantee these things). Since both of these are popular with the people who ultimately call the shots on how space is used (because they paid for it), this lead to a certain amount of additional ZFS filesystems and thus NFS mounts.

This points to the larger scale reason that we have so many NFS filesystems, which is that filesystems are natural namespaces and having plenty of namespaces is useful. Since ZFS filesystems are basically free, any time people want to separate things it's natural to make a new (ZFS and NFS) filesystem to do so. There's a certain amount of overhead and so we try not to go too far with this (we're unlikely to ever support each user in their own filesystem), but it's quite a good thing to not have to squeeze everything into a limited number of filesystems. Ever since we moved to ZFS, our approach has been that if we're in doubt, we make a new filesystem. Our local tools are built to deal with this (for instance, automatically distributing new accounts across multiple potential home directory filesystems based on their current usage).

sysadmin/ManyNFSFilesystemsWhy written at 21:27:50; Add Comment

2020-05-10

How we guarantee there's always some free space in our ZFS pools

One of the things that we discovered fairly early on in our experience with ZFS (I think within the lifetime of the first generation Solaris fileservers) is that ZFS gets very unhappy if you let a pool get completely full. The situation has improved since then, but back in the days we couldn't even change ZFS properties, much less remove files as root. Being unable to change properties is a serious issue for us because NFS exports are controlled by ZFS properties, so if we had a full pool we couldn't modify filesystem exports to cut off access from client machines that were constantly filling up the filesystem.

(At one point we resorted to cutting off a machine at the firewall, which is a pretty drastic step. Going this far isn't necessary for machines that we run, but we also NFS export filesystems to machines that other trusted sysadmins run.)

To stop this from happening, we use pool-wide quotas. No matter how much space people have purchased in a pool or even if this is a system pool that we operate, we insist that it always have a minimum safety margin, enforced through a 'quota=' setting on the root of the pool. When people haven't purchased enough to use all of the pool's current allocated capacity, this safety margin is implicitly the space they haven't bought. Otherwise, we have two minimum margins. The explicit minimum margin is that our scripts that manage pool quotas always insist on a 10 MByte safety margin. The implicit minimum margin is that we normally only set pool quotas in full GB, so a pool can be left with several hundred MB of space between its real maximum capacity and the nearest full GB.

All of this pushes the problem back one level, which is determining what the pool's actual capacity is so we can know where this safety margin is. This is relatively straightforward for us because all of our pools use mirrored vdevs, which means that the size reported by 'zpool list' is a true value for the total usable space (people with raidz vdevs are on their own here). However, we must reduce this raw capacity a bit, because ZFS reserves 1/32nd of the pool for its own internal use. We must reserve at least 10 MB over and above this 1/32nd of the pool in order to actually have a safety margin.

(All of this knowledge and math is embodied into a local script, so that we never have to do these calculations by hand or even remember the details.)

PS: These days in theory you can change ZFS properties and even remove files when your pool is what ZFS will report as 100% full. But you need to be sure that you really are freeing up space when you do this, not using more because of things like snapshots. Very bad things happen to your pool if it gets genuinely full right up to ZFS's internal redline (which is past what ZFS will normally let you unless you trick it); you will probably have to back it up, destroy it, and recreate it to fully recover.

(This entry was sparked by a question from a commentator on yesterday's entry on how big our fileserver environment is.)

solaris/ZFSGuaranteeFreeSpace written at 22:51:20; Add Comment

2020-05-09

How big our fileserver environment is (as of May 2020)

A decade ago I wrote some entries about how big our fileserver environment was back then (part 1 and part 2). We're now on our third generation fileservers and for various reasons it's a good time to revisit this and talk about how big our environment has become today.

Right now we have six active production fileservers, with a seventh waiting to go into production when we need the disk space. Each fileserver has sixteen 2 TB SSDs for user data, which are divided into four fixed size chunks and then used to form mirrored pairs for ZFS vdevs. Since we always need to be able to keep one disk's worth of chunks free to replace a dead disk, the maximum usable space on any given fileservers is 30 pairs of chunks. After converting from disk vendor decimal TB to powers of two TiB and ZFS overheads, each pair of chunks gives us about 449 GiB of usable space, which means that the total space that can be assigned and allocated on any given fileserver is a bit over 13 TiB. No fileserver currently has all of that space allocated, much less purchased by people. Fileservers range from a low of 1 remaining allocatable pair of chunks to a high of 7 such chunks (to be specific, right now it goes 1, 2, 5, 6, 6, 7 across the six active production fileservers, so we've used 153 pairs of chunks out of the total possible of 180).

We don't put all of this space into a single ZFS pool on each fileserver for all sorts of reasons, including that we sell space to people and groups, which makes it natural to split up space into different ZFS pools based on who bought it. Right now we have sold and allocated 58.8 TiB of space in ZFS pools (out of 67 TiB of allocated chunks, so this is somewhat less dense than I expected, but not terrible). In total we have 40 ZFS pools; the largest pool is 5.6 TiB of sold space and the smallest pool is 232 GB. Exactly half the pools (20 out of 40) have 1 TiB or more of allocated ZFS space.

Not all of this allocated ZFS space is actually used, thankfully (just like every other filesystem, ZFS doesn't like you when you keep your disks full). Currently people have actually used 38 TiB of space across all of the filesystems in all of those pools. The largest amount of space used in a single pool is 4 TiB, and the largest amount of space used in a single filesystem is 873 GiB. Our /var/mail is the third largest filesystem in used space, at 600 GiB.

So in summary we've allocated about 85% of our disk chunks, sold about 87% of the space from the allocated chunks, and people are using about 64% of the space that they've purchased. The end to end number is that people are using about 48% of the space that we could theoretically allocate and sell. However, we have less room for growth on existing fileservers than you might think from this raw number, because the people who tend to buy lots of space are also the people who tend to use it. One consequence of this is that the fileserver with 1 free pair of chunks left is also one with two quite active large pools.

We have 305 meaningful ZFS filesystems across all of those ZFS pools (in addition to some 'container' filesystems that just exist for ZFS reasons and aren't NFS exported). The number of ZFS filesystems in each pool is somewhat correlated with the pool's size, but not entirely; multiple ZFS filesystems get created and used for all sorts of reasons. The most populated ZFS pool has 23 filesystems, while there are three pools with only one filesystem in them (those are a somewhat complicated story).

(We have more NFS mounts than ZFS filesystems for various reasons beyond the scope of this entry.)

sysadmin/OurFileserverScale-2020-05 written at 23:41:37; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.