2016-07-28
A bit about what we use DTrace for (and when)
Earlier this year, Byran Cantrill kind of issued a call for people to talk about their DTrace success stories. I do want to write up a blog entry about all of the times we've used DTrace to solve our problems, but it's clearly not happening soon, so for now I want to stop stalling and at least say a bit about the kind of situations we use DTrace for.
Unlike some people, we don't make routine use of DTrace; it's not a part of ongoing system monitoring, for example. Partly this is because our fileservers spend most of their time not having problems. When stuff sits there quietly working, we don't need to pay much attention to it. There's probably useful information that DTrace could gather for us on an ongoing basis, but we just don't use it that way at the moment.
What we do use DTrace for is deep system investigations during problems and crises. Some of this is having scripts available that can do detailed monitoring of areas of interest to us; when an NFS fileserver problem appears, we can start by firing up our existing information collection scripts. A lot of the time we have merely ordinary problems and the scripts will tell us what they are (a slow disk, a user pushing a huge volume of IO, etc). Some of the time we have extraordinary problems and the existing scripts just let us rule things out.
Some of the time we have a new and novel problem, or even a crisis. In these situations we use DTrace to dig deep into the depths of the live kernel and pull out information we probably couldn't get any other way. This tends to be done with ad hoc hacked together scripts instead of anything more carefully developed; as we explore the problem we find questions to ask, write DTrace snippets to give us answers, and iterate this process. Often the questions we're asking (and the answers we're getting) are so specific to the current problem and our suspicions that there's no point in cleaning the resulting scripts up; they're the equivalent of one-off shell scripts and we'll almost certainly never use them again. DTrace is only one of the tools we use in these situations, of course, but it's an extremely valuable one and has let us understand deep issues (although not always solve them).
(Some of the time an ad hoc tool seems useful enough to be turned into something more, even if it turns out that I basically never use it again.)
2016-07-08
Some notes on UID and GID remapping in the Illumos/OmniOS NFS server
As part of looking into this whole issue,
I recently wound up reading the current manpages for the OmniOS NFS
server and thus discovered that it can remap UIDs and GIDs for
clients via the uidmap= and gidmap= NFS share options. Server
side NFS ID remapping is not as neat or scalable as client side
remapping, but it does solve my particular problem, so I've now played around with it
a bit and have some notes.
The feature itself is an Illumos one and is about two years old now, so it's probably been integrated into most Illumos-based releases that you want to use (even though we mostly don't update, you really do want to do so every so often). It's certainly in OmniOS r151014, which is what we use on our fileservers.
The share_nfs manpage describes mappings as
[clnt]:[srv]:access_list. Despite its name, the access_list
bit is just for matching the client; it doesn't create or change
any NFS mount access permissions, which are still set through rw=
and so on. You can also use a different mechanism in each place for
identifying clients, say a netgroup for filesystem access and then
a hostname to identify the client for remapping (which might be
handy if using a netgroup has side effects).
The manpage also describes uidmap (and gidmap) as 'remapping
the user ID (uid) in the incoming request to some other uid'. This
is a completely accurate description in that the server does not
remap IDs in replies, such as in the results from stat() system
calls. For example, if you remap your client UID to your server
UID, 'ls -l' will show that your files are owned by the server-side
UID, not you on the client. This is potentially confusing in general
and will probably cause anything that does client-side UID or GID
checking to incorrectly reject you.
(This design decision is probably due to the fact that the UID and GID mapping is not necessarily 1:1, either on the server or for that matter on the client. And yes, I can imagine merely somewhat perverse client side uses of mapping a second local UID to a server UID that also exists on the client.)
Note that in general UID remapping is probably more important than GID remapping, since you can always force a purely server side group list (as far as I know this NFS server lookup entirely overwrites the group list from the client).
PS: I don't know how well this scales on any level. Since all of the mappings have to be specified as share options, I expect that this won't really be very pleasant to deal with if you're trying to do a lot of remapping (either many IDs for some clients or many clients with a few ID remappings).
2016-06-19
Why ZFS can't really allow you to add disks to raidz vdevs
Today, the only change ZFS lets you make to a raidz vdev once you've created it is to replace a disk with another one. You can't do things like, oh, adding another disk to expand the vdev, which people wish for every so often. On the surface, this is an artificial limitation that could be bypassed if ZFS wanted to, although it wouldn't really do what you want. Underneath the surface, there is an important ZFS invariant that makes it impossible.
What makes this nominally easy in theory is that ZFS raidz vdevs already use variable width stripes. A conventional RAID system uses full width stripes, where all stripes span all disks. When you add another disk, the RAID system has to change how all of the existing data is laid out to preserve this full width; you goes from having the data and parity striped across N disks to having it striped across N+1 disks. But with variable width stripes, ZFS doesn't have this problem; adding an existing disk doesn't require touching any of the existing stripes, even what were full width stripes. All that happens is they go from being full width stripes to being partial width stripes.
However, this is probably not really what you wanted because it doesn't get you as much new space as adding a disk does in a conventional RAID system. In a conventional RAID system, the reshaping involved both minimizes the RAID overhead and gives you a large contiguous chunk of free space at the end of the RAID array. In ZFS, simply adding a disk this way would obviously not do that; all of your old 'full width' stripes are now somewhat inefficient partial width stripes, and much of the free space is going to be scattered about in little bits at the end of those partial width stripes.
In fact, the free space issue is the fatal flaw here. ZFS raidz imposes a minimum size on chunks of free space; they must be large enough that it can write one data block plus its parity blocks (ie N+1, where N is the raidz level). Were we to just add another disk along side existing disks, much of the free space on it could in fact violate this invariant. For example, if the vdev previously had two consecutive full width stripes next to each other, adding a new disk will create a single-block chunk of free space in between them.
You might be able to get around this by immediately marking such space on the new disk as allocated instead of free, but if so you could find that you got almost no extra space from adding the disk. This is probably especially likely on a relatively full pool, which is exactly the situation where you'd like to get space quickly by adding another disk to your existing raidz vdev.
Realistically, adding a disk to a ZFS raidz vdev requires the same sort of reshaping as adding a disk to a normal RAID-5+ system; you really want to rewrite stripes so that they span across all disks as much as possible. As a result, I think we're unlikely to ever see it in ZFS.
2016-06-17
Why you can't remove a device from a ZFS pool to shrink it
One of the things about ZFS that bites people every so often is
that you can't remove devices from ZFS pools. If you do 'zpool add
POOL DEV', congratulations, that device or an equivalent replacement
is there forever. More technically, you cannot remove vdevs once
they're added, although you can add and remove mirrors from a
mirrored vdev. Since people do make mistakes with 'zpool add',
this is periodically a painful limitation. At this point you might
well ask why ZFS can't do this, especially since many other volume
managers do support various forms of shrinking.
The simple version of why not is ZFS's strong focus on 'write once' immutability and being a copy on write filesystem. Once it writes filesystem information to disk, ZFS never changes it; if you change data at the user level (by rewriting a file or deleting it or updating a database or whatever), ZFS writes a new copy of the data to a different place on disk and updates everything that needs to point to it. That disk blocks are not modified once written creates a whole lot of safety in ZFS and is a core invariant in the whole system.
Removing a vdev obviously requires breaking this invariant, because as part of removing vdev A you must move all of the currently in use blocks on A over to some other vdev and then change everything that points to those blocks to use the new locations. You need to do this not just for ordinary filesystem data (which can change anyways) but also for things like snapshots that ZFS normally never modifies once created. This is a lot of work (and code) that breaks a bunch of core ZFS invariants. As a result, ZFS was initially designed without the ability to do this and no one has added it since.
(This is/was known as 'block pointer rewrite' in the ZFS community. ZFS block pointers tell ZFS where to find things on disk (well, on vdevs), so you need to rewrite them if you move those things from one disk to another.)
About a year and a half ago, I wrote an entry about how ZFS pool shrinking might be coming. Given what I've written here, you might wonder how it works. The answer is that it cheats. Rather than touch the ZFS block pointers, it adds an extra layer underneath them that maps IO from one vdev to another. I'm sure this works, but it also implies that removing a vdev adds a more or less permanent extra level of indirection for access to all blocks that used to be on the vdev. In effect the removed vdev lingers on as a ghost instead of being genuinely gone.
(This obviously has an effect on, for example, ZFS RAM usage. That mapping data has to live somewhere, and may have to be fetched off disk, and we've seen this show before.)
Having the ability to remove an accidentally added vdev is a good thing, but the more I look at the original Delphix blog entry, the more dubious I am about ever using it for anything big. A quick removal of an accidentally added vdev has the advantage that almost nothing should be on the new vdev, and normal churn might well get rid of the few bits that wound up on it (and so allow the extra indirection to go away). Shrinking an old, well used pool by a vdev or two is not going to be like that, especially if you have things like old snapshots.
2016-05-28
A problem with using old OmniOS versions: disconnection from the community
One of the less obvious problems with us probably never doing another OmniOS upgrade is that I'm clearly going to become more and more disconnected from the OmniOS community. This is only natural, since most or almost all of the community is using recent versions; as time goes on, those versions and the version we're running are only going to drift more and more apart.
(It's true that OmniOS r151014 is an OmniOS LTS release, supported through early 2018 per here. But in practice I expect that most OmniOS people will be running the one of the more up to date stable releases instead, since they won't have our upgrade concerns.)
Being disconnected from the community makes me sad, because the OmniOS community is one of the great parts about OmniOS. There are several dimensions to this disconnection. First, the more disconnected I am from the community, the less I'll be able to give back to it, the less I can contribute answers or information or whatever. Giving back to the community is something that I would like to do for all sorts of reasons (including that I plain like being able to contribute).
Obviously, the more distant we are from what the community is running the less the community can help us with advice and information and all of that if we run into issues or just have questions about how best to do something or what the community's experiences are. At best they may be able to tell us how things would look or would be done on a newer version of OmniOS. Of course, some things only change slowly, but I suspect that there is only going to be more and more of a gap here over time. I don't want to put too much weight on this; I'm very grateful to the help that the community has given us, but at the same time it's not help that I think we should count on and significantly factor into our plans.
(To put it one way, community help comes from the goodness of its heart and is best considered a pleasant surprise instead of a guarantee or an entitlement. I don't know if all of this makes sense to anyone but me, though.)
Finally, I'll just plain be paying less attention to the community and drifting away it. It's inevitable; more and more, community discussions will be about things that aren't relevant to our version and that I can't contribute to. If people have problems or questions, I'll only have outdated information or more and more uninformed opinions. That's a recipe for disengagement, even from a nice community.
Having written all of this, I think that what I should do is build one experimental OmniOS server to keep up to date. It doesn't have to use our fileserver hardware; for a lot of things, any old server running OmniOS will serve to keep me at least somewhat current. As a bonus it will provide me with a platform to test things on the current OmniOS version (whatever that is at the time).
(We have enough spare SSDs for our current fileservers so that I could take the test fileserver and build a system SSD set for the current OmniOS, just so I have it around. We did this sort of back and forth OmniOS version testing during our transition to r151014, so we actually have a template for it.)
2016-05-22
Our problem with OmniOS upgrades: we'll probably never do any more
Our current fileserver infrastructure is currently running OmniOS r151014, and I have recently crystallized the realization that we will probably not upgrade it to a newer version of OmniOS over the remaining lifetime of this generation of server hardware (which I optimistically project to be another two to three years). This is kind of a problem for a number of reasons (and yes, beyond the obvious), but my pessimistic view right now is that it's an essentially intractable one for us.
The core issue with upgrades for us is that in practice they are extremely risky. Our fileservers are a core and highly visible service in our environment; downtime or problems on even a single production fileserver directly impacts the ability of people here to get their work done. And we can't even come close to completely testing a new fileserver outside of production. Over and over, we have only found problems (sometimes serious ones) under our real and highly unpredictable production load.
(We can do plenty of fileserver testing outside of production and we do, but testing can't show that production fileservers will be problem free, it can only find (some) problems before production.)
Since upgrades are risky, we need fairly strong reasons to do them. When our existing fileservers are working reasonably well, it's not clear where such strong reasons would come from (barring a few freak events, like a major ixgbe improvement, or the discovery of catastrophic bugs in ZFS or NFS service or the like). On the one hand this is a testimony to OmniOS's current usefulness, but on the other hand, well.
I don't have any answers to this. There probably really aren't any, and I'm wishing for a magic solution to my problems. Sometimes that's just how it goes.
(I'm assuming for the moment that we could do OmniOS version upgrades through new boot environments. We might not be able to, for various reasons (we couldn't last time), in which case the upgrade problem gets worse. Actual system reinstalls, hardware swaps, or other long-downtime operations crank the difficulty of selling upgrades up even more. Our round of upgrades to OmniOS r151014 took about six months from the first server to the last server, for a whole collection of reasons including not wanting to do all servers at once in case of problems.)
2016-04-23
Why I think Illumos/OmniOS uses PCI subsystem IDs
As I mentioned yesterday, PCI has both
vendor/device IDs and 'subsystem' vendor/device IDs. Here is what
this looks like (in Linux) for a random device on one of our machines
here (from 'lspci -vnn', more or less):
04:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 [1000:0086] (rev 05)
Subsystem: Super Micro Computer Inc Device [15d9:0691]
[...]
This is the integrated motherboard SAS controller on a SuperMicro motherboard (part of our fileserver hardware). It's using a standard LSI chipset, as reported in the main PCI vendor and device ID, but the subsystem ID says it's from SuperMicro. Similarly, this is an Intel chipset based motherboard so there are a lot of things with standard Intel vendor and device IDs, but SuperMicro specific subsystem vendor and device IDs.
As far as I know, most systems use the PCI vendor and device IDs and mostly ignore the subsystem vendor and device IDs. It's not hard to see why; the main IDs tell you more about what the device actually is, and there are fewer of them to keep track of. Illumos is an exception, where much of the PCI information you see reported uses subsystem IDs. I believe that a significant reason for this is that Illumos is often attempting to basically fingerprint devices.
Illumos tries hard to have some degree of constant device naming
(at least for their definition of it), so that say 'e1000g0' is
always the same thing. This requires being able to identify specific
hardware devices as much as possible, so you can tie them to the
visible system-level names you've established. This is the purpose
of /etc/path_to_inst and the systems associated with it; it
fingerprints devices on first contact, assigns them an identifier
(in the form of a driver plus an instance number), and thereafter
tries to keep them exactly the same.
(From Illumos's perspective the ideal solution would be for each single PCI device to have a UUID or other unique identifier. But such a thing doesn't exist, at least not in general. So Illumos must fake a unique identifier by using some form of fingerprinting.)
If you want a device fingerprint, the PCI subsystem IDs are generally going to be more specific than the main IDs. A whole lot of very different LSI SAS controllers have 1000:0086 as their PCI vendor and device IDs, after all; that's basically the purpose of having the split. Using the SuperMicro subsystem vendor and device IDs ties it to 'the motherboard SAS controller on this specific type of motherboard', which is much closer to being a unique device identifier.
Note that Illumos's approach more or less explicitly errs on the side of declaring devices to be new. If you shuffle which slots your PCI cards are in, Illumos will declare them all to be new devices and force you to reconfigure things. However, this is broadly much more conservative than doing it the other way. Essentially Illumos says 'if I can see that something changed, I'm not going to go ahead and use your existing settings'. Maybe it's a harmless change where you just shuffled card slots, or maybe it's a sign of something more severe. Illumos doesn't know and isn't going to guess; you get to tell it.
(I do wish there were better tools to tell Illumos that certain changes were harmless and expected. It's kind of a pain that eg moving cards between PCI slots can cause such a commotion.)
2016-04-22
What Illumos/OmniOS PCI device names seem to mean
When working on an OmniOS system, under normal circumstances you'll
use friendly device names from /dev and things like dladm (for
network devices). However, Illumos-based systems have an underlying
hardware based naming scheme (exposed in /devices), and under
some circumstances you can wind up
dealing with it. When you do, you'll be confronted with relatively
opaque names like '/pci@0,0/pci8086,e04@2/pci8086,115e@0' and
very little clue what these names actually mean, at least if you're
not already an Illumos/Solaris expert.
So let's take just one bit here: pci8086,e04@2. The pci8086,e04
portion is the PCI subsystem vendor and device code, expressed in
hex. You'll probably see '8086' a lot, because it's the vendor code for
Intel. Then the @2 portion is PCI path information expressed relative
to the parent. This can get complicated, because 'path relative to the
parent' doesn't map well to the kinds of PCI names you get on Linux
from eg 'lspci'. When you see a '@...' portion with a comma, that is
what other systems would label
as 'device.function'. If there is no comma in the '@..' portion, the
function is implicitly 0.
(Note that the PCI subsystem vendor and device is different from
the PCI vendor and device. Linux 'lspci -n' shows only the vendor
code, because that's what's important for knowing what sort of thing
it is instead of who exactly made it; you have to use 'lspci -vn'
to see the subsystem stuff. Illumos's PCI names here are inherently
framed as a PCI tree, whereas Linux lspci normally does not show
the tree topology, just flat slot numbering. See 'lspci -t' for
the tree view.)
As far as I can tell, in a modern PCI Express setup the physical
slot you put a card into will determine the first two elements of
the PCI path. '/pci@0,0' is just a (synthetic) PCI root instance,
and then '/pci8086,e04@2' is a specific PCI Express Port. However,
I'm not sure if one PCI Express Port can feed multiple slots and
if it can, I'm not sure how you tell them apart. I'm not quite sure
how things work for plain PCI cards, but for onboard PCI devices
you get PCI paths like '/pci@0,0/pci15d9,714@1a' where the '@1a'
corresponds to what Linux lspci sees as 00:1a.0.
So, suppose that you have a collection of OmniOS servers and you want to know if they have exactly
the same PCI Express cards in exactly the same slots (or, say,
exactly the same Intel 1G based network cards). If you look at
/etc/path_to_inst and see exactly the same PCI paths, you've
got what you want. If you look at the paths and see two systems
with say:
s1: /pci@0,0/pci8086,e04@2/pci8086,135e@0
s2: /pci@0,0/pci8086,e04@2/pci8086,115e@0
What you have is a situation where the cards are in the same slots
(because the first two elements of the path are the same) but they're
slightly different generations and Intel has changed the PCI subsystem
device code on you (seen in ',135e' versus ',115e'). If you're
transplanting system disks from s2 to s1, this can cause problems
that you'll need to deal with by editing path_to_inst.
I don't know what order Illumos uses when choosing how to assign instances (and thus eg network device names) to hardware when you have multiple instances of the same hardware. On a single card with multiple ports it seems consistent that the port with the lower function is assigned first, eg if you have a dual port card where the ports are pci8086,115e@0 and pci8086,115e@0,1, the @0 port will always be a lower instance than the @0,1 port. How multiple cards are handled is not clear to me and I can't reverse engineer it based on our current hardware.
(While we have multiple Intel 1G dual-port cards in our OmniOS fileservers, they are in PCI Express slots that differ both in the PCI subdevice and in the PCI path information; we have pci8086,e04@2 as the PCI Express Port for the first card and pci8086,e0a@3,2 for the second. I suspect that the PCI path information ('@2' versus '@3,2') determines things here, but I don't know for sure.)
PS: Yes, all of this is confusing (at least to me). Maybe I need to read up on general principles of PCI, PCI Express, and how all the topology stuff works (the PCI bus world is clearly not flat any more, if it ever was).
2016-03-27
The limits of open source with Illumos and OmniOS
I go back and forth on how optimistic I feel about OmniOS and Illumos as a whole. During the up moods, I remember how our fileservers are problem free these days; during the down moods, I remember our outstanding problems. This is an entry written from a down mood perspective.
At this point we have several outstanding problems with OmniOS and Illumos as a whole, such as our ixgbe 10G Ethernet issues and the kernel holding memory. These issues have been officially known for some time, but they remain and as far as I can tell there's been no visible movement towards fixing them. At the same time we have seen other problems be dealt with quite rapidly.
What I read into this is that we have hit the limits of Illumos's
open source development. The things that I've seen dealt with
promptly are either small, already solved somewhere, or a priority
of some paying customer of an Illumos-related company. Our open
issues are big and gnarly and (apparently) not being pushed along
by anyone who can afford to pay for support; revising bits of the
kernel memory system or doing a major update of the ixgbe driver
are both not small projects, after all.
In a bigger open source project such as Linux, there is both more manpower available and more people running into relatively obscure problems such as these. As an example, Linux is popular enough that it's extremely unlikely that a major 10G Ethernet driver would be left to rot in an effectively unusable condition for common hardware. But Illumos simply does not have that kind of manpower and usage; what gets developed and fixed for Illumos is clearly much more narrow. The people working on Illumos are great and they have been super-helpful to us where they could, but the limits of where they can be helpful do not extend to doing major unpaid work. And this means that what we can expect from Illumos and OmniOS is limited.
How limited? In my down mood right now, I say that in practice we can expect to get something very close to no support. If something doesn't work, we get to keep all the pieces and (as with our 10G situation) we cannot expect a fix over the lifetime of our fileservers.
(This is the theoretical situation with Linux and FreeBSD until we, say, pay Red Hat for good RHEL support, but not the practical one.)
This makes me think that as nice as OmniOS is on our current fileservers, I won't really be able to recommend it as the OS for our next generation of fileservers in a few years. This is beyond the concrete issues I wrote about in the future of OmniOS here without 10G (or when I initially worried about driver support); it's a general issue of how much confidence I can have about being able to get problems fixed.
(I'm sure that if we had the money for support or consulting work we'd get great support from OmniTI and so on, and we'd probably have fixes for our problems. But we don't have that money and are unlikely to ever do so, so we must rely on the charity of the crowd. And the Illumos crowd is thin.)
PS: Some people might say 'just test the 2018 version of OmniOS a lot before you make the final decision'. Unfortunately, our experiences with 10G ixgbe and other issues make it clear that we simply can't do that well enough. We will experience problems in production that we couldn't find before then.
2016-03-14
An additional small detail of how writes work on ZFS raidzN pools
Back in How writes work on ZFS raidzN pools I wrote up how ZFS doesn't always do what's usually called 'full stripe writes', unlike normal RAID-5/6/etc systems. This matters because if you write data in small chunks you can use up more space than you expect, especially on 4k physical sector size disks (apparently zvols with a 4K or 8K record size are especially terrible for this; see eg this ZFS on Linux issue report).
Recently, I was reading Matthew Ahrens' ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ and learned another small but potentially important detail about how ZFS does raidzN writes. It turns out that ZFS requires all allocations to be multiples of N+1 blocks, so it rounds everything up to the nearest N+1 block boundary. This is regardless of how many disks you have in the raidzN vdev; if you have a raidz2 pool, for example, it can allocate 9 blocks or 12 blocks for a single write but never 10 or 11 blocks.
(Note that this is the allocation size including the raidzN parity blocks, not the user level data alone.)
At first this might seem kind of crazy, but as Matthew Ahrens explains, it more or less makes sense. The minimum write size in a raidzN pool is one data block plus N parity blocks, ie N+1 blocks in total. By rounding allocations up to this boundary, ZFS makes life somewhat easier on itself; any chunk of free space is always guaranteed to fit at least one data block, no matter how space is allocated. No matter how things are allocated and freed, ZFS will never be left with 'runt' free space that is too small to be used.
(This is free space as ZFS sees it, ie free space in a space map, which is what ZFS scans when it wants to allocate space. There will be some amount of irregular space that is 'free' in the sense that it is not used because it's rounded-up blocks, but ZFS doesn't have to keep track of that as free space. Instead ZFS just ignores it entirely, or more exactly marks it as used space.)
As with partial stripe writes, this does interact with 4k sector drives to potentially use more space, especially for higher raidzN settings. However, how much extra space gets used is going to be very dependent on what size your writes are.
(The good news is that minimum-sized objects won't experience any extra space usage as a result of this, since they're already one data block plus N parity blocks.)