Wandering Thoughts


Some consumer SSDs are moving to a 4k 'advance format' physical block size

Earlier this month I wrote an entry about consumer SSD nominal physical block sizes, because I'd noticed that almost all of the recent SSDs we had advertised a 512 byte physical block size (the exceptions were Intel 'DC' SSDs). In that entry, I speculated that consumer SSD vendors might have settled on just advertising them as 512n devices and we'd see this on future SSDs too, since the advertised 'physical block size' on SSDs is relatively arbitrary anyways.

Every so often I write a blog entry that becomes, well, let us phrase it as 'overtaken by events'. Such is the case with that entry. Here, let me show you:

$ lsblk -o NAME,TRAN,MODEL,PHY-SEC --nodeps /dev/sdf /dev/sdg
sdf  sas    Crucial_CT2050MX     512
sdg  sas    CT2000MX500SSD1     4096

The first drive is a 512n 2 TB Crucial MX300. We bought a number of them in the fall for a project, but then Crucial took them out of production in favour of the new Crucial MX500 series. The second drive is a 2TB Crucial MX500 from a set of them that we just started buying to fill out our drive needs for the project. Unlike the MX300s, this MX500 advertises a 4096 byte physical block size and therefor demonstrates quite vividly that the thesis of my earlier entry is very false.

(I have some 750 GB Crucial MX300s and they also advertise 512n physical block sizes, which led to a ZFS pool setup mistake. Fixing this mistake is now clearly pretty important, since if one of my MX300s dies I will probably have to replace it with an MX500.)

My thesis isn't just false because different vendors have made different decisions; this example is stronger than that. These are both drives from Crucial, and successive models at that; Crucial is replacing the MX300 series with the MX500 series in the same consumer market segment. So I already have a case where a vendor has changed the reported physical block size in what is essentially the same thing. It seems very likely that Crucial doesn't see the advertised physical block size as a big issue; I suspect that it's primarily set based on whatever the flash controller being used works best with or finds most convenient.

(By today, very little host software probably cares about 512n versus 4k drives. Advanced format drives have been around long enough that most things are probably aligning to 4k and issuing 4k IOs by default. ZFS is an unusual and somewhat unfortunate exception.)

I had been hoping that we could assume 512n SSDs were here to stay because it would make various things more convenient in a ZFS world. That is now demonstrably wrong, which means that once again forcing all ZFS pools to be compatible with 4k physical block size drives is very important if you ever expect to replace drives (and you should, as SSDs can die too).

PS: It's possible that not all MX500s advertise a 4k physical block size; it might depend on capacity. We only have one size of MX500s right now so I can't tell.

tech/SSDsAnd4KSectorsII written at 00:34:40; Add Comment


Memories of MGR

I recently got into a discussion of MGR on Twitter (via), which definitely brings back memories. MGR is an early Unix windowing system, originally dating from 1987 to 1989 (depending on whether you go from the Usenix presentation, when people got to hear about it, to the comp.sources.unix, when people could get their hands on it). If you know the dates for Unix windowing systems you know that this overlaps with X (both X10 and then X11), which is part of what makes MGR special and nostalgic and what gave it its peculiar appeal at the time.

MGR was small and straightforward at a time when that was not what other Unix window systems were (I'd say it was slipping away with X10 and X11, but let's be honest, Sunview was not small or straightforward either). Given that it was partially inspired by the Blit and had a certain amount of resemblance to it, MGR was also about as close as most people could come to the kind of graphical environment that the Bell Labs people were building in Research Unix.

(You could in theory get a DMD 5620, but in reality most people had far more access to Unix workstations that you could run MGR on that they did to a 5620.)

On a practical level, you could use MGR without having to set up a complicated environment with a lot of moving parts (or compile a big system). This generally made it easy to experiment with (on hardware it supported) and to keep it around as an alternative for people to try out or even use seriously. My impression is that this got a lot of people to at least dabble with MGR and use it for a while.

Part of MGR being small and straightforward was that it also felt like something that was by and for ordinary mortals, not the high peaks of X. It ran well on ordinary machines (even small machines) and it was small enough that you could understand how it worked and how to do things in it. It also had an appealingly simple model of how programs interacted with it; you basically treated it like a funny terminal, where you could draw graphics and do other things by sending escape sequences. As mentioned in this MGR information page, this made it network transparent by default.

MGR was not a perfect window system and in many ways it was a quite limited one. But it worked well in the 'all the world's a terminal' world of the late 1980s and early 1990s, when almost all of what you did even with X was run xterms, and it was often much faster and more minimal than the (fancier) alternatives (like X), especially on basic hardware.

Thinking of MGR brings back nostalgic memories of a simpler time in Unix's history, when things were smaller and more primitive but also bright and shiny and new and exciting in a way that's no longer the case (now they're routine and Unix is everywhere). My nostalgic side would love a version of MGR that ran in an X window, just so I could start it up again and play around with it, but at the same time I'd never use it seriously. Its day in the sun has passed. But it did have a day in the sun, once upon a time, and I remember those days fondly (even if I'm not doing well about explaining why).

(We shouldn't get too nostalgic about the old days. The hardware and software we have today is generally much better and more appealing.)

unix/MGRMemories written at 02:00:03; Add Comment


Using lsblk to get extremely useful information about disks

Every so often I need to know the serial number of a disk, generally because it's the only way to identify one particular disk out of two (or more) identical ones. As one example, perhaps I need to replace a failed drive that's one of a pair. You can get this information from the disks through smartctl, but the process is somewhat annoying if you just want the serial number, especially if you want it for multiple disks.

(Sometimes you have a dead disk so you need to find it by process of elimination starting from the serial numbers of all of the live disks.)

I've used lsblk for some time to get disk UUIDs and raid UUIDs, but I never looked very deeply at its other options. Recently I discovered that lsblk can do a lot more, and in particular it can report disk serial numbers (as well as a bunch of other handy information) in an extremely convenient form. It's simplest to just show you an example:

$ lsblk -o NAME,SERIAL,HCTL,TRAN,MODEL --nodeps /dev/sd?
sda  S21NNXCGAxxxxxH 0:0:0:0    sata   Samsung SSD 850 
sdb  S21NNXCGAxxxxxE 1:0:0:0    sata   Samsung SSD 850 
sdc  Zxxxxx4E        2:0:0:0    sata   ST500DM002-1BC14
sdd  WD-WMC5K0Dxxxxx 4:0:0:0    sata   WDC WD1002F9YZ-0
sde  WD-WMC5K0Dxxxxx 5:0:0:0    sata   WDC WD1002F9YZ-0

(For obscure reasons I don't feel like publishing the full serial numbers of our disks. It might be harmless to do so, but let's not find out otherwise the hard way.)

You can get a full list of possible fields with 'lsblk --help', along with generally what they mean, although you'll find that some of them are less useful than you might guess. VENDOR is always 'ATA' for me, for example, and KNAME is the same as NAME for my systems; TRAN is usually 'sata', as here, but we have some machines where it's different. Looking for a PHY-SEC that's not 512 is a convenient way to find advanced format drives, which may be surprisingly uncommon in some environments. SIZE is another surprisingly handy field; if you know you're looking for a disk of a specific size, it lets you filter disks in and out without checking serial numbers or even the specific model, if you have multiple different sized drives from one vendor such as WD or Seagate.

(--nodeps tells lsblk to just report on the devices that you gave it and not also include their partitions, software RAID devices that use them, and so on.)

This compact lsblk output is great for summarizing all of the disks on a machine in something that's easy to print out and use. Pretty much everything I need to know is one spot and I can easily use this to identify specific drives. I'm quite happy to have stumbled over this additional use of lsblk, and I plan to make much more use of it in the future. Possibly I should routinely collect this output for my machines and save it away.

(This entry is partly to write down the list of lsblk fields that I find useful so I don't have to keep remembering them or sorting through lsblk --help and trying to remember the fields that are less useful than they sound.)

linux/LsblkForDiskInfo written at 01:17:07; Add Comment


How I tend to label bad hardware

Every so often I wind up dealing with some piece of hardware that's bad, questionable, or apparently flaky. Hard disks are certainly the most common thing, but the most recent case was a 10G-T network card that didn't like coming up at 10G. For a long time I was sort of casual about how I handled these; generally I'd set them aside with at most a postit note or the like. As you might suspect, this didn't always work out so great.

These days I have mostly switched over to doing this better. We have a labelmaker (as everyone should), so any time I wind up with some piece of hardware I don't trust any more, I stick a label on it to mark it and say something about the issue. Labels that have to go on hardware can only be so big (unless I want to wrap the label all over whatever it is), so I don't try to put a full explanation; instead, my goal is to put enough information on the label so I can go find more information.

My current style of label looks broadly like this (and there's a flaw in this label):

volary 2018-02-12
no 10g problem

The three important elements are the name of the server the hardware came from (or was in when we ran into problems), the date, and some brief note about what the problem was. Given the date (and the machine) I can probably find more details in our email archives, and the remaining text hopefully jogs my memory and helps confirm that we've found the right thing in the archives.

As my co-workers gently pointed out, the specific extra text on this label is less than idea. I knew what it meant, but my co-workers could reasonably read it as 'no problem with 10G' instead of the intended meaning of 'no 10g link', ie the card wouldn't run a port at 10G when connected to our 10G switches. My takeaway is that it's always worth re-reading a planned label and asking myself if it could be misread.

A corollary to labeling bad hardware is that I should also label good hardware that I just happen to have sitting around. That way I can know right away that it's good (and perhaps why it's sitting around). The actual work of making a label and putting it on might also cause me to recycle the hardware into our pool of stuff, instead of leaving it sitting somewhere on my desk.

(This assumes that we're not deliberately holding the disks or whatever back in case we turn out to need them in their current state. For example, sometimes we pull servers out of service but don't immediately erase their disks, since we might need to bring them back.)

Many years ago I wrote about labeling bad disks that you pull out of servers. As demonstrated here, this seems to be a lesson that I keep learning over and over again, and then backsliding on for various reasons (mostly that it's a bit of extra work to make labels and stick them on, and sometimes it irrationally feels wasteful).

PS: I did eventually re-learn the lesson to label the disks in your machines. All of the disks in my current office workstation are visibly labeled so I can tell which is which without having to pull them out to check the model and serial number.

sysadmin/LabelingBadHardware written at 00:52:35; Add Comment


DTrace being GPL (and thrown into a Linux kernel) is just the start

The exciting news of the recent time interval comes from Mark J. Wielaard's dtrace for linux; Oracle does the right thing. To summarize the big news, I'll just quote from the Oracle kernel commit message:

This changeset integrates DTrace module sources into the main kernel source tree under the GPLv2 license. [...]

This is exciting news and I don't want to rain on anyone's parade, but it's pretty unlikely that we're going to see DTrace in the Linux kernel any time soon (either the kernel.org main tree or in distribution versions). DTrace being GPL compatible is just the minimum prerequisite for it to ever be in the main kernel, and Oracle putting it in their kernel only helps move things forward so much.

The first problem is simply the issue of integrating foreign code originally written for another Unix into the Linux kernel. For excellent reasons, the Linux kernel people have historically been opposed to what I've called 'code drops', where foreign code is simply parachuted into the kernel more or less intact with some sort of compatibility layer or set of shims. Getting them to accept DTrace is very likely to require modifying DTrace to be real native Linux kernel code that does things in the Linux kernel way and so on. This is a bunch of work, which means that it requires people who are interested in doing the work (and who can navigate the politics of doing so).

(I wrote more on this general issue when I talked about practical issues with getting ZFS into the main Linux kernel many years ago.)

Oracle could do this work, and it's certainly a good sign that they've at least got DTrace running in their own kernel. But since it is their own vendor kernel, Oracle may have just done a code drop instead of a real port into the kernel. Even if they've tried to do a port, similar efforts in the past (most prominently with XFS) took a fairly long time and a significant amount of work before the code passed muster with the Linux kernel community and was accepted into the main kernel.

A larger issue is whether DTrace would even be accepted in any form. At this point the Linux kernel has a number of tracing systems, so the addition of yet another one with yet another set of hooks and so on might not be viewed with particularly great enthusiasm by the Linux kernel people. Their entirely sensible answer to 'we want to use DTrace' might be 'use your time and energy to improve existing facilities and then implement the DTrace user level commands on top of them'. If Oracle followed through on this, we would effectively still get DTrace in the end (I don't care how it works inside the kernel if it works), but this also might cause Oracle to not bother trying to upstream DTrace. From Oracle's perspective, putting a relatively clean and maintainable patchset into their vendor kernel is quite possibly good enough.

(It's also possible that this is the right answer at a technical level. The Linux kernel probably doesn't need three or four different tracing systems that mostly duplicate each others work, or even two systems that do. Reimplementing the DTrace language and tools on top of, say, kprobes and eBPF would not be as cool as porting DTrace into the kernel, but it might be better.)

Given all of the things in the way of DTrace being in the main kernel, getting it included is unlikely to be a fast process (if it does happen). Oracle is probably more familiar with how to work with the main Linux kernel community than SGI was with XFS, but I would still be amazed if getting DTrace into the Linux kernel took less than a year. Then it would take more time before that kernel started making it into Linux distributions (and before distributions started enabling DTrace and shipping DTrace tools). So even if it happens, I don't expect to be able to use DTrace on Linux for at least the next few years.

(Ironically the fastest way to be able to 'use DTrace' would be for someone to create a version of the language and tools that sat on top of existing Linux kernel tracing stuff. Shipping new user-level programs is fast, and you can always build them yourself.)

PS: To be explicit, I would love to be able to use the DTrace language or something like it to write Linux tracing stuff. I may have had my issues with D, but as far as I can tell it's still a far more casually usable environment for this stuff than anything Linux currently has (although Linux is ahead in some ways, since it's easier to do sophisticated user-level processing of kernel tracing results).

linux/DTraceKernelPessimism written at 00:41:31; Add Comment


Some things about ZFS block allocation and ZFS (file) record sizes

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. For a file under the recordsize, the block size turns out to be in a multiple of 512 bytes, regardless of the pool's ashift or the physical sector size of the drives the pool is using.

Well, sort of. While everything I've written is true, it also turns out to be dangerously imprecise (as I've seen before). There are actually three different sizes here and the difference between them matters once we start getting into the fine details.

To talk about these sizes, I'll start with some illustrative zdb output for a file data block, as before:

 0 L0 DVA[0]=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]

The first size of the three is the logical block size, before compression. This is the first size= number ('4200L' here, in hex and L for logical). This is what grows in 512-byte units up to the recordsize and so on.

The second size is the physical size after compression, if any; this is the second size= number ('4200P' here, P for physical). It's a bit weird. If the file can't be compressed, it is the same as the logical size and because the logical size goes in 512-byte units, so does this size, even on ashift=12 pools. However, if compression happens this size appears to go by the ashift, which means it doesn't necessarily go in 512-byte units. On an ashift=9 pool you'll see it go in 512-byte units (so you can have a compressed size of '400P', ie 1 KB), but the same data written in an ashift=12 pool winds up being in 4 Kb units (so you wind up with a compressed size of '1000P', ie 4 Kb).

The third size is the actual allocated size on disk, as recorded in the DVA's asize field (which is the third subfield in the DVA[0] portion). This is always in ashift-based units, even if the physical size is not. Thus you can wind up with a 20 KB DVA but a 16.5 KB 'physical' size, as in our example (the DVA is '5000' while the block physical size is '4200').

(I assume this happens because ZFS insures that the physical size is never larger than the logical size, although the DVA allocated size may be.)

For obvious reasons, it's the actual allocated size on disk (the DVA asize) that matters for things like rounding up raidz allocation to N+1 blocks, fragmentation, and whether you need to use a ZFS gang block. If you write a 128 KB (logical) block that compresses to a 16 KB physical block, it's 16 KB of (contiguous) space that ZFS needs to find on disk, not 128 KB.

On the one hand, how much this matters depends on how compressible your data is and much modern data isn't (because it's already been compressed in its user-level format). On the other hand, as I found out, 'sparse' space after the logical end of file is very compressible. A 160 KB file on a standard 128 KB recordsize filesystem takes up two 128 KB logical blocks, but the second logical block has 96 KB of nothingness at the end and that compresses down to almost nothing.

PS: I don't know if it's possible to mix vdevs with different ashifts in the same pool. If it is, I don't know how ZFS would decide what ashift to use for the physical block size. The minimum ashift in any vdev? The maximum ashift?

(This is the second ZFS entry in a row where I thought I knew what was going on and it was simple, and then discovered that I didn't and it isn't.)

solaris/ZFSLogicalVsPhysicalBlockSizes written at 00:49:29; Add Comment


Writing my first addon for Firefox wasn't too hard or annoying

I prefer the non-JavaScript version of Google search results, but they have an annoyance, which is that Google rewrites all the URLs to indirect through themselves (with tracking numbers, but that's a lesser annoyance for me than the loss of knowing what I've already read). The Firefox 56 version of NoScript magically fixed this up, but I've now switched to uMatrix.

(The pre-WebExtensions NoScript does a lot of magic, which has good and bad aspects. uMatrix is a lot more focused and regular.)

To cut a long story short, today I wrote a Firefox WebExtensions-based addon to fix this, which I have imaginatively called gsearch-urlfix. It's a pretty straightforward fix because Google embeds the original URL in their transformed URL as a query parameter, so you just pull it out and rewrite the link to it. Sane people would probably do this as a GreaseMonkey user script, but for various reasons I decided it was simpler and more interesting to write an addon.

The whole process was reasonably easy. Mozilla has very good documentation that will walk you through most of the mechanics of an addon, and it's easy enough to test-load your addon into a suitable Firefox version to work on it. The actual JavaScript to rewrite hrefs was up to me, which made me a bit worried about needing to play around with regular expressions and string manipulation and parsing URLs, but it turns out that modern Firefox-based JavaScript has functions and objects that do all of the hard work; all I had to do was glue them together correctly. I had to do a little bit of debugging because of things that I got wrong, but console.log() worked fine to give me my old standby of print based debugging.

(Credit also goes to various sources of online information, which pointed me to important portions of the MDN JavaScript and DOM documentation, eg 1, 2, 3, and 4. Now you can see why I say that I just had to connect things. Mozilla also provides a number of sample extensions, so I looked at their emoji substitution example to see what I needed to do to transform the web page when it had loaded, which turned out to be a pleasantly simple process.)

There are a couple of things about your addon manifest.json that the MDN site won't tell you directly. The first is that if you want to make your addon into an unsigned XPI and load it permanently into your developer or nightly Firefox, it must have an id attribute (see the example here and the discussion here). The second is that the matches globs for what websites your content scripts are loaded into cannot be used to match something like 'any website with .google. in it'; they're very limited. I assume that this restriction is there because matches feeds into the permissions dialog for your addon.

(It's possible to have Firefox further filter what sites your content scripts will load into, see here, but the design of the whole system insures that your content scripts can only be loaded into fewer websites than the user approved permissions for, not more. If you need to do fancy matching, or even just *.google.*, you'll probably have to ask for permission for all websites.)

This limitation is part of the reason why gsearch-urlfix currently only acts on www.google.com and www.google.ca; those are the two that I need and going further is just annoying enough that I haven't bothered (partly because I want to actually limit it to Google's sites, not have it trigger on anyone who happens to have 'google' as part of their website name). Pull requests are welcome to improve this.

I initially wasn't planning to submitting this to AMO to be officially signed so it can be installed in normal Firefox versions; among other things, doing so feels scary and probably calls for a bunch of cleanup work and polish. I may change my mind about that, if only so I can load it into standard OS-supplied versions of Firefox that I wind up using. Also, I confess that it would be nice to not have my own Firefox nag at me about the addon being unsigned, and the documentation makes the process sound not too annoying.

(This is not an addon that I imagine there's much of an audience for, but perhaps I'm wrong.)

web/FirefoxMyFirstAddon written at 23:59:21; Add Comment


Sending emails to your inbox is a dangerous default

I tweeted:

One of the things I have to keep learning over and over again about email is that I should not let so many things bother me by showing up in my inbox. Even relatively low-volume things.

(I can filter or I can eliminate the email, depending on the situation.)

It starts innocently enough. You start getting some new sort of email (perhaps you sign up for it, maybe it's an existing service sending new email, or perhaps it's a new type of notification that you've been auto-included in). It's low volume and reasonably important or useful or at least interesting. But it's a drip. Often it ramps up over time, and in any case there are a lot of sources of such drips so collectively they add up.

In the process of planning an entry about dealing with this, I've come to the obvious realization that one important part here is that new email almost always defaults to going to your inbox. When it goes to your inbox two things happen. First, it gets mixed up with everything else and you have to disentangle it any time you look at your inbox. Second, by default it interrupts you when it comes in. Sure, I may have some tricks to avoiding significant interruptions from new email, but it still partly interrupts me (I have to look at the subject at least), and unless I'm very busy there's always the temptation to read it right now just so that I can throw it away (or file it away).

(Avoiding that interruption in the first place is not an option for two reasons. First, part of my job as a sysadmin is to be interrupted by sufficiently important issues. Second, I genuinely want to read some email right away; it's important or I'm expecting it or I'm looking forward to it.)

It's certainly possible to move email so it doesn't wind up in my inbox, but as long as the default is for email to go to my inbox, stuff is going to keep creeping in. It's inevitable because people follow the path of least resistance; when it takes more work to filter things out (and requires a sample email and some guesses as to what to match on and so on), we don't always do that extra work.

(And that's the right tradeoff, too, at least some of the time. One email a year or even a month probably is not worth the time to set up a filter for. Maybe not even one email a week, depending.)

If email defaulted to not coming to my inbox and had to be filtered in, my email life would be a very different place. There are drawbacks to this, so in practice probably the easiest way to arrange it is to have different email accounts with different inboxes that have different degrees of priority (and that you check at different times and so on).

(Of course this is where my email mistake bites me in the rear. I don't have the separate email accounts that other people often do; I would have to set up new ones and shift things over. This is something I'll have to do someday, but I keep deferring it because of the various pains involved.)

PS: There are also practical drawbacks to shifting (some) email out of your inbox, in that unless you're very diligent it increases the odds that the email won't get dealt with because you just don't get around to looking at it. This is certainly happening with some of the email that I've moved out of my inbox; I'll get to it someday, probably, but not right now.

sysadmin/InboxDangerousDefault written at 23:07:37; Add Comment

Access control security requires the ability to do revocation

I recently read Guidelines for future hypertext systems (via). Among other issues, I was sad but not surprised to see that it was suggesting an idea for access control that is perpetually tempting to technical people. I'll quote it:

All byte spans are available to any user with a proper address. However, they may be encrypted, and access control can be performed via the distribution of keys for decrypting the content at particular permanent addresses.

This is in practice a terrible and non-workable idea, because practical access control requires the ability to revoke access, not just to grant it. When the only obstacle preventing people from accessing a thing is a secret or two, people's access can only move in one direction; once someone learns the secret, they have perpetual access to the thing. With no ability to selectively revoke access, at best you can revoke everyone's access by destroying the thing itself.

(If the thing itself is effectively perpetual too, you have a real long term problem. Any future leak of the secret allows future people to access your thing, so to keep your thing secure you must keep your secret secure in perpetuity. We have proven to be terrible at this; at best we can totally destroy the secret, which of course removes our own access to the thing too.)

Access control through encryption keys has a mathematical simplicity that appeals to people, and sometimes they are tempted to wave away the resulting practical problems with answers like 'well, just don't lose control of the keys' (or even 'don't trust anyone you shouldn't have', which has the useful virtue of being obviously laughable). These people have forgotten that security is not math, security is people, and so a practical security system must cope with what actually happens in the real world. Sooner or later something always goes wrong, and when it does we need to be able to fix it without blowing up the world.

(In the real world we have seen various forms of access control systems without revocation fail repeatedly. Early NFS is one example.)

tech/SecurityRequiresRevocation written at 02:21:43; Add Comment


The interesting error codes from Linux program segfault kernel messages

When I wrote up what the Linux kernel's messages about segfaulting programs mean, I described what went into the 'error N' codes and how to work out what any particular one meant, but I didn't inventory them all. Rather than put myself through reverse engineering what any particular error code means, I'm going to list them all here, in ascending order.

The basic kernel message looks like this:

testp[9282]: segfault at 0 ip 0000000000401271 sp 00007ffd33b088d0 error 4 in testp[400000+98000]

We're interested in the 'error N' portion, and a little bit in the 'at N' portion (which is the faulting address).

For all of these, the fault happens in user mode so I'm not going to mention it specifically for each one. Also, the list of potential reasons for these segfaults is not exhaustive or fully detailed.

  • error 4: (Data) read from an unmapped area.

    This is your classic wild pointer read. On 64-bit x86, most of the address space is unmapped so even a program that uses a relatively large amount of memory is hopefully going to have most bad pointers go to memory that has no mappings at all.

    A faulting address of 0 is a NULL pointer and falls into page zero, the lowest page in memory. The kernel prevents people from mapping page zero, and in general low memory is never mapped, so reads from small faulting addresses should always be error 4s.

  • error 5: read from a memory area that's mapped but not readable.

    This is probably a pointer read of a pointer that is so wild that it's pointing somewhere in the kernel's area of the address space. It might be a guard page, but at least some of the time mmap()'ing things with PROT_NONE appears to make Linux treat them as unmapped areas so you get error code 4 instead. You might think this could be an area mmap()'d with other permissions but without PROT_READ, but it appears that in practice other permissions imply the ability to read the memory as well.

    (I assume that the Linux kernel is optimizing PROT_NONE mappings by not even creating page table entries for the memory area, rather than carefully assembling PTEs that deny all permissions. The error bits come straight from the CPU, so if there are no PTEs the CPU says 'fault for an unmapped area' regardless of what Linux thinks and will report in, eg, /proc/PID/maps.)

  • error 6: (data) write to an unmapped area.

    This is your classic write to a wild or corrupted pointer, including to (or through) a null pointer. As with reads, writes to guard pages mmap()'d with PROT_NONE will generally show up as this, not as 'write to a mapped area that denies permissions'.

    (As with reads, all writes with small faulting addresses should be error 6s because no one sane allows low memory to be mapped.)

  • error 7: write to a mapped area that isn't writable.

    This is either a wild pointer that was unlucky enough to wind up pointing to a bit of memory that was mapped, or an attempt to change read-only data, for example the classical C mistake of trying to modify a string constant (as seen in the first entry). You might also be trying to write to a file that was mmap()'d read only, or in general a memory mapping that lacks PROT_WRITE.

    (All attempts to write to the kernel's area of address space also get this error, instead of error 6.)

  • error 14: attempt to execute code from an unmapped area.

    This is the sign of trying to call through a mangled function pointer (or a NULL one), or perhaps returning from a call when the stack is in an unexpected or corrupted state so that the return address isn't valid. One source of mangled function pointers is use-after-free issues where the (freed) object contains embedded function pointers.

    (Error 14 with a faulting address of 0 often means a function call through a NULL pointer, which in turn often means 'making an indirect call to a function without checking that it's defined'. There are various larger scale causes of this in code.)

  • error 15: attempt to execute code from a mapped memory area that isn't executable.

    This is probably still a mangled function pointer or return address, it's just that you're unlucky (or lucky) and there's mapped memory there instead of nothing.

    (Your code could have confused a function pointer with a data pointer somehow, but this is a lot rarer a mistake than confusing writable data with read-only data.)

If you're reporting a segfault bug in someone else's program, the error code can provide useful clues as to what's wrong. Combined with the faulting address and the instruction pointer at the time, it might be enough for the developers to spot the problem even without a core dump. If you're debugging your own programs, well, hopefully you have core dumps; they'll give you a lot of additional information (starting with a stack trace).

(Now that I know how to decode them, I find these kernel messages to be interesting to read just for the little glimpses they give me into what went wrong in a program I'm using.)

On 64-bit x86 Linux, generally any faulting address over 0x7fffffffffff will be reported as having a mapping and so you'll get error codes 5, 7, or 15 respective for read, write, and attempt to execute. These are always wild or corrupted pointers (or addresses more generally), since you never have valid user space addresses up there.

A faulting address of 0 (sometimes printed as '(null)', as covered in the first entry) is a NULL pointer itself. A faulting address that is small, for example 0x18 or 0x200, is generally an offset from a NULL pointer. You get these offsets if you have a NULL pointer to a structure and you try to look at one of the fields (in C, 'sptr = NULL; a = sptr->fld;'), or you have a NULL pointer to an array or a string and you're looking at an array element or a character some distance into it. Under some circumstances a very large address, one near 0xffffffffffffffff (the very top of memory space), can be a sign of a NULL pointer that your code then subtracted from.

(If you see a fault address of 0xffffffffffffffff itself, it's likely that your code is treating -1 as a pointer or is failing to check the return value of something that returns a pointer or '(type *)-1' on error. Sadly there are C APIs that are that perverse.)

linux/KernelSegfaultErrorCodes written at 00:49:09; Add Comment

(Previous 10 or go back to February 2018 at 2018/02/09)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.