Wandering Thoughts


A little surprise with Prometheus scrape intervals, timeouts, and alerts

Prometheus pulls metrics from metric sources or, to put it in Prometheus terms, scrapes targets. Every scrape configuration and thus every target has a scrape interval and a scrape timeout as part of its settings; these can be specified explicitly or inherited from global values. In a perfect world where scraping targets either completes or fails in zero time, this results in simple timing; a target is scraped at time T, then T + interval, then T + interval + interval, and so on. However, the real world is not simple and scraping a target can take a non-zero amount of time, possibly quite a lot if you time out. You might sensibly wonder if the next scrape is pushed back by the non-zero scrape time.

The answer is that it is not, or at least it is sort of not. Regardless of the amount of time a scrape at time T takes, the next scrape is scheduled for T + interval and will normally happen then. Scrapes are driven by a ticker, which runs independently of how long a scrape took and adjust things as necessary to keep ticking exactly on time.

So far, so good. But this means that slow scrapes can have an interesting and surprising interaction with alerting rules and Alertmanager group_wait settings. The short version is that you can get a failing check and then a successful one in close succession, close enough to suppress an Alertmanager alert that you would normally expect to fire.

To make this concrete, suppose that you perform SSH blackbox checks every 90 seconds, time out at 60 seconds, trigger a Prometheus alert rule the moment a SSH check fails, and have a one minute group_wait in Alertmanager. Then if a SSH check times out instead of failing rapidly, you can have a sequence where you start the check at T, have it fail via timeout at T + 60, send a firing alert to Alertmanager shortly afterward, have the next check succeed at T + 90, and withdraw the alert shortly afterward from Alertmanager, before the one minute group_wait is up. The net result is that your 'alert immediately' SSH alert rule has not sent you an alert despite a SSH check failing.

It's natural to expect this result if your scrape interval is less than your group_wait, because then it's obvious that you can get a second scrape in before Alertmanager makes the alert active. It's not as obvious when the second scrape is possible only because the difference between the scrape interval and the scrape timeout is less than group_wait.

(If nothing else, this is going to make me take another look at our scrape timeout settings. I'm going to have to think carefully about just what all of the interactions are here, especially given all of the other alert delays. Note that a resolved alert is immediately sent to Alertmanager.)

PS: It's a pity that there's no straightforward way that I know of to get either Prometheus or Alertmanager to write a log record of pending, firing, and cleared alerts (with timestamps and details). The information is more or less captured in Prometheus metrics, but getting the times when things happened is a huge pain; being able to write optional logs of this would make some things much easier.

(I believe both report this if you set their log level to 'debug', but of course then you get a flood of other information that you probably don't want.)

Sidebar: How Prometheus picks the start time T of scrapes

If you've paid attention to your logs from things like SSH blackbox checks, you'll have noticed that Prometheus does not hit all of your scrape targets at exactly the same time, even if they have the same scrape interval. How Prometheus picks the start time for each scrape target is not based on when it learns about the scrape target, as you might expect; instead, well, let me quote the code:

base   = interval - now%interval
offset = t.hash() % interval
next   = base + offset

if next > interval {
   next -= interval

All of these values are in nanoseconds, and t.hash() is a 64-bit hash value, hopefully randomly distributed. The next result value is an offset to wait before starting the scrape interval ticker.

In short, Prometheus randomly smears the start time for scrape targets across the entire interval, hopefully resulting in a more or less even distribution.

sysadmin/PrometheusScrapeIntervalBit written at 01:36:30; Add Comment


Things you can do to make your Linux servers reboot on kernel problems

One of the Linux kernel's unusual behaviors is that it often doesn't reboot after it hits an internal problem, what is normally called a kernel panic. Sometimes this is a reasonable thing and sometimes this is not what you want and you'd like to change it. Fortunately Linux lets you more or less control this through kernel sysctl settings.

(The Linux kernel differentiates between things like OOPSes and RCU stalls, which it thinks it can maybe continue on from, and kernel panics, which immediately freeze the machine.)

What you need to do is twofold. First, you need to make it so that the kernel reboots when it considers itself to have paniced. This is set through the kernel.panic sysctl, which is a number of seconds. Some sources recommend setting this to 60 seconds under various circumstances, but in limited experience we haven't found that to do anything for us except delay reboots, so we now use 10 seconds. Setting kernel.panic to 0 restores the default state, where panics simply hang the machine.

Second, you need to arrange for various kernel problems to trigger panics. The most important thing here is usually for kernel OOPS messages or BUG messages to trigger panics; the kernel considers these nominally recoverable, except that they mostly aren't and will often leave your machine effectively hung. Panicing on OOPS is turned on by setting kernel.panic_on_oops to 1.

Another likely important sign of trouble is RCU stalls; you can panic on these with kernel.panic_on_rcu_stall. Note that I'm biased about RCU stalls. The kernel documentation in sysctl/kernel.txt mentions some other ones as well, currently panic_on_io_nmi, panic_on_stackoverflow, panic_on_unrecovered_nmi, and panic_on_warn. Of these, I would definitely be wary about turning on panic_on_warn; our systems appear to see a certain number of them in reasonably routine operation.

(You can detect these warnings by searching your kernel logs for the text 'WARNING: CPU: <..> PID: <...>'. One of our WARNs was for a network device transmit queue timeout, which recovered almost immediately. Rebooting the server due to this would have been entirely the wrong reaction in practice.)

Note that you can turn on any or all of the various panic_on_* settings while still having kernel.panic set to 0. If you do this, you convert OOPSes, RCU stalls, or whatever into things that are guaranteed to hang the whole machine when they happen, instead of perhaps having it continue on in partial operating order. There are systems where this may be desirable behavior.

PS: If you want to be as sure as possible that the machine reboots after hitting problems, you probably want to enable a hardware watchdog as well if you can. The kernel panic() function tries hard to reboot the machine, but things can probably go wrong. Unfortunately not all machines have hardware watchdogs available, although many Intel ones do.

Sidebar: The problem with kernel OOPSes

When a kernel oops happens, the kernel kills one or more processes. These processes were generally in kernel code at the time (that's usually what generated the oops), and they may have been holding locks or have been in the middle of modifying data structures, submitting IO operations, or doing other kernel things. However, the kernel has no idea what exactly needs to be done to safely release these locks, revert the data structure modifications, and so on; instead it just drops everything on the floor and hopes for the best.

Sometimes this works out, or at least the damage done is relatively contained (perhaps only access to one mounted filesystem starts hanging because of a lock held by the now-dead process that will never be unlocked). Often it is not and more or less everything grinds to a more or less immediate halt. If you're lucky, enough of the system survives long enough for the kernel oops message to be written to disk or sent out to your central syslog server.

linux/RebootOnPanicSettings written at 00:44:25; Add Comment


Two annoyances I have with Python's imaplib module

As I mentioned yesterday, I recently wrote some code that uses the imaplib module. In the process of doing this, I wound up experiencing some annoyances, one of them a traditional one and one a new one that I've only come to appreciate recently.

The traditional annoyance is that the imaplib module doesn't wrap errors from other modules that it uses. This leaves you with at least two problems. The first is that you get to try to catch a bunch of exception classes to handle errors:

  c = ssl.create_default_context()
  m = imaplib.IMAP4_SSL(host=host, ssl_context=c)
except (imaplib.IMAP4.error, ssl.SSLError, OSError) as e:

The second is that, well, I'm not sure I'm actually catching all of the errors that calling the imaplib module can raise. The module doesn't document them, and so this list is merely the ones that I've been able to provoke in testing. This is the fundamental flaw of not wrapping exceptions that I wrote about many years ago; by not wrapping exceptions, you make what modules you call an implicit part of your API. Then you usually don't document it.

I award the imaplib module bonus points for having its error exception class accessed via an attribute on another class. I'm sure there's a historical reason for this, but I really wish it had been cleaned up as part of the Python 3 migration. In the current Python 3 source, these exception classes are actually literally classes inside the IMAP4 class:

class IMAP4:
  class error(Exception): pass
  class abort(error): pass
  class readonly(abort): pass

The other annoyance is that the imaplib module doesn't implement any sort of timeouts, either on individual operations or on a whole sequence of them. If you aren't prepared to wait for potentially very long amounts of time (if the IMAP server has something go wrong with it), you need to add some sort of timeout yourself through means outside of imaplib, either something like signal.setitimer() with a SIGALRM handler or through manipulating the underlying socket to set timeouts on it (although I've read that this causes problems, and anyway you're normally going to be trying to work through SSL as well). For my own program I opted to go the SIGALRM route, but I have the advantage that the only thing I'm doing is IMAP. A more sophisticated program might not want to blow itself up with a SIGALRM just because the IMAP side of things was too slow.

Timeouts aren't something that I used to think about when I wrote programs that were mostly run interactively and did only one thing, where the timeout is most sensibly imposed by the user hitting Ctrl-C to kill the entire program. Automated testing programs and other, similar things care a lot about timeouts, because they don't want to hang if something goes wrong with the server. And in fact it is possible to cause imaplib to hang for a quite long time in a very simple way:

m = imaplib.IMAP4_SSL(host=host, port=443)

You don't even need something that actually responds and gets as far as establishing a TLS session; it's enough for the TCP connection to be accepted. This is reasonably dangerous, because 'accept the connection and then hang' is more or less the expected behavior for a system under sufficiently high load (accepting the connection is handled in the kernel, and then the system is too loaded for the IMAP server to run).

Overall I've wound up feeling that the imaplib module is okay for simple, straightforward uses but it's not really a solid base for anything more. Sure, you can probably use it, but you're also probably going to be patching things and working around issues. For us, using imaplib and papering over these issues is the easiest way forward, but if I wanted to do more I'd probably look for a third party module (or think about switching languages).

python/ImaplibTwoAnnoyances written at 00:33:00; Add Comment


A few notes on using SSL in Python 3 client programs

I was recently writing a Python program to check whether a test account could log into our IMAP servers and to time how long it took (as part of our new Prometheus monitoring). I used Python because it's one of our standard languages and because it includes the imaplib module, which did all of the hard work for me. As is my usual habit, I read as little of the detailed module documentation as possible and used brute force, which means that my first code looked kind of like this:

  m = imaplib.IMAP4_SSL(host=host)
  m.login(user, pw)
except ....:

When I tried out this code, I discovered that it was perfectly willing to connect to our IMAP servers using the wrong host name. At one level this is sort of okay (we're verifying that the IMAP TLS certificates are good through other checks), but at another it's wrong. So I went and read the module documentation with a bit more care, where it pointed me to the ssl module's "Security considerations" section, which told me that in modern Python, you want to supply a SSL context and you should normally get that context from ssl.create_default_context().

The default SSL context is good for a client connecting to a server. It does certificate verification, including hostname verification, and has officially reasonable defaults, some of which you can see in ctx.options of a created context, and also ctx.get_ciphers() (although the latter is rather verbose). Based on the module documentation, Python 3 is not entirely relying on the defaults of the underlying TLS library. However the underlying TLS library (and its version) affects what module features are available; you need OpenSSL 1.1.0g or later to get SSLContext.minimum_version, for example.

It's good that people who care can carefully select ciphers, TLS versions, and so on, but it's better that this seems to have good defaults (especially if we want to move away from the server dictating cipher order). I considered explicitly disabling TLSv1 in my checker, but decided that I didn't care enough to tune the settings here (and especially to keep them tuned). Note that explicitly setting a minimum version is a dangerous operation over the long term, because it means that someday you're lowering the minimum version instead of raising it.

(Today, for example, you might set the minimum version to TLS v1.2 and increase your security over the defaults. Then in five years, the default version could change to TLS v1.3 and now your unchanged code is worse than the defaults. Fortunately the TLS version constants do compare properly so far, so you can write code that uses max() to do it more or less right.)

Python 2.7 also has SSL contexts and ssl.create_default_context(), starting in 2.7.9. However, use of SSL contexts is less widespread than it is in Python 3 (for instance the Python 2 imaplib doesn't seem to support them), so I think it's clear you want to use Python 3 here if you have a choice.

(It seems a little bit odd to still be thinking about Python 2 now that it's less than a year to it being officially unsupported by the Python developers, but it's not going away any time soon and there are probably people writing new code in it.)

python/Python3SSLInClients written at 01:53:36; Add Comment


A surprise potential gotcha with sharenfs in ZFS on Linux

In Solaris and Illumos, the standard and well supported way to set and update NFS sharing options for ZFS filesystems is through the sharenfs ZFS filesystem property. ZFS on Linux sort of supports sharenfs, but it attempts to be compatible with Solaris and in practice that doesn't work well, partly because there are Solaris options that cannot be easily translated to Linux. When we faced this issue for our Linux ZFS fileservers, we decided that we would build an entirely separate system to handle NFS exports that directly invokes exportfs, which has worked well. This turns out to have been lucky, because there is an additional and somewhat subtle problem with how sharenfs is currently implemented in ZFS on Linux.

On both Illumos and Linux, ZFS actually implements sharenfs by calling the existing normal command to manipulate NFS exports; on Illumos this uses share_nfs and on Linux, exportfs. By itself this is not a problem and actually makes a lot of sense (especially since there's no official public API for this on either Linux or Illumos). On Linux, the specific functions involved are found in lib/libshare/nfs.c. When you initially share a NFS filesystem, ZFS will wind up running the following command for each client:

exportfs -i -o <options> <client>:<path>

When you entirely unshare a NFS filesystem, ZFS will wind up running:

exportfs -u <client>:<path>

The potential problem comes in when you change an existing sharenfs setting, either to modify what clients the filesystem is exported to or to alter what options you're exporting it with. ZFS on Linux implements this by entirely unexporting the filesystem to all clients, then re-exporting it with whatever options and to whatever clients your new sharenfs settings call for.

(The code for this is in nfs_update_shareopts() in lib/libshare/nfs.c.)

On the one hand this is a sensible if brute force implementation, and computing the difference in sharing (for both clients and options) and how to transform one to the other is not an easy problem. On the other hand, this means that clients that are actually doing NFS traffic during the time when you change sharenfs may be unlucky enough to try a NFS operation in the window of time between when the filesystem was unshared (to them) and when it was reshared (to them). If they hit this window, they'll get various forms of NFS permission denied messages, and with some clients this may produce highly undesirably consequences, such as libvirt guests having their root filesystems go read-only.

(The zfs-discuss re-query from Todd Pfaff today is what got several people to go digging and figure out this issue. I was one of them, but only because I rushed into exploring the code before reading the entire email thread.)

I would like to say that our system for ZFS NFS export permissions avoids this issue, but it has exactly the same problem. Rather than try to reconcile the current NFS export settings and the desired new ones, it just does a brute force 'exportfs -u' for all current clients and then reshares things. Fortunately we only very rarely change the NFS exports for a filesystem because we export to netgroups instead of individual clients, so adding and removing individual clients is almost entirely done by changing netgroup membership. The actual exportfs setting only has to change if we add or remove entire netgroups.

(Exportfs has a tempting '-r' option to just resynchronize everything, but our current system doesn't use it and I don't know why. I know that I poked around with exportfs when I was developing it but I don't seem to have written down notes about my exploration, so I don't know if I ran into problems with -r, didn't notice it, or had some other reason I rejected it. If I didn't overlook it, this is definitely a case where I should have documented why I wasn't doing an attractive thing.)

linux/ZFSOnLinuxSharenfsGotcha written at 00:08:23; Add Comment


Linux CPU numbers are not necessarily contiguous

In Linux, the kernel gives all CPUs a number; you can see this number in, for example, /proc/stat:

cpu0 [...]
cpu1 [...]
cpu2 [...]
cpu3 [...]

Under normal circumstances, Linux has contiguous CPU numbers that start at 0 and go up to however many CPUs the system has. However, this is not guaranteed and is not always the case on certain live configurations. It's perfectly possible to have a configuration where, for example, you have sixteen CPUs that are numbered 0 to 7 and 16 to 23, with 8 to 15 missing. In this situation, /proc/stat will match the kernel's numbering, with lines for cpu0 through cpu7 and cpu16 through cpu23. If your code sees this and decides to fill in the missing CPUs 8 through 15, it will be wrong.

You might think that no code could possibly make this mistake, but it's not quite that simple. If, for example, you make a straightforward array to hold CPU status, read in information from various sources, and then print out your accumulated data for CPUs 0 through the highest CPU you saw, you will invent those missing CPUs 8 through 15 (possibly with random unset data for them). In situations like this, you need to actively keep track of what CPUs in your array are valid and what ones aren't, or you need a more sophisticated data structure.

(If you've created an API that says 'I return an array of CPU information for CPUs 0 through N', well, you have a problem. You're probably going to need an API change; if this is in a structure, at least an API addition of a new field to tell people which CPUs are valid.)

I can see why people make this mistake. It's tempting to have simple code, displays, and so on, and almost all Linux machines have contiguous CPU numbering so your code will work almost everything (we only wound up with non-contiguous numbering through bad luck). But, sadly, it is a mistake and sooner or later it will bite either you or someone who uses your code.

(It's unfortunate that doing this right is more complicated. Life certainly would be simpler if Linux guaranteed that CPU numbers were always contiguous, but given that CPUs can come and go, that could cause CPU numbers to not always refer to the same actual CPU over time, which is worse.)

Sidebar: How we have non-contiguous CPU numbers

We have one dual-socket machine with hyperthreading where one socket has cooling problems and we've shut it down by offlining the CPUs. Each socket has eight cores, and Linux enumerated one side of the HT pairs for both sockets before starting on the other side of the HT pairs. CPUs 0 through 7 and 16 through 23 are the two HTs for the eight cores on the first socket; CPUs 8-15 would be the first set of CPUs for the second socket, if they were online, and then CPUs 24-32 would be the other side of the HT pairs.

In general, HT pairing is unpredictable. Some machines will pair adjacent CPU numbers (so CPU 0 and CPU 1 are a HT pair) and some machines will enumerate all of one side before they enumerate all of the other. My Ryzen-based office workstation enumerates HT pairs as adjacent CPU numbers, so CPU 0 and 1 are a pair, while my Intel-based home machine enumerates all of one HT side before flipping over to enumerate all of the other, so CPU 0 and CPU 6 are a pair.

(I prefer the Ryzen ordering because it makes life simpler.)

It's possible that we should be doing something less or other than offlining all of the CPUs for the socket with the cooling problem (perhaps the BIOS has an option to disable one socket entirely). But offlining them all seemed like the most thorough and sure option, and it certainly was simple.

linux/CPUNumbersNotContiguous written at 00:50:14; Add Comment


Why C uninitialized global variables have an initial value of zero

In C, uninitialized local variables are undefined but uninitialized global variables (whether static or not) are defined to start out as zero. This difference periodically strikes people as peculiar and you might wonder why C is this way. As it happens, there is a fairly simple answer.

One answer is certainly 'because the ANSI C standard says that global variables behave that way', and in some ways this is the right answer (but we'll get to that). Another answer is 'because C was documented to behave that way in "The C Programming Language" and so ANSI C had no choice but to adopt that behavior'. But the real answer is that C behaves this way because it was the most straightforward way for it to behave in Unix on PDP-11s, which was its original home.

In a straightforward compiled language like the early versions of C, all global variables have a storage location, which is to say that they have a fixed permanent address in memory. This memory comes from the operating system and when operating systems give you memory, they don't give it to you with random contents; for good reasons they have to set it to something and they tend to fill it with zero bytes. Early Unix was no exception, so the memory locations for uninitialized global variables were know to start out as all zero bytes. Hence early K&R C could easily and naturally declare that uninitialized global variables were zero, as they were located in memory that had been zero-filled by the operating system.

(Programs did not explicitly ask Unix for this memory. Instead, executable files simply had a field that said 'I have <X> bytes of bss', and the kernel set things up when it loaded the executable.)

The fly in the ointment for this simple situation is that there are some uncommon architectures where zero-filled memory doesn't give you zero valued variables for all types and instead the 0 value for some types has some of its bits turned on in memory. When this came up, people decided that C meant what it said; uninitialized values of these types were still zero, even though you could no longer implement this with no effort by just putting these variables in zero-filled memory. This is where 'the ANSI C standard says so' is basically the answer, although it is also really the only good answer since any other answer would make the initial value of uninitialized global variables non-portable.

(You can read more careful discussion of this on Wikipedia, and probably in many C FAQs. The comp.lang.c FAQ section 5.17 lists some architectures where null pointers are not all-bits-zero values. I suspect that there have been C compilers on architectures where floating point 0 is not all-bits-zero, although it is in IEEE 754 floating point, which pretty much everyone uses today.)

As a side note, the reason that this logic doesn't work for uninitialized local variables is that in a straightforward C implementation, they go on the stack and the stack is reused. The very first time you use a new section of stack, it's fresh memory from the operating system, so it's been zero-filled for you and your uninitialized local variables are zero, just like globals. But after that the memory has 'random' values left over from its previous use. And for various reasons you can't be sure when a section of the stack is being used for the first time.

(In a modern C environment, even completely untouched sections of the stack may not be zero. For security reasons, they may have been filled with random values or with specific 'poison' ones.)

programming/CWhyGlobalsZeroDefault written at 00:42:22; Add Comment


Perhaps you no longer want to force a server-preferred TLS cipher order on clients

To simplify a great deal, when you set up a TLS connection one of the things that happens in the TLS handshake is that the client sends the server a list of the cipher suites it supports in preference order, and then the server picks which one to use. One of the questions when configuring a TLS server is whether you will tell the server to respect the client's preference order or whether you will override it and use the server's preference order. Most TLS configuration resources, such as Mozilla's guidelines, will implicitly tell you to prefer the server's preference order instead of the client's.

(I say 'implicitly' here because the Mozilla discussion doesn't explicitly talk about it, but the Mozilla configuration generator consistently picks server options to prefer the server's order.)

In the original world where I learned 'always prefer the server's cipher order', the server was almost always more up to date and better curated than clients were. You might have all sorts of old web browsers and so on calling in, with all sorts of questionable cipher ordering choices, and you mostly didn't trust them to be doing a good job of modern TLS. Forcing everyone to use the order from your server fixed all of this, and it put the situation under your control; you could make sure that every client got the strongest cipher that it supported.

That doesn't describe today's world, which is different in at least two important ways. First, today many browsers update every six weeks or so, which is probably far more often than most people are re-checking their TLS best practices (certainly it's far more frequently than we are). As a result, it's easy for browsers to be the more up to date party on TLS best practices. Second, browsers are running on increasingly varied hardware where different ciphers may have quite different performance and power characteristics. An AES GCM cipher is probably the fastest on x86 hardware (it can make a dramatic difference), but may not be the best on, say, ARM based devices such as mobile phones and tablets (and it depends on what CPUs those have, too, since people use a wide variety of ARM cores, although by now all of them may be modern enough to have ARMv8-A AES-NI crypto instructions).

If you're going to consistently stay up to date on the latest TLS developments and always carefully curate your TLS cipher list and order, as Mozilla is, then I think it still potentially makes sense to prefer your server's cipher order. But the more I think about it, the more I'm not sure it makes sense for most people to try to do this. Given that I'm not a TLS expert and I'm not going to spend the time to constantly keep on top of this, it feels like perhaps once we let Mozilla restrict our configuration to ciphers that are all strong enough, we should let clients pick the one they think is best for them. The result is unlikely to do anything much to security and it may help clients perform better.

(If you're CPU-constrained on the server, then you certainly want to pick the cheapest cipher for you and never mind what the clients would like. But again, this is probably not most people's situation.)

PS: As you might guess, the trigger for this thought was looking at a server TLS configuration that we probably haven't touched for four years, and perhaps more. In theory perhaps we should schedule periodic re-examinations and updates of our TLS configurations; in practice we're unlikely to actually do that, so I'm starting to think that the more hands-off they are, the better.

tech/TLSServerCipherPriority written at 01:11:18; Add Comment


Why your fresh new memory pages are zero-filled

When you (or your programs) obtain memory directly from the operating system, you pretty much invariably get memory that is filled with zero bytes. The same thing is true if you ask for fresh empty disk space, on systems where you can do this (Unix, for example); by specification for Unix, if you extend a file without writing data, the 'empty space' is all 0 bytes. You might wonder why this is. The answer is pretty straightforward; the operating system has to put some specific value into the new memory and disk space, and people have historically picked all 0 bytes as that value.

(I am not dedicated enough to try to research very old operating system history to see if I can find the first OSes to do this. For reasons we're about to cover, it probably started no later than the 1960s.)

There is a story in this, although it is a short one. Once upon a time, when you asked the operating system for some memory or some disk space, the operating system didn't fill it with any defined value; instead it gave it to you with whatever random values it had had before. Since you were about to write to the memory (or disk space), the operating system setting it to a specific value before you overwrote it with your data was just a waste of CPU. This worked fine for a while, and then people on multi-user systems noticed that you could allocate a bunch of RAM or disk space, not write to it, and search through it to see if the previous users had left anything interesting there. Not infrequently they had. Very soon after people started doing this, operating systems stopped giving you new memory or disk space without clearing its old contents. The simplest way to clear the old contents is to overwrite them with some constant value, and apparently the simplest constant value (or at least the one everyone settled on) is 0.

(Since then, hardware and software have developed all sorts of high speed ways of setting memory to 0, partly because it's become such a common operation as a result of this operating system behavior. Some operating systems even zero memory in the background when idle so they can immediately hand out memory instead of having to pause to clear it.)

This behavior of clearing (new) memory to 0 bytes has had some inobvious consequences in places you might not think of immediately, but that's another entry.

Note that this is only what happens when you get memory directly from the operating system, generally with some form of system call. Most language environments don't return memory to the operating system when your code frees it (either explicitly or, in garbage collected languages, implicitly); instead they keep holding on to the now-free memory and recycle it when your code asks for more. This reallocated memory normally has the previous contents that your own code wrote into it. Although this can be a security issue too, it's not something the operating system deals with; it's your problem (or at least a problem for the language runtime).

tech/WhyZeroMemoryPages written at 23:06:16; Add Comment


Two views of ZFS's GPL-incompatibility and the Linux kernel

As part of a thread on linux-kernel where ZFS on Linux's problem with a recent Linux kernel change in exported symbols was brought up, Greg Kroah-Hartman wrote in part in this message:

My tolerance for ZFS is pretty non-existant. Sun explicitly did not want their code to work on Linux, so why would we do extra work to get their code to work properly?

If one frames the issue this way, my answer would be that in today's world, Sun (now Oracle) is no longer at all involved in what is affected here. It stopped being 'Sun's code' years ago, when Oracle Solaris and OpenSolaris split apart, and it's now in practice the code of the people who use ZFS on Linux, with a side digression into FreeBSD and Illumos. The people affected by ZoL not working are completely disconnected from Oracle, and anything the Linux kernel does to make ZoL work will not help Oracle more than a tiny fraction.

In short, the reason to do extra work here is that the people affected are Linux users who are using their best option for a good modern filesystem, not giant corporations taking advantage of Linux.

(I suspect that the kernel developers are not happy that people would much rather use ZFS on Linux than Btrfs, but I assure them that it is still true. I am not at all interested in participating in a great experiment to make Btrfs sufficiently stable, reliable, and featureful, and I am especially not interested in having work participate in this for our new fileservers.)

However, there is a different way to frame this issue. If you take it as given that Sun did not want their code to be used with Linux (and Oracle has given no sign of feeling otherwise), then fundamental social respect for the original copyright holder and license means respecting their choice. If Sun didn't want ZFS to work on Linux, it's hostile to them for the kernel community to go to extra work to enable it to work on Linux. If people outside the kernel community hack it up so that it works anyway, that's one thing. But if the kernel community goes out of its way to enable these hacks, well, then the kernel community becomes involved and is violating the golden rule as applied to software licenses.

As a result, I can reluctantly and unhappily support or at least accept 'no extra work for ZFS' as a matter of principle for Linux kernel development. But if your concern is not principle but practical effects, then I think you are mistaken.

(And if Oracle actually wanted to take advantage of the Linux kernel for ZFS, they could easily do so. Whether they ever will or not is something I have no idea about, although I can speculate wildly and their relicensing of DTrace is potentially suggestive.)

linux/ZFSLicenseTwoViews written at 23:50:46; Add Comment

The risk that comes from ZFS on Linux not being GPL-compatible

A couple of years ago I wrote about the harm of ZFS not being GPL-compatible, which was that this kept ZFS from being bundled into most Linux distributions. License compatibility is both a legal and a social thing, and the social side is quite clear; most people who matter consider ZFS's CDDL license to be incompatible with the kernel. However, it turns out that there is another issue and another side of this that I didn't realize back at the time. This issue surfaced recently with the 5.0 kernel release candidates, as I first saw in Phoronix's ZFS On Linux Runs Into A Snag With Linux 5.0.

The Linux kernel doesn't allow kernel modules to use just any internal kernel symbols; instead they must be officially exported symbols. Some symbols (often although not entirely old ones) are exported to all kernel modules, regardless of the module's license, while others are exported in a way that marks them as restricted to GPL'd kernel modules. At the same time the kernel does not have a stable API of these exported symbols and previously exported ones can be removed as code is revived. Removed symbols may have no replacement at all or the replacement may be a GPL-only one when the previous symbol was generally available.

Modules that are part of the Linux kernel source are always going to work, so the kernel always exports enough symbols for them (although possibly as GPL-only symbols, since in-tree kernel modules are all GPL'd). Out of kernel modules that do the same sort of thing as in-kernel ones are also always going to work, at least if they're GPL'd; you're always going to be able to have out kernel modules for device drivers in general, for example. But out of kernel modules for less common things are more or less at the mercy of what symbols the kernel exports, especially if they're not GPL'd modules. If you're an out of kernel module with a GPL-compatible license, you might get the kernel developers to export some symbols you needed. If your module has a license that is seen as not GPL-compatible, well, the kernel developers may not be very sympathetic.

This is what has happened with ZFS on Linux as of the 5.0 pre-release, as covered in the Phoronix story and ZoL issue #8259. This specific problem will probably be worked around, but it shows a systemic risk for ZFS on Linux (and for any unusual non-GPL'd module), which is that you are at the mercy of the Linux kernel people to keep working in some vaguely legal way. If the Linux kernel people ever decide to be hostile they can systematically start making your life hard, and they may well make your life hard just as a side effect.

Is it likely that ZFS on Linux will someday be unable to work at all with new kernels, because crucial symbols it needs are not available at all? I think it's unlikely, but it's certainly possible and that makes it a risk for long term usage of ZFS on Linux. If it happened (hopefully far in the future), at work our answer would be to replace our current Linux-based ZFS fileservers with FreeBSD ones. On my own machines, well, I'd have to figure out some way of migrating all of my data around and what I'd put it on, and it would definitely be a pain and make me unhappy.

(It wouldn't be BTRFS, unless things change a lot by that point.)

linux/ZFSNonGPLRisk written at 23:17:29; Add Comment

Even thinking about spam makes me angry

It isn't news to me that dealing with spam makes me irritated and angry. I resent the intrusion into my email, and then I resent the time I spend dealing with it, and in fact I resent its very existence. This is not a rational irritation and hatred; I viscerally dislike spam and people and organizations who spam me. Sensible people would resent spammers only for the time and effort they take to deal with, but I am angry all out of proportion with that.

(This anger is part of what pushes me to think about and try to design elaborate potential anti-spam measures, even when this isn't necessarily wise. It is not that I enjoy the challenge of it all or the like, it is that I want to frustrate spammers.)

What I've recently clued in to is that even thinking about spam often makes me angry, not merely dealing with it. Perhaps this shouldn't surprise me, since I know my reaction is a visceral one and just being reminded of things will set off that sort of reaction, but it kind of does. I am a happier person when I can spend as long as possible paying as little attention as possible to all things involving spam; the less I think of it at all, the better it is for me.

That sounds awfully abstract, so let me make it concrete. I have yet another case of Google being a spammer mailing list provider, and I considered writing it up for Wandering Thoughts. Then I realized that even thinking about it was making me grumpy and soaking in the situation for long enough to write an entry would be even worse, since I can't write an entry about a spam incident without having the spam incident on my mind for the entire time I write.

So, I have decided that I will probably not write that entry. I am angry about the spam and angry at Google and I would like to hold them up to the light (again), but it is not worth it. I would rather be non-angry. Since any reminder about Google's culpability will probably not help, it would also be sensible for me to entirely block email from Google to my spamtrap addresses so I'm completely unaware of any future cases.

It's possible that this will cause me to write less about spam in general on Wandering Thoughts, although I'm going to have to see about that. I lump sort of spam-related issues like DKIM and so on into my spam category, and I likely still have things to talk about there.

(DMARC as a whole is not necessarily an anti-spam feature. As commonly used, it may be more of an anti-phish one, although I'm not sure that works as well as you'd like. That's another entry, though.)

spam/SpamThinkingAnger written at 02:22:21; Add Comment

(Previous 12 or go back to January 2019 at 2019/01/12)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.