Our likely long road to working 10G-T on OmniOS
I wrote earlier about our problems with Intel 10G-T on our OmniOS fileservers and how we've had to fall back to 1G networking. Obviously we'd like to change that and go back to 10G-T. The obvious option was another sort of 10G-T chipset besides Intel's. Unfortunately, as far as we can see Intel's chipsets are the best supported option and eg Broadcom seems even less likely to work well (or at all, and we later had problems with even a Broadcom 1G chipset under OmniOS). So we've scratched that idea; at this point it's Intel or bust.
We really want to reproduce our issues outside of production. While we've set up a test environment and put load on it, we've so far been unable to make it fall over in any clearly networking related way (OmniOS did lock up once under extreme load, but that might not be related at all). We're going to have to keep trying in the new year; I don't know what we'll do if we can't reproduce things.
(We also aren't currently trying to reproduce the dual port card issue. We may switch to this at some point.)
As I said in the earlier entry, we no longer feel that we can trust the current OmniOS ixgbe driver in production. That means going back to production needs an updated driver. At the moment I don't think anyone in the Illumos community is actively working on this (which I can't blame them for), although I believe there's some interest in doing a driver update at some point.
It's possible that we could find some money to sponsor work on updating the ixgbe driver to the current upstream Intel version, and so get it done that way (assuming that this sort of work can be sponsored for what we can afford, which may be dubious). Unfortunately our constrained budget situation means that I can't argue very persuasively for exploring this until we have some confidence that the current upstream Intel driver would fix our issues. This is hard to get without at least some sort of reproduction of the problem.
(What this says to me is that I should start trying to match up driver versions and read driver changelogs. My guess is that the current Linux driver is basically what we'd get if the OmniOS driver was resynchronized, so I can also look at it for changes in the areas that I already know are problems, such as the 20msec stall while fondling the X540-AT2 ports.)
While I don't want to call it 'ideal', I would settle for a way to reproduce the dual card issue with simply artificial TCP network traffic. We could then change the server from OmniOS to an up to date Linux to see if the current Linux driver avoids the problem under the same load, then use this as evidence that commissioning an OmniOS driver update would get us something worthwhile.
None of this seems likely to be very fast. At this point, getting 10G-T back in six months seems extremely optimistic.
(The pessimistic view of when we might get our new fileserver environment back to 10G-T is obviously 'never'. That has its own long-term consequences that I don't want to think about right now.)
Sidebar: the crazy option
The crazy option is to try to learn enough about building and working on OmniOS so that I can build new ixgbe driver versions myself and so attempt either spot code modifications or my own hack testing on a larger scale driver resynchronization. While there is a part of me that finds this idea both nifty and attractive, my realistic side argues strongly that it would take far too much of my time for too little reward. Becoming a vaguely competent Illumos kernel coder doesn't seem like it's exactly going to be a small job, among other issues.
(But if there is an easy way to build new OmniOS kernel components, it'd be useful to learn at least that much. I've looked into this a bit but not very much.)
The potential end of public clients at the university?
Recently, another department asked our campus-wide sysadmin mailing list for ideas on how to deal with keyloggers, after having found one. They soon clarified that they meant physical keyloggers, because that's what they'd found. As I read the ensuing discussion I had an increasing sinking feeling that the answer was basically 'you can't' (which was pretty much the consensus answer; no one had really good ideas and several people knew things that looked attractive but didn't fully work). And that makes me pretty unhappy, because it means that I'm not sure public clients are viable any more.
Here at the university there's long been a tradition and habit of various sorts of public client machines, ranging from workstations in computer labs in various departments to terminals in libraries. All of these uses depend crucially on the machines being at least non-malicious, where we can assure users that using the machine in front of them is not going to give them massive problems like compromised passwords and everything that ensues from that.
(A machine being non-malicious is different from it being secure, although secure machines are usually non-malicious as well. A secure machine is doing only what you think it should be, while a non-malicious machine is at least not screwing its user. A machine that does what the user wants instead of what you want is insecure but not hopefully not malicious (and if it is malicious, well, the user did it to themselves, which is admittedly not a great comfort).)
Keyloggers, whether software or physical, are one way to create malicious machines. Once upon a time they were hard to get, expensive, and limited. These days, well, not so much, based on some hardware projects I've heard of; I'm pretty sure you could build a relatively transparent USB keylogger with tens of megabytes of logging capacity as an undergrad final project with inexpensive off the shelf parts. Probably you can already buy fully functional ones for cheap on EBay. What was once a pretty rare and exclusive preserve is now available to anyone who is bored and sufficiently nasty to go fishing. As this incident illustrates, some number of our users probably will do so (and it's only going to get worse as this stuff gets easier to get and use).
If we can't feasibly keep public machines from being made malicious, it's hard to see how we can keep offering and operating them at all. I'm now far from convinced that this is possible in most settings. Pessimistically, it seems like we may have reached the era where it's much safer to tell people to bring their own laptops, tablets, or phones (which they often will anyways, and will prefer using).
(I'm not even convinced it's a good idea to have university provided machines in graduate student offices, many of which are shared and in practice are often open for people who look like they belong to stroll through and fiddle briefly with a desktop.)
PS: Note that keyloggers are on the easy scale of the damage you can do with nasty USB hardware. There's much worse possible, but of course people really want to be able to plug their own USB sticks and so on into your public machines.
Sidebar: Possible versus feasible here
I'm pretty sure that you could build a kiosk style hardware enclosure that would make a desktop's actual USB ports and so on completely inaccessible, so that people couldn't unplug the keyboard and plug in their keylogger. I'm equally confident that this would be a relatively costly piece of custom design and construction that would also consume a bunch of extra physical space (and the physical space needed for public machines is often a big limiting factor on how many seats you can fit in).
Does having a separate daemon manager help system resilience?
One of the reasons usually put forward for having a separate daemon manager process (instead of having PID 1 do this work) is that doing so increases overall system resilience. As the theory goes, PID 1 can be made minimal and extremely unlikely to crash (unlike a more complex PID 1), while if the more complicated daemon manager does crash it can be restarted.
Well, maybe. The problem is the question of how well you can actually take over from a crashed daemon manager. Usually this won't be an orderly takeover and you can't necessarily trust anything in any auxiliary database that the daemon manager has left behind (since it could well have been corrupted before or during the crash). You need to have the new manager process step in and somehow figure out what was (and is) running and what isn't, then synchronize the state of the system back to what it's supposed to be, then pick up monitoring everything.
The simple case is a passive init system. Since the init system does not explicitly track daemon state, there is no state to recover on a daemon manager restart and resynchronization can be done simply by trying to start everything that should be started (based on runlevel and so on). We can blithely assume that the 'start' action for everything will do nothing if the particular service is already started. Of course this is not very realistic, as passive init systems generally don't have daemon manager processes that can crash in the first place.
For an active daemon manager, I think that at a minimum what you need is some sort of persistent and stable identifier for groups of processes that can be introspected and monitored from an arbitrary process. The daemon manager starts processes for all services under a an identifier determined from their service name; then when it crashes and you have to start a new one, the new one can introspect the identifiers for all of the groups to determine what services are (probably) running. Unfortunately there are lots of complications here, including that this doesn't capture the state of 'one-shot' services without persistent processes. This is of course not a standard Unix facility, so no fully portable daemon manager can do this.
It's certainly the case that a straightforward, simple daemon manager will not be able to take over from a crashed instance of itself. Being able to do real takeover requires both system-specific features and a relatively complex design and series of steps on startup, and still leaves you with uncertain or open issues. In short, having a separate daemon manager does not automatically make the system any more resilient under real circumstances. A crashing daemon manager is likely to force a system reboot just as much as a crashing PID 1 does.
However I think it's fair to say that under normal circumstances a separate daemon manager process crashing (instead of PID 1 crashing) will buy you more time to schedule a system outage. If the only thing that needs the daemon manager running is starting or stopping services and you already have all normal services started up, your system may be able to run for days before you need to reboot it. If your daemon manager is more involved in system operation or is routinely required to restart services, well, you're going to have (much) less time depending on the exact details.
How a Firefox update just damaged practical security
Recently, Mozilla pushed out Firefox 34 as one of their periodic regular Firefox updates. Unfortunately this shipped with a known incompatible change that broke several extensions, including the popular Flashblock extension. Mozilla had known about this problem for months before the release; in fact the bug report was essentially filed immediately after the change in question landed in the tree, and the breakage was known when the change was proposed. Mozilla people didn't care enough to do anything in particular about this beyond (I think) blacklisting the extension as non-functional in Firefox 34.
I'm sure that this made sense internally in Mozilla and was justified at the time. But in practice this was a terrible decision, one that's undoubtedly damaged pragmatic Firefox security for some time to come. Given that addons create a new browser, the practical effect of this decision is that Firefox's automatic update to Firefox 34 broke people's browsers. When your automatic update breaks people's browsers, congratulations, you have just trained them to turn your updates off. And turning automatic updates off has very serious security impacts.
The real world effect of Mozilla's decision is that Mozilla has now trained some number of users that if they let Mozilla update Firefox, things break. Since users hate having things break, they're going to stop allowing those updates to happen, which will leave them exposed to real Firefox security vulnerabilities that future updates would fix (and we can be confident that there will be such updates). Mozilla did this damage not for a security critical change but for a long term cleanup that they decided was nice to have.
(Note that Mozilla could have taken a number of methods to fix the popular extensions that were known to be broken by this change, since the actual change required to extensions is extremely minimal.)
I don't blame Mozilla for making the initial change; trying to make this change was sensible. I do blame Mozilla's release process for allowing this release to happen knowing that it broke popular extensions and doing nothing significant about it, because Mozilla's release process certainly should care about the security impact of Mozilla's decisions.
Why your 64-bit Go programs may have a huge virtual size
For various reasons, I build (and rebuild) my copy of the core Go
system from the latest development source on a regular basis, and
periodically rebuild the Go programs I use from that build. Recently
I was looking at the memory use of one of my programs with
ps and noticed that
it had an absolutely huge virtual size (Linux ps's
of around 138 GB, although it had only a moderate resident set size.
This nearly gave me a heart attack, since a huge virtual size with
a relatively tiny resident set size is one classical sign of a
(Builds with earlier versions of Go tended to have much more modest virtual set sizes on the order of 32 MB to 128 MB depending on how long it had been running.)
Fortunately this was not a memory leak. In fact, experimentation
soon demonstrated that even a basic 'hello world' program had that
huge a virtual size. Inspection of the process's
file (cf) showed that basically all of the
virtual space used was coming from two inaccessible mappings, one
roughly 8 GB long and one roughly 128 GB. These mappings had no
access permissions (they disallowed reading, writing, and executing)
so all they did was reserve address space (without ever using any
actual RAM). A lot of address space.
It turns out that this is how Go's current low-level memory management likes to work on 64-bit systems. Simplified somewhat, Go does low level allocations in 8 KB pages taken from a (theoretically) contiguous arena; what pages are free versus allocated is stored in a giant bitmap. On 64-bit machines, Go simply pre-reserves the entire memory address space for both the bitmaps and the arena itself. As the runtime and your Go code starts to actually use memory, pieces of the arena bitmap and the memory arena will be changed from simple address space reservations into memory that is actually backed by RAM and being used for something.
(Mechanically, the bitmap and arena are initially
PROT_NONE. As memory is used, it is remapped with
PROT_READ|PROT_WRITE. I'm not confident that I understand what
happens when it's freed up, so I'm not going to say anything there.)
All of this is the case for the current post Go 1.4 development version of Go. Go 1.4 and earlier behave differently with much lower virtual sizes for running 64-bit programs, although in reading the Go 1.4 source code I'm not sure I understand why.
As far as I can tell, one of the interesting consequences of this is that 64-bit Go programs can use at most 128 GB of memory for most of their allocations (perhaps all of them that go through the runtime, I'm not sure).
I have to say that this turned out to be more interesting and
educational than I initially expected, even if it means that watching
ps is no longer a good way to detect memory leaks in your Go
programs (mind you, I'm not sure it ever was). As a result, the
best way to check this sort of memory usage is probably some
runtime.ReadMemStats() (perhaps exposed through
net/http/pprof) and Linux's
smem program or the like to obtain detailed information on
meaningful memory address space usage.
PS: Unixes are generally smart enough to understand that
mappings will never use up any memory and so shouldn't count against
things like system memory overcommit limits. However they generally
will count against a per-process limit on total address space, which
likely means that you can't really use such limits and run post 1.4
Go programs. Since total address space limits are rarely used, this
is probably not likely to be an issue.
Sidebar: How this works on 32-bit systems
The full story is in the
mallocinit() comment. The short version
is that the runtime reserves a large enough arena to handle 2 GB
of memory (which 'only' takes 256 MB) but only reserves 512 MB of
address space out of the 2 GB it could theoretically use. If the
runtime later needs more memory, it asks the OS for another block
of address space and hopes that it is in the remaining 1.5 GB of
address space that the arena covers. Under many circumstances the
odds are good that the runtime will get what it needs.
How init wound up as Unix's daemon manager
If you think about it, it's at least a little bit odd that PID 1 wound up as the de facto daemon manager for Unix. While I believe that the role itself is part of the init system as a whole, this is not the same thing as having PID 1 do the job and in many ways you'd kind of expect it to be done in another process. As with many things about Unix, I think that this can be attributed to the historical evolution Unix has gone through.
As I see the evolution of this, things start in V7 Unix (or maybe
earlier) when Research Unix grew some system daemons, things like
crond. Something had to start these, so V7 had init run
on boot as the minimal approach. Adding networking to Unix in BSD
Unix increased the number of daemons to start (and was one of several
changes that complicated the whole startup process a lot). Sun added
even more daemons with NFS and YP and so on and either created or
elaborated interdependencies among them. Finally System V came along
and made everything systematic with
rcN.d and so on, which was
just in time for yet more daemons.
(Modern developments have extended this even further to actively
monitoring and restarting daemons if you ask them to. System V init
could technically do this if you wanted, but people generally didn't
inittab for this.)
At no point in this process was it obvious to anyone that Unix was going through a major sea change. It's not as if Unix went in one step from no daemons to a whole bunch of daemons; instead there was a slow but steady growth in both the number of daemons and the complexity of system startup in general, and much of this happened on relatively resource-constrained machines where extra processes were a bad idea. Had there been a single giant step, maybe people would have sat down and asked themselves if PID 1 and a pile of shell scripts were the right approach and said 'no, it should be a separate process'. But that moment never happened; instead Unix basically drifted into the current situation.
(Technically speaking you can argue that System V init actually does do daemon 'management' in another process. System V init doesn't directly start daemons; instead they're started several layers of shell scripts away from PID 1. I call it part of PID 1 because there is no separate process that really has this responsibility, unlike the situation in eg Solaris SMF.)
There are two parts to making your code work with Python 3
In my not terribly extensive experience so far, in the general case
porting your code to Python 3 is really two steps in one, not a single
process. First, you need to revise your code so that it runs on Python 3
at all; it uses
print(), it imports modules under their new names, and
so on. Some amount of this can be automated by
2to3 and similar tools,
although not all of it. As I discovered, a
great deal of this is basically synonymous with modernizing your code
to the current best practice for Python 2.7. I believe that almost
all of the necessary changes will still work on Python 2.7 without
hacks (certainly things like
print() will with the right imports from
After your code will theoretically run at all, you need to revise your code so that it handles strings in Unicode, and it means that calling this process 'porting' is not really a good label. The moment you deal with Unicode you need to consider both character encoding conversion points and what you do on errors. Dealing with Unicode is extra work and confronting it may well require at least a thorough exploration of your code and perhaps a deep rethink of your design. This is not at all like the effort to revise your code to Python 3 idioms.
(And some people will have serious problems, although future Python 3 versions are dealing with some of the problems.)
Code that has already been written to the latest Python 2.7 idioms will need relatively little revision for Python 3's basic requirements, although I think it always needs some just to cope with renamed modules. Code that was already dealing very carefully with Unicode on Python 2.7 will need little or no revision to deal with Python 3's more forced Unicode model, because it's already effectively operating in that model anyways (although possibly imperfectly in ways that were camouflaged by Python 2.7's handling of this issue).
The direct corollary is that both the amount and type of work you need to do to get your code running under Python 3 depends very much on what it does today with strings and Unicode on Python 2. 'Clean' code that already lives in a Unicode world will have one experience; 'sloppy' code will have an entirely different one. This means that the process and experience of making code work on Python 3 is not at all monolithic. Different people with different code bases will have very different experiences, depending on what their code does (and on how much they need to consider corner cases and encoding errors).
(I think that Python 3 basically just works for almost all string handling if your system locale is a UTF-8 one and you never deal with any input that isn't UTF-8 and so never are confronted with decoding errors. Since this describes a great many people's environments and assumptions, simplistic Python 3 code can get very far. If you're in such a simple environment, the second step of Python 3 porting also disappears; your code works on Python 3 the moment it runs, possibly better than it did on Python 2.)
The bad side of systemd: two recent systemd failures
In the past I've written a number of favorable entries about systemd.
In the interests of balance, among other things, I now feel that I
should rake it over the coals for today's bad experiences that I
ran into in the course of trying to do a
yum upgrade of one system
from Fedora 20 to Fedora 21, which did not go well.
The first and worst failure is that I've consistently had systemd's master process (ie, PID 1, the true init) segfault during the upgrade process on this particular machine. I can say it's a consistent thing because this is a virtual machine and I snapshotted the disk image before starting the upgrade; I've rolled it back and retried the upgrade with variations several times and it's always segfaulted. This issue is apparently Fedora bug #1167044 (and I know of at least one other person it's happened to). Needless to say this has put somewhat of a cramp in my plans to upgrade my office and home machines to Fedora 21.
(Note that this is a real segfault and not an assertion failure. In fact this looks like a fairly bad code bug somewhere, with some form of memory scrambling involved.)
The slightly good news is that PID 1 segfaulting does not reboot
the machine on the spot. I'm not sure if PID 1 is completely stopped
afterwards or if it's just badly damaged, but the bad news is that
a remarkably large number of things stop working after this happens.
Everything trying to talk to systemd fails and usually times out
after a long wait, for example attempts to do '
from postinstall scripts. Attempts to log in or to
su to root from
an existing login either fail or hang. A plain
reboot will try to
talk to systemd and thus fails, although you can force a reboot in
various ways (including '
The merely bad experience is that as a result of this I had occasion
journalctl (I normally don't). More specifically, I had
occasion to use '
journalctl -l', because of course if you're going
to make a bug report you want to give full messages. Unfortunately,
journalctl -l does not actually show you the full message.
Not if you just run it by itself. Oh, the full message is available,
all right, but
journalctl specifically and deliberately invokes
the pager in a mode where you have to scroll sideways to see long
lines. Under no circumstance is all of a long line visible on screen
at once so that you may, for example, copy it into a bug report.
This is not a useful decision. In fact it is a screamingly frustrating
decision, one that is about the complete reverse of what I think most
people would expect
-l to do. In the grand systemd tradition, there is
no option to control this; all you can do is force
journalctl to not
use a pager or work out how to change things inside the pager to not do
journalctl goes out of its way to set up this behavior. Not
by passing command line arguments to
less, because that would be too
obvious (you might spot it in a
ps listing, for example); instead
$LESS to effectively add the '
-S' option, among other
While I'm here, let me mention that
journalctl's default behavior
of 'show all messages since the beginning of time in forward
chronological order' is about the most useless default I can imagine.
Doing it is robot logic, not human logic.
Unfortunately the systemd journal is unlikely to change its course
in any significant way so I expect we'll get to live
with this for years.
(I suppose what I need to do next is find out wherever
al puts core dumps from root processes so that I can run
systemd core to poke around. Oh wait, I think it's in the
systemd journal now.
This is my unhappy face, especially since I am having to deal with
a crash in systemd itself.)
What good kernel messages should be about and be like
Linux is unfortunately a haven of terrible kernel messages and terrible kernel message handling, as I have brought up before. In a spirit of shouting at the sea, today I feel like writing down my principles of good kernel messages.
The first and most important rule of kernel messages is that any kernel message that is emitted by default should be aimed at system administrators, not kernel developers. There are very few kernel developers and they do not look at very many systems, so it's pretty much guaranteed that most kernel messages are read by sysadmins. If a kernel message is for developers, it's useless for almost everyone reading it (and potentially confusing). Ergo it should not be generated by default settings; developers who need it for debugging can turn it on in various ways (including kernel command line parameters). This core rule guides basically all of the rest of my rules.
The direct consequence of this is that all messages should be clear, without in-jokes or cleverness that is only really comprehensible to kernel developers (especially only subsystem developers). In other words, no yama-style messages. If sysadmins looking at your message have no idea what it might refer to, no lead on what kernel subsystem it came from, and no clue where to look for further information, your message is bad.
Comprehensible messages are only half of the issue, though; the other half is only emitting useful messages. To be useful, my view is that a kernel message should be one of two things: it should either be what they call actionable or it should be necessary in order to reconstruct system state (one example is hardware appearing or disappearing, another is log messages that explain why memory allocations failed). An actionable message should cause sysadmins to do something and really it should mean that sysadmins need to do something.
It follows that generally other systems should not be able to cause the kernel to log messages by throwing outside traffic at it (these days that generally means network traffic), because outsiders should not be able to harm your kernel to the degree where you need to do anything; if this is the case, they are not actionable for the sysadmin of the local machine. And yes, I bang on this particular drum a fair bit; that's because it keeps happening.
Finally, almost all messages should be strongly ratelimited. Unfortunately I've come around to the view that this is essentially impossible to do at a purely textual level (at least with acceptable impact for kernel code), so it needs to be considered everywhere kernel code can generate a message. This very definitely includes things like messages about hardware coming and going, because sooner or later someone is going to have a flaky USB adapter or SATA HD that starts disappearing and then reappearing once or twice a second.
To say this more compactly, everything in your kernel messages should be important to you. Kernel messages should not be a random swamp that you go wading in after problems happen in order to see if you can spot any clues amidst the mud; they should be something that you can watch live to see if there are problems emerging.
How to delay your fileserver replacement project by six months or so
This is not exactly an embarrassing confession, because I think we made the right decisions for the long term, but it is at least an illustration of how a project can get significantly delayed one little bit at a time. The story starts back in early January, where we had basically finalized the broad details of our new fileserver environment; we had the hardware picked out and we knew we'd run OmniOS on the fileservers and our current iSCSI target software on some distribution of Linux. But what Linux?
At first the obvious answer was CentOS 6, since that would get us a nice long support period and RHEL 5 had been trouble-free on our existing iSCSI backends. Then I really didn't like RHEL/CentOS 6 and didn't want to use it here for something we'd have to deal with for four or five years to come (especially since it was already long in the tooth). So we switched our plans to Ubuntu, since we already run it everywhere else, and in relatively short order I had a version of our iSCSI backend setup running on Ubuntu 12.04. This was probably complete some time in late February, based on circumstantial evidence.
Eliding some rationale, Ubuntu 12.04 was an awkward thing to settle on in March or so of this year because Ubuntu 14.04 was just around the corner. Given that we hadn't built and fully tested the production installation, we might actually have wound up in the position of deploying 12.04 iSCSI backends after 14.04 had actually come out. Since we didn't feel in a big rush at the time, we decided it was worthwhile to wait for 14.04 to be released and for us to spin up the 14.04 version of our local install system, which we expected to have done by not too long after the 14.04 release. As it happened it was June before I picked the new fileserver project up again and I turned out to dislike Ubuntu 14.04 too.
By the time we knew we didn't really want to use Ubuntu 14.04, RHEL 7 was out (it came out June 10th). While we couldn't use it directly for local reasons, we though that CentOS 7 was probably going to be released soon and that we could at least wait a few weeks to see. CentOS 7 was released on July 7th and I immediately got to work, finally getting us back on track to where we probably could have been at the end of January if we'd stuck with CentOS 6.
(Part of the reason that we were willing to wait for CentOS 7 was that I actually built a RHEL 7 test install and everything worked. That not only proved that CentOS 7 was viable, it meant that we had an emergency fallback if CentOS 7 was delayed too long; we could go into at least initial production with RHEL 7 instead. I believe I did builds with CentOS 7 beta spins as well.)
Each of these decisions was locally sensible and delayed things only a moderate bit, but the cumulative effects delayed us by five or six months. I don't have any great lesson to point out here, but I do think I'm going to try to remember this in the future.