Wandering Thoughts archives

2015-12-21

Some opinions on how package systems should allow you to pin versions

Good package management systems allow you to selectively pin or freeze the versions of some packages. Over time I have evolved some opinions on how this should work, usually by getting irritated by the limitations of some tool in doing this (which is what happened today). So here is the minimum things that your package management tool should support around this.

First, you should be able to pin a package at a version that is not the currently installed version. Such a pin means that the package system is allowed to upgrade the package to that version and no other. Ideally such a pinned version would anchor the update of other packages which require synchronized versions.

(Bonus points are awarded if the package system can be made to downgrade a package to that version as well.)

Second, you should be able to pin the version of a package that is not even installed. Such a pin means that only that version of the package is allowed to be installed later. As with the previous case, a 'can only install version X' pin should influence other packages through dependencies and so on.

When both situations primarily matter is during system installation where you will be applying package updates (which is often the case). If you can pin non-current versions and even future package installs, your install system has a simple workflow; it first installs all of your pins, then does its usual package updates and installations of extra packages without worrying about specific versions. If you cannot pin non-current or non-installed packages, knowledge of pinned packages (and their versions) leaks into the whole update and install process. When you apply updates, you have to limit some packages to be updated only so far; when you install new packages, you have to install specific versions of some packages. And on top of this you have to pin (or hold, or freeze) the packages as well, so that future updates won't undo your work of picking specific package versions.

(This can also matter later on if you decide that you now want to pin some additional packages before applying more updates or installing new packages.)

Sidebar: pinning specific versions versus holding back changes

Sometimes you want a specific version of a package because it's what you've determined will work, or because you want all systems to be the same for some important package, or the like. Other times you don't particularly care about the specific version, but you just don't want a package to change for various reasons (for example, kernel updates might require reboots which have to be scheduled well in advance, or grub updates often wind up causing problems for your update process).

In theory, holding package changes is a subset of pinning a specific version (it is 'pin the current version of the package'). In practice package managers that support both often implement the two in different ways. I believe that Debian apt is an example of this.

PackageManagersPinFreely written at 21:39:41; Add Comment

2015-12-18

A fun tale of network troubleshooting involving VLANs and MACs

The following is not my story; it comes from my co-workers, who had the real fun of trying to figure this one out and then finding a fix.

To start with, let's set the background. We have an (OpenBSD) routing firewall machine that sits on a network segment whose egress router is not under our control. Actually, we have two of them, one active and one as a warm spare (it's on and being updated, but it is not connected to any of the production networks because otherwise it would fight the live firewall for the public IPs). A while back, as part of trying to fail over from the live machine to the warm spare, we discovered that the egress router for the network caches ARP information for a long time. Like, apparently, hours. This was obviously no good for being able to switch over (such as in the case of hardware failure). Since the egress router is not under our control, the only thing we could really do was explicitly set the warm spare to have the same Ethernet address as the active machine.

(This was tested at the time it was set up and worked, but we believe the test at the time was misleading.)

Recently my co-workers wanted to swap from the active machine to the warm spare, because the active machine had been up for literally years (we don't update OpenBSD all that often). Unfortunately, when they made the swap the (ex-)warm spare was not reachable on its public IP, so they failed back to the active machine and took the warm spare off for testing. Testing established that the warm spare was showing 'incomplete' for other machines in its ARP cache, although other machines picked it up fine for their ARP caches. Further, trying to inspect the traffic with tcpdump made the network suddenly work, but things broke again when they stopped tcpdump. Oh, and the problem was specific to using our preferred Intel Ethernet cards; if the warm spare was switched to use non-Intel network hardware, everything worked.

Now, it happens that this machine has a slightly unusual network configuration. Because it needs to talk to a number of external networks, it actually gets all of its external networks as tagged VLANs over a single physical network port. When we changed the machine to use the MAC of the active machine, we had set the Ethernet address on the VLAN for that particular network, because that was the network that mattered; we didn't change the MAC of anything else.

It turned out that this was the problem. Using Intel cards on our (old) version of OpenBSD, when the MAC of the VLAN differed from the MAC of the underlying physical interface and the interface was not in promiscuous mode, ARP (at least) didn't work because the kernel apparently never received the replies to its ARP queries. If you put the interface into promiscuous mode, such as by running tcpdump, things suddenly worked; the kernel received ARP replies and so on. We think that the whole setup worked when tested because we likely tested it with tcpdump running to watch traffic (and verify what MACs were being used).

(The obvious suspect here is hardware level receive filtering; perhaps the hardware is only being set by the driver to recognize the physical port MAC as its MAC. This is a driver and/or hardware issue, but these things happen.)

Once my co-workers figured out what the problem was, the fix was simple: explicitly set the MACs of both the physical port and all the VLANs on it to the active machine's MAC. But getting there took a whole frustrating and puzzling journey. This wasn't exactly a Heisenbug, but until my co-workers noticed the pattern that running tcpdump made it disappear it did look like one.

(Using 'tcpdump -p' is the obvious thing for the future, but I don't know if it would actually have worked in this situation. Still, it's something to try to remember for the next time around. Maybe tcpdump should default to -p these days.)

VLANAndMACSurprise written at 23:56:37; Add Comment

2015-11-30

A new piece of my environment: xcape, an X modifier key modifier

I was turned on to xcape by evaryont in a comment on my entry on my Backspace/Delete X key mapping shift. What xcape does is, well, let's quote from its own readme:

xcape allows you to use a modifier key as another key when pressed and released on its own.

A modifier key here is things like Shift, Control, and Alt (and CapsLock if you turn it into a modifier, such as making it another Control). The common use of xcape is by vi people to make one of those keys act as Escape when it's tapped, so they don't have to make the long stretch off to the top left of the keyboard for a key that they use all the time; instead they can tap something much closer.

(At the same time they don't lose the normal use of a valuable modifier key the way they would if they completely turned one of the modifier keys into Escape with, say, xmodmap.)

It's not clear how xcape works from the manpage or the readme. Before I started reading the code (it's short), I had concerns that it actually intercepted the modifier key and did weird things with it, which might interfere with other programs. This is not how it works. Instead, xcape passively listens to all keyboard events; when it sees a press and a release of the modifier key alone fly by within its time window, it injects a synthetic key-down and key-up event for your chosen additional key. No existing events are touched, only new ones added.

(Xcape is listed as Linux specific, although it might not be; it only seems to use the X 'record' and 'XTest' extensions, and I think they're generic. The record extension is used to monitor key events, the XTest extension to inject the new ones.)

What I'm using xcape for is a bit different from usual. Dmenu is a core part of my environment, and I have my window manager set to bring it up when I hit F5. F5 was in an easily reached location on my old keyboard, but on my new keyboard it's moved just enough so it's no longer a casual, rapid tap. So I'm using xcape to make tapping the CapsLock key (which I normally use as a Control key) also generate F5 and thereby bring up dmenu. The CapsLock key is of course in an extremely convenient and easily reached spot, which is great for this.

In general this works and achieves the goal of making bringing up dmenu be a fast, easy thing. The one drawback to reusing CapsLock is that I sometimes activate dmenu accidentally during normal typing; evidently I can plan to type a control character but then rapidly change my mind without thinking about it, which creates a CapsLock press and release close enough together to trigger xcape. If this turns out to be a long-term annoyance, I'll probably shift dmenu to being triggered off the much less used actual right Control key.

(This keyboard also has Windows keys, so I could go all the way to making the otherwise unused left Window key trigger dmenu, which wouldn't need xcape at all. But on the whole I like being able to call up dmenu so easily and casually, so I'm inclined to keep things the way I have them now.)

It's possible that someday I'll add an xcape mapping for Escape, but I'm extremely used to hitting Escape in its current location now (it's basically a reflex action at this point) and I don't really find it a problem. Still, I acknowledge that I may be missing out by not doing so and devoting the time to acclimatize to a new Escape location.

(I'd probably put Escape on the left Shift.)

ToolsXcape written at 01:53:51; Add Comment

2015-11-27

Documentation should explain why things are security issues, at least briefly

In my discussion of Apache suexec I mentioned the apache2-suexec-custom Debian package, which allows you to change suexec's idea of its docroot and thus use suexec to run virtual host CGIs that aren't located under /var/www. If you're using suexec-custom, one of the obvious questions is what it's safe to set the suexec docroot to. If you read the manpage, you will hit this paragraph:

Do not set the [suexec] document root to a path that includes users' home directories (like /home or /var) or directories where users can mount removable media. Doing so would create local security issues. Suexec does not allow to set the document root to the root directory /.

This is all that the manpage has to say about this. In fact, this is all of the documentation you get about the security issues involved, period.

Perhaps the people who wrote this documentation felt that the security issues created here are obvious to everyone. If so, they were wrong. I at least have no idea what specifically makes including user home directories dangerous. It seems unlikely to be that users can create new executables, because if you're doing virtual hosting and using suexec, you're presumably already giving all of those different virtual hosting UIDs write access to their subdirectory in /var/www so they can set up their own CGIs. After all, suexec explicitly requires all of those CGIs and their containing directories to be owned by the target user, not you. And after that, what is there that applies to user home directories but not /var/www?

(It can't be that suexec will run arbitrary programs under user home directories, because suexec has to be run through Apache and you should not be telling Apache 'treat anything at all under this entire general directory hierarchy as a CGI through these URL'. If you tell Apache that your CGI-BIN directory is /usr/bin or /home or the like, you have already made a horrible mistake.)

This is a specific example of what is a general failing, namely not explaining why things are security issues. When you don't explain why things are a security problem, you leave people uncertain about what's safe and what isn't. Here, I've been left with no idea about what the important security properties of suexec's docroot actually are. The authors of the manpage have in mind some dangers, but I don't know what they are and as a result I don't know how to avoid them. It's quite possible that this will result in me accidentally configuring Apache and suexec in a subtly insecure way.

The explanation of why things are a security issue doesn't have to be deep and detailed; I don't demand, say, an example of how to exploit an issue. But it should be detailed enough that an outsider can see clearly what they need to avoid and broadly why. If you say 'avoid this general sort of setup', you need to explain what makes that setup dangerous so that people can avoid accidentally introducing a dangerous bit in another setup. Vagueness here doesn't help anyone.

(As a corollary, if you say that a general sort of setup is safe, you should probably explain why that's so. Otherwise you risk people making some small, harmless looking variant of the setup that is in fact not safe because it violates one of the assumptions.)

By the way, all of this applies to local system setup documentation too. If you know why something has to be done or not done in a particular way to preserve security, write it down in specific (even if it seems obvious to you now). Future readers of your documentation will thank you for being clear, and as usual this may well include your future self.

PS: It's possible that you don't know of any specific issues in your program but feel that it's probably not safe to use outside of certain narrow circumstances that you've considered in detail. If so, the documentation should just say this outright. Sysadmins and other people who care about the security properties of your program will appreciate the honesty.

ExplainSecurityIssues written at 23:47:03; Add Comment

2015-11-12

Don't have support registrations in the name of a specific sysadmin

Every so often at work you will buy something, sign up for a service, arrange a support contract, register for monitoring, or whatever with an outside company or organization. Not infrequently these things will ask you for an email address that will be both your organization's key to the service and the person who gets notifications about it (sometimes you have to pick a username or login too). Here is something that we have learned the awkward way: when you do this, don't just use your email address (and name, and so on). Instead, either use an existing generic group email address or make up a new service specific email address (often these will just be mail aliases that distribute the email to all of the relevant parties). There are two reasons for this.

The first reason is that it keeps a single person from being the critical path for things to do with the service. If things like password resets or approvals for some action go only to me because I used my own email address and you need to do one of these things when I'm sick or on vacation or very busy or whatever, well, we have a problem. Or at least an annoyance. Using a generic address that multiple people see avoids that problem; I don't need to wait for the single magic person to be able to deal with whatever they need to do.

The second reason is that, well, to put it bluntly: people leave eventually. If person X leaves and there are things tied to their email address, using their customary personal login, and so on, life is at least a bit awkward. You can make it work, but take it from personal experience, it still feels weird and not entirely right to log in somewhere as ex-co-worker X because that's just how it was set up.

(I imagine you can have lots of fun if there have been several generations of turnover. 'Why do we have to log in to this site as 'jane'? Who's Jane? Oh, she was here ten years ago.')

Consistently registering everything with a generic email address, a suitable generic login or username, and so on avoids all of that. When someone leaves nothing needs to change and there's no uncomfortable feel or awkward 'who is jane?' explanations in a few years.

(There are exceptions to this, of course. Sometimes a service has been built with these issues in mind, so it has groups and supports multiple accounts that you manage and so on. Sometimes a registration is genuinely personal and will only ever be used by you and it's okay for it to go away if you leave. Sometimes it's just in the nature of the service that everyone needs an individual login in order for things to really work. And so on.)

PS: The flipside of this is that if you're a service provider who has people register accounts with you, this is yet another reason that you really want to support changing logins.

RegisterGenericAddresses written at 00:15:42; Add Comment

2015-11-10

Why I spent a lot of time agonizing over an error message recently

I recently spent an inordinate amount of time not so much writing a local script as repeatedly writing, rewriting, and modifying its error messages (the rest of the script mostly simple). Now, I'll admit up front that I have a general habit of obsessing over small details of program output, and maybe some of the fidgeting with the error messages was for this. But I actually maintain that I had a completely sensible reason for caring so much about the script's error messages. You see, the script isn't supposed to fail.

More exactly, it's not supposed to fail but we think that it might someday do so because every so often something weird is going on with the operation the script is doing. In fact the script exists to automate certain workarounds we were doing when we did this particular operation 'by hand' (it's actually buried inside another script). So almost all of the time the script is supposed to work, and we certainly hope it works all the time, but there's a rare possibility of failure lurking in the underbrush.

What this means for the script is that by the time we get an error, we'll probably have long since forgotten exactly what's going on. It's likely that the script will work reliably for weeks and months, during which our knowledge of the entire problem will have been displaced by other things. This means it's important for the error message we get to be clear, so we don't have to try to remember all of the surrounding context from scratch. A cryptic error message would make perfect sense for us right now, when the context is clear in our minds, but it won't in six months.

When I was revising the error message, one part of what I did was to look for things that might be mis-remembered or misinterpreted by people who'd forgotten the context. A surprisingly large amount of my initial language was at least partially ambiguous when I took a step back and tried my best to read it without context. Things that were obvious or only had one meaning inside the context suddenly took on an uncomfortable new life outside it. The resulting error messages are significantly more verbose now, but at least I can hope that they'll still make sense in six months.

(This is of course a version of the problem of context in programming.)

ContextInErrorMessages written at 01:33:02; Add Comment

2015-11-09

What sysadmins want out of logging means that it can't be too simple

Dave Cheney recently wrote Let's talk about logging about (Go) logging packages, where he advocates, well, I'm going to quote him directly:

I believe that there are only two things you should log:

  1. Things that developers care about when they are developing or debugging software.
  2. Things that users care about when using your software.

Obviously these are debug and info levels, respectively.

log.Info should simply write that line to the log output. There should not be an option to turn it off as the user should only be told things which are useful for them. [...]

My reaction is that this is too simple for real use. Ignoring things like (web) activity logs (which Dave Cheney agrees are a different case), there are clear divisions between what sysadmins need at different times and in different situations.

First, let's agree that programs should always be able to log their basic actions. If you're a web server, this is HTTP requests; if you're a mail server, this is email traffic; and so on. This tells sysadmins whether or not the system is doing anything, and if it's doing something what it's doing and how fast. Sysadmins will use this to do monitoring, to check if something happened (such as an email arriving or a request being processed), and so on.

Systems not infrequently encounter internal issues that are not fatal errors. They may experience timeouts, request errors, and so on. If we say that errors are fatal, these are all 'warnings' (even if some terminate the processing of the current whatever that the system is handling). They mark odd things that should not normally happen. Sysadmins like to have a record of these for obvious reasons.

Finally, when sysadmins are working to diagnose problems with services we want to be able to get detailed activity traces of exactly how the system processed requests. What did it look at? What did it find or not find as it stepped through things? Here we're looking for a description of why the system is acting as it is. This level of information is too voluminous to be logged routinely, and often it needs to be segmented up so that we can look only at certain aspects (because otherwise we'll drown in probably irrelevant information).

It's tempting to say that this level of information is the same as developer debug information, but it's my view that it's not. Developer debug information is internally focused and aimed at people who know the code and are making code changes. Sysadmin activity traces are externally focused and aimed at people who do not know the code and are not changing it. As a sysadmin, I don't care about internal state in the code; I'm going to assume there's no code bug and instead that I have either a misconfiguration or a malfunction somewhere in the overall system environment. I want to find that.

You can in theory run all of this through a simple log.Info interface. But if you do so there are two problems. First, you need to create internal standards in your program for formatting messages so that sysadmins can tell the different sorts of messages apart from each other. Second, you are spewing massive amounts of information out all the time (since you're always dumping all activity traces), which is not very friendly. My view is that a good logging package should be able to do this for you. A too-simple logging package throws both program authors and sysadmins to the wolves of ad-hoc logging and log filtering.

This is why real programs grow features to control what gets logged and to log different sorts of things in different places. Apache does not have separate request logs and error logs for arbitrary reasons, for example; real people wanted that separation because they find it quite useful.

SysadminLoggingNotSimple written at 01:50:26; Add Comment

2015-11-02

Status reporting commands should have script-oriented output too

There's a lot of status reporting programs out there on a typical systems; they report on packages, on filesystems, on the status of ZFS pools or Linux software RAID or LVM, on boot environments, on all sorts of things. I've written before about these programs as tools or frontends, where I advocated for writing tools, but it's clear that battle is long since lost; almost no one writes programs that are tools instead of frontends. So today I have a more modest request: status reporting programs should have script oriented output as well as human oriented output.

The obvious reason is that this makes it easier for sysadmins to build scripts on top of your programs. Sysadmins do want to do this, especially these days where automation is increasingly important, and parsing your regular human-oriented output is more difficult and also more error-prone. Such script oriented output doesn't have to be very elaborate, either; it just has to be clear and easy to deal with in a script.

But there's a less obvious reason to have script oriented output; it's much easier to make script oriented output be stable (either de facto or explicitly documented as such). The thing about human oriented output is that it's quite prone to changing its format as additional information gets added and people rethink what the nicest presentation of information is. And it's hard to argue against better, more informative, more readable output (and in fact I don't think one should). But changed output is death on things that try to parse that output; scripts really want and need stable output, and will often break if they're parsing your human oriented output and you change it. When you explicitly split human oriented output from script oriented output, you can provide both the stability that scripts need and the changes that improve what people see. This is a win for both parties.

(As a side effect it may make it easier to change the human oriented output, because there shouldn't be many worries about scripts consuming it too. Assuming that you worried about that in the first place.)

(This is the elaborated version of a tweet and the resulting conversation with Dan McDonald.)

StatusReportsScriptableDesire written at 00:23:48; Add Comment

2015-10-31

In practice, anything involving the JVM is often a heavyweight thing

Last week I asked on Twitter if anyone had a good replacement for swish-e for indexing and searching some HTML pages. Several people suggested Apache Solr; my immediate reaction was that this sounded too heavyweight for what we wanted. It was then asserted that Solr is not that heavy if you disable enough things. I had a number of reactions to that, but my instant one was 'nothing involving the JVM is lightweight'. Today I want to talk about that.

I don't call JVM-based things 'heavyweight' because Java itself can easily eat up lots of memory (although that's certainly a potential issue). What makes the JVM heavy for us is that we don't already run any JVM-based services and that all too often, Java is not like other languages. With languages like Python, Perl, Ruby, or even PHP, as a sysadmin you can generally be pretty indifferent to the language the system is written in. You install the system (ideally through a package manager), you get some binaries and maybe some crontab jobs, and you run the binaries. You're done. With Java, my impression and to some extent my experience is that you also have to administer and manage a JVM. A Java system is not run some programs and forget; it's putting .jars in the right place, it's loading certificates into JVM trust stores, it's configuring JVM parameters, and so on and so forth. There is a whole level of extra things to learn and things to do that you take on in order to maintain the JVM environment for the system you actually want to run.

(One way to put it is that a JVM seems to often be a system inside your normal system. You get to maintain your normal system and you also get to learn how to maintain the JVM system as well.)

All of this makes any JVM-based system a heavyweight one, because adopting it means not just learning the system but also learning how to manage a probably-complex JVM environment. If we were already running JVM based things it would be a different issue, of course, because we'd probably already have this expertise (and the JVM way might even work better for us), but as it stands we don't.

(Similar issues probably hold for any Node-based system, partly because of Node itself and partly because Node has its own very popular package management system that we'd probably have to learn and wrangle in order to run any Node-based thing.)

It's probably possible to design JVM-using systems that are not 'JVM-based' in this way and that encapsulate all of the complexity inside themselves. But I suspect that something labeled on its website as 'enterprise' has not been designed to operate this way.

(I've mostly talked about the JVM instead of Java specifically because I suspect most of these issues also apply to any other JVM-based language, such as Scala, Clojure, and so on.)

JVMsAreHeavyweight written at 01:05:43; Add Comment

2015-10-23

Perhaps it's a good idea to reboot everything periodically

Yesterday around 6pm, the department's connection to the campus backbone had its performance basically fall off a cliff. Packet loss jumped and bandwidth dropped from that appropriate to gigabit Ethernet down to the level of a good home connection (it seems to have been running around 16 Mbits/sec inbound, although somewhat more outbound). People started noticing this morning, which resulted in us running around and talking to the university's central NOC (who run the backbone and thus the router that we connect to).

Everything looked perfectly normal on our side of things, with no errors being logged, all relevant interfaces up at 1G, and so on. But in the process of looking at things, we noticed that our bridging firewall had been up for 450 days or so. Since we have a ready hot spare and pfsync makes shifting over relatively non-disruptive, we (by which I mean my co-workers) decided to switch the active and hot spare machines (after rebooting the hot spare). Lo and behold, all of our backbone performance problems went away on the spot.

We reboot our Ubuntu machines on a relatively regular basis in order to apply kernel updates, because they're exposed to users. But many of our other machines we treat as appliances and as part of that we basically don't reboot them unless there's some compelling reason to do so. That's how we wind up with firewalls with 450 day uptimes, fileservers and backends that have mostly been up since they were installed a year or so ago, and so on.

Perhaps we should rethink that. In fact, if we're going to rethink things and agree to reboot machines every so often, we should actually make a relatively concrete schedule for it in advance. We don't have to schedule down to the day or week, but even something like deciding that all of the firewalls will be rebooted in March is likely to drastically increase the odds that it will actually happen.

('We should reboot the firewalls after they've been up for a while' is sufficiently fuzzy that it is at best a low priority entry in someone's to-do list, and thus easy to forget about or never get to. Adding 'in March' pushes things closer to the point where someone will put it on their calendar and then get it done.)

RebootPeriodically written at 01:56:00; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.