Wandering Thoughts


Unix's pipeline problem (okay, its problem with files too)

In a comment on yesterday's entry, Mihai Cilidariu sensibly suggested that I not add timestamp support to my tools but instead outsource this to a separate program in a pipeline. In the process I would get general support for this and complete flexibility in the timestamp format. This is clearly and definitely the right Unix way to do this.

Unfortunately it's not a good way in practice, because of a fundamental pragmatic problem Unix has with pipelines. This is our old friend block buffering versus line buffering. A long time ago, Unix decided that many commands should change their behavior in the name of efficiency; if they wrote lines of output to a terminal you'd get each line as it was written, but if they wrote lines to anything else you'd only get it in blocks.

This is a big problem here because obviously a pipeline like 'monitor | timestamp' basically requires the monitor process to produce output a line at time in order to be useful; otherwise you'd get large blocks of lines that all had the same timestamp because they were written to the timestamp process in a block. The sudden conversion from line buffered to block buffered can also affect other sorts of pipeline usage.

It's certainly possible to create programs that don't have this problem, ones that always write a line at a time (or explicitly flush after every block of lines in a single report). But it is not the default, which means that if you write a program without thinking about it or being aware of the issue at all you wind up with a program that has this problem. In turn people like me can't assume that a random program we want to add timestamps to will do the right thing in a pipeline (or keep doing it).

(Sometimes the buffering can be an accidental property of how a program was implemented. If you first write a simple shell script that runs external commands and then rewrite it as a much better and more efficient Perl script, well, you've probably just added block buffering without realizing it.)

In the end, what all of this really does is that it chips away quietly at the Unix ideal that you can do everything with pipelines and that pipelining is the right way to do lots of stuff. Instead pipelining becomes mostly something that you do for bulk processing. If you use pipelines outside of bulk processing, sometimes it works, sometimes you need to remember odd workarounds so that it's mostly okay, and sometimes it doesn't do what you want at all. And unless you know Unix programming, why things are failing is pretty opaque (which doesn't encourage you to try doing things via pipelines).

(This is equally a potential problem with redirecting program output to files, but it usually hits most acutely with pipelines.)

unix/PipelineProblem written at 02:27:41; Add Comment


Monitoring tools should report timestamps (and what they're monitoring)

This is a lesson learned, not quite the hard way but close to it. What is now a fairly long time ago I wrote some simple tools to report the network bandwidth (and packets per second) for a given interface on Linux and Solaris. The output looked (and looks) like this:

 40.33 MB/s RX  56.54 MB/s TX   packets/sec: 50331 RX 64482 TX

I've used these tools for monitoring and troubleshooting ever since, partly because they're simple and brute force and thus I have a great deal of trust in the numbers they show me.

Recently we've been looking at a NFS fileserver lockup problem, and as part of that I've spent quite some time gathering output from monitoring programs that run right up to the moment the system locks up and stops responding. When I did this, I discovered two little problems with that output format up there: it tells me neither the time it was for nor the interface I'm monitoring. If I wanted to see what happened thirty seconds or a minute before the lockup, well, I'd better count back 30 or 60 lines (and that was based on the knowledge that I was getting one report a second). As far as keeping track of which interface (out of four) that a particular set of output was from, well, I wound up having to rely on window titles.

So now I have a version of these tools with a somewhat different output format:

e1000g1 23:10:08  14.11 MB/s RX  77.40 MB/s TX   packets/sec: 37791 RX 66359 TX

Now this output is more or less self identifying. I can look at a line and know almost right away what I'm seeing, and I don't have to carefully preserve a lot of context somehow. And yes, this doesn't show how many seconds this report is aggregated over (although I can generally see it given two consecutive lines).

I was lucky here in that adding a timestamp plus typical interface names still keep output lines under 80 characters. But even in cases where adding this information would widen the output lines, well, I can widen my xterm windows and it's better to have this information than to have to reconstruct it afterwards. So in the future I think all of my monitoring tools are at least going to have an option to add a timestamp and similar information, and they might print it all the time if it fits (as it does here).

PS: I have strong feelings that timestamps et al should usually be optional if they push the output over 80 columns wide. There are a bunch of reasons for this that I'm not going to try to condense into this entry.

PPS: This idea is not a miracle invention of mine by any means. In fact I shamelessly copied it from how useful the timestamps printed out by tools like arcstat are. When I noticed how much I was using those timestamps and how nice it was to be able to scroll back, spot something odd, and say 'ah, this happened at ...' right away, I smacked myself in the forehead and did it for all of the monitoring commands I was using. Fortunately many OmniOS commands like vmstat already have an option to add timestamps, although it's sometimes kind of low-rent (eg vmstat prints the timestamp on a separate line, which doubles how many lines of output it produces and thus halves the effective size of my scrollback buffer).

sysadmin/ReportTimeAndId written at 23:58:52; Add Comment

What I want to have in shell (filename) completion

I've been using basic filename completion in my shell for a while now, and doing so has given me a perspective on what advanced features of this I'd find useful and which strike me as less useful. Unfortunately for me, the features that I'd find most useful are the ones that are the hardest to implement.

Put simply, the problem with basic filename completion is that any time you want to use even basic shell features like environment variables, you lose completion. Do you refer to some directories through convenience variables? Nope, not any more, because you can choose between completing a long name and not completing, say, '$ml/afilename'.

(I know, bash supports completing filenames that use environment variables. Probably zsh does as well. I'm not interested in switching to either right now.)

But environment variables are just one of the ways to shorten filenames. Two more cases are using wildcards to match unique or relatively unique subsets of a long filename and using various multi-match operators to specify several filenames in some directory or the like. Both of these would be handy to be able to do filename completion for. In fact, let's generalize that: what I'd really like is for my shell to be able to do filename completion in the face of any and all things that can appear in filenames and get expanded by the shell. Then I could combine the power of filename completion and the power of all of those handy shell operators together.

Of course this is easier said than done. I know that what I'm asking for is quite a challenging programming exercise and to some extent a design exercise once we get to the obscure features. But it sure would be handy (more handy, in my biased opinion, than a number of other completion features).

(I've never used eg bash's smart context aware autocompletion of program arguments in general, so I don't know for sure how useful I'd find it; however, my personal guess is 'not as much as full filename completion'. I'm so-so on autocompletion of program names; I suppose I do sometimes use programs with annoyingly long names, so my shell might as well support it. Again I'd rate it well below even autocompletion with environment variables, much less the full version.)

unix/MyShellCompletionDesire written at 01:16:52; Add Comment


Sometimes knowing causes does you no good (and sensible uses of time)

Yesterday, I covered our OmniOS fileserver problem with overload and mentioned that the core problem seems to be (kernel) memory exhaustion. Of course once we'd identified this I immediately started coming up with lots of theories about what might be eating up all the memory (and then not giving it back), along with potential ways to test these theories. This is what sysadmins do when we're confronted with problems, after all; we try to understand them. And it can be peculiarly fun and satisfying to run down the root cause of something.

(For example, one theory is 'NFS TCP socket receive buffers', which would explain why it seems to need a bunch of clients all active.)

Then I asked myself an uncomfortable question: was this going to actually help us? Specifically, was it particularly likely to get us any closer to having OmniOS NFS fileservers that did not lock up under surges of too-high load? The more I thought about that, the more gloomy I felt, because the cold hard answer is that knowing the root cause here is unlikely to do us any good.

Some issues are ultimately due to simple and easily fixed bugs, or turn out to have simple configuration changes that avoid them. It seems unlikely that either are the case here; instead it seems much more likely to be a misdesigned or badly designed part of the Illumos NFS server code. Fixing bad designs is never a simple code change and they can rarely be avoided with configuration changes. Any fix is likely to be slow to appear and require significant work on someone's part.

This leads to the really uncomfortable realization that it is probably not worth spelunking this issue to explore and test any of these theories. Sure, it'd be nice to know the answer, but knowing the answer is not likely to get us much closer to a fix to a long-standing and deep issue. And what we need is that fix, not to know what the cause is, because ultimately we need fileservers that don't lock up every so often if things go a little bit wrong (because things go a little bit wrong on a regular basis).

This doesn't make me happy, because I like diagnosing problems and finding root causes (however much I gripe about it sometimes); it's neat and gives me a feeling of real accomplishment. But my job is not about feelings of accomplishment, it's about giving our users reliable fileservice, and it behooves me to spend my finite time on things that are most likely to result in that. Right now that does not appear to involve diving into OmniOS kernel internals or coming up with clever ways to test theories.

(If we had a lot of money to throw at people, perhaps the solution would be 'root cause the problem then pay Illumos people to do the kernel development needed to fix it'. But we don't have anywhere near that kind of money.)

sysadmin/KnowingCausesIsNoCure written at 01:32:45; Add Comment


OmniOS as a NFS server has problems with sustained write loads

We have been hunting an serious OmniOS problem for some time. Today we finally have enough data that I feel I can say something definitive:

An OmniOS NFS server will lock up under (some) sustained write loads if the write volume is higher than its disks can sustain.

I believe that this issue is not specific to OmniOS; it's likely Illumos in general, and was probably inherited from OpenSolaris and Solaris 10. We've reproduced a similar lockup on our old fileservers, running Solaris 10 update 8.

Our current minimal reproduction is the latest OmniOS (r151014) on our standard fileserver hardware, with 1G networking added and with a test pool of a single mirrored vdev on two (local) 7200 RPM 2TB SATA disks. With both 1G networks being driven at basically full wire speed by a collection of NFS client systems writing out a collection of different files on that test pool, the system will run okay for a while and then suddenly enter a situation where system free memory nosedives abruptly and the amount of kernel memory used for things other than the ARC jumps massively. This leads immediately to a total system hang when the free memory hits rock bottom.

(This is more write traffic than the disks can sustain due to mirroring. We have 200 MBytes/sec of incoming NFS writes, which implies 200 MBytes/sec of writes to each disk. These disks appear to top out at 150 MBytes/sec at most, and that's probably only a burst figure.)

Through a series of relatively obvious tests that are too long to detail here (eg running only one network's worth of NFS clients), we're pretty confident that this system is stable under a write load that it can sustain. Overload is clearly not immediate death (within a few seconds or the like), so we assume that the system can survive sufficiently short periods of overload if the load drops afterwards. However we have various indications that it does not fully recover from such overloads for a long time (if ever).

(Death under sustained overload would explain many of the symptoms we've seen of our various fileserver problems (eg). The common element in all of the trigger causes is that they cause (or could cause) IO slowdowns; backend disks with errors, backend disks that are just slow responding, full pools, or even apparently pools hitting their quota limits, even 10G networking problems. A slowdown of IO would take a fileserver that was just surviving a current high client write volume and push it over the edge.)

The memory exhaustion appears to be related to a high and increasing level of outstanding incomplete or unprocessed NFS requests. We have some indication that increasing the number of NFS server threads helps stave off the lockup for a while, but we've had our test server lock up (in somewhat different test scenarios) with widely varying numbers.

In theory this shouldn't happen. An NFS server that is being overloaded should push back on the clients in various ways, not enter a death spiral of accepting all of their traffic, eating all its memory, and then locking up. In practice, well, we have a serious problem in production.

PS: Yes, I'll write something for the OmniOS mailing lists at some point. In practice tweets are easier than blog entries, which are easier than useful mailing list reports.

PPS: Solaris 11 is not an option for various reasons.

solaris/OmniOSNFSOverloadProblem written at 01:11:19; Add Comment


I'm considering ways to mass-add URLs to Firefox's history database

I wrote yesterday about how I keep my browser history forever, because it represents the memory of what I've read. A corollary of this is that it bugs me if things I've read don't show up as visited URLs. For example, if all of the blog entries and so on here at Wandering Thoughts were to turn unvisited tomorrow, that'd make me twitch every time I read something here and saw a blue link that should instead be visited purple.

(One of the reasons for this is that links showing visited purple is a sign that they point to the right place. Under normal circumstances, if links on Wandering Thoughts suddenly go blue, something has probably broken. And when I'm drafting entries, a nominal link to an older entry that shows blue is a sign that I got the link wrong.)

Which winds up with the problem: Wandering Thoughts and indeed this entire site is in the process of moving from HTTP to HTTPS. The HTTP versions of all of the entries and so on are in my Firefox history database, but Firefox properly considers the HTTPS version to be a completely different URL and so not in the history. So, all of a sudden, all of my entries and links and so on are unvisited blue. At one level this is not a problem. After all, I know that I've read them all (I wrote them). In theory, I could leave everything here alone, then maybe re-visit links one by one as I use them in new entries or otherwise run across them. But the whole situation bugs me; by now, seeing all the links be purple is reassuring and the way things should be, while blue links here make me twitch.

Conceptually the fix is simple. All I have to do is get every HTTP URL for here out of my existing history database, mechanically turn the 'http:' into 'https:', and then add all of the new URLs to Firefox's history database. All of the last visited and so on values can be exactly copied from the HTTP version of the URL. The only problem is that as far as I know there is no tool or extension for doing this.

(There are plenty of addons for removing history entries, which is of course exactly the opposite of what I want.)

These days, Firefox's history in is a SQLite database (places.sqlite in your profile directory). There are plenty of tools and packages to manipulate SQLite databases, which leaves me with merely the problem of figuring out what actually goes into a history entry in concrete detail (and then calculating everything that isn't obvious). So all of this is achievable, but on the other hand it's clearly going to be a bunch of work.

(While the Places database is documented, parts of this documentation are out of date. In particular, current Firefox places.sqlite has a unique guid field in the moz_places table.)

PS: The other obvious nefarious hack is to literally rewrite the URLs in all current history entries to be 'https:' instead of 'http:', possibly by dumping and then reloading the moz_places table. Assuming that you can change the URL schema without invalidating any linkages in the database, this is simple. Unfortunately it has a brute force inelegance that makes me grumpy; it's clearly the expedient fix instead of the right one.

web/FirefoxAddHistoryDesire written at 23:36:51; Add Comment

Why I have a perpetual browser history

I've mentioned in passing that I keep my browser's history database basically forever, and I've also kind of mentioned that it drives me up the wall when web sites make visited links and unvisited links look the same. These two things are closely related.

Put simply, the visited versus unvisited distinction between links is a visible, visual representation of your current state of dealing with a (good) site. A visited link tells you 'yep, I've been there, no need to visit again'; an unvisited link tells you that you might want to go follow it. This representation of state is very important because otherwise we must fall back on our fallible, limited, and easily fooled human memories to try to keep track of what we've read and haven't read. This fallback is both error-prone and a cognitive load; mental effort you're spending to keep track of what you've read is mental effort you can't use on reading.

Of course this doesn't work on all sites (and doesn't work all the time even on 'good' sites). I'm sure you can come up with any number of sites and any number of ways that this breaks down, and so the visited versus unvisited state of a page is not important or useful information. But it works well enough on enough sites to be extremely useful in practice, at least for me.

And this is why I want my browser history to last forever. My browser history is the collected state representation of what I have and haven't read. It tracks things not just now, in my currently active browsing session as I work through something, but also back through time, because I don't necessarily forget things I've read long ago (but at the same time I don't necessarily remember them well enough to be absolutely confident that I've already read them). For that matter, I don't always get through big or deep sites in one go, so again the visited link history is a history of how far I've gotten in archives or reference articles or the like.

There is nothing else on the web that can give me this state recall, nothing else that serves to keep track of 'how far have I gotten' and 'have I already seen this'. The web without it is a much more spastic and hyperactive place. It's a relatively more hyperactive place if I only have a short-term state recall; I really do want mine to last basically forever.

(In fact for me anything without a read versus unread state indicator is an irritatingly spastic and hyperactive place. All sorts of things are vastly improved by having it, and lack of it causes me annoyance (and that example is on the web).)

web/BrowserHistoryForever written at 00:14:42; Add Comment


The 'EHLO ylmf-pc' plague of SMTP authentication guessers

If you run a mail server on the Internet and look at your logs, you may have noticed a lot of connections from machines that EHLO with the name ylmf-pc. There are many pages about this on the web, and the general consensus is that this is some sort of long standing brute force SMTP authentication guessing botnet or piece of software. Whatever it is, it's quite annoying and may also be unevenly distributed in action.

(I've mentioned them before in passing.)

I can't say with any confidence what it is, because it also seems to be pretty dumb and limited. Our new authenticated SMTP server doesn't offer authentication before you STARTTLS, but it will afterwards. This can't be an uncommon configuration, yet I see a whole plague of ylmf-pc machines connecting to it and then immediately disconnecting without trying anything more (and in particular without STARTTLS). It's as if they connect, examine the EHLO response, see no authentication advertised, and then immediately disconnect.

Of course, that's when the real annoyance comes in; these machines aren't content with doing this just once. Oh no. A ylmf-pc machine will do this same connect, EHLO, then disconnect cycle over and over and over again, very fast. Our logs typically show multiple connects and disconnects a second. We have firewall connection limiters that cut in to temporarily block these IPs, but otherwise a ylmf-pc machine will also keep doing this for quite a while. This creates quite a bunch of log spam, even with the firewall blocking IPs for us.

I was going to confidently say that the ylmf-pc plague hits some of our machines much more than other ones and speculate about why, but it turns out that I can't; our inbound MX gateway doesn't even log machines that do this connect then disconnect game, so I can't tell whether or not the ylmf-pc brigade is ignoring them. They do seem to do at least a little bit of scanning of the Internet in general, but they also seem much more concentrated on machines with MX entries and machines with suggestive DNS names (such names seem to cause spammers to show up fast, although I haven't tried a scientific test of this).

(This is apparently the signature of a botnet called 'PushDo' or 'Cutwail', per this stackoverflow question and answer (also). The oldest mention I can find in my own logs is November of 2013, but it looks like this pattern may go back to 2012 and possibly earlier.)

spam/YlmfPcPlague written at 01:37:19; Add Comment


There's no portable way to turn a file descriptor read only or write only

It all started when John Regehr asked a good question in a tweet:

serious but undoubtedly stupid question: why does writing to file descriptor 0 in Linux and OS X work?

My first impulse was to say 'lazy code that starts with a general read/write file descriptor and doesn't bother to make it read only when the fd becomes a new process's standard input', but I decided to check the manual pages first. Much to my surprise it turns out that in Unix there is no portable way to turn a read/write file descriptor into a read-only or write-only one.

In theory the obvious way to do this is with fcntl(fd, F_SETFL, O_RDONLY) (or O_WRONLY as applicable). In practice, this is explicitly documented as not working on both Linux and FreeBSD; on them you're not allowed to affect the file access mode, only things like O_NONBLOCK. It's not clear if this behavior is compliant with the Single Unix Specification for fcntl(), but either way it's how a very large number of real systems behave in the field today so we're stuck with it.

This means that if you have, say, a shell, the shell cannot specifically restrict plain commands that it starts to have read-only standard input and write-only standard output and standard error. The best it can do is pass on its own stdin, stdout, and stderr, and if they were passed to the shell with full read/write permissions the shell has to pass them to your process with these permissions intact and so your process can write to fd 0. Only when the shell is making new file descriptors can it restrict them to be read only or write only, which means pipelines and file redirections.

Further, it turns out that in a fair number of cases it's natural to start out with a single read/write file descriptor (and in a few it's basically required). For one example, anything run on a pseudo-tty that was set up through openpty() will be this way, as the openpty() API only gives you a single file descriptor for the entire pty and obviously it has to be opened read/write. There are any number of other cases, so I'm not going to try to run through them all.

(At this point it may also have reached the point of backwards compatibility, due to ioctl() calls on terminals and ptys. I'm honestly not sure of the rules for what terminal ioctls need read and/or write permissions on the file descriptors, and I bet a bunch of other people aren't either. In that sort of environment, new programs that set up shells might be able to restrict fds 0, 1, and 2 to their correct modes but don't dare do so lest they break various shells and programs that have gotten away with being casual and uncertain.)

PS: If you want to see how a shell or a command's descriptors are set up, you can use lsof. The letter after the file descriptor's number will tell you if it's read, write, or u for r/w.

unix/FdPermissionsLimitation written at 00:29:11; Add Comment


The fading out of tcpwrappers and its idea

Once upon a time, tcpwrappers were a big thing in (Unix) host security. Plenty of programs supported the original TCP Wrapper library by Wietse Venema, and people wrote their own takes on the idea. But nowadays, tcpwrappers is clearly on the way out. It doesn't seem to be used very much any more in practice, fewer and fewer programs support it at all, and of the remaining ones that (still) do, some of them are removing support for it. This isn't exclusive to Wietse Venema's original version; the whole idea and approach just doesn't seem to be all that popular any more. So what happened?

I don't know for sure, but I think the simple answer is 'firewalls and operating system level packet filtering'. The core idea of tcpwrappers is application level IP access filtering, and it dates from an era where that was your only real choice. Very few things had support for packet filtering, so you had to do this in the applications (and in general updating applications is easier than updating operating systems). These days we have robust and well developed packet filtering in kernels and in firewalls, which takes care of much of the need for tcpwrappers stuff. In many cases, maintaining packet filtering rules may be easier than maintaining tcpwrappers rules, and kernel packet filtering has the advantage that it's centralized and so universally 'supported' by programs; in fact programs don't have any choice about it.

(Kernel packet filters can't do DNS lookups the way that tcpwrappers can, but using DNS lookups for anything except logging has fallen out of favour these days. Often people don't even want to do it for logging.)

Having written some code that used libwrap, I think that another issue is that the general sort of API that Venema's tcpwrappers has is one that's fallen out of favour. Even using the library, what you get is basically a single threaded black box. This works sort of okay if you're forking for each new connection, but it doesn't expose a lot of controls or a lot of information and it's going to completely fall down if you want to do more sophisticated things (or control the DNS lookups it does). Basically Venema's tcpwrappers works best for things that you could at least conceive of running out of inetd.

(It's not impossible to create an API that offers more control, but then you wind up with something that is more complex as well. And once you get more complex, what programs want out of connection matching becomes much more program-specific; consider sshd's 'Match' stuff as contrasted with Apache's access controls.)

Another way of putting it is that in the modern world, we've come to see IP-level access control as something that should be handled outside the program entirely or that's deeply integrated with the program (or both). Neither really fits the tcpwrappers model, which is more 'sitting lightly on top of the program'.

(Certainly part of the decline of tcpwrappers is that in many environments we've moved IP access controls completely off end hosts and on to separate firewalls, for better or worse.)

sysadmin/TcpwrappersFadeout written at 03:03:53; Add Comment

(Previous 10 or go back to April 2015 at 2015/04/26)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.