Wandering Thoughts: Recent Entries

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web.

2012-05-19

A semi-brief history and overview of X fonts and font rendering technology

Because I've just been mucking around in this swamp, I feel like writing all of this down.

In the very beginning, X had some simple bitmap fonts in the server with equally simple names that were basically labels; ie, the X server attached no particular meaning to the font names, which were created by people mostly to be short and sometimes to be vaguely meaningful. Some of these font names linger on in X today; the most well known one is fixed, the name for the default monospace font (and font size), but there are others like 6x10 and 6x13bold. The X clients just asked the server to display some text in the named font, and the server handled all of the rendering and drawing.

(Today these font names are implemented as aliases for other fonts.)

As the X server evolved, it grew more bitmap fonts. Fonts from different vendors, fonts in more sizes, proportional fonts as well as monospaced ones, fonts for different resolutions (75dpi versus 100dpi), fonts for different character set encodings, and so on. It was clear that ad-hoc font names weren't going to scale because no one was going to be able to keep fonts straight or find one. So the X people invented a naming convention for their fonts, the X logical font description (XLFD). In theory an XLFD name describes most of the important attributes of a font, things like the point and pixel size, the slant, the style, whether it's proportional or monospaced, and so on. Along with XLFD names and their defined structure, the X people introduced the idea of wildcard matches so that X programs could say 'I don't really care which vendor it comes from, I just want whatever 15 pixel monospaced font you have'.

(For backwards compatibility, the original simple X font names were defined to be acceptable XLFD font names (although you can't use wildcards with them).)

Initially XLFD fonts were still all bitmapped fonts, just better named and more numerous than before. However the X server soon got additional font rendering support that let it handle several scalable font formats of the time. Scalability was implemented in XLFD font names in a straightforward way; if the font name itself had zeros in various resolution related fields, you knew that the X server could and would render it at whatever pixel size you asked for.

At some point, people observed that the precious X server was spending a bunch of time and memory loading, parsing, rendering, and so on all of these fonts, even for fonts that weren't very actively used. So the X people decided to offload a lot of this work to a separate daemon, the X font server (xfs), although in the usual manner of X they left all of the direct server side font rendering code intact too. An X font server could (in theory) be isolated from the X server and if necessary restarted independently; all of the hairy and CPU consuming code for scalable font rendering could be parked in a process in it, and so on. The X font server could handle all font formats that the X server could, and thus could render all XLFD fonts for the server. Although the X server delegated actual font rendering to another process, X font handling was still done in the server; clients still just told the server 'draw text X in font Y'.

(Generally xfs could be pointed to exactly the same font directories as the X server had previously been using directly.)

The next evolutionary step in X font handling was to move it to the client side, which marked (and marks) a stark division in X font handling. This is XFT and 'XFT fonts'. XFT is to a significant degree glue; it uses FontConfig to translate from font names (and attributes) to actual concrete font data files, then FreeType to turn text into picture data and draws the picture data using various X bits.

Technically and theoretically XFT and its pieces still support old X bitmapped fonts. Practically they do not; XFT and XFT-using programs really expect fully scalable fonts, generally ones with a wide glyph selection, and have basically no patience or tolerance for bitmapped fonts that are available in only a few point sizes with only a few glyphs. With heroic work in FontConfig configuration files you can sort of get something limping along, but in practice moving to XFT fonts means no more bitmap fonts.

(Yes, I have tried this experiment. It's especially unsatisfactory for 'frankenfonts', ones where the real font is only available in a few pixel sizes and you were already filling in other pixel sizes with substitutes. The XLFD configuration system is much better for this.)

Generally, the system FontConfig configuration will look for fonts in all of the X server font directories with scalable fonts, or at least all of the directories that are considered to include 'good' fonts. This makes scalable XLFD fonts available to modern XFT-using clients, although under somewhat different names.

(TrueType fonts will generally render the same in XLFD and XFT form because the X server and the X font server long ago were set up to render them with FreeType. Any remaining differences in appearance are due to rendering decisions made differently between FontConfig and the X server environment. I'm not sure how older scalable font formats come out, and generally you don't want to use those fonts anyways.)

So today X has two separate font technologies: XLFD fonts and XFT fonts. XLFD fonts are configured through xset's font path directives (the modern coolness is catalogues) and perhaps your xfs configuration file (if xfs support is still working in your X server, which it probably isn't). XLFD fonts and font directories exist only on the X server machine (or perhaps the xfs machine) and properly set up font directories have fonts.dir, fonts.scale, and/or fonts.alias files that describe their contents. These are the only bitmap fonts still supported and XLFD bitmap fonts usually work well at small pixel sizes.

(This is especially true of monospaced bitmap fonts, many of which were extensively tuned for high readability and high density at relatively small pixel sizes.)

XFT fonts are configured through FontConfig. Magic happens, and it happens differently on different machines. On the good side, you can generally put new font files into $HOME/.fonts and they will automatically be recognized for you. FontConfig has various sorts of support for font aliases, but they hide out in magic places like the XML configuration files in /etc/fonts (if you want to see some of it, look for how many things in /etc/fonts/conf.d want to lay claim to be 'monospace', the default monospaced font). Most XFT fonts are TrueType fonts (at least the modern ones that you want to use) and do not work well at small pixel sizes.

XLFD font support is pervasive in old X programs but scattered and increasingly absent in modern ones. XFT support is the inverse; uncommon in old X programs and old Unixes (such as Solaris), but increasingly common in modern X programs (especially ones that are part of a pervasive environment like Gnome or KDE, where it is ubiquitous), on Linux, and at least partially on other Unix OSes such as FreeBSD. Some programs and frameworks make an effort to support both XLFD fonts and XFT fonts, but many are XFT-only.

(Today's expedition into this swamp was started by Tk 8.5, which can be compiled to support XLFD fonts or XFT fonts but not both at once. You can guess which option modern Linux distributions have picked.)

unix/XFontTypes written at 03:27:39; Add Comment

2012-05-17

My Firefox memory bloat was mostly from All-in-One Gestures

It's time for an update to my prior Firefox situation (one, two). After some experimentation it's become clear that most of my Firefox problems with constant memory growth and zombie compartments were due to my use of All-in-One Gestures (as I kind of suspected it might be). I've switched to FireGestures instead (initially as an experiment and now full time on all of my various Firefox instances on various different machines) and things have been much better; there are no zombie compartments at all and memory growth seems to have dropped significantly (although it's not clear yet if it's completely gone). And I haven't run into any problems or bugs this time around; everything has just worked the way I expected.

(A-i-O doesn't seem to have been the only problem I had; for example, it seems to be a bad idea to leave a tab or window sitting around with an embedded Youtube video. It's also not clear if Firefox Nightly behaves well for me in general because I haven't been able to leave it running for multiple days yet.)

In addition to less memory usage, FireGestures also seems to simply be more responsive and snappy than A-i-O. It certainly has more useful features, including the ability to add gestures without needing to hack the source code, a library of existing additional gestures (including the one that I wanted), and the ability to 'back up' and 'restore' your settings (which for me really means the ability to easily synchronize my gestures between multiple Firefox instances).

(See FireGesture's homepage for more information on all of this.)

So FireGestures is now one of my core extensions, replacing All-in-One Gestures in the previous list.

The one drawback of FireGestures is that it doesn't work in Firefox 3.6; my laptop is still running Fedora 14 with this Firefox release (because that's the last one with Gnome 2 instead of Gnome 3). I don't consider this a real drawback, but you may.

PS: people migrating from All-in-One Gestures to FireGestures might want to use Down-Right-Down to call up the A-i-O information display that shows all of your gestures and then save it (as an HTML page, which is what it is). You can then conveniently look at it later when you're using FireGestures.

(I am far too impatient to try to retrain years of reflexes to use the native FireGestures gestures for various actions; I just ruthlessly rewrote them to be the A-i-O gestures I'm used to.)

web/Firefox12Gestures written at 16:11:51; Add Comment

The Go language's problem on 32-bit machines

Recently (for my value of recently) there was somewhat of a commotion of people declaring that Go wasn't usable in production on 32-bit systems because its garbage collection was broken and it would eat all of your memory. Naturally I was interested in this and spent some time digging in to the reports and trying to understand the situation. Today I'm going to try to write down as much as I know about what's going on to get it straight in my head, which is going to involve a trip into the fun land of garbage collection.

To simplify a bit, the purpose of garbage collection is to automatically free up memory that's no longer used. The GC technique everyone starts with is reference counting but since it has various problems (including dealing with circular references) most people soon upgrade to more complex schemes based on inverting the problem: rather than noticing when something stops being used, the garbage collection system periodically finds all of the memory that's still actively used and then frees everything else. This is 'tracing garbage collection' (and garbage collectors), so called because the garbage collector 'traces' all live objects.

One deep but unsexy problem in garbage collection is how your GC system knows what fields in your objects refer to other objects and what fields are just primitive types like numbers, memory buffers, strings, or the like, and how it does this efficiently. This can be a particular issue for a system language where you probably want to have structures and objects that are as simple and dense as possible, with as little overhead from type annotations, inefficient 'boxed' representations, and so on as possible. One solution is to maintain a separate bitmap of what words in an allocated memory area are actually pointers (which the GC can then scan efficiently, and which can be set by the runtime when an object is allocated). Another solution is what gets called 'conservative garbage collection'. The fundamental idea is that in conservative GC, we are willing to over-estimate references (and thus wind up not freeing some unused memory); rather than insisting on knowing about references, the GC system simply scans through allocated memory looking anything that might be a pointer to an allocated object. If it finds one, it conservatively declares that the object is still alive and traces things from there.

Go was initially designed as a system language, although it's no longer described as one. As such, one of the tradeoffs the language designers made is that Go more or less uses conservative garbage collection, as far as I understand, at least for objects or at least memory areas that may contain pointers (some static data that's known to be pointer free may be skipped by the conservative GC). Although there's said to be the start of a more efficient word-bitmap implementation for Go objects, it's not currently usable by the GC (and may not be fully live).

(As far as I can tell from commentary, Go's garbage collector only scans Go's own memory areas; it doesn't make any attempt to scan memory used by outside libraries or code to find references to Go objects. Runtime code that passes a pointer to a Go object to an outside function is apparently required to keep the object alive inside Go, for example by hooking it into a global variable.)

The problem with conservative GC is that it over-estimates memory still in use because it finds false 'references', things that look like pointers to allocated objects that aren't actually that. There are a number of factors that make conservative GC worse:

  • the more of your address space is in use for language objects, the more random values can look like references to them. If half of the address space is your objects, half of all properly aligned N-bit patterns look like pointers to your objects (where N is the size of a pointer).
  • the smaller the address space is in general, the more of it you're going to fill up with your objects for the same amount of memory use. Two GB of objects is half of the 32-bit address space but a tiny fraction of the 64-bit address space.

  • the larger your individual objects are, the more memory a single 'reference' somewhere inside one will prevent from being freed.
  • similarly, the more other objects a single object refers to, the more memory will be held down by a single spurious reference to the top object.

Many of these factors are apparently quite bad for 32-bit Go programs that use a significant amount of memory, apparently especially for large objects and when they use objects that the garbage collector treats conservatively. They are drastically reduced on 64-bit machines, where you would generally have to be unlucky in order for the conservative GC to accidentally hold a significant amount of memory busy. However, the problem could still happen with 64-bit Go; it's just less likely.

(The general reference for this is Go language issue 909.)

At this point I have no articulate personal reactions to all of this. As a pragmatic matter I'm not exactly writing Go programs right now for various reasons (although I keep vaguely wanting to because I like Go in the abstract), so if I'm being honest it's all kind of theoretical.

(My problem with Go in practice is partly that I have nothing to really use it on. I need to find a project that calls out for it instead of anything else.)

Sidebar: the 32-bit Windows issue

There's also an issue on Windows machines due to memory fragmentation (via Hacker News). When it starts, the Go runtime tries to allocate a contiguous 512 Mbyte region of virtual address space. Sometimes on Windows machines enough DLLs have loaded in enough places by this point that there isn't such a contiguous chunk of address space left any more, the allocation fails, and the Go runtime immediately exits with an error.

(In theory this sort of address space fragmentation could happen on any 32-bit OS, but apparently Windows is uniquely susceptible for various reasons.)

programming/GoLang32BitProblem written at 03:19:19; Add Comment

2012-05-15

Some stuff on 'time since boot' timestamps

From today on Twitter:

@standaloneSA: Is it just me, or does it seem silly that the #NetFlow timestamp field for the flow references "ms since the router booted". Seems obtuse.

@thatcks: @standaloneSA It's probably easy to implement in the router and it creates an absolute ordering w/o worries about time going backwards.

In the Twitter way, this is a little bit cryptic so I'm going to elaborate on my guess here.

Suppose that routers were supposed to generate an absolute timestamp for their events instead of this relative one, for example UTC in milliseconds. This would create two problems.

First, routers would somehow need to know or acquire the correct UTC time (with millisecond resolution) and then maintain it. This is to some degree a solved problem but it adds complexity to the router. It also leads to the second problem, because a router is unlikely to boot with the correct UTC time (down to the millisecond).

The second problem is that the moment you have a system generating an absolute timestamp you need to deal with the certainty that the correct time, as the system sees it, will jump around. The router will boot will some idea of the UTC time but it's quite likely to be a bit off (remember that we're calling for millisecond accuracy here), then over time it will converge on the correct UTC time. As it does so, its version of UTC time may go forward abruptly, go backwards abruptly, or go forward more slowly than UTC time is really advancing. Backwards time jumps screw up event ordering completely, and all of the options screw up the true relative time between events; if you have two events timestamped UTC1 and UTC2, you actually have only a weak idea how long it is between them.

The valuable property that milliseconds since boot has is that it is a clear monotonic timestamp. It only ever goes forward and it goes forward at what should be a very constant rate, which means that it creates a clear order of events and a clear duration between any two events (well, for events from the same stream of monotonic timestamps). Monotonic timestamps are not a substitute for absolute time but neither is absolute time a substitute for monotonic timestamps; you really need both, which means that you need a map between them.

There are two possible places to build such a map: each device can do its own or it can be done in a central aggregator. I believe that the right answer is to do it in the central aggregator because this means that you have only a single version of absolute time, the aggregator's view (each device, aggregator included, may have a slightly different view of the current 'correct' absolute time for the reasons outlined above). Using only a single version of absolute time means that you have a single coherent map of all of the monotonic timestamps to (some) absolute time.

(Of course you need devices that generate monotonic timestamps to tell you when they reset their timestamps, eg when they boot.)

My impression is that using elapsed time since boot is actually common in a number of environments. For example, Linux kernel messages are usually reported this way these days (which has its own issues if you're trying to work backwards to roughly when in absolute time something happened).

tech/TimestampIssues written at 12:20:36; Add Comment

2012-05-14

My Firefox 12 extensions and addons

In light of yesterday's entry about my failed Firefox Nightly experiment and the potential that some of my extensions are the root cause of my Firefox problems, I'm going to run down the current set of Firefox extensions that I use in my main browser (updating previous discussions from the Firefox 7 era, which alarmingly was less than a year ago). This time around I'm going to group them by purpose:

Safe browsing:

  • NoScript to disable JavaScript for almost everything. I browse with JS blocked and only enable it selectively on sites when I have to (and almost always temporarily). I consider this more an issue of safety than of performance; I simply don't trust most JavaScript from most sites to not do things that will make me unhappy.

    (NoScript also takes care of blocking Flash, Java, and so on.)

  • CookieSafe 3.0.5, with the actual addon here. I browse through a filtering proxy and it blocks ordinary cookies, but it can't do anything about cookies I get over HTTPS or via JavaScript. I use CookieSafe to block those (there's some more explanation here).

    (For me, CookieSafe 3.1a10 has an explosive interaction with NoScript that hangs Firefox in some sort of infinite JavaScript loop, so I am still on 3.0.5 aka the 2011-12-10 version of CookieSafe.)

User interface:

  • All-in-One Gestures (specifically my tweaked version of it). I turn off A-i-O autoscroll because the native Firefox autoscroll is better (and has been for years). A-i-O hasn't been updated in ages but still seems to be the best, most reliable gesture extension in my brief experimentation.

    (FireGestures is actively developed but the last time I tried it there was an odd bug with changing font size settings; however, that was a while back. It would be my leading alternate here.)

    Update: All-in-One Gestures seems to have been a major cause of my Firefox memory bloat problems. I've now replaced it with FireGestures; see this update. I can no longer recommend it.

  • Status-4-Evar restores the old Firefox bottom status bar so that I can see the full display of link targets and have a useful page load status display.

Fixing annoying websites, especially Google's:

  • GreaseMonkey combined with the Google Link Cleanup user script to remove Google's tracking links from search results. I hate these tracking links with a burning passion for two reasons; first, I have no interest in letting Google know what search results I've followed and second, Google's tracking links screw up my history so that I can't see which search results I've already read and which are new.

  • Stylish combined with a number of mostly personally written styles to fix various website misdesigns. The most important is a version of this user style to disable the left option sidebar in Google searches (because I hate it and I use Google all the time). I also have Compact Google Reader in the Firefox instance I use with Google Reader, for similar reasons.

    (This entry and its comments have a bunch of discussion about ways to fix Google's layout issues.)

    I could probably replace my use of Stylish with more GreaseMonkey user scripts, but I started with Stylish and I prefer fixing things with CSS alterations than with JavaScript (even if the JavaScript just inserts CSS alterations). Certainly there seem to be plenty of 'fix Google stuff' GreaseMonkey user scripts, eg this one for Google Reader (which I have not tried).

Improving my life:

  • It's All Text! handily deals with how browsers make bad editors. The more I have it available the more I use it (and the longer comments and so on I wind up leaving, because I can actually edit them sensibly; this may not be a plus, all things considered).

Modern versions of Firefox also give you a JavaScript based PDF viewer addon for free. I have not done much with this and in fact currently have it turned off.

Of these extensions, I consider NoScript, All-in-One Gestures, GreaseMonkey, and Stylish to be completely essential. I can sort of live without the others, so as an experiment I am trying that to see if it makes a difference in Firefox memory usage and the number of zombie compartments that build up. If I am serious about this, I probably should migrate away from Stylish to GreaseMonkey for everything on the grounds that the latter is probably more actively used and maintained and so any leaks it has are more likely to get fixed promptly.

(Unfortunately I suspect that A-i-O is a likely candidate to be a leaky extension, since it hasn't been updated in ages.)

web/Firefox12Extensions written at 15:24:58; Add Comment

2012-05-13

My experiment with Firefox Nightly builds: a failure

Ever since my old Firefox build started crashing and I was forced to update to current versions, I've had serious memory issues with Firefox. I used to be able to leave Firefox running for weeks (or months) with basically stable memory usage. Now, Firefox will steadily bloat up from under a GB of resident memory at its initial steady state to, say, 1.5 GB in a few days at most. Although my current machine has 16 GB of RAM, Firefox progressively gets slower and slower as its resident memory grows; by the time it reaches around 1.5 to 1.6 GB resident the performance is visibly dragging and I have to restart.

Recently I stumbled across this Mozilla blog entry on Firefox memory usage, which discusses how current Firefox builds have changes that reduce memory leaks, especially a drastic reduction in zombie compartments (see this entry for more). Ever since I discovered the verbose about:memory information, I've noticed that I have zombie compartments that linger from my ordinary browsing; the longer I browse, the more zombie compartments build up. A Firefox change that actually dropped zombie compartments seemed very promising, certainly promising enough to build a current version of Firefox and see what happened.

(Thus this is not quite an experiment with the literal Nightly builds, although it should be very close; as far as I understand, they're built from the same source repository (see also) that I was using.)

Unfortunately, the experiment turned out to be mostly a failure, although a sort of interesting one; in some ways Firefox improved but in other ways it got significantly worse. I tweeted a cryptic short form version, and I feel like elaborating on it now.

What improved was Firefox's responsiveness as its resident memory grew. Firefox 12 visibly starts slowing down with as little as 1.2 or 1.3 GB of resident memory; the current Firefox code was still running almost as well as at start when it reached 2 GB or more of resident memory, and it might have kept going even as it bloated more. What did not improve was everything else. I still saw zombie compartments (probably just as many as before) and if anything Firefox memory usage grew faster than under Firefox 12, reaching 2 GB resident in a day or two. But the worse thing was that at home, Firefox would soon get into a state where it was constantly using CPU (apparently talking with the X server). In this state it would not shut down gracefully; I could quit Firefox and it would close all its windows, but the process would not exit and would continue consuming the CPU talking with the X server.

(I had to use 'kill -9' to get it to exit, and this happened more than once with builds across several days. It was also odd CPU usage; it showed clearly in top but did not affect the load average and didn't lag the X server that I could tell.)

Unclean shutdowns aren't something that I considered acceptable in this situation so I am now back to Firefox 12, memory bloat slowdown and all.

It's possible that the current Firefox codebase will improve as it marches towards release, eliminating the memory bloat and 100% CPU usage while preserving responsiveness as its memory usage grows. I could live with that and it certainly would be an improvement over the status quo. (In some ways, simply eliminating the CPU usage would be a bit of an improvement over the status quo, although I don't like Firefox consuming several GB of my RAM for no good reason.)

(Despite the result, I don't regret doing this experiment; it was worth trying and it didn't particularly explode in my face.)

Update, May 17th: It seems that most of my Nightly memory problems were probably due to a single old extension I was using. See this update.

Sidebar: dealing with this with Chrome or by disabling extensions

Chrome is not something I consider an acceptable alternative to Firefox, so switching to it is not an option.

One piece of advice the Mozilla people give about this sort of memory bloat is 'disable unnecessary addons'. Well, I don't have any of those; all of the addons I have loaded are ones that I consider either absolutely necessary (to the point where I would not browse without them) or important for how I use Firefox.

(I suppose there's one or two that I don't use very often, like It's All Text!, but it would be actively painful periodically.)

web/FirefoxNightly-2012-05-13 written at 21:36:57; Add Comment

A basic step in measuring and improving network performance

There is a mistake that I have seen people make over and over again when they attempt to improve, tune, or even check network performance under unusual circumstances. Although what set me off now is this well intentioned article, I've seen the same mistake in people setting off to improve their iSCSI performance, NFS performance, and probably any number of other things that I've forgotten by now.

The mistake is skipping the most important basic step of network performance testing: the first thing you have to do is make sure that your network is working right. Before you can start tuning to improve your particular case or start measuring the effects of different circumstances, you need to know that your base case is not suffering from performance issues of its own. If you skip this step, you are building all future results on a foundation of sand and none of them are terribly meaningful.

(They may be very meaningful for you in that they improve your system's performance right now, but if your baseline performance is not up to what it should be it's quite possible that you could do better by addressing that.)

In the very old days, the correct base performance level you could expect was somewhat uncertain and variable; getting networks to run fast was challenging for various reasons. Fortunately those days have long since passed. Today we have a very simple performance measure, one valid for any hardware and OS from at least the past half decade if not longer:

Any system can saturate a gigabit link with TCP traffic.

As I've written before in passing, if you have two machines with gigabit Ethernet talking directly to each other on a single subnet you should be able to get gigabit wire rates between them (approximately 110 MBytes/sec) with simple testing tools like ttcp. If you cannot get this rate between your two test machines, something is wrong somewhere and you need to fix it before there's any point in going further.

(There are any number of places where the problem could be, but one definitely exists.)

I don't have an answer for what the expected latency should be (as measured either by ping or by some user-level testing tool), beyond that it should be negligible. Our servers range from around 150 microseconds down to 10 microseconds, but there's other traffic going on, multiple switch hops, and so on. Bulk TCP tends to smooth all of that out, which is part of why I like it for this sort of basic tests.

As a side note, a properly functioning local network has basically no packet loss whatsoever. If you see any more than a trace amount, you have a problem (which may be that your network, switches, or switch uplinks are oversaturated).

The one area today where there's real uncertainty in the proper base performance is 10G networking; we have not yet mastered the art of casually saturating 10G networks and may not for a while. If you have 10G networks you are going to have to do your own tuning and measurements of basic network performance before you start with higher level issues, and you may have to deliberately tune for your specific protocol and situation in a way that makes other performance worse.

tech/NetworkPerfBasicStep written at 00:40:17; Add Comment

2012-05-12

The death of paging on the web

I've written about the problem of permanent headers and footers before (around a year ago), but I'm seeing more and more of them these days. What this confirms for me is that paging is dead on the modern web.

By this I don't mean long pages; I'm not one of those people who think that all of your content has to be 'above the fold', immediately visible as what people see (and the available evidence from actual experimentation apparently says otherwise). What I mean is getting to that content by paging, advancing in nearly full page increments (usually by hitting the space bar in your browser). Given that permanent headers or footers (or both) screw this up, and given that permanent headers and footers are increasingly popular, I can only conclude that paging isn't really used any more; otherwise, header and footer based designs would be wretched experiences and test badly (and on the modern web, people do at least do A/B tests).

Instead, I think that on the modern web everyone has scroll wheels (or some other way of scrolling, for example on tablets) and they scroll through articles and pages with them. Only an insignificant number of people still navigate with paging.

Now I'll add a personal confession here: since I started my scroll wheel mouse experiment, I've found myself increasingly scrolling web pages instead of paging them. I don't know why, but there's just something about it that feels right (and this is on pages without obnoxious headers and footers). I think that part of it is that the boundaries of things on the web page often don't align naturally with what I'd get by paging; by partially scrolling the page I can make things line up right (this is especially visible to me if the page content includes images).

(Looking back, I've had middle mouse button based scrolling in my browser for years and have used it too instead of paging. So I should have seen this one coming.)

I don't know what this means for web page design going forward, but I suspect that it means something (I also suspect that current web designers do know what it implies; I am not exactly current on the field). There have to be things you design differently if you expect almost everyone to scroll your page around so that things can catch their eye as they move past.

(I probably won't ever put a permanent header or footer on a page I design (at least not a full-width one), but that's a personal thing. Also it would have to be something awfully important to the page to deserve a permanent full-time presence in front of the viewer. My bias is that almost all headers and footers I've seen aren't that important; in fact, they're often rather presumptuous that way, which is part of the reason I dislike them.)

web/WebPagingDeath written at 02:13:02; Add Comment

2012-05-10

All your servers should have Linux's magic SysRq enabled

This is effectively another lesson learned from our recent building power shutdown. I will put it simply:

All of your servers should have magic SysRq enabled.

There are reasons to not do this on client machines (but not necessarily very good ones), but none on your servers (which certainly should have their hardware and consoles in a secure location).

What magic SysRq is good for on servers (above everything else) is giving you a last ditch chance to shut down or reboot the machine in something approaching an orderly way. I'm not just talking about if the system goes crazy, because it's also quite possible for ordinary system shutdowns to hang, especially if you're shutting down a group of systems that have complex NFS filesystem relationships and something went down out of order. If this happens and you don't have magic SysRq support available, you're plain out of luck; all you can do is pull the power and hope that nothing is going to explode because it hasn't been killed, had its data synced to disk, or whatever.

With magic SysRq you have at least a chance of doing something about this. You can force a kernel level sync, a kernel level unmount of as many filesystems as possible, and even hit processes with signals if you think it's going to do any good. And then you can reboot the machine (and afterwards, possibly pull the power to keep the machine down).

PS: you should explicitly enabled magic SysRq in your standard server install setup, even if your distribution normally defaults to leaving it on; distribution defaults can change over time. Also, note that if you have a serial console you generally need a getty listening on it in order to make magic SysRq work.

(You can check to see if magic SysRq is enabled by looking at the value of /proc/sys/kernel/sysrq; a 1 means that it is, a 0 means that it isn't.)

linux/ServersEnableMagicSysrq written at 16:28:49; Add Comment

2012-05-09

Using rsync to pull a directory tree to client machines

Suppose that you have a decent sized directory tree that you want some number of clients to mirror from a master server (with the clients pulling updates instead of the master pushing them), perhaps because you've just noticed undesired NFS dependencies. Things in the directory tree are potentially sensitive (so you want access control), it's updated at random, and it's not in a giant VCS tree or something; this is your typical medium-sized ball of local stuff. The straightforward brute force approach is to use rsync with SSH; give the clients special SSH identities, put them in the server's authorized_keys, and have them run 'rsync -a --delete' (or some close variant) to pull the directory tree over. However, this has the problem that normal rsync is symmetric; if you allow a client to pull from you, you also allow a client to push to you (assuming that the server side login has write access to the directory tree, and yes let's make that assumption for now).

(You also have to set the SSH access up so that the clients can't run arbitrary commands on the server.)

Rsync's solution to this is its daemon mode, which can restricted to operate in read only mode. Normally rsync wants to be run this way as an actual daemon (listening on a port and so on), but that requires us to use rsync's weaker and harder to manage authentication, access control, and other things. I would rather continue to run daemon mode rsync over plain SSH and take advantage of all of the existing, proven SSH features for various things.

(The rsync manpage suggests hacks like binding the rsync daemon to only listen on localhost on the server and then using SSH port forwarding to give clients access to it. But those are hacks and require making various assumptions.)

How to to do this is not obvious from the documentation, so here is the setup I have come up with for doing this on both the server and the clients. First, you need an rsyncd.conf configuration file on the server. Don't use the normal /etc/rsyncd.conf; it's much more controllable to use your own in a different place. It should look something like:

use chroot = no
[somepath]
comment = Replication module
path = /some/path
read only = true
# if necessary:
uid = 0
gid = 0

(The '[somepath]' bit is what rsync calls the module name and can be anything meaningful for you; you'll need it on the client later. The comment is optional but potentially useful. You need to explicitly specify uid and gid if the server login is UID 0 for access to the directory tree and you need to keep that; otherwise rsync will drop privileges to a default UID.)

Next, you need a script on the server that will force an incoming SSH login to run rsync in daemon mode against this configuration file and do nothing else. We will set this as the command= value in the server login's authorized_keys to restrict what the incoming SSH connection from clients can do. This looks like:

#!/bin/sh
exec /usr/bin/rsync --server --daemon --config=/your/rsyncd.conf .

Note that this completely ignores any arguments that the client attempts to supply. However, this doesn't matter; as far as I can tell, the command line that the clients send will always be 'rsync --server --daemon .', regardless of what command line options and paths you use on the clients. (Certainly this is the only command line that clients seem to send for requests that you actually want to pay attention to.)

On the server, the login that you're using for this should have a .ssh/authorized_keys file with entries for the client SSH identities. These entries should all force incoming logins to run the command above and block various other activities (especially port forwarding, which could otherwise be done without command execution at all as Dan Astoorian mentioned in a comment here):

command="/your/rsyncd-shell",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty [...]

A from="..." restriction is optional but potentially recommended. Even a broad one may limit the fallout from problems.

Finally, on the client you need to run rsync with all of the necessary arguments. You probably want to put this in a script:

#!/bin/sh
rsync -a --delete --rsh="/usr/bin/ssh -i /client/identity" LOGIN@MASTER-HOST::somepath /some/path/

Potentially useful additional arguments for rsync are -q and --timeout=<something>. In a production script you probably also want an option to mirror the directory tree to somewhere other than /some/path on the client.

If you run this from cron, remember to add some locking to prevent two copies from running at once. If the directory tree is large and you have enough clients, you may want to add some amount of randomization of the start times for the replication in order to keep load down on the master server.

(There may be a better way to do this with rsync; if you know of one, let me know in the comments. For various reasons we're probably not interested in doing this with any other tool, partly because we already have rsync and not the other tools. Another tool would have to be very much better than rsync to really be worth switching to.)

sysadmin/RsyncReplicationSetup written at 23:54:57; Add Comment

Things I will do differently in the next building power shutdown (part 2)

Back at the start of last September, we had an overnight building wide power shutdown in the building with our machine room and I wrote a lessons-learned entry in the aftermath. Well, we just had another one and apparently I didn't learn all of the lessons that I needed to learn the first time around. So here's another set of things that I've now learned.

Next time around I will:

  • explicitly save the previous time's checklist. If nothing else, the 'power up' portion makes a handy guide for what to do if you abruptly lose building power some day.

    (I sort of did this last time, not through active planning but just because I reflexively don't delete basically any of this sort of stuff. But I should do it deliberately and put it somewhere where I can easily find it, instead of just leaving it lying around.)

    Having last time's list isn't the end of the work, because things have undoubtedly changed since then. But it's a starting point and a jog to the memory.

  • start preparing the checklist well in advance, like more than a day beforehand. Things worked out in the end but doing things at the last moment was a bit nerve wracking.

    (There's always stuff to do around here and somehow it always felt like there was plenty of time right up until it was Friday and we had a Monday night shutdown.)

  • update and correct the checklist immediately afterwards to cover things that we missed. My entry from last time is kind of vague; I'm sure I knew the specifics I was thinking of at the time, but I didn't write them down so they slipped away. I was able to reconstruct a few things from notes and email in the wake of last time, but others I only realized in the aftermath of this one.

  • add explanatory notes about why things are being done in a certain order and what the dependencies are. Especially in the bustle of trying to get everything down or up as fast as possible, it's useful to have something to jog our minds about why something is the way it is and whether or not it's that important.

    (Our checklists for this sort of thing are not fixed; they're more guidelines than requirements. We deviate from them on the fly and thus it's really useful to have some indication of how flexible or rigid things are.)

  • if any machines are being brought down and then deliberately not being brought back up, explicitly mention this so that people don't get potentially confused about a 'missing' machine.

My entry from last time was very useful in several ways. I reread it when I was preparing our checklist for this time and it jogged my memory about several important issues; as a result our checklist for this time around was (I think) significantly better than for last time (and also noticeably longer and more verbose). This time I at least made new mistakes, which is progress that I can live with.

I will also probably try to put more explanation into the checklist the next time around. I'm sure it's possible to put too much of it in, but I don't think that's been our problem so far. In the heat of the moment we're going to skim anyways, so the thing to do is to break the checklist up into skimmable blocks with actions and things to check off and then chunks of additional explanation after them.

(In a sense a checklist like this serves two purposes at once. During the power down or power up it is mostly a catalog of actions and ordering, but beforehand it's a discussion and a rationale for what needs to be done and why. Without the logic behind it being written out explicitly, you can't have that discussion; once you have that logic written out, you might as well leave it in to jog people's memories on the spot.)

On a side note, a full power up is an interesting and useful way to find problematic dependencies that have quietly worked their way into your overall network, ones that are not so noticeable when your systems are in their normal steady state. For example, DHCP service for several of our networks now depends on our core fileserver, which means that it can only come up fairly late in the power up process. We're going to be fixing that.

(There is a chain of dependencies that made this make sense in a steady state environment.)

sysadmin/PowerdownLessonsLearnedII written at 00:37:34; Add Comment

These are my WanderingThoughts
(About the blog)

GettingAround
Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.

* * *

Atom feeds are available; see the bottom of most pages.

This is a DWiki.
(Help)

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web

Search:
[There's more, starting at 2012/05/07 or Previous 11]
(Previous day)
By day for May 2012: 1 3 4 5 6 7 9 10 12 13 14 15 17 19; before May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.