Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web.
|
2012-05-19 A semi-brief history and overview of X fonts and font rendering technologyBecause I've just been mucking around in this swamp, I feel like writing all of this down. In the very beginning, X had some simple bitmap fonts in the server
with equally simple names that were basically labels; ie, the X server
attached no particular meaning to the font names, which were created by
people mostly to be short and sometimes to be vaguely meaningful. Some
of these font names linger on in X today; the most well known one is
(Today these font names are implemented as aliases for other fonts.) As the X server evolved, it grew more bitmap fonts. Fonts from different vendors, fonts in more sizes, proportional fonts as well as monospaced ones, fonts for different resolutions (75dpi versus 100dpi), fonts for different character set encodings, and so on. It was clear that ad-hoc font names weren't going to scale because no one was going to be able to keep fonts straight or find one. So the X people invented a naming convention for their fonts, the X logical font description (XLFD). In theory an XLFD name describes most of the important attributes of a font, things like the point and pixel size, the slant, the style, whether it's proportional or monospaced, and so on. Along with XLFD names and their defined structure, the X people introduced the idea of wildcard matches so that X programs could say 'I don't really care which vendor it comes from, I just want whatever 15 pixel monospaced font you have'. (For backwards compatibility, the original simple X font names were defined to be acceptable XLFD font names (although you can't use wildcards with them).) Initially XLFD fonts were still all bitmapped fonts, just better named and more numerous than before. However the X server soon got additional font rendering support that let it handle several scalable font formats of the time. Scalability was implemented in XLFD font names in a straightforward way; if the font name itself had zeros in various resolution related fields, you knew that the X server could and would render it at whatever pixel size you asked for. At some point, people observed that the precious X server was spending a
bunch of time and memory loading, parsing, rendering, and so on all of
these fonts, even for fonts that weren't very actively used. So the X
people decided to offload a lot of this work to a separate daemon, the
X font server ( (Generally The next evolutionary step in X font handling was to move it to the client side, which marked (and marks) a stark division in X font handling. This is XFT and 'XFT fonts'. XFT is to a significant degree glue; it uses FontConfig to translate from font names (and attributes) to actual concrete font data files, then FreeType to turn text into picture data and draws the picture data using various X bits. Technically and theoretically XFT and its pieces still support old X bitmapped fonts. Practically they do not; XFT and XFT-using programs really expect fully scalable fonts, generally ones with a wide glyph selection, and have basically no patience or tolerance for bitmapped fonts that are available in only a few point sizes with only a few glyphs. With heroic work in FontConfig configuration files you can sort of get something limping along, but in practice moving to XFT fonts means no more bitmap fonts. (Yes, I have tried this experiment. It's especially unsatisfactory for 'frankenfonts', ones where the real font is only available in a few pixel sizes and you were already filling in other pixel sizes with substitutes. The XLFD configuration system is much better for this.) Generally, the system FontConfig configuration will look for fonts in all of the X server font directories with scalable fonts, or at least all of the directories that are considered to include 'good' fonts. This makes scalable XLFD fonts available to modern XFT-using clients, although under somewhat different names. (TrueType fonts will generally render the same in XLFD and XFT form because the X server and the X font server long ago were set up to render them with FreeType. Any remaining differences in appearance are due to rendering decisions made differently between FontConfig and the X server environment. I'm not sure how older scalable font formats come out, and generally you don't want to use those fonts anyways.) So today X has two separate font technologies: XLFD fonts and XFT
fonts. XLFD fonts are configured through (This is especially true of monospaced bitmap fonts, many of which were extensively tuned for high readability and high density at relatively small pixel sizes.) XFT fonts are configured through FontConfig. Magic happens, and it happens differently on
different machines. On the good side,
you can generally put new font files into XLFD font support is pervasive in old X programs but scattered and increasingly absent in modern ones. XFT support is the inverse; uncommon in old X programs and old Unixes (such as Solaris), but increasingly common in modern X programs (especially ones that are part of a pervasive environment like Gnome or KDE, where it is ubiquitous), on Linux, and at least partially on other Unix OSes such as FreeBSD. Some programs and frameworks make an effort to support both XLFD fonts and XFT fonts, but many are XFT-only. (Today's expedition into this swamp was started by Tk 8.5, which can be compiled to support XLFD fonts or XFT fonts but not both at once. You can guess which option modern Linux distributions have picked.)
2012-05-17 My Firefox memory bloat was mostly from All-in-One GesturesIt's time for an update to my prior Firefox situation (one, two). After some experimentation it's become clear that most of my Firefox problems with constant memory growth and zombie compartments were due to my use of All-in-One Gestures (as I kind of suspected it might be). I've switched to FireGestures instead (initially as an experiment and now full time on all of my various Firefox instances on various different machines) and things have been much better; there are no zombie compartments at all and memory growth seems to have dropped significantly (although it's not clear yet if it's completely gone). And I haven't run into any problems or bugs this time around; everything has just worked the way I expected. (A-i-O doesn't seem to have been the only problem I had; for example, it seems to be a bad idea to leave a tab or window sitting around with an embedded Youtube video. It's also not clear if Firefox Nightly behaves well for me in general because I haven't been able to leave it running for multiple days yet.) In addition to less memory usage, FireGestures also seems to simply be more responsive and snappy than A-i-O. It certainly has more useful features, including the ability to add gestures without needing to hack the source code, a library of existing additional gestures (including the one that I wanted), and the ability to 'back up' and 'restore' your settings (which for me really means the ability to easily synchronize my gestures between multiple Firefox instances). (See FireGesture's homepage for more information on all of this.) So FireGestures is now one of my core extensions, replacing All-in-One Gestures in the previous list. The one drawback of FireGestures is that it doesn't work in Firefox 3.6; my laptop is still running Fedora 14 with this Firefox release (because that's the last one with Gnome 2 instead of Gnome 3). I don't consider this a real drawback, but you may. PS: people migrating from All-in-One Gestures to FireGestures might want to use Down-Right-Down to call up the A-i-O information display that shows all of your gestures and then save it (as an HTML page, which is what it is). You can then conveniently look at it later when you're using FireGestures. (I am far too impatient to try to retrain years of reflexes to use the native FireGestures gestures for various actions; I just ruthlessly rewrote them to be the A-i-O gestures I'm used to.) (One comment.)
web/Firefox12Gestures written at 16:11:51; Add Comment
The Go language's problem on 32-bit machinesRecently (for my value of recently) there was somewhat of a commotion of people declaring that Go wasn't usable in production on 32-bit systems because its garbage collection was broken and it would eat all of your memory. Naturally I was interested in this and spent some time digging in to the reports and trying to understand the situation. Today I'm going to try to write down as much as I know about what's going on to get it straight in my head, which is going to involve a trip into the fun land of garbage collection. To simplify a bit, the purpose of garbage collection is to automatically free up memory that's no longer used. The GC technique everyone starts with is reference counting but since it has various problems (including dealing with circular references) most people soon upgrade to more complex schemes based on inverting the problem: rather than noticing when something stops being used, the garbage collection system periodically finds all of the memory that's still actively used and then frees everything else. This is 'tracing garbage collection' (and garbage collectors), so called because the garbage collector 'traces' all live objects. One deep but unsexy problem in garbage collection is how your GC system knows what fields in your objects refer to other objects and what fields are just primitive types like numbers, memory buffers, strings, or the like, and how it does this efficiently. This can be a particular issue for a system language where you probably want to have structures and objects that are as simple and dense as possible, with as little overhead from type annotations, inefficient 'boxed' representations, and so on as possible. One solution is to maintain a separate bitmap of what words in an allocated memory area are actually pointers (which the GC can then scan efficiently, and which can be set by the runtime when an object is allocated). Another solution is what gets called 'conservative garbage collection'. The fundamental idea is that in conservative GC, we are willing to over-estimate references (and thus wind up not freeing some unused memory); rather than insisting on knowing about references, the GC system simply scans through allocated memory looking anything that might be a pointer to an allocated object. If it finds one, it conservatively declares that the object is still alive and traces things from there. Go was initially designed as a system language, although it's no longer described as one. As such, one of the tradeoffs the language designers made is that Go more or less uses conservative garbage collection, as far as I understand, at least for objects or at least memory areas that may contain pointers (some static data that's known to be pointer free may be skipped by the conservative GC). Although there's said to be the start of a more efficient word-bitmap implementation for Go objects, it's not currently usable by the GC (and may not be fully live). (As far as I can tell from commentary, Go's garbage collector only scans Go's own memory areas; it doesn't make any attempt to scan memory used by outside libraries or code to find references to Go objects. Runtime code that passes a pointer to a Go object to an outside function is apparently required to keep the object alive inside Go, for example by hooking it into a global variable.) The problem with conservative GC is that it over-estimates memory still in use because it finds false 'references', things that look like pointers to allocated objects that aren't actually that. There are a number of factors that make conservative GC worse:
Many of these factors are apparently quite bad for 32-bit Go programs that use a significant amount of memory, apparently especially for large objects and when they use objects that the garbage collector treats conservatively. They are drastically reduced on 64-bit machines, where you would generally have to be unlucky in order for the conservative GC to accidentally hold a significant amount of memory busy. However, the problem could still happen with 64-bit Go; it's just less likely. (The general reference for this is Go language issue 909.) At this point I have no articulate personal reactions to all of this. As a pragmatic matter I'm not exactly writing Go programs right now for various reasons (although I keep vaguely wanting to because I like Go in the abstract), so if I'm being honest it's all kind of theoretical. (My problem with Go in practice is partly that I have nothing to really use it on. I need to find a project that calls out for it instead of anything else.) Sidebar: the 32-bit Windows issueThere's also an issue on Windows machines due to memory fragmentation (via Hacker News). When it starts, the Go runtime tries to allocate a contiguous 512 Mbyte region of virtual address space. Sometimes on Windows machines enough DLLs have loaded in enough places by this point that there isn't such a contiguous chunk of address space left any more, the allocation fails, and the Go runtime immediately exits with an error. (In theory this sort of address space fragmentation could happen on any 32-bit OS, but apparently Windows is uniquely susceptible for various reasons.)
2012-05-15 Some stuff on 'time since boot' timestampsFrom today on Twitter:
In the Twitter way, this is a little bit cryptic so I'm going to elaborate on my guess here. Suppose that routers were supposed to generate an absolute timestamp for their events instead of this relative one, for example UTC in milliseconds. This would create two problems. First, routers would somehow need to know or acquire the correct UTC time (with millisecond resolution) and then maintain it. This is to some degree a solved problem but it adds complexity to the router. It also leads to the second problem, because a router is unlikely to boot with the correct UTC time (down to the millisecond). The second problem is that the moment you have a system generating an absolute timestamp you need to deal with the certainty that the correct time, as the system sees it, will jump around. The router will boot will some idea of the UTC time but it's quite likely to be a bit off (remember that we're calling for millisecond accuracy here), then over time it will converge on the correct UTC time. As it does so, its version of UTC time may go forward abruptly, go backwards abruptly, or go forward more slowly than UTC time is really advancing. Backwards time jumps screw up event ordering completely, and all of the options screw up the true relative time between events; if you have two events timestamped UTC1 and UTC2, you actually have only a weak idea how long it is between them. The valuable property that milliseconds since boot has is that it is a clear monotonic timestamp. It only ever goes forward and it goes forward at what should be a very constant rate, which means that it creates a clear order of events and a clear duration between any two events (well, for events from the same stream of monotonic timestamps). Monotonic timestamps are not a substitute for absolute time but neither is absolute time a substitute for monotonic timestamps; you really need both, which means that you need a map between them. There are two possible places to build such a map: each device can do its own or it can be done in a central aggregator. I believe that the right answer is to do it in the central aggregator because this means that you have only a single version of absolute time, the aggregator's view (each device, aggregator included, may have a slightly different view of the current 'correct' absolute time for the reasons outlined above). Using only a single version of absolute time means that you have a single coherent map of all of the monotonic timestamps to (some) absolute time. (Of course you need devices that generate monotonic timestamps to tell you when they reset their timestamps, eg when they boot.) My impression is that using elapsed time since boot is actually common in a number of environments. For example, Linux kernel messages are usually reported this way these days (which has its own issues if you're trying to work backwards to roughly when in absolute time something happened). (2 comments.)
tech/TimestampIssues written at 12:20:36; Add Comment
2012-05-14 My Firefox 12 extensions and addonsIn light of yesterday's entry about my failed Firefox Nightly experiment and the potential that some of my extensions are the root cause of my Firefox problems, I'm going to run down the current set of Firefox extensions that I use in my main browser (updating previous discussions from the Firefox 7 era, which alarmingly was less than a year ago). This time around I'm going to group them by purpose: Safe browsing:
User interface:
Fixing annoying websites, especially Google's:
Improving my life:
Modern versions of Firefox also give you a JavaScript based PDF viewer addon for free. I have not done much with this and in fact currently have it turned off. Of these extensions, I consider NoScript, All-in-One Gestures, GreaseMonkey, and Stylish to be completely essential. I can sort of live without the others, so as an experiment I am trying that to see if it makes a difference in Firefox memory usage and the number of zombie compartments that build up. If I am serious about this, I probably should migrate away from Stylish to GreaseMonkey for everything on the grounds that the latter is probably more actively used and maintained and so any leaks it has are more likely to get fixed promptly. (Unfortunately I suspect that A-i-O is a likely candidate to be a leaky extension, since it hasn't been updated in ages.) (5 comments.)
web/Firefox12Extensions written at 15:24:58; Add Comment
2012-05-13 My experiment with Firefox Nightly builds: a failureEver since my old Firefox build started crashing and I was forced to update to current versions, I've had serious memory issues with Firefox. I used to be able to leave Firefox running for weeks (or months) with basically stable memory usage. Now, Firefox will steadily bloat up from under a GB of resident memory at its initial steady state to, say, 1.5 GB in a few days at most. Although my current machine has 16 GB of RAM, Firefox progressively gets slower and slower as its resident memory grows; by the time it reaches around 1.5 to 1.6 GB resident the performance is visibly dragging and I have to restart. Recently I stumbled across this Mozilla blog entry on Firefox memory usage, which discusses how current Firefox builds have changes that reduce memory leaks, especially a drastic reduction in zombie compartments (see this entry for more). Ever since I discovered the verbose about:memory information, I've noticed that I have zombie compartments that linger from my ordinary browsing; the longer I browse, the more zombie compartments build up. A Firefox change that actually dropped zombie compartments seemed very promising, certainly promising enough to build a current version of Firefox and see what happened. (Thus this is not quite an experiment with the literal Nightly builds, although it should be very close; as far as I understand, they're built from the same source repository (see also) that I was using.) Unfortunately, the experiment turned out to be mostly a failure, although a sort of interesting one; in some ways Firefox improved but in other ways it got significantly worse. I tweeted a cryptic short form version, and I feel like elaborating on it now. What improved was Firefox's responsiveness as its resident memory grew. Firefox 12 visibly starts slowing down with as little as 1.2 or 1.3 GB of resident memory; the current Firefox code was still running almost as well as at start when it reached 2 GB or more of resident memory, and it might have kept going even as it bloated more. What did not improve was everything else. I still saw zombie compartments (probably just as many as before) and if anything Firefox memory usage grew faster than under Firefox 12, reaching 2 GB resident in a day or two. But the worse thing was that at home, Firefox would soon get into a state where it was constantly using CPU (apparently talking with the X server). In this state it would not shut down gracefully; I could quit Firefox and it would close all its windows, but the process would not exit and would continue consuming the CPU talking with the X server. (I had to use ' Unclean shutdowns aren't something that I considered acceptable in this situation so I am now back to Firefox 12, memory bloat slowdown and all. It's possible that the current Firefox codebase will improve as it marches towards release, eliminating the memory bloat and 100% CPU usage while preserving responsiveness as its memory usage grows. I could live with that and it certainly would be an improvement over the status quo. (In some ways, simply eliminating the CPU usage would be a bit of an improvement over the status quo, although I don't like Firefox consuming several GB of my RAM for no good reason.) (Despite the result, I don't regret doing this experiment; it was worth trying and it didn't particularly explode in my face.) Update, May 17th: It seems that most of my Nightly memory problems were probably due to a single old extension I was using. See this update. Sidebar: dealing with this with Chrome or by disabling extensionsChrome is not something I consider an acceptable alternative to Firefox, so switching to it is not an option. One piece of advice the Mozilla people give about this sort of memory bloat is 'disable unnecessary addons'. Well, I don't have any of those; all of the addons I have loaded are ones that I consider either absolutely necessary (to the point where I would not browse without them) or important for how I use Firefox. (I suppose there's one or two that I don't use very often, like It's All Text!, but it would be actively painful periodically.) (4 comments.)
web/FirefoxNightly-2012-05-13 written at 21:36:57; Add Comment
A basic step in measuring and improving network performanceThere is a mistake that I have seen people make over and over again when they attempt to improve, tune, or even check network performance under unusual circumstances. Although what set me off now is this well intentioned article, I've seen the same mistake in people setting off to improve their iSCSI performance, NFS performance, and probably any number of other things that I've forgotten by now. The mistake is skipping the most important basic step of network performance testing: the first thing you have to do is make sure that your network is working right. Before you can start tuning to improve your particular case or start measuring the effects of different circumstances, you need to know that your base case is not suffering from performance issues of its own. If you skip this step, you are building all future results on a foundation of sand and none of them are terribly meaningful. (They may be very meaningful for you in that they improve your system's performance right now, but if your baseline performance is not up to what it should be it's quite possible that you could do better by addressing that.) In the very old days, the correct base performance level you could expect was somewhat uncertain and variable; getting networks to run fast was challenging for various reasons. Fortunately those days have long since passed. Today we have a very simple performance measure, one valid for any hardware and OS from at least the past half decade if not longer:
As I've written before in passing, if you have two machines with gigabit Ethernet talking directly to each other on a single subnet you should be able to get gigabit wire rates between them (approximately 110 MBytes/sec) with simple testing tools like ttcp. If you cannot get this rate between your two test machines, something is wrong somewhere and you need to fix it before there's any point in going further. (There are any number of places where the problem could be, but one definitely exists.) I don't have an answer for what the expected latency should be (as
measured either by As a side note, a properly functioning local network has basically no packet loss whatsoever. If you see any more than a trace amount, you have a problem (which may be that your network, switches, or switch uplinks are oversaturated). The one area today where there's real uncertainty in the proper base performance is 10G networking; we have not yet mastered the art of casually saturating 10G networks and may not for a while. If you have 10G networks you are going to have to do your own tuning and measurements of basic network performance before you start with higher level issues, and you may have to deliberately tune for your specific protocol and situation in a way that makes other performance worse. (2 comments.)
tech/NetworkPerfBasicStep written at 00:40:17; Add Comment
2012-05-12 The death of paging on the webI've written about the problem of permanent headers and footers before (around a year ago), but I'm seeing more and more of them these days. What this confirms for me is that paging is dead on the modern web. By this I don't mean long pages; I'm not one of those people who think that all of your content has to be 'above the fold', immediately visible as what people see (and the available evidence from actual experimentation apparently says otherwise). What I mean is getting to that content by paging, advancing in nearly full page increments (usually by hitting the space bar in your browser). Given that permanent headers or footers (or both) screw this up, and given that permanent headers and footers are increasingly popular, I can only conclude that paging isn't really used any more; otherwise, header and footer based designs would be wretched experiences and test badly (and on the modern web, people do at least do A/B tests). Instead, I think that on the modern web everyone has scroll wheels (or some other way of scrolling, for example on tablets) and they scroll through articles and pages with them. Only an insignificant number of people still navigate with paging. Now I'll add a personal confession here: since I started my scroll wheel mouse experiment, I've found myself increasingly scrolling web pages instead of paging them. I don't know why, but there's just something about it that feels right (and this is on pages without obnoxious headers and footers). I think that part of it is that the boundaries of things on the web page often don't align naturally with what I'd get by paging; by partially scrolling the page I can make things line up right (this is especially visible to me if the page content includes images). (Looking back, I've had middle mouse button based scrolling in my browser for years and have used it too instead of paging. So I should have seen this one coming.) I don't know what this means for web page design going forward, but I suspect that it means something (I also suspect that current web designers do know what it implies; I am not exactly current on the field). There have to be things you design differently if you expect almost everyone to scroll your page around so that things can catch their eye as they move past. (I probably won't ever put a permanent header or footer on a page I design (at least not a full-width one), but that's a personal thing. Also it would have to be something awfully important to the page to deserve a permanent full-time presence in front of the viewer. My bias is that almost all headers and footers I've seen aren't that important; in fact, they're often rather presumptuous that way, which is part of the reason I dislike them.)
2012-05-10 All your servers should have Linux's magic SysRq enabledThis is effectively another lesson learned from our recent building power shutdown. I will put it simply:
There are reasons to not do this on client machines (but not necessarily very good ones), but none on your servers (which certainly should have their hardware and consoles in a secure location). What magic SysRq is good for on servers (above everything else) is giving you a last ditch chance to shut down or reboot the machine in something approaching an orderly way. I'm not just talking about if the system goes crazy, because it's also quite possible for ordinary system shutdowns to hang, especially if you're shutting down a group of systems that have complex NFS filesystem relationships and something went down out of order. If this happens and you don't have magic SysRq support available, you're plain out of luck; all you can do is pull the power and hope that nothing is going to explode because it hasn't been killed, had its data synced to disk, or whatever. With magic SysRq you have at least a chance of doing something about this. You can force a kernel level sync, a kernel level unmount of as many filesystems as possible, and even hit processes with signals if you think it's going to do any good. And then you can reboot the machine (and afterwards, possibly pull the power to keep the machine down). PS: you should explicitly enabled magic SysRq in your standard server install setup, even if your distribution normally defaults to leaving it on; distribution defaults can change over time. Also, note that if you have a serial console you generally need a getty listening on it in order to make magic SysRq work. (You can check to see if magic SysRq is enabled by looking at the value
of /proc/sys/kernel/sysrq; a (2 comments.)
linux/ServersEnableMagicSysrq written at 16:28:49; Add Comment
2012-05-09 Using rsync to pull a directory tree to client machinesSuppose that you have a decent sized directory tree that you want some number of clients to mirror
from a master server (with the clients pulling updates instead of the
master pushing them), perhaps because you've just noticed undesired
NFS dependencies. Things in the directory
tree are potentially sensitive (so you want access control), it's updated
at random, and it's not in a giant VCS tree or something; this is your
typical medium-sized ball of local stuff. The straightforward brute
force approach is to use rsync with SSH; give the clients special SSH
identities, put them in the server's authorized_keys, and have them
run ' (You also have to set the SSH access up so that the clients can't run arbitrary commands on the server.) Rsync's solution to this is its daemon mode, which can restricted to operate in read only mode. Normally rsync wants to be run this way as an actual daemon (listening on a port and so on), but that requires us to use rsync's weaker and harder to manage authentication, access control, and other things. I would rather continue to run daemon mode rsync over plain SSH and take advantage of all of the existing, proven SSH features for various things. (The rsync manpage suggests hacks like binding the rsync daemon to only listen on localhost on the server and then using SSH port forwarding to give clients access to it. But those are hacks and require making various assumptions.) How to to do this is not obvious from the documentation, so here is
the setup I have come up with for doing this on both the server and
the clients. First, you need an use chroot = no[somepath] comment = Replication module path = /some/path read only = true # if necessary: uid = 0 gid = 0 (The ' Next, you need a script on the server that will force an incoming SSH
login to run rsync in daemon mode against this configuration file and do
nothing else. We will set this as the
Note that this completely ignores any arguments that the client attempts
to supply. However, this doesn't matter; as far as I can tell, the
command line that the clients send will always be ' On the server, the login that you're using for this should have
a
A Finally, on the client you need to run rsync with all of the necessary arguments. You probably want to put this in a script:
Potentially useful additional arguments for If you run this from cron, remember to add some locking to prevent two copies from running at once. If the directory tree is large and you have enough clients, you may want to add some amount of randomization of the start times for the replication in order to keep load down on the master server. (There may be a better way to do this with rsync; if you know of one, let me know in the comments. For various reasons we're probably not interested in doing this with any other tool, partly because we already have rsync and not the other tools. Another tool would have to be very much better than rsync to really be worth switching to.)
Things I will do differently in the next building power shutdown (part 2)Back at the start of last September, we had an overnight building wide power shutdown in the building with our machine room and I wrote a lessons-learned entry in the aftermath. Well, we just had another one and apparently I didn't learn all of the lessons that I needed to learn the first time around. So here's another set of things that I've now learned. Next time around I will:
My entry from last time was very useful in several ways. I reread it when I was preparing our checklist for this time and it jogged my memory about several important issues; as a result our checklist for this time around was (I think) significantly better than for last time (and also noticeably longer and more verbose). This time I at least made new mistakes, which is progress that I can live with. I will also probably try to put more explanation into the checklist the next time around. I'm sure it's possible to put too much of it in, but I don't think that's been our problem so far. In the heat of the moment we're going to skim anyways, so the thing to do is to break the checklist up into skimmable blocks with actions and things to check off and then chunks of additional explanation after them. (In a sense a checklist like this serves two purposes at once. During the power down or power up it is mostly a catalog of actions and ordering, but beforehand it's a discussion and a rationale for what needs to be done and why. Without the logic behind it being written out explicitly, you can't have that discussion; once you have that logic written out, you might as well leave it in to jog people's memories on the spot.) On a side note, a full power up is an interesting and useful way to find problematic dependencies that have quietly worked their way into your overall network, ones that are not so noticeable when your systems are in their normal steady state. For example, DHCP service for several of our networks now depends on our core fileserver, which means that it can only come up fairly late in the power up process. We're going to be fixing that. (There is a chain of dependencies that made this make sense in a steady state environment.) (One comment.)
sysadmin/PowerdownLessonsLearnedII written at 00:37:34; Add Comment
|
These are my WanderingThoughts GettingAround This is part of CSpace, and is written by ChrisSiebenmann. * * * Atom feeds are available; see the bottom of most pages. Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web |