2012-06-22
How the Linux kernel command line is processed (more or less)
Because I just had to research this (right down to reading kernel source), here is how the Linux kernel handles its command line arguments.
(The kernel command line is set in the bootloader in various ways. Grub lets you edit it on the fly, assuming that you can interrupt grub in time before it autoboots your kernel.)
First, the kernel goes through all kernel options and handles them. In theory all of the kernel options are documented in Documentation/kernel-parameters.txt in the kernel source, but beware: on a modern system that boots using an initial ramfs, a number of these options are really handled by the initial boot process instead of the kernel.
(For example, root=... and init=... are handled by the initial boot
process, because the initramfs /init is what actually mounts the root
filesystem and starts the real init.)
For anything that is not a kernel option and does not have a '.' in
its name (these are assumed to be unhandled module parameters), one
of two things happen. If it is of the form 'a=b' or (I believe) 'a=',
it's placed in the environment that will be passed to the initial
user-level process (generally either /init from the initial ramfs or
/sbin/init). If it doesn't look like an environment variable setting
it becomes one of the command line arguments for the initial user-level
process. Everything from the kernel command line also appears in
/proc/cmdline.
Generally the initial user-level process then immediately reads and
pseudo-parses /proc/cmdline, pulling out various options that it
cares about. Depending on your distribution, it may or may not pay
any attention to its command line arguments (and for an initial ramfs
/init, may or may not pass them to /sbin/init). Generally they are
at most passed to /sbin/init; initial ramfs processing usually prefers
to read everything from /proc/cmdline (partly because everything winds
up there regardless of its specific form).
(I say 'pseudo-parses' because /init may or may not handle things
like quoted arguments.)
Some but not all distributions make use of the initial environment
variables; for example, Fedora sets $LANG right from the moment of
boot this way.
Sidebar: what some distributions do with init command line arguments
All of these are for when you have an initial ramfs with its own
/init.
- Ubuntu 10.04's and 12.04's
/initpass command line arguments to Upstart's/sbin/initbut otherwise ignores them. What options/initaccepts is documented in theinitramfs-toolsmanpages. - Fedora 17's
/initappears to completely ignore command line arguments and does not pass them to systemd's/sbin/init. However, it appears that you can specify arguments in aninit=option along with the program to run (unlike Ubuntu, where it can only be an executable). What options/initaccepts is documented in thedracutmanpage.(I expect that this is true of Fedora 16 and earlier but I lack the energy to dig up a Fedora 16 initrd image, unpack it, and carefully read through its
/initjust to be sure.) - Red Hat Enterprise 5 is sufficiently old that I think it uses a
somewhat different scheme to transition from the initial ramdisk
to your root filesystem. Its initrd
/initcertainly doesn't seem to do anything with command line arguments.
These days initial ramdisk images are gzip'd CPIO archives, so they can be extracted with:
mkdir /tmp/unpack cd /tmp/unpack zcat /boot/init<whatever>.img | cpio -id
Reading /tmp/unpack/init is the most interesting thing to do, although
it may be relatively opaque.
2012-06-17
Why our server had its page allocation failure
In the previous entry I went through the kernel messages printed when one of our Linux servers had a page allocation failure. Now it's time to explain why our server had that failure despite what looks like plenty of memory. To refresh, the kernel was trying to allocate a 64 Kbyte ('order 4') chunk of contiguous memory in the Normal zone. There was no such chunk in the server's small Normal zone, but there were several such chunks free in the DMA32 zone (and also larger chunks free in DMA32 that could have been split).
First off, we can rule out allocation from the small DMA zone entirely. As far as I can tell, general memory will almost never be allocated from the DMA zone because most of it is effectively reserved for allocations that have to have DMA memory. This is not a big loss since it's only 16 MBytes of RAM. What matters in our case is the state of the DMA32 zone, and in particular two bits of its state:
Node 0 DMA32 free:12560kB min:7076kB low:8844kB high:10612kB [...]
Node 0 DMA32: 2360*4kB 80*8kB 21*16kB 7*32kB 4*64kB 1*128kB 2*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 12560kB
'min:' is the minimum low water mark for most non-urgent memory
allocation requests, and the second line reports on how many chunks of
each allocation size (or order) are free.
On first glance it looks like everything should be fine here, because
there is a free 64 Kb chunk and the zone has more free memory than any
of the low water marks (especially min), but it turns out that the
kernel does something tricky for higher-order allocations (allocations
that aren't for a single 4 Kb page). To simplify and generalize,
the kernel has decided that when checking free memory limits, if
you're asking for a higher-order page the free memory in lower-order
pages shouldn't count towards the amount of memory it considers free
(presumably because such 'free memory' can't satisfy your request). At
the same time it has to reduce the minimum amount of free memory
required to avoid absurd results.
(One little thing is that the kernel check is made based on what the free memory would be after your allocation is made. This probably won't matter for small requests but might matter if you ask for an order 10 allocation of four megabytes.)
So for higher-order allocations only memory available at that order and higher counts, and the starting low water mark is divided by two for every order above 0, ie for an order 4 request like ours the water marks wind up divided by 16 (which conveniently is 2^order). In theory in our situation this means that the kernel would consider there to be 1920 Kb free in DMA32 (well, 1856 Kb after we take off our allocation) instead of 12560 Kb and the minimum low water mark would be 442 Kb instead of 7076 Kb. This still looks like our allocation request should pass muster.
However, the kernel doesn't actually implement the check this way in a single computation. Instead it does it iteratively, using a loop that is done for each order up to (but not including) the order of your request. In pseudo-code:
for each order starting with 0 up to (our order - 1):
free memory -= free memory for the current order
minimum memory = minimum memory / 2
if free memory <= minimum memory:
return failure
The problem is that this iterative approach causes an early failure if a significant amount of the free memory in a zone is in very low order pages, because you can lose a lot of free memory while the current minimum memory requirement only drops by a bit (well, by half). In our situation, much of the free memory in DMA32 is in order 0 pages so the first pass through the loop gives us a new free memory of 3056 Kbytes (12560 Kb minus 9440 Kb of order-0 pages and our 64 Kb request) but a minimum memory requirement of 3538 Kb (the initial 7076 Kb divided by two) and the code immediately declares that there is not enough memory in this zone.
(People who want to read the gory details can see them in
zone_watermark_ok() in mm/page_alloc.c in the Ubuntu 10.04
kernel source, which has been renamed to __zone_watermark_ok() in
current 3.5-rcN kernels.)
I'm reluctant to declare this behavior a bug; the kernel memory people may well consider it working as designed that a zone with a disproportionate amount of its free memory in low-order pages is very reluctant to allocate higher-order chunks, even more reluctant than you might think. However I do think that the current code is at least very unclear as to whether this is intentional (or simply an accident of the current implementation) and what the actual logic is.
(I personally would prefer the direct computation logic. As it stands, you have to know and then explain (and simulate) the actual kernel code in order to understand why this allocation failed; there is no simple to express general rule.)
2012-06-16
Decoding the Linux kernel's page allocation failure messages
One of our machines has recently started logging cryptic but verbose kernel messages about page allocation failures. Naturally I went digging to try to find out just what the bits mean, how alarmed I should be, and what was probably causing the failures. I now know a certain amount about decoding the cryptic reports, so that's what I'm going to cover in this entry.
Here's what I know so far about decoding, in the form of a walk-through of a sample message:
imap: page allocation failure. order:4, mode:0xc0d0
The 'imap:' bit is the command that the process that had the failed allocation was running. The 'order:4' bit tells you how many pages were requested, but indirectly. The kernel code always requests pages in powers of two (so one page, two contiguous pages, four contiguous pages, and so on); the 'order' is that power. So the order 4 request here is for 2^4 pages, which is 16 pages or 64 Kb. Memory fragmentation means that it's quite possible for high-order requests to fail even if there's plenty of pages free, while an order:0 failures mean that your machine is totally out of memory with not even a single page free.
The 'mode' is the flags passed to the kernel memory allocator. If you
have kernel source handy they are covered in include/linux/gfp.h,
but you need to know a certain amount of kernel internals to
understand them. You may see this particular 0xc0d0 mode a lot
because it's the normal flags used when code calls kcalloc()
to allocate an array of structures for kernel purposes (ie, with
GFP_KERNEL); this sort of large block allocation is one of the more
likely ones to fail.
Pid: 15273, comm: imap Not tainted 2.6.32-41-server #88-Ubuntu
So this happened to PID 15273, an imap process.
Call Trace: [<ffffffff810fcd27>] __alloc_pages_slowpath+0x4a7/0x590 [<ffffffff810fcf89>] __alloc_pages_nodemask+0x179/0x180 [<ffffffff81130177>] alloc_pages_current+0x87/0xd0 [<ffffffff810fbe7e>] __get_free_pages+0xe/0x50 [<ffffffff81139de8>] __kmalloc+0x128/0x1a0
The early part of the kernel stack call trace is generally not useful because it's all memory allocation functions, as we see here.
[<ffffffffa0158d9e>] xs_setup_xprt+0xae/0x1c0 [sunrpc] [<ffffffffa0159411>] xs_setup_tcp+0x21/0x2c0 [sunrpc] [<ffffffffa0153ccb>] xprt_create_transport+0x5b/0x290 [sunrpc] [<ffffffffa01538db>] rpc_create+0x5b/0x1d0 [sunrpc]
Here we see what kernel code actually tried to do the failing memory
allocation. xs_setup_xprt is part of the kernel NFS RPC code, and I
believe that in recent kernels it has been reformed so that it doesn't
try to make large contiguous allocations.
(I'm eliding the rest of the kernel call stack because it's not quite interesting enough.)
Mem-Info:
Now the kernel prints a whole bunch of information about the
memory state of the machine. If you want (or need) to read the
kernel code involved, the overall function is show_mem()
in lib/show_mem.c, with most of the report coming from
show_free_areas() in mm/page_alloc.c.
Because this is detailed memory usage information, zones and nodes are going to be mentioned a lot. I wrote a long description of what these are here.
Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 0 CPU 1: hi: 186, btch: 31 usd: 0 Node 0 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 0 CPU 1: hi: 186, btch: 31 usd: 0
What I believe this is is that each CPU can maintain a private list of
free pages in each zone, which is useful because pages can normally
be allocated from the list without taking any global locks. The most
important number here is 'usd:', how many pages are currently in each
list.
Because this machine has a DMA32 zone, we know it's running a 64-bit kernel. A 32-bit kernel would have a HighMem zone and no DMA32 zone.
active_anon:79266 inactive_anon:31437 isolated_anon:0 active_file:260339 inactive_file:255635 isolated_file:31 unevictable:0 dirty:38 writeback:1 unstable:1 free:7660 slab_reclaimable:320826 slab_unreclaimable:18290 mapped:19846 shmem:7 pagetables:11033 bounce:0
This is global memory usage information, reporting how many pages are
in each of various states. If you don't already know a lot about the
Linux virtual memory system the most useful number is free; here it's
telling us that we have a decent number of free pages.
Node 0 DMA free:15900kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15336kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
This is giving us more or less the same memory usage information but
on a per-zone basis. There are some new bits of information, though;
present: tells us how large the zone is, and the min:, low:, and
high: report on various (low) watermarks for page allocation and page
reclaims. How the watermarks affect whether or not you get your memory
is so complicated that I am going to have to write another entry on it,
but it turns out that they can matter a lot (and they are probably why
this allocation failed).
I believe that it's quite normal for the DMA zone to be basically unused, as it is here; it's only 16 MBytes and it's the last zone used for allocations.
lowmem_reserve[]: 0 3511 4016 4016
This is the 'lowmem reservation' level for the DMA zone. When the kernel is handling an allocation that started in another zone (either DMA32 or Normal on this machine), the DMA zone must have at least a certain number of pages free in order to be a fallback candidate for the allocation (sort of, it actually gets quite complicated). Here the numbers are 0 pages free for DMA allocations, 3511 pages free for DMA32 allocations, and 4016 pages free for Normal (or HighMem) allocations. Since the DMA zone has less than 4016 pages, it's perhaps not too surprising that it's mostly free right now.
(See the discussion of lowmem_reserve_ratio in the documentation
for virtual memory sysctls for more
information on this.)
Node 0 DMA32 free:12560kB min:7076kB low:8844kB high:10612kB active_anon:267460kB inactive_anon:75996kB active_file:935288kB inactive_file:938308kB unevictable:0kB isolated(anon):0kB isolated(file):124kB present:3596192kB mlocked:0kB dirty:128kB writeback:4kB mapped:68660kB shmem:16kB slab_reclaimable:1185640kB slab_unreclaimable:51468kB kernel_stack:4808kB pagetables:30040kB unstable:4kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 505 505 Node 0 Normal free:2180kB min:1016kB low:1268kB high:1524kB active_anon:49604kB inactive_anon:49752kB active_file:106068kB inactive_file:84232kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:517120kB mlocked:0kB dirty:24kB writeback:0kB mapped:10724kB shmem:12kB slab_reclaimable:97664kB slab_unreclaimable:21692kB kernel_stack:2352kB pagetables:14092kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0
The same report for the other two zones. Now, notice something interesting: this machine has a tiny Normal zone and a much larger DMA32 zone. This is because it has only 4 GB of memory; as mentioned before, this means that almost all of its memory goes into the DMA32 zone instead of Normal.
(You might wonder why any memory goes into the Normal zone at all on a machine with only 4 GB of memory. The answer is 'PC hardware craziness'; the hardware needs some address space below 4 GB for PCI devices, so some of your 4 GB of RAM gets remapped above the 4 GB address space boundary.)
Node 0 DMA: 3*4kB 2*8kB 4*16kB 2*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15900kB Node 0 DMA32: 2360*4kB 80*8kB 21*16kB 7*32kB 4*64kB 1*128kB 2*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 12560kB Node 0 Normal: 377*4kB 14*8kB 17*16kB 9*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2180kB
We have finally reached the really interesting bit. This is a report of how many chunks of each order are free in each zone (helpfully expressed in the actual allocation sizes instead of the order number, so you have to remember to map back and forth). From this we can see that the DMA32 and DMA zones could easily satisfy an order 4 (64 Kb) allocation, but the Normal zone is entirely out of everything larger than 32 Kb (which is not really surprising since it only has 512 MBytes of memory total).
This gives me enough information to understand why this allocation failed, but explaining the exact details is sufficiently complex (and involves deep kernel internals) that I have to put it in another entry. The short version is that when you do high-order allocations there is a very complex interaction between the low water mark for a zone and how many pages are free at various orders that can rule zones out when you might not expect it. This can especially happen when a zone has a significant amount of its free memory in order 0 pages, as is the case for DMA32 here.
530984 total pagecache pages 14979 pages in swap cache Swap cache stats: add 800917, delete 785938, find 6804111/6850366 Free swap = 881852kB Total swap = 975864kB
This is information about the state of the swap system, produced by
show_swap_cache_info() in mm/swap_state.c. It's probably
mostly of interest to see if you've exhausted your swap space at the
time of the allocation failure.
1048576 pages RAM 34079 pages reserved 277863 pages shared 958709 pages non-shared
At the end of the report the kernel prints out some overall memory state information.
On a side note, you might wonder why the OOM killer didn't trigger here (it didn't). The short answer is that it's not triggered for higher order allocations, only for allocations of 32 Kb and smaller (order 3 or below). So this allocation request is just large enough to not trigger OOM processing. All things considered that's probably a good thing.
(Going through the OOM processing might have changed the memory situation enough for the allocation to succeed but it might also have started killing processes that we definitely didn't want to die, especially when the system has a decent amount of free memory.)
2012-06-15
How the Linux kernel divides up your RAM
The Linux kernel doesn't consider all of your physical RAM to be one great big undifferentiated pool of memory. Instead, it divides it up into a number of different memory regions (at least for kernel purposes), which it calls 'zones' (to simplify slightly). What memory regions there are depends on whether your machine is 32-bit or 64-bit and also how complicated it is.
The zones are:
DMAis the low 16 MBytes of memory. At this point it exists for historical reasons; once upon what is now a long time ago, there was hardware that could only do DMA into this area of physical memory.DMA32exists only in 64-bit Linux; it is the low 4 GBytes of memory, more or less. It exists because the transition to large memory 64-bit machines has created a class of hardware that can only do DMA to the low 4 GBytes of memory.(This is where people mutter about everything old being new again.)
Normalis different on 32-bit and 64-bit machines. On 64-bit machines, it is all RAM from 4GB or so on upwards. On 32-bit machines it is all RAM from 16 MB to 896 MB for complex and somewhat historical reasons.Note that this implies that machines with a 64-bit kernel can have very small amounts of Normal memory unless they have significantly more than 4GB of RAM. For example, a 2 GB machine running a 64-bit kernel will have no Normal memory at all while a 4 GB machine will have only a tiny amount of it.
HighMemexists only on 32-bit Linux; it is all RAM above 896 MB, including RAM above 4 GB on sufficiently large machines.
Normally allocations can come from a more restrictive zone than you asked for if that's where the free memory is. For example, if you ask for Normal memory on a 64-bit machine and there isn't any but there's lots of DMA32, you'll get DMA32. It's just that kernel prefers to preserve DMA32 for things that have asked for it specifically.
Now, this is a slight simplification; actually, zones and memory are attached to a 'node'. Ordinary machines have only a single node, node 0, but sufficiently large servers can have multiple nodes (we have one with eight). Nodes are how Linux represents NUMA architectures. To simplify, each CPU is also associated with a node and the kernel will try to allocate memory for a process running on a CPU from that node's RAM, because it is considered 'closest' to that CPU.
I believe that the special zones (DMA, DMA32, and on 32-bit machines, Normal) will only be present on one node, generally node 0. All other nodes will generally have only Normal (on 64-bit kernels) or HighMem (on 32-bit kernels) memory.
You can see a bunch of information about your system's nodes, zones, and
the state of their memory in /proc/pagetypeinfo, /proc/zoneinfo,
/proc/<pid>/numa_maps, and /proc/buddyinfo, which deserves an
explanation of its own.
The kernel's basic unit of allocatable memory is the 4 KByte page (many
stats are reported by page count, instead of memory size in Kbytes). The
kernel also keeps track of larger contiguous blocks of pages because
sometimes kernel code wants, say, a contiguous 64 kbyte block of memory.
/proc/buddyinfo shows you how many such free chunks there are for each
allocation 'order'. The 'order' is 2^order pages, ie order 0 is a single
page, order 1 is 2 pages (8 KB), order 2 is 4 pages (16 Kb), and so on.
So when /proc/buddyinfo reports, for example:
Node 0, zone DMA32 7 20 2 4 6 4 3 4 6 5 369
This means that in the DMA32 zone on this machine there are currently 7 free solo 4kb pages, 20 8kb two-page chunks, 2 16kb chunks, and so on, all the way up to 369 1024-page (4 Mbyte) chunks. Since the kernel will split larger chunks to get smaller ones if it needs to, the DMA32 zone on this machine is in pretty good shape despite seeming to not have many order 0 4kb pages available.
(This is also what /proc/pagetypeinfo means by 'order' in its output.)
In fact, having a disproportionate number of order 0 pages free is generally a danger sign since order 0 pages exist only when the kernel can't merge them together to form higher-order free chunks. Lots of order 0 pages thus mean lots of fragmentation, where the kernel can't even find two adjacent aligned pages to merge into an 8kb order 1 chunk.
(See also the official documentation for various things in /proc.)
2012-06-07
My experience doing a Fedora 17 upgrade with yum: it worked fine
Despite the warnings on the yum upgrade web page, I just got through doing a yum upgrade from Fedora 16 to Fedora 17 on my office workstation. The short summary is that it went fine. I'd say that it went without problems, but that's not quite true; it went with no more problems than usual for my office workstation. In specific, the directions for using a dracut reboot to execute the /usr merge worked flawlessly.
(Note that I had already transitioned to having a single system
filesystem. Your mileage may be quite
different if you still have a separate /usr on an old machine.)
The one non-standard thing that I did for this upgrade is that I started
out by downloading all of the packages for Fedora 17 even before I
had done the /usr conversion (and thus committed myself to doing a
Fedora 17 upgrade), using 'yum --releasever=17 --disableplugin=presto
--downloadonly distro-sync'. While this speeds up the actual upgrade
somewhat, I had a bigger reason: checking for dependency problems.
While file conflicts aren't checked until
the actual package installs start, yum will find most of the package
dependency and compatibility problems from just downloading everything
(well, technically from the dependency solving it has to do in order
to know what to download). As usual for my yum upgrades, there were
a number of Fedora 16 packages that I wound up having to remove in
order to make yum happy with life here. By the way, note that
--skip-broken is not necessarily a magic cure to this sort of problem;
when I tried it in the first pass of resolving problems, yum confidently
told me that it was proposing to skip over 2,000 packages (more packages
than it was actually upgrading). Since that seemed unlikely to end well,
I started removing my problem packages and was ultimately able to avoid
using --skip-broken at all. And of course the ultimate advantage of
doing this before I was committed to the upgrade is that I could have
decided to hold off on the /usr conversion and the upgrade if serious
problems turned up.
(People with straightforward Fedora 16 installs will probably not have this problem, but my office workstation has a lot of packages installed and in general is the accreted product of almost six years of upgrading from Fedora version to Fedora version instead of reinstalling from scratch.)
As usual I still need to do various cleanup steps, like sweeping
my system for .rpmnew files and fixing up any of them that turn
out to be important (and also checking for now-standard RPMs that
need to be added). However I'm back running my usual environment and everything works fine (which is actually
unusual for Fedora upgrades, where I generally have to spend some time
afterwards fixing up my custom environment). In fact, some things have
improved; my office workstation now properly does automatic mounting of
removable devices.
(Apart from that Fedora 17 doesn't seem to be particularly different than Fedora 16, but that's really what I expect from running a custom environment. People who use Gnome or KDE may see more of a change.)
PS: my usual brute force approach for adding all of the RPMs that a
standard Fedora install has is to just do a standard install in a
virtual machine, copy the package list across, and install anything
that's missing on my workstation. In theory one can fiddle around with
'yum groupupdate', but in practice I find it more work than the brute
force approach.
Sidebar: minor problems that I ran into
In addition to package issues:
- the machine didn't shut down and reboot cleanly after I did all of the
yum upgrades. This wasn't the kernel issue that the common bugs
page warns you
about, but something else (it actually had a kernel panic during
shutdown). I used Magic SysRQ to force a relatively clean reboot
and there were no subsequent problems.
- I had to run around turning off various services and daemons that
decided that they should be running just because they were
installed. If I had been really smart I would have checked for this
before I rebooted the machine instead of afterwards.
- I ran into the chronyd issue from the common bugs page because
I still use ntpd on my workstation (I like it better for obscure
reasons). No big deal; it's documented.
- even with my custom-built version of freetype,
font rendering seems slightly different between Fedora 16 and
Fedora 17 (in a way that I don't entirely like). I need to
investigate this more, but I consider it par for the course in
the modern world of XFT fonts.
- due to the removal of ConsoleKit in Fedora 17,
the long-standing
ck-xinit-sessionprogram has quietly disappeared; I had to take my invocation of it out of my 'start X' script.
As usual, the actual yum upgrade process took something like four to five hours. I doubt a DVD-based or a preupgrade-based upgrade would have been any faster, which makes me quite happy that I was able to keep using my workstation throughout the whole process.
2012-06-06
A feature that Linux installers should have: restoring your backups
For reasons beyond the scope of this entry, I've recently been poking around Windows 7; specifically I've been poking around the standard Windows 7 backup tool, which is actually pretty decent as these things go. It will pretty effortlessly back your data or your entire system up to either some sort of disk or to a network share, and then it has one really nice feature: you can restore your system right from the installer. This works basically painlessly; you boot the install CD on a bare metal machine, find the right option, point it at your backup on a network share, and then in a surprisingly short time your entire system is back just as it was.
(I believe that Mac OS X has a similar feature but I haven't experimented with it.)
Having experienced this with Windows, I can't help but think that Linux installers should be able to do this too. It's not technically challenging and it would be a significant help for users who wind up needing to do this sort of thing; if they had a properly prepared backup, restoring their machine after a disk failure or having to replace a laptop would be pretty much a snap.
(With the right setup, making a backup would be sufficiently easy that you might get people to actually do it. It's my guess that easy install time restores would help encourage backups since they make backups more clearly useful.)
Apart from the small matter of programming in the installer, the real issues with this idea are unfortunately political. It's completely infeasible for a Linux installer to support all of the many, many options that you have on Linux for backing up your system, so supporting install time restores would mean picking one single backup system to be the officially endorsed one (along with a handful of ways to configure and use it). Linux distributions generally hate to make choices like this unless they absolutely have to, partly because doing so generally starts huge debates and rows.
(On the other hand it's possible that I'm out of touch and some Linux distributions actually already support this. If so, I can only applaud them; sometimes making a decision is what you need for usability and being genuinely convenient.)
PS: if you're tempted to implement this, please support network filesystems as well as things like USB disks. Individual people are more likely to back up to external USB disks, but in an organization it's really useful to be able to provide easy network backup and restore for things like people's laptops. You can probably guess why we're interested in this whole field. Sadly this probably means supporting doing this over Samba/CIFS, not just NFS.
2012-06-01
Some things that strike me about Linux and UEFI secure booting
When I read Matthew Garrett's Implementing UEFI Secure Boot in Fedora, a number of things struck me about the situation (for his background on UEFI secure boot, see part 1, part 2, more, and especially this). The basic setup is that Microsoft is requiring that any hardware that wants to carry a 'Windows 8 ready' logo must support UEFI secure boot and must have it turned on.
(Actually, when I read Garrett carefully the last bit is not clear. He says that if Windows 8 is preinstalled UEFI secure boot must be enabled, but he doesn't say that a motherboard merely marked with the Windows 8 logo program must have secure boot turned on. It's possible that this is not a Microsoft requirement and that motherboard and system vendors may thus ship bare machines with UEFI secure boot turned off. We probably won't know until Windows 8 logo hardware starts shipping.)
First, something that's worth noting explicitly:
UEFI secure boot enabled machines will not boot unsigned CDs or USB sticks without you manually changing the BIOS settings.
This hits both install media and 'live CDs' (these days as likely to be a USB stick as a physical CD or DVD), and also PXE netbooting. Signed media can boot automatically, but not unsigned media. Among other things, this is a real usability hit for unsigned installers; you can't even boot the installer to an instruction screen about the need to disable secure boot in the BIOS. And as Garrett notes in his series, Microsoft has not mandated a specific UI for disabling secure boot in the BIOS so everyone is going to do it differently.
So:
- Ubuntu (well, Canonical) is going to do what Fedora is; they just
aren't talking about it publicly (yet).
Any mass-market focused Linux distribution faces exactly the same problem as Fedora does here. Usability requires that you not need people to fiddle with their BIOS, and that means you need to be signed. Ubuntu is if anything more focused on easy desktop usability than Fedora is, so they are going to have to get signed somehow. Outside Ubuntu contributors may not like this very much when it happens, but Canonical is going to force it through.
I expect other mass-market focused distributions to blink as well, although SUSE is the only one I can think of offhand.
(The flipside is that I will be very surprised if Debian goes with signing; it would be very hard to square it with their principles, and Debian really cares about those principles.)
- As Garrett covered here,
this means that the proprietary binary Nvidia and ATI graphic
drivers are dead for mass-market Fedora users (the ones who do
not go into their BIOS and disable UEFI secure boot), including
Nvidia's CUDA environment. Fedora is extremely unlikely to sign
binary drivers for Nvidia and ATI, and you cannot give users
the ability to load them anyways.
This is not just a Fedora issue, of course; any mass-market focused
distribution has the same problem (assuming that they get signed
for usability reasons, per above), including and especially Ubuntu.
This is going to be very unpopular, to put it mildly. My strong impression is that a lot of people use the proprietary drivers, especially with Nvidia hardware.
(It's possible that Canonical will figure out some way that they can sign the drivers; of all of the Linux distributions, I think they will be the most willing to hold their noses and compromise. I don't think that this will work in the long run, partly because I expect the binary drivers to be a fruitful source of exploitable kernel bugs once people have a motive to start looking.)
- Hardware compatibility lists are coming back, although not right away.
Based on Garrett's writeups of general BIOS issues, I have the strong impression that one of the golden rules of PC BIOSes is that if Windows doesn't need something for booting and Microsoft doesn't explicitly test it in their hardware certification tests, it doesn't work. While Microsoft requires that BIOS vendors support turning secure boot off, we don't know how well they're going to actually test this (and it seems unwise to assume that they'll do it thoroughly).
In the short term there will probably be enough pressure from people wanting to run old versions of Windows to keep the BIOS vendors honest. But in a few years, well, I'm not that optimistic. Laptops will probably be the canary in the coal mine here, since my impression is that most laptops aren't reinstalled with older versions of Windows and so most laptop buyers wouldn't notice if UEFI secure boot couldn't be turned off.
- I have no clue how this is going to interact with Linux and
Fedora support for virtualization. If Fedora leaves virtualization
alone, the problem is that at least in theory you could construct
a properly signed Fedora install that immediately booted Windows
in a full screen virtualized environment with a compromised 'UEFI
secure boot' BIOS and boot time malware.
(I'm using Fedora here as an example; you could do the same thing with any Linux distribution that gets itself signed and supports virtualization.)
Everything that I can think of to do to block this (or make it obvious that it's happening because Windows 8 runs really slowly or without the graphics bling that it should have) makes virtualization less useful or extends signing further and further into user-level components, or both. A good virtualization environment does want to offer fast graphics, good access to USB hardware, direct use of disk partitions, and so on. And all of these are highly useful for creating this sort of fake Windows virtual environment.
In the short term I'm more optimistic than Garrett is about how easy it will be to turn UEFI secure boot off. Since (as far as I know) older versions of Windows will not boot on UEFI secure boot machines, as long as a significant number of people will want to install them the BIOS vendors have a strong motivation to make this as easy as possible. The absolutely easiest way would be a boot-time popup that says 'you are trying to boot an unsigned thing; continue anyways?'
(This may wind up be disallowed by Microsoft's Windows 8 requirements, of course. Just like all other security warnings, the easier it is for users to disable secure boot the less effective it is at preventing boot time malware because almost all users will just reflexively override the warning if they can. BIOS vendors don't care about this, but Microsoft does and they may put their foot down.)