The Linux kernel's pstore error log capturing system, and ACPI ERST
In response to my entry yesterday on enabling reboot on panic on your servers, a commentator left the succinct suggestion of 'setup pstore'. I had never heard of pstore before, so this sent me searching and what I found is actually quite interesting and surprising, with direct relevance to quite a few of our servers.
Pstore itself is a kernel feature that dates to 2011. It provides
a generic interface to storage that persists across reboots and
gets used to save kernel messages during a crash, as covered in
LWN's Persistent storage for a kernel's "dying breath" and the kernel documentation. Your
kernel very likely has pstore built in and your Linux probably
mounts the pstore filesystem at
(The Ubuntu 16.04 and 18.04 kernels, the CentOS 7 kernel, and the
Fedora kernel all have it built in. If in doubt, check your kernel's
configuration, which is often found in
/boot/conf-*; you're looking
CONFIG_PSTORE and associated things.)
By itself, pstore does nothing for you because it needs a chunk of storage that persists across reboots, and that's up to your system to provide in some way. One such source of this storage is in an optional part of ACPI called the Error Record Serialization Table (ERST). Not all machines have an ERST (it's apparently most common in servers), but if you do have one, pstore will probably automatically use it. If you have ERST at all, it will normally show up in the kernel's boot time messages about ACPI:
ACPI: ERST 0x00000000BF7D6000 000230 (v01 DELL PE_SC3 00000000 DELL 00040000)
If pstore is using ERST, you will get some additional kernel messages:
ERST: Error Record Serialization Table (ERST) support is initialized. pstore: using zlib compression pstore: Registered erst as persistent store backend
Some of our servers have ACPI ERST and some of them have crashed,
so out of idle curiosity I went and looked at
all of them. This led to a big surprise, which is that there may
be nothing in your Linux distribution that checks
to see if there are captured kernel crash logs. Pstore is
persistent storage, and so it does what it says on the can; if
you don't move things out of
/sys/fs/pstore, they stay there,
possibly for a very long time (one of our servers turned out to
have pstore ERST captures from a year ago). This is especially
important because things like ERST only have so much space, so
lingering old crash logs may keep you from saving new ones, ones
that you may discover you very much would like records of.
(The year-old pstore ERST captures are especially ironic because the machine's current incarnation was reinstalled this September, so they are from its previous life as something else entirely, making them completely useless to us.)
Another pstore backend that you may have on some machines is one that uses UEFI variables. Unfortunately, you need to have booted your system using UEFI in order to have access to UEFI services, including UEFI variables (as I found out the hard way once), so even on a UEFI-capable system you may not be able to use this backend because you're still using MBR booting. It's possible that using UEFI variables for pstore is disabled by some Linux distributions, since actually using UEFI variables has caused UEFI BIOS problems in the past.
(This makes it somewhat more of a pity that I failed to migrate to UEFI booting, since I would actually potentially get something out of it on my workstations. Also, although many of our servers are probably UEFI capable, they all use MBR booting today.)
Given that nothing in our Ubuntu 18.04 server installs seems to
/sys/fs/pstore and we have some machines with things in
it, we're probably going to put together some shell scripting of
our own to at least email us if something shows up.
(Additional references: Matthew Garrett's A use for EFI, CoreOS's Collecting
which mentions the need to clear out
/sys/fs/pstore, and abrt's
pstore oops wiki page,
which includes a list of pstore backends.)
PS: The awkward, brute force way to get pstore space is with the ramoops backend, which requires fencing off some section of your RAM from your kernel (it should be RAM that your BIOS won't clear on reboot for whatever reason). This is beyond my enthusiasm level on my machines, despite some recent problems, and I have the impression that ramoops is usually used on embedded ARM hardware where you have little or no other options.
Comments on this page:Written on 25 January 2019.