Rasdaemon is what you want on Linux if you're seeing kernel MCE messages
Suppose, not hypothetically, that you're seeing in your kernel logs messages like this:
mce: [Hardware Error]: Machine check events logged
As explained in the Arch wiki entry on "Machine check exceptions", an MCE is generated by your CPU when the CPU detects that some sort of a hardware situation has happened.
By itself, the kernel doesn't do anything more than log these very
non-specific messages. If you want to know what exact machine check
exceptions happened, you need something that pulls additional
information out of the kernel and the hardware. The program the
Arch wiki will refer you to and that seems to mostly work for us
(also, also), which replaces the
mcelog. On Ubuntu, just installing the 'rasdaemon' package
will do everything necessary.
On our AMD Zen based machines, all of the rasdaemon reports that we've seen create log messages that look like this:
<...>-2676499  0.682548: mce_record: 2022-08-18 12:21:31 -0400 Unified Memory Controller (bank=16), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error. Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=0, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01a000101000000, addr= 4dd42ac0, synd= 89010a400200, ipid= 9600150f00, mcgstatus=0, mcgcap= 117, apicid= 0
If you missed these messages in the logs, you can (on Ubuntu) also see
them with '
1 2022-08-18 12:21:31 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=16), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x00000117, status=0x9c2040000000011b, addr=0x4dd42ac0, misc=0xd01a000101000000, walltime=0x62fe670a, cpuid=0x00800f82, bank=0x00000010
There's probably a way to find out that this is a (corrected) DRAM ECC error from this message, but it's not as obvious as what rasdaemon puts in the system logs. As a result, I prefer to look at the system logs and, at the moment, I consider the ras-mc-ctl database to be just a backup. However, according to Monitoring ECC memory on Linux with rasdaemon, ras-mc-ctl can be used to see if a particular DIMM is having problems. Monitoring ECC memory on Linux with rasdaemon also discusses how to map which DIMM is which and give them nice labels.
On our Ubuntu servers (which are a mixture of AMD and Intel CPUs), it appears harmless to install rasdaemon on machines that aren't experiencing memory errors, on both servers and more desktop focused motherboards and systems. Unfortunately this hasn't been my experience on my office Fedora desktop, where running rasdaemon seems to produce a stream of unclear complaints from both rasdaemon and abrt-server.
So far we've only captured MCEs from AMD Zen CPUs with rasdaemon (and only for DRAM ECC errors). We had one Intel-based machine with an apparently bad DIMM that would produce complaints, but we swapped out the DIMM before we got around to installing rasdaemon.
(There doesn't seem to be a great overview of MCE errors under Linux, with the kind of information that would let me understand these rasdaemon messages and some of the configuration it would like. For what there is, see eg the mcelog glossary page and some Linux kernel documentation and another version.)
Some resources for looking at the current development version of Go
Go is under more or less continuous development (although the pace and nature of changes is different near releases). The Go website, Go playgroup, and other resources are what you want if you're interested in the latest released version of Go, as most people are, but there are also some resources if you want to look at the latest development version, what is generally called the tip.
The official source code is at go.googlesource.com. Typically you'll want to look at the tree view of the main branch. There's also the Github mirror of Go, which is where the issues are and which may be more convenient to navigate. Getting your own local copy is straightforward, as is building Go from source.
Tip.golang.org is more or less what it sounds like. Generally I'll want the Go documentation, especially the Go language specification. Tip.golang.org has a link for the latest standard library documentation, which goes to pkg.go.dev/std@master. You can also directly look at the specification from your local source tree, in doc/go_spec.html, but it probably won't have formatting that's as nice. At the moment, godoc can be used to run a local web server to view the standard library documentation for a Go source tree (or perhaps only the source tree that it was built from, in which case you'll want to build the latest Go development version yourself).
(You can also use pkg.go.dev to get access to all tagged versions of the Go standard library documentation, which includes betas and release candidates as well as actual released versions.)
Famously and usefully, Go has the online Go playground. As I write this there are two ways to get a playground with the Go tip version. First, you can pick 'Go dev branch' from the version dropdown on the normal playground. Second, you can use gotipplay.golang.org. I believe the two are functionally equivalent, but the latter specifically tells you what development version it's using and also runs 'go vet' on your code as part of submission.
(The normal playground will also let you use the two currently supported Go versions to try things with, which is currently Go 1.18 and Go 1.19.)
If you want to look at the generated assembly code for something, the Godbolt Compiler Explorer is what you want. There are two ways to get the Go version; you can select 'Go' from the language dropdown on the main page, or go straight to go.godbolt.org. To get the development version of Go you need to select eg 'amd64 gc (tip)'; 'gc' is what the Compiler Explorer calls the usual Go toolchain, as opposed to gccgo.
If you want to use, try, or test with the latest Go development version, you may be interested in the gotip command. An interesting feature of gotip that's not available by just cloning the source repository and building locally is that it can build Go with a specific CL (what Go calls its pending changes). This may be useful if a Go bug report says that a specific CL may fix an issue you're seeing; you can (in theory) use gotip to build that CL and then use it to try your code.
I believe that the Go team is in the process of moving away from golang.org in favour of go.dev, so at some point the golang.org URLs here may stop working. Hopefully there will be go.dev equivalents of them, ideally with redirections from eg tip.golang.org to the new go.dev version.
(This is the kind of thing I write down for myself so I can find it again later.)
The names of disk drive SMART attributes are kind of made up (sadly)
A well known part of SMART is its system of attributes, which provide assorted information about the state of the disk drive. When we talk about SMART attributes we usually use names such as "Hardware ECC Recovered", as I did in my entry on how SMART attributes can go backward. In an ideal world, the names and meanings of SMART attributes would be standardized. In a less than ideal world, at least each disk drive would tell you the name of each attribute, similar to how x86 CPUs tell you their name. Sadly we don't live in either such world, so in practice those nice SMART attribute names are what you could call made up.
The only actual identification of SMART attributes provided by disk drives (or obtained from them) is an ID number. Deciding what that ID should be called is left up to programs reading SMART data (as is how to interpret the raw value). Because of this flexibility in the standard, disk drive makers have different views on both the proper, official names of their SMART attributes as well as how to interpret them. Some low-numbered SMART attributes have almost standard names and interpretations, but even that is somewhat variable; SMART ID 9 is commonly used for 'power on hours', but both the units and the name can vary from maker to maker.
Disk drive makers may or may not share information on SMART ID names and interpretations with people; usually it's not, except perhaps to some favoured drive drive diagnostic programs. Often, information about the meaning and names of SMART attributes must be reverse engineered from various sources, especially in the open source world. Open source programs such as smartmontools often come with an extensive database of per-model attribute names and meanings; in smartmontools' case, you probably want to update its database every so often.
As a corollary of this, names for SMART attributes aren't necessarily unique; the same name may be used for different SMART IDs across different drives. Across our collection of disk drives, "Total LBAs Written" may be any of SMART ID 233 (some but not all Intel SSDs), 241 (most brands and models of our SSDs and even some HDDs), or 246 (Crucial/Micron). Meanwhile, SMART IDs 241 and 233 have five different names across our fleet, according to smartmontools.
(SMART ID 233 is especially fun; the names are "media wearout indicator", "nand gb written tlc", "sandforce internal", "total lbas written", and "total nand writes gib". The proper interpretation of values of SMART ID 233 thus varies tremendously.)
Fortunately, NVMe is more sensible about its drive health information. The NVMe equivalent of (some) SMART attributes are standardized, with fixed meanings and no particularly obvious method for expansion.
PS: Interested parties can peruse the smartmontools drivedb.h to find all sorts of other cases.
Disk drive SMART attributes can go backward and otherwise be volatile
Recently, we had a machine stall hard enough that I had to power cycle it in order to recover it. Since the stall seemed to be related to potential disk problems, I took a look at SMART data from before the problem seemed to have started and after the machine was back (this information is captured in our metrics system). To my surprise, I discovered that several SMART attributes had gone backward, such as the total number of blocks read and written (generally SMART IDs 241 and 242) and 'Hardware ECC Recovered' (here, SMART ID 195). I already knew that the SMART 'power on hours' value was unreliable, but I hadn't really thought that other attributes could be unreliable this way.
This has lead me to look at SMART attribute values over time across our fleet, and there certainly do seem to be any number of attributes that see 'resets' of some sort despite being what I'd think was stable. Various total IO volume attributes and error attributes seem most affected, and it seems that the 'power on hours' attribute can be affected by power loss as well as other things.
Once I started thinking about the details of how drives need to handle SMART attributes, this stopped feeling so surprising. SMART attributes are changing all the time, but drives can't be constantly persisting the changed attributes to stable storage, whether that's some form of NVRAM or the HDD itself (for traditional HDDs with no write endurance issues). Naturally drives will be driven to hold the current SMART attributes in RAM and only persist them periodically. On an abrupt power loss they may well not persist this data, or at least only save the SMART attributes after all other outstanding IO has been done (which is the order you want, the SMART attributes are the least important thing to save). It also looks like some disks may sometimes not persist all SMART attributes even during normal system shutdowns.
This probably doesn't matter very much in practice, especially since SMART attributes are so variable in general that it's hard to use them for much unless you have a very homogenous set of disk drives. There's already no standard way to report the total amount of data read and written to drives, for example; across our modest set of different drive models we have drives that report in GiB, MiB, or LBAs (probably 512 bytes).
(Someday I may write an entry on fun inconsistencies in SMART attribute names and probably meaning that we see across our disks.)
PS: I don't know how NVMe drives behave here, since NVMe drives don't have conventional SMART attributes and we're not otherwise collecting the data from our few NVMe drives that might someday let us know for sure, but for now I'd assume that the equivalent information from NVMe drives is equally volatile and may also go backward under various circumstances.
Our slow turnover of servers and server generations
We have long had a habit of upgrading machines between Ubuntu versions either every two years (for most machines that users log in to or directly use) or every four years (although the past two years are an exception). The every two year machines upgrade to every LTS version; the every four year machines upgrade every other LTS version, as their old LTS version threatens to fall out of support. The longer version of this is in How we handle Ubuntu LTS versions.
One part of this that I haven't mentioned before now is how this affects the rollout of new generations of the servers we use. Barring exceptional events, we don't change the physical hardware that a given version of a server is built with once it's in production. Instead, the server hardware only turns over when we reinstall machines from scratch (usually on a new Ubuntu version) or build completely new servers that have no existing version. This means that even important production machines can be running on what is now out of date hardware, because it was our most up to date hardware when they were built three or four years ago. Less important servers can be using even older hardware, if it was our 'previous generation' hardware when they were built three or four years ago using it.
Because we tend to buy hardware in bulk every so often, this often means that we buy a block of new server hardware at time X and then it may be a year or more before all of the new hardware is actually deployed. I think that all of our Dell R340s have now been deployed and we have no brand new in box ones sitting around, but we're certainly still working through our boxes of Dell R240s (which we bought toward the end of their availability).
(This is on my mind lately because I pulled two R240s out of their boxes last week to use for upgraded servers, along with reusing an R210 II for a third one.)
When new server generations introduce new useful capabilities, like dedicated BMC network ports, these capabilities can be slow to spread through our fleet and correspondingly slow to get used. Unless we really need or want a new capability for some server, it can take a while before we decide it's sufficiently wide spread to be investigated and put to use.
(All of which is to say that we're only now starting to default to connecting dedicated BMC networking ports to the network and configuring BMC networking. Until recently, dedicated BMC networking wasn't pervasive enough that we even thought about it.)
free() API means memory allocation must save some metadata
Here's something that I hadn't really thought about until I was
thinking about the effects of
free() on C APIs: the API of
free() in specific more or
less requires a conventional C memory allocator to save some metadata
about each allocation. This is because
free() isn't passed an
explicit size of what to free, which implies that it must get this
information from elsewhere. The traditional way the C memory allocator
does this is to put information about the size of the allocation
in a hidden area either before or after the memory it returns (often
before, because it's less likely to get accidentally overwritten
(That C memory allocators store the size of allocations they've handed out is clear to anyone who's read through the small malloc() implementation in K&R.)
This free() API isn't the only way it could be; a less convenient
version would be to pass in an explicit size. But this would be a
pain, because in practice a lot of C allocations are variable-sized
ones for things like (C) strings. The C free() API is in a sense
optimized for blind allocations of variable sized objects. It also
allows for a more straightforward optimization in
malloc() can round up the size you requested, save that size as the
metadata, and then
realloc() can expand your nominal allocation into
any remaining free space if possible. So there's pretty strong reasons
for free() to not require a size even if it normally requires some
extra allocator overhead.
Of course you can build C memory allocators that avoid or amortize this overhead, mostly obviously by having free() never do anything (some programs will be perfectly fine with this and it's very fast). A slab allocator that uses size classes doesn't need size metadata for individual allocations that fall into size classes, because the size of an individual allocation is implicit in being allocated in a particular size class's arena. More broadly you can have an allocator interface where programs can set all future memory allocations to come from a particular arena, and then promise to de-allocate the arena all at once and not care about free() otherwise (letting you make free() a no-op while there's an active arena).
(Talloc is an explicit arena setup, as opposed to the implicit one I described, but of course this is an option too.)
My adventure with URLs in a Grafana that's behind a reverse proxy
I was oblique in yesterday's entry,
but today I'm going to talk about the concrete issue I'm seeing
because it makes a good illustration of how modern web environments
can be rather complicated. We run Grafana behind a reverse proxy as
part of a website, with all of Grafana under the /grafana/ path.
One of the things you can add to a Grafana dashboard is links,
either to other dashboards or to URLs. I want all of our dashboards
to have a link to the front page of our overall metrics site. The
obvious way to configure this is to tell Grafana that you want a
link to '
/', which as a raw link in HTML is an absolute path to
the root of the current web server in the current scheme.
When I actually do this, the link is actually rendered (in the
resulting HTML) as a link to '/grafana/', which is the root of the
Grafana portion of the website. Grafana is partially configured so
that it knows what this is, in that on the one hand it knows what
the web server's root URL for it is, but on the other hand its own
post-proxy root is '/' (in Apache terminology, we do a ProxyPass
of '/grafana/' to 'localhost:3000/'). This happens in both Firefox
and Chrome, and I've used Firefox's developer tools to verify that
href' of the link in the HTML is '/grafana/' (as opposed
you hover or click on it).
What I take from all of this is that a modern web application is a complicated thing and putting it behind a reverse proxy makes it more so, at least if it's sharing your web server with anything else. Of course, neither of these two things are exactly news. Now that I know a little bit more about how much 'rehydration' Grafana does to render dashboards, I'm a bit more amazed at how seamlessly it works behind our Apache reverse proxy.
PS: Configuring the link value in Grafana to be 'https:/' defeats
whatever rewriting is going on. The HTML winds up with that literal
text as the '
href' value, and then the pragmatics of how browsers
interpret this take over.
My uncertainty over whether an URL format is actually legal
I was recently dealing with a program that runs in a configuration
that sometimes misbehaves when you ask it to create and display a
link to a relative URL like '
/'. My vague memory suggested an
alternative version of the URL that might make the program leave
it alone, one with a schema but no host, so I tried '
it worked. Then I tried to find out if this is actually a proper
legal URL format, as opposed to one that browsers just make work,
and now I'm confused and uncertain.
The first relatively definite thing that I learned is that
URLs don't need all of those slashes; a URL of '
is perfectly valid and is interpreted the way you'd expect. This
is suggestive but not definite, since the "file" URL scheme is a
pretty peculiar thing.
An absolute URL can leave out the scheme; '//mozilla.org/' is a valid URL that means 'the root of mozilla.org in whichever of HTTP and HTTPS you're currently using' (cf). Wikipedia's section on the syntax of URLs claims that the authority section is optional. The Whatwg specification's section on URL writing requires anything starting with 'http:' and 'https:' to be written with the host (because scheme relative special URL strings require a host). This also matches the MDN description. I think this means that my 'https:/path' trick is not technically legal, even if it works in many browsers.
Pragmatically, Firefox, Chrome, Konqueror, and Lynx (all on Linux) support this, but Links doesn't (people are extremely unlikely to use Lynx or Links with this program, of course). Safari on iOS also supports this, which is the extent of my easy testing. Since Chrome on Linux works, I assume that Chrome on other platforms, including Android, will; similarly I assume desktop Safari on macOS will work, and Firefox on Windows and macOS.
(I turned to specifications because I'm not clever enough at Internet search terms to come up with a search that wasn't far, far too noisy.)
PS: When I thought that 'https:/path' might be legal, I wondered if ':/path' was also legal (with the meaning of 'the current scheme, on the current host, but definitely an absolute path'). But that's likely more not lega than 'https:/path' and probably less well supported; I haven't even tried testing it.
Sidebar: Why I care about such an odd URL
The obvious way to solve this problem would just be to put the host
in the URL. However, this would get in the way of how I test new
versions of the program in question, where I really do want a
URL that means 'the root of the web server on whatever website this
is running on'. Yes, I know, that should be '
/', but see above
about something mis-handling this sometimes in our configuration.
(I don't think it's Apache's ProxyPassReverse directive, because the URL is transformed in the HTML, and PPR doesn't touch that.)
Some notes (to myself) about formatting text in
These days I'm having to deal with a steadily increasing number of
commands that either output JSON only or where JSON is their best
output option, and I want to reformat some of that JSON to a more
useful or more readable text-based format. The obvious tool to do
this with is
jq, at least for
simple reformatting (I think there's some things that are too
tangled for jq). However, every
time I need to do this, I keep having to look up how to format text
jq. Jq has a very big manual and a lot of features, so
here's some notes to my future self about this.
In the normal case I have some fixed fields that I want to present in a line, for example information about SSH login failures:
logcli ... | jq '. | (.value, .metric.rhost, .metric.ruser)'
(I can leave off the '(' and ')' and it works, to my surprise, but I'm banging rocks here.)
First, I basically always want to use '
jq -r' so that strings
aren't quoted (and strings may include what are actually numbers
but rendered as strings by the tool). Then I know of several ways
to produce text in a useful form.
Often the simplest way is to put the values into a JSON array and
run them through the '
@tsv' filter (described in the "Format
strings and escaping" section of the manual), which produces tab
$ ... | jq -r '. | [.value, .metric.ruser, .metric.rhost] | @tsv' 1596 root 126.96.36.199 [...]
By itself, @tsv just puts a tab between things, which can leave me
with ragged columns if the length is different enough. As various
people on the Internet will tell you, the
column program can
be used to align the output nicely.
The next option is string interpolation:
jq -r '. | "\(.value): \(.metric.rhost) -> \(.metric.ruser) on \(.metric.host)"' 1596: 188.8.131.52 -> root on <ourhost> [...]
String interpolation permits embedded "'s, so you can write things
\(.metric.ruser // "<no-such-login>")' even in the interpolation
The third option is string concatenation, provided that all of your
values are strings (or you use
@text on things).
jq -r '. | (.value | @text) + " " + (.metric.ruser // "??") + "@" + .metric.host + " from " + .metric.rhost' 1596 root@<ourhost> from 184.108.40.206
(I got this use of string concatenation from here.)
If I'm doing this text formatting in jq purely for output, I think
it's clear that @tsv is the easiest option and has the simplest
jq expression. I suspect I'd never have a reason to use string
concatenation to produce the entire output line instead of doing
string interpolation. Well, maybe if I'm in some shell context that
jq all of those '
\(' bits too hard, since string
concatenation doesn't need any backslashes.
But honestly, if I need complicated formatting I'm more likely to fix jq's output up in awk with its printf. Awk printf will do a lot that's at least quite annoying in jq.
Ubuntu 22.04 with multiple disks and (U)EFI booting
One of the traditional and old problems with UEFI booting on servers is that it
had a bad story if you wanted to be able to boot off multiple disks.
Each disk needed its own EFI System Partition (ESP) and you either
manually kept them in synchronization (perhaps via
rsync in a
cron job) or put them in a Linux software RAID mirror with the RAID
superblock at the end and hope hard that nothing ever went wrong.
To my surprise, the state of things seems to be rather better in
Ubuntu 22.04, although there are still traps.
Modern Linuxes don't put much in the ESP, and in particular even
Fedora no longer puts frequently changing things there. In Ubuntu 22.04, what's there in the
EFI/ubuntu subdirectory is a few GRUB binaries and a stub
that tells GRUB where to find your real /boot/grub/grub.cfg, which
normally lives in your root filesystem. All of these are installed
into /boot/efi by running '
or into some other location by running '
(On a 64-bit Ubuntu 22.04 EFI booted system, '
will usefully tell you that the default target type is '
although the manual page will claim otherwise.)
This lets you manually maintain two or more ESPs; just mount the
second one somewhere (perhaps temporarily) and run grub-install
against it. Ubuntu has added a script to do more or less this (cf),
/usr/lib/grub/grub-multi-install, which is normally run by EFI grub
package postinstalls as '
This script will run through a list of configured ESPs, mount them
temporarily (if they're not already mounted), and update them with
grub-install. In the 22.04 server installer, if you mark additional
disks as extra boot drives, it will create an ESP partition on them
and add them to this list of configured ESPs.
(I believe that you can run this script by hand if you want to.)
The list of configured ESPs is stored in a debconf selection,
grub-efi/install_devices'; there are also a number of other
related grub-efi/ debconf selections. An important thing to know
is that configured ESPs are listed using their disk's ID, as
/dev/disk/by-id/<something> (which is perfectly sensible and
perhaps the only sane way to do it). This means that if one of your
boot disks fails and is replaced, the list of configured ESPs won't
include the new disk (even if you made an ESP on it) and will (still)
include the old one. Apparently one fix is to reconfigure a relevant
GRUB package, such as (I think) '
from this AskUbuntu answer.
(In the usual Debian and Ubuntu way, one part of this setup is that
a package upgrade of GRUB may someday abruptly stop to quiz you
about this situation, if you've replaced a disk but not reconfigured
things since. Also, I don't know if there's a better way to see
this list of configured ESPs other than '
grep ...' or maybe '
Life would be nicer if you could set Ubuntu 22.04 to just install
or update GRUB on all valid ESPs that it found, but the current
situation isn't bad (assuming that the reconfigure works; I haven't
tested it, since we just started looking into this today). The
reconfiguration trick is an extra thing to remember, but at least
we're already used to running grub-install on BIOS boot systems.
I'm also not sure I like having
/boot/efi listed in /etc/fstab
and mounted, since it's non-mirrored; if that particular disk fails,
you could have various issues.
(In looking at this I discovered that some of our systems were mounting /boot/efi from their second disk instead of their first one. I blame the Ubuntu 22.04 server installer for reasons beyond the scope of this aside.)
PS: On a BIOS boot system, the '
can be a software RAID array name, which presumably means 'install
boot blocks on whatever devices are currently part of this RAID
array'. I assume that UEFI boot can't be supported this way because
there would be more magic in going from a software RAID array to the
ESP partitions (if any) on the same devices.
PPS: Someday Ubuntu may let you configure both BIOS and UEFI boot on the same system, which would have uses if you want to start off with one but preserve your options to switch to the other for various reasons. We'd probably use it on our servers.