Wandering Thoughts

2022-01-23

DNS queries to external sources do fail every so often out of the blue

It's tempting to think that DNS is a reliable environment in general practice, where if you're using a good resolving DNS server (including public ones run by eg Google and Cloudflare) and querying for major domains that have well run DNS servers, you won't see failures. After all, if you ask Google's 8.8.8.8 for the DNS A record (IP address) of amazon.com, you would expect it to always work unless something terrible has gone wrong.

For our own reasons, we use Prometheus's Blackbox to make "black box" probes of various endpoints. Included among the probes we make are a variety of DNS probes to a variety of DNS servers. We started out checking to see if our own domains were resolvable, but then we extended this to querying other domains as a cross-check. And since this is part of our Prometheus and Grafana setup, we store all the results and show them on some dashboards. The result is, in some sense, depressing.

Individual DNS queries regularly fail. It doesn't happen very often, but it happens often enough that if we're looking at a one-hour dashboard, we can expect to see at least one failure. In perhaps unsurprising news, queries fail more often to external DNS servers than to internal ones (even when looking up external names), and it happens for both public resolvers and querying primary DNS servers for data they hold.

In typical use of DNS these failures are masked, because most resolvers and I believe most clients will automatically retry at least once or twice. Blackbox is an exception; although it's not documented, it makes only a single DNS query attempt, and it gives you the result. In the default configuration where you're making a UDP based DNS query, that will be a single DNS UDP query packet, so all of the usual UDP things can happen to it and the reply (on top of the DNS server you're querying just not answering you).

In a way this shouldn't really surprise me. I know that the general Internet is a broadly unreliable place, where packets can and do get dropped on a regular basis. But usually everything works well enough that we can ignore that and assume that even UDP and ICMP packets are just going to get through. Not always, though, as this demonstrates.

PS: In a way it's especially surprising to big public resolvers like Google and Cloudflare run, because both have points of presence here in Toronto so anycast routing means our network path to both of them is fairly short. Right now, both Google and Cloudflare appear to be directly connected to the Toronto Internet Exchange.

sysadmin/DNSQueriesCanFlake written at 23:29:28; Add Comment

2022-01-22

Modern public TLS is a quite different thing than it used to be

If you're not deeply involved with TLS, it probably seems that the state of public TLS today is much the way it used to be a decade ago, or even five years ago, including things like the fundamental problem with TLS on the web (which is that your browser trusts a ton of Certificate Authorities). This is not actually the case, for at least three reasons. Two of them are logistical changes, while the third is a dramatic change to the security of TLS in practice.

The first logistical change is that Let's Encrypt has made fully valid TLS certificates both widely available and free. In the process it's dragged other Certificate Authorities toward this model. I suspect that TLS certificates being free and widely available has made browser vendors much more willing to get strict on (other) Certificate Authorities, because limiting or deprecating other CAs no longer necessarily leaves site operators in a big pinch. The second logistical change is that Let's Encrypt's short certificate lifetimes have driven people to automate TLS certificate changes. This automation isn't perfect today, but it's a vast improvement from what it used to be.

(Let's Encrypt has also forced everyone to be honest about how much validation is actually done for ordinary TLS certificates, which is "not much". I doubt that this has changed people's perceptions about what a TLS certificate means, though.)

The dramatic change to the practical security of TLS is TLS Certificate Transparency (also), where the browser vendors require Certificate Authorities to publish information about all of their TLS certificates. A decade ago, the problem with TLS (and thus with HTTPS) is that any CA could issue a TLS certificate for any site and not get caught at it most of the time. This issuance might be because of a mistake, because the CA was compromised, or because an entity with sufficient power over the CA ordered them to do so.

In theory all of this is still possible today, because Certificate Authorities are no less vulnerable to mistakes, attackers, or state pressure. However, in practice, Certificate Transparency makes the issuance of a bad certificate a high stakes thing, especially if the website in question does things like publish restricted CAA records. A TLS certificate issued without being in the CT logs is both a smoking gun of CA misconduct and increasingly useless, since browsers increasingly only accept CT-logged TLS certificates. A TLS certificate issued and in the CT logs is exposed to public scrutiny and potential immediate alerts, and it's extremely non-deniable on the part of the CA. This is a big practical improvement in TLS security, just as blocking passive eavesdropping by switching to HTTPS is a big change.

A meta-level change is that the browsers are now in charge of TLS. In a way they always were, but now there's fewer of them than there used to be and they understand their power more and are more willing to exercise it. This is an important change in its own right because browsers don't care about how much money Certificate Authorities make or don't make.

All of this means that modern public TLS is a much safer place in practice than it used to be. Someone else getting a TLS certificate for your site is not necessarily harder than it used to be, but it's more risky and thus expensive, and thus much less likely to be done or to happen.

(If you're running a website, it's cheaper and usually much easier to get and manage TLS certificates than it used to be. We went from TLS certificates being various sorts of headaches to them being something we don't even think about.)

tech/TLSHasChangedALot written at 23:05:09; Add Comment

2022-01-21

Sorting out the situation with Intel desktop CPUs and hyper-threading

One of the reasons I've been thinking about the advantages of hyper-threading is that I had absorbed the information that in another one of Intel's market segmentation moves it had moved away from hyper-threading on Core desktop processors except for the top end (and expensive) Core i9s. It turns out that this is not entirely the case, although the situation is somewhat confusing.

In the Coffee Lake (8th generation CPUs) generation, the i7 was the top of the desktop line and was the only CPU series with hyper-threading; i5s and i3s did not have it. This is the CPU generation I have in my current home machine, which I put together in early 2018. In late 2018 Intel refreshed this to Coffee Lake Refresh, calling this the 9th generation. This refresh introduced the new i9 series CPUs and dropped hyper-threading from the 9th generation i7 CPUs, making the expensive i9s the only ones with hyper-threading. However, Intel increased the core count on i7s, moving them from 6 cores to 8 cores.

Whether an extra two real cores (with one CPU each) are better than an extra six hyper-thread CPUs is probably very dependent on your workload. However, I think that on the whole the impression it left on people was probably not positive. You can say what you like about the whole situation, but Intel (and others) have been talking about 'CPU' numbers for a fair while and the 9th generation i7 line made those numbers go down. And in a way that felt like more Intel market segmentation.

Since then, Intel has had Comet Lake (10th generation), Rocket Lake (11th generation), and just recently Alder Lake (12th generation). Starting with Comet Lake, all of the Core model lines have had hyper-threading; i9, i7, i5, even i3, although the Rocket Lake generation had no i3s. Before Alder Lake, all cores were uniform and all supported hyper-threading. In Alder Lake, the main ('performance') cores all support hyper-threading, but i9, i7, and a couple of i5 CPUs also have extra 'efficiency' cores, which don't. All of these generations kept the i7 line with 8 cores (8 performance cores in Alder Lake).

This makes the Intel i7 and i5 line somewhat more competitive against the latest AMD Ryzens, although I'm not sure I'm still inclined to Intel the way I used to be. Intel is still ahead in the widespread availability of integrated GPUs, where I believe basically every Intel Core model is available that way. With AMD Ryzen Zen3, you're more restricted for this, although there are perfectly good Zen3 Ryzens available with integrated graphics.

(I have various reasons to consider at least a stop-gap new machine right now. That Core i7s and i5s have hyper-threading opens up more CPU options than I thought I had before I started looking this up.)

tech/IntelDesktopCPUsSMT written at 23:20:34; Add Comment

2022-01-20

I'm using journalctl's --since option now to speed up checking logs

I've probably had an ambient awareness of journalctl's --since option to show the systemd journal since some particular time ever since I read enough of the manpage to find options like '-u' (used to see only logs for a single unit) and '-b' (used to select which system boot you want to start from). But for a long time I didn't really use it, even when I mentioned it in my entry on '-u'. Recently that's been changing and I've been finding myself using --since more and more often, generally in two different situations.

The most obvious and straightforward situation is when I know that something odd happened on a system at a particular time and I want to look at the logs around that time. I typically pick a --since time a bit before the event's time, usually only a few minutes but sometimes more. On the one hand, the earlier you pick for --since, the more potentially irrelevant log messages you have to skip through; on the other hand, you can't scroll back to look at logs before your --since (not without quitting and restarting), so I want to make sure it's definitely going to have any early warning messages.

The other case is when I really want to start at the most recent messages and scroll backward. I used to use the old standby of 'journalctl -b0' followed by the less 'G' command to go to the end, but that can be slow, especially if the system didn't boot all that recently. Using a somewhat recent --since generally makes this much faster at the cost of limiting how far back I can scroll (which usually isn't an issue). Here I should make more use of systemd's relative time units (see systemd.time for details), for example '--since -4h', rather than looking at the current time and then specifying something a bit in the past.

I could use 'journalctl -r' for this, which shows the journal in reverse order, but for some reason my brain is happier seeing logs in their normal forward order and paging backward. Part of this is that the systemd journal is the only form of logs that I can actually look at in reverse; for all of the file-based logs I look at, I have no choice but to jump to the end and page backward.

In both cases, how far back I go depends partly on my guess or knowledge of how busy the journal is. If this is a system with busy logs, there's not much point in going very far back from what I'm interested in because I'll never look at all that volume.

(Log volume is quite variable on our systems for various reasons. Some systems have popular services that are exposed to the entire world, for example our IMAP servers, while others have low activity and don't have anything externally accessible, not even SSH. The latter systems tend to see ongoing log activity mostly from frequent cron jobs.)

linux/JournalctlSinceOption written at 21:46:24; Add Comment

2022-01-19

When I might expect simultaneous multithreading to help

Let's accept for the moment the idea that simultaneous multithreading can help under the right circumstances (Intel apparently sometimes claims a potential 30% benefit from hyper-threading, for example). It then becomes interesting to ask when you might expect SMT to help out your machines and when it's probably not going to do anything, partly because if you don't expect much from SMT you should ignore the extra CPUs you theoretically get from SMT when considering choices of machines.

(There's at least two stories of why and how SMT might help.)

The first thing I would expect is that SMT mostly only helps if you have as many running processes as cores. Broadly speaking, a separate core is (or should normally be) superior to the second SMT CPU for unrelated processes, and you'd hope that operating systems schedule processes accordingly. Because SMT CPUs usually share caches and some other resources, it might be useful to schedule two threads from the same process into the same core so that they could take advantage of that, with one thread getting cache hits from the other thread's activities. Of course this won't work for all threads, since sometimes threads don't have any cache sharing.

(And if both threads are trying to access the same external resources at the same time, both could stall in eg RAM latency, which sort of defeats one of the points of SMT. You could be better off scheduling un-correlated processes on each CPU of a SMT pair.)

If a process is paused briefly and then resumed, the operating system will normally try to re-schedule it on the same CPU as it was so that it can take advantage of hot caches there. If this CPU is busy, it might be a win to schedule the process back onto the CPU's SMT sibling if it's free rather than push the process off to a different core; from the cache point of view, the two CPUs in a SMT pair are usually basically the same. This is not a sure thing, since the process that's now on the original CPU might have thrashed the caches up.

I don't know if SMT could be expected to reduce the latency of how long runnable processes wait before executing (and then how long before they produce useful work, which is what really matters). SMT does provide extra CPUs to put processes on, but this only matters if you're busy enough that you don't have any full cores still idle. And once the processes are running they probably need (slow) memory accesses to do useful work.

Given the stories about SMT, I'd expect that you would get bad results if two carefully optimized computational kernels were scheduled onto SMT siblings. Both kernels would have basically maximal use of a core's available resources, and there aren't two of all of those resources across two SMT pairs, so the two sides would be fighting each other for the core's resources. You might do well if you could schedule an integer compute kernel on one side and a floating point compute kernel on the other, but that probably depends on a lot of things.

tech/SMTWhenMightHelp written at 23:58:59; Add Comment

2022-01-18

Logs are invisible (at least most of the time and by default)

Suppose, hypothetically, that you have a program that does something (perhaps it's a metric collection agent), and you know that it's possible for it to encounter a problem while in operation. So you decide that if there's a problem, your program will emit a log message. Now, as they say, you have two problems, because logs are invisible. Well, more specifically, things reported (only( in logs are invisible in almost all environments.

The reality of logs is that almost all of the time, nothing is looking at them. You can't. There are too many things being logged and they're too variegated. In most environments people only look at logs as part of troubleshooting, or maybe once in a blue moon to see if anything jumps out at them. The rest of the time, logs are written and then they sit there in case they're needed later.

If you want to actually surface problems instead of just recording them, you need something else in addition to the log messages. Perhaps you need a special log that's only written to with problems (and then something to alert about the log having contents). Perhaps you can use a metric (if you expose metrics). Perhaps you need to signal something. But you need to do something.

Speaking from plenty of personal experience, it's very tempting to ignore this. Logging a message is generally quite easy, while every other reasonable way of attracting attention is much harder (and often specific to your environment, which is to say how it is today; much logging is universal). But if you just log a message on problems, it's pretty certain you're going to find out about them by some other means (hopefully not by something exploding).

(A corollary of this is that if log messages are primarily read during troubleshooting, you should make them as useful for that as possible.)

PS: One way around this is to monitor your logs for messages that you know your programs log when they hit problems, or that you've otherwise found out indicate problems. This requires extra work to set up and often extra work to maintain. Also, now you get to watch out because your messages (or parts of them) have become an API between your programs and your generalized monitoring. Worse, it's a decoupled API that's not actually checked, so one side can drift out of sync with the other without anything noticing.

(This thought is brought to you by me discovering that one of the Prometheus metrics agents we run had discovered a problem on one host, and the only sign of it was log messages that I only noticed in passing. In theory we could have spotted this problem from some side effects on the exposed metrics; in practice we didn't know what to look for until it happened and I could observe the side effects.)

sysadmin/LogsAreInvisible written at 21:09:58; Add Comment

2022-01-17

Pipx and a problem with changing the system Python version

I use pipx on my work laptop, among other places, which I upgraded from Fedora 34 to Fedora 35 today. Afterward, my single pipx installed program didn't work, which was basically what I expected due to the familiar pip issue with Python versions; Fedora 34 has Python 3.9, while Fedora 35 has Python 3.10. Since virtual environments for one don't work with the other, the virtual environment for my installed programs couldn't find any Python packages that had been installed in them.

Since I've had success with 'pipx reinstall' before, I assumed that the way to fix this was to do a reinstall. Unfortunately this resulted in a spectacular failure, where pipx deleted my virtual environment then failed to recreate it with an error about pip not being available. Since the initial deletion lost the pipx metadata for my installed program there was no easy recovery, and anyway a 'pipx install' also had the 'pip not available' problem. Ultimately, this appears to be because pipx has a more or less hidden virtual environment of its own in ~/.local/pipx/shared, where it puts shared things that crucially includes pip itself. This virtual environment is also bound to a specific version of Python; if you change your Python, it too stops working, which means that any per-program virtual environments that point to it also stop working.

(You can find the signs of this in your venvs as a pipx_shared.pth file in each venv's lib/python3.X/site-packages/ directory, which has the absolute path to the relevant part of this shared venv. Note that this means that your pipx installed venvs will probably fail if you change your home directory or copy them to another system with a different home directory, because they have the absolute path to this shared tree.)

On my laptop, I fixed the problem by the brute force solution of removing ~/.local/pipx entirely, but I only had one program installed through pipx. I did experiment enough to determine that pipx will recreate ~/.local/pipx/shared if you delete it (or rename it), but I don't know if this will work through a complete installed Python version upgrade process. If it does, I think what you need to do is upgrade the Python you're using, delete ~/.local/pipx/shared, then do 'pipx reinstall-all'.

This is clearly a pipx bug where it should automatically detect an out of date shared area and rebuild it, but we deal with the pipx we have now, not the pipx we would like to have.

(This elaborates on some tweets.)

PS: Although pipx doesn't expose this to you, you can get a Python shell in the virtual environment of any installed program by running ~/.local/pipx/venv/<what>/bin/python. This may be useful if you want to do things like inspect that Python's sys.path setting.

python/PipxPythonVersionIssue written at 22:12:49; Add Comment

2022-01-16

HTTPS is still optional, at least sort of

I was recently reading this article (via). I have a number of reactions to it, but today's reaction is to the small portion of its argument that the need for HTTPS certificate renewal (and HTTPS certificates) makes modern websites somewhat dynamic in practice in that you can't just abandon them and necessarily have everything keep on working. My counterpoint is that HTTPS is still optional for certain sorts of sites, even here in early 2022.

Certainly what you can do on a plain HTTP website is limited and getting more so; a steadily increasing variety of Javascript features and so on are getting fenced off to secure origins only (ie, HTTPS sites). However, if you're building the sort of website which could reasonably be a static site, this is not likely to be a concern (unless you really want to do client side rendering of your content with a giant bolus of Javascript, which requires browsers to accept that bunch of Javascript over HTTP). You can definitely have a HTTP website today with useful content (for example) and even modern browsers are willing to show it to people even if some of them put warnings into the URL bar.

It's possible that browsers will stop supporting plain HTTP at some point in the future, just like they stopped supporting FTP recently. But it seems much less likely. First, there are plenty of HTTP sites currently and it seems likely that many of these will continue to be HTTP in the future. Second, browsers need to continue to support HTTP the protocol for a long time to come, since it's one of the protocols used for 'HTTPS' (which is really multiple protocols now). Dropping support for plaintext HTTP is likely to remove relatively little code from browsers, unlike the case with FTP (where dropping FTP allowed removing all of the code for a somewhat complex protocol). Third, there would be a lot more people objecting to it than there are for FTP, since there are no other good clients for plaintext HTTP other than browsers, which again is unlike the situation with FTP.

(I expect people would be very vocal about things if any browser proposed stopping supporting plaintext HTTP. There are a lot of tangled issues, since requiring HTTPS makes people dependent on access to the general CA infrastructure to run websites. Let's Encrypt not withstanding, this access is in no way guaranteed today.)

web/HTTPSStillOptional written at 21:18:10; Add Comment

2022-01-15

You should do lint checks on your Prometheus alert (and recording) rules

I tweeted:

I turned Cloudflare's Pint Prometheus linter loose on our alert rules with our Prometheus server configured so it could check for metrics existence, and wow it found a bunch of problems (once again, on top of basic label checks I did before).

Today's learning experience: If you leave off the 0x on a hex number that starts with a letter in a Prometheus alert rule, like 'c0a8fdff', Prometheus interprets it as a metric name (and finds nothing).

As you might guess from the threading, Pint is what found my 'c0a8fdff' mistake.

If you're extremely meticulous about writing, reviewing, testing, and double checking your Prometheus rules, you might not have Pint find anything. I thought I was pretty good about writing alert rules and testing them, but I was very clearly wrong. Configuring and using Pint has been quite valuable, as has been teaching it how to connect to our Prometheus server so that it could check for things like the existence of metrics.

No linter is perfect, Pint included; I had to turn off a number of warnings for various reasons. But they're a lot better than nothing and they're usually fairly easy to set up, giving them a good return for your time. If you're really energetic you can write unit tests for rules, but this won't catch everything (it doesn't necessarily check that you're putting in the labels that you should be, for example) and it's a lot more work. The odds that I'll ever write any significant number of alert rule unit tests are very low; the odds that I will run Pint over our alert rules after I make changes are now very high.

PS: As with all linters, I strongly suggest getting your rules to the state where they don't generate any lint check warnings as they are. It's much easier to tell the difference between zero warnings and some (new) warnings than it is to spot some new warnings in a sea of existing ones. Our alert rules have some silenced Pint warnings that I'm not entirely happy about for this reason; it's better to make it very obvious if there's a new warning than to keep nagging me about an existing issue I'm not going to fix right now.

sysadmin/PrometheusLintYourRules written at 22:37:55; Add Comment

2022-01-14

Link: Histograms in Grafana (a howto)

Histogram evolution: visualize how a distribution of values changes over time (via) has the article URL slug of 'grafana histogram howto', and the slug is quite accurate. It's a step by step walkthrough of how to do this for a native Prometheus counter histogram metric, which most of them are. It includes copious screenshots, which is especially useful since you have to do all of this through Grafana's GUI and describing GUI actions in text is not necessarily ideal. I've slogged through heatmaps and histograms in Prometheus and Grafana, and this article still taught me something quite useful that I hadn't realized (the 'exclude zeros' setting; I agree with the author that this should be the Grafana default).

PS: Contrary to what the article suggests, heatmap legends aren't always useful, at least in current versions of Grafana. I tried putting a legend on some disk IO latency heatmaps that have very small latencies and the result was not all that readable or clear.

links/GrafanaHistogramsOpstrace written at 22:36:20; Add Comment

Understanding what a DKIM (spam) replay attack is

I recently read A breakdown of a DKIM replay attack (via), which introduced me to the idea of a DKIM (spam) replay attack. In a DKIM spam replay attack, an attacker arranges to somehow send one or more messages with spam content through your system, and then saves the full message, complete with your DKIM signature. Once they have this single copy, they can use other SMTP servers to (re)send it to all sorts of recipients, since in SMTP and in mailers in general, the recipients come from (unsigned) envelope information, not the (signed and thus unchangeable) message.

As Protonmail notes, the damage is made worse if the attacker can somehow persuade you to create a DKIM signature that doesn't cover headers like Subject:, From:, or To:, for example by omitting them from the initial message they send. If the DKIM signature doesn't cover these headers for whatever reason, the attacker can add them after the fact and the message will still pass DKIM validation, and mail clients (and mail systems) will probably not flag that the message Subject and other things being shown to people is not actually signed. The attacker can also add an additional Subject: header (or other headers) to see if the recipient's overall mail system validates the DKIM signature with one but shows the other.

DKIM signatures can be made over missing headers, which can be used to 'seal' certain headers so that additional versions of them can't be added. When I experimented with our Exim setup, which uses default Exim DKIM parameters, it did sign missing Subject: and To: headers, effectively sealing them, but it doesn't currently seal any headers against additions.

(Exim takes its default header list to sign from RFC 4871. That's been obsoleted by RFC 6376, but our Ubuntu 18.04 version of Exim is definitely using the RFC 4871 list, not the RFC 6376 list, since it signs including headers like Sender:, Message-ID:, and the MIME headers.)

Finding out about DKIM replay attacks has made me consider what we might do about them. Right now I can't think of very much we could do (although I can think of a certain amount of clever ideas for bigger, more complex places with more infrastructure). However, perhaps we should have a second set of DKIM keys pre-configured into our DNS and ready to go live, so that we can switch at the drop of a hat if we ever have to (well, with a simple configuration file change).

(I think that rotating your DKIM keys regularly might help to some degree, but my assumption is that someone who manages to get your to DKIM sign a bad message is most likely going to start their mass sending activities almost immediately. If nothing else, the longer they wait the more out of place the message's (signed) Date: header will look.)

spam/DKIMSpamReplayAttack written at 22:24:59; Add Comment

(Previous 11 or go back to January 2022 at 2022/01/13)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.