Wandering Thoughts

2022-06-28

What symmetric and asymmetric IP routing are

In a recent entry I talked somewhat informally about symmetric (IP) routing. Symmetric and asymmetric IP routing are ideas that I'm familiar with from working on firewalls and networking, but it's not necessarily common knowledge in the broader community. We can approach what they are from two directions, so I'm going to start from how conventional IP routing works.

The traditional and normal way that your IP stack decides where an outgoing IP packet should be sent is based (only) on the destination IP address. If the destination IP is in a directly attached network, your system sends it out the relevant interface. If there's a specific route that applies to the destination IP, the packet is sent to the gateway the route lists. And if all else fails, the packet is sent to your default route's gateway (or dropped, if you have no default route).

However, if you have a multi-homed host, a host with multiple interfaces and IP addresses, this approach to routing outgoing traffic can create a situation where outgoing and incoming packets for the same connection (or flow) use different interfaces. To have this happen you normally need at least two of your networks to be routable, which is to say that hosts not on those networks can reach them and hosts on those networks can reach other networks.

To make this concrete, say you have a host with two interfaces and IP addresses on each, with 10.20.0.10 on 10.20.0.0/16 and 192.168.100.1 on 192.168.100.0/24. Your default route is to 192.168.100.254 and you have no other special routes. There are two situations that will create a difference between incoming and outgoing packets. First, if any host not on 10.20.0.0/16 pings your 10.20.0.10 IP address, your replies will use your default route and go out your 192.168.100.0/24 network interface (despite coming from 10.20.0.10). Second, if a host on 10.20.0.0/16 pings your 192.168.100.1 IP address, your replies will go directly out your 10.20.0.0/16 interface despite coming from 192.168.100.1.

Both of these situations are asymmetric routing, where packets in one direction take a different path through the network than packets in the other direction. In a completely reliable network with no special features, asymmetric routing is things working as intended, with IP packets taking what your system believes is the most efficient available path to their destinations. However, in a network that may be having faults along some paths and that has firewalls, asymmetric routing can cause artificial connectivity failures (or hide them). It's especially a problem with stateful firewalls, because such a firewall will be seeing only one half of the conversation and will normally block it.

In symmetric routing, we arrange (somehow) for packets to take the same path in both directions in all of these situations. If you're pinged at 192.168.100.1, your replies always go out on 192.168.100.0/24 even if they're from a host in 10.20.0.0/16; if you're pinged at 10.20.0.10 by some random IP, your replies always go out on 10.20.0.0/16 even if your normal default route is through 192.168.100.254 (you'll need a second default route for 10.20.0.0/16 to make this work). This also extends to traffic that your host originates. If you ping a host in 10.20.0.0/16 with the source IP of 192.168.100.1, your pings should go to 192.168.100.0/24's default gateway of 192.168.100.254, not directly out your 10.20.0.0/16 interface. If your 'source IP 192.168.100.1' pings did go out your 10.20.0.0/16 interface, the ICMP replies from the innocent 10.20.0.0/16 host would take a different return path and create asymmetric routing.

There are a variety of ways to create a situation with symmetric routing. One general approach is to create separate network worlds, each with only one (routed) network interface in it, and to confine packets (and connections) to their appropriate world. Another general approach goes by the name of policy based routing, which is the broad idea of using more than just the destination IP to decide on packet routing. To do symmetric routing through policy based routing, you make routing choices depend on the source IP as well as the destination IP.

(Policy based routing is potentially much more general than mere symmetric routing, and I believe that it originates from the world of routers, not hosts. Sophisticated routing environments may have various complex rules, such as 'traffic from these networks can only use these links'. Symmetric routing itself is mostly a host issue.)

tech/SymmetricAndAsymmetricIPRouting written at 22:11:43; Add Comment

2022-06-27

Wishing for a simple way to set up multi-interface symmetric routing on Linux

For neither the first nor the last time, I've wound up in a situation where it would be quite useful for one of our machines to have what I will describe as simple symmetric routing across multiple interfaces. What I mean by this is a situation where each of the host's IP addresses is associated with an interface and when packets go out with a particular IP address, they use that interface (and the interface's default route). I call this "symmetric routing" because it makes the inbound and outbound paths the same for a given connection, which is not the case by default for a host with multiple interfaces today.

Setting this up with Linux's policy based routing is straightforward and almost mechanical. However, the setup has a lot of moving parts and there's no current automation for it that I know of. You can build your own, of course, but then that means you're stuck maintaining and operating your own automation; at that point you (we) start asking if you (we) really need symmetric routing, or if it's just a nice to have thing.

If you're directly using systemd-networkd, you can probably build something out of [Route] sections and [RoutingPolicyRule] sections, but keeping all of the sections organized for each interface and keeping the table numbers straight is up to you. Ubuntu's netplan can express similar things in its routing and routing-policy sections, but once again you're left to hand-craft everything to keep it organized (a look at the netplan examples may help get the syntax and placement of directives right). However, I'm not convinced that netplan can be made to work correctly for this because I don't see how to add direct subnet routes to tables in netplan, and direct subnet routes are required in some situations.

(It's also not always clear that you've considered all of the corner cases, especially if you're trying for a simple setup. As I've found, there can be quite a number of corner cases, some of which aren't obvious and usually appear to work.)

I don't expect the Linux kernel to have a simple configuration option or way to do this. The Linux kernel traditionally provides a bunch of low level networking options and it's up to you to build what you want out of them. But I would like things like systemd's networkd and Ubuntu's netplan to have a simple way of configuring something like this, one that reduces the amount of make-work and insures that you've covered all of the corner cases.

(I would be surprised to get it, though. It's a little bit amazing that we have policy based routing support in systemd's networkd and Ubuntu netplan.)

PS: I've historically done this in two different ways, one as isolated interfaces for testing purposes and the other as my general isolated networks on my desktop. I'm not sure which approach works better, and that sort of illustrates why I'd like to have this all handled by networkd or netplan.

linux/SimpleSymmetricRoutingWish written at 22:50:31; Add Comment

2022-06-26

Modern disk sizes and powers of two

Recently I grumbled in an aside about how disk drive vendors use decimal ('SI') numbers instead of IEC 'binary' numbers, which I and various other people consider more natural. You might wonder what makes binary sizes more natural for disk drives, especially since vendors have been using decimal sizes for a long time. My answer is that it comes down to sector size.

Almost all disks have had 512 byte sectors for decades, and disks have a user usable capacity that is an integer number of sectors. Most systems have then used sectors (or some power of two multiple of them) as the minimum filesystem allocation unit, and correspondingly the unit of used and free space. This makes it power of two units up and down the stack (although there's no reason for disks to have a power of two number of sectors).

Neither of these is inevitable. There used to be disks with different sector sizes, or at least that could be formatted to them (one source says that IBM AS/400s used 520 or 522 byte sectors, although even then they stored 512 bytes of payload data; also). Filesystems could allocate disk space in units other than sector size, but doing it in sectors makes life easier and for a long time writing to a sector was assumed to be atomic and to not touch any other sectors.

(This is definitely not the case today for any sort of drive. SSDs famously have natural write sizes very different from 512 byte sectors, and even conventional HDDs may do things like rewrite entire tracks at once.)

Similar but stronger considerations apply for disk drive bandwidth numbers. Systems read and write from almost all modern disk drives in some multiple of 512 byte 'sectors' (which these days are logical sectors instead of physical ones). This leads naturally to talking about bandwidth per second in binary units, especially since other sorts of bandwidth are often also expressed in binary units.

(The underlying network speeds are in decimal bits per second but we usually talk about network bandwidth in bytes per second using binary units. Well, network engineers may be different than system administrators like me.)

PS: The Wikipedia page on Disk sectors says that the domination of 512 byte sectors were driven by the popularity of the IBM PC.

tech/DiskSizesAndPowersOfTwo written at 22:15:10; Add Comment

2022-06-25

A limitation on what 'go install' can install (as of Go 1.18)

As all people dealing with Go programs know or are learning, now that Go is module-only, the way you install third party Go programs from source is now 'go install <name>@latest', not the old way of 'go get <name>'. However, this is not always a completely smooth process that just works, because it's possible to have Go programs in a state where they won't install this way. Here's an illustration:

$ go install github.com/monsterxx03/gospy@latest
go: downloading github.com/monsterxx03/gospy v0.5.0
go: github.com/monsterxx03/gospy@latest (in github.com/monsterxx03/gospy@v0.5.0):
    The go.mod file for the module providing named packages contains one or
    more replace directives. It must not contain directives that would cause
    it to be interpreted differently than if it were the main module.

What is happening here is that internally, gospy uses packages from its own repository (module) and one of them, github.com/monsterxx03/gospy/pkg/term, in turn uses github.com/gizak/termui/v3. However, the github.com/monsterxx03/gospy module has a replace directive for this termui module that changes it to github.com/monsterxx03/termui/v3.

If you clone the repository and run 'go install' inside it, everything works and you wind up with a gospy binary in your $HOME/go/bin. However, as we see here 'go install ...@latest' works differently enough that the replace directive causes this error. To fix the problem (ie, to build gospy or any program like it), you must clone the repository and run 'go install' in the right place inside the repository.

(Alternately you can file bugs with the upstream to get them to fix this, for example by dropping the replace directive and directly using the replacement in their code. But if the upstream is neglected, this may not work very well.)

Unsurprisingly, there is a long standing but closed Go issue on this 'go install' behavior, cmd/go: go install cmd@version errors out when module with main package has replace directive #44840. This was closed more than a year ago in 2021 with a 'working as designed', and indeed the help for 'go install' explicitly says about this mode:

No module is considered the "main" module. If the module containing packages named on the command line has a go.mod file, it must not contain directives (replace and exclude) that would cause it to be interpreted differently than if it were the main module. The module must not require a higher version of itself.

(The apparent Go reference for why this exists is issue #40276, which I haven't tried to read through because I'm not that interested.)

Possible this will be changed someday, especially since it seems to keep coming up over and over again; issue #44840 contains quite the laundry list of projects that have hit this issue. Amusingly, one of the gopls releases hit this issue.

For now, if you're developing a Go program and you need to use replace directives in your go.mod during development, you'll have to do some extra work. One option is to strip the replace directives out for releases (and you need to make releases, because 'go install ...@master' won't work because of your replace directives). Another option is to switch to using Go workspaces for local development and drop the go.mod replace directives entirely. If you need to actually release a version of your program that uses the replacement module, well, you're out of luck; you need to actually change your code to explicitly use the replacement module and drop the replace directive.

programming/GoInstallLimitation written at 22:25:17; Add Comment

2022-06-24

Even for us, SSD write volume limits can matter

Famously, one difference between HDDs and SSDs is that SSDs have limits on how much data you can write to them and HDDs mostly don't (which means that SSDs have a definite lifetime). These limits are a matter both of actual failure and of warranty coverage, with the warranty coverage limit generally being lower. We don't normally think about about this, though, because we're not a write-intensive place. Sometimes there are surprises, such as high write volume on our MTAs or more write volume than I expected on my desktops, but even then the absolute numbers tend to be low and not anywhere near the write endurance ratings of our SSDs.

Recently, though, we realized that we have one place with high write volume, more than high enough to cause problems with ordinary SSDs, and that's on our Amanda backup servers. When an Amanda server takes in backups and puts them on 'tapes', it first writes each backup to a staging disk and then later copies from the staging disk to 'tape' (in our Amanda environment, these are HDDs). If you have a 10G network and fileservers with SATA SSDs, as we do, how fast an ordinary HDD can write data generally becomes your bottleneck. If your fileservers can provide data at several hundred MBytes/sec and Amanda can deliver that over the network, a single HDD staging disk or even a stripe of two of them isn't enough to keep up.

However, the nature of the work that a staging disk does means that it sees high write volume. Every day, all of your backups sluice through the staging disk (or disks) on their way to 'tapes'. If you back up 3 TB to 4 TB a day per backup server, that's 3 TB to 4 TB of writes to the staging disk. It would be nice to use SSDs here to speed up backups, but no ordinary SSD has that sort of write endurance. Much as you'd have to aggregate a bunch of HDDs to get the write speed you'd need, you'd have to aggregate a bunch of ordinary SSDs to get any individual one down to the write endurance level they can survive.

(In a way the initial backup to the staging disks is often the most important part of how fast your backups are, because that's when your other machines may be bogged down with making backups or otherwise affected by the process.)

There are special enterprise SSDs with much higher write endurance, but they also come with much higher price tags. For once, this extra cost is not just because the e word has been attached to something. The normal write endurance limits are intrinsic to how current solid state storage cells work; to increase them, either the SSD must be over-provisioned or it needs to use more expensive but more robust cell technology, or both. Neither of these is free.

sysadmin/SSDWriteLimitsCanMatter written at 23:44:44; Add Comment

2022-06-23

A mystery with Fedora 36, fontconfig, and xterm (and urxvt)

As of Fedora 36, Fedora changed their default fonts from DejaVu to Noto. This changes what the standard names 'serif', 'sans', and especially 'monospace'. When I upgraded my desktops to Fedora 36, I had a very bad reaction to the 'monospace' change, because the result looks really bad. It turns out that part of the reason that the result looks bad (although not all of it) is specific to xterm, and that is where the mystery comes in.

Here is what gnome-terminal looks like displaying 'Monospace 12' on Fedora 36 (in a 50x10 window):

Gnome-terminal on Fedora 36

Here is what xterm looks like displaying the same 'Monospace 12' on Fedora 36:

Xterm on Fedora 36 with stretched out text

Xterm and gnome-terminal are about the same vertical height (and in both, Fedora 36 Noto based monospace takes up more vertical room than DejaVu monospace does), but xterm is significantly stretched horizontally because there's the text has been rendered with a lot of space between each letter. With dramatic spacing between letters in the same word, the text is less dense and harder to read and the result is fairly unpleasant to me.

The fc-match program claims that 'monospace' on Fedora 36 is Noto Sans Mono in Regular:

; fc-match monospace
NotoSansMono-VF.ttf: "Noto Sans Mono" "Regular"

I wasn't able to get very far with figuring out just what font and font options xterm is using, despite using the usual magic tricks. But I did today think to test with urxvt (and alacritty), and now I have a second puzzle because urxvt (and alacritty) render text with more spacing between letters than gnome-terminal but less than with xterm. Since I can, here is an image of it:

Urxvt on Fedora 36

The letter spacing in urxvt isn't what it should be (that would be how gnome-terminal renders it), but it's significantly less spread out and more readable than xterm renders things.

This difference in rendering between gnome-terminal, urxvt, and xterm is new with the Fedora 36 setup and Noto Sans Mono. In Fedora 35, with 'monospace' as DejaVu Sans Mono, all three had text that looked the same and came out the same size.

One of the anomalies here is that while a Fedora 36 gnome-terminal will allow you to select a custom font of 'Monospace' or 'DejaVu Sans Mono' (if you have the DejaVu fonts installed), it won't let you select 'Noto Sans Mono'. The only Noto font that appears in gnome-terminal's Preferences is Noto Emoji. It's possible that this is because Noto Sans Mono does not appear to have the 'spacing=mono' property, which causes it to not show up in eg 'fc-list :scalable=true:spacing=mono: family'.

As documented in fonts.conf if you read carefully, the 'mono' value for spacing is 100. Dumping font information in various ways (including using fc-scan on the font files), the Noto Sans Mono fonts do appear to lack a 'spacing' attribute, which probably means that they're being taken as proportional fonts. In fact this seems to be a known issue that's been outstanding since 2018, and it may not even be correct to assert that Noto Sans Mono is a genuine monospaced font, because it seems to contain some glyphs that are extra width.

Since modern fonts (in X11 and otherwise) are a mess, it's possible that I need to blame FreeType here instead of Fontconfig. Perhaps Fontconfig is always picking the same font but FreeType is being passed (or not passed) various rendering parameters that make it behave differently.

The whole thing is a mess. It feels like something went off the rails in the switch from DejaVu to Noto, but in some weird and subtle way.

(Trying to troubleshoot Fontconfig and perhaps FreeType issues is also a mess. For example, it seems inordinately difficult or impossible to get Fontconfig to tell me just what the program asked for and what Fontconfig gave it. Possibly I'm missing a magic decoder ring in all of this.)

PS: All three pictures here are from a stock Fedora 35 install that was upgraded to Fedora 36, running Cinnamon, in a virtual machine. My desktop environment has its own oddities, which is part of why I reproduced it in a virtual machine.

Sidebar: How to revert this for monospace and other fonts

Go to /etc/fonts/conf.d and copy the relevant dejavu related fonts to have a lower number, eg:

# cp 57-dejavu-sans-mono-fonts.conf 49-cks-dejavu-sans-mono-fonts.conf

The more I look at these Noto fonts, the less I like them even for the serif and sans families. I'm increasingly thinking of doing this for all three general font names, because my life is too short to squint at bad looking fonts. In fact I just did that on my home desktop and I think I'm happier with the result. Some things are perhaps too dense now, but the result feels more readable in all sorts of programs.

(I'm sure that the fine people in Fedora don't think that the Noto fonts look bad. But that's the result I get.)

linux/Fedora36FontconfigMystery written at 23:36:04; Add Comment

2022-06-22

Signing email with DKIM is becoming increasing mandatory in practice

For our sins, we forward a certain amount of email to GMail (which is to say that it's sent to addresses here and then we send it onward to GMail). These days, GMail rejects a certain amount of that email at SMTP time with a message that some people will find very familiar:

550-5.7.26 This message does not have authentication information or fails to pass authentication checks (SPF or DKIM). [...]

(They helpfully include a link to their help section on "Make sure your messages are authenticated".)

As far as we can see from outside, there are two ways to pass this authentication requirement. First, the sending IP can be covered by actively positive SPF authorization, such as a '+a' clause. GMail actively ignores '~all', so I suspect that they also ignore '+all'. Second, you can DKIM sign your messages.

There are people who don't like email forwarding, but I can assure them that it definitely happens, possibly still a lot. Unless you want your email not to be accepted by GMail when forwarded, this means you need to DKIM sign it, because forwarded email won't pass SPF (and no, the world won't implement SRS).

GMail is not the only large email provider, but they are one of the influential ones. Where GMail goes today, others are likely to follow soon enough, if they haven't already. And even if other providers (or GMail) accept the message at SMTP time, they might use something similar to these requirements as part of deciding whether or not to file the new message away as spam.

I'm not really fond of the modern mail environment and how complex it's become. But it is what it is, so we get to live with it. If your mail system is capable of DKIM signing messages but you're not doing so yet, you should probably start. If your mailer can't DKIM sign messages, you probably need to look into fixing that in one way or another.

(We're lucky in that we're DKIM signing locally generated messages, and unlucky in that we do forward messages and so we're trying to figure out what we can do to help when the message isn't DKIM signed.)

Appending: The uncertainty of SRS and GMail

SPF's usual answer to how it breaks forwarding messages is SRS. However, it's not clear that SRS or any other scheme of rewriting just the envelope sender will pass GMail's SMTP authentication checks, because GMail's help specifically says (with emphasis mine):

For SPF and DKIM to authenticate a message, the message From: header must match the sending domain. Messages must pass either the SPF or the DKIM check to be authenticated.

SRS and similar schemes normally rewrite the envelope sender but not the message From:, and so would not pass what GMail says is their check (whether it actually is, who knows). Effectively GMail is insisting on DMARC alignment even without DMARC in the picture.

spam/DKIMSigningMostlyMandatory written at 21:01:24; Add Comment

2022-06-21

Some network speeds and network related speeds we see in mid 2022

We are not anywhere near a bleeding edge environment. We still mostly use 1G networking, with 10G to a minority of machines, and a mixture of mostly SATA SSDs with some HDDs for storage. A few very recent machines have NVMe disks as their system disks. So here are the speeds that we see and that I think of as 'normal', here in mid 2022 in our environment, with somewhat of a focus on where the limiting factors are.

On 1G connections, anything can get wire bandwidth for streaming TCP traffic (or should be able to; if it can't, you have some sort of problem). On 10G connections, a path between Linux machines without a firewall in the middle should readily run over 900 Mbytes/sec for a TCP connection without any specific tuning (and without Ethernet jumbo frames). We haven't tried to measure our OpenBSD firewalls recently but I don't think they can move traffic this fast yet. SSH connections aren't this fast; we can count on hitting 1G wire bandwidth but generally not anywhere near close to 10G TCP bandwidth with a single SSH connection.

(A machine with enough CPUs can improve the aggregate speed with multiple SSH connections in parallel.)

A single HDD will do sustained reads and writes somewhere between 100 Mbytes/sec and 240 Mbytes/sec; I generally assume 150 Mbytes/sec for most of our drives, although the very recent ones can go faster. However, there can be performance surprises in sustained HDD IO. A single HDD is generally fast enough to saturate a 1G network connection, but it takes quite a number in parallel in some way to reach the practical limit of 10G network speeds. Similarly, it would take a lot of HDDs operating in parallel to hit PCIe controller bandwidth limits.

(You're most likely to get enough HDDs in parallel to reach 10G TCP speeds in a RAID6 or RAID5 array, especially if you want them to write at those speeds.)

A single SATA SSD will do sustained reads in the general area of 500 Mbytes/sec. The SSD sustained write performance is more uncertain, but I've observed rates over 400 Mbytes/sec for over an hour on some systems. A single SSD doing streaming reads is slower than a 10G TCP network connection but probably faster than a 10G SSH connection; however, even a simple pair of mirrored SSDs will likely provide enough read bandwidth to saturate the 10G TCP connection. Our experience is that two (SATA) SSDs don't hit PCIe bandwidth limits, but that you can apparently hit them if you put enough SSDs on a single controller (our Linux fileservers seem to have an aggregate PCIe bandwidth limit for their drives on a SATA controller).

A single decent NVMe drive will definitely read fast enough to saturate a 10G TCP connection, even on a PCIe x2 link. As with SATA SSDs, I consider the sustained write performance to be more uncertain, and I don't have much data on it so far (we have no NVMe drives in situations where they would see sustained writes). However, generally the claimed sustained write performance numbers for NVMe drives are pretty good; if they hold up in real life, even a single NVMe drive should be able to write data at the full speed of a 10G TCP connection.

In a 1G network environment, the network is our limiting factor. In our 10G network environment doing transfers through SSH, SSH is probably the limiting factor. If you do direct TCP over 10G or use a high bandwidth SSH, HDD performance may be your limiting factor but SATA SSD performance is probably not the limit for reads (it might be for writes, since a mirrored pair of SATA SSDs only writes at the speed of one). It's likely that NVMe drives will make even 10G TCP performance your limiting factor for both reads and writes.

(However it will be years before we're using NVMe drives in any significant amounts, especially for things where bandwidth matters, unless a surprising amount of money rains on us out of the sky.)

sysadmin/NetworkRelatedSpeeds2022 written at 23:26:59; Add Comment

2022-06-20

Modern HDDs have gotten somewhat better than they used to be

I tweeted:

I didn't expect these spinning rust HDs to be able to sustain 235 Mbyte/sec read and write rates (sequential IO, it's software RAID building/resynching). I guess large, high-density drives have improved the 150 MB/sec I used to expect.

I haven't paid attention to the spinning rust hard drive industry for a while (long enough that I missed the terminology switch from 'HD' to 'HDD'). Generally at work, our use of HDDs is limited to either unimportant servers where we're going through our old stock of 'smaller' HDDs to use them as system drives, or for bulk storage, generally using older drives, where we don't really pay close attention to read and write speeds.

Recently we got some 20 TB HDDs to become the future data storage for our Prometheus metrics system, which is outgrowing its current mirrored pair of 4 TB HDDs. For reasons somewhat beyond the scope of this entry, I'm not doing an in-place storage upgrade; instead I put the 20 TBs into the new Ubuntu 22.04 server that will take over as our Prometheus server and built a Linux software RAID mirror. When you build new Linux software RAID mirrors, they need to resync. I was curious to how long this would take, so I looked at /proc/mdstat, which gave me both an estimated time (which amounted to about 24 hours) and a current data rate, which it said was around 220,000K/sec.

At first I assumed that this data rate was an initial burst rate that would soon fall. But we have Prometheus metrics for this host, and several hours later they confirmed that it really was sustaining these data rates; in fact, they'd increased up to what I tweeted. One of the two 20 TBs is reading at this speed and the other is writing at it, and they're both sustaining it (in fact, Linux disk stats also claim that neither disk is 100% utilized).

When I checked the specification sheet for this drive series, this is unsurprisingly under their quoted maximum speed (claimed to be 269 decimal MB/s, because of course disk vendors quote everything in the smaller SI units). According to the specifications sheet (okay, "product brief"), these good data rates hold right down to the 4 TB model (at 255 MB/s), and even the 2TB and 1TB models claim higher sustained rates than I'm used to (of 200 MB/s and 184 MB/s).

Now that I've looked, this is also not exclusive to this model line from this vendor. It seems that all of the 20 TB HDDs we looked at quote similar data transfer rates, and probably do so even for smaller models (I haven't looked extensively). Apparently the days of assuming only 150 to 160 Mbyte/sec of sustained read and write performance on your HDDs are over (although we still have plenty of them that only perform that well).

(I have no idea if there's been any improvement in IOPS/second, which on HDDs is more or less seeks per second. Since they're still 7200 RPM, I suspect that it's basically the same as always.)

tech/HDDsNowSomewhatBetter written at 22:01:20; Add Comment

2022-06-19

What fast SSH bulk transfer speed (probably) looks like in mid-2022

A number of years ago I wrote about what influences SSH's bulk transfer speeds, and in 2009 I wrote what turned out to be an incomplete entry on how fast various ssh ciphers were on the hardware of the time. Today, for reasons outside the scope of this entry, I'm interested in the sort of best case performance we can get on good modern hardware, partly because we actually have some good modern hardware for once. Specifically, we have two powerful dual-socket systems, one with AMD Epyc 7453s and one with Intel Xeon Gold 6348s.

To take our actual physical network out of the picture (since this is absolute best case performance), I ran my test suite against the system itself (although over nominal TCP by ssh'ing to its own hostname, not localhost). Both servers have more than enough CPUs and memory that this is not at all a strain for them. Both servers are running Ubuntu 22.04, where the default SSH cipher and MAC are chacha20-poly1305@openssh.com and no MAC (it's implicit in the cipher).

On the AMD Epyc 7453 server, the default SSH cipher choice ran at about 448 MBytes/sec. A wide variety of AES ciphers (both -ctr and -gcm versions) and MACs pushed the speed to over 600 MBytes/sec and sometimes over 700 Mbytes/sec, although I don't think there's any one option that stands out as a clear winner.

On the Intel Xeon Gold 6348 server, the default SSH cipher choice ran at about 250 Mbytes/sec. Using aes128-gcm could reliably push the speed over 300 Mbytes/sec (with various MACs). Using aes256-gcm seemed slightly worse.

I happen to have some not entirely comparable results from machines with another Intel CPU, the Pentium D1508, on tests that were run over an essentially dedicated 10G network segment between two such servers. Here the default performance was only about 150 Mbytes/sec, but aes128-gcm could reliably be pushed to 370 Mbytes/sec or better, and aes256-gcm did almost as well.

(These Pentium D1508 machines are currently busy running backups, so I can't run a same-host test on them for an apples to apples comparison.)

What this says to me is that SSH speed testing is not trivial and has non-obvious results that I don't (currently) understand. If we care about SSH speed in some context, we need to test it in exactly that context; we shouldn't assume that results from other servers or other network setups will generalize.

sysadmin/SshFastBulkSpeed-2022 written at 21:46:57; Add Comment

(Previous 10 or go back to June 2022 at 2022/06/18)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.