Wandering Thoughts

2019-12-07

Some important things about how PCIe works out involve BIOS magic

I'll start with my remark on Mastodon:

I still don't know why my Radeon graphics card and the PCIe bridge it's behind dropped down from PCIe 3.0 all the way to PCIe 1.0 bandwidth, but going into the BIOS and wandering around appears to have magically fixed it, so I'll take that.

PCIe: this generation's SCSI.

When I added some NVMe drives to my office machine and ran into issues, I discovered that the Radeon graphics card on my office machine was operating at 2.5 GT/s instead of 8 GT/s, which is to say PCIe 1.0 data rates instead of PCIe 3.0 ones (which is what it should be operating at). At the end of the last installment I speculated that I had accidentally set something in the BIOS that told it to limit that PCIe slot to PCIe 1.0, because that's actually something you can do through BIOS settings (on some BIOSes). I went through the machine's BIOS today and found nothing that would explain this, and in fact it doesn't seem to have any real settings for PCIe slot bandwidth. However, when I rebooted the machine after searching through the BIOS, I discovered that my Radeon and the PCIe bridge it's behind were magically now at PCIe 3.0's 8 GT/s.

I already knew that PCIe device enumeration involved a bunch of actions and decisions by the BIOS. I believe that the BIOS is also deeply involved in deciding how many PCIe lanes are assigned to particular slots (although there are physical constraints there too). Now it's pretty clear that your BIOS also has its fingers in decisions about what PCIe transfer rate gets used. As far as I know, all of these decisions happen before your machine's operating system comes into the picture; it mostly has to accept whatever the BIOS set up, for good or bad. Modern BIOSes are large opaque black boxes of software, and like all such black boxes they can have both bugs and mysterious behavior.

(Even when their PCIe setup behavior isn't a bug and is in fact necessary, they don't explain themselves, either to you or to the operating system so that your OS can log problems and inefficiencies.)

How do you know that your system is operating in a good PCIe state instead of one where PCIe cards and onboard controllers are being limited for some reason? Well, you probably don't, not unless you go and look carefully (and understand a reasonable amount about PCIe). If you're lucky you may detect this through side effects, such as increased NVMe latency or lower than expected GPU performance (if you know what GPU performance to expect in your particular environment). Such is the nature of magic.

PCIeAndBIOSDecisions written at 01:26:03; Add Comment

2019-12-04

Desktop motherboards can have fewer useful PCIe slots than they seem to

When I had to get an adapter card for my office machine's second NVMe drive, I worried (in advance) about the card itself slowing down the drive. It turns out that I should have also been concerned about a second issue, which is what PCIe slot to put it in. My current office machine uses an ASUS Prime X370-Pro motherboard, and if you look at it there are six PCIe slots; three of them are 'x1' single lane slots (which the manual calls PCIEX1_1 through 3), and three are physically x16 slots (called PCIEX16_1 through 3). In theory, this should make life simple; a good NVMe drive requires 4 PCIe lanes (written as 'x4'), and I have three slots that can do that.

Of course life is not so simple. For a start, just because a slot is physically a PCIe x16 slot doesn't mean that it supplies all 16 lanes (especially under all conditions), and basically no current motherboards actually provide three true x16 slots. Of the three x16 slots on this motherboard, the third is only PCIe x4 and the first two are only collectively x16; you can have one card at x16 or two cards at x8 (and it may be that only the first slot can do x16, the manual isn't entirely clear). The next issue is that the x16 @ x4 slot also shares PCIe lanes, this time with the second and third PCIe x1 slots. If you use a PCIe x1 card in either of the x1 slots, the x16 @ x4 slot becomes an x16 @ x2 slot. Finally, the first PCIe x1 slot is physically close enough to the first x16 slot that a dual width GPU card more or less precludes using it, which is unfortunate since that's the only PCIe x1 slot that doesn't conflict with the x16/x4 slot.

My office machine has a Radeon graphics card that happens to be dual width, an x1 Intel gigabit Ethernet card because I need a second network port, and now a PCIe NVMe adapter card that physically requires a PCIe x4 or greater slot and wants to be x4 to work best. The natural layout I used when I put the machine together initially was the Radeon graphics card in the first PCIe x16 slot and the Intel card in one of the two PCIe x1 slots where it would fit (I picked the third, putting it as far away from the Radeon as possible). When I added the NVMe card, I put it in the third PCIe x16 slot (which is physically below the third PCIe x1 slot with the Intel card); it seemed the most natural spot for it, partly because it kept room for air circulation for the fans of the Radeon card. Then I noticed that the second NVMe drive had clearly higher latency (especially write latency) than the first one, started looking, and discovered that it was running at PCIe x2 instead of x4 (because of the Intel Ethernet card).

If my graphics card could use x16 and I wanted it to, it might still be possible to make everything work at full speed, but I'd have to move the graphics card and hope that the second PCIe x16 slot can support full x16, not just x8. As it is, my card fortunately only wants x8, which means the simple resolution to my problem is moving the NVMe adapter card to the second PCIe x16 slot. If I wanted to also add a 10G-T Ethernet card, I might be out of luck, because I think those generally want at least x4.

(Our current 10G-T hardware turns out to be somewhat inconsistent on this. Our Intel dual 10G-T cards seem to want x8, but our Linux fileservers claim that their onboard 10G-T ports only want x1 with a link speed of 2.5GT/s.)

All of this is annoying, but the more alarming bit is that it's unlikely to be particularly obvious to people if their PCIe lane count is being reduced with cards like this PCIe to NVMe adapter card. It will still work, just more slowly than you'd expect, and then perhaps people write reviews saying 'this card is inferior and doesn't deliver full performance for NVMe drives'.

(This also omits the additional issue of whether the PCIe lanes in question are directly connected to the CPU or have to flow through the chipset, which has a limited bandwidth connection to the CPU. This matters on modern machines because you have to go through the CPU to get to RAM, so you can only get so much RAM bandwidth total from all PCIe devices behind the chipset, no matter how many PCIe lanes the chipset claims to provide (even after limitations). See also my older entry on PCIe and how it interacts with modern CPUs.)

Sidebar: PCIe 3.0 versus 2.0 in slots

The other issue in slots is which PCIe version they support, with PCIe 3.0 being potentially faster than PCIe 2.0. On my motherboard, only slots driven directly by the CPU support PCIe 3.0; slots driven through the AMD X370 chipset are 2.0 only. All of the PCIe x1 slots and the PCIe x16 @ x4 slot are driven by the chipset and so are PCIe 2.0, which may have been another source of my NVMe performance difference. The two full x16 slots are PCIe 3.0 from the CPU, as is the motherboard's M.2 slot.

PCIeSlotsLimitations written at 01:04:01; Add Comment

2019-11-29

The problem of multiple NVMe drives in a PC desktop today

My current office workstation currently has two 250 GB Samsung 850 EVO SSDs (and some HDs). These were decent SSDs for their era, but they're now any number of years old and 250 GB isn't very large, so as part of our general stocking up on 500 GB SSDs at good sale prices, I get to replace them. To my surprise, it turns out that decent 500 GB NVMe drives can now be had at roughly the same price as 500 GB SSDs (especially during sales), so I got approval to get two NVMe drives as replacements instead of two SSDs. Then I realized I had a little issue, because my motherboard only has one M.2 NVMe slot.

In general, if you want multiple NVMe drives in a desktop system, you're going to have problems that you wouldn't have with the same number of SSDs (or HDs). PC motherboards have been giving you lots of SATA ports for a long time now, but M.2 slots are much scarcer. I think that part of this is simply an issue of physical board space, since M.2 slots need a lot more space than SATA ports do, but part of it also seems to be that M.2 drives consume a lot more PCIe lanes than SATA ports do. An M.2 slot needs at least two lanes and really you want it to have four, and even today there are only so many PCIe lanes to go around, at least on common desktops.

(I suspect that this is partly segmentation on the part of Intel and to a lesser extent AMD. They know that server people increasingly want lots of PCIe lanes, so if they restrict that to expensive CPUs and chipsets, they can sell more of them. Unusual desktop people like me get to lose out.)

I'm solving my immediate problem by getting a PCIe M.2 adapter card, because fortunately my office desktop has an unused PCIe x4 card slot right now. But this still leaves me with potential issues in the long run. I mirror my drives, so I'll be mirroring these two NVMe drives, and when I replace a drive in such a mirror I prefer to run all three drives at once for a while rather than break the mirror's redundancy to swap in a new drive. With NVMe drives, that would require two addon cards on my current office machine and I believe it would drop my GPU from x16 to x8 in the process (not that I need the GPU bandwidth, since I run basic X).

(And if I wanted to put a 10G-T Ethernet card into my desktop for testing, that too would need another 4x capable slot and I'd have my GPU drop to 8x to get the PCIe lanes. Including the GPU slot, my motherboard has only three 4x capable card slots.)

One practical issue here is that apparently PCIe M.2 adapter cards can vary somewhat in quality and the resulting NVMe IO rates you get, and it's hard to know whether or not you're going to wind up with a decent one. Based on the low prices for cards with a single M.2 slot and the wide collection of brands I'd never heard of, this is a low margin area dominated by products that I will politely call 'inexpensive'. The modern buying experience for such products is not generally a positive one (good luck locating decent reviews, for example).

(Also, apparently not all motherboards will boot from an NVMe drive on an adapter card in a PCIe slot. This isn't really an issue for me (if the NVMe drives in my motherboard M.2 slot fails, I'll move the adapter drive to the motherboard), but it might be for some people.)

Hopefully all of this will get better in the future. If there is a large movement toward M.2 (and I think there may be), there will be demand for more M.2 capacity from CPUs and chipsets and more M.2 slots on desktop motherboards, and eventually the vendors will start delivering on that (somehow). This might be M.2 slots on the motherboard, or maybe more x8 and x16 PCIe slots and then adapter cards (and BIOSes that will boot from them).

MultiNVMeMotherboardIssue written at 01:45:12; Add Comment

2019-11-18

LiveJournal and the path to NoSQL

A while back on Mastodon, I said:

Having sort of distantly observed the birth of NoSQL from the outside, I have a particular jaundiced view of its origins that traces it back to LiveJournal's extensive wrangling with MySQL (although this is probably not entirely correct, since I was a distant outside observer who was not paying much attention).

Oddly, this view is more flattering to NoSQL than many takes on its origins that I've seen. LJ's problems & solutions in the mid 00s make a good case for NoSQL.

LiveJournal aren't the only people who had problems with scaling MySQL in the early and mid 2000s, but they do have the distinction of having made a bunch of public presentations about them, many of which I read with a fair degree of interest back in the days. So let me rewind time to the early 2000s and quickly retrace LiveJournal's MySQL journey. To set the stage, this was a time before clouds and before affordable SSDs; if you ran a database-backed website at scale, you had real machines somewhere and they used hard drives.

LiveJournal started out with a single MySQL database machine. When the load overwhelmed it, they added read replicas (and then later in memory caching, creating memcached). However, their read replicas were eventually overwhelmed by write volume; reads can be served from one of N replicas, but the master and all replicas see the same write load. To deal with this, LiveJournal wound up manually sharding parts of their overall MySQL database into smaller 'user clusters', having already accepted that they never wanted to do what would be the equivalent of cross-cluster database joins.

At the point where you've manually sharded your SQL database, from a global view you don't really have an SQL database any more. You can't do cross-shard joins, cross-shard foreign keys, or cross-shard transactions, and you don't even have a guaranteed schema for your database since every shard has its own copy and they might not be in synchronization (they should be, but stuff happens). You're almost certainly querying and modifying your database through a higher level application library that packages up the work of finding the right shard, handling cross-shard operations, and so on, not by opening up a connection and sending SQL down it. The SQL is an implementation detail.

These problems aren't LiveJournal's alone. You need all of this sharding to operate at scale in the mid 2000s, because hard drives imposed a hard limit on how many reads and writes a second you could do. You couldn't get away from the tyranny of hard drive limitations on (seek) IO operations a second the way you can today with SSDs and now NVMe drives (and large amounts of RAM). And if you have to shard and don't have an SQL database after you shard, you might as well use a database that's designed for this environment from the start, one that handles the sharding, replication, and so on for you instead of making you build your own version. If it's really fast, so much the better.

(This idea is especially attractive to small startups, who don't have the people to build LiveJournal level software and libraries themselves. Small startups are busy enough trying to build a viable product without trying to also build a sharded database environment.)

Today, in a world of SSDs, NVMe drives, large amounts of RAM, cloud providers, managed SQL compatible database offerings, and so on, it's tempting to laugh at the idea of NoSQL. But at the time I think it looked like a sensible response to the challenges that large websites were facing.

(Whether so many people should have adopted NoSQL databases is a separate issue. But my impression is that startups are nothing if not ambitious.)

LiveJournalAndNoSQL written at 22:40:53; Add Comment

2019-11-14

TCP/IP and a consequence of reliable delivery guarantees

I recently read My hardest bug to debug (via), which discusses an interesting and hard to find bug that caused an industrial digital camera used for barcode scanning to hang. The process of diagnosis (and the lessons learned from it) are interesting, so I urge you to go read the article now before reading further here, because I have to spoil the actual bug.

(Really, go read the article. This is your last chance.)

One part of the control system worked by making a TCP connection to the camera, doing some initial setup, and then leaving the connection open so that it could later send any setting changes to the camera without having to re-open a connection. It turned out that the camera had an undocumented behavior of sending scan results over this TCP connection (as well as making them available in other ways). The control system didn't expect this return traffic, so it never listened for responses on the TCP connection. The article ends with, in part:

I still don't understand how this caused the camera to lock up. We were receiving the TCP results via Telnet but we weren't reading the stream. Did it just build up in some buffer? How did this cause the camera to lock up? I still can't answer these questions.

The most likely guesses are that yes, the sent data built up in a collection of buffers on both the receiver and the sender, and this caused the hang because eventually the camera software attempted to send more data to the network and the OS on the camera put the software to sleep because there wasn't any more buffer room.

While you might blame common networking APIs a bit here, in large part this is a deep but in some sens straightforward consequence of TCP/IP promising reliable delivery of your bytes in a world of finite and often limited resources. A properly operating receiving system cannot throw away bytes after it's ACK'd them, so it must buffer them; a sensible system will have a limit on how many bytes it will buffer before it stops accepting any more. Similarly, a properly operating sending system can't generally throw away bytes after accepting them from an application, so if the receiving system isn't accepting more bytes the sending system has to (eventually) stop accepting them from the application. All of this is familiar in general as backpressure. When the backpressure propagates all the way back to the sending application, it can either stall itself or it can get a 'no more buffer space' error from the operating system.

(Where the APIs come in is that when there's no more buffer space, they generally opt to have the application's attempt to send more data just stall instead of producing an error. This is generally easier for many applications to handle, but sometimes it's not what you want.)

TCP's reliable delivery guarantee means that you can only send so much data to something that isn't dealing with it. You cannot just send data forever and have it vanish into the void because no one is dealing with it; that wouldn't be reliable delivery. After all, the receiving application might wake up and start reading that accumulated data some day, and if it does all the data had better be there for it.

TCPReliableDeliveryConsequence written at 23:54:10; Add Comment

2019-10-21

Filesystem size limits and the complication of when errors are detected

One of the practical things that further complicates limiting the size of things in filesystems is the issue of when people find out about this, or rather how this interacts with another very desirable property in modern filesystems.

For practical usability, you want people (by which I mean programs) to be notified about all 'out of space' errors synchronously, when they submit their write IOs, or more exactly when they get back the results of submitting one (although this is effectively the same thing unless you have an asynchronous write API and a program is using it). Common APIs such as the Unix API theoretically allow you to signal write errors later (for example on close() in Unix), but actually doing so will cause practical problems both for straightforward programs that just write files out and are done (such as editors) and more complicated programs that do ongoing IO. Beyond carelessness, the problem is that write errors that aren't tied to a specific write IO leave programs in the dark about what exactly went wrong. If your program makes five write() calls at various offsets in the file and then gets an error later, the file could be in pretty much any state as far as it knows; it has no idea which writes succeeded, if any, and which failed. Some write errors can't be detected until the IO is actually done and have to be delivered asynchronously, but delivering as many as possible as early as possible is quite desirable. And while 'the disk exploded' write errors are uncommon and unpredictable, 'you're out of space' is both predictable (in theory) and common, so you very much want to deliver it to programs immediately if it's going to happen at all.

By itself this is no problem, but then we get into the other issue. Modern filesystems have discovered that they very much want to delay block allocation until as late as possible, because delaying and aggregating it together across a bunch of pending writes gives you various performance improvements and good features (see eg the Wikipedia article). The natural place to detect and report being out of various sorts of space is during block allocation (because that's when you actually use space), but if this is asynchronous and significantly delayed, you don't have prompt reporting of out of space issues to programs. If you try to delay block allocation but perform size limit checking early, there's a multi-way tradeoff between basically duplicating block allocation, being completely accurate (so that there's little or no chance of a delayed failure during the later block allocation), and being fast.

In theory, the best solution to this problem is probably genuinely asynchronous write APIs that delay returning their results until your data has been written to disk. In practice asynchronous APIs leave you with state machines in your programs, and state machines are often hard to deal with (this is the event loop problem, and also).

FilesystemQuotaErrorTiming written at 22:14:06; Add Comment

2019-10-09

Limiting the size of things in a filesystem is harder than it looks

Suppose, not entirely hypothetically, that you want to artificially limit the size of a filesystem or perhaps something within it, for example the space used by a particular user. These sort of limits usually get called quotas. Although you might innocently think that enforcing quotas is fairly straightforward, it turns out that it can be surprisingly complicated and hard even in very simple filesystems. As filesystems become more complicated, it can rapidly become much more tangled than it looks.

Let's start with a simple filesystem with no holes in files and us only wanting to limit the amount of data that a user has (not filesystem metadata). If the user tries to write 128 Kb to some file, we already need to know where in the file it's going. If the 128 Kb is entirely overwriting existing data, the user uses no extra space, if it's being added to the end of the file, they use 128 Kb more space, and if it partly overlaps with the end of the file, they use less than 128 Kb. Fortunately the current size of a file that's being written to is generally very accessible to the kernel, so we can probably know right away whether the user's write can be accepted or has to be rejected because of quota issues. Well, we can easily know until we throw multiple CPUs into the situation, with different programs on different CPUs all executing writes at once. Once we have several CPUs, we have to worry about synchronizing our information on how much space the user is currently using.

Now, suppose that we want to account for filesystem metadata as well, and that files can have unallocated space in the middle of themselves. Now the kernel doesn't know how much space 128 Kb of file data is going to use until it's looked at the file's current indirect blocks. Writing either after the current end of the file or before it may require allocating new data blocks and perhaps new indirect blocks (in extreme cases, several levels of them). The existing indirect blocks for the file may or may not already be in memory; if they aren't, the kernel doesn't know whether it can accept the write until it reads them off disk, which may take a while. The kernel can optimistically accept the write, start allocating space for all of the necessary data and metadata, and then abort if it runs into a quota limit by the end. But if it does this, it has to have the ability to roll back all of those allocations it may already have done.

(Similar issues come up when you're creating or renaming files and more broadly whenever you're adding entries to a directory. The directory may or may not have a free entry slot already, and adding your new or changed name may cause a cascade of allocation changes, especially in sophisticated directory storage schemes.)

Features like compression and deduplication during writes complicate this picture further, because you don't know how much raw data you're going to need to write until you've gone through processing it. You can even discover that the user will use less space after the write than before, if they replace incompressible unique data with compressible or duplicate data (an extreme case is turning writes of enough zero bytes into holes).

If the filesystem is a modern 'copy on write' one such as ZFS, overwriting existing data may or may not use extra space even without compression and deduplication. Overwriting data allocates a new copy of the data (and metadata pointing to it), but it also normally frees up the old version of the data, hopefully giving you a net zero change in usage. However, if the old data is part of a snapshot or otherwise referenced, you can't free it up and so an 'overwrite' of 128 Kb may consume the same amount of space as appending it to the file as new data.

Filesystems with journals add more issues and questions, especially the question of whether you add operations to the journal before you know whether they'll hit quota limits or only after you've cleared them. The more you check before adding operations to the journal, the longer user processes have to wait, but the less chance you have of hitting a situation where an operation that's been written to the journal will fail or has to be annulled. You can certainly design your journal format and your journal replay code to cope with this, but it makes life more complicated.

At this point you might wonder how filesystems that support quotas ever have decent performance, if checking quota limits involves all of this complexity. One answer is that if you have lots of quota room left, you can cheat. For instance, the kernel can know or estimate the worst case space usage for your 128 Kb write, see that there is tons of room left in your quota even in the face of that, and not delay while it does further detailed checks. One way to deal with the SMP issue is to keep a very broad count of how much outstanding write IO there is (which the kernel often wants anyway) and not bother with synchronizing quota information if the total outstanding writes are significantly less than the quota limit.

(I didn't realize a lot of these lurking issues until I started to actually think about what's involved in checking and limiting quotas.)

FilesystemLimitingSizeProblems written at 21:31:43; Add Comment

2019-09-20

TLS server certificate verification has two parts (and some consequences)

One of the unusual and sometimes troublesome parts of TLS is that verifying a TLS server's certificate actually has two separate parts, each critical. The first part is verifying that you have a valid certificate, one that is signed by a certificate chain that runs up to a known CA, hasn't expired, hasn't been revoked (or is asserted as valid), perhaps appears in a CT log, and so on. The second, equally critical part is making sure that this valid certificate is actually for the server you are talking to, because there are a lot of valid certificates out there and more or less anyone can get one for some name. Failing to do this opens you up to an obvious and often trivial set of impersonation attacks.

However, there is an important consequence of this for using TLS outside of the web, which is that you must know the name of the server you're supposed to be talking to in order to verify a server's TLS certificate. On the web, the server name is burned into the URL; you cannot make a request without knowing it (even if you know it as an IP address). In other protocols that also use TLS, this may not be true or it may not be clear what name for the server you should use (if there are levels of aliases, redirections, and so on going on, possibly including DNS CNAMEs).

The corollary of this is that it's much harder to use TLS with a protocol that doesn't give you start with a server name somehow. If the protocol is 'I broadcast a status packet and something responds' or 'someone gives me a list of IP addresses of resources', you sort of have a problem. Sometimes you can resolve this problem by fiat, for example by saying 'we will do a DNS PTR query to resolve this IP address to a name and then use the name', and sometimes you can't even get that far.

(You can also say 'we will not verify the server name', but then you only have part of TLS.)

That's all very abstract, so let's go with two real examples. The first one is 801.2X network authentication, which I tangled with recently. When I dealt with this on my phone, I was puzzled why various instructions said to make sure that the TLS certificate was for a specific name (and I even wondered if this meant that the TLS certificate wasn't being verified at all). But the reason you have to check the name is that the 801.2X protocol doesn't have any trustworthy way of asserting what authentication server should be called; almost by definition, you can't trust anything the 801.2X server itself claims about what it should be called, and the only other information you have is (perhaps) the free-form name of the network (as, for example, the wireless SSID). The server name and the trust has to be established out of band, and on phones that's through websites with instructions.

(On Linux you have to explicitly configure the expected server name in advance if you want security.)

The second example is wanting to use DNS over TLS or DNS over HTTPS to talk to the DNS servers you find through DHCP or have in a normal resolv.conf. In both of these cases, the protocol and the configuration file only specify the DNS servers by IP address, with no names associated with them (and the IPs may well be RFC 1918 private addresses). It's possible to turn this into a server name if you want to through DNS, but you wind up basically having to trust what the DNS server is telling you about what its TLS server name should be.

(You can augment DHCP and resolv.conf with additional information about the server names you should look for, but then you need to define the format and protocol for this information, and you need more moving parts in order to get your TLS protected DNS queries.)

PS: Sometimes the first part of TLS is sufficiently important by itself, because blocking passive eavesdropping can be a significant win. But it's at least questionable, and you need to consider your threat models carefully.

TLSCertVerifyTwoParts written at 00:12:26; Add Comment

2019-09-16

The problem of 'triangular' Network Address Translation

In my entry on our use of bidirectional NAT and split horizon DNS, I mentioned that we couldn't apply our bidirectional NAT translation to all of our internal traffic in the way that we can for external traffic for two reasons, an obvious one and a subtle one. The obvious reason is our current network topology, which I'm going to discuss in a sidebar below. The more interesting subtle reason is the general problem of what I'm going to call triangular NAT.

Normally when you NAT something in a firewall or a gateway, you're in a situation where the traffic in both directions passes through you. This allows you to do a straightforward NAT implementation where you only rewrite one of the pair of IP addresses involved; either you rewrite the destination address from you to the internal IP and then send the traffic to the internal IP, or you rewrite the source address from the internal IP to you and then send the traffic to the external IP.

However, this straightforward implementation breaks down if the return traffic will not flow through you when it has its original source IP. The obvious case of this is if a client machine is trying to contact a NAT'd server that is actually on its own network. It will send its initial packet to the public IP of the NAT'd machine and this packet will hit your firewall, get its destination address rewritten, and then passed to the server. However, when it replies to the packet, the server will see a destination IP on its local network and just send it directly to the client machine. The client machine will then go 'who are you?', because it's expecting the reply to come from the server's nominal public IP, not its internal one.

(Asymmetric routing can also create this situation, for instance if the machine you're talking to has multiple interfaces and a route to you that doesn't go out the firewall-traversing one.)

In general the only way to handle triangular NAT situations is to force the return traffic to flow through your firewall by always rewriting both IP addresses. Unfortunately this has side effects, the most obvious one being that the server no longer gets the IP address of who it's really talking to; as far as it's concerned, all of the connections are coming from your firewall. This is often less than desirable.

(As an additional practical issue, not all NAT implementations are very enthusiastic about doing such two-sided rewriting.)

Sidebar: Our obvious problem is network topology

At the moment, our network topology basically has three layers; there is the outside world, our perimeter firewall, our public IP subnets with various servers and firewalls, and then our internal RFC 1918 'sandbox' subnets (behind those firewalls). Our mostly virtual BINAT subnet with the public IPs of BINAT machines basically hangs off the side of our public subnets. This creates two topology problems. The first topology problem is that there's no firewall to do NAT translation between our public subnets and the BINAT subnet. The larger topology problem is that if we just put a firewall in, we'd be creating a version of the triangular NAT problem because the firewall would have to basically be a virtual one that rewrote incoming traffic out the same interface it came in on.

To make internal BINAT work, we would have to actually add a network layer. The sandbox subnet firewalls would have to live on a separate subnet from all of our other servers, and there would have to be an additional firewall between that subnet and our other public subnets that did the NAT translation for most incoming traffic. This would impose additional network hops and bottlenecks on all internal traffic that wasn't BINAT'd (right now our firewalls deliberately live on the same subnet as our main servers).

TriangleNATProblem written at 00:31:27; Add Comment

2019-09-06

Programs that let you jump around should copy web browser navigation

As part of their user interface, many programs these days have some way to jump around (or navigate around) the information they display (or portions of it, such as a help system). Sometimes you do this by actually clicking on things (and they may even look like web links); sometimes you do this through keyboard commands of various sorts.

(The general common form of these is that you are looking at one thing, you perform an action, and you're now looking at another thing entirely. Usually you don't know what you're going to get before you go to it.)

Programs have historically come up with a wide variety of actual interfaces for this general idea. Over the years, I have come to a view on how this should work, and that is the obvious one; jumping around in any program should work just like it does in web browsers, unless the program has a very good reason to surprise people. Your program should work the same as browsers both in the abstract approach and also, ideally, in the specific key and mouse bindings that do things.

There are two reasons for this. The first reason is simply that people already spend a lot of time navigating around in browsers, so they are very familiar with it and generally pretty good at it. If you deviate from how browsers behave, people will have to learn your behavior instead of being able to take advantage of their existing knowledge. The second and more subtle reason is that browsers have spent a lot of time working on developing and refining their approach to navigation, almost certainly more time than you have. If you have something quite different than a web page environment, perhaps you can still design a better UI for it despite your much less time, but the more your setup resembles a series of web pages, the less likely that is.

At this point you might ask what the general abstract approach of web browser navigation is. Here are my opinions:

  • You can move both back and then forward again through your sequence of jumps, except for rare things which cannot be repeated. In a normal UI, non-repeatable things should probably use a completely different interface from regular 'follow this thing' jumps.

  • The sequence is universal by default in that it doesn't matter what sort of a forward jump you made and regardless of where it took you, you can always go back with a single universal action, and then forward again by another one. You can add extra sorts of back and forward traversal that only act on some sorts of jumps if you want to, but the default should be universal.

    (As far as the destination goes, notice that browsers have a universal 'back' action regardless of whether the link was to another anchor on the same web page or to an entirely different web page.)

  • By default, the sequence of jumps is not a global one but is specific to your current pane, whatever a pane is in your application (a window, a tab, a section of the window). What you do in one pane should not contaminate the forward/back sequence of another pane, because it's generally harder to keep track of a global, cross-pane history and remember where 'forward' and 'back' operations will take you in it.

    (There are clever uses of a global sequence and you can offer one, but it shouldn't be the default.)

  • It should be possible to have the destination of a jump not overwrite the current stuff you're looking but instead open in another pane. This should generally not be the default, but that's somewhat situational.

There are probably other aspects of browser navigation that I haven't thought of simply because I'm so accustomed to them.

There are still reasons to use different interfaces here under the right circumstances, but you should be quite sure that your different interface really is a significant advantage and that a decent amount of your target audience will use your program a lot. Editors are generally a case of the latter, but I'm not convinced that most of them are a case of the former.

(At this point in time I suspect that this is a commonly held and frequently repeated position, but I feel like writing it down anyway.)

CopyBrowserNavigation written at 21:43:46; Add Comment

(Previous 10 or go back to September 2019 at 2019/09/01)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.