Wandering Thoughts


Network switches aren't simple devices (not even basic switches)

Recently over on the Fediverse I said something about switches:

"Network switches are simple devices" oh I am so sorry. Hubs were simple devices. Switches are alarmingly smart devices even if they don't handle VLANs or support STP (and almost everyone wants them to support Spanning Tree Protocol, to stop loops). Your switch has onboard packet buffering, understands Ethernet addresses, often generates its own traffic and responds to network traffic (see STP), and is actually a (layer 2) high speed router with a fallback to being a hub.

(And I didn't even remember about multicast, plus I omitted various things. The trigger for my post was seeing a quote from Making a Linux-managed network switch, which is speaking (I believe) somewhat tongue in cheek and anyway is a fun and interesting article.)

Back in the old days, a network hub could simply repeat incoming packets out each port, with some hand waving about having to be aware of packet boundaries (see the Wikipedia page for more details). This is not the case with switches. Even a very basic switch must extract source and destination Ethernet addresses out of packets, maintain a mapping table between ports and Ethernet addresses, and route incoming packets to the appropriate port (or send them to all ports if they're to an unknown Ethernet address). This generally needs to be done at line speed and handle simultaneous packets on multiple ports at once.

Switches must have some degree of internal packet buffering, although how much buffering switches have can vary (and can matter). Switches need buffering to deal with both a high speed port sending to a low speed one and several ports all sending traffic to the same destination port at the same time. Buffering implies that packet reception and packet transmission can be decoupled from each other, although ideally there is no buffering delay if the receive to transmit path for a packet is clear (people like low latency in switches).

A basic switch will generally be expected to both send and receive special packets itself, not just pass through network traffic. Lots of people want switches to implement STP (Spanning Tree Protocol) to avoid network loops (which requires the switch to send, receive, and process packets itself), and probably Ethernet flow control as well. If the switch is going to send out its own packets in addition to incoming traffic, it needs the intelligence to schedule this packet transmission somehow and deal with how it interacts with regular traffic.

If the switch supports VLANs, several things get more complicated (although VLAN support generally requires a 'managed switch', since you have to be able to configure the VLAN setup). In common configurations the switch will need to modify packets passing through to add or remove VLAN tags (as packets move between tagged and untagged ports). People will also want the switch to filter incoming packets, for example to drop a VLAN-tagged packet if the VLAN in question is not configured on that port. And they will expect all of this to still run at line speed with low latency. In addition, the switch will generally want to segment its Ethernet mapping table by VLAN, because bad things can happen if it's not.

(Port isolation, also known as "private VLANs", adds more complexity but now you're well up in managed switch territory.)

PS: Modern small network switches are 'simple' in the sense that all of this is typically implemented in a single chip or close to it; the Making a Linux-managed network switch article discusses a couple of them. But what is happening inside that IC is a marvel.

NetworkSwitchesNotSimple written at 23:13:55; Add Comment


Using WireGuard as a router to get around reachability issues

Suppose that you have a machine, or a set of machines, that can't be readily reached from the outside world with random traffic (for example, your home LAN setup), and you also have a roaming machine that you want to use to reach those machines (for example, your phone). If you only had one of these problems, you could set up a straightforward WireGuard tunnel, where your roaming phone talked to the WireGuard machines on your home LAN. But on the surface, having both of them sounds like you need some degree of complex inbound NAT gateway on a fixed and reachable address in the cloud (your phone talks to the gateway with WireGuard, the gateway NATs the traffic and passes it over WireGuard to the home VLAN, etc). However, with some tricks you don't need this; instead, you can use WireGuard on the fixed cloud machine as a router instead of a gateway.

(As someone who deals with non-WireGuard networking regularly, my reflex is that if two machines can't talk to each other with plain IP, we're going to need some kind of NAT or port forwarding somewhere. This leads to a situation where if two potential WireGuard peers can't talk to each other, my thoughts immediately jump to 'clearly we're going to need a NAT'.)

The basic idea is that you set up the fixed public machine as a router, although only for WireGuard connections, and then you arrange to route appropriate IP addresses and IP address ranges over the various WireGuard connections. The simplest approach is to give each WireGuard client an 'inside' IP address on the WireGuard interface on some subnet, and then have each client route the entire (rest of the) subnet to the WireGuard router machine. The router machine's routing table then sends the appropriate IP address (or address range) down the appropriate WireGuard connection. More complex setups are possible if you have existing IP address ranges that need to be reached over these WireGuard-based links, but the more distinct IPs or IP ranges you want to reach over WireGuard, the more routing entries each WireGuard client needs (the router's routing table also gets more complicated, but it was already a central point of complexity).

(This isn't a new pattern; it used to appear in, for example, PPP servers. But those have been generally out of fashion for a while and not something people deal with. VPN servers also behave this way but often their VPN software handles this all for you without explicit routing table entries or you having to think about it. They may also automatically NAT traffic for you.)

Routing an existing home LAN IP address range or the like to the WireGuard machines is potentially a bit more complex. Unless you can use your existing home gateway as a WireGuard peer, you'll need to either NAT the WireGuard 'inside' IP addresses when they talk to your home LAN or establish a special route on your home LAN that sends traffic for those IPs to your WireGuard gateway. If you can set up WireGuard on your home gateway (by which I mean whatever machine is the default route for things on your LAN), life is simpler because the return traffic is already flowing through the machine; you just need to send it off to the WireGuard router instead of to the Internet. Another option is to assign unused home LAN IP addresses to your remote WireGuard machines, and then have your home LAN WireGuard gateway do 'proxy ARP' or IPv6 NDP for those IPs.

(In theory this is one of the situations where IPv6 may make your life easier, because if necessary you can create your own Unique local address space, carve it up between your home LAN and other areas, and route it around.)

WireGuardRouterPattern written at 23:33:45; Add Comment


Unix's fsync(), write ahead logs, and durability versus integrity

I recently read Phil Eaton's A write-ahead log is not a universal part of durability (via), which is about what it says it's about. In the process it discusses using Unix's fsync() to achieve durability, which woke up a little twitch I have about this general area, which is the difference between durability and integrity (which I'm sure Phil Eaton is fully aware of; their article was only about the durability side).

The core integrity issue of simple uses of fsync() is that while fsync() forces the filesystem to make things durable on disk, the filesystem doesn't promise to not write anything to disk until you do that fsync(). Once you write() something to the filesystem, it may write it to disk without warning at any time, and even during an fsync() the filesystem makes no promises about what order data will be written in. If you start an fsync() and the system crashes part way through, some of your data will be on disk and some won't be and you have no control over which part is which.

This means that if you overwrite data in place and use fsync(), the only time you are guaranteed that your data has both durability and integrity is in the time after fsync() completes and before you write any more data. Once you start (over)writing data again, that data could be partially written to disk even before you call fsync(), and your integrity could be gone. To retain integrity, you can't overwrite more than a tiny bit of data in place. Instead, you need to write data to a new place, fsync() it, and then overwrite one tiny piece of existing data to activate your new data (and fsync() that write too).

(Filesystems can use similar two-stage approaches to make and then activate changes, such as ZFS's slight variation on this. ZFS does not quite overwrite anything in place, but it does require multiple disk flushes, possibly more than two.)

The simplest version of this condenses things down to one fsync() (or its equivalent) at the cost of having an append-only data structure, which we usually call a log. Logs need their own internal integrity protection, so that they can tell whether or not a segment of the log had all of its data flushed to disk and so is fully valid. Once your single fsync() of a log append finishes, all of the data is on disk and that segment is valid; before the fsync finishes, it's not necessarily so. Only some of the data might have been written, and it might have been written out of order (so that the last block made it to disk but an earlier block did not).

Write ahead logs normally increases the amount of data written to disk; you write data once to the WAL and once to the main database. However, a WAL may well reduce the number of fsync()s (and thus disk flushes) that you have to do in order to have both durability and integrity. In modern solid state storage systems, synchronous disk flushes can be the slowest operation and (asynchronous) write bandwidth relatively plentiful, so trading off more data written for fewer disk flushes can be a net performance win in practice for plenty of workloads.

(Again, I'm sure Phil Eaton knows all of this; their article was specifically about the durability side of things. I'm using it as a springboard for additional thoughts. I'm not sure I'd realized how a WAL can reduce the number of fsync()s required before now.)

FsyncDurabilityVsIntegrity written at 22:41:15; Add Comment


Modifying and setting alarm times: a phone UI irritation

Over on the Fediverse, I mentioned a little irritation with my iPhone:

Pretty much every time I change the time of an alarm on my phone I am irritated all over again at the fundamental laziness and robotic computer-ness of time controls. What I want to do is move the time forward or backward, not to separately change (or set) the hours and the minutes. But separate 'hour' and 'minutes' spinners or options are the easy computer way out so that's how UIs implement it.

My phone's standard alarm app has what I believe is the common phone interface for setting and modifying alarm times, where you set the hour and the minute separately. There are two problems with this.

The first problem is what I mentioned in my post. In the case when I'm modifying the time of an existing alarm, what I want to do is move it forward or backward by some amount. Where this time change moves the time of the alarm over an hour boundary, I must separately adjust the minutes and the hours, and do the relevant math in my head. I can't just say 'make it half an hour earlier', I have to move the hour backward and then the minutes forward, in two separate actions.

The second problem is that this interface is also not all that great if I have an exact time for an alarm. If I want to set an alarm for exactly, say, 10:50, this interface forces me to first set '10' hour and then '50' minute, instead of just letting me type in, say, '1050' (the sensible interface is to infer the separation between hours and minutes, so you can use a basic number entry interface). The iPhone's standard alarm application actually supports direct entry of alarm times, but it's not exposed as an obvious feature; you have to know that you can tap on the time spinners to get a number pad for direct time entry.

How this situation probably came about feels relatively straightforward. Spinner fields for selecting between alternatives are a broadly used UI element and are available in standard forms in the system's UI libraries. A UI to adjust times forward and backward would have to be specifically designed for this purpose and would have limited use outside a few contexts. You don't even have to assume laziness on the part of the phone UI designers; if you want to do a good job of UI control design, it needs things like user testing to make sure people can understand it, and you can only do so much of that. It's not difficult to imagine that user testing for something with narrow usage would get pushed way down the priority list in favour of things with more usage and higher importance.

(There is also the issue of UI standardization. Spinner controls may not be ideal for this purpose, but because they're commonly used, people will likely be able to immediately recognize and use them. A custom UI does not have this advantage, and you can argue that setting alarms is not important enough to make people remember a UI just for it. After all, how often do you change the time of alarms? I'm likely an outlier here.)

PhoneAlarmTimeSettingUIIssue written at 23:08:04; Add Comment


Security is not really part of most people's jobs

A while back I said something on the Fediverse:

In re people bypassing infosec policies at work, I feel that infosec should understand that "getting your job done" is everyone's first priority, because in this capitalistic society, not getting your job done gets you fired. You might get fired if you bypass IT security, but you definitely will if you can't do your work. Trying to persuade everyone that it's IT's fault, not yours, is a very uphill battle and not one anyone wants to bet on.

(This is sparked by <Fediverse post>)

Let's look at this from the perspective of positive motivations. By and large, people don't get hired, promoted, praised, given bonuses, and so on for doing things securely, developing secure code, and so on. People get hired for being able to program or otherwise do their job, and they get rewarded for things like delivering new features. Sure, you require people to do things securely, but you (probably) also require them to wear clothes, and people are rewarded about equally for them (which is to say they get to keep being employed and paid money). People may or may not fear losing their job if they don't perform well enough because security is getting in their way, but they definitely do get rewarded for performing the non-security aspects of their job well, especially in programming and other computing jobs.

(Perhaps their current employer doesn't really reward them, but they're probably improving their odds of being rewarded by their next employer.)

It's a trite observation that what you reward is what you get. When you hire and promote people for their ability to program and deliver features, that is what they will prioritize. People are generally not indifferent to security issues (especially today), but what you don't reward has turned it into an overhead, one that potentially gets in the way of getting a promotion, a raise, or a bonus. Will a team kill a feature because they can't make it secure enough, when the feature is on their road map and thus their job ratings for this quarter? You already know the answer to that.

Also, people are going to focus on developing their skills at what you reward (and what the industry rewards in general). When you interview and promote and so on based on people being able to write code and solve problems and ship features, that's what they get good at. When you provide no particular rewards for doing things (more) securely, people have no motivation to work on it, and also they generally have little or no feedback on whether they're doing it right and are improving their skills, instead of flailing around and wasting their time.

(My feeling is that industry practices also make it hard to get useful feedback on is the long term consequences of design and programming decisions, in large part because most people don't stay around for the long term, although to be fair a bunch of programs and systems don't either.)

(Many years ago I wrote that people don't care about security and consider it an overhead. I'm not sure that this is still true, but it's probably still somewhat so, along with how security is not the most important thing to most people.)

SecurityNotPeoplesJob written at 22:23:52; Add Comment


Account recovery is still a hard problem in public key management

Soatok recently published their work on a part of end to end encryption for the Fediverse, Towards Federated Key Transparency. To summarize the article, it is about the need for a Fediverse public key directory and a proposal for how to build one (this is a necessary ingredient for trustworthy end to end encryption). Soatok is a cryptographer and security expert and I'm not, so I have nothing to say about the specifics of the proposed protocol and so on. But as a system administrator, one thing did catch my eye right away, and that is that Soatok's system has no method of what I will call "account recovery".

How this manifests in the protocol is that registering in the key directory is a one-way action for a given Fediverse identity. Once you (as a specific Fediverse identity) register your first key in the key directory, you cannot reset from this state and start over again. If you somehow lose all of your registered private keys, there is no natural or easy way out to register a new one under your current Fediverse identity and your only option is to start a new Fediverse identity, which can register from scratch.

(While the proposal allows you to revoke keys if you have more than one active one, it specifically doesn't allow you to revoke your last key. This has the additional effect that you can't advertise that all of your previous keys are no longer trusted and you can't be reached over whatever they enable at all. The closest you can come is to leave a single public key registered that you've destroyed the private key for, rendering it useless in practice; however, this still leaves people able to retrieve your 'current key' and then use it in things that will never work.)

Of course, there are good security reasons to not allow this sort of re-registration and account recovery, which is undoubtedly why Soatok's proposal doesn't attempt to include them. Telling the difference between account recovery by a good person and account recovery by an attacker is ultimately a very hard problem, so if you absolutely have to prevent the latter, you can't allow account recovery at all. Even partially and reasonably solving account recovery generally requires human involvement, and that is hard and doesn't scale well (and it's hard to write into protocol specifications).

However, I think it's meaningful to note the tradeoffs being made. One of the lenses to look at security related things is through the triad of confidentiality, availability, and integrity. As with any system that doesn't have account recovery, Soatok's proposal is prioritizing confidentiality over availability. Sometimes this is the right tradeoff, and sometimes it isn't.

To me, all of this demonstrates that account recovery remains a hard and unsolved problem in this area (and in a variety of others). I pessimistically suspect that there will never be good solutions to it, but at the same time I hope that clever people will prove me wrong. Good, secure account recovery would enable a lot of good things.

AccountRecoveryHardPKIProblem written at 22:30:40; Add Comment


CVEs are not what I'll call security reports

Today I read Josh Bressers' Why are vulnerabilities out of control in 2024? (via), which made me realize that I, along with other people, had been unintentionally propagating a misinterpretation of what a CVE was (for example when I talked about the Linux kernel giving CVEs to all bugfixes). To put it simply, a CVE is not what I'll call a (fully baked) security report. It's more or less in the name, as 'CVE' is short of 'Common Vulnerabilities and Exposures'. A CVE is a common identifier for a vulnerability that is believed to have a security impact, and that's it.

A CVE as such is thus an identifier and a description of the vulnerability. It does not intrinsically tell you what software and versions of the software the vulnerability is present in, or how severe or exploitable the vulnerability is in any specific environment, or the like, which is to say that it doesn't come with an analysis of its security impact. All of that is out of scope for a basic CVE. We think of all of these things as being part of a 'CVE' because people have traditionally 'enriched' the basic CVE information with these additional details; sometimes this has been done by the people reporting the vulnerability and sometimes it has been done by third parties.

(One reason that early CVEs were enriched by the reporters themselves was that in the beginning, people often didn't believe that certain bugs were security vulnerabilities unless you held their hand with demonstration exploits and so on. As general exploit technology has evolved, entire classes of bugs, such as use after free, are now considered likely exploitable and so are presumed to be security vulnerabilities even without demonstration exploits.)

As Why are vulnerabilities out of control in 2024? notes, the amount of work required to do this enrichment is steadily increasing because the number of CVEs is steadily increasing (even outside the Linux kernel situation). This work won't happen for free, and I mean that in a broad sense, since collectively there is only so much free time people have for (unpaid) vulnerability discovery and reporting. Our options are to (fully) fund vulnerability enrichment (which seems increasingly unlikely), live with basic CVE reporting, or to get fewer vulnerabilities reported by insisting that only enriched vulnerabilities can be reported in the first place.

(The current state of CVE reporting and assignment is biased toward getting vulnerabilities reported, which in my view is the correct choice.)

It's certainly convenient for system administrators and other people when we get fully baked, fully enriched security (vulnerability) reports instead of bare CVEs or bare vulnerabilities. But not only does no one owe that to us, we also can't have our cake and eat it too. If we insist on only receiving and acting on fully enriched security reports, we will leave some number of vulnerabilities active in our systems (which may or may not be known ones, depending on whether people bother to report them and make them into CVEs).

(This elaborates a bit on some Fediverse posts of mine.)

CVEsVsSecurityReports written at 21:38:38; Add Comment


Stand-alone downloads of program assets has a security implication

I recently read Engineering for Slow Internet (via), which is about what it talks about and also about the practical experience of trying to use the Internet in Antarctica (in 2023), which has (or had) challenging network conditions. One of the recommendations in the article was that as much as possible you allow people to do stand-alone downloads with their own tools for it, rather than forcing them to download assets through your program (which, to put it kindly, may not be well prepared for the Internet conditions of Antarctica). In general, I am all for having programs cope better with limited Internet (I used to be on a PPP dialup modem link long after most people in Canada had upgraded to DSL or cable, and it was a bit unpleasant), but as I was reading the article it occurred to me that supporting people getting assets your program will use through their own downloads can change the security picture of your application a bit, possibly requiring additional changes in how you do things.

When a modern application fetches assets of some sort over HTTPS from a URL that you fully specified (for example, a spot on your website), most of the time you can assume that the contents you fetched are trustworthy. The entire institution of modern web PKI is working (quite well) to keep bad people from easily intercepting and altering that flow of data. Only in relatively high security situations do you need to add some sort of additional end to end security verification, like digital signatures; a lot of the time you can just assume 'we got it over HTTPS from our URL so it's good'.

(Even with fetching assets over HTTPS, signing your assets provides safety against various attacks, including attackers who compromise your website but not your signing infrastructure.)

This is obviously not true any more if you accept files that were downloaded outside of your program's control. Then you're relying on the person using your software to have not been fooled about where they got the files from and to not have had the files quietly swapped out or provided by malicious other software on their machine. Since you didn't fetch these assets yourself, if you need trust in them it will have to be provided in some additional way. If you aren't already digitally signing things, you may need to start doing so (with all of the key management hassles this involves, and potential key expiry, and so on), or perhaps fetch a small list of cryptographic hashes of the assets from your website while allowing the person to provide you the asset files themselves.

(On common systems, some things you want to download may already be signed due to general system requirements, for example program updates.)

This is not just about the security of your program. This is also somewhat about the security of people using your program, in terms of what they can be tricked into doing by a malicious asset that they accidentally download from the wrong place. Attackers definitely already use various forms of fake program updates, compromised installers, and so on, with various additional tricks to direct people to those things.

FetchingVsDownloadsSecurity written at 23:42:06; Add Comment


Phish tests and (not) getting people to report successful phish attacks

One of the very important things for dealing with phish attacks is for people to rapidly self-report successful phish attacks, ones that obtained their password or other access token. If you don't know that an access token has been compromised, even briefly, you can't take steps to investigate any access it may have been used for, mitigate it, and so on. And the sooner you know about it, the better.

So called "phish tests" in their current form are basically excuses to explicitly or implicitly blame people who 'fall for' the phish test. Explicit blame is generally obvious, but you might wonder about the implicit blame. If the phish test reports how many people in each unit 'fell for' the phish test message, or requires those people to take additional training, or things like that, it is implicitly blaming those people; they or their managers will be exhorted to 'do better' and maybe required to do extra work.

When you conduct phish tests and blame people who 'fall for' those tests, you're teaching people that falling for phish attacks will cause them to be blamed. You are training them that this is a failure on their part and there will be consequences for their failure. When people know that they will be blamed for something, some number of them will try to cover it up, or will delay reporting it, or decide that they didn't really fall for it and they changed their password right away or didn't approve the MFA request or whatever, or the like. This is an entirely natural and predictable human reaction to the implicit training that your phish tests have delivered. And, as covered, this reaction is very bad for your organization's ability to handle a real, successful phish attack (which is going to happen sometime).

Much like you want "blameless incident postmortems", my view is that you want "blameless reporting of successful phishes". I'm not sure how you get it, but I'm pretty sure that the current approach to "phish tests" isn't it (beyond the other issues that Google discussed in On Fire Drills and Phishing Tests). Instead, I think phish tests most likely create a counterproductive mindset in people subjected to them, one where the security team is the opposition, out to trick people and then punish those who were tricked.

(This is the counterproductive effect I mentioned in my entry on how phish tests aren't like fire drills.)

PhishTestsVsReporting written at 22:52:03; Add Comment


Phish tests aren't like fire drills

Google recently wrote a (blog) article, On Fire Drills and Phishing Tests, which discusses the early history of what we now call fire drills. As the article covers, the early "fire evacuation tests" focused mostly on how individual people performed, complete with telling people that things were their own fault for not doing the evacuation well enough. It then analogizes this to the current way "phish tests" are done. As I read this, I had a reaction on the Fediverse to the general thought of fire drills and phish tests:

In re comparing fire drills to phishing tests[1], if phishing tests were like fire drills, they would test the response to a successful phish. Was the person phished able to rapidly report and mitigate things? Do the organization's phish alarms work and reach people? Etc etc.

Current "phishing tests" are like testing people to see if they accidentally start fires if they're handed (dangerously) flammable materials. That's not a fire drill.

1: <fediverse link>

The purpose of fire drills is to test what happens once the fire alarm goes off and to make sure that it works. Do all of the fire alarms actually generate enough noise that people can hear? Are there visual indicators for people with bad or no hearing? Can people see (or hear) where they should go to get out of the building? And so on and so forth. In other words, fire drills test the response to the problem, not whether the problem happens in the first place.

(They also somewhat implicitly test if people respond to fire alarms, because if people don't you have another problem.)

As I mentioned in my Fediverse post, current "phish tests" aren't doing anything like this. Current "phish tests" are testing people to see if they recognize and (don't) respond to phish messages (and then blaming people if they don't handle the phish right, which is one of the things that the Google article is calling out). A "phish drill" that was like a "fire drill" would test all of the mitigation and response processes that you wanted to happen after someone fell for a phish, whatever these were. Of course, one awkward aspect of testing these processes is that you actually have to have them and they need to be made effective. But this is exactly why you should test them, just as part of the reason for fire drills is to make sure you have enough alarms, evacuation routes, and so on (and that they all work).

(I personally think that current blame the person "phish tests" are counterproductive in an additional way not covered by the Google article, but that's another entry.)

PhishTestsVsFireDrills written at 23:01:59; Add Comment

(Previous 10 or go back to May 2024 at 2024/05/18)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.