Wandering Thoughts archives

2017-11-05

How collections.defaultdict is good for your memory usage

There is a classical pattern in code that uses entries in dictionaries to accumulate data. In the simplest form, it looks like this:

 e = dct.get(ky, None)
 if e is None:
    e = []
    dct[ky] = e

 # now we work on e without
 # caring if it's new or old

There is an obvious variation of this that gets rid of the whole bureaucracy involving the if:

e = dct.setdefault(ky, [])
# work on e

On the surface, this looks very much like what you get with collections.defaultdict. At this level you might reasonably think that defaultdict is just a convenience, giving you a slightly shorter and nicer way to write this code so you don't have to do either the if or use .setdefault() instead of just doing a simple dct[ky]. However, there's an important way that both defaultdict and the if-based version are better than the .setdefault() version.

To see it, let's change what the individual elements are:

e = dct.setdefault(ky, ExpensiveItem())
....

When I write things this way, the problem may jump out right away. The issue with this version is that we always create a new ExpensiveItem object regardless of whether ky is already in dct. If ky is not in dct, we use the new object and all is good, but if there already is one, we throw away the new object we created. If we're dealing with a lot of keys that already exist, this is a lot of objects being created and then immediately thrown away. Both the if-based version and defaultdict avoid this problem because they only create a new object if and when they actually need it, and a defaultdict version is just as short as the .setdefault() version.

(The other subtle advantage of defaultdict is that you specify the default item only once, when you create the dictionary, instead of having to duplicate it in every section of code where you need to do this update-or-add pattern.)

On the one hand, this advantage of defaultdict feels obvious once I write it out like this. On the other hand, Python doesn't really encourage people to think about how often objects are created and other aspects of memory churn. Also, even if you know about the issue (as I generally do), it's tempting to go with the setdefault() version instead of the if version just because it's shorter and you probably aren't dealing with enough objects for this to matter. Using collections.defaultdict lets you have your cake and eat it too; you get short code and memory efficiency.

python/DefaultdictAndMemoryChurn written at 23:56:04; Add Comment

Some early notes on WireGuard

WireGuard is a new(ish) secure IP tunnel system, currently only for Linux. Yesterday I wrote about why I've switched over to it; today is for some early notes on things about it that I've run into, especially in ways it's different from my previous IKE IPSec plus GRE setup.

For the most part, my WireGuard configuration is basically their simple example configuration, but with a single peer. The important bit I had to get my head around is the AllowedIPs setting, which controls which traffic is allowed to flow inside the secure tunnel. My home machine may receive traffic to its 'inside' IP from anywhere, so it must have an AllowedIPs of 0.0.0.0/0. My work machine, as my WireGuard touchdown point, should only see traffic from my home machine and that traffic should only be coming from my home machine's inside IP; it has an AllowedIPs of just that IP address.

(I did specify Endpoint on my work machine, which I think means that my work machine, the 'server', can initiate the initial connection handshake if necessary if it has packets to send to my home machine and my home machine hasn't already got things going.)

Unlike IKE (and GRE), WireGuard itself has no way to restrict where traffic from a particular peer is allowed to originate; peers are authenticated (and restricted) purely by their public key, and this public key will be accepted from any IP address that can talk to you. In fact, WireGuard will happily update its idea of where a peer is if you send it appropriate traffic. If you want this sort of IP-based access restriction, you will have to add it yourself by putting both ends of the WireGuard tunnel on fixed UDP port numbers and then using iptables (or nftables) to restrict who can send IP packets to them.

(WireGuard packets are UDP, so an attacker who's managed to get a copy of your keys could forge the IP origin on traffic they send. However, an active connection requires an initial handshake to negotiate symmetric keys, so the attacker can't get anywhere just with the ability to send packets but not receive replies.)

Unlike IKE (again), WireGuard has no user-visible concept of a connection being 'up' (with encryption successfully negotiated with the remote end) or 'down'; a WireGuard network device is always up, although it may or may not pass traffic. This means that you don't have a chance to run scripts when the connection comes up or goes down, for example to establish or withdraw routes through the device. In the past I was tearing down my GRE tunnel on IPSec failure, which had security implications, but with WireGuard the tunnel and its routes stay up all the time and I'll have to manually tear it down at home if the other end breaks and I need things to still mostly work. This is more secure even if it's potentially less convenient.

(If I cared enough I could set up connection monitoring that automatically tore down the routes if the work end of the tunnel couldn't be pinged for long enough.)

WireGuard lets you set the firewall mark (fwmark) for outgoing encrypted packets, which turned out to be necessary for me for solving what I'll call the recursive VPN problem, where your remote VPN touchdown point is itself on a subnet that you want to route over the VPN. In fact my case is extra-tricky, because I want non-WireGuard IP traffic to my VPN touchdown address to flow over the WireGuard tunnel. What I did was set a fwmark in WireGuard and then used policy-based routing to force traffic with that mark to bypass the tunnel:

ip route add default dev ppp0 table 11
[...]

# Force Wireguard marked packets out ppp0, no matter what.
ip rule add fwmark 0x5151 iif lo priority 4999 table 11

(The fwmark value is arbitrary.)

This is much less magic than the IPSec equivalent, and as a result I have more confidence that it won't suffer from occasional bugs.

The fwmark stuff is especially important (and useful) because the current WireGuard software is missing the ability to bind outgoing packets to a specific IP address on a multi-address host. As far as I can see, outgoing packets may someday be sent out from whatever IP address WireGuard finds convenient, instead of the IP alias that you've designated as the VPN touchdown. WireGuard on the other end will then explicitly update its idea of the peer address, even if it was initially configured with another one. I may be missing something here, and I should ask the WireGuard people about this; the might accept it as a feature request (or a bug). I'm not sure if you can fix it with policy based routing cleverness, but you might be able to.

The best way to understand WireGuard configuration files is to think of them as interface-specific configuration files; I sort of missed this initially. Since you apply them with 'wg setconf <interface> <file>', they can only include a single interface's parameters. Somewhat inconveniently, they include secret information (your private key) and so must be unreadable. Similarly, it's a bit inconvenient that checking connection status with wg show requires root privileges, although you can work around that with sudo.

linux/WireGuardEarlyNotes written at 02:24:37; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.