2017-11-05
How collections.defaultdict
is good for your memory usage
There is a classical pattern in code that uses entries in dictionaries to accumulate data. In the simplest form, it looks like this:
e = dct.get(ky, None) if e is None: e = [] dct[ky] = e # now we work on e without # caring if it's new or old
There is an obvious variation of this that gets rid of the whole
bureaucracy involving the if
:
e = dct.setdefault(ky, []) # work on e
On the surface, this looks very much like what you get with
collections.defaultdict
.
At this level you might reasonably think that defaultdict
is just
a convenience, giving you a slightly shorter and nicer way to write
this code so you don't have to do either the if
or use .setdefault()
instead of just doing a simple dct[ky]
. However, there's an
important way that both defaultdict
and the if
-based version
are better than the .setdefault()
version.
To see it, let's change what the individual elements are:
e = dct.setdefault(ky, ExpensiveItem()) ....
When I write things this way, the problem may jump out right away.
The issue with this version is that we always create a new
ExpensiveItem
object regardless of whether ky
is already in
dct
. If ky
is not in dct
, we use the new object and all is
good, but if there already is one, we throw away the new object we
created. If we're dealing with a lot of keys that already exist,
this is a lot of objects being created and then immediately thrown
away. Both the if
-based version and defaultdict
avoid this
problem because they only create a new object if and when they
actually need it, and a defaultdict
version is just as short as
the .setdefault()
version.
(The other subtle advantage of defaultdict
is that you specify
the default item only once, when you create the dictionary, instead
of having to duplicate it in every section of code where you need
to do this update-or-add pattern.)
On the one hand, this advantage of defaultdict
feels obvious once
I write it out like this. On the other hand, Python doesn't really
encourage people to think about how often objects are created and
other aspects of memory churn. Also, even if you know about the
issue (as I generally do), it's tempting to go with the setdefault()
version instead of the if
version just because it's shorter and
you probably aren't dealing with enough objects for this to matter.
Using collections.defaultdict
lets you have your cake and eat it
too; you get short code and memory efficiency.
Some early notes on WireGuard
WireGuard is a new(ish) secure IP tunnel system, currently only for Linux. Yesterday I wrote about why I've switched over to it; today is for some early notes on things about it that I've run into, especially in ways it's different from my previous IKE IPSec plus GRE setup.
For the most part, my WireGuard configuration is basically their
simple example configuration, but with a single peer. The important
bit I had to get my head around is the AllowedIPs
setting, which
controls which traffic is allowed to flow inside the secure tunnel.
My home machine may receive traffic to its 'inside' IP from anywhere,
so it must have an AllowedIPs
of 0.0.0.0/0. My work machine, as
my WireGuard touchdown point, should only see traffic from my home
machine and that traffic should only be coming from my home machine's
inside IP; it has an AllowedIPs
of just that IP address.
(I did specify Endpoint
on my work machine, which I think means
that my work machine, the 'server', can initiate the initial
connection handshake if necessary if it has packets to send to my
home machine and my home machine hasn't already got things going.)
Unlike IKE (and GRE), WireGuard itself has no way to restrict where traffic from a particular peer is allowed to originate; peers are authenticated (and restricted) purely by their public key, and this public key will be accepted from any IP address that can talk to you. In fact, WireGuard will happily update its idea of where a peer is if you send it appropriate traffic. If you want this sort of IP-based access restriction, you will have to add it yourself by putting both ends of the WireGuard tunnel on fixed UDP port numbers and then using iptables (or nftables) to restrict who can send IP packets to them.
(WireGuard packets are UDP, so an attacker who's managed to get a copy of your keys could forge the IP origin on traffic they send. However, an active connection requires an initial handshake to negotiate symmetric keys, so the attacker can't get anywhere just with the ability to send packets but not receive replies.)
Unlike IKE (again), WireGuard has no user-visible concept of a connection being 'up' (with encryption successfully negotiated with the remote end) or 'down'; a WireGuard network device is always up, although it may or may not pass traffic. This means that you don't have a chance to run scripts when the connection comes up or goes down, for example to establish or withdraw routes through the device. In the past I was tearing down my GRE tunnel on IPSec failure, which had security implications, but with WireGuard the tunnel and its routes stay up all the time and I'll have to manually tear it down at home if the other end breaks and I need things to still mostly work. This is more secure even if it's potentially less convenient.
(If I cared enough I could set up connection monitoring that automatically tore down the routes if the work end of the tunnel couldn't be pinged for long enough.)
WireGuard lets you set the firewall mark (fwmark) for outgoing encrypted packets, which turned out to be necessary for me for solving what I'll call the recursive VPN problem, where your remote VPN touchdown point is itself on a subnet that you want to route over the VPN. In fact my case is extra-tricky, because I want non-WireGuard IP traffic to my VPN touchdown address to flow over the WireGuard tunnel. What I did was set a fwmark in WireGuard and then used policy-based routing to force traffic with that mark to bypass the tunnel:
ip route add default dev ppp0 table 11 [...] # Force Wireguard marked packets out ppp0, no matter what. ip rule add fwmark 0x5151 iif lo priority 4999 table 11
(The fwmark value is arbitrary.)
This is much less magic than the IPSec equivalent, and as a result I have more confidence that it won't suffer from occasional bugs.
The fwmark stuff is especially important (and useful) because the current WireGuard software is missing the ability to bind outgoing packets to a specific IP address on a multi-address host. As far as I can see, outgoing packets may someday be sent out from whatever IP address WireGuard finds convenient, instead of the IP alias that you've designated as the VPN touchdown. WireGuard on the other end will then explicitly update its idea of the peer address, even if it was initially configured with another one. I may be missing something here, and I should ask the WireGuard people about this; the might accept it as a feature request (or a bug). I'm not sure if you can fix it with policy based routing cleverness, but you might be able to.
The best way to understand WireGuard configuration files is to think
of them as interface-specific configuration files; I sort of
missed this initially. Since you apply them with 'wg setconf
<interface> <file>
', they can only include a single interface's
parameters. Somewhat inconveniently, they include secret information
(your private key) and so must be unreadable. Similarly, it's a bit
inconvenient that checking connection status with wg show
requires
root privileges, although you can work around that with sudo
.