Wandering Thoughts archives

2010-06-29

Converting between IPv4 addresses and 32-bit integers in Python

When you write code that deals with IPv4 addresses, every so often you wind up in situations where you want to convert addresses in conventional text form to unsigned 32-bit integers and manipulate them in various ways (and then convert the results back to normal notation). This is something I've been dealing with recently, so I've been working on this today.

Python doesn't have any standard library functions that directly do these conversions. You can write your own code for this (but it's just long enough to be annoying), or alternately you can glue together the socket and struct modules to get what you want.

`socket.inet_aton` doesn't directly convert an IP address to a 32-bit integer; instead it convert it to a 32-bit packed binary string. To get an actual integer, we need to unpack the string with `struct.unpack`. The one gotcha is that `inet_aton` returns results in network byte order, so if we want our IP addresses as natural numbers (in host byte order) we need to tell `struct` to do this conversion. To convert the other way we can use `socket.inet_ntoa`, which also has the same requirement of working in network byte order.

So we wind up with the following small and almost symmetric routines:

```def ip_to_uint32(ip):
t = socket.inet_aton(ip)
return struct.unpack("!I", t)[0]

def uint32_to_ip(ipn):
t = struct.pack("!I", ipn)
return socket.inet_ntoa(t)
```

If you want your unsigned 32-bit integer IPv4 addresses in network byte order, leave out the `!` in the `unpack()` and `pack()` format strings. Probably you don't, though, unless you are doing peculiar things with them.

(Given what this does, there is no point in using `socket.inet_pton` instead of `inet_aton`; we intrinsically deal only with IPv4 addresses anyways, so extra generality is actually counterproductive.)

PS: if you slap together code like this, it's a good idea to test it with simple IPv4 addresses that you can easily work out the numeric values for by hand. I almost skipped doing this and thus almost missed that `inet_aton` returns results in network byte order instead of host byte order. Such a mistake is entertaining, but only from a suitable distance.

There is such a thing as too much SQL (a reminder to myself)

This is a very small war story.

I've spent part of today working on generating some reports from our IP traffic accounting system. We have multiple internal networks that get NAT'd out different gateways, and what we want is a report (or a set of reports) of the top individual traffic sources for each gateway that did more than a certain amount of aggregate traffic in the day.

Our traffic accounting data is held in a PostgreSQL database. Since a typical day has around six to eight million records (and this is after a record aggregation process), making efficient SQL queries for this reporting is relatively important; even if we're only going to run it once a day, it would be nice if it didn't take ages and bog the database down.

Because of how we collect the raw traffic data our traffic monitoring system only has the traffic's internal IP, without explicit information on which NAT gateway the traffic used. The internal IP statically determines the outgoing gateway, but in a complex way; there are subnets of some networks that go out different gateways than their containing network (eg most of a /16 goes out one way, but a /24 in the middle uses another gateway IP), and there are some machines that are assigned individual public IPs. This complexity means that attributing traffic to the right gateway is one of the hard parts of generating this report.

It's possible to do this in SQL and I spent today working out how to do it (with reasonable efficiency); I needed some mapping tables, one with CIDR netblocks and one with the exceptional machines, and some joining and `NOT IN` work. Then I ran into the subnet challenge. As I was working out how to split the remaining IP address ranges of the overall network with the subnet into multiple CIDR netblocks, I came to my senses.

A quick ad-hoc query showed me that we have less than a thousand different internal IP addresses generating traffic on any given day. Yes, I could do the gateway attribution in SQL, but doing it in SQL makes no sense; with so few internal IP addresses to process it's overkill and almost insanity. The right approach to my problem is to only use SQL for the heavy work of data reduction by generating a 'volume by internal IP' report and then to do the complex work of attributing each internal IP to a particular gateway in another program, in an environment where I have the full power of a procedural programming language (and where I am not potentially running the attribution process seven million times but instead only a few hundred times).

(As a bonus I'm going to do one less volume-aggregation query for reasons that don't fit in this entry's margin. And I no longer need those auxiliary mapping tables, which means that we don't have to maintain and update them.)

As I've noted to myself before, SQL is not the answer to everything. Today was a sharp reminder of this, although fortunately it was not all that painful a one. (Still, I'm going to mourn my nice SQL query a bit.)

PS: if there is a fast, efficient way to get PostgreSQL to tell you the most specific CIDR match for an IP address out of a table of CIDRs (in the context of an overall SELECT involving lots of different IP addresses), please don't tell me what it is. I don't need the temptation.

(Okay, I'd be interested to know what it is, but I'm always curious about things like that.)