There is such a thing as too much SQL (a reminder to myself)

June 29, 2010

This is a very small war story.

I've spent part of today working on generating some reports from our IP traffic accounting system. We have multiple internal networks that get NAT'd out different gateways, and what we want is a report (or a set of reports) of the top individual traffic sources for each gateway that did more than a certain amount of aggregate traffic in the day.

Our traffic accounting data is held in a PostgreSQL database. Since a typical day has around six to eight million records (and this is after a record aggregation process), making efficient SQL queries for this reporting is relatively important; even if we're only going to run it once a day, it would be nice if it didn't take ages and bog the database down.

Because of how we collect the raw traffic data our traffic monitoring system only has the traffic's internal IP, without explicit information on which NAT gateway the traffic used. The internal IP statically determines the outgoing gateway, but in a complex way; there are subnets of some networks that go out different gateways than their containing network (eg most of a /16 goes out one way, but a /24 in the middle uses another gateway IP), and there are some machines that are assigned individual public IPs. This complexity means that attributing traffic to the right gateway is one of the hard parts of generating this report.

It's possible to do this in SQL and I spent today working out how to do it (with reasonable efficiency); I needed some mapping tables, one with CIDR netblocks and one with the exceptional machines, and some joining and NOT IN work. Then I ran into the subnet challenge. As I was working out how to split the remaining IP address ranges of the overall network with the subnet into multiple CIDR netblocks, I came to my senses.

A quick ad-hoc query showed me that we have less than a thousand different internal IP addresses generating traffic on any given day. Yes, I could do the gateway attribution in SQL, but doing it in SQL makes no sense; with so few internal IP addresses to process it's overkill and almost insanity. The right approach to my problem is to only use SQL for the heavy work of data reduction by generating a 'volume by internal IP' report and then to do the complex work of attributing each internal IP to a particular gateway in another program, in an environment where I have the full power of a procedural programming language (and where I am not potentially running the attribution process seven million times but instead only a few hundred times).

(As a bonus I'm going to do one less volume-aggregation query for reasons that don't fit in this entry's margin. And I no longer need those auxiliary mapping tables, which means that we don't have to maintain and update them.)

As I've noted to myself before, SQL is not the answer to everything. Today was a sharp reminder of this, although fortunately it was not all that painful a one. (Still, I'm going to mourn my nice SQL query a bit.)

PS: if there is a fast, efficient way to get PostgreSQL to tell you the most specific CIDR match for an IP address out of a table of CIDRs (in the context of an overall SELECT involving lots of different IP addresses), please don't tell me what it is. I don't need the temptation.

(Okay, I'd be interested to know what it is, but I'm always curious about things like that.)


Comments on this page:

From 213.155.151.239 at 2010-06-29 04:18:54:

ip4r adds lots of speed compared to the normal PostgreSQL network data types and functions

From 198.102.62.250 at 2010-06-29 10:29:54:

In MySQL I have done this by having a table with IP addresses (in numeric form) and another table with IP ranges or subnets (with a start and end IP address -- also in numeric form).

I then do a subquery on the IP range table and sort by the size of the range. Outside of my subquery, I GROUP BY the IP address so it chooses the IP range with the smallest (most specific) delta.

Probably an ugly, MySQL-only hack -- I've never tried it elsewhere.

Here's the SQL:

    SELECT a.id, INET_NTOA(a.address) AS address, h.department, ranges.name,
           ranges.net_type, i.state
      FROM addresspool a
 LEFT JOIN (
         SELECT range_id, range_end - range_start AS range_diff, 
          range_start, range_end
     FROM range_members
       ORDER BY range_diff ASC
       ) AS range_members ON a.address BETWEEN range_start AND range_end
 LEFT JOIN ranges ON ranges.id = range_members.range_id
 LEFT JOIN hostdata h ON h.poolid = a.id
 LEFT JOIN ipstate_ip ii ON ii.address = a.address
 LEFT JOIN ipstate i ON ii.id = i.ip_id
     WHERE a.purpose IN ('IP', 'NAT')
       AND a.scanned = 'Y'
       AND ranges.name IS NOT NULL
       AND (i.state = 'UP'
        OR i.state IS NULL)
  GROUP BY a.address
  ORDER BY h.department ASC, a.address ASC

Ray

Written on 29 June 2010.
« The great irritation of hidden access controls
Converting between IPv4 addresses and 32-bit integers in Python »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jun 29 00:46:01 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.