|
2005-06-18
SMTP IP firewall stats at June 18th, 2005
We maintain a filter list of bad hosts and network areas that can't
talk to our SMTP port at all; their SMTP packets are silently
discarded. The filter list is reinitialized each time the server
reboots, currently once a week. During the week we add various spam
sources and high volume sources of other rejections to the filters on
a dynamic basis.
As the server does its weekly reboot at 6 AM Sunday morning, right now
is a great time to pull a top-N summary from the kernel's firewall
statistics. So, here are the top 20 sources of rejected packets to
this server over the past nearly 7 days:
Host/Mask Packets Bytes
213.4.129.48 7768 356K [a] [njabl]
192.35.251.3 4539 218K [a] [bad-helo]
61.128.0.0/10 4356 215K
216.7.201.43 4169 200K [a] [bad-helo]
220.160.0.0/11 3313 161K
195.46.148.28 2955 177K [a] [baddns]
65.194.220.21 2696 129K [a] [cbl]
24.156.64.52 2683 129K [a] [dialup] [cbl]
218.0.0.0/11 2577 126K
213.29.7.174 2492 150K [a] [njabl]
219.128.0.0/12 2435 123K
65.214.61.100 2425 116K
66.18.69.6 2359 142K [a] [spews]
24.222.77.233 2088 125K [a] [flushot]
62.219.46.43 1949 93552 [a] [dialup] [cbl]
193.70.192.0/24 1893 85360
212.47.15.29 1824 109K [a] [flushot]
12.31.56.73 1719 82512 [a] [bad-helo]
212.216.176.0/24 1654 86576
221.216.0.0/13 1584 78068
The key:
[a]: entry was added during the week as a high-count rejection
source.
[baddns]: IP lacks a good PTR record.
[bad-helo]: tried to say hi with a bad SMTP HELO name.
[cbl]: IP in cbl.abuseat.org.
[dialup]: IP seems to be in a dynamic/dialup address range.
[flushot]: IP address sent email to our spamtraps.
[njabl]: IP in dnsbl.njabl.org.
[spews]: IP in the SPEWS DNSbl.
This isn't a particularly active server for mail in general; we
usually get about 1,000 to 2,000 incoming real mail messages a day
(mostly from mailing lists).
I believe that 213.4.129.48 (smtpout.terra.es), 213.29.7.174
(mail1002.centrum.cz), and 66.18.69.6 (mailout06.infosat.net) are all
involved in providing free email. And apparently doing a bad job of
stopping spammers from using it. Both 213.29.7.174 and 66.18.69.6
would have been rejected by later blocks as well, blocks we set up
due to them sending us spam.
Due to a long-term spam problem, we have a number of Chinese netblocks
that we aren't interested in accepting email from. In this listing,
that's 61.128.0.0/10, 220.160.0.0/11, 218.0.0.0/11, 219.128.0.0/12,
and 221.216.0.0/13.
212.216.176.0/24 is tin.it, an Italian ISP that had yet to get
HELO greetings correct by the time I gave up and firewalled them.
193.70.192.0/24 is liberato.it, another Italian ISP with a
significant spam problem that we've just stopped talking to. (On a
quick spot check it seems to also be iol.it; they may have merged,
been bought out, or renamed since I put them in our filter list.)
65.214.61.100 kept trying to send us email from the blocked origin
address of 'info@salesrecruits.imakenews.net', week after week after
week. At some point I just put them in our core filter list instead of
adding them every week. I don't consider their continued attempts to
send us email despite everything bouncing for months to be a good
sign.
Note: because we drop incoming packets from these IP addresses on
the floor and don't reply to them in any way, this is not an accurate
count of even SMTP connection attempts. (One SMTP connection attempt
will produce a number of packets to our SMTP port, depending on how
much their OS retries TCP connection attempts.)
Disclaimer
By the time you read this, some of these IP addresses may no longer be
in the DNSbls listed. Because this is IP level firewalling, we can't
say anything definite about whether what these places are trying to
send us is spam; we've just decided that we don't want to talk to them
at all.
(Some of the SMTP connection attempts are probably for bounce
backscatter from spammers forging our domain as the MAIL FROM of
their spam runs.)
spam/IPReject-2005-06-18 written at 22:27:45; Add Comment
2005-06-17
The problem with CPAN (and other similar systems)
At one level, CPAN is a great thing: people really like having a simple
way of installing Perl packages (and a big archive of them). It's such
a good idea that it's being copied over and over: Python's distutils,
Ruby's RubyGems, the R statistic package's CRAN, and no doubt others.
But CPAN and things like it have a problem: they're a package management
system. Or, to be more detailed, they're another package management
system, on top of the one that our Unix systems already have.
Multiple package systems on a single computer means that no single
package system has a full picture of the system. This causes various
problems:
- to get a complete picture of what's on the system, I have to
remember to use multiple tools (and remember to how to use all of them).
- two different tools can both think they own or exclusively manage certain
files (for example, index files of all the packages installed). The
extreme case is installing the same thing through the OS package manager
and a program's own package manager.
- missing cross packaging system relationships; for example, things
installed through CPAN likely depend on the version of Perl installed
by the OS's package manager. Does the OS package manager know enough
to tell me that upgrading Perl because of a security fix is going to
orphan all of those CPAN packages I need?
- satisfying dependencies: when I try to install a core OS package
BazOrp, which requires Python package FooBar (version 1.6.1 to 1.7.8)
how does the core OS package management system know that I installed
FooBar 1.7.7 through Python's distutils and it's OK to go ahead?
And this simplifies the problems, because most of these CPAN-like
things are not actually package management systems, they are package
installation systems. All they do is install things; they don't keep a
package inventory (especially with version numbers), they usually don't
have much of an idea of package dependencies, and often they can't even
remove what they just installed.
The situation is worse when I work in large-scale environments, with
tens to hundreds of systems. Systems that large can't deal with
computers by hand; they have to be managed through automated systems.
In that sort of environment, every program with its own package system
means that I would have to obtain or build an automated system to manage
that package system. Since the package system itself is unlikely to
provide the basic management tools (inventory, dependencies, etc), I
would have to build those, too.
You may have guessed the punchline: as a result of all of this, we don't
and can't use CPAN, distutils, RubyGems, CRAN, and so on. Of course this
is sometimes difficult to explain to users, who are know to approach us
to ask 'there is this CPAN module I need, can you please install it on
the machines?' and then don't understand why I break down and twitch.
Solution: build real OS packages
I already have to deal with the OS's package management system, so the
best way to make my life easier is to make your package installation
system build OS packages for me, instead of directly installing files.
This shouldn't be too difficult, as your installation system already
has most of the information necessary, such as what files are going
to be installed and a package description. Don't worry too much about
dependencies, as a decent OS packaging system will be capable of working
them out for you.
On Linux systems, supporting building Debian .debs and RPMs will get
you most of the way to making people entirely happy. (You don't have to
decide which distributions to support; with generic building support,
you support everyone using that packaging format.)
Existing support for this
Debian's dh-make-perl builds CPAN packages into Debian .debs.
The CPAN RPM::Specfile package and its cpanflute2 program will build
RPMs from CPAN packages. (Getting it in RPM form to bootstrap this
properly may be a pleasantly recursive exercise.) There's also the
cpan-to-rpm.pl program from
here,
to do everything in one go. (I believe cpanflute2 has had some
problems for us in the past, but I have blotted them out from my
mind.)
Python distutils has a bdist_rpm command for building RPMs,
but this doesn't work reliably for somewhat complicated packages
in the versions I've tried. (Yes, I should file bug reports and
produce patches to fix things. Someday, when I have enough time
to fully investigate the situation.)
sysadmin/CPANProblem written at 22:05:47; Add Comment
2005-06-16
AJAX vs Dialups
AJAX is short for 'Asynchronous Javascript And XML', the common term
for the technology behind highly interactive web sites like Google
Maps and Google Mail. Given that the features AJAX enables (from the
large to the small) are very appealing to designers, we're pretty much
guaranteed to see more and more use of it on web sites.
But please don't reach for AJAX too fast, because there is such a
thing as being too interactive.
AJAX's interactivity comes through communication, and communication
takes bandwidth. While it'd be nice if everyone coming to your web
site had lots of bandwidth, it's not true (unless you want to make it
true by driving away everyone else).
Let's take an example: using AJAX to implement incremental searches.
The search box on your web pages uses AJAX to notice when I start
typing and does a callback to your web server so it can show me
matching results; once I've typed enough to pull up what I want, I
can just go there.
So I start typing, entering 'p'. Lightning-fast, your highly
interactive AJAX wakes up and sends the request back to your web
server. Of course there are a lot of pages that match such a broad
criteria, so the reply is not short (the RD light of my modem goes on
solid). As I add a 'y' and a 't' the whole process repeats, possibly
colliding with the data transfer for the initial 'p' in the process.
This hypothetical web site's great interactivity hasn't helped me,
it's frustrated me. Search has turned into a laggy experience where I
have to wait for the application to catch up to my typing. The slower
a typist I am, the worse it may be; if I type fast I have at least a
chance of outracing the AJAX over-interactivity.
So: don't be too interactive. If your AJAX needs results from
your web server, you probably can't keep up with the user's
interactions in real time. Don't try; wait a bit, let the user get a
bit of a head start, give some feedback every so often, and reserve
your big efforts for when the user has paused. (Pauses in user input
are your big hint that the user is waiting for you now.)
Google Suggest shows
another solution to this: don't return interactive results until
they're small enough to be useful. (In a search interface I do ask
that you put up some feedback to the effect of 'searching for "py":
too many results to show in the sidebar', so that I can tell the
difference between lots of results and no results.)
Whichever you choose, people on dialups (like me at home on my poky
28.8K PPP link) will thank you for considering them. And you may
discover that there are more of us than you thought, along with the
people using your web site from halfway around the world, the
unfortunates stuck behind choked up corporate Internet links, and so
on.
Unsurprisingly I'm not the only person (or the first person) writing
about this general issue; for example,
Markus Baker's discussion is
here.
You can read about other AJAX design issues
here
and here. (And this
entirely neglects the collection of practical issues one faces when
implementing AJAX in the presence of network delays.) Note to self:
AJAX is complicated in practice.
You can read more about AJAX in the
Wikipedia article.
web/AJAXvsDialups written at 23:03:47; Add Comment
Iterator & Generator Gotchas
Python iterators are objects (or functions, using some magic) that
repeatedly produce values, one at a time, until they get
exhausted. Python introduced this general feature to efficiently
support things like:
for line in fp.readlines():
... do something with each line ...
Without iterators, .readlines() would have to read the entire file
into memory, split it up into lines, and return a huge list; now, this
code only has one line in memory at any given time, even if the file
is tens or hundreds of megabytes.
Generators are functions that magically create iterators instead of
just returning values (ignoring some technicalities). Generators are
the most common gateway to iterators, and are thus the more commonly
used term for the whole area.
When iterators were introduced, a number of standard things that had
previously returned lists started returning iterators, and using a
generator instead of just returning a list became part of the common
Python programming idioms.
In many cases it can be tempting, and temptingly easy, to replace
things that return lists with generators; it looks like it should
just work, and it mostly does. It can be similarly tempting to just
ignore the difference in the standard Python modules.
But there are some gotchas when you write code like this, and I have
the stubbed toes to prove it. At one point or another, I've made all
of these iterator-confusion mistakes in my code.
Iterators are always true
t = generate_list(some, inputs)
if not t:
return
print "Header Line:"
for item in t:
.....
If generate_list returns an iterator instead of a list, this code
doesn't work right. Unless someone got quite fancy, iterator objects
are always true, unlike lists, which are only true if they contain
something.
There's really no way to see if an iterator contains anything except
to try to get a value from it. And there's no 'push value back onto
iterator' operation.
Iterators can't be saved
def cached_lookup(what):
if what not in cache:
cache[what] = real_lookup(what)
return cache[what]
If real_lookup returns iterators, this code doesn't work.
When an iterator's exhausted, it's exhausted; if you try to use it
again (such as if cached_lookup found it as a cached result), it
generates nothing.
(Technically I believe there are semi-magical ways to copy iterators.
I suspect one is best off avoiding them unless you really have to
save an iterator copy.)
I can't use list methods on iterators
t = generate_list(some, inputs)
t.sort()
t = t[:firstN]
# ... admire the pretty explosions
Of course, iterators don't have general list functions like .sort()
(or .len(), or so on). If you want to use those functions, you have
to write:
t = list(generate_list(some, inputs))
t.sort(); t = t[:firstN]
Fortunately, list() will expand the iterator for you and is
harmless to apply to real lists, so you can use it without having to
care if the generate_list routine changes what it returns.
Writing recursive generators
Sometimes the most natural structure for a generator is a recursive
one. This works, but you have to bear in mind a twist: you cannot
simply return the results of the recursive calls. This is because the
recursive results are themselves iterators, and if you return them
straight your callers get iterators that produce a stream of
iterators that produce a stream of iterators that someday, at some
level, produce actual results. (But by that time the caller has
given up in despair.)
Instead each time you recurse, you have to expand the resulting
iterator and return each result, like so:
def treewalk(node):
if not node:
return
yield node.value
for val in treewalk(node.left):
yield val
for val in treewalk(node.right):
yield val
This implies that significantly recursive generators can be quite
inefficient, as they will spend a great deal of time trickling results
up through all the levels involved.
python/GeneratorGotchas written at 02:26:39; Add Comment
2005-06-14
Putting a pleasant Python surprise to use
Although I've been programming in Python for a few years now, it keeps
surprising me with little bits and pieces. Here's a neat Python language
feature that I recently used for the first time (discovered originally
through Bram Cohen's LiveJournal).
A common programming pattern is 'search for a something to work on, but
stop if you don't find anything'. In Python one might write it something
like this (taken more or less from DWiki's source):
found = False
for dir in utils.walk_to_root(curdir):
page = dir.child("__readme")
if page.exists():
found = True
break
if not found:
return ''
# Go on to use the __readme file we found in some directory.
Python allows you to put 'else' conditions on loops (both for and
while loops); the else condition is executed if the loop completed
instead of being break'd from. This lets us simplify this pattern down
to:
for dir in utils.walk_to_root(curdir):
page = dir.child("__readme")
if page.exists():
break
else:
return ''
If there's no __readme file to be found from the current directory
up to the root, we just return nothing; otherwise, we'll process it.
This DWiki code is the first occasion I've had to use this feature since
I discovered it, and I'm pleased to finally have been able to.
(As you can now see, not all the entries in this blog are going to
be long and meandering.)
python/LoopElse written at 17:16:09; Add Comment
Pitfalls in generating Last-Modified:
Every HTTP reply from a web server can include a Last-Modified:
header, which theoretically tells interested parties when the web page
was last modified. This is really something that works best when the
web server is just sending out static files; when it is generating
dynamic content, like DWiki does, things get interesting.
The major use of Last-Modified: is to decide when a browser already has
a current copy of the web page and doesn't need to fetch it again. Thus,
with dynamic pages built from many pieces the Last-Modified: time needs
to be the most recent modification time for all of the pieces. Then
when any of the pieces that make a page are updated, changing the page's
appearance, the page's Last-Modified: time will change and the browser
will fetch a new copy.
This means DWiki can't just use the page's modification time (which
is what gets shown in the 'Last Modified:' line at the bottom of most
CSpace pages). DWiki pages are built from a cascade of templates and
pieces, so as it builds a web page DWiki keeps track of the most recent
modification time of all the files involved; change one template, and
the updated time is automatically propagated through the system.
Or it would if there weren't some complications.
Authentication Soup
Being logged in to a DWiki, and who you're logged in as, not just can
but will change the appearance of pages. It's not just big things, like
being able to see a page's contents; it's everything from DWiki saying
'Welcome, <whoever>' in the top right corner down to whether you get a
login form or a logout button. So if you log in or out and then refresh
pages in your browser, the pages better change to look right for your
current status; otherwise users start wondering if their login or logout
actually worked.
In order to support Last-Modified: with authentication, DWiki would have
to somehow arrange to track the last time you logged in or out of the
DWiki. While this is theoretically possible, it would be a bunch of work
and would involve trying to send a cookie to every visiting browser (and
I refuse to do the latter).
Instead DWiki just mostly punts when authentication is enabled; regular
DWiki pages get served without any Last-Modified: header. Fortunately
modern browsers have another, better header called ETags: that they
can use instead of Last-Modified: to see if they need to refresh a page.
Page List Soup
The other complication is easy to state: what's the modification
time of a list of files?
Lists of files come up in several places in DWiki, most importantly
when generating Atom syndication feeds. Atom feeds also complicate
life because of two factors:
- the Atom feed format requires some kind of 'most recently updated'
timestamp.
- the
ETags: header's value is some identifying hash of the HTTP
response's contents, so if the contents keep changing (because
one generates a 'right now' timestamp as the most recently updated
time in an Atom feed), the ETags: header will keep changing and
everything will keep re-fetching Atom feeds and pages even when
nothing has changed.
(Also, the RSS/Atom feed reader I use doesn't use ETags:, only
Last-Modified:, so I have been trying to support Last-Modified:
in my Atom feeds.)
The simple approach is to make the Last-Modified: value be the
modification time of the most recently modified file in the list.
Unfortunately this doesn't change when files are added or removed from
the middle of the list, which makes it useless for most of DWiki's
purposes.
At the moment DWiki folds in the modification times of all the
directories it scans when looking at files during Atom feed generation
(thereby currently missing directories that currently have no files in
them at all). At other times it just punts.
Summary For Client Authors
If you're thinking of writing a feed reader client or a web browser, I
have this to say: please just use the ETags: header. Since it's some
hash value of the HTTP response's data, it's easy to generate and always
accurate about whether or not the response is the same. Last-Modified:
is essentially an approximation in everything except relatively simple
situations or programs that go to obsessive amounts of work.
web/LastModifiedPitfalls written at 00:50:46; Add Comment
2005-06-12
Making a Python mountain out of a molehill
DWiki is the software that runs CSpace, including this blog. It's
wound up much bigger than I expected and wanted it to be. This is sort
of the story of how (or why) that happened.
[More available: DWikiGrowth]
python/DWikiGrowth written at 03:22:32; Add Comment
Why a Blog?
Honesty compels me to admit that part of why I am trying blogging is
that I coded a bunch of blog features into DWiki, my wiki-thing, and I
would feel kind of stupid if they never got used. (The blogdir code
has a clear usage case, but the blog code does not quite so much.)
But that's not all of it. Web pages cry out for a linking structure,
an organization, a taxonomy, just like we are supposed to sensibly
organize directories on our computers. And HTML pages themselves are
standalone objects, so it feels wrong to make one too small. (I have
some HTML pages that are only a few lines long; I suspect that they
irritate people who think 'I clicked on a link for this little?'.)
A blog is a structure for casualness, and its way of organizing things
liberates you all of those worries. Freed of the neurotic worry to
organize, I can just write, jot, scribble, rant, or whatever. And
since blog posts are all run together, I don't have to worry that one
is too small for a HTML page (or just too much work once I add the
necessary framing).
Blog writing doesn't have to be scribbles, and a lot of the blogs that
I most enjoy reading are very carefully put together. But I'm not sure
I have the time and energy for that; a good part of my purpose in
trying out blogging is the theory that writing something is better
than writing nothing.
(That's also the theory behind CSpace and DWiki as a whole.)
The question of why I'm not using a different piece of blog software
for this is something for another time and another entry.
BlogGenesis written at 00:02:19; Add Comment
2005-06-11
Writing DWiki has been a very educational process. Mostly it has been
educational about all sorts of irritations that I was previously happily
ignorant of.
Take HTTP redirects, for example. (Please.)
To be fully specification-compliant, an HTTP redirect must be to
a different URL than the current one, and must be to an absolute
URL: it must redirect to http://host/some/where, not just /some/where.
(Perhaps common browsers all accept relative redirects, but at least
lynx complains about them.)
issue #1: when absolute URLs can't be
This presents a small problem for a program like DWiki: just what
is the absolute URL of a DWiki page? The host is relatively
easy, since modern HTTP requests include the host name (it's how
name-based virtual hosts work).
But ... what about the port? Not every web server lives on port
80, especially a DWiki running in standalone test mode.
In theory the absolute URL should include the port (unless it's the
default). In practice, every program I've tried gleefully adds the port
itself if it is a non-standard port and you're referring to the same
hostname. If you naievely generate redirects of http://host:port/...,
what most programs try to get is http://host:port:port/..., which
doesn't work too well.
Presumably people who want to run two web servers on the same host
on different ports just lose.
Maybe this is even documented somewhere. (I jest; I looked, and failed
to find anything obvious in the RFCs.)
Update, much later: I was completely mistaken here; see HostMistake.
issue #2: did you say different URL?
Why, yes. Different URL. Why is this irritating? Let's take logging
in to this wiki as an example.
Login forms need to be POST forms, not GETs, because one does not want
the password sitting in plaintext in URLs. The natural way to do it
is to let the login form POST to the current page, which then just
redisplays itself. Unfortunately if you then ask your browser to reload
the resulting page (perhaps to see an updated edit of it), your browser
warns you that you're about to resubmit a POST form and are you sure?
So: what we want to happen is to POST to the current URL, which
instead of redisplaying itself in a POST context immediately
redirects back to the GET version of itself.
Which is where it becomes very irritating that HTTP redirects
have to go to a different URL.
DWiki 'solves' this issue by making up synthetic page names for
processing logins (and logouts, which have the same problem).
Fortunately it can guarantee that certain page names in its URL space
will never be valid real DWiki pages, so it just uses some of them.
To get back to the page you were just reading, the login and logout
forms add a hidden field to the form to say what the old page was.
(Which means that the form has to be generated dynamically, because it's
different on each page.)
web/HTTPRedirects written at 19:44:27; Add Comment
I find myself quite irritated with CSS lately, because I have been
trying to be a good modern web-boy and style this place with CSS.
The problem with CSS is what it leaves out. Take a very simple example:
this blog. Like many blogs, I want a two-column layout: blog
entries on the left, a small sidebar about the blog on the right.
No problem in CSS: use two <div>s, set a width: or a min-width:, and
then set one <div> as float: left; or so. And this works, as long as
neither column overflows. Unfortunately, a) I am guaranteed to someday
write a blog entry with a long unbreakable line, since I am going to
quote code periodically and b) the user can make their browser window
pretty darn narrow.
What I want to happen in this situation is for CSS to shrug and
enlarge the whole thing so that the user has to scroll sideways
to see the entire page. What I can get is one of:
- the over-long line is truncated and is undisplayable.
- the thing containing the over-long line is truncated but grows a
scrollbar, so at least I can theoretically read it.
- the over-long line scribbles itself all over the sidebar.
- my sidebar stops being a sidebar and suddenly becomes a top-bar
or a bottom-bar.
No, blech, ptui, and 'ha, you jest' respectively.
Those evil bad table things that we aren't supposed to use for
layout? Surprise. They get this right.
So if you look at the HTML source for the blog, you will
see a giant table.
(Now, perhaps there is some magic way to do this in CSS that
actually works right and that no one mentions or uses. If so,
please tell me about it. I would like to use CSS if I can.)
web/CSSIrritation written at 19:15:44; Add Comment
|
These are my WanderingThoughts
(About the blog)
GettingAround
Full index of entries
Recent comments
This is part of CSpace, and is written by ChrisSiebenmann.
* * *
Atom feeds are available; see the bottom of most pages.
This is a DWiki.
(Help)
Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web
|