Googlebot and Feedfetcher are still aggressively grabbing syndication feeds
Somewhat more than a year ago I wrote about how I'd detected Googlebot aggressively crawling my syndication feeds, despite them being marked as 'stay away'. At the time I was contacted by someone from Google about this and forwarded various information about it.
Well, you can probably guess what happened next: nothing. It is now more than a year later and Googlebot is still determinedly attempting to pound away at fetching my syndication feed. In fact it made 25 requests for it yesterday, all of which got 403s as a result of me blocking it back then. In fact Googlebot is still trying on the order of 25 times a day despite getting 403s on all of its requests for this URL for literally more than a year.
(At least it seems to be down to only trying to fetch one feed URL.)
Also, because I was looking, back what is now more than a year and a half ago I discovered that Google Feedfetcher was still fetching feeds; as a result I blocked it. Well, that's still happening too. Based on the last 30 days or so, Google Feedfetcher is making anywhere between four and ten attempts a day. And yes, that's despite getting 403s for more than a year and a half. Apparently those don't really discourage Google's crawling activities if Google really wants your data.
I'd like to say that I'm surprised, but I'm not in the least bit. Google long ago stopped caring about being a good Internet citizen, regardless of what its propaganda may say. These days the only reason to tolerate it and its behavior is because you have no choice.
(As far as I can tell it remains the 800 pound gorilla of search traffic, although various things make it much harder for me to tell these days.)
Sidebar: The grumpy crazy idea of useless random content
If I was a real crazy person, it would be awfully tempting to divert Google's feed requests to something that fed them an endless or at least very large reply. It would probably want to be machine generated valid Atom feed entries full of more or less random content. There are of course all sorts of tricks that could be played here, like embedding honeypot URLs on a special web server and seeing if Google shows up to crawl them.
I don't care enough to do this, though. I have other fish to fry in my life, even if this stuff makes me very grumpy when I wind up looking at it.
Wandering Thoughts is now ten years old
Because I am often terrible at scheduling, Wandering Thoughts' ten year anniversary was actually almost a month ago, on June 12th (for odd reasons). And as I noted four years ago, I'm not really for anniversaries. Still, ten years is something that feels significant, enough so to produce some words.
I'm a different person than I was ten years ago and four years ago, but then we almost all are. Some of the changes are welcome ones, some less welcome, and some just are. Wandering Thoughts too has undoubtedly changed over the ten years I've been writing at least one entry a day here, but those changes are usually less obvious to me. Hopefully they are overall for the better.
(When I go back to read old entries, especially very old entries, I feel somewhat ambivalent about the changes in my writing style that I think I see. I suspect that everyone does.)
If you'd told me at the start that I would still be writing Wandering Thoughts ten years later, well, honestly I might have believed you; I'm the sort of person who gets into habits and then sticks to them unless something big comes along to jar me out. Am I happy to have done this and to still be doing this? Yes, of course, or I wouldn't be doing it any more. Writing Wandering Thoughts has enriched my life in any number of ways, both in the writing itself and in the contacts and associations I've made through the blog, and I'd be a quite different person without WT.
(Sometimes I wonder a bit about what that other me would be like. It's kind of fun but also hard; WT's effects on me feel quite pervasive.)
I don't expect to stop writing here and I probably won't change how I do it; my one entry a day habit is quite well set by now (although I sometimes think about the potential merits of taking longer to develop and write entries; writing them in an hour or two has its limitations and drawbacks).
(The next vaguely significant waypoint will be 4,000 main entries. Don't expect a marker for it, though.)
(And yes, if I think about it, ten years of an entry a day is kind of a scary thing to contemplate. I don't even try to add up the total time and effort I've put into Wandering Thoughts in the past ten years; it's far too intimidating.)
Some notes on my 'commit local changes and rebase' Git workflow
A month or so ago I wrote about how I don't commit changes in my working repos and in reaction to it several people argued that I ought to change my way. Well, never let it be said that I can't eventually be persuaded to change my ways, so since then I've been cautiously moving to committing my changes and rebasing on pulls in a couple of Git repos. I think I like it, so I'm probably going to make it my standard way of working with Git in the future.
The Git configuration settings I'm using are:
git config pull.rebase true git config rebase.stat true
The first just makes '
git pull' be '
git pull --rebase'. If I
wind up working with multiple branches in repos, I may need to set
this on a per-branch basis or something; so far I just track
origin/master so it works for me. The second preserves the normal
git pull' behavior of showing a summary of updates, which I find
useful for keeping an eye on things.
One drawback of doing things this way is that '
git pull' will now
abort if there are also uncommitted changes in the repo, such as I
might have for a very temporary hack or test. I need to remember
to either commit such changes or do '
git stash' before I pull.
(The other lesson here is that I need to learn how to manipulate rebase commits so I can alter, amend, or drop some of them.)
Since I've already done this once: if I have committed changes in
a repo without this set, and use '
git pull' instead of '
--rebase', one way to abort the resulting unwanted merge is '
reset --hard HEAD'. Some sources suggest '
git reset --merge' or
git merge --abort' instead. But really I should set pull rebasing
to on the moment I commit my own changes to a repo.
(There are a few repos around here that now need this change.)
I haven't had to do a bisection on a
commit-and-rebase repo yet, but I suspect that bisection won't go
well if I actually need my changes in all versions of the repo that
I build and test. If I wind up in this situation I will probably
temporarily switch to uncommitted changes and use of '
probably in a scratch clone of the upstream master repo.
(In general I like cloning repos to keep various bits of fiddling around in them completely separate. Sure, I probably could mix various activities in one repo without having things get messed up, but a different directory hierarchy that I delete afterwards is the ultimate isolation and it's generally cheap.)
Some thoughts on Go compiler directives being in source comments
Recently, I've been reading some commotion about how Go compiler directives being in source code comments is, well, not the 'elegant design' that Go's creators may feel it is. As it happens I have some sympathies for Go here, so let's talk about what I see as the issues involved.
First, let's differentiate between what I'll arbitrarily call 'broad'
and 'narrow' compiler directives. In a nutshell, what I'm calling
a broad compiler directive is something that changes the meaning
of the source code such that every compiler implementation must
handle it. In C,
#define are broad directives.
Broad directives are effectively part of the language and as such
I feel that they deserve first class support as an explicit element
in language syntax.
(Broad directives don't have to use a new language syntax element.
from __future__ import ...' is such a broad directive,
but it uses a standard language element.)
By contrast, narrow directives only apply to a specific compiler or tool. Since they're only for a specific program they should be namespaced, ie you need some way of saying 'this uninterpreted blob of text is only for <X>' so that other compilers can ignore it. This requires either a specific element of language syntax to say 'this following text is only for <X>' or hijacking a portion of some existing syntax where you can add arbitrary namespaced text. The easiest existing syntax to hijack is comments.
Since narrow directives do not change the language itself (at least in theory), it seems at least a bit odd to give them an explicit syntax element. In effect you're creating another escape hatch for language-meaningless text that sits alongside comments; one is sort of for people (although it may be interpreted by tools, for example for documentation) and one is a slightly structured one for tools.
(If a narrow directive changes the semantics of the code being compiled, it's actually changing the language the compiler is dealing with from 'language <X>' to 'something similar to <X> but not quite it'. Problems often ensue here in the long run.)
As far as I know, all of the existing Go compiler directives are
narrow directives. They're either used by specific non-compiler
tools or they're internal directives for one specific Go compiler
(admittedly the main '
go' compiler). As far as I'm concerned this
makes them pretty much fair game to be implemented without a specific
element of language syntax. Other people may disagree and feel that
even narrow directives should have some sort of specific language
PS: There may well be standard terminology in the programming language community for what I'm calling broad versus narrow directives here.
Sidebar: The problem with non-namespaced narrow directives
If you don't namespace your narrow directives you wind up with the
#pragma problem, which is 'what do you do when you encounter a
#pragma that you don't recognize?'. If you do error out, you cause
problems for people who are using you to compile source code with
#pragmas for some other compiler. If you don't error out, you
cause problems for people who've accidentally misspelled one of
#pragmas and are now having it be more or less silently
(You can try to know about the
#pragmas of all other compilers,
but in practice you're never going to know absolutely all of them.)
My early impressions of Fedora 22, especially of DNF
I recently updated first my office laptop (which runs a relatively stock Cinnamon environment) and my office workstation (which runs my custom setup) to Fedora 22, both via my usual means of a yum-based upgrade instead of the officially supported FedUp mechanism. I feel kind of ambivalent about the results of this.
On the one hand, the upgrade was smooth in both cases and everything in both of my environments basically worked from the start. This is not always the case, especially in my custom setup; I'm used to having to fiddle things around after Fedora version upgrades in order to get audio or automatic removable media mounting or whatever working again. Instead everything pretty much just went, and nothing changed in my Cinnamon environment.
On the other hand, how can I say this gently: I have not been really impressed with Fedora 22's quality control. A number of things happened to me in and after the upgrade:
- Fedora 22 appears to have shuffled around the
/dev/disk/by-idnames for disks in a way that broke automatic boot time importing of my ZFS pools until I imported them once by hand. I'm not entirely happy with this, but I am running an unsupported configuration.
- systemd's networkd exhibited systemd's usual regard for its users by
changing which of several IP addresses would be an interface's
primary IP address. Apparently
my hope was naive.
(This is where some systemd person smugly observes that the order is not documented and so I deserve whatever I get here, including random decisions from boot to boot. See 'usual regard for its users', above.)
- Fedora 22 has a broken rsyslog(d) that will casually and trivially
Fortunately the code flaw is trivially fixable; it's a good thing
that I know how to patch RPMs by hand.
An update is coming out sometime, but apparently Fedora does not
consider 'the syslog daemon dumps core' to be a high priority issue.
(This feels like a facet of the great systemd thing about syslog.)
- Fedora 22 now spews audit system messages all over your kernel
logs by default, which is especially fun if you just got rsyslog
working and would like to watch your kernel logs for important
anomalies. I have so far disabled most of this with '
auditctl -e 0'; my long term fix is going to be adding '
audit=0' to the kernel command line. I wish I knew what changed in Fedora 22 to cause these messages to start showing up in my configuration, but, well, who knows.
(I also modified my rsyslog configuration to divert those messages to another file, using a very brute force method because I was angry and in a hurry.)
- I ran into a gcc 5.1.1 bug.
And then there's DNF, the Fedora 22 replacement for yum. Oh, DNF, what can I say about you.
I believe the Fedora and DNF people when they say that the internals of DNF are better than the internals of Yum. But it's equally clear to me that DNF is nowhere near as usable and polished as Yum and so it has a ton of irritations in day to day usage. My experience with DNF has it slow and balky and erratic as compared to the smooth and working Yum I'm used to, and I've been neither impressed nor enthused about the forced switch. From a user perspective, this is not an improvement, it's a whole bunch of regressions.
On top of that, it's pretty clear that no one has ever seriously used or tested the dnf 'local' plugin, which lets you keep a copy of all packages you install through DNF. I've used the equivalent Yum plugin for years so that I could roll back to older versions of packages if I needed to (ie, when a package 'improvement' has broken the new current version for me). The DNF version has a truly impressive collection of 'this thing doesn't work' bugs. I managed to get it sort of working by dint of being both fairly familiar with how this stuff works under the hood and willing to edit the DNF Python source, and even then it sometimes explodes.
(Many people may not care about this but I actually use yum quite frequently, so a balky, stalling, uninformative, and frustrating version of it is really irritating. Everything I do with DNF seems to take twice as long and be twice as irritating as it was with Yum.)
At this point some people will reasonably ask if upgrading to Fedora 22 was worth it. My current answer is 'yes, sort of, and it's not as if I have a choice here'. To run Fedora is to be on an upgrade treadmill, like it or not, and Fedora 22 does improve and modernize some things. All of this annoyance is just the price I periodically pay for running Fedora instead of any of the alternatives.
(And yes, I still prefer Fedora to Debian, Ubuntu, or FreeBSD.)
The probable and prosaic explanation for a
socket() API choice
It started on Twitter:
@mjdominus: Annoyed today that the BSD people had socket(2) return a single FD instead of a pair the way pipe(2) does. That necessitated shutdown(2).
@thatcks: I suspect they might have felt forced to single-FD returns by per-process and total kernel-wide FD limits back then.
I came up with this idea off the cuff and it felt convincing at the
moment that I tweeted it; after all, if you have a socket server
or the like, such as
inetd, moving to a two-FD model for sockets
means that you've just more or less doubled the number of file
descriptors your process needs. Today we're used to systems that
let processes to have a lot of open file descriptors at once, but
historically Unix had much lower limits and it's not hard to imagine
inetd running into them.
It's a wonderful theory but it immediately runs aground on the
practical reality that
accept() were introduced
no later than 4.1c BSD, while
inetd only came in in 4.3 BSD (which was years later). Thus it seems
very unlikely that the BSD developers were thinking ahead to processes
that would open a lot of sockets at the time that the
API was designed. Instead I think that there are much simpler and
more likely explanations for why the API isn't the way Mark Jason
Dominus would like.
The first is that it seems clear that the BSD people were not
particularly concerned about minimizing new system calls; instead
BSD was already adding a ton of new system features and system
calls. Between 4.0 BSD and 4.1c BSD, they went from 64 syscall table
entries (not all of them real syscalls) to 149 entries. In this
atmosphere, avoiding adding one more system call is not likely to have
been a big motivator or in fact even very much on people's minds. Nor
was networking the only source of additions; 4.1c BSD added
rmdir(), for example.
The second is that C makes multi-return APIs more awkward than
single-return APIs. Contrast the
pipe() API, where you must construct
a memory area for the two file descriptors and pass a pointer to it,
socket() API, where you simply assign the return value. Given
a choice, I think a lot of people are going to design a
API rather than a
There's also the related issue that one reason the
works well returning two file descriptors is because the file
descriptors involved almost immediately go in different 'directions'
(often one goes to a sub-process); there aren't very many situations
where you want to pass both file descriptors around to functions
in your program. This is very much not the case in network related
programs, especially programs that use
et al returned two file descriptors, one for read and one for write,
I think that you'd find they were often passed around together.
Often you'd prefer them to be one descriptor that you could use
either for reading or writing depending on what you were doing at
the time. Many classical network programs (and protocols) alternate
reading and writing from the network, after all.
(Without processes that open multiple sockets, you might wonder
select() is there for. The answer is programs like
rlogin (and their servers), which talk to both the network
and the tty at the same time. These were already present in 4.1c
BSD, at the dawn of the
pipe() user API versus the kernel API
Before I actually looked at the 4.1c BSD kernel source code, I was
also going to say that the kernel to user API makes returning more
than one value awkward because your kernel code has to explicitly
fish through the pointer that userland has supplied it in things
pipe() system call. It turns out that this is false.
Instead, as far back as V7 and
probably further, the kernel to user API could return multiple
values; specifically, it could return two values.
this to return both file descriptors without having to fish around
in your user process memory, and it was up to the C library to write
these two return values to your
I really should have expected this; in a kernel, no one wants to have to look at user process memory if they can help it. Returning two values instead of one just needs an extra register in the general assembly level syscall API and there you are.
BSD Unix developed over more time than I usually think
Left to myself, I tend to sloppily think of 4.2 BSD as where all of the major development of BSD Unix took place and the point in time where what we think of as 'BSD Unix' formed. Sure, there were BSDs before and after 4.2 BSD, but I think of the before releases as just the preliminaries and the releases after 4.2 BSD as just polishing and refining things a bit. As I was reminded today, this view is in fact wrong.
If you'd asked me what 4.x BSD release
inetd first appeared in, I
would have confidently told you that it had to have appeared in 4.2 BSD
along with all of the other networking stuff. Inetd is such a pivotal
bit of the BSD networking (along with the services that it enables,
finger) that of course it would be there from the start in 4.2,
Wrong. It turns out that
inetd only seems to have appeared in 4.3
BSD. In fact a number of related bits of 4.2 BSD are surprisingly
under-developed and different from what I think of as 'the BSD way'.
finger in 4.2 BSD is not network enabled, but a more
fundamental thing is that 4.2 BSD limits processes to only 20 open
file descriptors at once (by default, and comments in the source
suggest that this cannot be raised above 30 no matter what).
Instead it is 4.3 BSD that introduced not just
inetd but a higher
limit on the number of open file descriptors (normally 64).
With that higher limit came the modern
FD_* set of macros used
to set, check, and clear bits in the
select() file descriptor
bitmaps; 4.2 BSD didn't need these since the file descriptor masks
fit into a single 32-bit word.
dup2() and BSD's low file descriptor limit
Given the existence of the
dup2() system call, which in theory
lets you create a file descriptor with any FD number, you might
wonder how 4.2 BSD got away with a 32-bit word for the
bitmask. The answer turns out to be that 4.2 BSD simply forbid you
dup2()'ing to a file descriptor number bigger than 19 (or
in general the
(You can see the code for this in the
In general a lot of the early Unix kernel source code is quite simple
and readable, which is handy at times like this.)
Faster SSDs matter to companies because they sell things
The computer hardware industry has a problem: systems mostly aren't getting (much) better any more, especially desktop PCs. The most famous example is that CPU performance has been changing only incrementally for years, especially for single threaded performance. This is a problem because a lot of hardware sales are upgrades and when there's no particular performance improvement you can trumpet people don't bother to upgrade. They do replace old machines eventually, but that's slower and less lucrative (and runs the risk of people leaking out to, eg, tablets and smartphones).
This is where a faster SSD interconnect matters to companies; it's a clear performance improvement they can point to. Whether or not it makes a difference in practice for most people, companies can trumpet 'much faster disk read and write speeds' as well as 'take full advantage of SSDs' and thereby move (more) hardware. No matter what it does in practice, it sounds good.
My general impression is that Intel and the motherboard companies are pretty desperate for things that will move new hardware, and really I can't blame them. So I wouldn't be surprised to see U.2 NVMe support appear in motherboards and systems quite fast, and I honestly hope it works to prop up their fortunes.
(As someone who is well out on the small tail end in terms of my PC hardware, I have a vested interest in a vibrant motherboard market that caters to even relatively weird interests like mine.)
Sidebar: the 'pushing technology' view
On a longer and larger scale view, drastically increased 'disk' access speeds that are essentially specific to SSDs also increase the chances that people will start building filesystems and other things that are specifically designed and tuned for SSDs, or just generally for things that look more like memory than rotating magnetic medium. It's been very useful to be able to pretend SSDs are hard drives, but they aren't really and we may find that systems are quite different and better when we can stop pretending.
(This too is likely to sell new hardware over the long term.)
The next (or coming) way to connect SSDs to your system
Modern SSDs have a problem: flash chips are so fast that they outpace even high speed SATA and SAS links. In the enterprise market the workaround for this is SSD 'drives' that are PCIe cards, but this has all sorts of drawbacks as a general solution. Since the companies involved here are not stupid, they've known this for some time and have come up with a new interconnection system, NVMe aka NVM Express.
The Wikipedia page is a bit confusing to outsiders, but as far as I can tell NVMe is essentially a standard for how PCIe SSDs should present themselves to the host system. NVMe devices advertise that they have a specific PCI device class and promise to have a common set of registers, control operations, and so on; as a result, any NVMe device can be driven by a single common driver instead of each company's devices needing their own driver.
(Most PCI and PCIe devices need specific drivers because there's no standard for how they're controlled; each different device has its own unique collection of registers, operations, and so on. This gives us, eg, a zillion different PCI(e) Ethernet device drivers.)
If this was all that NVMe was, it would be kind of boring because it would be restricted to actual PCIe card SSDs and those are never going to be really popular. But NVMe also has a physical standard called U.2 that lets you pull PCIe out over a cable to a conventional-ish SSD drive. This means that you can have a 2.5" form factor SSD mounted somewhere and cabled up that is an NVMe drive and thus is actually a PCIe device on one of your PCIe busses. Assuming everything works and U.2 ports appear in sufficient quantity on motherboards, this seems likely to compete with SATA for connecting SSDs in general, not just in expensive enterprise setups.
(U.2 used to be called SFF-8639 until this month. As you can tell, the ink is barely dry on much of this stuff.)
If I'm reading the tea leaves right, U.2 is somewhat less convenient than ordinary SATA because it requires cables and connectors that are a bit more than twice as big. This is going to impact port density and wiring density, but there are plenty of ordinary machines which have enough motherboard real estate and enough space for cables that this probably isn't a big concern. On the other hand I do expect a bunch of small motherboard and high density servers to deliberately stay with SATA or SAS for the higher achievable port density.
(PCIe and thus NVMe can also be connected up with a less popular connector standard called M.2. This is apparently intended for plugging bare-board SSDs directly into your motherboard instead of cabling things to mounts elsewhere, although I've read some things suggestion it can be coerced into working with cables.)
Does this all matter to ordinary people balancing the SSD inflection point? Maybe. My view is that it does matter in the long term for computer hardware companies, but that's going to take another entry to explain.
The status of our problems with overloaded OmniOS NFS servers
Back at the start of May, we narrowed down our production OmniOS problems to the fact that OmniOS NFS servers have problems with sustained 'too fast' write loads. Since then there have been two pieces of progress and today I feel like writing about them.
The first is that this was identified as a definite Illumos issue. It turns out that Nexenta stumbled over this and fixed it in their own tree in this commit. The commit has since been upstreamed to the Illumos master here (issue) and has made it into the repo for OmniOS r151014 (although I believe it's not yet in a released update). OmniTI's Dan McDonald did the digging to find the Nexenta change after I emailed the OmniOS mailing list and built us a kernel with it patched in that we were able to run in our test environment, where it passed with flying colors. This is clearly our long term solution to the problem.
(In case it's not obvious, Dan McDonald was super helpful to us here, which we're quite grateful for. Practically the moment I sent in my initial email, our problem was on the way to getting solved.)
In the short term we found out that taking a fileserver from 64 GB of RAM to 128 GB of RAM made us no longer able to reproduce the problem in both our test environment and the production fileserver that was having problems. In addition it appears to make our test fileserver significantly more responsive under heavy load. Currently the production fileserver is running without problems with 128 GB of RAM and 4096 NFS server threads (and an increase in kernel rpcmod parameters to go with it). It's definitely survived getting into memory use situations that we'd have expected to lock it up based on prior experience.
(At the moment we've only upgraded the one problem fileserver to 128 GB and left the others at 64 GB. The others get much less load due to some decisions we made during the migration from the old fileservers to our current ones.)
We still have some other issues with our OmniOS fileservers, but for now the important thing is that we have what seems to be a stable production fileserver environment. After all our problems getting here, that is a very big relief. We can live with 1G Ethernet instead of 10G; we can't live with fileservers that lock up under load.