It seems routine to see a bunch of browser User-Agents from the same IP

June 19, 2024

The sensible thing to do about the plague of nasty web spiders and other abusers is to ignore it unless it's actually affecting your site. There is always some new SEO optimizer or marketer or (these days) LLM dataset collector crawling your site, and trying to block all of them is a Sisyphean labour. However, I am not necessarily a sensible person, and sometimes I have potentially clever ideas. One of my recent clever ideas was to look for IP addresses that requested content here using several different HTTP 'User-Agent' values, because this is something I see some of the bad web spiders do. Unfortunately it turns out that this idea does not work out, at least for me and based on my traffic.

Some of the IP addresses that are using multiple User-Agent are clearly up to no good; for example, right now there appears to be an AWS-based stealth spider crawling Wandering Thoughts with hundreds of different User-Agents from its most prolific AWS IPs. Some of its User-Agent values are sufficiently odd that I suspect it may be randomly assembling some of them from parts (eg, pick a random platform, pick a random 'AppleWebKit/' value, pick a random 'Chrome/' value, pick a random 'Safari/' value, and put them all together regardless of whether any real browser ever used the combination). This crawler also sometimes requests robots.txt, for extra something. Would it go away if you got the right name for it? Would it go away if you blocked all robots? I am not going to bet on either.

Some of the sources of multiple User-Agent values are legitimate robots that are either presenting variants of their normal User-Agent, all of which clearly identify them, or are multiple different robots operated by the same organization (Google has several different robots, it turns out). A few sources appear to have variant versions of their normal User-Agent string; one has its regular (and clear) robot identifier and then a version that puts a ' Bot' on the end. These are small scale sources and probably small scale software.

Syndication feed fetchers (ie, RSS or Atom feed fetchers) are another interesting category. There are a number of feed aggregators pulling various of my syndication feeds for different people, which they put some sort of identifier for in their User-Agent (or a count of subscribers), along with their general identification. At the small scale, some people seem to be using more than one feed reader (or feed involved thing) on their machines, with each program fetching things independently and using its own User-Agent. Some of this could also be several different people behind the same IP, all pulling my feed with different programs.

This is in fact a general thing. If you have multiple different devices at home, all of them behind a single IPv4 address, and you visit Wandering Thoughts from more than one, I will see more than one User-Agent from the same IP. The same obviously happens with larger NAT environments.

An interesting and relatively new category is the Fediverse. When a Fediverse message is posted that includes a URL, many Fediverse servers will fetch the URL in order to generate a link preview for their users. To my surprise, a surprising number of these fetches seem to be routed through common front-end IPs. Each Fediverse server is using a User-Agent that identifies it specifically (as well as its Fediverse software), so I see multiple User-Agents from this front-end. Today, the most active front end IP seems to have been used by 39 different Mastodon servers. Meanwhile, some of the larger Fediverse servers use multiple IPs for this link preview generation.

The upshot of all of this is that looking at IPs that use a lot of different User-Agents is too noisy to be useful for me to identify new spider IPs. Something that shows up with a lot of different User-Agents might be yet another bot IP, or it might be legitimate, and it's too much work to try to tell them apart. Also, at least right now there are a lot of such bot IPs (due to the AWS-hosted crawler).

Oh well, not all clever ideas work out (and sometimes I feel like writing up even negative results, even if they were sort of predictable in hindsight).


Comments on this page:

By Anonymous at 2024-06-20 09:48:21:

The second I read the title of the post, I immediately thought of the legitimate 'NAT environments' scenario (which made me wonder why you apparently didn't, and made me read the post to figure out why that would be so.)

By cks at 2024-06-20 10:26:33:

I left it out of the entry (partly by accident), but when I started I expected to see a number of IPs that were using two or three User-Agent values due to consumer NAT. I was hoping that there wouldn't be too many of them and that most or all of the ones using more than a couple of User-Agents would be bad, but that's clearly not the case (even once I exclude syndication feed fetches).

The other legitimate case that I didn't mention in the entry is people using Tor, since that aggregates traffic through Tor exit nodes.

By Jonathan at 2024-06-22 07:25:13:

Instead of using just the IP address to identify the source, what if you use the IP address plus the source's outbound port number?

That would let you more narrowly identify the source process, even behind the source's router, firewall, or NAT device.

Typically the source's device assigns and reuses a unique port number for each client process, across multiple http(s) requests. Different client processes get different outbound port numbers. The device sends you the outbound port number inside each IP packet.

So, for example, two different browsers or spiders from the same source have the same IP address but different outbound port numbers.

Using the IP address + outbound port number would let you detect the suspicious case where a single client process is sending multiple http(s) requests with different User Agent strings. If you detect that, the client is highly likely to be a spider.

I can imagine scenarios where you receive requests from a spider using different User Agents from the same IP address with different outbound port numbers. In that case, you wouldn't be able to use this technique to detect that it's a spider.

But if you receive different User Agents from the same IP address and the same outbound port number, then you can be pretty confident that you've detected a spider.

By Nick at 2024-06-22 09:02:24:

A few of those requests will be from me! My "User-Agent Switcher" extension sends a randomly chosen user agent string with each request. Just an attempt to make life more difficult for spyware, which might or might not make any difference.

By Jonathan at 2024-06-30 19:39:46:

Thanks, @Nick. That's an interesting approach -- although (as you said) I'm not sure how much difference it'd make.

I do know that some sysadmins work doggedly to kill bot requests to their sites. Perhaps the technique of excluding multiple User Agents from the same IP address+port number (at least for a period of time) would help them.

Written on 19 June 2024.
« Some things on how ZFS System Attributes are stored
Where Thunderbird seems to get your default browser from on Linux »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jun 19 22:54:36 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.