It seems routine to see a bunch of browser User-Agents from the same IP

June 19, 2024

The sensible thing to do about the plague of nasty web spiders and other abusers is to ignore it unless it's actually affecting your site. There is always some new SEO optimizer or marketer or (these days) LLM dataset collector crawling your site, and trying to block all of them is a Sisyphean labour. However, I am not necessarily a sensible person, and sometimes I have potentially clever ideas. One of my recent clever ideas was to look for IP addresses that requested content here using several different HTTP 'User-Agent' values, because this is something I see some of the bad web spiders do. Unfortunately it turns out that this idea does not work out, at least for me and based on my traffic.

Some of the IP addresses that are using multiple User-Agent are clearly up to no good; for example, right now there appears to be an AWS-based stealth spider crawling Wandering Thoughts with hundreds of different User-Agents from its most prolific AWS IPs. Some of its User-Agent values are sufficiently odd that I suspect it may be randomly assembling some of them from parts (eg, pick a random platform, pick a random 'AppleWebKit/' value, pick a random 'Chrome/' value, pick a random 'Safari/' value, and put them all together regardless of whether any real browser ever used the combination). This crawler also sometimes requests robots.txt, for extra something. Would it go away if you got the right name for it? Would it go away if you blocked all robots? I am not going to bet on either.

Some of the sources of multiple User-Agent values are legitimate robots that are either presenting variants of their normal User-Agent, all of which clearly identify them, or are multiple different robots operated by the same organization (Google has several different robots, it turns out). A few sources appear to have variant versions of their normal User-Agent string; one has its regular (and clear) robot identifier and then a version that puts a ' Bot' on the end. These are small scale sources and probably small scale software.

Syndication feed fetchers (ie, RSS or Atom feed fetchers) are another interesting category. There are a number of feed aggregators pulling various of my syndication feeds for different people, which they put some sort of identifier for in their User-Agent (or a count of subscribers), along with their general identification. At the small scale, some people seem to be using more than one feed reader (or feed involved thing) on their machines, with each program fetching things independently and using its own User-Agent. Some of this could also be several different people behind the same IP, all pulling my feed with different programs.

This is in fact a general thing. If you have multiple different devices at home, all of them behind a single IPv4 address, and you visit Wandering Thoughts from more than one, I will see more than one User-Agent from the same IP. The same obviously happens with larger NAT environments.

An interesting and relatively new category is the Fediverse. When a Fediverse message is posted that includes a URL, many Fediverse servers will fetch the URL in order to generate a link preview for their users. To my surprise, a surprising number of these fetches seem to be routed through common front-end IPs. Each Fediverse server is using a User-Agent that identifies it specifically (as well as its Fediverse software), so I see multiple User-Agents from this front-end. Today, the most active front end IP seems to have been used by 39 different Mastodon servers. Meanwhile, some of the larger Fediverse servers use multiple IPs for this link preview generation.

The upshot of all of this is that looking at IPs that use a lot of different User-Agents is too noisy to be useful for me to identify new spider IPs. Something that shows up with a lot of different User-Agents might be yet another bot IP, or it might be legitimate, and it's too much work to try to tell them apart. Also, at least right now there are a lot of such bot IPs (due to the AWS-hosted crawler).

Oh well, not all clever ideas work out (and sometimes I feel like writing up even negative results, even if they were sort of predictable in hindsight).

Written on 19 June 2024.
« Some things on how ZFS System Attributes are stored
Where Thunderbird seems to get your default browser from on Linux »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jun 19 22:54:36 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.