2018-04-30
You probably need to think about how to handle core dumps on modern Linux servers
Once upon a time, life was simple. If and when your programs hit
fatal problems, they generally
dumped core in their current directory under the name core
(sometimes you could make them be core.<PID>
). You might or might
not ever notice these core files, and some of the time they might
not get written at all because of various permissions issues (see
the core(5) manpage).
Then complications ensued due to things like Apport, ABRT, and
systemd-coredump,
where an increasing number of Linux distributions have decided to
take advantage of the full power of the kernel.core_pattern
sysctl to capture core dumps themselves.
(The Ubuntu Apport documentation claims that it's disabled by default on 'stable' releases. This does not appear to be true any more.)
In a perfect world, systems like Apport would capture core dumps
from system programs for themselves and arrange that everything
else was handled in the traditional way, by writing a core
file.
Unfortunately this is not a perfect world. In this world, systems
like Apport almost always either discard your core files entirely
or hide them away where you need special expertise to find them.
Under many situations this may not be what you want, in which case
you need to think about what you do want and what's the best way
to get it.
I think that your options break down like this:
- If you're only running distribution-provided programs, you can
opt to leave Apport and its kin intact. Intercepting and magically
handling core dumps from standard programs is their bread and butter,
and the result will probably give you the smoothest way to file bug
reports with your distribution. Since you're not running your own
programs, you don't care about how Apport (doesn't) handle core dumps
for non-system programs.
- Disable any such system and set
kernel.core_pattern
to something useful; I like 'core.%u.%p
'. If the system only runs your services, with no users having access to it, you might want to have all core dumps written to some central directory that you monitor; otherwise, you probably want to set it so that core dumps go in the process's current directory.The drawback of this straightforward approach is that you'll fail to capture core dumps from some processes.
- Set up your own program to capture core dumps and save them
somewhere. The advantage of such a program is that you can capture
core dumps under more circumstances and also that you can immediately
trigger alerting and other things if particular programs or
processes die. You could even identify when you have a core dump
for a system program and pass the core dump on to Apport,
systemd-coredump, or whatever the distribution's native system is.
One drawback of this is that if you're not careful, your core dump handler can hang your system.
If you have general people running things on your servers and those things may run into segfaults and otherwise dump core, it's my view that you probably want to do the middle option of just having them write traditional core files to the current directory. People doing development tend to like having core files for debugging, and this option is likely to be a lot easier than trying to educate everyone on how to extract core dumps from the depths of the system (if this is even possible; it's theoretically possible with systemd at least).
Up until now we've just passively accepted the default of Apport on our Ubuntu 16.04 systems, but now that we're considering what we want to change for Ubuntu 18.04 and I've been reminded of this whole issue by Julia Evans' How to get a core dump for a segfault on Linux (where she ran into the Apport issue), I think we want to change things to the traditional 'write a core file' setup (which is how it was in Ubuntu 14.04).
(Also, Apport has had its share of security issues over the years, eg 1, 2.)
PS: Since systemd now wants to handle core dumps, I suspect that this is going to be an issue in more and more Linux distributions. Or maybe everyone is going to make sure that that part of systemd doesn't get turned on.
Microsoft's Bingbot crawler is on a relative rampage here
For some time, people in various places have been reporting that Microsoft Bing's web crawler is hammering them; for example, Discourse has throttled Bingbot (via). It turns out that Wandering Thoughts is no exception, so I thought I'd generate some numbers on what I'm seeing.
Over the past 11 days (including today), Bingbot has made 40998 requests, amounting to 18% of all requests. In that time it's asked for only 14958 different URLs. Obviously many pages have been requested multiple times, including pages with no changes; the most popular unchanging page was requested almost 600 times. Quite a lot of unchanging pages have been requested several times over this interval (which isn't surprising, since most pages here change only very rarely).
Over this time, Bingbot is the single largest source by user-agent (and the second place source is claimed by a bot that is completely banned; after that come some syndication feed fetchers). For scale, Googlebot has only made 2,800 requests over the past 11 days.
Traffic fluctuates from day to day but there is clearly a steady volume. Traffic for the last 11 days is, going backward from today, 5154 requests, then 2394, 2664, 3855, 1540, 2021, 3265, 7575, 2516, 3592, and finally 6432 requests.
As far as bytes transferred go, Bingbot came in at 119.8 Mbytes over those 11 days. Per day volume is 14.9 Mbytes, then 6.9, 7.3, 11.5, 4.6, 5.8, 8.8, 22.9, 6.7, 10.8, and finally 19.4 Mbytes. On the one hand, the total Bingbot volume by bytes is only 1.5% of my total traffic. On the other hand, syndication feed fetches are about 94% of my volume and if you ignore them and look only at the volume from regular web pages, Bingbot jumps up to 26.9% of the total bytes.
I think that all of this crawling is excessive. It's one thing to want current information; it's another thing to be hammering unchanging pages over and over again. Google has worked out how to get current information with far fewer repeat visits to fewer pages (in part by pulling my syndication feed, presumably using it to drive further crawling). The difference between Google and Bing is especially striking considering that far more people seem to come to Wandering Thoughts from Google searches than come from Bing ones.
(Of course, people coming from Bing could be hiding their Referers far more than people coming from Google do, but I'm not sure I consider that very likely.)
I'm not going to ban Bing(bot), but I certainly do wish I had a useful way to answer their requests very, very slowly in order to discourage them from visiting so much and to be smarter about what they do visit.