2024-07-21
Our giant login server: solving resource problems with brute force
One of the moderately peculiar aspects of our environment is that we still have general Unix multiuser systems that people with accounts can log in to and do stuff on. As part of this we have some general purpose login servers, and in particular we have one that's always been the most popular, partly because it was what you got when you did 'ssh cs.toronto.edu'. For years and years we had a succession of load and usage issues on this server, where someone would log in and start doing something that was CPU or memory intensive, hammering the machine for everyone on it (which was generally a lot of people, and so this could be pretty visible). We spent a non-trivial amount of time keeping an eye on the machine's load, sending email to people, terminating people's heavy-duty processes, and in a few cases having to block logins from specific people until they paid attention to their email.
Then a few years ago we had a chunk of spare money and decided to spend it on getting rid of the problem once and for all. We did this by buying a ridiculously overpowered server to become the new version of our primary login server, with 512 GB of RAM and 112 CPUs (AMD Epyc 7453s); in fact we bought two at once and put the other one into our SLURM cluster, where it was at the time one of the most powerful compute machines there (back in 2022).
By itself this wouldn't be sufficient to protect us from having to care about what people were doing on the machine, because (some) modern software can eat any amount of CPUs and RAM that's available (due to things like auto-sizing how many things it does in parallel based on the available CPU count). So we set up per-user CPU and memory resource limits for all users. Because this server is so big, we can actually give people quite large limits; our current settings are 30 GBytes of RAM and 8 CPUs, which is effectively a reasonable desktop machine (we figure people can't really complain at that point).
(In completely unsurprising news, people do manage to run into the memory limit from time to time and have their giant processes killed.)
These limits don't completely guarantee avoiding problems, since enough different people doing enough at once could still overload the machine. But this hasn't happened yet, so in practice we've been able to basically stop caring about what people run on our primary login server, and with it we've stopped watching things like its load average and free memory. For people using our primary login server, the benefit is that they can do a lot more than they could before without problems and they don't get affected by what other people are doing.