We need to start getting some experience with using Ubuntu 20.04

October 18, 2020

Under normal circumstances, we would have a decent number of machines running Ubuntu 20.04 by now, probably including our login servers. But the situation is not normal, because ongoing world and local events still have us working from home, making it not so simple to install and deploy a new physical server with a new version of Ubuntu. However, it really looks like this is the new normal so we should start dealing with it.

It may or may not make sense to spend limited in-office time upgrading perfectly good 18.04 machines to 20.04 (I speculated about this back here in early August), although I suspect we're going to wind up doing some of it. I think it does make sense to install completely new machines on Ubuntu 20.04 for more lifetime, and we're certainly going to have some of those. We have what I believe is a working 20.04 install system, but what we don't currently have are any continuously running 20.04 machines, especially ones that normal people can use, explore, and see what's broken or odd. In the past actually operating new versions of Ubuntu has frequently turned up surprises, so the sooner we start doing that the better.

The obvious thing to do is to build a few 20.04 test servers. We're likely going to run Apache on some 20.04 machines, so one test server should have an Apache install. Another one should be a general login server, which would let us look into how various programs that people use behave on 20.04. We should also build a third server that's completely expendable and we can experiment with rebooting and other things that may blow up. All of these have to be built on physical hardware, since we don't currently have any virtualization environment (and anyway we'd be running most 20.04 machines on physical hardware).

(Running on actual physical hardware has periodically turned up practical problems. Since it's now eight years after that issue, perhaps we should experiment with no longer supplying that kernel parameter in 20.04.)

PS: An expendable test server is where it would be very nice to have some way to roll back the root filesystem to an initial state. This can apparently be done through LVM, which Ubuntu does support for the root filesystem, and I may experiment with it following eg the Arch wiki.

(This is one of the entries that I write partly to motivate myself to start working on something. We've been mostly ignoring Ubuntu 20.04 so far and it would be very easy to keep on doing so.)


Comments on this page:

By vcarceler at 2020-10-18 12:23:44:

I'm pretty sure its not your use case for a server test with Ubuntu 20.04. But in our school we have hundreds of desktops (and laptops) computers with Ubuntu 20.04 booting from OpenZFS.

As you know you can revert to a previous snapshot from the grub's menu. And after this experience I need to admit that works very fine.

It's just a crazy idea.

By cks at 2020-10-18 22:57:14:

Since we want to do these reversions to a snapshot while working remotely, it's ideal for us if they don't require things like having to boot a non-default grub menu entry. Apart from other reasons, this makes OpenZFS and ZFS for the root filesystem less attractive here. LVM snapshots can apparently be set up to revert on the next reboot while the system is running, which is definitely appealing.

By Miksa at 2020-10-22 12:44:36:

This was once again a post where I started thinking that your organisation should start investing heavily in virtualization. With the specific case of Ubuntu 20 deployment in our university it went exactly like the previous versions. The deployment server is running as KVM virtual server, installation is usually done by PXE boot and the test servers run in VMware ESXi. My coworker did the operation over SSH and the VMware vCenter HTML console to choose the correct PXE menu entry. The only difference from doing it at the office network was an extra remote access jump. I'm not even sure if the coworker tested it on physical server, probably if there was a blade server conveniently available. With our standard Dell and HPE servers I would assume Ubuntu just works if we get it working in VMware. We have one Supermicro running Ubuntu and it had been problematic until newer kernels got it working properly. On RHEL the same SM model worked without issues, but I guess it wasn't officially supported on Ubuntu.

Last time I had this thought was with your post about the "power event". My view is that if your organization doesn't have the resources to properly equip your servers with dual PSUs, with one leg connected to UPS and the other to the power grid, and remote KVM or IPMI console, then you should minimize the amount of physical servers you have and use virtual servers as much as possible. You already have the storage infrastructure you have built, which I think would work as the backend for VMware, so you already have the biggest obstacle taken care. Just three or four well equipped servers and you can start building virtual servers to your heart's content. I assume you have several extra servers laying around. Configure couple NFS shares and with the free ESXi trial and vCenter you could have VMware cluster running in days work. I don't have experience of shared storage with KVM or XEN, but I assume it would work fine.

Virtual servers are so easy and convenient. At best I can have a new virtual server running in half an hour without leaving my living room. And the HTML5 console in VMware is superior to any of the consoles from Dell, HPE or Supermicro, or the HPE KVMs at our data center. Virtual servers also consume so little resources compared to physical that they are practically free. We do have a price catalog with prices for extra cores, RAM storage space and backup, but the expense is so minor we don't actually bother to bill any of our clients. We have well over thousand servers running in our VMware clusters and I can't name a single virtual server that someone has had to pay for. A good example was when some students had created an online game about animal diseases. They had it running in one of the student's free Microsoft Azure account and they asked to have it transferred to university's Azure account. I think we even had Azure at that point, but it was too much effort to figure who to contact about it and what to do transfer the Azure server. So we just asked their teacher for approval that it's a worthy cause, built a new Ubuntu 14 virtual server and gave the students sudo.

I think you mentioned in some of your older post about virtualization that you are worried about putting "all the eggs in one basket", but I would argue that your "power event" post indicates that you don't worry about that at all. Single PSU servers without UPS makes the power grid exactly such a basket. And the concern is valid. Every now and then a ESXi host encounters a hiccup and takes hundred virtual servers down with it, but so what, it happens. The virtual servers will be back up and running in five minutes on another host. In general I feel, that a large amount of users experiencing a short outage is less of an issue than smaller amount of users suffering hours or days while we try to deal with a malfunctioning physical hardware.

In August one of our datacenters suffered a cooling malfunction and the temps reached around 60°C. This was a problem for the virtual servers pretty only because our other datacenter didn't have enough VMware capacity to run them. If it had happened just a couple weeks later when we got a major VMware expansion online the virtual servers could have kept running happily without being aware of any issues. After we got it cooled and could turn on more ESXi hosts all the virtual server started up automatically. For the physical servers on that datacenter a vMotion migration was not an option.

We experienced our biggest VMware emergency couple years ago when we had major problems with our SAN infrastructure. The virtual servers stayed running, but VMware clusters were so jammed we weren't able to administer them. The biggest catastrophy happened when one of our most powerful hosts found faulty RAM during that and ECC rebooted it. This took down 150-200 virtual servers including critical authentication servers that we were unable to start for over a day because of the administration problems. But even this wasn't a big enough of an issue to consider expanding the use of physical servers. It more showed the need for virtualization infrastructure independent of our SAN for running cluster members for the most critical virtual servers. Maybe KVM hosts with local drives or VMware hosts with VSAN.

Written on 18 October 2020.
« A potential Prometheus issue for labeled metrics for infrequent events
What versions of PyPy I can use (October 2020 edition) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Oct 18 00:37:57 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.