Wandering Thoughts


Go's proposed try() will be used and that will change how code is written

One of the recent commotions in Go is over the Go team's proposed try() built-in error check function, which is currently planned to be part of Go 1.14 (cf). To simplify, 'a, [...] := try(f(...))' can be used to replace what you would today have to write as:

a, [...], err := f(...)
if err != nil {
   return [...], err

Using try() means you can drop that standard if block and makes your function clearer; much more of the code that remains is relevant and important.

Try() is attractive and will definitely be used in Go code, probably widely, and especially by people who are new to Go and writing more casual code. However, this widespread use of try() is going to change how Go code is written.

One of my firm beliefs is that most programmers are strongly driven to do what their languages make easy, and I don't think try() is any exception (I had similar thoughts about the original error handling proposal). What try() does is that try() makes returning an unchanged error the easiest thing to do. You can wrap the error from f() with more context if you work harder, but the easiest path is to not wrap it at all. This is a significant change from the current state of Go, where wrapping an error is a very easy thing that needs almost no change to the boilerplate code:

a, [...], err := f(...)
if err != nil {
   return [...], fmt.Errorf("....", ..., err)

In a try() world, adding that error wrapping means adding those three lines of near boilerplate back in. As a result, I think that once try() is introduced, Go code will see a significantly increased use of errors being returned unchanged and unwrapped from deep in the call stack. Sure, it's not perfect, but programmers are very good at convincing themselves that it's good enough. I'm sure that I'll do it myself.

This change isn't necessarily bad by itself, but it does directly push against the Go team's efforts to put more context into error values, an effort that actually landed changes in the forthcoming Go 1.13 (see also the draft design). It's possible to combine try() and better errors in clever ways, as shown by How to use 'try', but it's not the obvious, easy path, and I don't think it's going to be a common way to use try().

I am neither for or against try() at the moment, because I think that being for or against it in isolation is asking the wrong question. The important question is how Go wants errors to work, and right now the Go team does not seem to have made up its mind. If the Go team decides that errors should frequently be wrapped on their way up the call stack, I believe that try() in its current form is a bad idea.

(If the Go team thinks that they can have both try() in its current form and people routinely wrapping errors, I think that they are fooling themselves. try() will be used in the easiest way to use it, because that's what people do.)

PS: While Go culture is currently relatively in favour of wrapping errors with additional information, I don't think that this culture will survive the temptation of try(). You can't persuade people to regularly do things the hard way for very long.

programming/GoTryWillBeUsedSimply written at 23:36:22; Add Comment


ZFS on Linux still has annoying issues with ARC size

I'll start with my tweet:

One of the frustrating things about operating ZFS on Linux is that the ARC size is critical but ZFS's auto-tuning of it is opaque and apparently prone to malfunctions, where your ARC will mysteriously shrink drastically and then stick there.

Linux's regular filesystem disk cache is very predictable; if you do disk IO, the cache will relentlessly grow to use all of your free memory. This sometimes disconcerts people when free reports that there's very little memory actually free, but at least you're getting value from your RAM. This is so reliable and regular that we generally don't think about 'is my system going to use all of my RAM as a disk cache', because the answer is always 'yes'.

(The general filesystem cache is also called the page cache.)

This is unfortunately not the case with the ZFS ARC in ZFS on Linux (and it wasn't necessarily the case even on Solaris). ZFS has both a current size and a 'target size' for the ARC (called 'c' in ZFS statistics). When your system boots this target size starts out as the maximum allowed size for the ARC, but various events afterward can cause it to be reduced (which obviously limits the size of your ARC, since that's its purpose). In practice, this reduction in the target size is both pretty sticky and rather mysterious (as ZFS on Linux doesn't currently expose enough statistics to tell why your ARC target size shrunk in any particular case).

The net effect is that the ZFS ARC is not infrequently quite shy and hesitant about using memory, in stark contrast to Linux's normal filesystem cache. The default maximum ARC size starts out as only half of your RAM (unlike the regular filesystem cache, which will use all of it), and then it shrinks from there, sometimes very significantly, and once shrunk it only recovers slowly (if at all).

This sounds theoretical, so let me make it practical. We have six production ZFS on Linux NFS fileservers, all with 196 GB of RAM and a manually set ARC maximum size of 155 GB. At the moment their ARC sizes range from 117 GB to 145 GB; specifically, 117 GB, 127 GB, three at 132 GB, and 145 GB. On top of this, the fileserver at 117 GB of ARC is a very active one with some very popular and big filesystems (such as our mail spool, which is perennially the most active filesystem we have). Even if we're still getting a good ARC hit rate during active periods, I'm pretty sure that we could get some use out of it caching more ZFS data in RAM than it actually is.

(We don't currently have ongoing ARC stats for our fileservers, so I don't know what the ARC hit rate is or why ARC misses happen (cf).)

Part of the problem here is not just that the ARC target size shrinks, it's that you can't tell why and there aren't really any straightforward and reliable ways to tell ZFS to reset it. And since you can't tell why the ARC target size shrunk, you can't tell if ZFS actually did have a good reason for shrinking the ARC. The auto-sizing is great when it works but very opaque when it doesn't, and you can't tell the difference.

PS: Several years ago, I saw memory competition between the ARC and the page cache on my workstation, but then the issue went away. I don't think our fileserver ARC issues are due to page cache contention, partly because the entire ext4 root filesystem on them is only around 20 GB. Even if all of it is completely cached in RAM, there's a bunch of ARC shrinkage that's left unaccounted for. Similarly, the sum of smem's PSS for all user processes is only a gigabyte or two. There just isn't very much happening on these machines.

PPS: This is with an older version of ZFS on Linux, but my office workstation with a bleeding edge ZoL doesn't do any better (in fact it does worse, with periodic drastic ARC collapses).

linux/ZFSOnLinuxARCShrinkage written at 22:49:23; Add Comment


We're going to be separating our redundant resolving DNS servers

We have a number of OpenBSD machines in various roles; they're our firewalls, our resolving DNS servers as well as our public authoritative DNS server, and so on. For pretty much all of these, we actually have two identical servers per role in a hot spare setup, so that we can rapidly recover from various sorts of failures. For our firewalls, switching from one to another takes manual action (we have to change which one is plugged into the live network, although their firewall state is synchronized with pfsync so that a switch is low impact). For our DNS resolvers, we have both on the network and list both addresses in our /etc/resolv.conf, because this works perfectly fine with DNS servers.

(All of our machines list the same resolver first, which we consider a feature for reasons beyond the scope of this entry. Our routing firewalls don't use CARP for various reasons, some of them historical, but in practice it doesn't matter, as we haven't had a firewall hardware failure. When we have switched firewalls, it's been for software reasons.)

All of this sounds great, except for the bit where I haven't mentioned that these redundant resolving DNS servers are racked next to each other (one on top of the other), plugged into the same rack PDU, and connected to the same leaf switch. We have great protection against server failure, which is what we designed for, but after we discovered that switches can wind up in weird states after power failures it no longer feels quite so sufficient, since working DNS is a crucial component of our environment (as we found out in an earlier power failure).

(Most of our paired redundant servers are racked up this way because it's the most convenient option. They're installed at the same time, generally worked on at the same time, and they need the same network connections. For firewalls, in fact, you need to switch their network cables back and forth to change which is the live one.)

So, as the title of this entry says, we're now going to be separating our resolving DNS servers, both physically and for their network connection, so that the failure of a single rack PDU or leaf switch can't take both of them offline. Unfortunately we can't put one DNS server directly on the same switch as our fileservers; the fileserver switch is a 10G-T switch with a very limited supply of ports.

(Now that I write this entry the obvious question is whether all of our fileservers should be on the same 10G-T switch. Probably it's harmless, because our entire environment will grind to a halt if even a single fileserver drops off the network.)

PS: I suspect that our resolving DNS servers are the only redundant pair that are important to separate this way, but it's clearly something we should think about. We could at least add some extra redundancy for our VPN servers by separating the pairs, and that might be important during a serious problem.

sysadmin/SeparatingOurDNSResolvers written at 22:21:22; Add Comment


Our switches can wind up in weird states after a power failure

We've had two power failures so far this year, which is two more than we usually have. Each has been a learning experience, because both times around our overall environment failed to come back up afterward. The first time around the problem was DNS, due to a circular dependency that we still don't fully understand. The second time around, what failed was much more interesting.

Three things failed to come back up after the second power failure. The more understandable and less fatal problem was that our OpenBSD external bridging firewall needed some manual attention to deal with a fsck issue. By itself this just cut us off from the external world. Much worse, two of our core switches didn't fully boot up; instead, they stopped in their bootloader and waiting for someone to tell them to continue. Since the switches didn't boot and apply their configuration, they didn't light up their ports and none of our leaf switches could pass traffic around. The net effect was to create little isolated pools of machines, one pool per leaf switch.

(Then naturally most of these pools didn't have access to our DNS servers, so we also had DNS problems. It's always DNS. But no one would have gotten very far even with DNS, because all of our fileservers were isolated on their own little pool on a 10G-T switch.)

We've never seen this happen before (and certainly it didn't happen in prior power outages and scheduled shutdowns), so we've naturally theorized that the power failure wasn't a clean one (either during the loss of power or when it came back) and this did something unusual to the switches. It's more comforting to think that something exceptional happened than that this is a possibility that's always lurking there even in clean power loss and power return situations.

(While we shut down all of our Unix servers in advance for scheduled power shutdowns, we've traditionally left all of our switches powered on and just assumed that they'd come back cleanly afterward. We probably won't change that for the next scheduled power shutdown, but we may start explicitly checking that the core switches are working right before we start bringing servers up the next day.)

That we'd never seen this switch behavior before also complicated our recovery efforts, because we initially didn't recognize what had gone wrong with the switches or even what the problem with our network was. Even once my co-worker recognized that something was anomalous about the switches, it took a bit of time to figure out what the right step to resolve it was (in this case, to tell the switch bootloader to go ahead and boot the main OS).

(The good news is that the next time around we'll be better prepared. We have a console server that we access the switch consoles through, and it supports informational banners when you connect to a particular serial console. The consoles for the switches now have a little banner to the effect of 'if you see this prompt from the switch it's stuck in the bootloader, do the following'.)

PS: What's likely booting here is the switch's management processor. But the actual switching hardware has to be configured by the management processor before it lights up the ports and does anything, so we might as well talk about 'the switch booting up'.

sysadmin/SwitchesAndPowerGlitch written at 23:58:51; Add Comment


Browers can't feasibly stop web pages from talking to private (local) IP addresses

I recently read Jeff Johnson's A problem worse than Zoom (via), in which Johnson says:

[...] The major browsers I've tested — Safari, Chrome, Firefox — all allow web pages to send requests not only to localhost but also to any IP address on your Local Area Network! Can you believe that? I'm both astonished and horrified.

(Johnson mostly means things with private IP addresses, which is the only sense of 'on your local and private network' that can be usefully determined.)

This is a tempting and natural viewpoint, but unfortunately this can't be done in practice without breaking things. To understand this, I'll outline a series of approaches and then explain why they fail or cause problems.

To start with, a browser can't refuse to connect to private IP addresses unless the URL was typed in the URL bar because there are plenty of organizations that use private IP addresses for their internal web sites. Their websites may link to each other, load resources from each other, put each other in iframes, and in general do anything you don't want an outside website to do to your local network, and it is far too late to tell everyone that they can't do this all of a sudden.

It's not sufficient for a browser to just block access by explicit IP address, to stop web pages from poking URLs like ''. If you control a domain name, you can make hosts in that have arbitrary IP addresses, including private IP addresses and Some DNS resolvers will screen these out except for 'internal' domains where you've pre-approved them, but a browser can't assume that it's always going to be behind such a DNS resolver.

(Nor can the browser implement such a resolver itself, because it doesn't know what the valid internal domains even are.)

To avoid this sort of DNS injection, let's say that the browser will only accept private IP addresses if they're the result of looking up hosts in top level domains that don't actually exist. If the browser looks up 'nasty.evil.com' and gets a private IP address, it's discarded; the browser only accepts it if it comes from 'good.nosuchtld'. Unfortunately for this idea, various organizations like to put their internal web sites into private subdomains under their normal domain name, like '<host>.corp.us.com' or '<host>.internal.whoever.net'. Among other reasons to do this, this avoids problems when your private top level domain turns into a real top level domain.

So let's use a security zone model. The browser will divide websites and URLs into 'inside' and 'outside' zones, based on what IP address the URL is loaded from (something that the browser necessarily knows at the time it fetches the contents). An 'inside' page or resource may refer to outside things and include outside links, but an outside page or resource cannot do this with inside resources; going outside is a one-way gate. This looks like it will keep internal organizational websites on private IP addresses working, no matter what DNS names they use. (Let's generously assume that the browser manages to get all of this right and there are no tricky cases that slip by.)

Unfortunately this isn't sufficient to keep places like us working. We have a 'split horizon' DNS setup, where the same DNS name resolves to different IP addresses depending on whether you're inside or outside our network perimeter, and we also have a number of public websites that actually live in private IP address space but that are NAT'd to public IPs by our external firewall. These websites are publicly accessible, get linked to by outside things, and may even have their resources loaded by outside public websites, but if you're inside our network perimeter and you look up their name, you get a private IP address and you have to use this IP address to talk to them. This is exactly an 'outside' host referring to an 'inside' resource, which would be blocked by the security zone model.

If browsers were starting from scratch today, there would probably be a lot of things done differently (hopefully more securely). But they aren't, and so we're pretty much stuck with this situation.

web/BrowsersAndLocalIPs written at 21:49:48; Add Comment

Reflections on almost entirely stopping using my (work) Yubikey

Several years ago (back in 2016), work got Yubikeys for a number of us for reasons beyond the scope of this entry. I got designated as the person to figure out how to work with them, and in my usual way with new shiny things, I started using my Yubikey's SSH key for lots of additional things over and above their initial purpose (and I added things to my environment to make that work well). For a long time since then, I've had a routine of plugging my Yubikey in when I got in to work, before I unlocked my screen the first time. The last time I did that was almost exactly a week ago. At first, I just forgot to plug in the Yubikey when I got in and didn't notice all day. But after I noticed that had happened, I decided that I was more or less done with the whole thing. I'm not throwing the Yubikey away (I still need it for some things), but the days when I defaulted to authenticating SSH with the Yubikey SSH key are over. In fact, I should probably go through and take that key out of various authorized_keys files.

The direct trigger for not needing the Yubikey as much any more and walking away from it are that I used it to authenticate to our OmniOS fileservers, and we took the last one out of service a few weeks ago. But my dissatisfaction has been building for some time for an assortment of reasons. Certainly one part of it is that the big Yubikey security issue significantly dented my trust in the whole security magic of a hardware key, since using a Yubikey actually made me more vulnerable instead of less (well, theoretically more vulnerable).

Another part of it is that for whatever reason, every so often the Fedora SSH agent and the Yubikey would stop talking to each other. When this happened various things would start failing and I would have to manually reset everything, which obviously made relying on Yubikey based SSH authentication far from the transparent experience of things just working that I wanted. At some points, I adopted a ritual of locking and then un-locking my screen before I did anything that I knew required the Yubikey.

Another surprising factor is that I had to change where I plugged in my Yubikey, and the new location made it less convenient. When I first started using my Yubikey I could plug it directly into my keyboard at the time, in a position that made it very easy to see it blinking when it was asking for me to touch it to authenticate something. However I wound up having to replace that keyboard (cf) and my new keyboard has no USB ports, so now I have to plug the Yubikey into the USB port at the edge of one of my Dell monitors. This is more awkward to do, harder to reach and touch the Yubikey's touchpad, and harder to even see it blinking. The shift in where I had to plug it in made everything about dealing with the Yubikey just a bit more annoying, and some bits much more annoying.

(I have a few places where I currently use a touch authenticated SSH key, and these days they almost always require two attempts, with a Yubikey reset in the middle because one of the reliable ways to have the SSH agent stop talking to the Yubikey is not to complete the touch authentication stuff in time. You can imagine how enthused I am about this.)

On the whole, the most important factor has been that using the Yubikey for anything has increasingly felt like a series of hassles. I think Yubikeys are still reasonably secure (although I'm less confident and trusting of them than I used to be), but I'm no longer interested in dealing with the problems of using one unless I absolutely have to. Nifty shiny things are nice when they work transparently; they are not so nice when they don't, and it has surprised me how little it took to tip me over that particular edge.

(It's also surprised me how much happier I feel after having made the decision and carrying it out. There's all sorts of things I don't have to do and deal with and worry about any more, at least until the next occasion when I really need the Yubikey for something.)

sysadmin/YubikeyMostlyDropped written at 01:27:37; Add Comment


I brought our Django app up using Python 3 and it mostly just worked

I have been worrying for some time about the need to eventually get our Django web application running under Python 3; most recently I wrote about being realistic about our future plans, which mostly amounted to not doing anything until we had to. Well, guess what happened since then.

For reasons beyond the scope of this entry, last Friday I ended up working on moving our app from Django 1.10.7 to 1.11.x, which was enlivened by the usual problem. After I had it working under 1.11.22, I decided to try running it (in development mode, not in production) using Python 3 instead of Python 2, since Django 1.11.22 is itself fully compatible with Python 3. To my surprise, it took only a little bit of cleanup and additional changes beyond basic modernization to get it running, and the result is so far fully compatible with Python 2 as well (I committed the changes as part of the 1.11 move, and since Monday they're running in production).

I don't think this is particularly due to anything I've done in our app's code; instead, I think it's mostly due to the work that Django has done to make everything work more or less transparently. As the intermediate layer between your app and the web (and the database), Django is already the place that has to worry about character set conversion issues, so it can spare you from most of those. And generally that's the big difference between Python 2 and Python 3.

(The other difference is the print statement versus 'print()', but you can make Python 2.7 work in the same way as Python 3 with 'from __future__ import print_function', which is what I did.)

I haven't thoroughly tested our web app under Python 3, of course, but I did test a number of the basics and everything looks good. I'm fairly confident that there are no major issues left, only relatively small corner cases (and then the lurking issue of how well the Python 3 version of mod_wsgi works and if there are any traps there). I'm still planning to keep us on Python 2 and Django 1.11 through at least the end of this year, but if we needed to I could probably switch over to a current Django and Python 3 with not very much additional work (and most of the work would be updating to a new version of Django).

There was one interesting and amusing change I had to make, which is that I had to add a bunch of __str__ methods to various Django models that previously only had __unicode__ methods. When building HTML for things like form <select> fields, Django string-izes the names of model instances to determine what to put in here, but in Python 2 it actually generates the Unicode version and so ends up invoking __unicode__, while in Python 3 str is Unicode already and so Django was using __str__, which didn't exist. This is an interesting little incompatibility.

Sidebar: The specific changes I needed to make

I'm going to write these down partly because I want a coherent record, and partly because some of them are interesting.

  • When generating a random key to embed in a URL, read from /dev/urandom using binary mode instead of text mode and switch from an ad-hoc implementation of base64.urlsafe_b64encode to using the real thing. I don't know why I didn't use the base64 module in the first place; perhaps I just didn't look for it, since I already knew about Python 2's special purpose encodings.

  • Add __str__ methods to various Django model classes that previously only had __unicode__ ones.

  • Switch from print statements to print() as a function in some administrative tools the app has. The main app code doesn't use print, but some of the administrative commands report diagnostics and so on.

  • Fix mismatched tabs versus spaces indentation, which snuck in because my usual editor for Python used to use all-tabs and now uses all-spaces. At some point I should mass-convert all of the existing code files to use all-spaces, perhaps with four-space indentation.

  • Change a bunch of old style exception syntax, 'except Thing, e:', to 'except Thing as e:'. I wound up finding all of these with grep.

  • Fix one instance of sorting a dictionary's .keys(), since Python 3 now returns an iterator here instead of a sortable object.

Many of these changes were good ideas in general, and none of them are ones that I find objectionable. Certainly switching to just using base64.urlsafe_b64encode makes the code better (and it makes me feel silly for not using it to start with).

python/DjangoAppPython3Surprise written at 21:46:22; Add Comment


Systemd services that always restart should probably set a restart delay too

Ubuntu 18.04's package of the Prometheus host agent comes with a systemd .service unit that is set with 'Restart=always' (something that comes from the Debian package, cf). This is a perfectly sensible setting for the host agent for a metrics and monitoring system, because if you have it set to run at all, you almost always want it to be running all the time if at all possible. When we set up a local version of the host agent, I started with the Ubuntu .service file and kept this setting.

In practice, pretty much the only reason the Prometheus host agent aborts and exits on our machines is that the machine has run out of memory and everything is failing. When this happens with 'Restart=always' and the default systemd settings, systemd will wait its default of 100 milliseconds (the normal DefaultRestartSec value) and then try to restart the host agent again. Since the out of memory condition has probably not gone away in 100 ms, this restart is almost certain to fail. Systemd will repeat this until the restart has failed five times in ten seconds, and then, well, let me quote the documentation:

[...] Note that units which are configured for Restart= and which reach the start limit are not attempted to be restarted anymore; [...]

With the default restart interval, this takes approximately half a second. Our systems do not clear up out of memory situations in half a second, and so the net result was that when machines ran out of memory sufficiently badly that the host agent died, it was dead until we restarted it manually.

(I can't blame systemd for this, because it's doing exactly what we told it to do. It is just that what we told it to do isn't the right thing under the circumstances.)

The ideal thing to do would be to try restarting once or twice very rapidly, just in case the host agent died due to an internal error, and then to back off to much slower restarts, say once every 30 to 60 seconds, as we wait out the out of memory situation that is the most likely cause of problems. Unfortunately systemd only offers a single restart delay, so the necessary setting is the slower one; in the unlikely event that we trigger an internal error, we'll accept that the host agent has a delay before it comes back. As a result we've now revised our .service file to have 'RestartSec=50s' as well as 'Restart=always'.

(We don't need to disable StartLimitBurst's rate limiting, because systemd will never try to restart the host agent more than once in any ten second period.)

There are probably situations where the dominant reason for a service failing and needing to be restarted is an internal error, in which case an almost immediate restart minimizes downtime and is the right thing to do. But if that's not the case, then you definitely want to have enough of a delay to let the overall situation change. Otherwise, you might as well not set a 'Restart=' at all, because it's probably not going to work and will just run you into the (re)start limit.

My personal feeling is that most of the time, your services are not going to be falling over because of their own bugs, and as a result you should almost always set a RestartSec delay and consider what sort of (extended) restart limit you want to set, if any.

Sidebar: The other hazard of always restarting with a low delay

The other big reason for a service to fail to start is if you have an error in a configuration file or the command line (eg a bad argument or option). In this case, restarting in general does you no good (since the situation will only be cleared up with manual attention and changes), and immediately restarting will flood the system with futile restart attempts until systemd hits the rate limits and shuts things off.

It would be handy to be able to tell systemd that it should not restart the service if it immediately fails during a 'systemctl start', or at least to tell it that the failure of an ExecStartPre program should not trigger the restarting, only a failure of the main ExecStart program (since ExecStartPre is sometimes used to check configuration files and so on). Possibly systemd already behaves this way, but if so it's not documented.

linux/SystemdRestartUseDelay written at 23:45:39; Add Comment


SMART drive self-tests seem potentially useful, but not too much

I've historically ignored all aspects of hard drive SMART apart, perhaps, from how smartd would occasionally email us to complain about things and sometimes those things would even be useful. There is good reason to be a SMART sceptic, seeing as many of the SMART attributes are underdocumented, SMART itself is peculiar and obscure, hard drive vendors have periodically had their drives outright lie about SMART things, and SMART attributes are not necessarily good predictors of drive failures (plenty of drives die abruptly with no SMART warnings, which can be unnerving). Certain sorts of SMART warnings are usually indicators of problems (but not always), but the absence of SMART warnings is no safety (see eg, and also Blackblaze from 2016). Also, the smartctl manpage is very long.

But, in the wake of our flaky SMART errors and some other events with Crucial SSDs here, I wound up digging deeper into the smartctl manpage and experimenting with SMART self-tests, where the hard drive tries to test itself, and SMART logs, where the hard drive may record various useful things like read errors or other problems, and may even include the sector number involved (which can be useful for various things). Like much of the rest of SMART, what SMART self-tests do is not precisely specified or documented by drive vendors, but generally it seems that the 'long' self-test will read or scan much of the drive.

By itself, this probably isn't much different than what you could do with dd or a software RAID scan. From my perspective, what's convenient about SMART self-tests is that you can kick them off in the background regardless of what the drive is being used for (if anything), they probably won't get too much in the way of your regular IO, and after they're done they automatically leave a record in the SMART log, which will probably persist for a fair while (depending on how frequently you run self-tests and so on).

On the flipside, SMART self-tests have the disadvantage that you don't really know what they're doing. If they report a problem, it's real, but if they don't report a problem you may or may not have one. A SMART self-test is better than nothing for things like testing your spare disks, but it's not the same as actually using them for real.

On the whole, my experimentation with SMART self-tests leaves me feeling that they're useful enough that I should run them more often. If I'm wondering about a disk and it's not being used in a way where all of it gets scanned routinely, I might as well throw a self-test at it to see what happens.

(They probably aren't useful and trustworthy enough to be worth scripting something so that we routinely run self-tests on drives that aren't already in software RAID arrays.)

PS: Much but not all of my experimentation so far has been on hard drives, not SSDs. I don't know if the 'long' SMART self-test on a SSD tests more thoroughly and reaches more bits of the drive internals than you can with just an external read test like dd, or conversely if it's less thorough than a full read scan.

tech/SMARTSelfTestsMaybe written at 21:07:18; Add Comment


Straightforward web applications are now very likely to be stable in browsers

In response to my entry on how our goals for our web application are to not have to touch it, Ross Hartshorn left a comment noting:

Hi! Nice post, and I sympathize. However, I can't help thinking that, for web apps in particular, it is risky to have the idea of software you don't have to touch anymore (except for security updates). The browsers which are used to access it also change. [...]

I don't think these are one-off changes, I think it's part of a general trend. If it's software that runs on your computer, you can just leave it be. If it's a web app, a big part of it is running on someone else's computer, using their web browser (a piece of software you don't control). You will need to update it from time to time. [...]

This is definitely true in a general, abstract sense, and in the past it has been true in a concrete sense, in that some old web applications could break over time due to the evolution of browsers. However, this hasn't really been an issue for simple web applications (ones just based around straight HTML forms), and these days I think that even straightforward web applications are going to be stable over browser evolution.

The reality of the web is that there is a huge overhang of old straightforward HTML, and there has been for some time; in fact, for a long time now, at any given point in time most of the HTML in existence is 'old' to some degree. Browsers go to great effort to not break this HTML, for the obvious reason, and so any web application built around basic HTML, basic forms, and the like has been stable (in browsers) for a long time now. The same is true for basic CSS, which has long since stopped being in flux and full of quirks. If you stick to HTML and CSS that is at least, say, five years old, everything just works. And you can do a great deal with that level of HTML and CSS.

(One exhibit for this browser stability is DWiki, the very software behind this blog, which has HTML and CSS that mostly fossilized more than a decade ago. This includes the HTML form for leaving comments.)

Augmenting your HTML and CSS with Javascript has historically been a bit more uncertain and unstable, but my belief is that even that has now stopped. Just as with HTML and CSS, there is a vast amount of old(er) Javascript on the web and no one wants to break it by introducing incompatible language changes in browsers. Complex Javascript that lives on the bleeding edge of browsers is still something that needs active maintenance, but if you just use some simple Javascript to do straightforward progressive augmentation, I think that you've been perfectly safe for some time and are going to be safe well into the future.

(This is certainly our experience with our web application.)

Another way to put this is that the web has always had some stable core, and this stable core has steadily expanded over time. For some time now, that stable core has been big enough to build straightforward web applications. It's extremely unlikely that future browsers will roll back very much of this stable core, if anything; it would be very disruptive and unpopular.

(You don't have to build straightforward web applications using the stable core; you can make your life as complicated as you want to. But you're probably not going to do that if you want an app that you can stop paying much attention to.)

web/WebAppsAndBrowserStability written at 23:23:22; Add Comment

(Previous 10 or go back to July 2019 at 2019/07/06)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.