Thinking about timeouts and exceptions in Python
As part of our overall monitoring and metrics system, I have a Python program that logs in to our IMAP server to make sure that we can at least get that far (because we have broken that sort of thing in the past). The program emits various Prometheus metrics, including how long this took. For reasons beyond the scope of this entry, I would also like to have some very basic information on how the IMAP server is performing, such as how long it takes to do an IMAP SELECT operation on the test account's IMAP inbox. Trying to do a clean implementation has run me into issues surrounding handling timeouts.
Like any sensible program that checks a system that may have
problems, my program has an overall timeout on how long it will
talk to the IMAP server. In Python, the straightforward way
to implement an overall timeout is with
signal.setitimer and a
SIGALRM signal handler that raises an exception. When you're only
doing one (conceptual) thing, this is straightforward to implement
by wrapping your operation in a
try: signal.setitimer(signal.ITIMER_REAL, timeout) metrics = login_check(host, user, pw) signal.setitimer(signal.ITIMER_REAL, 0) report_success(host, metrics) except Timeout: report_failure(host)
Either we finish within the timeout interval, in which case we report the timing and other metrics we generated, or we fail and we report timeout failure metrics.
This simple approach breaks down once I want to report separate metrics and success statuses for two different operations. Timing out while trying to log in to the IMAP server is quite different (and more severe) than successfully logging in to the IMAP server but timing out during the IMAP SELECT operation. Since I realized this (while I was writing the new code), I've been trying to work out the right structure to make the code natural and clean.
The theoretical clean abstraction that I think I want is that once the timeout is hit, this is recorded and all further network IO (or more generally, IMAP protocol operations) fail immediately. If this was how it worked, both the IMAP login attempt and the IMAP SELECT would report success or failure depending on where things were when the timeout happened, and I could report a 'there was a timeout' metric at the end. This would also extend very naturally to doing a series of IMAP operations (for example, SELECT'ing several different mailboxes and collecting timings on each). The code could just generate metrics in a straight line fashion and everything would work out. Unfortunately Python's network code and imaplib don't provide a straightforward way to do this, so I would have to build a layer on top of imaplib to do it for me.
(This approach is inspired by Go's network package, which supports something like this. But even then it's not quite as clean as it looks, because ideally you want every check to be aware of the possibility of timeouts, so that it distinguishes a real network error from 'we hit the time limit and my network operations started failing'.)
My current approach is to keep more or less explicit track of how far I got by what metrics have been generated, and then fill in any missing metrics with failure markers:
login_metrics = None select_metrics = None try: signal.setitimer(signal.ITIMER_REAL, timeout) login_metrics, conn = login_check(host, user, pw) # logging in may have failed if conn: select_metrics = select_check(conn, host, "INBOX") # Let's ignore logging out for now signal.setitimer(signal.ITIMER_REAL, 0) did_timeout = False except Timeout: did_timeout = True if not login_metrics: login_metrics = failed_login(host, user) if not select_metrics: select_metrics = failed_select(host, "INBOX") report_metrics(login_metrics, select_metrics) report_maybe_timeout(host, did_timeout)
This approach works, but it has a scaling problem; if I add more IMAP operations, I have to add code in several places and I'd better not miss one. This isn't very generic and it feels like there should be a better way. On the other hand, this code is at least explicit and free of magic; it may be brute force, but it's straightforward to follow. A high-magic approach is probably not the right one for a program that I touch at most once every six months.
(Some of my problems may be because of how I generate almost all login metrics in one function, which was a clever idea in the original code but perhaps isn't any more. I'm not sure what would be a better approach, though.)
This approach also puts all timeout handling in one place, at the top level, instead of forcing all of the individual operations to be aware of the possibility that they will hit a timeout (or be invoked after the timeout has triggered, and so that's why all of their IMAP operations are failing). It may be that this is the best option for code structure, especially in Python where exceptions are how we deal with many global concerns.
(This is related to phase tracking for better error reporting. In a sense, what my current code is doing is tracking 'phase' in a collection of variables.)