sysadmin/SSLChasingCertChains written at 22:02:04; Add Comment
Chasing SSL certificate chains to build a chain file
Supposes that you have some shiny new SSL certificates for some reason. These new certificates need a chain of intermediate certificates in order to work with everything, but for some reason you don't have the right set. In ideal circumstances you'll be able to easily find the right intermediate certificates on your SSL CA's website and won't need the rest of this entry.
Okay, let's assume that your SSL CA's website is an unhelpful swamp pit. Fortunately all is not lost, because these days at least some SSL certificates come with the information needed to find the intermediate certificates. First we need to dump out our certificate, following my OpenSSL basics:
openssl x509 -text -noout -in WHAT.crt
This will print out a bunch of information. If you're in luck (or possibly always), down at the bottom there will be a 'Authority Information Access' section with a 'CA Issuers - URI' bit. That is the URL of the next certificate up the chain, so we fetch it:
(In case it's not obvious: for this purpose you don't have to worry if this URL is being fetched over HTTP instead of HTTPS. Either your certificate is signed by this public key or it isn't.)
Generally or perhaps always this will not be a plain text file like your certificate is, but instead a binary blob. The plain text format is called PEM; your fetched binary blob of a certificate is probably in the binary DER encoding. To convert from DER to PEM we do:
openssl x509 -inform DER -in <WGOT-FILE>.crt -outform PEM -out intermediate-01.crt
Now you can inspect intermediate-01.crt in the same to see if it needs a further intermediate certificate; if it does, iterate this process. When you have a suitable collection of PEM format intermediate certificates, simply concatenate them together in order (from the first you fetched to the last, per here) to create your chain file.
PS: The Qualys SSL Server Test is a good way to see how correct your certificate chain is. If it reports that it had to download any certificates, your chain of intermediate certificates is not complete. Similarly it may report that some entries in your chain are not necessary, although in practice this rarely hurts.
Sidebar: Browsers and certificate chains
As you might guess, some but not all browsers appear to use this embedded intermediate certificate URL to automatically fetch any necessary intermediate certificates during certificate validation (as mentioned eg here). Relatedly, browsers will probably not tell you about unnecessary intermediate certificates they received from your website. The upshot of this can be a HTTPS website that works in some browsers but fails in others, and in the failing browser it may appear that you sent no additional certificates as part of a certificate chain. Always test with a tool that will tell you the low-level details.
(Doing otherwise can cause a great deal of head scratching and frustration. Don't ask how I came to know this.)
python/WarningsModuleReactions written at 01:19:06; Add Comment
My reactions to Python's warnings module
A commentator on my entry on the warnings problem pointed out the existence of the warnings module as a possible solution to my issue. I've now played around with it and I don't think it fits my needs here, for two somewhat related reasons.
The first reason is that it simply makes me nervous to use or even take over the same infrastructure that Python itself uses for things like deprecation warnings. Warnings produced about Python code and warnings that my code produces are completely separate things and I don't like mingling them together, partly because they have significantly different needs.
The second reason is that the default formatting that the warnings module uses is completely wrong for the 'warnings produced from my program' case. I want my program warnings to produce standard Unix format (warning) messages and to, for example, not include the Python code snippet that generated them. Based on playing around with the warnings module briefly it's fairly clear that I would have to significantly reformat standard warnings to do what I want. At that point I'm not getting much out of the warnings module itself.
All of this is a sign of a fundamental decision in the warnings module: the warnings module is only designed to produce warnings about Python code. This core design purpose is reflected in many ways throughout the module, such as in the various sorts of filtering it offers and how you can't actually change the output format as far as I can see. I think that this makes it a bad fit for anything except that core purpose.
In short, if I want to log warnings I'm better off using general logging and general log filtering to control what warnings get printed. What features I want there are another entry.
python/WarningHandlingProblem written at 02:14:16; Add Comment
A problem: handling warnings generated at low levels in your code
Python has a well honed approach for handling errors that happen at a low level in your code; you raise a specific exception and let it bubble up through your program. There's even a pattern for adding more context as you go up through the call stack, where you catch the exception, add more context to it (through one of various ways), and then propagate the exception onwards.
All of this is great when it's an error. But what about warnings? I recently ran into a case where I wanted to 'raise' (in the abstract) a warning at a very low level in my code, and that left me completely stymied about what the best way to do it was. The disconnect between errors and warnings is that in most cases errors immediately stop further processing while warnings don't, so you can't deal with warnings by raising an exception; you need to somehow both 'raise' the warning and continue further processing.
I can think of several ways of handling this, all of which I've sort of used in code in the past:
The second and third approaches have the problem that it's hard for intermediate layers to add context to warning messages; they'll wind up wanting or needing to pass the context down to the low level routines that generate the warnings. The third approach can have the general options problem when it comes to controlling what warnings are and aren't produced, or you can try to control this by having the high level code configure the logging system to discard some messages.
I don't have any answers here, but I can't help thinking that I'm missing a way of doing this that would make it all easy. Probably logging is the best general approach for this and I should just give in, learn a Python logging system, and use it for everything in the future.
(In the incident that sparked this entry, I wound up punting and just
printing out a message with
tech/SSHAndSSLAndHeartbleed written at 23:43:41; Add Comment
The relationship between SSH, SSL, and the Heartbleed bug
I will lead with the summary: since the Heartbleed bug is a bug in OpenSSL's implementation of a part of the TLS protocol, no version or implementation of SSH is affected by Heartbleed because the SSH protocol is not built on top of TLS.
So, there's four things involved here:
Low level flaws in OpenSSL such as Debian breaking its randomness can affect OpenSSH when OpenSSH uses something that's affected by the low level flaw. In the case of the Debian issue, OpenSSH gets its random numbers from OpenSSL and so was affected in a number of ways.
High level flaws in OpenSSL's implementation of TLS itself will never affect OpenSSH because OpenSSH simply doesn't use those bits of OpenSSL. For instance, if OpenSSL turns out to have an SSL certificate verification bug (which happened recently with other SSL implementations) it won't affect OpenSSH's SSH user and host key verification.
As a corollary, OpenSSH (and all SSH implementations) aren't directly affected by TLS protocol attacks such as BEAST or Lucky Thirteen, although people may be able to develop similar attacks against SSH using the same general principles.
linux/DracutNeededArguments written at 02:11:02; Add Comment
What sort of kernel command line arguments Fedora 20's dracut seems to want
Recently I upgraded the kernel on my Fedora 20 office workstation, rebooted the machine, and had it hang in early boot (the first two are routine, the last is not). Forcing a reboot back to the earlier kernel brought things back to life. After a bunch of investigation I discovered that this was not actually due to the new kernel, it was due to an earlier dracut update. So this is the first thing to learn: if a dracut update breaks something in the boot process, you'll probably only discover this the next time you upgrade the kernel and the (new) dracut builds a (new and not working) initramfs for it.
The second thing I discovered in the process of this is the Fedora boot process will wait for a really long time for your root filesystem to appear before giving up, printing messages about it, and giving you an emergency shell, where by a really long time I mean 'many minutes' (I think at least five). It turned out that my boot process had not locked up but instead it was sitting around waiting my root filesystem to appear. Of course this wait was silent, with no warnings or status notes reported on the console, so I thought that things had hung. The reason the boot process couldn't find my root filesystem was that my root filesystem is on software RAID and the new dracut has stopped assembling such things for a bunch of people.
(Fedora apparently considers this new dracut state to be 'working as designed', based on bug reports I've skimmed.)
I don't know exactly what changed between the old dracut and the
new dracut, but what I do know is that the new dracut really wants
you to explicitly tell it what software RAID devices, LVM devices,
or other things to bring up on boot through arguments added to the
kernel command line.
For me on my machine, this prints out (and booting wants):
The simplest approach to fixing up your machine in a situation like
this is probably to just update
(I won't say that Dracut is magic, because I'm sure it could all be read up on and decoded if I wanted to. I just think that doing so is not worth bothering with for most people. Modern Linux booting is functionally a black box, partly because it's so complex and partly because it almost always just works.)
sysadmin/MetricsSystemChoice written at 01:01:20; Add Comment
My current choice of a performance metrics system and why I picked it
In response to my previous entries on gathering OS level performance metrics, people have left a number of comments recommending various systems for doing this. So now it's time to explain my current decision about this.
This decision is not made from a full and careful comparison of all of the available tools with respect to what we need, partly because I don't know enough to make that comparison. Instead it's made in large part based on what seems to be popular among relatively prominent and leading edge organizations today. Put bluntly, graphite appears to be the current DevOps hotness as far as metrics goes.
That it's the popular and apparent default choice means two good things. First, given that it's used by much bigger environments than we are I can probably make it work for us, and given that the world is not full of angry muttering about how annoying and/or terrible it is it's probably not going to be particularly bad. Second, it's much more likely that such a popular tool will have a good ecology around it, that there will be people writing howtos and 'how I did this' articles for it and add on tools and so on. And indeed this seems to be the case based on my trawling of the Internet so far; I've tripped over far more stuff about graphite than about anything else and there seem to be any number of ways of collecting stats and feeding it data.
(That graphite's the popular choice also means that it's likely to be kept up to date, developed further, possibly packaged for me, and so on.)
A side benefit of this reading is that it's shown me that people are pushing metrics into a graphite-based system at relatively high rates. This is exactly what I want to do given that averages lie and the shorter period you take them over the better for avoiding some of those lies.
(I'm aware that we may run into things like disk IO limits. I'll have to see, but gathering metrics say every five or ten seconds is certainly my goal.)
Many of the alternatives are probably perfectly good and would do decently well for us. They're just somewhat more risky choices than the current big popular thing and as a result they leave me with various concerns and qualms.
web/SSLPragmaticKeyCompReactions written at 00:59:49; Add Comment
Pragmatic reactions to a possible SSL private key compromise
In light of the fact that the OpenSSL 'heartbleed' issue may have resulted in someone getting a copy of your private keys, there are least three possible reactions that people and organizations can take:
These are listed in declining order of theoretical goodness and also possibly declining order of cost.
Obviously the completely cautious approach is to assume that your private keys have been compromised and also that you should explicitly revoke them so that people might be protected from an attacker trying man in the middle attacks with your old certificates and private keys (if revocation actually works this time). The pragmatic issue is that this course of action probably costs the most money (if it doesn't, well, then there's no problem). If your organization has a lot riding on the security of your SSL certificates (in terms of money or other things) then this extra expense is easy to justify, and in many places the actual cost is small or trivial compared to other budget items.
But, as they say. There are places where this is not so true, where the extra cost of certificate revocations will to some degree hurt or require a fight to get. Given that certificate revocation may not actually do much in practice, there is a real question of whether you're actually getting anything worthwhile for your money (especially since you're probably doing this as merely a precaution against potential key compromise). If certificate revocation is an almost certainly pointless expense that's going to hurt, the pragmatics push people away from paying for it and towards one of the other two alternatives.
Getting new certificates is the intermediate caution option (especially if you believe that certificate revocation is ineffective in practice), since it closes off future risks that you can actually do something about yourself. But it still probably costs you some money (how much money depends on how many certificates you have or need).
Doing nothing with your SSL keys is the cheapest and easiest approach and is therefor very attractive for people on a budget, and there are a number of arguments towards a low risk assessment (or at least away from a high one). People will say that this position is obviously stupid, which is itself obviously stupid; all security is a question of risk versus cost and thus requires an assessment of both risk and cost. If people feel that the pragmatic risk is low (and at this point we do not have evidence that it isn't for a random SSL site) or cannot convince decision makers that it is not low and the cost is perceived as high, well, there you go. Regardless of what you think, the resulting decision is rational.
(Note that there is at least one Certificate Authority that offers SSL certificates for free but normally charges a not insignificant cost for revoking and reissuing certificates, which can swing the various costs involved. When certificates are free it's easy to wind up with a lot of them to either revoke or replace.)
In fact, as a late-breaking update as I write this, Neel Mehta
(the person who found the bug) has said that private key exposure
although of course unlikely is nowhere near the same thing as
'impossible'. See also Thomas Ptacek's followup comment.
Update April 12: It's now clear from the results of the CloudFlare challenge and other testing by people that SSL private keys can definitely be extracted from servers that are vulnerable to Heartbleed.
My prediction is that pragmatics are going to push quite a lot of people towards at least the second option and probably the third. Sure, if revoking and reissuing certificates is free a lot of people will take advantage of it (assuming that the message reaches them, which I would not count on), but if it costs money there will be a lot of pragmatic pressure towards cheap options.
Sidebar: Paths to high cost perceptions
Some people are busy saying that the cost of new SSL certificates is low (or sometimes free), so why not get new ones? There are at least three answers:
These answers can combine with each other.
sysadmin/StatsGatheringGoals written at 00:45:11; Add Comment
My goals for gathering performance metrics and statistics
I've written before that one of my projects is putting together something to gather OS level performance metrics. Today I want to write down what my goals for this are. First off I should mention that this is purely for monitoring, not for alerting; we have a completely separate system for that.
The most important thing is to get visibility into what's going on with our fileservers and their iSCSI backends, because this is the center of our environment. We want at least IO performance numbers on the backends, network utilization and error counts on the backends and the fileservers, perceived IO performance for the iSCSI disks on the fileservers, ZFS level stats on the fileservers, CPU utilization information everywhere, and as many NFS level stats as we can conveniently get (in a first iteration this may amount to 'none'). I'd like like to have both a very long history (half a year or more would be great) and relatively fine-grained measurements, but in practice we're unlikely to need fine-grained measurements very far into the past. To put it one way, we're unlikely to try to troubleshoot in detail a performance issue that's more than a week or so old. At the same time it's important to be able to look back and say 'were things as bad as this N months ago or did they quietly get worse on us?', because we have totally had that happen. Long term stats are also a good way to notice a disk that starts to quietly decay.
(In general I expect us to look more at history than at live data.
In a live incident we'll probably go directly to
Next most important is OS performance information for a few crucial Ubuntu NFS clients such as our IMAP servers and our Samba servers (things like local IO, NFS IO, network performance, and oh sure CPU and memory stats too). These are very 'hot' machines, used by a lot of people, so if they have performance problems we want to know about it and have a good shot at tracking things down. Also, this sort of information is probably going to help for capacity planning, which means that we probably also want to track some application level stats if possible (eg the number of active IMAP connections). As with fileservers a long history is useful here.
Beyond that it would be nice to get the same performance stats from basically all of our Ubuntu NFS clients. If nothing else this could be used to answer questions like 'do people ever use our compute servers for IO intensive jobs' and to notice any servers with surprisingly high network IO that might be priorities for moving from 1G to 10G networking. Our general Ubuntu machines can presumably reuse much or all of the code and configuration from the crucial Ubuntu machines, so this should be relatively easy.
In terms of displaying the results, I think that the most important thing will be an easy way of doing ad-hoc graphs and queries. We're unlikely to wind up with any particular fixed dashboard that we look at to check for problems; as mentioned, alerting is another system entirely. I expect us to use this metrics system more to answer questions like 'what sort of peak and sustained IO rates do we typically see during nightly backups' or 'is any backend disk running visibly slower than the others'.
I understand that some systems can ingest various sorts of logs, such as syslog and Apache logs. This isn't something that we'd do initially (just getting a performance metrics system off the ground will be a big enough project by itself). The most useful thing to have for problem correlation purposes would be markers for when client kernels report NFS problems, and setting up an entire log ingestion system for that seems a bit overkill.
(There are a lot of neat things we could do with smart log processing if we had enough time and energy, but my guess is that a lot of them aren't really related to gathering and looking at performance metrics.)
Note that all of this is relatively backwards from how you would do it in many environments, where you'd start from application level metrics and drill downwards from there because what's ultimately important is how the application performs. Because we're basically just a provider of vague general computing services to the department, we work from the bottom up and have relatively little 'application' level metrics we can monitor.
(With that said, it certainly would be nice to have some sort of metrics on how responsive and fast the IMAP and Samba servers were for users and so on. I just don't know if we can do very much about that, especially in an initial project.)
PS: There are of course a lot of other things we could gather metrics for and then throw into the system. I'm focusing here on what I want to do first and for the likely biggest payoff. Hopefully this will help me get over the scariness of uncertainty and actually get somewhere on this.
web/MyIfModifiedSinceHack written at 01:06:39; Add Comment
Giving in: pragmatic
* * *