The scariness of uncertainty

April 3, 2014

One of the issues that I'm facing right now (and have been for a while) is that being uncertain can be a daunting thing. As sysadmins we deal with uncertainty all of the time, of course, and if we were paralyzed by it in general we'd never get anywhere. It's usually easy enough to overcome uncertainty and move forward in small situations or important situations (for various reasons). Where uncertainty can dig in is in dauntingly big and complex projects that are not essential. If you don't have to have whatever and building anything is clearly a lot of work for an uncertain reward, it's very easy to defer and defer action in favour of various stalling measures (or other work).

All of this sounds rather hand waving, so let me tell you about my project with gathering OS level performance statistics. Or rather my non-project.

If you look around, there are a lot of options for gathering, aggregating, and graphing OS performance stats (in tools, full systems, and ecologies of tools). Beyond a certain basic level it's unclear which ones of them are going to work best for us and which ones will be crawling failures, but at the same time it's also clear that any of them that look good are going to take a significant amount of work and time to set up and try out (and I'm going to have to try them in production).

As a result I have been circling around this project for literally years now. Every so often I poke and prod at the issue; I read more about some tool or another, I look at pretty pictures, I hear about something new, and so on and so forth. But I've never sat down to really do something. I've always found higher priority things to do or other excuses.

(Here in the academy this behavior in graduate students is well known and gets called 'thesis avoidance'.)

The scariness of uncertainty is not the only reason for this, of course, but it's a significant contributing factor. In a way it raises the stakes for making a choice.

(The uncertainty comes from two directions. One is simply trying to select which system to use; the other is whether not the whole idea is going to be worthwhile. The latter is a bit stupid since we're probably not going to be left with a white elephant of a system that we ignore and then quietly abandon, but the possibility gnaws at me and feeds other uncertainties and doubts.)

I don't have any answers, but maybe writing this entry has made it more likely that I do something here. And maybe I should embrace the possibility of failure as a sign that I am finally taking enough risk.

(I feel divided about that idea but I need to think about it more and then write another entry on it.)

Comments on this page:

By dozzie at 2014-04-03 07:13:11:

If you look around, there are a lot of options for gathering, aggregating, and graphing OS performance stats (in tools, full systems, and ecologies of tools).

So you pretty much want a monitoring system (for certain meaning of "monitoring" term). I can tell you in advance: all of the bigger ones (Nagios/Icinga, Zabbix, Zenoss) suck. They do a lot of things and you need to configure them even for things you don't need.

You would actually want to start with something way, way smaller and dedicated to just a single thing (graphing), like Graphite or collectd. Or even better: with something that collects whatever data you order and passes it over to one or more systems you told it to.

Sounds simple (and it is simple), but there aren't too many tools that can do it. Currently I'm using Fluentd (there's also logstash that can do the same). One instance for every host under monitoring (these instances spool data in case of network problems) and one instance on aggregator. The aggregator instance then passes the data to collectd. I could plug in a Graphite if I decided to, and I wouldn't need to touch anything except the aggregator.

An advantage of this setup is that a cron script that collects data and submits it to Fluentd on localhost is just enough. Another advantage is I only collect performance data once, no matter how many graphing systems I have deployed. I can try them all.

And another advantage: I can gradually expand whole monitoring system. I can start with collecting anything (like load average) and just submitting it to local Fluentd. Then I can add some centralization, Fluentd will forward data to Fluentd on aggregator (I don't touch probes when doing this). Then, the aggregator starts to fill graphing system. Several small steps, each one separated from the rest.

The drawback of this setup is that I need to write a plugin for Fluentd for most of data sinks, as Fluentd is a daemon just for passing messages around, not a dedicated monitoring tool.

Written on 03 April 2014.
« I'm angry that ZFS still doesn't have an API
Shifting a software RAID mirror from disk to disk in modern Linux »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Apr 3 00:34:47 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.