## Histogram data is most useful when they also provide true totals

January 26, 2024

A true histogram is generated from raw data. However, in things like metrics, we generally don't have the luxury of keeping all of the raw data around; instead we need to summarize it into histogram data. This is traditionally done by having some number of buckets with either independent or cumulative values. A lot of systems stop there; for example OpenZFS provides its histogram data this way. Unfortunately by itself this information is incomplete in an annoying way.

If you're generating histogram data, you should go the extra distance to also provide a true total of all of the raw data. The reason is simple; only with a true total can one get a genuine and accurate average value, or anything derived from that average. Importantly, one thing you can potentially derive from the average value is an indication of what I'll call skew in your buckets.

The standard assumption when dealing with histograms is that the values in each bucket are randomly distributed through the range of the bucket. If they truly are, then you can do things like get a good estimate of the average value by just taking the midpoint of each bucket, and so people will say that you don't really need the true total. However, this is an assumption and it's not necessarily correct, especially if the size of the buckets is large (as it can be at the upper end of a 'powers of two' logarithmic bucket size scheme, which is pretty common because it's convenient to generate).

I've certainly looked at a number of such histograms where it's clear (from various other information sources) that this assumption of even distribution wasn't correct. How incorrect it was wasn't all that clear, though, because the information necessary to have a solid idea wasn't there.

Good histogram data takes more than counts in buckets. But including a true total as an additional piece of data is at least a start, and it's probably inexpensive (both to export and to accumulate).

(Someone has probably already written a 'best practices for gathering and providing histogram data' article.)

Written on 26 January 2024.