Chris's Wiki :: blog/solaris/ZFSChecksumErrorMaybeSignal Commentshttps://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSChecksumErrorMaybeSignal?atomcommentsDWiki2016-11-24T15:56:31ZRecent comments in Chris's Wiki :: blog/solaris/ZFSChecksumErrorMaybeSignal.By Chris Siebenmann on /blog/solaris/ZFSChecksumErrorMaybeSignaltag:CSpace:blog/solaris/ZFSChecksumErrorMaybeSignal:e96bd0abb2bcb47ac905bcde49bdc8cd11650643Chris Siebenmann<div class="wikitext"><p>Miksa: Backblaze has written a couple of blog entries on
what SMART stats they find have predictive value, <a href="https://www.backblaze.com/blog/hard-drive-smart-stats/">one in 2014</a> and <a href="https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/">one in 2016</a>.
I have a vague memory that Google put at least some information about this
in a paper or two, with some surprising results (eg I think Google found
that high drive temperatures didn't predict failure in their population),
but I don't have references handy.</p>
</div>2016-11-24T15:56:31ZBy Aneurin Price on /blog/solaris/ZFSChecksumErrorMaybeSignaltag:CSpace:blog/solaris/ZFSChecksumErrorMaybeSignal:8ee2e3a059aa596be4624a419365e9245027cd6cAneurin Price<div class="wikitext"><p>My experience has been pretty much the same as you tweeted: checksums have been very useful for reassuring me after something has gone wrong that it's been fixed correctly, but they've never alerted me to something that wasn't already very loudly obvious.</p>
<p>I'm not using ZFS at high volume like some people - I'd estimate that I've probably seen no more than a few dozen TBW across all the systems that use it - but <em>so far</em> I've seen no sign of silent corruption happening. That's not to say I'd be happy to disable checksums, but fear of silent data corruption is not the reason.</p>
</div>2016-11-24T14:24:46ZBy Miksa on /blog/solaris/ZFSChecksumErrorMaybeSignaltag:CSpace:blog/solaris/ZFSChecksumErrorMaybeSignal:32c9e938df7c7693b77995dd055312e9ffd3d635Miksa<div class="wikitext"><p>That's a good point. Google and BackBlaze have released statistical studies about harddrive failures, but it would be useful if they also released more indepth data about possible signs of failure.</p>
</div>2016-11-24T13:01:56ZFrom 54.240.193.1 on /blog/solaris/ZFSChecksumErrorMaybeSignaltag:CSpace:blog/solaris/ZFSChecksumErrorMaybeSignal:2e37d16c58557c9be9568b38cd8573e6c463e654From 54.240.193.1<div class="wikitext"><p>Sadly we would all like a magic flag to indicate when our hdd/ssd's are going to drop off a cliff. </p>
<p>When you only have a few hundred drives, it is very hard to separate correlation vs causation. One SSD, one checksum failure and I am sorry it could be anything. </p>
<p>Realistically we need to see a trend in 10,000 drives to make a conclusion. Yes this would be very useful...</p>
<p>I take it your monitoring smart stats through nagios or similar.</p>
</div>2016-11-23T11:41:57Z