Have you ever worried about Silent Errors?

Do you every worry that data you just saved to disk might be corrupt? No? You’re not alone. Still, this is a risk that is always there, even in the most advanced storage system. Your hard drive won’t always tell you if it’s unable to write data, and by the time you’re reading it back you’re too late. And what about the cache memory in your operating system, and possibly your RAID controller(s)? Ever thought about what happens to that if you have a power failure, or some of the memory fails?

Researchers at the CERN institute in Switzerland did some research on this topic in 2007. Their findings were rather shocking, but didn’t attract much attention at the time (Robin Harris was one of the few people to write about this, if I remember correctly). Across all components involved in their study, the total byte error rate was abound 3 * 10^7. That might sound like a huge number, but it is about 3 errors per TB of data. Jeff Bonwick wrote the following in 2005:

Arbitrarily expensive storage arrays can’t solve the problem. The I/O path remains just as vulnerable, but becomes even longer: after leaving the platter, the data has to survive whatever hardware and firmware bugs the array has to offer. And if you’re on a SAN, you’re using a network designed by disk firmware writers. God help you.

This is a bold statement, obviously meant to draw attention to the fact that Sun’s ZFS filesystem is doing its part to provide extra protection against silent data corruption. If you read the rest of his blog entry, you’ll notice that several storage vendors do their part to protect your data; but having an extra layer of protection on the filesystem level is nice to have. BTRFS, the most likely candidate for the next default filesystem for millions of Linux users, also includes checksumming at various levels, and I expect more filesystems to follow.

But I digress; the reason I started writing this post was this entry on the Cleversafe blog. In it they describe how their dispersed storage system deals with checksumming and data corruption; it provides some interesting peeks behind the scenes. Because data in a dsNet is spread across many servers, it’s easier to detect and route around errors. Here’s a quick summary of some of the techniques they use to deal with what would otherwise be silent errors:

  • Transactional write semantics (atomic commit/rollback)
  • Write thresholds
  • Revisions
  • Up to a 512-bit data source integrity check value
  • Use of CRC-32 checksums on storage servers

Be sure to read their entire post for more details!

Related posts:

  1. Fix for Intel SSD corruption issue
  2. Does the world need Triple Parity RAID?

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>