bit rot detection and correction with mdadm

I’m about to re-organise all my HDDs in my home linux box nas and would like to use mdadm raid for data protection and its flexibility for reshaping the arrays. However, before I use mdadm for this I’d like to know how it handles bit rot. Specifically the kinds of bit rot that do not result in unrecoverable read error messages being sent from the HDD.

Given that I’ll likely be using at least 21TB of HDDs in 8 disks in the nas and the various quotes on probabilities of failures on HDDs, I’m thinking that during a rebuild from a single disk failure I’m reasonably likely to encounter some form of bit rot on the remaining disks. If it is an unrecoverable read error on 1 of the drives, that the drive actually reports it as an error, I believe that should be fine with raid6(is it?). However if the data read from the disk is bad but not reported as such by the disk, then I can’t see how this can be automatically corrected even with raid6. Is this something we need to be concerned about? Given the article It is 2010 and RAID5 still works, and my own successful experiences at home and work, things are not necessarily as doom and gloom as the buzz words and marketing would have us believe, but I hate having to restore from backups just because a HDD failed.

Given that the usage patterns will be, write at most a few times, and read occasionally, I’ll need to perform data scrubbing. I see on
the archlinux wiki
the mdadm commands for data scrubbing an array as

echo check > /sys/block/md0/md/sync_action

then to monitor the progress

cat /proc/mdstat

This seems to me that it will read all sectors of all disks and check that the data matches the parity and vice-versa. Though I notice there is heavy emphasis in the docs to say that there are significant circumstances that the “check” operation will not be able to auto correct, only detect, and it will leave it up to the user to fix.

What mdadm RAID level(s) should I choose to maximise my protection from bit rot and what maintenance and other protective steps should I be doing? And what will this not protect me from?

Edit: I’m not looking to start a RAID vs ZFS or any other technology QA. I want to know specifically about mdadm raid. That is also why I’m asking on Unix & Linux and not on SuperUser.

Edit: is the answer: mdadm can only correct URE’s that are reported by the disk systems during a data scrub and detect silent bit rot during a scrub but cannot/will not fix it?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Frankly, I find it rather surprising that you’d reject RAIDZ2 ZFS. It seems to suit your needs almost perfectly, except for the fact that it isn’t Linux MD. I’m not on a crusade to bring ZFS to the masses, but the simple fact is that yours is one of the kinds of problems that ZFS was designed from the ground up to solve. Relying on RAID (any “regular” RAID) to provide error detection and correction possibly in a reduced- or no-redundancy situation seems risky. Even in situations where ZFS cannot correct a data error properly, it can at least detect the error and let you know that there is a problem, allowing you to take corrective action.

You don’t have to do regular full scrubs with ZFS, although it is recommended practice. ZFS will verify that the data read from disk matches what was written as the data is being read, and in the case of a mismatch either (a) use redundancy to reconstruct the original data, or (b) report an I/O error to the application. Also, scrubbing is a low-priority, online operation, quite different from a file system check in most file systems which can be both high-priority and offline. If you’re running a scrub and something other than the scrub wants to do I/O, the scrub will take the back seat for the duration. A ZFS scrub takes the place of both a RAID scrub and a file system metadata and data integrity check, so is a lot more thorough than just scrubbing the RAID array to detect any bit rot (which doesn’t tell you if the data makes any sense whatsoever, only that it’s been written correctly by the RAID controller).

ZFS redundancy (RAIDZ, mirroring, …) has the advantage that unused disk locations don’t need to be checked for consistency during scrubs; only actual data is checked during scrubs, as the tools walk the allocation block chain. This is the same as with a non-redundant pool. For “regular” RAID, all data (including any unused locations on disk) must be checked because the RAID controller (whether hardware or software) has no idea what data is actually relevant.

By using RAIDZ2 vdevs, any two constituent drives can fail before you are at risk of actual data loss from another drive failure, as you have two drives’ worth of redundancy. This is essentially the same as RAID6.

In ZFS all data, both user data and metadata, is checksummed (except if you choose not to, but that is recommended against), and these checksums are used to confirm that the data hasn’t changed for any reason. Again, if a checksum does not match the expected value, the data will either be transparently reconstructed or an I/O error will be reported. If an I/O error is reported, or a scrub identifies a file with corruption, you will know for a fact that the data in that file is potentially corrupted and can restore that specific file from backup; no need for a full array restore.

Plain, even double-parity, RAID doesn’t protect you against situations like for example when one drive fails and one more reads the data incorrectly off the disk. Suppose one drive has failed and there’s a single bit flip anywhere from any one of the other drives: suddenly, you’ve got undetected corruption, and unless you’re happy with that you’ll need a way to at least detect it. The way to mitigate that risk is to checksum each block on disk and make sure the checksum cannot be corrupted along with the data (protecting against errors like high-fly writes, orphan writes, writes to incorrect locations on disk, etc.), which is exactly what ZFS does as long as checksumming is enabled.

The only real downside is that you cannot easily grow a RAIDZ vdev by adding devices to it. There are workarounds for that, usually involving things like sparse files as devices in a vdev, and very often termed “I wouldn’t do this if it was my data”. Hence, if you go a RAIDZ route (regardless of whether you go with RAIDZ, RAIDZ2 or RAIDZ3), you need to decide up front how many drives you want in each vdev. Although the number of drives in a vdev is fixed, you can grow a vdev by gradually (making sure to stay within the redundancy threshold of the vdev) replacing the drives with larger-capacity ones and allowing a complete resilver.

Method 2

This answer is the product of reasoning based on the various bits of evidence I’ve found. I don’t know how the kernel Linux implementation works, as I am not a kernel dev and there seems to be a fair amount of nonsensical misinformation out there. I presume that the kernel Linux makes sane choices. My answer should apply unless I am mistaken.

Many drives use ECCs (error-correcting codes) to detect read errors. If data is corrupt, the kernel should receive a URE (unrecoverable read error) for that block from an ECC supporting drive. Under these circumstances (and there is an exception below), copying corrupt, or empty, data over good data would amount to insanity. In this situation the kernel should know which is good data and which is bad data. According to the It is 2010 and RAID5 still works … article:

Consider this alternative, that I know to be used by at least a couple of array vendors. When a drive in a RAID volume reports a URE, the array controller increments a count and satisfies the I/O by rebuilding the block from parity. It then performs a rewrite on the disk that reported the URE (potentially with verify) and if the sector is bad, the microcode will remap and all will be well.

However, now for the exception: if a drive does not support ECC, a drive lies about data corruption, or the firmware is particularly disfunctional, then a URE may not be reported, and corrupted data would be given to the kernel. In the case of mismatching data: it seems that if you are using a 2 disk RAID1, or a RAID5, then the kernel can’t know which data is correct, even when in a non-degraded state, because there is only one parity block and there was no reported URE. In a 3 disk RAID1 or a RAID6, a single corrupted non-URE-flagged block would not match the redundant parity (in combination with the other associated blocks), so proper automatic recovery should be possible.

The moral of the story is: use drives with ECC. Unfortunately not all drives that support ECC advertise this feature. On the other hand, be careful: I know someone who used cheap SSDs in a 2 disk RAID1 (or a 2 copy RAID10). One of the drives returned random corrupted data on each read of a particular sector. The corrupted data was automatically copied over the correct data. If the SSD used ECCs, and was properly functioning, then the kernel should have taken proper corrective action.

Method 3

I don’t have enough rep to comment, but I want to point out that the mdadm system in Linux DOES NOT correct any errors. If you tell it to “fix” errors during a scrub of, say, RAID6, if there is an inconsistency, it will “fix” it by assuming the data portions are correct and recalculating the parity.

Method 4

For the protection you want, I’d go with RAID6 + the normal offsite backup in 2 locations.

I personally scrub once a week anyway, and backup nightly, weekly and monthly depending on the data importance and change speed.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating