The tl;dr: how would I go about fixing a bad block on 1 disk in a RAID1 array?
But please read this whole thing for what I’ve tried already and possible errors in my methods. I’ve tried to be as detailed as possible, and I’m really hoping for some feedback
This is my situation: I have two 2TB disks (same model) set up in a RAID1 array managed by mdadm. About 6 months ago I noticed the first bad block when SMART reported it. Today I noticed more, and am now trying to fix it.
This HOWTO page seems to be the one article everyone links to to fix bad blocks that SMART is reporting. It’s a great page, full of info, however it is fairly outdated and doesn’t address my particular setup. Here is how my config is different:
- Instead of one disk, I’m using two disks in a RAID1 array. One disk is reporting errors while the other is fine. The HOWTO is written with only one disk in mind, which bring up various questions such as ‘do I use this command on the disk device or the RAID device’?
- I’m using GPT, which fdisk does not support. I’ve been using gdisk instead, and I’m hoping that it is giving me the same info that I need
So, lets get down to it. This is what I have done, however it doesn’t seem to be working. Please feel free to double check my calculations and method for errors. The disk reporting errors is /dev/sda:
# smartctl -l selftest /dev/sda smartctl 5.42 2011-10-20 r3458 [x86_64-linux-3.4.4-2-ARCH] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 12169 3212761936
With this, we gather that the error resides on LBA 3212761936. Following the HOWTO, I use gdisk to find the start sector to be used later in determining the block number (as I cannot use fdisk since it does not support GPT):
# gdisk -l /dev/sda GPT fdisk (gdisk) version 0.8.5 Partition table scan: MBR: protective BSD: not present APM: not present GPT: present Found valid GPT with protective MBR; using GPT. Disk /dev/sda: 3907029168 sectors, 1.8 TiB Logical sector size: 512 bytes Disk identifier (GUID): CFB87C67-1993-4517-8301-76E16BBEA901 Partition table holds up to 128 entries First usable sector is 34, last usable sector is 3907029134 Partitions will be aligned on 2048-sector boundaries Total free space is 2014 sectors (1007.0 KiB) Number Start (sector) End (sector) Size Code Name 1 2048 3907029134 1.8 TiB FD00 Linux RAID
Using tunefs I find the blocksize to be 4096. Using this info and the calculuation from the HOWTO, I conclude that the block in question is ((3212761936 - 2048) * 512) / 4096 = 401594986.
The HOWTO then directs me to debugfs to see if the block is in use (I use the RAID device as it needs an EXT filesystem, this was one of the commands that confused me as I did not, at first, know if I should use /dev/sda or /dev/md0):
# debugfs debugfs 1.42.4 (12-June-2012) debugfs: open /dev/md0 debugfs: testb 401594986 Block 401594986 not in use
So block 401594986 is empty space, I should be able to write over it without problems. Before writing to it, though, I try to make sure that it, indeed, cannot be read:
# dd if=/dev/sda1 of=/dev/null bs=4096 count=1 seek=401594986 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000198887 s, 20.6 MB/s
If the block could not be read, I wouldn’t expect this to work. However, it does. I repeat using /dev/sda, /dev/sda1, /dev/sdb, /dev/sdb1, /dev/md0, and +-5 to the block number to search around the bad block. It all works. I shrug my shoulders and go ahead and commit the write and sync (I use /dev/md0 because I figured modifying one disk and not the other might cause issues, this way both disks overwrite the bad block):
# dd if=/dev/zero of=/dev/md0 bs=4096 count=1 seek=401594986 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000142366 s, 28.8 MB/s # sync
I would expect that writing to the bad block would have the disks reassign the block to a good one, however running another SMART test shows differently:
# 1 Short offline Completed: read failure 90% 12170 3212761936
Back to square 1. So basically, how would I fix a bad block on 1 disk in a RAID1 array? I’m sure I’ve not done something correctly…
Thanks for your time and patience.
EDIT 1:
I’ve tried to run an long SMART test, with the same LBA returning as bad (the only difference is it reports 30% remaining rather than 90%):
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 30% 12180 3212761936 # 2 Short offline Completed: read failure 90% 12170 3212761936
I’ve also used badblocks with the following output. The output is strange and seems to be miss-formatted, but I tried to test the numbers outputed as blocks but debugfs gives an error
# badblocks -sv /dev/sda Checking blocks 0 to 1953514583 Checking for bad blocks (read-only test): 1606380968ne, 3:57:08 elapsed. (0/0/0 errors) 1606380969ne, 3:57:39 elapsed. (1/0/0 errors) 1606380970ne, 3:58:11 elapsed. (2/0/0 errors) 1606380971ne, 3:58:43 elapsed. (3/0/0 errors) done Pass completed, 4 bad blocks found. (4/0/0 errors) # debugfs debugfs 1.42.4 (12-June-2012) debugfs: open /dev/md0 debugfs: testb 1606380968 Illegal block number passed to ext2fs_test_block_bitmap #1606380968 for block bitmap for /dev/md0 Block 1606380968 not in use
Not sure where to go from here. badblocks definitely found something, but I’m not sure what to do with the information presented…
EDIT 2
More commands and info.
I feel like an idiot forgetting to include this originally. This is SMART values for /dev/sda. I have 1 Current_Pending_Sector, and 0 Offline_Uncorrectable.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 166
2 Throughput_Performance 0x0026 055 055 000 Old_age Always - 18345
3 Spin_Up_Time 0x0023 084 068 025 Pre-fail Always - 5078
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 75
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 12224
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 75
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always - 1646911
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 12
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
194 Temperature_Celsius 0x0002 064 059 000 Old_age Always - 36 (Min/Max 22/41)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 252 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 30
223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 77
# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Thu May 5 06:30:21 2011
Raid Level : raid1
Array Size : 1953512383 (1863.01 GiB 2000.40 GB)
Used Dev Size : 1953512383 (1863.01 GiB 2000.40 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Tue Jul 3 22:15:51 2012
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : server:0 (local to host server)
UUID : e7ebaefd:e05c9d6e:3b558391:9b131afb
Events : 67889
Number Major Minor RaidDevice State
2 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
As per one of the answers: it would seem I did switch seek and skip for dd. I was using seek as that’s what is used with the HOWTO. Using this command causes dd to hang:
# dd if=/dev/sda1 of=/dev/null bs=4096 count=1 skip=401594986
Using blocks around that one (..84, ..85, ..87, ..88) seems to work just fine, and using /dev/sdb1 with block 401594986 reads just fine as well (as expected as that disk passed SMART testing). Now, the question that I have is: When writing over this area to reassign the blocks, do I use /dev/sda1 or /dev/md0? I don’t want to cause any issues with the RAID array by writing directly to one disk and not having the other disk update.
EDIT 3
Writing to the block directly produced filesystem errors. I’ve chosen an answer that solved the problem quickly:
# 1 Short offline Completed without error 00% 14211 - # 2 Extended offline Completed: read failure 30% 12244 3212761936
Thanks to everyone who helped. =)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
All these “poke the sector” answers are, quite frankly, insane. They risk (possibly hidden) filesystem corruption. If the data were already gone, because that disk stored the only copy, it’d be reasonable. But there is a perfectly good copy on the mirror.
You just need to have mdraid scrub the mirror. It’ll notice the bad sector, and rewrite it automatically.
# echo 'check' > /sys/block/mdX/md/sync_action # use 'repair' instead for older kernels
You need to put the right device in there (e.g., md0 instead of mdX). This will take a while, as it does the entire array by default. On a new enough kernel, you can write sector numbers to sync_min/sync_max first, to limit it to only a portion of the array.
This is a safe operation. You can do it on all of your mdraid devices. In fact, you should do it on all your mdraid devices, regularly. Your distro likely ships with a cronjob to handle this, maybe you need to do something to enable it?
Script for all RAID devices on the system
A while back, I wrote this script to “repair” all RAID devices on the system. This was written for older kernel versions where only ‘repair’ would fix the bad sector; now just doing check is sufficient (repair still works fine on newer kernels, but it also re-copies/rebuilds parity, which isn’t always what you want, especially on flash drives)
#!/bin/bash
save="$(tput sc)";
clear="$(tput rc)$(tput el)";
for sync in /sys/block/md*/md/sync_action; do
md="$(echo "$sync" | cut -d/ -f4)"
cmpl="/sys/block/$md/md/sync_completed"
# check current state and get it repairing.
read current < "$sync"
case "$current" in
idle)
echo 'repair' > "$sync"
true
;;
repair)
echo "WARNING: $md already repairing"
;;
check)
echo "WARNING: $md checking, aborting check and starting repair"
echo 'idle' > "$sync"
echo 'repair' > "$sync"
;;
*)
echo "ERROR: $md in unknown state $current. ABORT."
exit 1
;;
esac
echo -n "Repair $md...$save" >&2
read current < "$sync"
while [ "$current" != "idle" ]; do
read stat < "$cmpl"
echo -n "$clear $stat" >&2
sleep 1
read current < "$sync"
done
echo "$clear done." >&2;
done
for dev in /dev/sd?; do
echo "Starting offline data collection for $dev."
smartctl -t offline "$dev"
done
If you want to do check instead of repair, then this (untested) first block should work:
case "$current" in
idle)
echo 'check' > "$sync"
true
;;
repair|check)
echo "NOTE: $md $current already in progress."
;;
*)
echo "ERROR: $md in unknown state $current. ABORT."
exit 1
;;
esac
Method 2
I’ve just had pretty much the same problem with a RAID1 array. The bad sector was right at the beginning of one of the partitions – sector 16 of /dev/sdb2. I followed the instructions above: after verifying that logical block 2 was not in use by the file system and being careful to get dd seek and skip the right way around, and zeroed out 1 file system block:
# dd if=/dev/zero of=/dev/md0 bs=4096 count=1 seek=2
What did this do? It did not fix the bad sector. This, I now know, is because /dev/md0 does not map directly on to /dev/sdb2, you have to take account of the RAID DATA OFFSET!
More on this below. What it did do was a small but potentially devastating turd on my file system. It turns out that logical block 2 of /dev/md0 contained useful filesystem metadata and was fine on both disks, until I crapped on both copies by writing to /dev/md0. Luckily, e2fsck -y /dev/md0 fixed the problem (after spewing an alarming amount of output) with no apparent data loss. Lesson learned: if debugfs icheck says ‘block not found’, it doesn’t necessarily mean the corresponding sectors are not used.
Back to the data offset: use mdadm to find the offset like this:
# mdadm --examine /dev/sdb2
/dev/sdb2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : ef7934b9:24696df9:b89ff03e:b4e5a05b
Name : XXXXXXXX
Creation Time : Sat Sep 1 01:20:22 2012
Raid Level : raid1
Raid Devices : 2
Avail Dev Size : 1953241856 (931.38 GiB 1000.06 GB)
Array Size : 976620736 (931.38 GiB 1000.06 GB)
Used Dev Size : 1953241472 (931.38 GiB 1000.06 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : f3b5d515:446d4225:c2191fa0:9a9847b8
Update Time : Thu Sep 6 12:11:24 2012
Checksum : abb47d8b - correct
Events : 54
Device Role : Active device 0
Array State : AA ('A' == active, '.' == missing)
In this case, the data offset is 262144 sectors of 512 bytes. If you dd from /dev/md0 and compare it with data from the raw partition with an offset of 131072K, you’ll find they match. So in my case, logical block 2 (sectors 16–23) of /dev/sdb2 are not even in the file system; they’re in the RAID superblock, which you can read about here:
https://raid.wiki.kernel.org/index.php/RAID_superblock_formats – for version 1.2, it consists of 256 bytes + 2 bytes per device in the array, all starting 4096 bytes in, so in my case, the bad sector was not used. The corresponding sectors of /dev/sdc2 (the other half of the RAID1 array) are zero so I figured it would be safe to do this:
# dd if=/dev/zero of=/dev/sdb2 bs=4096 count=1 seek=2
It worked!
Method 3
If runing debian you most likely have a job in /etc/cron.d/mdadm . This will run /usr/share/mdadm/checkarray --cron --all --idle --quiet
first sunday of every month. Run that manually when you get uncorrectable hardware errors to expedite the rewrite.
Method 4
You mixed up your dd arguments. seek causes it to seek to the specified offset in the output. You wanted to skip blocks on the input.
Method 5
If you have a sw-raid1 and you write data to one of the members directly, you will have a corrupted raid immediately. DO NOT write data to a sdaX or sdbX if they are part of a mdX. If you write to mdX, you will have the data copied to both drives, if you read from mdX, you will have the data read from one of the drives..
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0