daknetworks.com

You are here: Blog unrecoverable I/O read error for block 976557696

unrecoverable I/O read error for block 976557696

unrecoverable I/O read error for block 976557696

unrecoverable I/O read error for block 976557696

TLDR; (SHORT VERSION)

I'm going to walk you through this one. It happened during a raid recovery. SDA is in the system and I'm trying to add a new SDB. I hate posting logs but it looks like this:

01.Dec 24 15:38:10 server kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
02.Dec 24 15:38:10 server kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
03.Dec 24 15:38:10 server kernel: sda: Current [descriptor]: sense key: Medium Error
04.Dec 24 15:38:10 server kernel:     Add. Sense: Unrecovered read error - auto reallocate failed
05.Dec 24 15:38:10 server kernel:
06.Dec 24 15:38:10 server kernel: Descriptor sense data with sense descriptors (in hex):
07.Dec 24 15:38:10 server kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
08.Dec 24 15:38:10 server kernel:         3a 38 46 23
09.Dec 24 15:38:10 server kernel: raid1: sda: unrecoverable I/O read error for block 976557696
10.Dec 24 15:38:10 server kernel: ata1: EH complete

It can't resync the raid because there's a read error on the only remaining disk, sda. I wish it would just skip over it because the value of having a working raid is more important than the value of having a bit of information that is probably not that important anyway. I think it's a bug in the mdadm package that's fixed in current versions but I'm stuck with what I have.

To fix, in short, clone the disk and move on with life. Here's howto:

================================================
Use ddrescue to clone disk. It's a tool on a boot cd found here:
http://www.sysresccd.org

So download the disk image and burn it.
Boot to system rescue cd with defaults.
The raid starts to rebuild automatically, so we will have to stop it. First, fail and remove sdb1:

1.mdadm -f /dev/md1 /dev/sdb1
2.mdadm -r /dev/md1 /dev/sdb1

Now, fail and remove sdb2:

1.mdadm -f /dev/md2 /dev/sdb2
2.mdadm -r /dev/md2 /dev/sdb2

Now let's clone the disk:

1.ddrescue -f -n /dev/sda /dev/sdb rescue.log
2.ddrescue -d -f -r3 /dev/sda /dev/sdb rescue.log

Now let's shutdown.

Remove sda. Now reboot.

Boot fine! It recognizes the new cloned drive as sda.

Shutdown:

1.signal-event halt

Reposition the drive sda to the first slot for sanity's sake.
Place new hd into the second slot.
reboot.

01.console
02.manage disk array redundancy.
03.add sdb into array.
04.wait for resync.
05.# cat /proc/mdstat
06.Personalities : [raid1]
07.md1 : active raid1 sdb1[1] sda1[0]
08. 104320 blocks [2/2] [UU]
09.md2 : active raid1 sdb2[1] sda2[0]
10. 488279552 blocks [2/2] [UU]

resync's fine!!!

LONG VERSION

Here's the long version with all of my notes. It's the perfect example of a sysadmin's downward spiral into the 5th circle of hell.  It may or may not make sense.

The error return code is below. I had to look it up. Basically it says there's no error.
==============================================
0x08000002
DRIVER | HOST | MESSAGE | CONDITION
08 | 00 | 00 | 02
cat /usr/src/kernels/2.6.18-308.16.1.el5PAE-i686/include/scsi/scsi.h
DRIVER SOFT | NO ERROR | COMMAND COMPLETE | SAM_STAT_CHECK_CONDITION

Here's the real issue. It's a read error on the live disk.
==============================================
raid1: sda: unrecoverable I/O read error for block 976557696

This link explains it better than I do:
http://www.sjvs.nl/?p=12

In theory, this would work. First check the read of the sector:
==============================================
$ hdparm –read-sector 976766499 /dev/sda: Input/Output error

Then write to the sector:
==============================================
hdparm –write-sector 976766499 /dev/sda
hdparm –write-sector 976766499 –yes-i-know-what-i-am-doing /dev/sda

But I don't have those options in hdparm. I'm at hdparm v6.6 and these options are in hdparm v8.1. Hmmm... Let's look at the status:
==============================================
cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md2 : active raid1 sdb2[2](S) sda2[0]
488279552 blocks [2/1] [U_]

Let's look at the details of md2:
==============================================
mdadm --detail /dev/md2

So md2 has:
1 active-sync member
1 removed member
1 spare member

We can't add the new drive into md2 because it's already added:
==============================================
mdadm -a /dev/md2 /dev/sdb2
mdadm: Cannot open /dev/sdb2: Device or resource busy

So we have to fail it and remove it:
==============================================
mdadm -f /dev/md2 /dev/sdb2
mdadm -r /dev/md2 /dev/sdb2

Now let's try to re-add it with the re-add option:
==============================================
mdadm /dev/md2 --remove detached
mdadm --re-add /dev/md2 /dev/sdb2

Same result. Humph...

==============================================
Let's remove the device, shutdown and add the original sdb back in.

Same result. Humph...

==============================================
We're back to the bad block. Exact same message, exact same block:
$raid1: sda: unrecoverable I/O read error for block 976557696

Modern hard disk drives are equipped with a small amount of spare sectors to reallocate damaged sectors. However, a sector only gets relocated when a write operation fails. A failing read operation will, in most cases, only throw an I/O error. In the unlikely event a second read does succeed, some disks perform a auto-reallocation and data is preserved. In my case, the second read failed miserably (“Unrecovered read error – auto reallocate failed“).

If there were another correctly working disk in the raid, I would just replace the bad disk. But this is the only disk left.

==============================================
Let's see the logical volume:
lvdisplay --maps |egrep 'Physical|LV Name|Type'

OK, it's: /dev/main/root

I'm going to try to run badblocks on the lv:
badblocks -n -s /dev/main/root

I get:
/dev/main/root is mounted; it's not safe to run badblocks!

===============================================
Boot from Install CD.
Type: sme rescue (at prompt)

Select your language. Select keyboard language.
Select READ/WRITE

Unmount the lvm:
unmount -l /dev/main/root

Check for bad blocks on the lv:
badblocks -n -s /dev/main/root

That took too long. Stopped at 1% compelete.

================================================
Let's check the remaining disk:
smartctl -a /dev/sda

Gives the error log:
40 51 08 22 46 38 e0  Error: UNC 8 sectors at LBA = 0x00384622 = 3687970

And also gives the proper LBA error:
LBA_of_first_error
976766499

This is the decimal number. The hexadecimal number is: 0x3a384623.

The system has a raid. LVM is on the raid. The file system is on LVM. Trying to map the bad phyiscal block, to the LVM logical block is nearly impossible because it's manual calculation. But I try anyway.

Finding the sda layout:
sfdisk -luS /dev/sda

Gives:
Disk /dev/sda: 60801 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

Device Boot    Start       End   #sectors  Id  System
/dev/sda1   *         1    208769     208769  fd  Linux raid autodetect
/dev/sda2        208770 976768063  976559294  fd  Linux raid autodetect

so the bad lba is in sda2.

The LBA minus the start of the partition:
976766499 - 208770 = 976557729

Let's get the size of the PHYSICAL EXTENT of the PHYSICAL VOLUME:
pvdisplay
--- Physical volume ---
PV Name               /dev/md2
VG Name               main
PV Size               465.66 GB / not usable 3.31 MB
Allocatable           yes (but full)
PE Size (KByte)       32768
Total PE              14901
Free PE               0
Allocated PE          14901
PV UUID               M1IIOi-nln7-encf-DgZC-DtjF-0ZGs-4OEFgK

So we have:
32768

So the LBA block size is 32768 * 2:
65536

Let's find the offset of the beginning of the first PE:
pvs -o+pe_start $part

Gives:
192.00K

Let's calculate the physical partition's bad block number / sizeof(PE) =
976557729 / 65536 = 14901.0883942

Find the LV of 14901:
lvdisplay --maps |egrep 'Physical|LV Name|Type'

Gives:
]# lvdisplay --maps |egrep 'Physical|LV Name|Type'
LV Name                /dev/main/root
Type                linear
Physical volume     /dev/md2
Physical extents    0 to 4702
Type                linear
Physical volume     /dev/md2
Physical extents    4765 to 14900

So we know it's in /dev/main/root

We need to know the block size of the LV:
dumpe2fs /dev/main/root | grep 'Block size'

Gives:
Block size:               4096

The logical partition begins on PE 4765

So let's find the badblock
(# PE's start of partition * sizeof(PE)) + parttion offset[pe_start] =
(4096 * 65536) + 192 = 268435648

Let's test the FS of the bad block:
dd if=/dev/main/root of=block268435648 bs=4096 count=1 skip=268435648

Nothing.

debugfs
debugfs 1.32 (09-Nov-2002)
debugfs:  open /dev/main/root
debugfs:  testb 268435648

976557696

I got:
Illegal block number passed to ext2fs_test_block_bitmap #268435648 for block bitmap for /dev/main/root
Block 268435648 not in use

This means my calculations are wrong.

Crap...

====================================
Back to the drawing board. Here's what I know:
debugfs must be used on the filesystem. The filesystem is on /dev/main/root
The disk has sectors of 512 bytes.
The logical block size of the FS is 4096 (Or the block size is 8 disk sectors)
The error message is: raid1: sda: unrecoverable I/O read error for block 976557696
(I don't know if this is the block on sda, sda2, md2 or /dev/main/root).
Doing a e2fsck -cvy /dev/main/root or a badblocks -n -s /dev/main/root takes too long.
The LBA_of_first_error is 976766499
The starting sector of sda2 is 208770.
The last sector of sda2 is 976768063.
The total sectors in sda2 is 976559294.
The problem is at 976557729 inside partition sda2.

====================================
If no LVM it would be:
((976766499 - 208770) * 512) / 4096 = 122069716.125

http://smartmontools.sourceforge.net/badblockhowto.html#bb

So I check to see if it's in the repos
yum --enablerepo=* search sg3

I get some pages back.

So I look at more info about it:
yum info sg3_utils

I get back the package is in the centos base repo.

Feeling safe, I install the package:
yum install sg3_utils

Now I can verify LBA block without feeling nausea:
sg_verify --lba=976766499 /dev/sda

I get:
verify (10):  Descriptor format, current;  Sense key: Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
Descriptor type: Information
0x000000003a384623
medium or hardware error, reported lba=0x3a384623

I poke around other LBA from the logs:
sg_verify --lba=976557696 /dev/sda

I get nothing.

I'm all in and going to reassign the block:
sg_reassign --address=976766499 /dev/sda

I get:
REASSIGN BLOCKS not supported

Crap... The good news is that I've verified the LBA block or physical block.

================================================
Physcial block: 976766499
File system block: 3687970

b = (int)((L-S)*512/B)

================================================
I'm down to:
using a live cd to hdparm
or
Spinrite.
or
calculate something to use dd on the lvm filesystem.

================================================
I downloaded a gparted livecd & burn it to a cd
Booted from gparted live cd and entered command line mode.

hdparm –read-sector 976766499 /dev/sda: Input/Output error

Then write to the sector:
hdparm –write-sector 976766499 /dev/sda
hdparm –write-sector 976766499 –yes-i-know-what-i-am-doing /dev/sda

Read sector again:
hdparm –read-sector 976766499 /dev/sda
(a lot of zeroes should follow)
================================================

New error:
raid1: sda: unrecoverable I/O read error for block 976558336

================================================
# 1  Short offline       Completed: read failure       90%     44420         976766501

Booted from gparted live cd and entered command line mode.

This time when I read the sector, I get a bunch of zero's
hdparm –read-sector 976766501 /dev/sda

================================================
Reboot... won't boot.

Boot into file system recover mode (no live cd or rescue).
Look at /cat/proc/mdstat
Now md1 was automatically renamed md127 thanks to gparted livecd.

Go into single user mode:
telinit 1

Take a look at the mdadm.conf
mdadm --detail /dev/md127

If it says "Preferred Minor : 127" this this is your problem. You need to update the preferred minor.

mdadm --stop /dev/md127
mdadm --assemble --update=super-minor /dev/md1 /dev/sda1

Then you'll have your md127 assembled as md1. And it should stay this way in the future too.

================================================
Since I didn't do:
mdadm --assemble --update=super-minor /dev/md1 /dev/sda1 /dev/sdb1

I now have:
# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda1[0]
104320 blocks [2/1] [U_]

md127 : active raid1 sdb1[1]
104320 blocks [2/1] [_U]

md2 : active raid1 sda2[0]
488279552 blocks [2/1] [U_]

unused devices: <none>

Wonderful. Let's stop md127. But first, unmount it:
# umount /dev/md127

I get:
umount: /dev/md127: not mounted

OK. Since it's unmounted, let's stop it:
# mdadm -S /dev/md127

I get:
mdadm: stopped /dev/md127

Great. Back to start. Let's see what we have:
# cat /proc/mdstat

I get:
Personalities : [raid1]
md1 : active raid1 sda1[0]
104320 blocks [2/1] [U_]

md2 : active raid1 sda2[0]
488279552 blocks [2/1] [U_]

unused devices: <none>

Let's zero out the superblock of sdb so that md127 doesn't com back via some type of magic out there.

Fist, I check the partition table of sdb with the -lu option so it shows sectors rather than cylinders.
# fdisk -lu /dev/sdb
or
# sfdisk -luS /dev/sdb

I get:
Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes

Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1      208769      104384+  fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2          208770   976768063   488279647   fd  Linux raid autodetect

Let's zero out the superblock of sdb.
#mdadm --zero-superblock /dev/sdb1
#mdadm --zero-superblock /dev/sdb2

Now let's add sdb back in.
#console > Manage Disk Array > Add sdb

I get:
kernel: raid1: sda: unrecoverable I/O read error for block 976558336

================================================

hdparm –read-sector 976766501 /dev/sda: Input/Output error

Then write to the sector:
hdparm –write-sector 976766501 /dev/sda
hdparm –write-sector 976766501 –yes-i-know-what-i-am-doing /dev/sda

Read sector again:
hdparm –read-sector 976766501 /dev/sda
(a lot of zeroes should follow)

hdparm –read-sector 976767187 /dev/sda: Input/Output error

Then write to the sector:
hdparm –write-sector 976767187 /dev/sda
hdparm –write-sector 976767187 –yes-i-know-what-i-am-doing /dev/sda

Read sector again:
hdparm –read-sector 976767187 /dev/sda
(a lot of zeroes should follow)

================================================
kernel panic

crap!!!

================================================
Boot from livecd.

type: sme rescue
press ENTER for ENGLISH
press ENTER for US
press right arrow to select NO, Don't start network.
press ENTER to CONTINUE, Start filesystem.
press ENTER for OK

Take a look at:
cat /proc/mdstat

md1
md127

mdadm --detail /dev/md127

If it says "Preferred Minor : 127" this this is your problem. You need to update the preferred minor.

vgchange -an
can't deactivate volue group main with 2 open logical volumes

Can't stop the volume group because there's a logical volume open.

lvchange -an
lv main/root in use: not deactivating

Can't stop the logical volume. Don't know why.

Let's reboot without mounting the file system:
Boot from livecd.

type: sme rescue
press ENTER for ENGLISH
press ENTER for US
press right arrow to select NO, Don't start network.
SKIP the filesystem

mdadm --assemble --update=super-minor --uuid ce917023:5da6a14f:2a9c304f:a380120a /dev/md2

================================================
Use ddrescue to clone disk.
Boot to system rescue cd with defaults.
The raid starts to rebuild automatically, so we will have to stop it.

mdadm -f /dev/md1 /dev/sdb1
mdadm -r /dev/md1 /dev/sdb1

mdadm -f /dev/md2 /dev/sdb2
mdadm -r /dev/md2 /dev/sdb2

Now let's clone the disk:

ddrescue -f -n /dev/sda /dev/sdb rescue.log
ddrescue -d -f -r3 /dev/sda /dev/sdb rescue.log

Now let's shutdown.
halt

Remove sda. Now reboot.

Boot fine! It recognizes the new cloned drive as sda.

Shutdown:
signal-event halt

Reposition the drive sda to the first slot for sanity's sake.
Place new hd into the second slot.
reboot.
console
manage disk array redundancy.
add sdb into array.
wait for resync.

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[1] sda2[0]
488279552 blocks [2/2] [UU]

resync's fine!!!

================================================
Now reboot with only sdb. I get:
black screen.

I have to reinstall grub onto sdb. It should have happened on manage disk array redundancy. Asking for help.

In the end, the following worked fine:
grub-install sdb

Contact Dak Networks

We are not taking on new clients at this time.