This brief document describes the required steps for replacing a disk in a Linux software RAID system.
Please note that these steps assume we are working with hot swap disks.
If not, you will be forced to do a reboot before removing/replacing the device/disk.
# cat /proc/mdstat. Broken disk will be marked with an F. Example: sda2(F).
# smartctl /dev/DISK -a | grep Serial or # hdparm -I /dev/DISK
Will show us the serial number. Also you can force activity on disk to make it obvious which one is (activity LED permanently on / blinking the most):
# dd if=/dev/DISK of=/dev/null
Stop with CTRL C when identified.
# mdadm --manage /dev/mdX --fail /dev/DISK_PARTITION
For example: # mdadm --manage /dev/md0 --fail /dev/sda1; mdadm --manage /dev/md1 --fail /dev/sda2
Repeat same commands for all partitions/disks with --remove modifier.
# mdadm --manage /dev/mdX --remove /dev/DISK_PARTITION
Check that there is not any mount or swap using our disk:
# grep DISK /proc/mounts; grep DISK /proc/swaps
If it displays any entry, it should be umounted or swapoff.
Check that there is not any RAID using our disk anymore:
# grep DISK /proc/mdstat
If it displays any entry, it should be marked as fail and removed issuing commands on step 2.
Replace disk (preferibly with a healthy one :-) ).
Check for any messages in /var/log/messages || # dmesg .
# echo "- - -" > /sys/class/scsi_host/host*/scan # Replacing host* with any of "host0", etc. # Sometimes it doesn't work. Try with one of: # echo "- - -" > /sys/devices/pci*/*:*/host?/scsi_host/host?/scan
# sfdisk -d /dev/CURRENT_RAID_DISK | sfdisk --Linux /dev/NEW_DISK
# dd if=/dev/CURRENT_RAID_DISK of=/dev/NEW_DISK
Force the kernel to reload partitions:
# blockdev --rereadpt /dev/NEW_DISK or use gparted's utility: # partprobe
# cat /proc/partitions
# grub-install /dev/NEW_DISK
Add partition(s) the same way they were (replace --remove modifier with --add):
# mdadm /dev/mdX --add /dev/DISK_PARTITION
# watch cat /proc/mdstat
There's a situation not involved in the exact previous steps, but that could come up from using the RAID disks in another machine with other system configuration, utilities versions, kernel version (i.e. in a chroot environment or from a Live CD) where trying to recover a RAID could cause some problems with device names, permanently changing them.
In those cases, if a common /dev/md0, or /dev/md1 device is recognized as /dev/md125 or a much larger number than original's, this device name will not automatically switch its name again back to /dev/mdX.
To resolve this situation, try the following:
# mdadm -S /dev/md125 # or similar name
# mdadm -A /dev/md0 --update=superminor # md0, or md1, etc.