Replacing a RAID device/disk

This brief document describes the required steps for replacing a disk in a Linux software RAID system.

Please note that these steps assume we are working with hot swap disks.

If not, you will be forced to do a reboot before removing/replacing the device/disk.

  1. Identify broken disk: # cat /proc/mdstat. Broken disk will be marked with an F. Example: sda2[2](F).
    1. If the disk partially works: # smartctl /dev/DISK -a | grep Serial or # hdparm -I /dev/DISK
      Will show us the serial number. Also you can force activity on disk to make it obvious which one is (activity LED permanently on / blinking the most): 
       # dd if=/dev/DISK of=/dev/null
      Stop with CTRL C when identified.
    2. If the disk doesn't work at all, repeat previous step for the rest of the disks, and identify which one doesn't blink.
  2. Remove disk/partition(s) from RAID(s):
    1. Make sure to mark all affected disk partitions that involve RAID(s) as failed: # mdadm --manage /dev/mdX --fail /dev/DISK_PARTITION
      For example: # mdadm --manage /dev/md0 --fail /dev/sda1; mdadm --manage /dev/md1 --fail /dev/sda2
    2. Repeat same commands for all partitions/disks with --remove modifier. 
      # mdadm --manage /dev/mdX --remove /dev/DISK_PARTITION 
    3. Make sure disk is not in use anymore and remove it:
      1. Check that there is not any mount or swap using our disk:
      2. # grep DISK /proc/mounts; grep DISK /proc/swaps
    4. If it displays any entry, it should be umounted or swapoff.
    5. Check that there is not any RAID using our disk anymore: 
       # grep DISK /proc/mdstat
      If it displays any entry, it should be marked as fail and removed issuing commands on step 2.
  3. Remove disk safely.
  4. Replace disk (preferibly with a healthy one :-) ).
  5. Check for any messages in /var/log/messages || # dmesg .
  6. If disk is not detected and doesn't appear in /proc/partitions (which might be the case):
    1. Force a re-scan by: # echo "- - -" > /sys/class/scsi_host/host*/scan # Replacing host* with any of "host0", etc. # Sometimes it doesn't work. Try with one of: # echo "- - -" > /sys/devices/pci*/*:*/host?/scsi_host/host?/scan
    2. Check again for new partitions (typically with the same name as the replaced disk). Try and error is the best method to discover the new disk.
  7. Recreate partition table using any other disk from the RAID as source. We have two methods here, so either:
    • # sfdisk -d /dev/CURRENT_RAID_DISK | sfdisk --Linux /dev/NEW_DISK
      # dd if=/dev/CURRENT_RAID_DISK of=/dev/NEW_DISK
      Force the kernel to reload partitions: 
      # blockdev --rereadpt /dev/NEW_DISK or use gparted's utility: # partprobe
  1. Check if new partitions are recognized:

# dmesg
# cat /proc/partitions

  1. Reinstall Grub: # grub-install /dev/NEW_DISK
  2. Add partition(s) the same way they were (replace --remove modifier with --add): 
    # mdadm /dev/mdX --add /dev/DISK_PARTITION
  3. Check progress with: # watch cat /proc/mdstat

Problems with md0 name changed to md125 or similar

There's a situation not involved in the exact previous steps, but that could come up from using the RAID disks in another machine with other system configuration, utilities versions, kernel version (i.e. in a chroot environment or from a Live CD) where trying to recover a RAID could cause some problems with device names, permanently changing them. 
In those cases, if a common /dev/md0, or /dev/md1 device is recognized as /dev/md125 or a much larger number than original's, this device name will not automatically switch its name again back to /dev/mdX. 
To resolve this situation, try the following:

  • Copy original /etc/mdadm.conf (or whatever) to the current system.
  • Umount and deactivate any active md device
  • Issue the following commands:
    • # mdadm -S /dev/md125 # or similar name
      # mdadm -A /dev/md0 --update=superminor # md0, or md1, etc.

Good luck.