How to replace a failed harddrive in a software RAID 1 array?
Is it possible to replace a faulty drive from RAID 1? What are the steps?
Here I’m explaining the detailed steps in replacing a bad drive from software RAID 1 array. As you know RAID 1 means mirroring. Here I’ve two hard drives /dev/sda and /dev/sdd with partitions /dev/sda1, /dev/sda2, /dev/sda3, /dev/sda5, /dev/sda6, /dev/sda7 and /dev/sda8 as well as /dev/sdd1, /dev/sdd2, /dev/sdd3, /dev/sdd5, /dev/sdd6, /dev/sdd7 and /dev/ssd8.
This is how RAID array is built:
/dev/sda1 and /dev/sdd1 makes the /dev/md0 RAID 1 array /dev/sda2 and /dev/sdd2 makes the /dev/md3 RAID 1 array /dev/sda3 and /dev/sdd3 makes the /dev/md5 RAID 1 array /dev/sda5 and /dev/sdd5 makes the /dev/md4 RAID 1 array /dev/sda6 and /dev/sdd6 makes the /dev/md2 RAID 1 array /dev/sda7 and /dev/sdd7 makes the /dev/md1 RAID 1 array /dev/sda8 and /dev/sdd8 makes the /dev/md6 RAID 1 array
This can be identified from the following command:
# cat /proc/mdstat
Here the failing disk is /dev/sdd and we need to replace it. From the command cat /proc/mdstat we can also get the details on degrading array. Here’s an example:
root@server100 [~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdd1 sda1 305088 blocks [2/2] [UU] md3 : active raid1 sdd2(F) sda2 57673280 blocks [2/1] [U_] md4 : active raid1 sdd5 sda5 26217984 blocks [2/2] [UU] md2 : active raid1 sdd6 sda6 8385792 blocks [2/2] [UU] md1 : active raid1 sdd7 sda7 4192832 blocks [2/2] [UU] md6 : active raid1 sdd8(F) sda8 1830518272 blocks [2/1] [U_] md5 : active raid1 sdd3 sda3 26217984 blocks [2/2] [UU] unused devices:
Instead of UU if you see ‘_‘(underscore), it’s a degrading drive. Here in this given example though ‘_’ is in second position, you can see a ‘F’ besides sdd2 and sdd8 so we can confirm that /dev/sdd is failing. You can also initiate a smartctl for /dev/sdd to confirm it. Check for ATA errors in the smartctl output.
Here /dev/sdd2 and /dev/sdd8 is failed. We need to mark the drive as failed for other arrays as well and then need to remove it from the RAID arrays.
Marking the harddrive as failed and removing it
Here’s the command to mark the drive as failed:
# mdadm --manage /dev/md0 --fail /dev/sdd1
Similarly, do it for other drives as well.
Here’s a sample output after executing it for other RAID arrays:
Removing the drive
To remove the failed drives from the RAID array, please use the following command:
# mdadm --manage /dev/md0 --remove /dev/sdd1
Repeat it for other drives. Here’s a sample screen-shot obtaining its output:
Once the bad drive is removed from the RAID array it’ll display only one harddrive, you can see it from cat /proc/mdstat
Now it’s time to power off the server and contact your DC for a drive replacement. To power off:
#shutdown -h now
Replace the defective /dev/sdd with a new one 🙂 It should be in exact size with that of the old one. (That is, if old drive is 1TB then the new one should also be 1TB)
Once the defective drive is replaced boot up the server. Now we need to create partitions on the new drive with the exact replica of the other drive /devc/sda as it is RAID1. For that we can use the command sfdisk.
# sfdisk -d /dev/sda | sfdisk /dev/sdd
Here, the entire partitions on /dev/sda will be copied over to the new one – /dev/sdd
Now you can execute the following command to check whether both the harddrives have the same partitions:
# fdisk -l
Add drives to the RAID array
Next is, we need to add the new partitions to the RAID arrays, for that we use the following command:
# mdadm --manage /dev/md0 --add /dev/sdd1
Repeat it for other RAID arrays as well.
Once you finished adding drives to the RAID arrays, it’ll start synchronising automatically.
That’s it, now you’ve replaced /dev/sdd!