How to replace a failed harddrive in a software RAID 1 array?

Is it possible to replace a faulty drive from RAID 1? What are the steps?

Here I’m explaining the detailed steps in replacing a bad drive from software RAID 1 array. As you know RAID 1 means mirroring.

Here I’ve two hard drives /dev/sda and /dev/sdd with partitions /dev/sda1, /dev/sda2, /dev/sda3, /dev/sda5, /dev/sda6, /dev/sda7 and /dev/sda8 as well as /dev/sdd1, /dev/sdd2, /dev/sdd3, /dev/sdd5, /dev/sdd6, /dev/sdd7 and /dev/ssd8.

This is how RAID array is built:

/dev/sda1 and /dev/sdd1 makes the /dev/md0 RAID 1 array
/dev/sda2 and /dev/sdd2 makes the /dev/md3 RAID 1 array
/dev/sda3 and /dev/sdd3 makes the /dev/md5 RAID 1 array
/dev/sda5 and /dev/sdd5 makes the /dev/md4 RAID 1 array
/dev/sda6 and /dev/sdd6 makes the /dev/md2 RAID 1 array
/dev/sda7 and /dev/sdd7 makes the /dev/md1 RAID 1 array
/dev/sda8 and /dev/sdd8 makes the /dev/md6 RAID 1 array

This can be identified from the following command:

# cat /proc/mdstat

Here the failing disk is /dev/sdd and we need to replace it. From the command cat /proc/mdstat we can also get the details on degrading array. Here’s an example:

root@server100 [~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdd1[1] sda1[0]
      305088 blocks [2/2] [UU]
      
md3 : active raid1 sdd2[2](F) sda2[0]
      57673280 blocks [2/1] [U_]
      
md4 : active raid1 sdd5[1] sda5[0]
      26217984 blocks [2/2] [UU]
      
md2 : active raid1 sdd6[1] sda6[0]
      8385792 blocks [2/2] [UU]
      
md1 : active raid1 sdd7[1] sda7[0]
      4192832 blocks [2/2] [UU]
      
md6 : active raid1 sdd8[2](F) sda8[0]
      1830518272 blocks [2/1] [U_]
      
md5 : active raid1 sdd3[1] sda3[0]
      26217984 blocks [2/2] [UU]
      
unused devices: 

raid1

Instead of UU if you see ‘_‘(underscore), it’s a degrading drive. Here in this given example though ‘_’ is in second position, you can see a ‘F’ besides sdd2 and sdd8 so we can confirm that /dev/sdd is failing. You can also initiate a smartctl for /dev/sdd to confirm it. Check for ATA errors in the smartctl output.

Here /dev/sdd2 and /dev/sdd8 is failed. We need to mark the drive as failed for other arrays as well and then need to remove it from the RAID arrays.

Marking the hard-drive as failed and removing it

Here’s the command to mark the drive as failed:

# mdadm --manage /dev/md0 --fail /dev/sdd1

Similarly, do it for other drives as well.

raid13

Here’s a sample output after executing it for other RAID arrays:

raid12

Removing the drive

To remove the failed drives from the RAID array, please use the following command:

# mdadm --manage /dev/md0 --remove /dev/sdd1

Repeat it for other drives. Here’s a sample screen-shot obtaining its output:

raid16

Once the bad drive is removed from the RAID array it’ll display only one harddrive, you can see it from cat /proc/mdstat

raid15

Now it’s time to power off the server and contact your DC for a drive replacement. To power off:

#shutdown -h now

Replace the defective /dev/sdd with a new one 🙂 It should be in exact size with that of the old one. (That is, if old drive is 1TB then the new one should also be 1TB)

Once the defective drive is replaced boot up the server. Now we need to create partitions on the new drive with the exact replica of the other drive /devc/sda as it is RAID1. For that we can use the command sfdisk.

# sfdisk -d /dev/sda | sfdisk /dev/sdd

Here, the entire partitions on /dev/sda will be copied over to the new one – /dev/sdd

raid21

Now you can execute the following command to check whether both the harddrives have the same partitions:

# fdisk -l

Add drives to the RAID array

Next is, we need to add the new partitions to the RAID arrays, for that we use the following command:

# mdadm --manage /dev/md0 --add /dev/sdd1

Repeat it for other RAID arrays as well.

raid22

Once you finished adding drives to the RAID arrays, it’ll start synchronising automatically.

raid23

That’s it, now you’ve replaced /dev/sdd!

Any questions? post a comment!!

Parted – A useful information!

Recently I had to work on a “Parted” based server. Not always we’ll get chances to work on “parted” based servers. Hence I thought of documenting it. Parted is a command which helps you to modify hard-disk partitions. More than a command, it’s a GNU utility. Using Parted we can add, delete and edit partitions along with the file systems located on them.

More than that, suppose think of a criteria that you need to partition a 6TB hard-disk on a Linux server. Most possibly we’ll think of fdisk utility. I’ve to say sorry, because fdisk can’t partition hard-drives more than 2TB in a Linux server. Fdisk will parttion only upto 2TB and around 4TB will remain as unused space. READ MORE…

, ,

Post navigation

Heba Habeeb

Working as a Linux Server Admin, Infopark, Cochin, Kerala.

2 thoughts on “How to replace a failed harddrive in a software RAID 1 array?

  1. These raid steps are really helpful to me, I have tried uninstalling everything and deleting folders in program files but still no luck. There are a number of users reporting the same problem if you google but no solution. Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *