We assume we have 3 drives: sda, sdb and hot spare drive sdc. The first drive sda is failing due to temperature problems and we want to take it offline and put sdc online.
Check the configuration and drives
mdadm -D /dev/md0 mdadm -D /dev/md1
Check the condition of drives and be sure that you have the right one to blame
smartctl --all /dev/sda smartctl --all /dev/sdb smartctl --all /dev/sdc
If the broken drive is still alive but adding latency and CPU load, you might want to see it for yourself before doing harsh decisions. So add an io monitor first
iostat -x 1
and do a stress test with something like this (remember to do it inside the mount you want it to be in)
dd if=/dev/zero of=/tmp/koe bs=1024k count=2000
Look at the monitor and see which drive is receiving the penalty. That would be the one which we will get rid of.
And remember to have other drives bootable
Remove the faulty drive from array
mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1 mdadm /dev/md1 --fail /dev/sda2 --remove /dev/sda2
This would be the part where you stuff like this
mdadm /dev/md0 --add /dev/sdc1 mdadm /dev/md0 --add /dev/sdc2
But since my spare /dev/sdc was already there waiting the md subsystem automatically started utilizing it without the need for intervention.
Look at /proc/mdstat and wait until /dev/sdc is in sync.
Just to be on the safe side you should do the following steps:
sfdisk -d /dev/sdb > sdb.txt sfdisk --force /dev/sda < sdb.txt mdadm /dev/md0 --add /dev/sda1 mdadm /dev/md0 --add /dev/sda2
Now look at the arrays with mdadm -D /dev/mdX and confirm that the new drive partitions are as spares. They will activate as soon as one of actives fail or is marked as failed.
Or if you have 2 drives raid /proc/mdstat shows you it is syncing /dev/sda and after it's completion all is done.
Oh, and remember again to make the new drive bootable
Voilá!