URGENT - RAID Failure

hybrid · 17 Sep 2010 at 10:11

Hi all,

Came in this morning to find one of the drives erroring on our RAID5 array. Replacing the drive but the server is failing to boot up stating it has a raid config mismatch. Is it safe for us to create a new config but to KEEP all the data on the drives?

Any help is much appreciated.

Cheers

Toughnoodle · 17 Sep 2010 at 10:46

They're hot pluggable SCSI/SAS disks?
I'd probably remove the spare disk and boot it. Then, on a spare machine I'd initialise the spare disk to wipe it (via the RAID BIOS menu) and put it back in to rebuild the RAID 5 array.
Sounds like the RAID card has found existing array configs on your faulty RAID5 array and your spare disk. I think it's good practice to have your spare disks completely blank.

hybrid · 17 Sep 2010 at 12:49

Thanks for the suggestions, Toughnoodle. This is a real odd one here.

Basically, we have three drives in a RAID5 config and 1 is dying. The one dying has a solid light displaying but no audible alarm. If we leave the server to boot up normally it absolutely crawls through the starting up of Windows then finally blue screens. We know this drive is wrecked and have a spare ready to go in but we get 'Unresolved configuration mismatch between disks and NVRAM on the adapter' when we try to boot up with the new drive installed.

We're just not sure if creating a new config will erase all the data from the drives. We don't want this as we know the data on the drives are good. Any ideas?

Controller: LSI 320-1

thewanted · 17 Sep 2010 at 12:52

Is it a brand-name server? If so, phone tech support immediately. Nobody will know what to do better than them.

mikeh501 · 17 Sep 2010 at 12:54

im guessing its a custom build? a possible option is to phone LSI tech support.

hybrid · 17 Sep 2010 at 12:58

Yes it's a custom built server. I've tried ringing LSI but they don't start yet (im guessing it's in America).

Does anyone know if creating that new configuration will erase the data?

mikeh501 · 17 Sep 2010 at 13:08

hybrid said:
Yes it's a custom built server. I've tried ringing LSI but they don't start yet (im guessing it's in America).

Does anyone know if creating that new configuration will erase the data?

honestly dont touch it until youve spoken to LSI techsupport, you could end up nuking it. If they are in the states their likely to be open in an hour or so.

Lets hope they support you/it......

(this thead is the reason people dont build servers....)

kefkef · 17 Sep 2010 at 13:43

They are supposed to have a tech support in the uk, 01344 413441, is that the number you're using?

If all else fails, according to the manual here you would use the raid bios to configure the new drive as a hot spare then the card should rebuild it into the array? Section 3.8 details the procedure. I take it you have back ups?

hybrid · 17 Sep 2010 at 17:34

Well we are up and running again. What a day!!

LSI Support didnt pick up the phone at all this afternoon. So great support from them i must say, NOT!

Anyway, the way i managed to fix this was to force the damaged drive to go into failure mode. This then caused the two other drives to take over and enabled me to boot into Windows 2003 at full speed. Pulled out the dead disk and replaced with a new drive. The new drive is currently rebuilding itself within Windows.

I still though, can't understand why the RAID alarm hadn't gone off to start with, and why on reboot the controller lost its config. We luckily managed to pull this back off the damaged array and back into the NVRAM. The disk we removed degraded the server to an almost standstill.

Time to bring in the SAN's and crack open the beers.

mikeh501 · 17 Sep 2010 at 18:23

Gratz, but I'd start thinking about a branded box with decent support rather than a San

memyselfandi · 18 Sep 2010 at 09:35

hybrid said:
I still though, can't understand why the RAID alarm hadn't gone off to start with, and why on reboot the controller lost its config. We luckily managed to pull this back off the damaged array and back into the NVRAM. The disk we removed degraded the server to an almost standstill.

Time to bring in the SAN's and crack open the beers.

Good to know you've got it fixed.

As to why it didn't alert and degraded things so much ... well the problem is probably related to the disk not completely failing. I've seen it quite a few times on commercial Unix servers where a disk is faulty but hasn't actually failed completely. This then floods the system with error messages and (in my cases) scsi bus reset errors until the disk can be properly forced out of the configuration manually at which point the system returns to normal running with degraded disk resiliency.

In my experience disk failures are rarely it works -> it doesn't work ... there's normally a mid ground which screws things up.

CraigN · 23 Sep 2010 at 12:53

windows install on a raid 5 ???????

Westyfield2 · 23 Sep 2010 at 14:20

memyselfandi said:
Good to know you've got it fixed.

As to why it didn't alert and degraded things so much ... well the problem is probably related to the disk not completely failing. I've seen it quite a few times on commercial Unix servers where a disk is faulty but hasn't actually failed completely. This then floods the system with error messages and (in my cases) scsi bus reset errors until the disk can be properly forced out of the configuration manually at which point the system returns to normal running with degraded disk resiliency.

In my experience disk failures are rarely it works -> it doesn't work ... there's normally a mid ground which screws things up.

I've seen this too.

Disk has problems but not enough for the controller to mark it as failed and ignore the disk. Instead it just keeps on trying to use the half-dead disk causing all manner of problems. Get the damaged half-dead drive out or convince the controller that it's ****ed and not to bother with that disk and hey presto, everything's working fine again (well obviously with reduced redundancy until you replace the failed disk).