What causes RAID failures?

clients outside the uk

There are many possible reasons that lead to, or cause, a RAID storage server to fail.

The commonest reason is hard drive failure. With a RAID 0 configuration, it will immediately cause complete loss of access to your data. With all other RAID configurations, there is an element of redundancy, so a single drive failure does not necessarily mean a catastrophe.

RAID controller failure, either in the form of total hardware failure or any issue that causes corruption to the RAID parameters or even to the data itself, is not as common an issue as it was twenty years ago, when it was the norm to have multiple RAID controllers for added redundancy.

RAID rebuild failure. This occurs when, for whatever reason, one or more hard drives within the RAID array have been “dropped” (or marked as failed or bad by the controller) and if your system is so configured, it will automatically “pick up” a hot spare and start to rebuild the RAID array, running in the interim in what is called a “degraded state”.

Firmware update failure. This is not so common with dedicated RAID controllers built into servers – but it is quite frequent when your RAID array is in the form of a NAS or Network Attached Storage device.

Power surges. If you have a good quality UPS (uninterruptible power supply) with power surge protection, this will not have any relevance to your RAID storage. However, cheaper UPSes or surge protection devices (or no protection at all) will not always cope properly with a power surge.

Incorrect drive substitution for RAID rebuild. This is a fairly common occurrence. Sometimes the IT administrator, on being made aware of a failed or failing drive within a RAID array will remove a healthy drive, leaving the bad drive in the array, and inserting a new drive in the healthy drive’s place. With RAID 1 this can mean terminal loss of data.

Reinitialisation of the array. Usually the operator error, it can also be caused by faulty firmware or a bad firmware upgrade.

RAID Admin or operator error. This is by far the commonest cause of total, unrecoverable data loss that we encounter. It is essential that, on discovering you have a failed RAID array, you let professionals attend to it – especially if the data is critical. A “quick fix” usually results in a data recovery bill far higher than would normally be charged for a “virgin” RAID failure.