RAID levels, do you care about your data?

So I thought this was an interesting exchange on SpiceWorks. http://community.spiceworks.com/topic/95946

Basically it questions the use of RAID 5 (or 50 in a production environment) this would apply whether you had an EqualLogic or any other SAN I would think?

Basically RAID 5 has a flaw called the

RAID-5 write hole. Whenever you update the data in a RAID stripe you must also update the parity, so that all disks XOR to zero — it’s that equation that allows you to reconstruct data when a disk fails. The problem is that there’s no way to update two or more disks atomically, so RAID stripes can become damaged during a crash or power outage.

http://blogs.sun.com/bonwick/entry/raid_z

I guess the question is, is it appropriate to have a production SQL or Exchange server running on a RAID 5 (or 50??) volume? Mr. Scott Alan Miller would I think say no. He likes RAID 10 for that sort of thing.

Any thoughts?

Advertisement

5 comments

  1. lol, reminds of BAARF (http://www.miracleas.com/BAARF/BAARF2.html)

    IMO, raid 5 on a enterpise SAN like an EQL is not as “dangerous” as stated.
    Even RAID50 is not as “not done” as they say.

    As you posted on spiceworks, EQL (and other vendors) uses preemtive drive replacement (sice fw 4.1.x I think)
    If you use RAID 50 on an EQL you must use 2 hotspares (which are “protected” with the preemtive drive replacement)

    anyway, If you want to be safe, use raid 10, that’s correct, but to give you an example:
    14 disk in an array (I don’t count in the 2 hotspares)
    Let’s say 450GB per disk:
    RAID 10 gives you 3TB
    RAID 50 gives you about 5.3TB

    looking at the cost of an EQL array, that’s a very expensive 2TB you lose on the array imo.

    So it comes down (as usual) to the budget you have 🙂

    but raid discussions are always good food for tought

  2. The RAID5 write hole does not occur in enterprise RAID arrays like EqualLogic.

    The RAID5 write hole exists only host software based or very simple adapter based solutions. The write hole occurs because a power failure during an update of a stripe can cause an inconsistency in a stripe which if not fixed before a drive failure can cause a stripe to be rebuilt with incorrect data/parity.

    Modern highly available RAID arrays for the enterprise including EqualLogic have protected write-cache which is mirrored. The protected cache prevents an inconsistent stripe occuring after a power failure.

    Since these controllers operate in write-back, data for a host write is preserved in cache till all disk side IOs corresponding to the IO completes. If power fails, cache contents is preserved. When power is restored, the IOs are just replayed from cache. So if there were any incomplete stripe updates at the point of power failure, the stripes are made consistent by the replaying of the IOs from cache. This works even if a drive fails in the RAID5 as the array recovers from a power failure.

    This works even if you have a controller failure because the mirror copy of the cache is available on the peer controller.

      • Clarification: The write hole does not exist independent of whether the controller operates in write-back or write-through for user data; the key here is that the fact that data is sitting in protected memory till all disk side operations complete allows for completing stripe updates when power is restored.

  3. The reasons for avoiding RAID 5 are not because of the write hole. The write hole, while bad, is extremely unlikely. RAID 5 is avoided because of the performance cost and the high cost of making a RAID 5 array resilient enough to not be in terrible danger of a URE failure on resilver. The write hole, when talking about RAID 5 risk, is normally excused because some implementations, notably RAIDZ from Oracle, fix the write hole. The dangerous of RAID 5 exist even if you assume that the write hole doesn’t exist. If you use a non-RAIDZ implementation of RAID 5, it just gets that much worse.

    I definitely recommend RAID 10 for SQL data for both performance and safety.

    Here are some references directly related:

    http://www.smbitjournal.com/2012/07/hot-spare-or-a-hot-mess/
    http://www.smbitjournal.com/2012/05/when-no-redundancy-is-more-reliable/
    http://community.spiceworks.com/topic/262196-one-big-raid-10-the-new-standard-in-server-storage

    Thanks for mentioning me! I totally stumbled on this post while looking for some RAID data and suddenly realized that you were talking about ME!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s