The Cloud Bacteria: Can two disks fail in a short interval?

Can two disks fail in a short interval?

YES. Of course. The answer is obvious. The real answer lies on how these failures are handled in HCI architectures. What happens to the Data Availability during these failures?

The questions arise:

Does HCI architecture start protecting data as quickly as possible?
How smart the architecture is to rebuild the lost data and how long it's going to take?

If the rebuild takes longer, this time window is called risky period. Administrator must hope that there is no other failure occurs during this long rebuild window to ensure FTT and RF requirements.
Vendors may offer another brute force method of increasing RF / FTT which increases the cost (in the absense of smart architecture).

In this video, I have two disk failures simulated with soft disk pulls on Nutanix AOS. This short video can articulate how fast the AOS gets into action to protect the data from subsequent failures and how fast it completes with its truly distributed architecture by employing all available resources in the cluster with minimal impact to performance of the existing workload.

Nutanix AOS is truly distributed and is very quick in handling failures:

AOS gives at-most priority for data availability and immediately gets into action on failures to protect data from any subsequent failures.
Second copy of one disk's data is distributed on all disks on other nodes almost evenly. This shortens the time window to repopulate the data in other disk if and when a disk fails.

True. Though I have used X-Ray to run a OLTP workload, I had to manually simulate disk pulls. How about fully automating this in X-Ray? Is it possible?

X-Ray 3.1 release is built with ansible playbook support. I am working on posting X-Ray scenario here as soon as I am done testing. Stay tuned!

The Cloud Bacteria

Tuesday, September 25, 2018

Can two disks fail in a short interval?

No comments:

Post a Comment