Wednesday, April 24, 2019

Do you want to do "Disk Pull"?

(1) Would you like to know how to fully automate disk pull/failure
testing on a nutanix cluster?


(2) Would you like to know why pulling disks from HW is less
realistic than failing via software?


(3) Would you like to know _exactly_ how nutanix handles
disk failures?


Then read on….

X-Ray supports users to create custom scenarios on their own. Here is a simple OLTP workload shown with a disk failure event in X-Ray UI. AOS detects a disk pull/failure event and starts the rebuild process immediately. The rebuild process takes only minutes as data is spread across many disks across many nodes. The rebuild process employs many disks across nodes to rebuild the data lost from the disk failure/pull.





As shown, running workload is unaffected and increase in latency is only about a couple of minutes
as AOS ensures data is safe as soon as possible and do not give room for longer window where
subsequent failures can put data at disk.






I have the above scenario built in X-Ray and is fully automated using X-Ray’s ansible playbook
support. This means one does not have to be present physically in Labs to do this. Let me link
x-ray package momentarily.

Why do you want to do “Disk Pull” test in POCs?

This gives administrators confidence that the architecture they are betting on, would handle the
failure scenarios seamlessly and their data is safe. This is true to some extent, as the rebuild
process is expensive in some architectures during component failures. Due to the expensive
nature of their rebuild process, some architectures would try to distinguish the component failures
with a few assumptions to skip or defer the rebuild process. This may lead to undesirable results
and data loss.
If you want to test "Disk Pull" scenario, first you may also want to learn about how the architecture
is equipped to handle it. You can look for the following in this scenario:
  1. What happens to the data residing in the drive that is being pulled/failed? Does the architecture get into action immediately to protect the copy of the data lost.
  2. Does the architecture makes any assumption (some call it intelligent design choice :) ) of the drive event and wait before any action being taken to protect the data?
  3. What happens to the new writes? During the drive pull event, application would continue assessing and writing data. Are the new data written in the same way as before the event? Would data resiliency be ensured in case of any other component failure during this “disk pull” event?
The core design of Nutanix AOS is built on data resiliency and data integrity. The way, AOS distributes its writes across drives evenly, the rebuild process is made it simple and less expensive. As core architecture of AOS is implemented for distributed systems,
  1. Nutanix AOS immediately starts the rebuild process as soon as AOS detects the drive got pulled
  2. As Nutanix AOS bets it's design on data resiliency, it does not make any assumptions of the nature of the event. It does not hope for the best and waits for the same disk reappearing. Data integrity is the core of Nutanix AOS, therefore it starts the rebuild process immediately. Unlike other architectures, Nutanix AOS employs many disks in the rebuild process as data is evenly distributed and keeps rebuild process less expensive.
  3. During the disk failure and rebuild process, Nutanix AOS does not compromise on data resiliency and writes go to new location (different disks in the cluster). New data will continue to be maintained at RF (Replication Factor) requirements.
How do you do “Disk Pull” in Nutanix POCs without physically removing the drives?

I have the custom scenario built and fully automated (results shown above). In case, if one wants to do this semi manually, here are a few quick steps to accomplish the same.

(1) Bring up X-Ray instance and start “OLTP Simulator (Med)” test in X-Ray UI against the
target cluster

(2) Decide the drive to pull by listing the devices using command “df -h” by logging into CVM. # df -h


# df -h
/dev/sdj1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/PHYM817501UZ1P9DGN
/dev/sdh1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/BTYM732500771P9DGN
/dev/sdi1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/PHYM817501V51P9DGN
/dev/sdg1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/PHYM727600RU1P9DGN
...


If you want to pull the specific drive such as metadata hosting one vs not, you can use “zeus_config_printer” command to find out whether a drive is hosting metadata.




(3) After a few minutes of OLTP workload running, issue the following command by logging
into one of the CVMs to delete the device.
echo 1 > /sys/block/sdX/device/delete
OR
echo 1 | sudo tee /sys/block/sdX/device/delete


Where sdX is a device name. The above command makes device sdX disappear from scsi subsystem. This simulates physical “disk pull” event.

(4) You can continue monitoring X-Ray UI and Prism UI for any changes you may want to observe.

(5) Finally, you can insert the disk back to the pool by clicking “Repartition and add Disk” option
in Hardware, Disk menu in Prism UI.

Summary:

"Disk Pull" is a decent test for POCs and can be automated using X-Ray. This test can give you an
indication on how architecture is designed to protect your data.