Wednesday, April 24, 2019

Do you want to do "Disk Pull"?

(1) Would you like to know how to fully automate disk pull/failure
testing on a nutanix cluster?


(2) Would you like to know why pulling disks from HW is less
realistic than failing via software?


(3) Would you like to know _exactly_ how nutanix handles
disk failures?


Then read on….

X-Ray supports users to create custom scenarios on their own. Here is a simple OLTP workload shown with a disk failure event in X-Ray UI. AOS detects a disk pull/failure event and starts the rebuild process immediately. The rebuild process takes only minutes as data is spread across many disks across many nodes. The rebuild process employs many disks across nodes to rebuild the data lost from the disk failure/pull.





As shown, running workload is unaffected and increase in latency is only about a couple of minutes
as AOS ensures data is safe as soon as possible and do not give room for longer window where
subsequent failures can put data at disk.






I have the above scenario built in X-Ray and is fully automated using X-Ray’s ansible playbook
support. This means one does not have to be present physically in Labs to do this. Let me link
x-ray package momentarily.

Why do you want to do “Disk Pull” test in POCs?

This gives administrators confidence that the architecture they are betting on, would handle the
failure scenarios seamlessly and their data is safe. This is true to some extent, as the rebuild
process is expensive in some architectures during component failures. Due to the expensive
nature of their rebuild process, some architectures would try to distinguish the component failures
with a few assumptions to skip or defer the rebuild process. This may lead to undesirable results
and data loss.
If you want to test "Disk Pull" scenario, first you may also want to learn about how the architecture
is equipped to handle it. You can look for the following in this scenario:
  1. What happens to the data residing in the drive that is being pulled/failed? Does the architecture get into action immediately to protect the copy of the data lost.
  2. Does the architecture makes any assumption (some call it intelligent design choice :) ) of the drive event and wait before any action being taken to protect the data?
  3. What happens to the new writes? During the drive pull event, application would continue assessing and writing data. Are the new data written in the same way as before the event? Would data resiliency be ensured in case of any other component failure during this “disk pull” event?
The core design of Nutanix AOS is built on data resiliency and data integrity. The way, AOS distributes its writes across drives evenly, the rebuild process is made it simple and less expensive. As core architecture of AOS is implemented for distributed systems,
  1. Nutanix AOS immediately starts the rebuild process as soon as AOS detects the drive got pulled
  2. As Nutanix AOS bets it's design on data resiliency, it does not make any assumptions of the nature of the event. It does not hope for the best and waits for the same disk reappearing. Data integrity is the core of Nutanix AOS, therefore it starts the rebuild process immediately. Unlike other architectures, Nutanix AOS employs many disks in the rebuild process as data is evenly distributed and keeps rebuild process less expensive.
  3. During the disk failure and rebuild process, Nutanix AOS does not compromise on data resiliency and writes go to new location (different disks in the cluster). New data will continue to be maintained at RF (Replication Factor) requirements.
How do you do “Disk Pull” in Nutanix POCs without physically removing the drives?

I have the custom scenario built and fully automated (results shown above). In case, if one wants to do this semi manually, here are a few quick steps to accomplish the same.

(1) Bring up X-Ray instance and start “OLTP Simulator (Med)” test in X-Ray UI against the
target cluster

(2) Decide the drive to pull by listing the devices using command “df -h” by logging into CVM. # df -h


# df -h
/dev/sdj1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/PHYM817501UZ1P9DGN
/dev/sdh1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/BTYM732500771P9DGN
/dev/sdi1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/PHYM817501V51P9DGN
/dev/sdg1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/PHYM727600RU1P9DGN
...


If you want to pull the specific drive such as metadata hosting one vs not, you can use “zeus_config_printer” command to find out whether a drive is hosting metadata.




(3) After a few minutes of OLTP workload running, issue the following command by logging
into one of the CVMs to delete the device.
echo 1 > /sys/block/sdX/device/delete
OR
echo 1 | sudo tee /sys/block/sdX/device/delete


Where sdX is a device name. The above command makes device sdX disappear from scsi subsystem. This simulates physical “disk pull” event.

(4) You can continue monitoring X-Ray UI and Prism UI for any changes you may want to observe.

(5) Finally, you can insert the disk back to the pool by clicking “Repartition and add Disk” option
in Hardware, Disk menu in Prism UI.

Summary:

"Disk Pull" is a decent test for POCs and can be automated using X-Ray. This test can give you an
indication on how architecture is designed to protect your data.

Tuesday, February 26, 2019

Cost Estimate for VM migration from AWS to Nutanix AHV

Starting with Nutanix Move (Formerly Xtract) 2.0.2, you can migrate VMs from AWS to
Nutanix AHV. This feature is available as a tech preview. This post is to help estimate
the cost involved in migrating VMs from AWS.

Preview of the Nutanix Move for AWS Architecture
Nutanix Move is a VM deployed usually in the target cluster where VMs are migrated to.
When the source is AWS, Nutanix Move brings up the move-agent as a t2.micro VM instance
in the region where the source VM is hosted. One of the software components in Move is
the source agent, which takes snapshots of the VM to be migrated and mounts those
snapshots in the move agent for the disk reader to read. The disk reader is the software
component in Nutanix Move that reads data continuously from source disks and sends it
over the WAN to the disk writer component in Move.The following figure diagrams this process.


Estimated Cost of Migrating from AWS
We estimate the cost involved in migrating VMs from AWS, keeping the above architectural
information in mind.
  1. AWS: Internet data transfer out cost
    AWS charges $0.09 per GB for transferring data out. If you move a 1TB VM, the cost is approximately $92.16 for data transfer alone.
  2. AWS: Regional data transfer cost

    Nutanix Move brings up a move agent as a t2.micro VM instance for each region. Data transfer from AWS to AHV happens while Move migrates the VMs, and, if the source VMs are hosted in a different availability zone than the t2.micro instance, AWS’s regional data transfer cost applies. AWS charges $0.01 per GB, so for 1 TB VM, the regional data transfer cost is $10.24.
  3. AWS: EC2 instance cost for move-agent VM
    The move-agent VM (t2.micro) is launched by Nutanix Move automatically when the migration plan involves AWS and remains active until the migration is complete. Move leaves this instance running for future migrations, but terminates this instance when all migration plans involving AWS are removed. We have to factor in the cost for running this EC2 instance. It costs about $2 per day.

Cost Estimate for Large VM Migrations from AWS

The data transfer rate from AWS to AHV can vary greatly because of various factors, such as
WAN delay. In the example of transferring a 1 TB VM to AHV from AWS, if the WAN speed
was 50 Mbps, the migration cost $106.40, as shown in the following table.

Details
Cost (USD)
AWS: Internet data transfer out cost ($0.09/GB)
$92.16
AWS: Regional data transfer cost ($0.01/GB)
(This is zero if t2.micro and the source VM are in the same region,
which tends to be the case. We estimated it here to give the
maximum potential cost involved.)
$10.24
AWS: EC2 cost for move-agent VM (~$2 per day)
(If we estimate 50 Mbps as the average throughput, it would take ~48 hours, or two days, for migration to finish.)
$4.00
Total
$106.40

Disclaimer: Cost estimate is based on the information available in public domain.