Wednesday, April 24, 2019

Do you want to do "Disk Pull"?

(1) Would you like to know how to fully automate disk pull/failure
testing on a nutanix cluster?


(2) Would you like to know why pulling disks from HW is less
realistic than failing via software?


(3) Would you like to know _exactly_ how nutanix handles
disk failures?


Then read on….

X-Ray supports users to create custom scenarios on their own. Here is a simple OLTP workload shown with a disk failure event in X-Ray UI. AOS detects a disk pull/failure event and starts the rebuild process immediately. The rebuild process takes only minutes as data is spread across many disks across many nodes. The rebuild process employs many disks across nodes to rebuild the data lost from the disk failure/pull.





As shown, running workload is unaffected and increase in latency is only about a couple of minutes
as AOS ensures data is safe as soon as possible and do not give room for longer window where
subsequent failures can put data at disk.






I have the above scenario built in X-Ray and is fully automated using X-Ray’s ansible playbook
support. This means one does not have to be present physically in Labs to do this. Let me link
x-ray package momentarily.

Why do you want to do “Disk Pull” test in POCs?

This gives administrators confidence that the architecture they are betting on, would handle the
failure scenarios seamlessly and their data is safe. This is true to some extent, as the rebuild
process is expensive in some architectures during component failures. Due to the expensive
nature of their rebuild process, some architectures would try to distinguish the component failures
with a few assumptions to skip or defer the rebuild process. This may lead to undesirable results
and data loss.
If you want to test "Disk Pull" scenario, first you may also want to learn about how the architecture
is equipped to handle it. You can look for the following in this scenario:
  1. What happens to the data residing in the drive that is being pulled/failed? Does the architecture get into action immediately to protect the copy of the data lost.
  2. Does the architecture makes any assumption (some call it intelligent design choice :) ) of the drive event and wait before any action being taken to protect the data?
  3. What happens to the new writes? During the drive pull event, application would continue assessing and writing data. Are the new data written in the same way as before the event? Would data resiliency be ensured in case of any other component failure during this “disk pull” event?
The core design of Nutanix AOS is built on data resiliency and data integrity. The way, AOS distributes its writes across drives evenly, the rebuild process is made it simple and less expensive. As core architecture of AOS is implemented for distributed systems,
  1. Nutanix AOS immediately starts the rebuild process as soon as AOS detects the drive got pulled
  2. As Nutanix AOS bets it's design on data resiliency, it does not make any assumptions of the nature of the event. It does not hope for the best and waits for the same disk reappearing. Data integrity is the core of Nutanix AOS, therefore it starts the rebuild process immediately. Unlike other architectures, Nutanix AOS employs many disks in the rebuild process as data is evenly distributed and keeps rebuild process less expensive.
  3. During the disk failure and rebuild process, Nutanix AOS does not compromise on data resiliency and writes go to new location (different disks in the cluster). New data will continue to be maintained at RF (Replication Factor) requirements.
How do you do “Disk Pull” in Nutanix POCs without physically removing the drives?

I have the custom scenario built and fully automated (results shown above). In case, if one wants to do this semi manually, here are a few quick steps to accomplish the same.

(1) Bring up X-Ray instance and start “OLTP Simulator (Med)” test in X-Ray UI against the
target cluster

(2) Decide the drive to pull by listing the devices using command “df -h” by logging into CVM. # df -h


# df -h
/dev/sdj1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/PHYM817501UZ1P9DGN
/dev/sdh1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/BTYM732500771P9DGN
/dev/sdi1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/PHYM817501V51P9DGN
/dev/sdg1       1.8T 16G 1.7T   1% /home/nutanix/data/stargate-storage/disks/PHYM727600RU1P9DGN
...


If you want to pull the specific drive such as metadata hosting one vs not, you can use “zeus_config_printer” command to find out whether a drive is hosting metadata.




(3) After a few minutes of OLTP workload running, issue the following command by logging
into one of the CVMs to delete the device.
echo 1 > /sys/block/sdX/device/delete
OR
echo 1 | sudo tee /sys/block/sdX/device/delete


Where sdX is a device name. The above command makes device sdX disappear from scsi subsystem. This simulates physical “disk pull” event.

(4) You can continue monitoring X-Ray UI and Prism UI for any changes you may want to observe.

(5) Finally, you can insert the disk back to the pool by clicking “Repartition and add Disk” option
in Hardware, Disk menu in Prism UI.

Summary:

"Disk Pull" is a decent test for POCs and can be automated using X-Ray. This test can give you an
indication on how architecture is designed to protect your data.

Tuesday, February 26, 2019

Cost Estimate for VM migration from AWS to Nutanix AHV

Starting with Nutanix Move (Formerly Xtract) 2.0.2, you can migrate VMs from AWS to
Nutanix AHV. This feature is available as a tech preview. This post is to help estimate
the cost involved in migrating VMs from AWS.

Preview of the Nutanix Move for AWS Architecture
Nutanix Move is a VM deployed usually in the target cluster where VMs are migrated to.
When the source is AWS, Nutanix Move brings up the move-agent as a t2.micro VM instance
in the region where the source VM is hosted. One of the software components in Move is
the source agent, which takes snapshots of the VM to be migrated and mounts those
snapshots in the move agent for the disk reader to read. The disk reader is the software
component in Nutanix Move that reads data continuously from source disks and sends it
over the WAN to the disk writer component in Move.The following figure diagrams this process.


Estimated Cost of Migrating from AWS
We estimate the cost involved in migrating VMs from AWS, keeping the above architectural
information in mind.
  1. AWS: Internet data transfer out cost
    AWS charges $0.09 per GB for transferring data out. If you move a 1TB VM, the cost is approximately $92.16 for data transfer alone.
  2. AWS: Regional data transfer cost

    Nutanix Move brings up a move agent as a t2.micro VM instance for each region. Data transfer from AWS to AHV happens while Move migrates the VMs, and, if the source VMs are hosted in a different availability zone than the t2.micro instance, AWS’s regional data transfer cost applies. AWS charges $0.01 per GB, so for 1 TB VM, the regional data transfer cost is $10.24.
  3. AWS: EC2 instance cost for move-agent VM
    The move-agent VM (t2.micro) is launched by Nutanix Move automatically when the migration plan involves AWS and remains active until the migration is complete. Move leaves this instance running for future migrations, but terminates this instance when all migration plans involving AWS are removed. We have to factor in the cost for running this EC2 instance. It costs about $2 per day.

Cost Estimate for Large VM Migrations from AWS

The data transfer rate from AWS to AHV can vary greatly because of various factors, such as
WAN delay. In the example of transferring a 1 TB VM to AHV from AWS, if the WAN speed
was 50 Mbps, the migration cost $106.40, as shown in the following table.

Details
Cost (USD)
AWS: Internet data transfer out cost ($0.09/GB)
$92.16
AWS: Regional data transfer cost ($0.01/GB)
(This is zero if t2.micro and the source VM are in the same region,
which tends to be the case. We estimated it here to give the
maximum potential cost involved.)
$10.24
AWS: EC2 cost for move-agent VM (~$2 per day)
(If we estimate 50 Mbps as the average throughput, it would take ~48 hours, or two days, for migration to finish.)
$4.00
Total
$106.40

Disclaimer: Cost estimate is based on the information available in public domain.

Monday, November 5, 2018

What if Data Locality was not a design choice in Nutanix AOS?



  • What if data locality was not considered in Nutanix AOS architecture like a few other HCI architectures available in the market? 
  • Would AOS suffer the same as other architectures for high throughput workloads such as DSS? 
  • When would the network bandwidth become bottleneck for such workloads (E.g., 10Gb network)?
Note: Please see my earlier post on whether data locality really makes difference.

X-Ray as usual comes to the rescue to answer these questions with in a couple of hours and help see the results ourselves visually. By keeping other design choices of AOS architecture remain the same, we wanted to experiment the results with and without data locality turned on.

I have got AOS 5.9 build with ESX 6.5, tuned the following params in AOS 5.9 and executed X-Ray's builtin scenario "Database Colocation: High Intensity".

  • Oplog's data locality (DL) is turned off  
  • Extent Store's data locality (DL) is turned off
  • Range Cache (RC) - DRAM data cache is also turned off
    • This removes cache impact in our experiments

The following charts were created from X-Ray runs. For readability, I have broken single screenshot into multiple images below.

  • Blue: Data Locality On
  • Green: Data Locality Off


OLTP IOPS stay steady and there is no impact to this workload.

Absence of data locality leads to slight increase in latency. Even during high throughput (DSS) workloads with large reads, there is no significant impact to latency.

DSS workloads are also unaffected even when data locality is turned off. I know this is not the case with other architectures.

Data locality is visible as network traffic charts indicates below.

Network traffic jumped to about 1.5GB/s. This all-flash cluster got a single 10Gbits network. It can handle network traffic upto 1.2GB/s which is the target throughput expected from two DSS workloads on two different nodes.

Nutanix AOS places two copies of the data for RF2 evenly across all nodes unlike other architectures. This leads to about 50% of data happen to be local and hence total network usage is only about half of workload generated.

The summary:
  • Nutanix AOS is still better off than other known HCI architectures even when data locality is turned off.
  • Network traffic is relatively low compared to other HCI architectures as data is evenly distributed across all disks and all nodes in Nutanix AOS.
Refer to my earlier post on whether data locality really makes difference.

Tuesday, September 25, 2018

Can two disks fail in a short interval?

Can two disks fail in a short interval?

YES. Of course. The answer is obvious. The real answer lies on how these failures are handled in HCI architectures. What happens to the Data Availability during these failures?

The questions arise:
  1. Does HCI architecture start protecting data as quickly as possible?
  2. How smart the architecture is to rebuild the lost data and how long it's going to take?
    • If the rebuild takes longer, this time window is called risky period. Administrator must hope that there is no other failure occurs during this long rebuild window to ensure FTT and RF requirements.
    • Vendors may offer another brute force method of increasing RF / FTT which increases the cost (in the absense of smart architecture).
In this video, I have two disk failures simulated with soft disk pulls on Nutanix AOS. This short video can articulate how fast the AOS gets into action to protect the data from subsequent failures and how fast it completes with its truly distributed architecture by employing all available resources in the cluster with minimal impact to performance of the existing workload.




Nutanix AOS is truly distributed and is very quick in handling failures:
  1. AOS gives at-most priority for data availability and immediately gets into action on failures to protect data from any subsequent failures. 
  2. Second copy of one disk's data is distributed on all disks on other nodes almost evenly. This shortens the time window to repopulate the data in other disk if and when a disk fails.

True. Though I have used X-Ray to run a OLTP workload, I had to manually simulate disk pulls. How about fully automating this in X-Ray? Is it possible?

X-Ray 3.1 release is built with ansible playbook support. I am working on posting X-Ray scenario here as soon as I am done testing. Stay tuned!

Wednesday, September 19, 2018

X-Ray Custom Scenarios Explained: 1. Configuration definitions


X-Ray team made their builtin scenarios open source a few months ago. I hope this blog series would help one to write their own custom scenarios.

Scenario definition file called test.yml is placed in a unique custom scenario directory. This file consists of several sections as described in the picture below. Scenario file is in YML format and parsed with the help of Jinja2 parser and ansible playbook.

The following picture does not include documentation and UI related information typically exist in the scenario file.




In this part-1, let's look at syntax for configuration definitions.

VARS:

In the following example, 3 variables "vms_per_node", "num_nodes" and "runtime_secs" are defined. Each variable definition can have default, min and max.

vars:
  vms_per_node:
    default: 1
    min: 1
    max: 4
  num_nodes:
    default: 4
    min: 1
  runtime_secs:
    default: 7200
    min: 1
    max: 14400

VM Groups (or) VMS:

VM Groups definition essentially got two parts. First part is about defining VM config itself such as vcpus, ram, disks, etc. The second part is about defining the number of VMs and their placements. Typically the number of VMs are scaled per cluster or node.

X-Ray comes with builtin VM template called "ubuntu1604".  This template has FIO pre-installed which is the primary I/O load generator.

Ok. Let's look at the simple one first.

vms:
  - VDI_1VM_Per_Cluster:
      template: ubuntu1604
      vcpus: 2
      ram_mb: 2048
      data_disks:
        count: 1
        size: 16
      nodes: all
      count_per_cluster: 1

VM Group called "VDI_1VM_Per_Cluster" is defined with VM config of 2 vcpus, 2GB memory and 1 data disk of 16GB in size. As defined in "count_per_cluster", it just brings up a single VM per cluster on the first node available in the cluster.

  - VDI_2VM_per_Node:
      template: ubuntu1604
      vcpus: 2
      ram_mb: 2048
      data_disks:
        count: 1
        size: 16
      nodes: all
      count_per_node: 2

VM Group called "VDI_2VM_Per_Node" is going to create 2 VMs per node on all nodes available in the cluster. If cluster size is 4, then 4*2=8 VMs are created in the cluster as 2 VM per node.

  - VDI_4VMs:
      template: ubuntu1604
      vcpus: 2
      ram_mb: 2048
      data_disks:
        count: 1
        size: 16
      nodes: 0,1
      count_per_node: 2

VM Group called "VDI_4VMs" is going to create 2 VMs per node on first (0) and second (1) nodes available in the cluster as total of 4.

Ok. What if one wants to create one or more VM Groups per node and scale dynamically, here jinja2 syntax come in handy. Let's define number of nodes in a variable and define VM groups
vars:
  num_nodes:
    default: 4
    min: 1

vms:
{% for node_index in range(num_nodes) %}
  - VM Group for Node {{ node_index }}:
      template: ubuntu1604
      vcpus: 4
      ram_mb: 4096
      data_disks:
        count: 6
        size: 64
      nodes: {{ node_index }}
      count_per_node: 2
{% endfor %}



Wednesday, September 12, 2018

Can data locality help scale throughput?

Can data locality help scale throughput?

X-Ray's throughput-scalability-random-reads and throughput-scalability-sequential-reads tests help answer the following:
  1. Can IOPS or throughput rate sustain over time?
  2. As number of nodes grow in cluster, can throughput or IOPS scale with it?
This test measures IOPS rate as it starts 8k random reads on all available nodes in 30 minutes interval. As workloads are started on subsequent nodes, X-Ray chart helps confirm if IOPS rate is sustained without much change.






This test measures I/O bandwidth as it starts 1m sequential reads on all available nodes in 30 minutes interval. As workloads are started on subsequent nodes, X-Ray chart helps confirm if bandwidth rate is sustained without much change.




AOS's data locality does not rely on network traffic for reads and able to scale reads 1x for each node without affecting other nodes.

X-Ray Four Corners, a simple and quick measure for burst performance

I/O bursts are common in database workloads. How does an HCI architecture handle short workload burts?

This simple X-Ray test, four corners microbenchmark can quickly simulate the workloads of different I/O types such as 8k random reads, 1m sequential reads, 8k random writes and 1m sequential writes in a sequence with 1 minute duration each. This can demonstrate how an HCI architecture is able to handle short 1 minute bursts in IOPS and throughput.