Monday, November 5, 2018

What if Data Locality was not a design choice in Nutanix AOS?



  • What if data locality was not considered in Nutanix AOS architecture like a few other HCI architectures available in the market? 
  • Would AOS suffer the same as other architectures for high throughput workloads such as DSS? 
  • When would the network bandwidth become bottleneck for such workloads (E.g., 10Gb network)?
Note: Please see my earlier post on whether data locality really makes difference.

X-Ray as usual comes to the rescue to answer these questions with in a couple of hours and help see the results ourselves visually. By keeping other design choices of AOS architecture remain the same, we wanted to experiment the results with and without data locality turned on.

I have got AOS 5.9 build with ESX 6.5, tuned the following params in AOS 5.9 and executed X-Ray's builtin scenario "Database Colocation: High Intensity".

  • Oplog's data locality (DL) is turned off  
  • Extent Store's data locality (DL) is turned off
  • Range Cache (RC) - DRAM data cache is also turned off
    • This removes cache impact in our experiments

The following charts were created from X-Ray runs. For readability, I have broken single screenshot into multiple images below.

  • Blue: Data Locality On
  • Green: Data Locality Off


OLTP IOPS stay steady and there is no impact to this workload.

Absence of data locality leads to slight increase in latency. Even during high throughput (DSS) workloads with large reads, there is no significant impact to latency.

DSS workloads are also unaffected even when data locality is turned off. I know this is not the case with other architectures.

Data locality is visible as network traffic charts indicates below.

Network traffic jumped to about 1.5GB/s. This all-flash cluster got a single 10Gbits network. It can handle network traffic upto 1.2GB/s which is the target throughput expected from two DSS workloads on two different nodes.

Nutanix AOS places two copies of the data for RF2 evenly across all nodes unlike other architectures. This leads to about 50% of data happen to be local and hence total network usage is only about half of workload generated.

The summary:
  • Nutanix AOS is still better off than other known HCI architectures even when data locality is turned off.
  • Network traffic is relatively low compared to other HCI architectures as data is evenly distributed across all disks and all nodes in Nutanix AOS.
Refer to my earlier post on whether data locality really makes difference.

Tuesday, September 25, 2018

Can two disks fail in a short interval?

Can two disks fail in a short interval?

YES. Of course. The answer is obvious. The real answer lies on how these failures are handled in HCI architectures. What happens to the Data Availability during these failures?

The questions arise:
  1. Does HCI architecture start protecting data as quickly as possible?
  2. How smart the architecture is to rebuild the lost data and how long it's going to take?
    • If the rebuild takes longer, this time window is called risky period. Administrator must hope that there is no other failure occurs during this long rebuild window to ensure FTT and RF requirements.
    • Vendors may offer another brute force method of increasing RF / FTT which increases the cost (in the absense of smart architecture).
In this video, I have two disk failures simulated with soft disk pulls on Nutanix AOS. This short video can articulate how fast the AOS gets into action to protect the data from subsequent failures and how fast it completes with its truly distributed architecture by employing all available resources in the cluster with minimal impact to performance of the existing workload.




Nutanix AOS is truly distributed and is very quick in handling failures:
  1. AOS gives at-most priority for data availability and immediately gets into action on failures to protect data from any subsequent failures. 
  2. Second copy of one disk's data is distributed on all disks on other nodes almost evenly. This shortens the time window to repopulate the data in other disk if and when a disk fails.

True. Though I have used X-Ray to run a OLTP workload, I had to manually simulate disk pulls. How about fully automating this in X-Ray? Is it possible?

X-Ray 3.1 release is built with ansible playbook support. I am working on posting X-Ray scenario here as soon as I am done testing. Stay tuned!

Wednesday, September 19, 2018

X-Ray Custom Scenarios Explained: 1. Configuration definitions


X-Ray team made their builtin scenarios open source a few months ago. I hope this blog series would help one to write their own custom scenarios.

Scenario definition file called test.yml is placed in a unique custom scenario directory. This file consists of several sections as described in the picture below. Scenario file is in YML format and parsed with the help of Jinja2 parser and ansible playbook.

The following picture does not include documentation and UI related information typically exist in the scenario file.




In this part-1, let's look at syntax for configuration definitions.

VARS:

In the following example, 3 variables "vms_per_node", "num_nodes" and "runtime_secs" are defined. Each variable definition can have default, min and max.

vars:
  vms_per_node:
    default: 1
    min: 1
    max: 4
  num_nodes:
    default: 4
    min: 1
  runtime_secs:
    default: 7200
    min: 1
    max: 14400

VM Groups (or) VMS:

VM Groups definition essentially got two parts. First part is about defining VM config itself such as vcpus, ram, disks, etc. The second part is about defining the number of VMs and their placements. Typically the number of VMs are scaled per cluster or node.

X-Ray comes with builtin VM template called "ubuntu1604".  This template has FIO pre-installed which is the primary I/O load generator.

Ok. Let's look at the simple one first.

vms:
  - VDI_1VM_Per_Cluster:
      template: ubuntu1604
      vcpus: 2
      ram_mb: 2048
      data_disks:
        count: 1
        size: 16
      nodes: all
      count_per_cluster: 1

VM Group called "VDI_1VM_Per_Cluster" is defined with VM config of 2 vcpus, 2GB memory and 1 data disk of 16GB in size. As defined in "count_per_cluster", it just brings up a single VM per cluster on the first node available in the cluster.

  - VDI_2VM_per_Node:
      template: ubuntu1604
      vcpus: 2
      ram_mb: 2048
      data_disks:
        count: 1
        size: 16
      nodes: all
      count_per_node: 2

VM Group called "VDI_2VM_Per_Node" is going to create 2 VMs per node on all nodes available in the cluster. If cluster size is 4, then 4*2=8 VMs are created in the cluster as 2 VM per node.

  - VDI_4VMs:
      template: ubuntu1604
      vcpus: 2
      ram_mb: 2048
      data_disks:
        count: 1
        size: 16
      nodes: 0,1
      count_per_node: 2

VM Group called "VDI_4VMs" is going to create 2 VMs per node on first (0) and second (1) nodes available in the cluster as total of 4.

Ok. What if one wants to create one or more VM Groups per node and scale dynamically, here jinja2 syntax come in handy. Let's define number of nodes in a variable and define VM groups
vars:
  num_nodes:
    default: 4
    min: 1

vms:
{% for node_index in range(num_nodes) %}
  - VM Group for Node {{ node_index }}:
      template: ubuntu1604
      vcpus: 4
      ram_mb: 4096
      data_disks:
        count: 6
        size: 64
      nodes: {{ node_index }}
      count_per_node: 2
{% endfor %}



Wednesday, September 12, 2018

Can data locality help scale throughput?

Can data locality help scale throughput?

X-Ray's throughput-scalability-random-reads and throughput-scalability-sequential-reads tests help answer the following:
  1. Can IOPS or throughput rate sustain over time?
  2. As number of nodes grow in cluster, can throughput or IOPS scale with it?
This test measures IOPS rate as it starts 8k random reads on all available nodes in 30 minutes interval. As workloads are started on subsequent nodes, X-Ray chart helps confirm if IOPS rate is sustained without much change.






This test measures I/O bandwidth as it starts 1m sequential reads on all available nodes in 30 minutes interval. As workloads are started on subsequent nodes, X-Ray chart helps confirm if bandwidth rate is sustained without much change.




AOS's data locality does not rely on network traffic for reads and able to scale reads 1x for each node without affecting other nodes.

X-Ray Four Corners, a simple and quick measure for burst performance

I/O bursts are common in database workloads. How does an HCI architecture handle short workload burts?

This simple X-Ray test, four corners microbenchmark can quickly simulate the workloads of different I/O types such as 8k random reads, 1m sequential reads, 8k random writes and 1m sequential writes in a sequence with 1 minute duration each. This can demonstrate how an HCI architecture is able to handle short 1 minute bursts in IOPS and throughput.




Tuesday, September 11, 2018

Does data locality really matter?

Does data locality matter in HCI Architecture?

Data locality is not just about storing new data in local node. It's also about keeping relevant useful data local when needed for low latency access.

There have been debates whether data locality makes difference in HCI architecture.  Some architectures argue that these days with low latency and high bandwidth networks, data locality hardly makes any difference and may end up adding additional bottleneck as hosting nodes are changed. The elegance of architecture comes when it implements data locality entirely considering various workflows such as vSphere HA events and vMotion.

With X-Ray one can see how these workflows are handled by different HCI architectures without much effort from users. These workflows are available as built-in tests for users to kick start as ready to use test.

(i) Database Colocation

What happens when DSS type high-bandwidth workload is deployed in the cluster but in different node? Does that affect existing OLTP workload? Here is the sample X-ray result run on NX-3060-G4. This chart explains all. There is slight increase in latency for OLTP, but both the workloads are handled seamlessly.



(ii) vMotion

This test migrates VM from 1st node to 2nd in 20 minutes and followed by another migration from 2nd node to 3rd node. Both the moves are seamless by maintaining IOPS and slight fluctuations in latency but well under 2ms. Increased network bandwidth shows that AOS architecture gets into action to move data horizontally to the hosting node. The entire process is transparent and data locality is taken care by the architecture as needed.




(iii) VM High Availability

What happens during vSphere HA event? This test brings down a node for vSphere HA to kick in and restart VM on other node. This chart is evident AOS gets into action as soon as VM is powered on other node much before vCenter claims the task complete. AOS migrates data horizontally from other nodes to hosting node utilizing idle network first and then prioritizes VM workload as it finds workload started.