This post was authored by Subhasish Bhattacharya, Program Manager, Windows Server.
In the past, in a world of reliable but expensive SANs, an aggressive high-availability strategy designed to fail fast was most optimal. The health of the system would be closely monitored to detect issues and react quickly and swiftly. This minimized downtime when catastrophic failures occurred.
In today’s cloud-scale environments, commonly comprising of commodity hardware, transient failures have become more common than hard failures. These transient compute and storage failures in commodity hardware are triggered by common events such as switch reset, packet loss, latency, and spanning tree convergence. In this new world, reacting aggressively to handle transient failures can cause more downtime than it prevents.
The storage and compute stack in Windows Server 2016 has been designed to optimize both high availability and resiliency. In a Software Defined Datacenter, we must assume infrastructure will break and it is imperative that software is resilient. At the same time, it is not acceptable to have degraded Virtual Machine (VM) availability.
Resilient private clouds: Compute and storage virtual machine resiliency
Windows Server 2016 introduces increased VM resiliency features to address both:
- Compute failures: Due to east-west transient network failures.
- Storage failures: Due to north-south transient storage failures.
Transient network failures impede intra-cluster communication for your private cloud. This results in cluster nodes being removed from active membership in a cluster. In Windows Server 2016, your cluster is resilient to intra-cluster communication failures. This resiliency is achieved by the following:
- A VM continues to run on a node even when it falls out of cluster membership. In this state, the node is considered to be in an “isolated” state and the VM is “unmonitored” – i.e., its health is not being actively monitored by the cluster service.
- If the network connectivity of the “isolated” node fails to recover within a certain duration, the VM is live-migrated to another node in the cluster. Note that this results in no downtime for the VM.
- Additionally, “flapping” nodes, which constantly come in and out of cluster membership, are temporarily banished and placed in a “quarantined” state.
A transient storage failure results in a VM being unable to access its underlying VHDX file since read or write requests to disk fail. In Windows Server 2016, a VM is able to seamlessly detect and be resilient to such transient failures as follows:
- On detecting a transient storage failure, the tenant VM session state is preserved.
- Any failure in block- or file-based storage infrastructure is handled by the VM stack, triggering an intelligent and quick response.
- The VM is moved to a “PausedCritical” state as it waits for the storage to recover.
- On recovery from the transient failure, the session state is restored.
Check out the series: