Infrastructure Adventures

07/15/2011

Adventures with Two Node RHEL HA Clusters – Behavior

Filed under: Compute — Tags: , , , , — Joe Keegan @ 10:57 AM

Several failure scenarios were tested to determine how the cluster would respond in those conditions. The behavior of the cluster during those testes produced the following results.

Failure of Network Connection

In the event of a failure of network connection (i.e. switch loss, cable disconnected, etc) that is used for an IP Resource (VIP) the cluster will detect that loss and failover the cluster services to the standby node. The detection and failover process takes less than 30 seconds to complete.

Failure of the network connection on the passive node has no impact on the cluster, other than any failover attempt to that node will fail.

Restoration of the network connections has no impact on the cluster and the cluster services will stay on the current active node (i.e. there is no “Fail-back”).

Failure of iSCSI Connection

In the event of a failure of the network connections used for iSCSI (i.e. switch loss, cable disconnected, etc) the cluster will detect that loss and failover the cluster services to the standby node. This requires the multipathing daemon to fail all valid iSCSI paths and for the SCSI stack to fail the disk. This failure detection and failover process can take between two to three minutes.

While reducing this time is technically possibly it was advised by Redhat support to utilize the defaults unless a shorter failover time was required. This issue with changing the timers is the possibility that a brief network interruption or loss of one or more (but not all) iSCSI paths could cause the premature failover of the cluster.

Failure of the iSCSI network connections on the passive node has no impact on the cluster, other than any failover attempt to that node will fail

Restoration of the iSCSI networks connections has no impact on the cluster and the cluster services will stay on the current active node (i.e. there is no “Fail-back”).

Failure of Heartbeat Connection

In the event that a node has a failure of its connection to the heartbeat network then its partner node will fence the failed node. So for example if the active node loses its heartbeat connection then the passive node will fence the currently active node, likewise if the passive node loses its heartbeat connection then the active node will fence the passive node. In the case the active node is fenced the passive node will then take over the cluster services. The detection, fencing and failover process takes less than 30 seconds.

The node that was fenced will reboot. If the node’s connection to the heartbeat network is restored by the reboot (or by the time the cluster services are started on the node) then the node will rejoin the cluster as the passive member. If the heartbeat connection is not restored then the node will not rejoin the cluster.

In the case that the node with the failed heartbeat connection is up and the cluster service (cman) is running when the heartbeat connection is restored then both nodes try and rebuild the cluster. This triggers a race condition where both nodes try and fence the other. This could cause a brief, less than 30 second, outage if the passive node manages to fence the active node and starts the cluster services.

Complete Failure of a Node

A complete failure of the active node (i.e. the node loses power) will require manual intervention to perform failover of the cluster services. This is because the fact that the passive node cannot successfully fence the failed node (at least by iLO).

In this case the remaining passive node will detect that its partner has failed, but will not start the cluster services. This is done since the cluster does not want to risk any data corruption that could occur in a split-brain scenario.

To failover cluster services an administrator will need to connect to the passive node and run the fence_ack_manual command. This command tells the passive node that you are manually acknowledging the fence. I.e. The system should consider that the fence has successfully completed. The system will then start the cluster services.

The condition could also be encountered in the case a node loses both its iLO and heartbeat connections. The remaining node would not be able to fence the failed node since it is unable to reach the iLO of the failed node. In this case the failed node could still have a working iSCSI connection and have the storage mounted. Because of this possible condition it is imperative that the administrator validate that the failed node does not have the storage mounted before issuing the fence_ack_manual.

06/29/2011

Adventures with Two Node RHEL HA Clusters – Concepts

Filed under: Compute — Tags: , , , , , — Joe Keegan @ 10:20 AM

Recently I had to configure several HA clusters using RHEL 6. I thought I would shared my notes in the hopes others would find them useful. I’ll split my notes up into several ports covering concepts, configuration, operation and behavior.

WARNING: RHEL 6.0 was very buggy and I had nothing but problems. Upgrading to RHEL 6.1 solved a vast majority of my issues and is highly recommended!

Heartbeat

Each node in the cluster sends out a multicast heart beat that tells the other member of the cluster that it is alive and healthy. By default a cluster node will consider another node dead if it misses the heartbeat from that node for 10 seconds.

The interface used for heartbeats is configured in the cluster.conf file (see configuration section for more details ). When discussion cluster configuration with Redhat support they highly recommended that a cross-connect between the two nodes is not used and that an interface connected to a switch and a private VLAN be used for heartbeats. They also recommended that this be the same interface used to initiate fencing (See below).

Quorum

One of the most dangerous situations that can happen in clusters is that both nodes become active at the same time. This is especially true for clusters that share storage resources. In this case both cluster nodes could be writing to the data on shared storage which will quickly cause data corruption.

When both nodes becoming active it is called “split brain” and can happen when a cluster stops receiving heartbeats from its partner node. Since the two nodes are no longer communicating they do not know if the problem is with the other node or if the problem is with itself.

For example say that the passive node stops receiving heartbeats from the active node due to a network failure of the heartbeat network. In this case if the passive node starts the cluster services then you would have a split-brain situation.

Most clusters use a Quorum Disk to prevent this from happening. The Quorum Disk is a small shared disk that both nodes can access at the same time. Whichever node is currently the active node writes to the disk periodically (usually every couple of seconds) and the passive node checks the disk to make sure the active node is keeping it up to date.

When a node stops receiving heartbeats from its partner node it looks at the Quorum Disk to see if it has been updated. If the other node is still updating the Quorum Disk then the passive node knows that the active node is still alive and does not start the cluster services.

Redhat clusters support Quorum Disks, but Redhat support had recommended not to use one since they are difficult to configure and can become problematic. Instead they recommend to relying on Fencing to prevent split brain.

Fencing

One of the strategies used by Redhat clusters to prevent split brain is a concept called fencing.

While there are several different types of fencing, fencing via the HP iLO devices (or similar) built into the servers is the recommended method. With this type of Fencing when the passive node stops receiving heartbeats from the active node it will connect to the iLO of the active node and reboot the active node. Once the passive node reboots (i.e. fences) the active node it will then start the cluster services.

By rebooting the active node the passive node can be sure that the active node is no longer running the cluster services and it is safe to start them.

From a design point of view the NIC used to connect to the iLO of a node’s partner server is the NIC that should also be used for heartbeat. This ensures that the node that lost its connection to the heartbeat network cannot fence its partner server.

Create a free website or blog at WordPress.com.