Infrastructure Adventures

07/15/2011

Adventures with Two Node RHEL HA Clusters – Behavior

Filed under: Compute — Tags: , , , , — Joe Keegan @ 10:57 AM

Several failure scenarios were tested to determine how the cluster would respond in those conditions. The behavior of the cluster during those testes produced the following results.

Failure of Network Connection

In the event of a failure of network connection (i.e. switch loss, cable disconnected, etc) that is used for an IP Resource (VIP) the cluster will detect that loss and failover the cluster services to the standby node. The detection and failover process takes less than 30 seconds to complete.

Failure of the network connection on the passive node has no impact on the cluster, other than any failover attempt to that node will fail.

Restoration of the network connections has no impact on the cluster and the cluster services will stay on the current active node (i.e. there is no “Fail-back”).

Failure of iSCSI Connection

In the event of a failure of the network connections used for iSCSI (i.e. switch loss, cable disconnected, etc) the cluster will detect that loss and failover the cluster services to the standby node. This requires the multipathing daemon to fail all valid iSCSI paths and for the SCSI stack to fail the disk. This failure detection and failover process can take between two to three minutes.

While reducing this time is technically possibly it was advised by Redhat support to utilize the defaults unless a shorter failover time was required. This issue with changing the timers is the possibility that a brief network interruption or loss of one or more (but not all) iSCSI paths could cause the premature failover of the cluster.

Failure of the iSCSI network connections on the passive node has no impact on the cluster, other than any failover attempt to that node will fail

Restoration of the iSCSI networks connections has no impact on the cluster and the cluster services will stay on the current active node (i.e. there is no “Fail-back”).

Failure of Heartbeat Connection

In the event that a node has a failure of its connection to the heartbeat network then its partner node will fence the failed node. So for example if the active node loses its heartbeat connection then the passive node will fence the currently active node, likewise if the passive node loses its heartbeat connection then the active node will fence the passive node. In the case the active node is fenced the passive node will then take over the cluster services. The detection, fencing and failover process takes less than 30 seconds.

The node that was fenced will reboot. If the node’s connection to the heartbeat network is restored by the reboot (or by the time the cluster services are started on the node) then the node will rejoin the cluster as the passive member. If the heartbeat connection is not restored then the node will not rejoin the cluster.

In the case that the node with the failed heartbeat connection is up and the cluster service (cman) is running when the heartbeat connection is restored then both nodes try and rebuild the cluster. This triggers a race condition where both nodes try and fence the other. This could cause a brief, less than 30 second, outage if the passive node manages to fence the active node and starts the cluster services.

Complete Failure of a Node

A complete failure of the active node (i.e. the node loses power) will require manual intervention to perform failover of the cluster services. This is because the fact that the passive node cannot successfully fence the failed node (at least by iLO).

In this case the remaining passive node will detect that its partner has failed, but will not start the cluster services. This is done since the cluster does not want to risk any data corruption that could occur in a split-brain scenario.

To failover cluster services an administrator will need to connect to the passive node and run the fence_ack_manual command. This command tells the passive node that you are manually acknowledging the fence. I.e. The system should consider that the fence has successfully completed. The system will then start the cluster services.

The condition could also be encountered in the case a node loses both its iLO and heartbeat connections. The remaining node would not be able to fence the failed node since it is unable to reach the iLO of the failed node. In this case the failed node could still have a working iSCSI connection and have the storage mounted. Because of this possible condition it is imperative that the administrator validate that the failed node does not have the storage mounted before issuing the fence_ack_manual.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.