SLES HAE Cluster Tests with Netweaver on AWS

 This is an example of tests to be performed with a SLES HAE HANA cluster.

Anyone will want to execute these tests before going into production.

No. Topic Expected behavior
1.0 Set a node on standby/offline
Set a node on standby by means of Pacemaker Cluster Tools (“crm node standby”).
 
The cluster stops all managed resources on the standby node (master resources will be migrated / slave resources will just stop)
1.1 Set <nodenameA> to standby.
 
Time until all managed resources were stopped / migrated to the other node: XX sec
1.2 Set <nodenameB> to standby Time until all managed resources were stopped / migrated to the other node: XX sec
2.0 Switch off cluster node A
Power-off the EC2 instance (hard / instant stop of the VM).
 
The cluster notices that a member node is down. The remaining node makes a STONITH attempt to verify that the lost member is really offline. If STONITH is confirmed the remaining node takes over all resources.
2.1 Failover time of ASCS / HANA primary XXX sec.
3 Switch off cluster node B
Power-off the EC2 instance (hard / instant stop of the VM).
 
The cluster notices that a member node is down. The remaining node makes a STONITH attempt to verify that the lost member is really offline. If STONITH is confirmed the remaining node takes over all resources.
3.1 Failover time of ASCS / HANA primary XXX sec.
4 un-plug network connection (Split Brain)
The cluster communication over the network is down.
 

Both nodes detect the split brain scenario and try to fence each other (using the AWS STONITH agent). One node shuts down – the other will take over all resources

Failovertime: XXX sec

5

Failure (crash) of ASCS instance
The processes of the SAP instance are killed via OS command:

ps -ef | grep ASCS | awk ‘{print $2}’ | xargs kill -9

The cluster notices the problem and promotes the ERS instance to ASCS while keeping all locks from the ENQ replication table.

ASCS Failover time: XXX sec

6

Failure of ERS instance
The processes of the SAP instance are killed via OS command:

ps –ef | grep ERS | awk ‘{print $2}’ | xargs kill -9
 

The cluster notices the problem and restarts the ERS instance.

 

Time until ERS got restarted on same node: XX sec
 

7 Failure of HANA primary
 
Time until HANA DB is available again: XXX sec
8 Failure of corosync
Kill corosync cluster deamon “kill -9 “ on one node.
 

The node without corosync is fenced by the remaining node (since it appears down). The remaining node makes a STONITH attempt to verify that the lost member is really offline. If STONITH is confirmed the remaining node takes over all resources.

Failover of all managed resources: xxx sec

Keep logfiles of all relevant resources to prove functionality. For instance after ASCS failover keep a copy of /usr/<SID>/ASCS<nr>/work/dev_enqserver. This logfile should list that an ENQ replication table was found in memory and that all locks got copied into the new ENQ table. Customers may request to aquire ENQ locks before the failover test and then check the status of those locks after successful failover (please document with screenshots of SM12 on both nodes before and after failover).

Keep corosync / cluster log of all actions taken during failover tests.

Ask customer for additional failover tests / requirements / scenarios he would like to cover.

Have customer sign the protocol (!) acknowledging that all tested failover scenarios worked as expected.

Remind customer to regularly re-test all failover scenarios if SAP / OS / cluster configuration changed or patches were applied.