Primary HANA servers becomes unavailable

Simulated Failures

Instance failures. The primary HANA instance is crashed or not anymore reachable through the network
Availability zone failure.

Components getting tested

EC2 stoneith agent
HANA agent
Overlay IP agent
Optional: Route 53 agent if it is configured

Approach

Have a correctly working HANA DB cluster
Shutdown eth0 on the instance to isolate
The cluster will shutdown the node
The cluster will failover the HANA database
The cluster will not restart the failed node

Intial Configuration

Check whether the overlay IP address gets hosted on the interface eth0 on the first node:

hana01:/var/log # ip address list eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 02:ca:c9:ca:a6:52 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.115/24 brd 10.0.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.10.21/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ca:c9ff:feca:a652/64 scope link 
       valid_lft forever preferred_lft forever

Check the cluster status as super user with the command crm status:

hana01:/var/log # crm status
Stack: corosync
Current DC: hana02 (version 1.1.15-21.1-e174ec8) - partition with quorum
Last updated: Tue Sep 11 12:37:53 2018
Last change: Tue Sep 11 12:37:53 2018 by root via crm_attribute on hana012 nodes configured
6 resources configured
Online: [ hana01 hana02 ]
Full list of resources:
res_AWS_STONITH	(stonith:external/ec2):	Started hana01
 res_AWS_IP	(ocf::heartbeat:aws-vpc-move-ip):	Started hana01
 Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00]
     Started: [ hana01 hana02 ]
 Master/Slave Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00]
     Masters: [ hana01 ]
     Slaves: [ hana02 ]

The AWS console shows that both nodes are running:

Screenshot two running nodes

Damage the Instance

There are two ways to "damage" an instance

Corrupt Kernel

Become super user on the master HANA node.

Issue the command:

echo 'b' > /proc/sysrq-trigger

Isolate Instance

Become super user on the master HANA node.

Issue the command:

$ ifdown eth0

The current session will now hang. The system will not be able to communicate with the network anymore.

SUSE has a recommendation to do the isolation with firewalls and IP tables.

Monitor Fail Over

Expect the following in a correct working cluster:

The second node will fence the first node. This means it will force a shutdown through AWS CLI commands
The first node will be stopped
The second node will take over the Overlay IP address and it will host the Hana database.

The cluster will now switch the master node and the slave node.

Monitor progress from the healthy node!

The first node gets reported being offline:

hana02:/home/ec2-user # SAPHanaSR-showAttr
Global cib-time
--------------------------------
global Wed Sep 19 13:18:21 2018

Hosts clone_state lpa_hdb_lpt node_state op_mode remoteHost roles score site srmode sync_state version vhost
-----------------------------------------------------------------------------------------------------------------------------------------------------------
hana01 1537362888 offline logreplay hana02 WDF sync hana01
hana02 PROMOTED 1537363101 online logreplay hana01 4:S:master1:master:worker:master 100 ROT sync SOK 2.00.030.00.1522209842 hana02

hana02:/home/ec2-user # crm_mon -1rfn

Stack: corosync
Current DC: hana02 (version 1.1.15-21.1-e174ec8) - partition with quorum
Last updated: Wed Sep 19 13:18:52 2018
Last change: Wed Sep 19 13:18:21 2018 by root via crm_attribute on hana02

2 nodes configured
6 resources configured

Node hana01: OFFLINE
Node hana02: online
rsc_SAPHana_HDB_HDB00 (ocf::suse:SAPHana): Slave
rsc_SAPHanaTopology_HDB_HDB00 (ocf::suse:SAPHanaTopology): Started
res_AWS_IP (ocf::heartbeat:aws-vpc-move-ip): Started

Inactive resources:

res_AWS_STONITH (stonith:external/ec2): Stopped
Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00]
Started: [ hana02 ]
Stopped: [ hana01 ]
Master/Slave Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00]
Slaves: [ hana02 ]
Stopped: [ hana01 ]

Migration Summary:
* Node hana02:
res_AWS_STONITH: migration-threshold=5000 fail-count=1 last-failure='Wed Sep 19 13:18:00 2018'

Failed Actions:
* res_AWS_STONITH_monitor_120000 on hana02 'unknown error' (1): call=-1, status=Timed Out, exitreason='none',
last-rc-change='Wed Sep 19 13:18:00 2018', queued=0ms, exec=0ms

The AWS console will now show that the second node has been fencing the first node. It gets shut down:

Screenshot node gets shut won

The second node will wait until the first node is shut down. The AWS console will look like:

First node being shut down

The cluster will now promote the instance on the second node to be the primary instance:

hana02:/home/ec2-user # SAPHanaSR-showAttr
Global cib-time
--------------------------------
global Wed Sep 19 13:19:14 2018

Hosts clone_state lpa_hdb_lpt node_state op_mode remoteHost roles score site srmode sync_state version vhost
-----------------------------------------------------------------------------------------------------------------------------------------------------------
hana01 1537362888 offline logreplay hana02 WDF sync hana01
hana02 PROMOTED 1537363154 online logreplay hana01 4:P:master1:master:worker:master 100 ROT sync PRIM 2.00.030.00.1522209842 hana02

The cluster status will be the following:

hana02:/home/ec2-user #  crm_mon -1rfn
Stack: corosync
Current DC: hana02 (version 1.1.15-21.1-e174ec8) - partition with quorum
Last updated: Wed Sep 19 13:19:16 2018
Last change: Wed Sep 19 13:19:14 2018 by root via crm_attribute on hana022 nodes configured
6 resources configured
Node hana01: OFFLINE
Node hana02: online
rsc_SAPHana_HDB_HDB00	(ocf::suse:SAPHana):	Master
res_AWS_STONITH	(stonith:external/ec2):	Started
rsc_SAPHanaTopology_HDB_HDB00	(ocf::suse:SAPHanaTopology):	Started
res_AWS_IP	(ocf::heartbeat:aws-vpc-move-ip):	Started
Inactive resources:
Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00]
     Started: [ hana02 ]
     Stopped: [ hana01 ]
 Master/Slave Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00]
     Masters: [ hana02 ]
     Stopped: [ hana01 ]
Migration Summary:
* Node hana02:
   res_AWS_STONITH: migration-threshold=5000 fail-count=1 last-failure='Wed Sep 19 13:18:00 2018'
Failed Actions:
* res_AWS_STONITH_monitor_120000 on hana02 'unknown error' (1): call=-1, status=Timed Out, exitreason='none',
    last-rc-change='Wed Sep 19 13:18:00 2018', queued=0ms, exec=0ms

Check whether the overlay IP address gets hosted on the eth0 interface of the second node. Example:

hana02:/tmp # ip address list eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 06:4f:41:53:ff:76 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.129/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.10.21/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::44f:41ff:fe53:ff76/64 scope link 
       valid_lft forever preferred_lft forever

Last step: Clean up the message on the second node:

hana02:/home/ec2-user # crm resource cleanup res_AWS_STONITH hana02
Cleaning up res_AWS_STONITH on hana02, removing fail-count-res_AWS_STONITH
Waiting for 1 replies from the CRMd. OK
hana02:/home/ec2-user # crm status
Stack: corosync
Current DC: hana02 (version 1.1.15-21.1-e174ec8) - partition with quorum
Last updated: Wed Sep 19 13:20:44 2018
Last change: Wed Sep 19 13:20:34 2018 by hacluster via crmd on hana022 nodes configured
6 resources configured
Online: [ hana02 ]
OFFLINE: [ hana01 ]
Full list of resources:
res_AWS_STONITH	(stonith:external/ec2):	Started hana02
 res_AWS_IP	(ocf::heartbeat:aws-vpc-move-ip):	Started hana02
 Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00]
     Started: [ hana02 ]
     Stopped: [ hana01 ]
 Master/Slave Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00]
     Masters: [ hana02 ]
     Stopped: [ hana01 ]

Recovering the Cluster

Restart your stopped node. See:

Starting first node

Check whether the cluster services get started

Check whether the first node becomes a replicating server

See:

hana02:/home/ec2-user # SAPHanaSR-showAttr;
Global cib-time
--------------------------------
global Wed Sep 19 13:57:41 2018

Hosts clone_state lpa_hdb_lpt node_state op_mode remoteHost roles score site srmode sync_state version vhost
-----------------------------------------------------------------------------------------------------------------------------------------------------------
hana01 DEMOTED 30 online logreplay hana02 4:S:master1:master:worker:master 100 WDF sync SOK 2.00.030.00.1522209842 hana01
hana02 PROMOTED 1537365461 online logreplay hana01 4:P:master1:master:worker:master 150 ROT sync PRIM 2.00.030.00.1522209842 hana02

Stefan Schneider Tue, 09/11/2018 - 14:50

2615 views