Takeover a HANA DB through killing the Database

Simulated Failures

Database failures. The database is not working as expected

Components getting tested

HANA agent
Overlay IP agent
Optional: Route 53 agent if it is configured

Approach

Have a correctly working HANA DB cluster
Kill database
The cluster will failover the database without fencing the node

Intial Configuration

Check whether the overlay IP address gets hosted on the interface eth0 on the first node:

hana01:/var/log # ip address list eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 02:ca:c9:ca:a6:52 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.115/24 brd 10.0.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.10.21/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ca:c9ff:feca:a652/64 scope link 
       valid_lft forever preferred_lft forever

Check the cluster status as super user with the command crm status:

hana01:/var/log # crm status
Stack: corosync
Current DC: hana02 (version 1.1.15-21.1-e174ec8) - partition with quorum
Last updated: Tue Sep 11 12:37:53 2018
Last change: Tue Sep 11 12:37:53 2018 by root via crm_attribute on hana012 nodes configured
6 resources configured
Online: [ hana01 hana02 ]
Full list of resources:
res_AWS_STONITH	(stonith:external/ec2):	Started hana01
 res_AWS_IP	(ocf::heartbeat:aws-vpc-move-ip):	Started hana01
 Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00]
     Started: [ hana01 hana02 ]
 Master/Slave Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00]
     Masters: [ hana01 ]
     Slaves: [ hana02 ]

Kill Database

hana01 is the node with the leading HANA database.

The failover will only work if the re-syncing of the slave node is completed. Check this through the command . Example:

hana02:/tmp # SAPHanaSR-showAttr
Global cib-time
--------------------------------
global Tue Sep 11 09:11:16 2018

Hosts clone_state lpa_hdb_lpt node_state op_mode remoteHost roles score site srmode sync_state version vhost
-----------------------------------------------------------------------------------------------------------------------------------------------------------
hana01 PROMOTED 1536657075 online logreplay hana02 4:P:master1:master:worker:master 150 WDF sync PRIM 2.00.030.00.1522209842 hana01
hana02 DEMOTED 30 online logreplay hana01 4:S:master1:master:worker:master 100 ROT sync SOK 2.00.030.00.1522209842 hana02

The synchronisation state (colum sync_state) of the slave node has to be SOK.

Become HANA DB user and execute the following command:

hdbadm@hana01:/usr/sap/HDB/HDB00> HDB kill
killing HDB processes:
kill -9 462 /usr/sap/HDB/HDB00/hana01/trace/hdb.sapHDB_HDB00 -d -nw -f /usr/sap/HDB/HDB00/hana01/daemon.ini pf=/usr/sap/HDB/SYS/profile/HDB_HDB00_hana01
kill -9 599 hdbnameserver
kill -9 826 hdbcompileserver
kill -9 828 hdbpreprocessor
kill -9 1036 hdbindexserver -port 30003
kill -9 1038 hdbxsengine -port 30007
kill -9 1372 hdbwebdispatcher
kill orphan HDB processes:
kill -9 599 [hdbnameserver] <defunct>
kill -9 1036 [hdbindexserver] <defunct>

Monitoring Fail Over

The cluster will now switch the master node and the slave node. The failover will be completed when the HANA database on the first node has been synchronized as well

hana02:/tmp # SAPHanaSR-showAttr
Global cib-time
--------------------------------
global Tue Sep 11 09:20:38 2018

Hosts clone_state lpa_hdb_lpt node_state op_mode remoteHost roles score site srmode sync_state version vhost
---------------------------------------------------------------------------------------------------------------------------------------------------------------
hana01 DEMOTED 30 online logreplay hana02 4:S:master1:master:worker:master -INFINITY WDF sync SOK 2.00.030.00.1522209842 hana01
hana02 PROMOTED 1536657638 online logreplay hana01 4:P:master1:master:worker:master 150 ROT sync PRIM 2.00.030.00.1522209842 hana02

Check the cluster status as super user with the command cluster status. Example

hana02:/tmp # crm status
Stack: corosync
Current DC: hana02 (version 1.1.15-21.1-e174ec8) - partition with quorum
Last updated: Tue Sep 11 09:28:10 2018
Last change: Tue Sep 11 09:28:06 2018 by root via crm_attribute on hana022 nodes configured
6 resources configured
Online: [ hana01 hana02 ]
Full list of resources:
res_AWS_STONITH	(stonith:external/ec2):	Started hana01
 res_AWS_IP	(ocf::heartbeat:aws-vpc-move-ip):	Started hana02
 Clone Set: cln_SAPHanaTopology_HDB_HDB00 [rsc_SAPHanaTopology_HDB_HDB00]
     Started: [ hana01 hana02 ]
 Master/Slave Set: msl_SAPHana_HDB_HDB00 [rsc_SAPHana_HDB_HDB00]
     Masters: [ hana02 ]
     Slaves: [ hana01 ]
Failed Actions:
* rsc_SAPHana_HDB_HDB00_monitor_61000 on hana01 'not running' (7): call=273, status=complete, exitreason='none',
    last-rc-change='Tue Sep 11 09:18:47 2018', queued=0ms, exec=1867ms
* res_AWS_IP_monitor_60000 on hana01 'not running' (7): call=264, status=complete, exitreason='none',
    last-rc-change='Tue Sep 11 08:57:15 2018', queued=0ms, exec=0ms

All resources are started. The overlay IP addres is now hosted on the second node. Delete the failed actions with the command:

hana02:/tmp # crm resource cleanup rsc_SAPHana_HDB_HDB00
Cleaning up rsc_SAPHana_HDB_HDB00:0 on hana01, removing fail-count-rsc_SAPHana_HDB_HDB00
Cleaning up rsc_SAPHana_HDB_HDB00:0 on hana02, removing fail-count-rsc_SAPHana_HDB_HDB00
Waiting for 2 replies from the CRMd.. OK
hana02:/tmp # crm resource cleanup res_AWS_IP
Cleaning up res_AWS_IP on hana01, removing fail-count-res_AWS_IP
Cleaning up res_AWS_IP on hana02, removing fail-count-res_AWS_IP
Waiting for 2 replies from the CRMd.. OK

The crm status command will not show anymore the failures.

Check whether the overlay IP address gets hosted on the eth0 interface of the second node. Example:

hana02:/tmp # ip address list eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 06:4f:41:53:ff:76 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.129/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.10.21/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::44f:41ff:fe53:ff76/64 scope link 
       valid_lft forever preferred_lft forever

Stefan Schneider Tue, 09/11/2018 - 11:44

2817 views