Wednesday, June 22, 2011

Troubleshooting RAC Public Network Failure

Here are some steps I used to troubleshoot the failure of a public network used for SCAN in a 2-node RAC cluster.
Note: I used an aliased crsstat  for the command: crsctl stat res –t
Check the status of the Clusterware resources. You can see that there are several resources that are offline below. The ones highlighted in red are the ones we are interested in. The resource for the local listener is now offline while the VIP has been failed over. You will also notice that all of the SCAN Listeners have been failed over to the surviving node.
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >crsstat
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DG_DATA.dg
               ONLINE  ONLINE       tibora30
               ONLINE  ONLINE       tibora31
ora.DG_FLASH.dg
               ONLINE  ONLINE       tibora30
               ONLINE  ONLINE       tibora31
ora.LISTENER.lsnr
               ONLINE  OFFLINE      tibora30
               ONLINE  ONLINE       tibora31
ora.asm
               ONLINE  ONLINE       tibora30                 Started
               ONLINE  ONLINE       tibora31                 Started
ora.gsd
               OFFLINE OFFLINE      tibora30
               OFFLINE OFFLINE      tibora31
ora.net1.network
               ONLINE  ONLINE       tibora30
               ONLINE  ONLINE       tibora31
ora.ons
               ONLINE  ONLINE       tibora30
               ONLINE  OFFLINE      tibora31
ora.registry.acfs
               ONLINE  ONLINE       tibora30
               ONLINE  ONLINE       tibora31
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       tibora31
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       tibora31
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       tibora31
ora.cvu
      1        ONLINE  OFFLINE
ora.oc4j
      1        ONLINE  ONLINE       tibora31
ora.scan1.vip
      1        ONLINE  ONLINE       tibora30
ora.scan2.vip
      1        ONLINE  ONLINE       tibora31
ora.scan3.vip
      1        ONLINE  ONLINE       tibora31
ora.tibora30.vip
      1        ONLINE  INTERMEDIATE tibora31                 FAILED OVER
ora.tibora31.vip
      1        ONLINE  ONLINE       tibora31
ora.tibprd.db
      1        ONLINE  ONLINE       tibora30                 Open
      2        ONLINE  ONLINE       tibora31                 Open
ora.tibprd.tibprd_applog.svc
      1        ONLINE  ONLINE       tibora31
ora.tibprd.tibprd_basic.svc
      1        ONLINE  ONLINE       tibora31
ora.tibprd.tibprd_smap.svc
      1        ONLINE  ONLINE       tibora31
Then look at the CRS logs under $GI_HOME/log/hostname/alerthostname.log  for entries similar to the ones below:
2011-06-21 09:43:57.844
[/u01/11.2.0/grid/bin/orarootagent.bin(21168162)]CRS-5818:Aborted command 'check for resource: ora.net1.network tibora30 1' for resource 'ora.net1.network'. Details at (:CRSAGF00113:) {0:9:2} in /u01/11.2.0/grid/log/tibora30/agent/crsd/orarootagent_root/orarootagent_root.log.
2011-06-21 09:44:00.459
[/u01/11.2.0/grid/bin/oraagent.bin(22413372)]CRS-5016:Process "/u01/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/tibora30/agent/crsd/oraagent_grid/oraagent_grid.log"
2011-06-21 09:44:01.112
[/u01/11.2.0/grid/bin/oraagent.bin(22413372)]CRS-5016:Process "/u01/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/11.2.0/grid/bin/oraagent.bin" for action "
check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/tibora30/agent/crsd/oraagent_grid/oraagent_grid.log"
2011-06-21 09:44:01.180
[/u01/11.2.0/grid/bin/oraagent.bin(22413372)]CRS-5016:Process "/u01/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/11.2.0/grid/bin/oraagent.bin" for action "
check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/tibora30/agent/crsd/oraagent_grid/oraagent_grid.log"
Check the status of the VIP on the node
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status vip -n tibora30
VIP tibora30-vip is enabled
VIP tibora30-vip is not running
Also check the status of the SCAN resources.
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is running on node tibora30
SCAN VIP scan2 is enabled
SCAN VIP scan2 is running on node tibora31
SCAN VIP scan3 is enabled
SCAN VIP scan3 is running on node tibora31
In this particular case the SCAN VIPs were running on both nodes. It turns out that on another cluster which also experienced a network failure the SCAN VIPs were all running on node.
First we need to start the local listener:
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl start listener
Now check the status of the resources.
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >crsstat
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DG_DATA.dg
               ONLINE  ONLINE       tibora30
               ONLINE  ONLINE       tibora31
ora.DG_FLASH.dg
               ONLINE  ONLINE       tibora30
               ONLINE  ONLINE       tibora31
ora.LISTENER.lsnr
               ONLINE  ONLINE       tibora30
               ONLINE  ONLINE       tibora31
ora.asm
               ONLINE  ONLINE       tibora30                 Started
               ONLINE  ONLINE       tibora31                 Started
ora.gsd
               OFFLINE OFFLINE      tibora30
               OFFLINE OFFLINE      tibora31
ora.net1.network
               ONLINE  ONLINE       tibora30
               ONLINE  ONLINE       tibora31
ora.ons
               ONLINE  ONLINE       tibora30
               ONLINE  OFFLINE      tibora31
ora.registry.acfs
               ONLINE  ONLINE       tibora30
               ONLINE  ONLINE       tibora31
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       tibora31
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       tibora31
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       tibora31
ora.cvu
      1        ONLINE  OFFLINE
ora.oc4j
      1        ONLINE  ONLINE       tibora31
ora.scan1.vip
      1        ONLINE  ONLINE       tibora30
ora.scan2.vip
      1        ONLINE  ONLINE       tibora31
ora.scan3.vip
      1        ONLINE  ONLINE       tibora31
ora.tibora30.vip
      1        ONLINE  ONLINE       tibora30
ora.tibora31.vip
      1        ONLINE  ONLINE       tibora31
ora.tibprd.db
      1        ONLINE  ONLINE       tibora30                 Open
      2        ONLINE  ONLINE       tibora31                 Open
ora.tibprd.tibprd_applog.svc
      1        ONLINE  ONLINE       tibora31
ora.tibprd.tibprd_basic.svc
      1        ONLINE  ONLINE       tibora31
ora.tibprd.tibprd_smap.svc
      1        ONLINE  ONLINE       tibora31

Starting the local listener also caused the VIP to relocate to the previous node.
In another situation I had to manually relocate the VIP to the original node.
Next we need to check our nodeapps including ONS.


grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status nodeapps

VIP tibora30-vip is enabled

VIP tibora30-vip is running on node: tibora30

VIP tibora31-vip is enabled

VIP tibora31-vip is running on node: tibora31

Network is enabled

Network is running on node: tibora30

Network is running on node: tibora31

GSD is disabled

GSD is not running on node: tibora30

GSD is not running on node: tibora31

ONS is enabled

ONS daemon is running on node: tibora30

ONS daemon is not running on node: tibora31


Here you can see that the ONS daemon is not running on the tibora31.
To start the ONS daemon issue the following command:
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl start nodeapps -n tibora31

PRKO-2421 : Network resource is already started on node(s): tibora31

PRKO-2420 : VIP is already started on node(s): tibora31
 Check the status of the nodeapps again
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status nodeapps
VIP tibora30-vip is enabled
VIP tibora30-vip is running on node: tibora30
VIP tibora31-vip is enabled
VIP tibora31-vip is running on node: tibora31
Network is enabled
Network is running on node: tibora30
Network is running on node: tibora31
GSD is disabled
GSD is not running on node: tibora30
GSD is not running on node: tibora31
ONS is enabled
ONS daemon is running on node: tibora30
ONS daemon is running on node: tibora31
The CVU resource can be started as follows:
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl start cvu -n tibora30
You can now verify connectivity to the database/services. I prefer to use SQL Developer to test connectivity to my databases with one connection for each service name.

All the SCAN listeners were running on a single node. At least one needed to be relocated to service the requests coming from the SCAN on that node.
grid@tibora31[+ASM2]-/home/grid >srvctl relocate scan_listener -i 1 -n tibora30
grid@tibora31[+ASM2]-/home/grid >srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node tibora30
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is running on node tibora31
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node tibora31
Once the SCAN was relocated the application connected successfully.

1 comment:

  1. I appreciate your invaluable sharing for network failure.it saves me for looking too much garbage on internet.regards.

    ReplyDelete