Grid Infrastructure 11.2.0.3 for AIX 64 includes Cluster Health Monitor

Cluster Health Monitor or CHM (formerly IPD/OS) is a utility used to gather OS metrics such as physical memory, swap, CPU usage etc. It's useful for troubleshooting node reboots or evictions, instance evictions as well as poor performance. CHM gathers data in real time with little resource overhead and stores the metrics which can then be retrieved for root-cause analysis. This is very important as other utilities often cannot provide such detailed metrics if there's a resource issue on a node.
CHM was bundled with  Grid Infrastructure for Linux and Solaris (x86-64) platforms in 11.2.0.2. However, the feature has now been made available in Grid Infrastructure 11.2.0.3.0 for AIX. After upgrading from 11.2.0.2 to 11.2.0.3 a new resource ora.crf can be seen in the CRS stack.

grid@oradba10t[+ASM1]-/home/grid >crsctl stat res -t -init

--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE ONLINE oradba10t Started
ora.cluster_interconnect.haip
1 ONLINE ONLINE oradba10t
ora.crf
1 ONLINE ONLINE oradba10t
ora.crsd
1 ONLINE ONLINE oradba10t
ora.cssd
1 ONLINE ONLINE oradba10t
ora.cssdmonitor
1 ONLINE ONLINE oradba10t
ora.ctssd
1 ONLINE ONLINE oradba10t OBSERVER
ora.diskmon
1 OFFLINE OFFLINE
ora.drivers.acfs
1 ONLINE ONLINE oradba10t
ora.evmd
1 ONLINE ONLINE oradba10t
ora.gipcd
1 ONLINE ONLINE oradba10t
ora.gpnpd
1 ONLINE ONLINE oradba10t
ora.mdnsd
1 ONLINE ONLINE oradba10t


There are actually two new processes used for gathering CHM data - oleggerd and sysmond. You can see them highlighted in red below.


grid@oradba10t[+ASM1]-/home/grid >ps -ef|grep d.bin|grep -v grep
root 6881454 1 0 10:05:09 - 2:37 /u01/11.2.0.3/grid/bin/orarootagent.bin
root 7274618 1 0 10:05:13 - 14:34 /u01/11.2.0.3/grid/bin/osysmond.bin
grid 7471320 33882340 0 10:06:44 - 0:00 /u01/11.2.0.3/grid/bin/evmlogger.bin -o /u01/11.2.0.3/grid/evm/log/evmlogger.info -l /u01/11.2.0.3/grid/evm/log/evmlogger.log
oracle 8781958 1 0 10:07:18 - 2:33 /u01/11.2.0.3/grid/bin/oraagent.bin
root 9109720 1 0 10:04:37 - 0:00 /bin/sh /u01/11.2.0.3/grid/bin/ocssd
grid 13041882 1 0 10:03:52 - 2:32 /u01/11.2.0.3/grid/bin/oraagent.bin
root 13697042 1 2 10:02:47 - 7:17 /u01/11.2.0.3/grid/bin/ohasd.bin reboot
grid 15138940 1 0 10:07:10 - 0:00 /u01/11.2.0.3/grid/bin/tnslsnr LISTENER -inherit
root 16515292 1 0 10:06:32 - 9:22 /u01/11.2.0.3/grid/bin/crsd.bin reboot
root 19202134 1 0 10:04:14 - 0:14 /u01/11.2.0.3/grid/bin/cssdagent
grid 21102764 1 1 10:04:04 - 6:57 /u01/11.2.0.3/grid/bin/gipcd.bin
root 21233726 1 0 10:05:10 - 7:48 /u01/11.2.0.3/grid/bin/octssd.bin
root 22544568 1 0 10:04:04 - 0:14 /u01/11.2.0.3/grid/bin/cssdmonitor
grid 24182816 1 0 10:24:55 - 0:00 /u01/11.2.0.3/grid/bin/tnslsnr LISTENER_SCAN2 -inherit
root 24838342 1 1 10:06:48 - 9:49 /u01/11.2.0.3/grid/bin/orarootagent.bin
grid 25624680 1 0 10:04:00 - 0:00 /u01/11.2.0.3/grid/bin/mdnsd.bin
root 26280110 1 0 10:06:17 - 1:35 /u01/11.2.0.3/grid/bin/ologgerd -M -d /u01/11.2.0.3/grid/crf/db/oradba10t
grid 29687946 1 0 10:04:01 - 0:10 /u01/11.2.0.3/grid/bin/gpnpd.bin
grid 30605512 1 0 10:24:54 - 0:04 /u01/11.2.0.3/grid/bin/scriptagent.bin
grid 31457480 1 0 10:24:55 - 0:00 /u01/11.2.0.3/grid/bin/tnslsnr LISTENER_SCAN3 -inherit
grid 32178268 1 0 10:06:48 - 1:25 /u01/11.2.0.3/grid/bin/oraagent.bin
grid 32309280 9109720 2 10:04:37 - 9:10 /u01/11.2.0.3/grid/bin/ocssd.bin
grid 33882340 1 0 10:06:38 - 8:03 /u01/11.2.0.3/grid/bin/evmd.bin
grid 34144466 1 0 10:34:16 - 0:00 /u01/11.2.0.3/grid/bin/tnslsnr LISTENER_SCAN1 -inheri

There is one master logger deamon with replicas running on the other nodes. You can see from the other node that there is another ologgerd process.

root  9437368        1   0 10:30:47      -  0:18 /u01/11.2.0.3/grid/bin/ologgerd -m oradba10t -r -d /u01/11.2.0.3/grid/crf/db/oradba11t

Note that the process on the first node is running with  a -M option (master) while on the second node it's running with a -r option (replica).

The oleggerd process saves it's data to a Berkley DB. The replica processes sends its data to the master process. Sysmond gathers the metrics on all local nodes and sends it to the master ologgerd.

CHM reporting can be done in either interactive or command line mode using the oclumon command.
To see the list of available options type the oclumon command passing the -h parameter:


grid@oradba10t[+ASM1]-/home/grid >oclumon help
For help from command line : oclumon -h
For help in interactive mode : -h
Currently supported verbs are :
showobjects, dumpnodeview, manage, version, debug, quit, exit, and help

The size of the repository depends on the number of days of metrics that is retain. The maximum retention period is 3 days and the default size is 1GB. To get the current size of the repository in seconds run the following command:
grid@oradba10t[+ASM1]-/home/grid >oclumon manage -get repsize
CHM Repository Size = 61646
Done
The repository can be resized to retain 24 hours worth of data using the following command:


grid@oradba10t[+ASM1]-/home/grid >oclumon manage -repos resize 86400
oradba10t --> retention check successful
oradba11t --> retention check successful
New retention is 86400 and will use 1504880640 bytes of disk space
CRS-9115-Cluster Health Monitor repository size change completed on all nodes.
Done


There are two commands used to collect CHM data - diagcollection.pl and oclumon.
For example:

diagcollection.pl --collect --crshome /u01/11.2.0.3/grid --chmos --incidenttime 03/08/201215:00:00 --incidentduration 00:15
The above command collects data form the specified date/time for a 15 minute duration.
Example output:


[root@oradba10t chm]$ s --incidenttime 03/08/201215:00:00 --incidentduration 00:15 <
Production Copyright 2004, 2010, Oracle. All rights reserved
Cluster Ready Services (CRS) diagnostic collection tool
Collecting Cluster Health Monitor (OS) data
Collecting OS logs
[root@oradba10t chm]$ ls -ltr
total 2208
-rw-r--r-- 1 root system 887008 Mar 8 15:26 chmosData_oradba10t_20120308_1526.tar.gz
-rw-r--r-- 1 root system 231485 Mar 8 15:26 osData_oradba10t_20120308_1526.tar.gz


oclumon can be used to dump data in the repository

grid@oradba10t[+ASM1]-/u02/chm >oclumon dumpnodeview -allnodes -v -last "00:00:10" > chmos_data.out

Here's the complete output.

This is just an introduction to what is available with CHM. 



Comments

Popular posts from this blog

Viewing ASM trace files and alert logs in Unix/Linux

ORA-00020: maximum number of processes (%s) exceeded

Troubleshooting RAC Public Network Failure