rhel5

11gR1 CRS start failing with ORA-29702

This post was actually a comment from Sasi which came on previous article 10.2 CRS startup issue . I am converting it to a post so that we can get some feedback on this issue from other users. I suspect this to be caused by RHEL5 bug (fixed in RHEL5u6) related to NIC going down when multiple interface cards are being used. 

We had a similar error but the problem was different and thought of sharing it here.

We recently installed 11gR1 two node RAC and all was fine till last week and suddenly we saw the same error “ORA-29702: error occurred in Cluster Group Service operation”. Crs was not starting . some of the crs process were running and it was refusing to stop.

root@node1> /u01/app/crs/bin/crsctl stop crs
Stopping resources.
This could take several minutes.
Error while stopping resources. Possible cause: CRSD is down.
Stopping Cluster Synchronization Services.
Unable to communicate with the Cluster Synchronization Services daemon.

ASM alert log and database alert log had the below to say

ASM Alert Log:

Errors in file /u02/app/asm/diag/asm/+asm/+ASM1/trace/+ASM1_lmon_2185.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON (ospid: 2185): terminating the instance due to error 29702
Mon Nov 22 20:02:16 2011
ORA-1092 : opitsk aborting process

Oracle database Alert Log:

ERROR: LMON (ospid: 3721) detects hung instances during IMR reconfiguration
Tue Nov 22 22:10:37 2011
Error: KGXGN polling error (16)
Errors in file /u03/app/oracle/diag/rdbms/ccbdrpd/ccbdrpd1/trace/ccbdrpd1_lmon_3721.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON (ospid: 3721): terminating the instance due to error 29702

Not much info in the trace files.

Looked at metalink note : Diagnosing ‘ORA-29702: error occurred in Cluster Group Service operation’ [ID 848622.1]
But the problems mentioned in it were not applicable to our site.

Looked at CRS alert log, CRSD logs and CSSD logs, there were heaps of information but not quite useful to nail down the issue. Could not see any error messages

Also, looked at

RAC ASM instances crash with ORA-29702 when multiple ASM instances start(ed) [ID 733262.1]

There it was mentioned, when using multiple NIC for cluster interconnect and if they are not bonded properly it could cause issues and that could be seen in the alert logs.

In our case NIC bonding was done properly. We have configured and bonded as below
• eth0 and eth1 bonded as bond0 – for public and
• eth2 and eth3 bonded as bond1 – for cluster interconnect

and alert log showed they were configured fine.

Interface type 1 bond1 192.xxx.x.x configured from OCR for use as a cluster interconnect
Interface type 1 bond0 xx.x.x.x configured from OCR for use as a public interface

If NIC bonding not done properly then you would see multiple entries for cluser interconnect in the alert log.

Well,though this was not the issue in our case but it gave me a lead to identify the root cause of the problem. As it was mentioned about bonding I wanted to check both channel bonding interface (ifcdfg-bond0 & ifcfg-bond1) and Ethernet interface configurations (ifcfg-eth0, ifcfg-eth1, ifcfg-eth2 & ifcfg-eth3)

Well, all configuration files were good except for ifcfg-bond1 file and the entries were as below,

root@node1>cat ifcfg-bond1

DEVICE=bond1
IPADDR=xxx.xxx.xx.x
NETMASK=255.xxx.x.x
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
TYPE=ethernet

On the 1st look they seem to be fine but when compared to ifcfg-bond0 the problem was obvious. Ifcfg-bond0 entries were as below,

root@node1> cat ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
NETMASK=255.xxx.x.x
IPADDR=xx.x.x.x
GATEWAY=xx.x.x.x
USERCTL=no
TYPE=BOND

If you look at line entry TYPE it’s mentioned as “TYPE=ethernet” in Ifcfg-bond1 and “TYPE=BOND” In Ifcfg-bond0.

Bingo…changed the configuration file and rebooted the server and all components came up fine. CRS, ASM and DB started and working fine.

But trying to find out why it worked fine during the installation and then stopped working suddenly.