Today I faced a strange issue with CRS post host reboot. CRS was not coming up and we could see following message in $ORA_CRS_HOME/log/<hostname>/client/clsc*.log
cat clsc26.log Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2008 Oracle. All rights reserved. 2011-07-01 21:00:14.345: [ COMMCRS][2541577376]clsc_connect: (0x6945e0) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET)) 2011-07-01 21:00:14.345: [ COMMCRS][2541577376]clsc_connect: (0x695020) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))
It looked like like issue with socket files, so I removed /var/tmp/.oracle files (This is RHEL4 box). Tried starting crs with ‘crsctl start crs’ and still no socket files were written. /tmp/crsctl*log files were getting generated but they were empty. I spent close to 1 hour rebooting host and trying various stuff. Then I decided to run the daemons mentioned in /etc/inittab manually i.e
/etc/init.d/init.evmd run /etc/init.d/init.cssd fatal /etc/init.d/init.crsd run
When I ran init.evmd I got following errors
# /etc/init.d/init.evmd run Startup will be queued to init within 30 seconds. /home/oracle/.bash_profile: line 6: ulimit: open files: cannot modify limit: Operation not permitted *** glibc detected *** double free or corruption (fasttop): 0x0000000000688960 *** -bash: line 1: 17389 Aborted /apps/oracle/product/102crs/bin/crsctl check boot >/tmp/crsctl.17085
It pointed to issue with .bash_profile so I renamed it to .old and retried the operation. This time it succeeded and crs also came up fine.
There was entry for ulimit -n 2048 in .bash_profile which was causing it. I am not aware why ulimit is causing issue, will try to find it and post details
Amit,I really proud of you as my lead in our team.You RoCk.
ulimit is supposed to be set to unlimited. In 11gR1, there was a bug which forced us to write it explicitly in ohasd file as it used to ignore the values in /etc/security/limits.conf
Hi Amit,
We had a similar error but the problem was different and thought of sharing it here.
We recently installed 11gR1 two node RAC and all was fine till last week and suddenly we saw the same error “ORA-29702: error occurred in Cluster Group Service operation”. Crs was not starting . some of the crs process were running and it was refusing to stop.
root@node1> /u01/app/crs/bin/crsctl stop crs
Stopping resources.
This could take several minutes.
Error while stopping resources. Possible cause: CRSD is down.
Stopping Cluster Synchronization Services.
Unable to communicate with the Cluster Synchronization Services daemon.
ASM alert log and database alert log had the below to say
ASM Alert Log:
Errors in file /u02/app/asm/diag/asm/+asm/+ASM1/trace/+ASM1_lmon_2185.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON (ospid: 2185): terminating the instance due to error 29702
Mon Nov 22 20:02:16 2011
ORA-1092 : opitsk aborting process
Oracle database Alert Log:
ERROR: LMON (ospid: 3721) detects hung instances during IMR reconfiguration
Tue Nov 22 22:10:37 2011
Error: KGXGN polling error (16)
Errors in file /u03/app/oracle/diag/rdbms/ccbdrpd/ccbdrpd1/trace/ccbdrpd1_lmon_3721.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON (ospid: 3721): terminating the instance due to error 29702
Not much info in the trace files.
Looked at metalink note : Diagnosing ‘ORA-29702: error occurred in Cluster Group Service operation’ [ID 848622.1]
But the problems mentioned in it were not applicable to our site.
Looked at CRS alert log, CRSD logs and CSSD logs, there were heaps of information but not quite useful to nail down the issue. Could not see any error messages
Also, looked at
RAC ASM instances crash with ORA-29702 when multiple ASM instances start(ed) [ID 733262.1]
There it was mentioned, when using multiple NIC for cluster interconnect and if they are not bonded properly it could cause issues and that could be seen in the alert logs.
In our case NIC bonding was done properly. We have configured and bonded as below
• eth0 and eth1 bonded as bond0 – for public and
• eth2 and eth3 bonded as bond1 – for cluster interconnect
and alert log showed they were configured fine.
Interface type 1 bond1 192.xxx.x.x configured from OCR for use as a cluster interconnect
Interface type 1 bond0 xx.x.x.x configured from OCR for use as a public interface
If NIC bonding not done properly then you would see multiple entries for cluser interconnect in the alert log.
Well,though this was not the issue in our case but it gave me a lead to identify the root cause of the problem. As it was mentioned about bonding I wanted to check both channel bonding interface (ifcdfg-bond0 & ifcfg-bond1) and Ethernet interface configurations (ifcfg-eth0, ifcfg-eth1, ifcfg-eth2 & ifcfg-eth3)
Well, all configuration files were good except for ifcfg-bond1 file and the entries were as below,
root@node1>cat ifcfg-bond1
DEVICE=bond1
IPADDR=xxx.xxx.xx.x
NETMASK=255.xxx.x.x
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
TYPE=ethernet
On the 1st look they seem to be fine but when compared to ifcfg-bond0 the problem was obvious. Ifcfg-bond0 entries were as below,
root@node1> cat ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
NETMASK=255.xxx.x.x
IPADDR=xx.x.x.x
GATEWAY=xx.x.x.x
USERCTL=no
TYPE=BOND
If you look at line entry TYPE it’s mentioned as “TYPE=ethernet” in Ifcfg-bond1 and “TYPE=BOND” In Ifcfg-bond0.
Bingo…changed the configuration file and rebooted the server and all components came up fine. CRS, ASM and DB started and working fine.
But trying to find out why it worked fine during the installation and then stopped working suddenly.
Thanks,
Sasi
Amit..what i had put in earlier looks jumbled..i guess its not very difficult to read it ..
Regards,
Sasi