10.2 CRS startup issue

Today I faced a strange issue with CRS  post host reboot. CRS was not coming up and we could see following message in $ORA_CRS_HOME/log/<hostname>/client/clsc*.log

cat clsc26.log
Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2008 Oracle.  All rights reserved.
2011-07-01 21:00:14.345: [ COMMCRS][2541577376]clsc_connect: (0x6945e0) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))

2011-07-01 21:00:14.345: [ COMMCRS][2541577376]clsc_connect: (0x695020) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

It looked like like issue with socket files, so I removed /var/tmp/.oracle files (This is RHEL4 box). Tried starting crs with ‘crsctl start crs’ and still no socket files were written. /tmp/crsctl*log files were getting generated but they were empty. I spent close to 1 hour rebooting host and trying various stuff. Then I decided to run the daemons mentioned in /etc/inittab  manually i.e

/etc/init.d/init.evmd run
/etc/init.d/init.cssd fatal
/etc/init.d/init.crsd run

When I ran init.evmd I got following errors

# /etc/init.d/init.evmd run
Startup will be queued to init within 30 seconds.
/home/oracle/.bash_profile: line 6: ulimit: open files: cannot modify limit: Operation not permitted
*** glibc detected *** double free or corruption (fasttop): 0x0000000000688960 ***
-bash: line 1: 17389 Aborted                 /apps/oracle/product/102crs/bin/crsctl check boot >/tmp/crsctl.17085

It pointed to issue with .bash_profile so I renamed it to .old and retried the operation. This time it succeeded and crs also came up fine.

There was entry for ulimit -n 2048 in .bash_profile which was causing it. I am not aware why ulimit is causing issue, will try to find it and post details

Tags: , , ,

5 thoughts on “10.2 CRS startup issue”

  • ulimit is supposed to be set to unlimited. In 11gR1, there was a bug which forced us to write it explicitly in ohasd file as it used to ignore the values in /etc/security/limits.conf

  • Hi Amit,

    We had a similar error but the problem was different and thought of sharing it here.

    We recently installed 11gR1 two node RAC and all was fine till last week and suddenly we saw the same error “ORA-29702: error occurred in Cluster Group Service operation”. Crs was not starting . some of the crs process were running and it was refusing to stop.

    root@node1> /u01/app/crs/bin/crsctl stop crs
    Stopping resources.
    This could take several minutes.
    Error while stopping resources. Possible cause: CRSD is down.
    Stopping Cluster Synchronization Services.
    Unable to communicate with the Cluster Synchronization Services daemon.

    ASM alert log and database alert log had the below to say

    ASM Alert Log:
    Errors in file /u02/app/asm/diag/asm/+asm/+ASM1/trace/+ASM1_lmon_2185.trc:
    ORA-29702: error occurred in Cluster Group Service operation
    LMON (ospid: 2185): terminating the instance due to error 29702
    Mon Nov 22 20:02:16 2011
    ORA-1092 : opitsk aborting process

    Oracle database Alert Log:
    ERROR: LMON (ospid: 3721) detects hung instances during IMR reconfiguration
    Tue Nov 22 22:10:37 2011
    Error: KGXGN polling error (16)
    Errors in file /u03/app/oracle/diag/rdbms/ccbdrpd/ccbdrpd1/trace/ccbdrpd1_lmon_3721.trc:
    ORA-29702: error occurred in Cluster Group Service operation
    LMON (ospid: 3721): terminating the instance due to error 29702

    Not much info in the trace files.

    Looked at metalink note : Diagnosing ‘ORA-29702: error occurred in Cluster Group Service operation’ [ID 848622.1]

    But the problems mentioned in it were not applicable to our site.

    Looked at CRS alert log, CRSD logs and CSSD logs, there were heaps of information but not quite useful to nail down the issue. Could not see any error messages

    Also, looked at

    RAC ASM instances crash with ORA-29702 when multiple ASM instances start(ed) [ID 733262.1]

    There it was mentioned, when using multiple NIC for cluster interconnect and if they are not bonded properly it could cause issues and that could be seen in the alert logs.

    In our case NIC bonding was done properly. We have configured and bonded as below
    • eth0 and eth1 bonded as bond0 – for public and
    • eth2 and eth3 bonded as bond1 – for cluster interconnect

    and alert log showed they were configured fine.

    Interface type 1 bond1 192.xxx.x.x configured from OCR for use as a cluster interconnect
    Interface type 1 bond0 xx.x.x.x configured from OCR for use as a public interface

    If NIC bonding not done properly then you would see multiple entries for cluser interconnect in the alert log.

    Well,though this was not the issue in our case but it gave me a lead to identify the root cause of the problem. As it was mentioned about bonding I wanted to check both channel bonding interface (ifcdfg-bond0 & ifcfg-bond1) and Ethernet interface configurations (ifcfg-eth0, ifcfg-eth1, ifcfg-eth2 & ifcfg-eth3)

    Well, all configuration files were good except for ifcfg-bond1 file and the entries were as below,
    root@node1>cat ifcfg-bond1

    DEVICE=bond1
    IPADDR=xxx.xxx.xx.x
    NETMASK=255.xxx.x.x
    USERCTL=no
    BOOTPROTO=none
    ONBOOT=yes
    TYPE=ethernet

    On the 1st look they seem to be fine but when compared to ifcfg-bond0 the problem was obvious. Ifcfg-bond0 entries were as below,

    root@node1> cat ifcfg-bond0
    DEVICE=bond0
    BOOTPROTO=none
    ONBOOT=yes
    NETMASK=255.xxx.x.x
    IPADDR=xx.x.x.x
    GATEWAY=xx.x.x.x
    USERCTL=no
    TYPE=BOND

    If you look at line entry TYPE it’s mentioned as “TYPE=ethernet” in Ifcfg-bond1 and “TYPE=BOND” In Ifcfg-bond0.

    Bingo…changed the configuration file and rebooted the server and all components came up fine. CRS, ASM and DB started and working fine.

    But trying to find out why it worked fine during the installation and then stopped working suddenly.

    Thanks,
    Sasi

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.