CRS

10gR2 Silent Install with 11gr2 CRS fails

I was trying to perform a 10.2 silent install with 11gR2 CRS. While doing pre-checks installer failed with following error

Check complete: Failed <<<<
Problem: The 'active' version of Oracle Clusterware is not 10g Release 2 (10.2).
Recommendation: You must upgrade all nodes of the cluster to Oracle Clusterware 10g Release 2.  If you have upgraded some but not all of the nodes to use the 10g Release 2 version of Oracle Clusterware, then the 'active' version is still 10g Release 1 (10.1)  You must upgrade all nodes in the cluster to Oracle Clusterware 10g Release 2 before installing Oracle 10g Release 2 Real Application Clusters.

I tried “ignoreSysPrereqs” option with runInstaller but it also did not succeed. I checked My Oracle Support (formerly metalink..anyways I still refer to as metalink) and also searched for any known issues, but couldn’t find any document. I could find some issues on OTN but there was no solution. Finally I searched for the file reporting this error in Oracle software staging location.

$% grep -r "version of Oracle Clusterware is not 10g Release 2" *
stage/prereq/db/db_prereq.xml:

This was part of following code( I have removed Angle brackets with Square brackets as wordpress confuses it with html tags)

[PREREQUISITE NAME="Detect10.2CRS"
                EXTERNALNAME="Checking Oracle Clusterware version ..."
                EXTERNALNAMEID="[email protected]"
                SEVERITY="Error"]
        [DESCRIPTION TEXT="This is a prerequisite condition to test if all nodes in the cluster have had the Clusterware upgraded to 10g Release 2 (10.2)."
                TEXTID="S_CHECK_10.2_CRS_DESCRIPTION@oracle.install.prereqs.resources.PrereqRes"/]
        [RULESETREF NAME="CRS102Checks" RULE="CheckFor102CRS" FILE="db/refhost.xml"
                RESULTS_FILE="install_rule_results.xml"/]
        [PROBLEM TEXT="The 'active' version of Oracle Clusterware is not 10g Release 2 (10.2)."
                TEXTID="S_CHECK_10.2_CRS_ERROR@oracle.install.prereqs.resources.PrereqRes"]
        [/PROBLEM]

Checking “Detect10.2CRS” in My Oracle Support, got exact hit

Silent Install 10.2.0.1 Database Fails When Cluster Is 11.1.0.6 [ID 755345.1]

As per note, we need to change the following lines in (software location)\stage\prereq\db\db_prereq.xml file ( I have removed Angle brackets with Square brackets as wordpress confuses it with html tags)

[PREREQUISITESET NAME="clusterTests"]
[PREREQUISITEREF NAME="Detect10.2CRS" SEVERITY="Error"/]
[/PREREQUISITESET]

to :

[PREREQUISITESET NAME="clusterTests"]
[/PREREQUISITESET]

You would be required to do same change for similar file to any 10g patchset on top of it. In case of 10.2.0.4 patch I found it under (software_location)/stage/prereq/patch_prereqs.xml
Searching on the error messages in My Oracle Support did not return above document. Anyways documenting it so that Search engines can report it faster. Note that to use 10g DB software with 11gR2 CRS, you will have to pin the nodes

$GRID_HOME/bin/crsctl pin css -n node1 node2

olsnodes -t will report current status of the nodes i.e whether pinned or not

10.2 CRS startup issue

Today I faced a strange issue with CRS  post host reboot. CRS was not coming up and we could see following message in $ORA_CRS_HOME/log/<hostname>/client/clsc*.log

cat clsc26.log
Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2008 Oracle.  All rights reserved.
2011-07-01 21:00:14.345: [ COMMCRS][2541577376]clsc_connect: (0x6945e0) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))

2011-07-01 21:00:14.345: [ COMMCRS][2541577376]clsc_connect: (0x695020) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

It looked like like issue with socket files, so I removed /var/tmp/.oracle files (This is RHEL4 box). Tried starting crs with ‘crsctl start crs’ and still no socket files were written. /tmp/crsctl*log files were getting generated but they were empty. I spent close to 1 hour rebooting host and trying various stuff. Then I decided to run the daemons mentioned in /etc/inittab  manually i.e

/etc/init.d/init.evmd run
/etc/init.d/init.cssd fatal
/etc/init.d/init.crsd run

When I ran init.evmd I got following errors

# /etc/init.d/init.evmd run
Startup will be queued to init within 30 seconds.
/home/oracle/.bash_profile: line 6: ulimit: open files: cannot modify limit: Operation not permitted
*** glibc detected *** double free or corruption (fasttop): 0x0000000000688960 ***
-bash: line 1: 17389 Aborted                 /apps/oracle/product/102crs/bin/crsctl check boot >/tmp/crsctl.17085

It pointed to issue with .bash_profile so I renamed it to .old and retried the operation. This time it succeeded and crs also came up fine.

There was entry for ulimit -n 2048 in .bash_profile which was causing it. I am not aware why ulimit is causing issue, will try to find it and post details

SRVCTL fails to start RAC resources:CRS-0215

After upgrading RAC database to 10204 and applying CRS bundle patch-1 for 10204 crs home,
srvctl command fails to startup resources on rac nodes. While starting up RAC resources using SRVCTL
following error occurs in CRSD.log file:

$ srvctl start instance -d rac -i rac2

2009-04-09 13:45:22.091: [  CRSRES][2611477408][ALERT]0`ora...inst` on member `` has experienced an unrecoverable failure.
2009-04-09 13:45:22.091: [  CRSRES][2611477408]0Human intervention required to resume its availability.
2009-04-09 13:46:25.162: [  CRSRES][2611477408]0StopResource: setting CLI values
2009-04-09 13:46:25.174: [  CRSRES][2611477408]0Attempting to stop `ora...inst` on member ``
2009-04-09 13:46:25.206: [  CRSAPP][2611477408]0StopResource error for ora...inst error code = 1

To debug SRVCTL SRVM_TRACE is set to true and a Strace is taken at OS level:

$script /tmp/srvm.log
$export SRVM_TRACE=TRUE
$srvctl start instance -d  -i
$exit

It will genertae a trace file at /tmp/srvm.log.

$ strace -aef -o /tmp/strace.log srvctl start instance -d -i

It will generate a trace file at /tmp/strace.log

— srvm.log shows follwoing error:

[Thread-2] [11:57:59:774] [StreamReader.run:65]  OUTPUT>Attempting to start `ora.rac.rac2.inst` on member `node11`
[Thread-2] [11:58:0:862] [StreamReader.run:65]  OUTPUT>`ora.rac.rac2.inst` on member `node11` has experienced an unrecoverable failure.
[Thread-2] [11:58:0:862] [StreamReader.run:65]  OUTPUT>Human intervention required to resume its availability.
[Thread-2] [11:58:0:863] [StreamReader.run:65]  OUTPUT>nloz11:ora.rac.rac2.inst:/oac/app/oracle/product/10.2.0/db_1/bin/racgwrap: line 62: fg: no job control
[Thread-3] [11:58:0:865] [StreamReader.run:65]  ERROR>CRS-0215: Resource ora.rac.rac2.inst cannot be started.
[Thread-3] [11:58:0:865] [StreamReader.run:65]  ERROR>
[Worker 0] [11:58:0:865] [RuntimeExec.runCommand:133]  runCommand: process returns 115

— strace.log file shows the following:

rt_sigprocmask(SIG_SETMASK, [], NULL, 8 ) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8 ) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8 ) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x8075d8b, [], SA_RESTORER, 0xb7ee5908}, {SIG_IGN}, 8 ) = 0
waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 2}], 0) = 18699
rt_sigprocmask(SIG_SETMASK, [], NULL, 8 ) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
waitpid(-1, 0xbfffe9bc, WNOHANG) = -1 ECHILD (No child processes)
sigreturn() = ? (mask now [])
rt_sigaction(SIGINT, {SIG_IGN}, {0x8075d8b, [], SA_RESTORER, 0xb7ee5908}, 8 ) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8 ) = 0
read(255, "exit $?\n", 6261) = 8
rt_sigprocmask(SIG_SETMASK, [], NULL, 8 ) = 0
exit_group(2) = ?

The SRVM trace showed that there is a problem with racgwrap script at line 62 which indicates the following:

$ORACLE_HOME/bin/racgmain “$@”

Could not found much with this line, but from the begning i.e line 1 the entry for ORACLE_HOME was missing.

ORACLE_HOME=<%ORACLE_HOME%>
export ORACLE_HOME
— Added the correct oracle_home location at this place.

Also, after checking the srvctl file for the db_home the “OHOME” and “CHOME” entries were missing:
— Added the correct entries for OHOME and CHOME ( copied the entries from the node where srvctl was working fine)

After making these two changes SRVCTL worked fine.

Cheers!!!!
Saurabh Sood

CRS Fails to Start – 10.2.0.1 RAC Install on AIX

I was installing 10.2.0.1 on IBM AIX 5L and while running root.sh from first node (as part of Clusterware installation) got following messages

Now formatting voting device: /dev/voting_disk01
Now formatting voting device: /dev/voting_disk02
Now formatting voting device: /dev/voting_disk03
Format of 3 voting devices complete.
Startup will be queued to init within 30 seconds

I waited for quite some time and found that it was stuck. To check what was status of CSS, I did a grep for CSS and found that it was running /etc/init.cssd startcheck css script. This indicated that Oracle was stuck trying to start CSS. Following errors were recorded in /tmp/crsct.7459

Failure in CSS initialization opening OCR.

Metalink notes suggested checking OCR Disk permission , though in my case they had correct permissions i.e ownership as oracle:dba and permission set to 660. To diagnose further, I checked $ORA_CRS_HOME/log to check for errors. All the logfiles related to CRS,CSS and EVMD are stored in $ORA_CRS_HOME/log/<hostname>.

/oracle/crs_base/app/product/crs10gR2/log>ls -ltr
total 0
drwxrwx---    2 oracle   dba             256 Jan 28 18:46 crs
drwx------    3 root     system          256 Jan 28 18:53 chd0196
drwxr-xr-t    8 root     dba             256 Jan 28 18:53 rac01

Hostname for the server was rac01 and not chd0196. This was a new server and also directories could not be present earlier as it was a fresh installation.  Oracle was picking two hostname which was quite strange. I checked for HACMP filesets  and found that they were present

# lslpp -l |grep -i hacmp
  rsct.basic.hacmp           2.4.9.0  COMMITTED  RSCT Basic Function (HACMP/ES
  rsct.compat.basic.hacmp    2.4.9.0  COMMITTED  RSCT Event Management Basic
                                                 Function (HACMP/ES Support)
  rsct.compat.clients.hacmp  2.4.9.0  COMMITTED  RSCT Event Management Client
                                                 Function (HACMP/ES Support)

10g RAC does not require Vendor clusterware as Oracle provides it own clusterware called “Oracle Clusterware”.We got these packages un-installed and got both server rebooted. After cleaning up RAC installation, we restarted installation . You can use  Metalink Note 239998.1 – 10g RAC: How to Clean Up After a Failed CRS Install for cleanup.  On re-running root.sh installation, installation went fine.

Creating Oracle Extended RAC on Oracle VM

Yesterday, I found one very useful article at OTN “Creating Oracle Extended RAC” on completely virtual environment using Oracle VM. As Virtualization is becoming popular day by day and is very cost effective, one must know how to use this to simulate actual environments. Click  here for details on Oracle Extended RAC on Oracle VM.

Verification of CRS Integrity Was Unsuccessful

While going through the routine checks from Grid Control, I found a critical alert stating “clusterware integrity check failed” and by clicking on this message it says that there is problem with some metric collections on RAC environment.

To check the node reachability status following query was run:

$ $CRS_HOME/bin/cluvfy comp nodecon -n all

This will check the internode connectivity for all nodes in the cluster. It came out with following message:

$ $CRS_HOME/bin/cluvfy comp nodecon -n all
Verifying node connectivity
Verification of node connectivity was unsuccessful on all the nodes.

Even the CRS component check was unsuccessful:

$ $CRS_HOME/bin/cluvfy comp crs -n all

It came out with the following message:

$ $CRS_HOME/bin/cluvfy comp crs -n all
Verifying CRS integrity
Verification of CRS integrity was unsuccessful on all the nodes.

After this it was quite obvious to check the CRS status:

$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
$crs_stat -t

Name           Type           Target    State     Host
------------------------------------------------------------
ora.orcl.db    application    ONLINE    ONLINE    rac1
ora....11.inst application    ONLINE    ONLINE    rac1
ora....12.inst application    ONLINE    ONLINE    rac2
ora....vice.cs application    ONLINE    ONLINE    rac2
ora....l1.srv application    ONLINE    ONLINE    rac1
ora....l1.srv application    ONLINE    ONLINE    rac2
ora....SM1.asm application    ONLINE    ONLINE    rac1
ora....DC.lsnr application    ONLINE    ONLINE    rac1
ora....idc.gsd application    ONLINE    ONLINE    rac1
ora....idc.ons application    ONLINE    ONLINE    rac1
ora....idc.vip application    ONLINE    ONLINE    rac1
ora....SM2.asm application    ONLINE    ONLINE    rac2
ora....C2.lsnr application    ONLINE    ONLINE    rac2
ora....dc2.gsd application    ONLINE    ONLINE    rac2
ora....dc2.ons application    ONLINE    ONLINE    rac2
ora....dc2.vip application    ONLINE    ONLINE    rac2
$$CRS_HOME/bin/olsnodes
rac1
rac2

This confirmed that the CRS install is valid, but the question now is why the cluster verification utility (CVU) was failing?

To find the reason I enabled the tracing of CVU as:

$export SRVM_TRACE=true

It will set the environment variable SRVM_TRACE to true and tracing of CVU will generate a trace file under $CRS_HOME/cv/log with name like “cvutrace.log.X”

After setting this and again running $CRS_HOME/bin/cluvfy comp crs -n all trace file with name cvutrace.log.0 was generated.

And a message in cvutrace.log like

<strong>"ksh: CVU_10.2.0.2_dba/exectask.sh: cannot execute"</strong>

Now its is clear that oracle is not able to execute exectask.sh and cheking the permission and ownership of exectask.sh:

$CRS_HOME/cv/remenv
ls -ltr
-rw-r--r--  1 oracle dba    184 Jan  9  2008 exectask.sh
-rw-r--r--  1 oracle dba 268386 Jan  9  2008 exectask

The permission of these two files was changed. After changing the permission back to 755 CUV was showing correct results.

$chmod 755 exectask*

It is still not discovered how the permission of these files got changed.