RAC

Cluvfy Reports CRS not Installed on Nodes

Before I start my post, I would like to wish all our readers a “Happy New Year”.

While reviewing a 10gR2 RAC configuration I faced following errors on invoking cluvfy utility

$ ./cluvfy stage -post crsinst -n prod01,prod02 -verbose

Performing post-checks for cluster services setup 

Checking node reachability...

Check: Node reachability from node "prod01-fe"
  Destination Node                      Reachable?
  ------------------------------------  ------------------------
  prod01                              yes
  prod02                              yes
Result: Node reachability check passed from node "prod01-fe".

Checking user equivalence...

Check: User equivalence for user "oracle"
  Node Name                             Comment
  ------------------------------------  ------------------------
  prod02                              passed
  prod01                              passed
Result: User equivalence check passed for user "oracle".

ERROR:
CRS is not installed on any of the nodes.
Verification cannot proceed.

Post-check for cluster services setup was unsuccessful on all the nodes.

Cluvfy has reported that Clusterware has not been installed on the server. But this was strange and unexpected as one RAC Database was already running on these nodes and crs_stat reported the status of all CRS resources. ocrcheck also did not report any problems with the OCR. To dig further I checked Cluvfy logs located under $ORA_CRS_HOME/cv/log. It had recorded following errors

[main] [15:52:41:800] [OUIData.readInventoryData:393]  ==== CRS home added: Oracle home properties:
Name     : OraCRS
Type     : CRS-HOME
Location : /app/oracle/product/10.2/crs
Node list: [prod01-fe, prod02-fe]
; Thu Dec 24 15:52:41 GMT+08:00 2009
[main] [15:52:41:800] [OUIData.readInventoryData:401]  ==== ORACLE home added: Oracle home properties:
Name     : OraHome
Type     : ORACLE-HOME
Location : /app/oracle/product/10.2/db_1
Node list: [prod01, prod02]
; Thu Dec 24 15:52:41 GMT+08:00 2009
[main] [15:52:41:800] [VerificationUtil.isCRSInstalled:1262]  CRS wasn't found installed  on node: prod02; Thu Dec 24 15:52:41
 GMT+08:00 2009
[main] [15:52:41:800] [VerificationUtil.isCRSInstalled:1262]  CRS wasn't found installed  on node: prod01; Thu Dec 24 15:52:41
 GMT+08:00 2009

You can notice that CRS_HOME has recorded Node list as [prod01-fe, prod02-fe] and for DB_HOME it is [prod01, prod02]. Actual hostname for the nodes are prod01 and prod02. prod01-fe and prod02-fe were aliases for the two nodes.

Cluvfy uses Central Oracle Inventory to obtain ORACLE_HOME information. Checking the $ORACLE_BASE/oraInventory/ContentsXML/inventory.xml confirmed that the nodes corresponding to CRS_HOME were stored as prod01-fe and prod02-fe ( I have changed the Angled brackets to Round brackets as html considers them as tags. Don’t worry your inventory file is not corrupted 🙂 )

(INVENTORY)
(VERSION_INFO)
   (SAVED_WITH)10.2.0.1.0(/SAVED_WITH)
   (MINIMUM_VER)2.1.0.6.0(/MINIMUM_VER)
(/VERSION_INFO)
(HOME_LIST)
(HOME NAME="OraCRS" LOC="/app/oracle/product/10.2/crs" TYPE="O" IDX="1" CRS="true")
   (NODE_LIST)
      (NODE NAME="prod01-fe"/)
      (NODE NAME="prod02-fe"/)
   (/NODE_LIST)
(/HOME)
(HOME NAME="OraHome" LOC="/app/oracle/product/10.2/db_1" TYPE="O" IDX="2")
   (NODE_LIST)
      (NODE NAME="prod01"/)
      (NODE NAME="prod02"/)
   (/NODE_LIST)
(/HOME)

There are few reported issues on metalink but they relate to CRS=”true” not being present for CRS_HOME entry. There is one more reported issue listed here by Surachart which was caused by incorrect permissions on /etc/oraInst.loc file.

To correct this problem, we had to run runInstaller -updateNodelist command to update the correct nodes in file. Even though some metalink notes recommend to change this file manually,  I would recommend using runInstaller command for updating inventory.
Before executing the command, I ran olsnodes command to confirm that OCR stored prod01,prod02 in its repository.

runInstaller -updateNodeList -silent "CLUSTER_NODES={prod01,prod02}" ORACLE_HOME="/app/oracle/product/10.2/crs" ORACLE_HOME_NAME="OraCRS" LOCAL_NODE="prod01" CRS=true
Starting Oracle Universal Installer...

No pre-requisite checks found in oraparam.ini, no system pre-requisite checks will be executed.
The inventory pointer is located at /var/opt/oracle/oraInst.loc
The inventory is located at /app/oracle/oraInventory
'UpdateNodeList' was successful.

Now the cluvfy worked fine. I do not have any reasoning for why the incorrect nodes were recorded in Oracle Inventory, but above solution should take care of this.

Link:11gR2 RAC installation steps on OEL4

If you are looking for steps for 11gR2 RAC installation, you can refer to this article by Rajeev Ramdas at Dbastreet.com. Article lists down steps for installing 11gR2 RAC on 64 bit Oracle Enterprise Linux 4 (OEL4) using ASM for storage. As Raw devices are no longer supported, OCR and Voting Disks are also stored on ASM. Yes, this is one more cool New Feature available in 11gR2.

OCFS2 Configuration Issue

While setting up ocfs2 for OCR and Voting disk storage with following commad:
# ocfs2console

After clicking on ==>cluster ==> configure nodes, I got a pop-up saying:

<span style="font-size: small;"><span style="font-family: arial,helvetica,sans-serif;">"Could not start cluster stack. This must be resolved before any OCFS2 filesystem can be mounted."</span></span>

Soon I realized that the thing which takes few minutes to get installed, is going to give me a tough time.

/var/log/messages shows following details:

<span style="font-size: small;"><span style="font-family: arial,helvetica,sans-serif;">Aug 17 14:53:40 rac1 modprobe: FATAL: Module configfs not found.
Aug 17 14:55:23 rac1 modprobe: FATAL: Module configfs not found.
Aug 17 14:56:56 rac1 modprobe: FATAL: Module configfs not found.
</span></span>

This prevents the configuration of OCFS2’s cluster stack, but it is mandatory to have OCFS2 cluster stack “O2CB” running, before
we can start anything with OCFS2 filesystem.

The stack includes the following services:

<span style="font-size: small;"><span style="font-family: arial,helvetica,sans-serif;">    * NM: Node Manager that keep track of all the nodes in the cluster.conf
    * HB: Heart beat service that issues up/down notifications when nodes join or leave the cluster
    * TCP: Handles communication between the nodes
    * DLM: Distributed lock manager that keeps track of all locks, its owners and status
    * CONFIGFS: User space driven configuration file system mounted at /config
    * DLMFS: User space interface to the kernel space DLM
</span></span>

Error : modprobe: FATAL: Module configfs not found” can occur because of following reasons:

1. SELINUX is enabled.
2. Mismatch between the Kernel and OCFS2 module.

1. To check for selinux:
# sestatus
Or
# vi /etc/sysconfig/selinux

Make sure that selinux is DISABLED here.

2. To check for Mismatch:
# uname -a (It will give the exact kernel version of the OS)
2.6.9-42.ELsmp
# rpm -qa |grep ocfs2 (It will tell us the ocfs2 package currently installed)
ocfs2-2.6.9-89.EL

Here it can be seen that ocfs2 is for kernel version 89 not for kernel version 42.

So I downloaded the correct OCFS2 kernel modules from:
http://oss.oracle.com/projects/ocfs2/files/
and the tools from
http://oss.oracle.com/projects/ocfs2-tools/files/

After installing the correct module and disabling the selinux settings, I got the cluster stack running.


Link: Best practices for AIX RAC Database on OTN

A new whitepaper has been published on OTN which primarily discusses best practices for AIX 5.2,5.3 and 6.1 for RAC database for avoiding node eviction due to Oprocd.  It also mentions recommended AIX VMO parameters and recommended patches for Oracle RAC.

You can find the article at below location

http://www.oracle.com/technology/products/database/clusterware/pdf/rac_aix_system_stability.pdf

SRVCTL fails to start RAC resources:CRS-0215

After upgrading RAC database to 10204 and applying CRS bundle patch-1 for 10204 crs home,
srvctl command fails to startup resources on rac nodes. While starting up RAC resources using SRVCTL
following error occurs in CRSD.log file:

$ srvctl start instance -d rac -i rac2

2009-04-09 13:45:22.091: [  CRSRES][2611477408][ALERT]0`ora...inst` on member `` has experienced an unrecoverable failure.
2009-04-09 13:45:22.091: [  CRSRES][2611477408]0Human intervention required to resume its availability.
2009-04-09 13:46:25.162: [  CRSRES][2611477408]0StopResource: setting CLI values
2009-04-09 13:46:25.174: [  CRSRES][2611477408]0Attempting to stop `ora...inst` on member ``
2009-04-09 13:46:25.206: [  CRSAPP][2611477408]0StopResource error for ora...inst error code = 1

To debug SRVCTL SRVM_TRACE is set to true and a Strace is taken at OS level:

$script /tmp/srvm.log
$export SRVM_TRACE=TRUE
$srvctl start instance -d  -i
$exit

It will genertae a trace file at /tmp/srvm.log.

$ strace -aef -o /tmp/strace.log srvctl start instance -d -i

It will generate a trace file at /tmp/strace.log

— srvm.log shows follwoing error:

[Thread-2] [11:57:59:774] [StreamReader.run:65]  OUTPUT>Attempting to start `ora.rac.rac2.inst` on member `node11`
[Thread-2] [11:58:0:862] [StreamReader.run:65]  OUTPUT>`ora.rac.rac2.inst` on member `node11` has experienced an unrecoverable failure.
[Thread-2] [11:58:0:862] [StreamReader.run:65]  OUTPUT>Human intervention required to resume its availability.
[Thread-2] [11:58:0:863] [StreamReader.run:65]  OUTPUT>nloz11:ora.rac.rac2.inst:/oac/app/oracle/product/10.2.0/db_1/bin/racgwrap: line 62: fg: no job control
[Thread-3] [11:58:0:865] [StreamReader.run:65]  ERROR>CRS-0215: Resource ora.rac.rac2.inst cannot be started.
[Thread-3] [11:58:0:865] [StreamReader.run:65]  ERROR>
[Worker 0] [11:58:0:865] [RuntimeExec.runCommand:133]  runCommand: process returns 115

— strace.log file shows the following:

rt_sigprocmask(SIG_SETMASK, [], NULL, 8 ) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8 ) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8 ) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x8075d8b, [], SA_RESTORER, 0xb7ee5908}, {SIG_IGN}, 8 ) = 0
waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 2}], 0) = 18699
rt_sigprocmask(SIG_SETMASK, [], NULL, 8 ) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
waitpid(-1, 0xbfffe9bc, WNOHANG) = -1 ECHILD (No child processes)
sigreturn() = ? (mask now [])
rt_sigaction(SIGINT, {SIG_IGN}, {0x8075d8b, [], SA_RESTORER, 0xb7ee5908}, 8 ) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8 ) = 0
read(255, "exit $?\n", 6261) = 8
rt_sigprocmask(SIG_SETMASK, [], NULL, 8 ) = 0
exit_group(2) = ?

The SRVM trace showed that there is a problem with racgwrap script at line 62 which indicates the following:

$ORACLE_HOME/bin/racgmain “$@”

Could not found much with this line, but from the begning i.e line 1 the entry for ORACLE_HOME was missing.

ORACLE_HOME=<%ORACLE_HOME%>
export ORACLE_HOME
— Added the correct oracle_home location at this place.

Also, after checking the srvctl file for the db_home the “OHOME” and “CHOME” entries were missing:
— Added the correct entries for OHOME and CHOME ( copied the entries from the node where srvctl was working fine)

After making these two changes SRVCTL worked fine.

Cheers!!!!
Saurabh Sood

OUI-67124 – Copy failed from ‘location 1’ to ‘location 2’

Just a short note to discuss a problem faced by me while applying CPU Jan patch to clusterware on AIX 5L. I was getting following errors

UtilSession failed: ApplySession failed in system modification phase... 'ApplySession::apply failed: Copy failed from '/archive/oracle/soft/Patch/6980307/6756433/files/lib/libhasgen10.so' to '/oracle/crs_base/app/product/crs10gR2/lib/libhasgen10.so'...
Copy failed from '/archive/oracle/soft/Patch/6980307/6756433/files/lib/libocr10.so' to '/oracle/crs_base/app/product/crs10gR2/lib/libocr10.so'...
Copy failed from '/archive/oracle/soft/Patch/6980307/6756433/files/lib/libocrb10.so' to '/oracle/crs_base/app/product/crs10gR2/lib/libocrb10.so'...
Copy failed from '/archive/oracle/soft/Patch/6980307/6756433/files/lib/libocrutl10.so' to '/oracle/crs_base/app/product/crs10gR2/lib/libocrutl10.so'..

I had followed all the  pre-requsites for this patch installation i.e

1)Stopped the database instance and ASM instance on the node

2)Stopped the nodeapps services

3)Stopped the clusterware

4) Executed /usr/sbin/slibclean as root

I searched over metalink and found a note recommending renaming the files and  retrying the patching process. One more suggestion was to copy the files manually. I thought of debugging this issue (also wanted to have clean installation), so I checked for processes being run by ‘oracle’ user. I found that listener was running

oracle 1982506       1   0 00:30:13      -  0:00 /oracle/ora_base/app/product/db10gR2/bin/tnslsnr LISTENER_TAF_PRODDB1 -inherit

This was a listener which was created manually (not using netca) and not registered in the OCR. As a result, it did not stop when we stopped the nodeapps services. I then stopped the listener and executed /usr/sbin/slibclean (as root) and re-initiated the patching process. This time it went fine.

One more easier way would have been to use ‘fuser’ command to identify the pid’s for the processes accessing the file.

In the end I realized that before proceeding to apply patch ,it is better to check for if any Instance or listener or any other process (RMAN, sqlplus or sqlloader utilities too) is running from Oracle Home being patched even though you have followed all the steps mentioned in Patch readme.

Cheers

Amit