How To Recover From Corrupted OCR Disk

It is very common where a DBA is left with corrupted OCR disk without having any good backup.
The same situation was experienced by me few days back. One node of RAC database shows the following:

NODE1:

<span style="font-family: arial,helvetica,sans-serif;"><strong>$ORA_CRS_HOME/bin/crs_stat -t
</strong>Name           Type           Target    State     Host
------------------------------------------------------------
ora.orcl.db    application    ONLINE    ONLINE    rac1
ora....11.inst application    ONLINE    ONLINE    rac1
ora....12.inst application    ONLINE    OFFLINE
ora....vice.cs application    OFFLINE   OFFLINE
ora....l11.srv application    ONLINE    OFFLINE
ora....l12.srv application    ONLINE    OFFLINE
ora....SM1.asm application    ONLINE    ONLINE    rac1
ora....DC.lsnr application    ONLINE    ONLINE    rac1
ora....abc.gsd application    ONLINE    ONLINE    rac1
ora....abc.ons application    ONLINE    ONLINE    rac1
ora....abc.vip application    ONLINE    ONLINE    rac1
ora....SM2.asm application    ONLINE    ONLINE    rac2
ora....C2.lsnr application    ONLINE    ONLINE    rac2
ora....bc2.gsd application    ONLINE    ONLINE    rac2
ora....bc2.ons application    ONLINE    ONLINE    rac2
ora....bc2.vip application    ONLINE    ONLINE    rac2</span>

The other node shows the following:
NODE2:

<span style="font-family: arial,helvetica,sans-serif;"><strong>/crs_stat -t</strong>
HA Resource                                   Target     State
-----------                                   ------     -----
ora.orcl.db                                   OFFLINE    OFFLINE
ora.orcl.orcl11.inst                          OFFLINE    OFFLINE
ora.orcl.orcl12.inst                          OFFLINE    OFFLINE
ora.orcl.test_service.cs                      ONLINE     OFFLINE
ora.orcl.test_service.orcl11.srv              OFFLINE    OFFLINE
ora.orcl.test_service.orcl12.srv              OFFLINE    OFFLINE
ora.rac1 .ASM1.asm                         OFFLINE    OFFLINE
ora.rac1 .LISTENER_RAC1 .lsnr           OFFLINE    OFFLINE
ora.rac1 .gsd                              OFFLINE    OFFLINE
ora.rac1 .ons                              OFFLINE    OFFLINE
ora.rac1 .vip                              OFFLINE    OFFLINE
ora.rac2.ASM2.asm                        OFFLINE    OFFLINE
ora.rac2.LISTENER_RAC2 2.lsnr         ONLINE     OFFLINE
ora.rac2.gsd                             ONLINE     OFFLINE
ora.rac2.ons                             ONLINE     OFFLINE
ora.rac2.vip                             ONLINE     OFFLINE</span>

We can see the inconsistent data across two node RAC. Every command for srvctl, crsctl was hanging on NODE 2.
Now the option is to restore the OCR backup, but if there is no backup available for OCR then we can use the following procedure to recover from corrupted OCR disk
(There will be complete downtime needed to perform these operations)

1. Check the status of CRS from node 1:

# ps -eaf |grep d.bin
root 12873 1 0 Aug11 ? 00:11:07 /u01/app/crs/bin/crsd.bin reboot
oracle 13105 12846 0 Aug11 ? 00:00:45 /u01/app/crs/bin/evmd.bin
oracle 13226 13200 0 Aug11 ? 00:13:13 /u01/app/crs/bin/ocssd.bin
root 21458 19986 0 20:34 pts/4 00:00:00 grep d.bin

2. Shutdown Oracle ClusterWare on all nodes:

<span style="font-family: arial,helvetica,sans-serif;">[root@rac1  bin]# ./crsctl stop crs
Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.</span>

Check the status again:

[root@rac1 bin]# ps -eaf |grep d.bin
root 21927 19986 0 20:34 pts/4 00:00:00 grep d.bin

It shows that the cluster is stopped.

3. Execute rootdelete.sh from all nodes.

It is under directory $ORA_CRS_HOME/install/rootdelete.sh

NODE1:

<span style="font-family: arial,helvetica,sans-serif;">[root@rac1  install]# <strong>./rootdelete.sh</strong>
Shutting down Oracle Cluster Ready Services (CRS):
Stopping resources.
Error while stopping resources. Possible cause: CRSD is down.
Stopping CSSD.
Unable to communicate with the CSS daemon.
Shutdown has begun. The daemons should exit soon.
Checking to see if Oracle CRS stack is down...
Oracle CRS stack is not running.
Oracle CRS stack is down now.
Removing script for Oracle Cluster Ready services
Updating ocr file for downgrade
Cleaning up SCR settings in '/etc/oracle/scls_scr'</span>

NODE 2:

./rootdelete.sh Shutting down Oracle Cluster Ready Services (CRS): OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2] Shutdown has begun. The daemons should exit soon. Checking to see if Oracle CRS stack is down... Oracle CRS stack is not running. Oracle CRS stack is down now. Removing script for Oracle Cluster Ready services Updating ocr file for downgrade Cleaning up SCR settings in '/etc/oracle/scls_scr'

“OCR initialization failed accessing OCR device”, this error can occur due to folloing reasons:
1. ocrconfig_loc is not pointing to the correct ocr.
2. Problem of rights and owners on the ocr devices
3. Configuration problem on Oracle Cluster Synchronization Services

As the SCR entries are cleaned up so there is no need to worry about PROC-26 error.

If you have more than 2 nodes in a rac you need to run rootdelete.sh on all the other nodes also.

4. Run rootdeinstall.sh from the node where the RAC installation was done (usually it is the node1).
It will clear up the OCR disk contents.

./rootdeinstall.sh Removing contents from OCR device 2560+0 records in 2560+0 records out

5. Run root.sh from the same node:

./root.sh WARNING: directory '/u01' is not owned by root Checking to see if Oracle CRS stack is already configured Setting the permissions on OCR backup directory Setting up NS directories Oracle Cluster Registry configuration upgraded successfully WARNING: directory '/u01' is not owned by root assigning default hostname rac1 for node 1. assigning default hostname rac2 2 for node 2. Successfully accumulated necessary OCR keys. Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897. node : node 1: rac1 rac1-priv rac1 node 2: rac2 rac2-priv rac2 Creating OCR keys for user 'root', privgrp 'root'.. Operation successful. Now formatting voting device: /dev/raw/raw1 Format of 1 voting devices complete. Startup will be queued to init within 90 seconds. Adding daemons to inittab Expecting the CRS daemons to be up within 600 seconds. CSS is active on these nodes. rac1 CSS is inactive on these nodes. rac2 2 Local node checking complete. Run root.sh on remaining nodes to start CRS daemons.

After its completion run root.sh on all remaining nodes.

 ./root.sh Checking to see if Oracle CRS stack is already configured Setting the permissions on OCR backup directory Setting up NS directories Oracle Cluster Registry configuration upgraded successfully clscfg: EXISTING configuration version 3 detected. clscfg: version 3 is 10G Release 2. assigning default hostname rac1 for node 1. assigning default hostname rac2 for node 2. Successfully accumulated necessary OCR keys. Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897. node : node 1: rac1 rac1-priv rac1 node 2: rac2 rac2-priv rac2 clscfg: Arguments check out successfully. NO KEYS WERE WRITTEN. Supply -force parameter to override. -force is destructive and will destroy any previous cluster configuration. Oracle Cluster Registry for cluster has already been initialized Startup will be queued to init within 90 seconds. Adding daemons to inittab Expecting the CRS daemons to be up within 600 seconds. CSS is active on these nodes. rac1 rac2 CSS is active on all nodes. Oracle CRS stack installed and running under init(1M) Running vipca(silent) for configuring nodeapps The given interface(s), "eth0" is not public. Public interfaces should be used to configure virtual IPs.

The silent mode VIPCA configuration will fail because of BUG 4437727 in 10.2.0.1. To solve this run the
VIPCA manually from root user from last node where this error has occured and follow the instructions.
# $ORA_CRS_HOME/bin/vipca

6. Now final step is to add the resources back to OCR with srvctl command.

Adding DATABASE to OCR:

$srvctl add database -d db_unique_name -o oracle_home [oracle@rac1 ~]$ $ORA_CRS_HOME/bin/srvctl add database -d orcl -o /u01/app/oracle/product/10.2.0/db_1

Adding INSTANCE to OCR:

srvctl add instance -d db_unique_name -i inst_name -n node_name [oracle@rac1 ~]$ $ORA_CRS_HOME/bin/srvctl add instance -d orcl -i orcl11 -n rac1 [oracle@rac1 ~]$ $ORA_CRS_HOME/bin/srvctl add instance -d orcl -i orcl12 -n rac2 2

Adding SERVICES to OCR:

$srvctl add service -d db_unique_name -s service_name -r preferred_list [oracle@rac1 ~]$ $ORA_CRS_HOME/bin/srvctl add service -d orcl -s test_service -r orcl11,orcl12

Adding NODEAPPS to OCR:

srvctl add nodeapps -n node_name -o oracle_home -A addr_str
Where addr_str= The node level VIP address
This command needs to be run from ROOT user otherwise you will get following error:

[oracle@rac1 ~]$ $ORA_CRS_HOME/bin/srvctl add nodeapps -n rac1 -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.89/255.255.255.0 PRKO-2117 : This command should be executed as the system privilege user. [oracle@rac1 ~]$ [oracle@rac1 ~]$ su - Password: [root@rac1 ~]# cd /u01/app/crs/bin [root@rac1 bin]# ./srvctl add nodeapps -n rac1 -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.87/255.255.255.0 [root@rac1 bin]#./srvctl add nodeapps -n rac2 2 -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.89/255.255.255.0

This will complete the OCR recreation, now you can test the status with cluvfy.

Related

Tags: 10g, OCR, oracle, RAC

Post navigation

Issues with CLUSTER_DATABASE parameter
Question:Sizing ASM LUN

19 thoughts on “How To Recover From Corrupted OCR Disk”

Dan Norris

12 September, 2008 at 1:56 am

“It is very common where a DBA is left with corrupted OCR disk without having any good backup.”

I think that is a false statement. First, it isn’t common that OCR becomes corrupted. I’ve installed lots of clusters (even a significant number on virtual machines) and I’ve never had an OCR become corrupted unless I was intentionally trying to corrupt it. Secondly, backups are almost always available because Clusterware makes them automatically in $ORA_CRS_HOME/cdata/ and keeps several recent backups while rotating off the oldest ones. This process is described in the 10g R2 Clusterware and RAC deployment and admin guide, chapter 3.

Your solution is much more drastic than required. The “ocrconfig -restore” command restores the OCR from a backup. I’d think that should be attempted first before blowing away the entire OCR contents and having to rebuild everything from scratch–a much more risky operation than simply restoring it.

Reply

Saurabh Sood

12 September, 2008 at 3:45 pm

Dan,

I have seen many situations where DBAs faced this issue. You are lucky that your installation have never been through corruption. I did not do anything special to introduce OCR corruption in my system, but still found the cluster in this state.

I completely agree with you that we have automatic backups and mirroring of OCR disk but under certain conditions like : We have recently added node and before next backup (Before 4 hrs) we found OCR corruption, which leaves us with no recent backup or a scenario where some shell scripts wrongly deleted backups.

I have listed down steps to recover when we do not have any backup available and we need to rebuild from scratch, which “as mentioned” requires downtime. In case we have backups available then definitely “ocrconfig -restore” is the way to go!!

Cheers!!!
Saurabh Sood

Reply

Emit

14 November, 2008 at 6:45 pm

Friend.

I am facing a corrup OCR, i follow the recomedations below but i am stucked on the next step ==> Run root.sh for the other nodes.

Looks like the second node do not have the image mount point of the node 1.

Did you have any document to see more in deep. ?

Regards

Reply

Saurabh Sood

18 November, 2008 at 4:15 am

Hi Emit,

Please let us know what is the error that you got while running root.sh on other node.

Regards
Saurabh

Reply

Emit

18 November, 2008 at 1:31 pm

[root@sjum1blnx42 crs]# ./root.sh
/bin/chmod: cannot access `/informat/u01/app/oracle/product/crs/evm/init’: No such file or directory
/bin/chmod: cannot access `/informat/u01/app/oracle/product/crs/css/init’: No such file or directory
/bin/chmod: cannot access `/informat/u01/app/oracle/product/crs/css/log’: No such file or directory
/bin/chmod: cannot access `/informat/u01/app/oracle/product/crs/css/auth’: No such file or directory
WARNING: directory ‘/informat/u01/app/oracle/product’ is not owned by root
WARNING: directory ‘/informat/u01/app/oracle’ is not owned by root
WARNING: directory ‘/informat/u01/app’ is not owned by root
WARNING: directory ‘/informat/u01’ is not owned by root
WARNING: directory ‘/informat’ is not owned by root

Reply

Saurabh Sood

19 November, 2008 at 5:53 am

Emit,

The warning messages are ok, but could you please check whether the “crs” directory is present or not.
$ ls -ltr /informat/u01/app/oracle/product/

You have to make sure that you already ran root.sh on the node from where rac installation was started and after that it is tried on other nodes.

Regards,
Saurabh Sood

Reply

suchi

31 December, 2008 at 7:43 am

Hi,

good content…
As I am new to RAC so don’t have much idea but the way you descired here for OCR recover with this accpetance of loss of our data?or there is no loss…please suggest me.

Thanks,
Suchi

Reply

Saurabh Sood

31 December, 2008 at 8:05 am

Hi Suchi,

I don’t think there will be any data loss because we are shutting down all of the databases and the clusterware before performing these steps.
Let me know if you have any queries.

Cheers!!!!
Saurabh Sood

Reply

Ming

27 March, 2009 at 2:45 pm

Thanks for note sharing. My question is whether those approaches are certified by Oracle?
I am searching for a “back out” option before upgrading CRS from 10g to 11g and directed to your site. One step further to what you’ve pointed out, as a last resort, I guess we can completely re-install clusterware without losing the existing ASM and database. Would you agree?
Ming

Reply

Amit

29 March, 2009 at 3:21 am

Yes, you can also clean up rac installation and re-install the software. Then you can configure ASM instance and open the database. In fact I used the same approach for a Test cluster environment (in my case i messed up with crs bundle patch and installation got corrupted)

Reply

Nitin

29 July, 2009 at 7:27 am

Hi Emit,

Thanks for a well documented note. I think we are missing the netca step. netca must be used to register the listeners in the CRS. This has to be done after registering the database and instances using SRVCTL. Your thoughts on this please…

Nitin R

Reply

Amit

30 July, 2009 at 11:39 am

Hi Nitin,

This note is actually written by Saurabh and not by Emit 🙂
Listener is registered with nodeapps so you need not register it again. Moreover there is no command like “srvctl add listener” . Refer http://download.oracle.com/docs/cd/B19306_01/rac.102/b14197/srvctladmin.htm#i1008403

If required you can change the default port for the listener using netca.

Regards
Amit

Reply

Saurabh Sood

30 July, 2009 at 3:59 pm

Thanks Nitin for your comments.

Amit: Thanks for your comments also.

From 10g onwards listener is added as a nodeapps part. there is one command to configure listener with srvctl that can be seen in the link provided by Amit.
There is no need to run netca again.

With Regards,
Saurabh Sood

Reply

Bassem

5 August, 2009 at 6:30 am

great effor saurabh , thanks
Bassem

Reply

Saurabh Sood

5 August, 2009 at 9:28 am

🙂

Reply

bhallar

13 April, 2010 at 8:43 pm

Jeffery Hunter published the same with some additional info.
http://www.idevelopment.info/data/Oracle/DBA_tip/Oracle10gRAC/CLUSTER_70.shtml

Reply

Kamal

18 August, 2010 at 6:06 pm

Hi Saurabh,

Can we shorten all this exercise by following the steps.

1)Stop rdbms/asm/clusterware.
2)mv /etc/oracle to /etc/oracle.bak
3)dd out voting disk and ocr.
4)run root.sh on all nodes.
5)run vipca if required.

Also i dont belive that listener will get added when you run vipca. Nodeapps will only register vip,gsd and ons.
You have to manually add the listener as it has to run from RDBMS/ASM home and not from CRS home.

Reply

Saurabh Sood

22 October, 2010 at 11:39 am

Hi Kamal,

I think the above steps should also work, but not sure about the correctness, have you ever tried these?

YOu are correct about the listener configuration, listener is added through netca and is a part of nodeapps.

This also clarifies Nitin’s comment.

Regards,
Saurabh Sood

Reply

Damodhar Reddy

29 June, 2016 at 10:21 am

Hi suarabh,

What happens to my cluster and databases when the all the ocr files are corrupted/lost?

Reply

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.