Saurabh Sood

How To Recover From Corrupted OCR Disk

It is very common where a DBA is left with corrupted OCR disk without having any good backup.
The same situation was experienced by me few days back. One node of RAC database shows the following:

NODE1:

<span style="font-family: arial,helvetica,sans-serif;"><strong>$ORA_CRS_HOME/bin/crs_stat -t
</strong>Name           Type           Target    State     Host
------------------------------------------------------------
ora.orcl.db    application    ONLINE    ONLINE    rac1
ora....11.inst application    ONLINE    ONLINE    rac1
ora....12.inst application    ONLINE    OFFLINE
ora....vice.cs application    OFFLINE   OFFLINE
ora....l11.srv application    ONLINE    OFFLINE
ora....l12.srv application    ONLINE    OFFLINE
ora....SM1.asm application    ONLINE    ONLINE    rac1
ora....DC.lsnr application    ONLINE    ONLINE    rac1
ora....abc.gsd application    ONLINE    ONLINE    rac1
ora....abc.ons application    ONLINE    ONLINE    rac1
ora....abc.vip application    ONLINE    ONLINE    rac1
ora....SM2.asm application    ONLINE    ONLINE    rac2
ora....C2.lsnr application    ONLINE    ONLINE    rac2
ora....bc2.gsd application    ONLINE    ONLINE    rac2
ora....bc2.ons application    ONLINE    ONLINE    rac2
ora....bc2.vip application    ONLINE    ONLINE    rac2</span>

The other node shows the following:
NODE2:

<span style="font-family: arial,helvetica,sans-serif;"><strong>/crs_stat -t</strong>
HA Resource                                   Target     State
-----------                                   ------     -----
ora.orcl.db                                   OFFLINE    OFFLINE
ora.orcl.orcl11.inst                          OFFLINE    OFFLINE
ora.orcl.orcl12.inst                          OFFLINE    OFFLINE
ora.orcl.test_service.cs                      ONLINE     OFFLINE
ora.orcl.test_service.orcl11.srv              OFFLINE    OFFLINE
ora.orcl.test_service.orcl12.srv              OFFLINE    OFFLINE
ora.rac1 .ASM1.asm                         OFFLINE    OFFLINE
ora.rac1 .LISTENER_RAC1 .lsnr           OFFLINE    OFFLINE
ora.rac1 .gsd                              OFFLINE    OFFLINE
ora.rac1 .ons                              OFFLINE    OFFLINE
ora.rac1 .vip                              OFFLINE    OFFLINE
ora.rac2.ASM2.asm                        OFFLINE    OFFLINE
ora.rac2.LISTENER_RAC2 2.lsnr         ONLINE     OFFLINE
ora.rac2.gsd                             ONLINE     OFFLINE
ora.rac2.ons                             ONLINE     OFFLINE
ora.rac2.vip                             ONLINE     OFFLINE</span>

We can see the inconsistent data across two node RAC. Every command for srvctl, crsctl was hanging on NODE 2.
Now the option is to restore the OCR backup, but if there is no backup available for OCR then we can use the following procedure to recover from corrupted OCR disk
(There will be complete downtime needed to perform these operations)


1. Check the status of CRS from node 1:

# ps -eaf |grep d.bin
root 12873 1 0 Aug11 ? 00:11:07 /u01/app/crs/bin/crsd.bin reboot
oracle 13105 12846 0 Aug11 ? 00:00:45 /u01/app/crs/bin/evmd.bin
oracle 13226 13200 0 Aug11 ? 00:13:13 /u01/app/crs/bin/ocssd.bin
root 21458 19986 0 20:34 pts/4 00:00:00 grep d.bin

2. Shutdown Oracle ClusterWare on all nodes:

<span style="font-family: arial,helvetica,sans-serif;">[root@rac1  bin]# ./crsctl stop crs
Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.</span>

Check the status again:

[root@rac1 bin]# ps -eaf |grep d.bin
root 21927 19986 0 20:34 pts/4 00:00:00 grep d.bin

It shows that the cluster is stopped.

3. Execute rootdelete.sh from all nodes.

It is under directory $ORA_CRS_HOME/install/rootdelete.sh

NODE1:

<span style="font-family: arial,helvetica,sans-serif;">[root@rac1  install]# <strong>./rootdelete.sh</strong>
Shutting down Oracle Cluster Ready Services (CRS):
Stopping resources.
Error while stopping resources. Possible cause: CRSD is down.
Stopping CSSD.
Unable to communicate with the CSS daemon.
Shutdown has begun. The daemons should exit soon.
Checking to see if Oracle CRS stack is down...
Oracle CRS stack is not running.
Oracle CRS stack is down now.
Removing script for Oracle Cluster Ready services
Updating ocr file for downgrade
Cleaning up SCR settings in '/etc/oracle/scls_scr'</span>

NODE 2:

./rootdelete.sh</strong>
Shutting down Oracle Cluster Ready Services (CRS):
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
Shutdown has begun. The daemons should exit soon.
Checking to see if Oracle CRS stack is down...
Oracle CRS stack is not running.
Oracle CRS stack is down now.
Removing script for Oracle Cluster Ready services
Updating ocr file for downgrade
Cleaning up SCR settings in '/etc/oracle/scls_scr'</span>

“OCR initialization failed accessing OCR device”, this error can occur due to folloing reasons:
1. ocrconfig_loc is not pointing to the correct ocr.
2. Problem of rights and owners on the ocr devices
3. Configuration problem on Oracle Cluster Synchronization Services

As the SCR entries are cleaned up so there is no need to worry about PROC-26 error.

If you have more than 2 nodes in a rac you need to run rootdelete.sh on all the other nodes also.

4. Run rootdeinstall.sh from the node where the RAC installation was done (usually it is the node1).
It will clear up the OCR disk contents.

<span style="font-family: arial,helvetica,sans-serif;">./rootdeinstall.sh</span>

<span style="font-family: arial,helvetica,sans-serif;">Removing contents from OCR device
2560+0 records in
2560+0 records out</span>

5. Run root.sh from the same node:

<span style="font-family: arial,helvetica,sans-serif;">./root.sh
WARNING: directory '/u01' is not owned by root
Checking to see if Oracle CRS stack is already configured</span>

<span style="font-family: arial,helvetica,sans-serif;">Setting the permissions on OCR backup directory
Setting up NS directories
Oracle Cluster Registry configuration upgraded successfully
WARNING: directory '/u01' is not owned by root
assigning default hostname rac1  for node 1.
assigning default hostname rac2 2 for node 2.
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node :</span>

node 1: rac1  rac1-priv rac1
node 2: rac2  rac2-priv rac2
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
Now formatting voting device: /dev/raw/raw1
Format of 1 voting devices complete.
Startup will be queued to init within 90 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
CSS is active on these nodes.
rac1
CSS is inactive on these nodes.
rac2 2
Local node checking complete.
Run root.sh on remaining nodes to start CRS daemons.

After its completion run root.sh on all remaining nodes.

<span style="font-family: arial,helvetica,sans-serif;"> ./root.sh
Checking to see if Oracle CRS stack is already configured</span>

<span style="font-family: arial,helvetica,sans-serif;">Setting the permissions on OCR backup directory
Setting up NS directories
Oracle Cluster Registry configuration upgraded successfully
clscfg: EXISTING configuration version 3 detected.
clscfg: version 3 is 10G Release 2.
assigning default hostname rac1  for node 1.
assigning default hostname rac2  for node 2.
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node :</span>

node 1: rac1  rac1-priv rac1
node 2: rac2  rac2-priv rac2
clscfg: Arguments check out successfully.

<span style="font-family: arial,helvetica,sans-serif;">NO KEYS WERE WRITTEN. Supply -force parameter to override.
-force is destructive and will destroy any previous cluster
configuration.
Oracle Cluster Registry for cluster has already been initialized
Startup will be queued to init within 90 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
CSS is active on these nodes.
rac1
rac2
CSS is active on all nodes.
Oracle CRS stack installed and running under init(1M)
Running vipca(silent) for configuring nodeapps
The given interface(s), "eth0" is not public. Public interfaces should be used to configure virtual IPs.</span>

The silent mode VIPCA configuration will fail because of BUG 4437727 in 10.2.0.1. To solve this run the
VIPCA manually from root user from last node where this error has occured and follow the instructions.
# $ORA_CRS_HOME/bin/vipca

6. Now final step is to add the resources back to OCR with srvctl command.

Adding DATABASE to OCR:

$srvctl add database -d db_unique_name -o oracle_home
[oracle@rac1 ~]$ $ORA_CRS_HOME/bin/srvctl add database -d orcl -o /u01/app/oracle/product/10.2.0/db_1</span>

Adding INSTANCE to OCR:

srvctl add instance -d db_unique_name -i inst_name -n node_name
[oracle@rac1 ~]$ $ORA_CRS_HOME/bin/srvctl add instance -d orcl -i orcl11 -n rac1
[oracle@rac1 ~]$ $ORA_CRS_HOME/bin/srvctl add instance -d orcl -i orcl12 -n rac2 2</span>

Adding SERVICES to OCR:

$srvctl add service -d db_unique_name -s service_name -r preferred_list
[oracle@rac1  ~]$ $ORA_CRS_HOME/bin/srvctl add service -d orcl -s test_service -r orcl11,orcl12</span>

Adding NODEAPPS to OCR:

srvctl add nodeapps -n node_name -o oracle_home -A addr_str
Where addr_str= The node level VIP address
This command needs to be run from ROOT user otherwise you will get following error:

<span style="font-family: arial,helvetica,sans-serif;">[oracle@rac1  ~]$  $ORA_CRS_HOME/bin/srvctl add nodeapps -n rac1  -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.89/255.255.255.0
PRKO-2117 : This command should be executed as the system privilege user.
[oracle@rac1  ~]$
[oracle@rac1  ~]$ su -
Password:
[root@rac1  ~]# cd /u01/app/crs/bin
[root@rac1  bin]# ./srvctl add nodeapps -n rac1  -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.87/255.255.255.0
[root@rac1  bin]#./srvctl add nodeapps -n rac2 2  -o /u01/app/oracle/product/10.2.0/db_1 -A 10.167.21.89/255.255.255.0</span>

This will complete the OCR recreation, now you can test the status with cluvfy.

UNKNOWN State Of RAC Resources

While Checking the status of database resources, ASM was shown as UNKNOWN on one node of a two node RAC.

$ crs_stat -t

Name           Type           Target    State     Host
------------------------------------------------------------
ora.orcl.db    application    ONLINE    ONLINE    rac1
ora....11.inst application    ONLINE    ONLINE    rac1
ora....SM1.asm application    ONLINE    ONLINE    rac1
ora....DC.lsnr application    ONLINE    ONLINE    rac1
ora....idc.gsd application    ONLINE    ONLINE    rac1
ora....idc.ons application    ONLINE    ONLINE    rac1
ora....idc.vip application    ONLINE    ONLINE    rac1
ora....SM2.asm application    ONLINE    UNKNOWN    rac2
ora....C2.lsnr application    ONLINE    ONLINE    rac2
ora....dc2.gsd application    ONLINE    ONLINE    rac2
ora....dc2.ons application    ONLINE    ONLINE    rac2
ora....dc2.vip application    ONLINE    ONLINE    rac2

Following error was coming while trying to start the +ASM2 instance with SRVCTL:

$srvctl start asm -n rac2

PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2",
[CRS-0223: Resource 'ora.rac2.ASM2.asm' has placement error.]

While trying to start the same with crs_start :

$ crs_start -f ora.rac2.ASM2.asm

CRS-1028: Dependency analysis failed because of:
'Resource in UNKNOWN state: ora.rac2.ASM2.asm'
CRS-0223: Resource 'ora.rac2.ASM2.asm' has placement error

There are two ways to come out of this UNKNOWN state of resources:
1. Start the resource from sqlplus
2. Use crs_stop -f to clear the state of database resources.

$ export ORACLE_HOME=+ASM2
$ sqlplus "/ as sysdba"
SQL>startup 
Diskgroup mounted

It will go fine and the +ASM2 instnace will be started.

$ crs_stop -f ora.rac2.ASM2.asm

This will clear the UNKNOWN state and will make the resource as OFFLINE.

Now start the resource as:

$ srvctl start asm -n rac2

After using this check the status :

$ crs_stat -t

In case of listener resource, if starting listener using srvctl results in following error

CRS-0215: Could not start resource 'ora.dev-101.LISTENER_DEV-101.lsnr'.

This can be resolved by removing listener resource and adding it back. Perform following action using root user

#crs_unregister ora.dev-101.LISTENER_DEV-101.lsnr
#crs_unregister ora.dev-102.LISTENER_DEV-102.lsnr

Then recreate the listener using silent mode as oracle user

$netca /silent /responsefile $ORACLE_HOME/network/install/netca_typ.rsp /nodeinfo dev-101,dev-102

Above command can result in error like below

Exception in thread "main" java.lang.UnsatisfiedLinkError: /home/oracle/product/10.2/jdk/jre/lib/i386/libawt.so: libXp.so.6: cannot open shared object file: No such file or directory
	at java.lang.ClassLoader$NativeLibrary.load(Native Method)
	at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1586)
	at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1503)
	at java.lang.Runtime.loadLibrary0(Runtime.java:788)
	at java.lang.System.loadLibrary(System.java:834)
	at sun.security.action.LoadLibraryAction.run(LoadLibraryAction.java:50)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.awt.NativeLibLoader.loadLibraries(NativeLibLoader.java:38)
	at sun.awt.DebugHelper.(DebugHelper.java:29)
	at java.awt.Component.(Component.java:506)

This can be resolved by installing xorg-x11-deprecated-libs rpm. (yum install xorg-x11-deprecated-libs)

Verification of CRS Integrity Was Unsuccessful

While going through the routine checks from Grid Control, I found a critical alert stating “clusterware integrity check failed” and by clicking on this message it says that there is problem with some metric collections on RAC environment.

To check the node reachability status following query was run:

$ $CRS_HOME/bin/cluvfy comp nodecon -n all

This will check the internode connectivity for all nodes in the cluster. It came out with following message:

$ $CRS_HOME/bin/cluvfy comp nodecon -n all
Verifying node connectivity
Verification of node connectivity was unsuccessful on all the nodes.

Even the CRS component check was unsuccessful:

$ $CRS_HOME/bin/cluvfy comp crs -n all

It came out with the following message:

$ $CRS_HOME/bin/cluvfy comp crs -n all
Verifying CRS integrity
Verification of CRS integrity was unsuccessful on all the nodes.

After this it was quite obvious to check the CRS status:

$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
$crs_stat -t

Name           Type           Target    State     Host
------------------------------------------------------------
ora.orcl.db    application    ONLINE    ONLINE    rac1
ora....11.inst application    ONLINE    ONLINE    rac1
ora....12.inst application    ONLINE    ONLINE    rac2
ora....vice.cs application    ONLINE    ONLINE    rac2
ora....l1.srv application    ONLINE    ONLINE    rac1
ora....l1.srv application    ONLINE    ONLINE    rac2
ora....SM1.asm application    ONLINE    ONLINE    rac1
ora....DC.lsnr application    ONLINE    ONLINE    rac1
ora....idc.gsd application    ONLINE    ONLINE    rac1
ora....idc.ons application    ONLINE    ONLINE    rac1
ora....idc.vip application    ONLINE    ONLINE    rac1
ora....SM2.asm application    ONLINE    ONLINE    rac2
ora....C2.lsnr application    ONLINE    ONLINE    rac2
ora....dc2.gsd application    ONLINE    ONLINE    rac2
ora....dc2.ons application    ONLINE    ONLINE    rac2
ora....dc2.vip application    ONLINE    ONLINE    rac2
$$CRS_HOME/bin/olsnodes
rac1
rac2

This confirmed that the CRS install is valid, but the question now is why the cluster verification utility (CVU) was failing?

To find the reason I enabled the tracing of CVU as:

$export SRVM_TRACE=true

It will set the environment variable SRVM_TRACE to true and tracing of CVU will generate a trace file under $CRS_HOME/cv/log with name like “cvutrace.log.X”

After setting this and again running $CRS_HOME/bin/cluvfy comp crs -n all trace file with name cvutrace.log.0 was generated.

And a message in cvutrace.log like

<strong>"ksh: CVU_10.2.0.2_dba/exectask.sh: cannot execute"</strong>

Now its is clear that oracle is not able to execute exectask.sh and cheking the permission and ownership of exectask.sh:

$CRS_HOME/cv/remenv
ls -ltr
-rw-r--r--  1 oracle dba    184 Jan  9  2008 exectask.sh
-rw-r--r--  1 oracle dba 268386 Jan  9  2008 exectask

The permission of these two files was changed. After changing the permission back to 755 CUV was showing correct results.

$chmod 755 exectask*

It is still not discovered how the permission of these files got changed.

11g New Feature: Database Replay

Database Replay:

In critical production systems, before making changes to the environment( Hardware or software related), lots of testing is required in test systems to know the effect of those changes and this usually takes lots of time to test for the stability and effects of the change.

In Oracle 11g this thing is simplified and a new feature “database replay” is added which can be used to simulate the production workload on a test system.

Following are the three stages of DATABASE REPLAY:
1. Workload Capture
2. Workload Preprocessing
3. Workload Replay

1. Workload Capture:

Before starting the capture process you need to take care of following things:

Login as sys user

It is always recommened (not mandatory) to shutdown the database and then start the database in restricted mode. By doing so, you will be making sure that no transactions are running on the database before capture process is actually started. It will guarantee the complete workload capture and you will be running full transactions during database replay on test system.
If you do not follow this step then you have to be prepared for some divergance of results during database replay on test system.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down
SQL> startup restrict
ORACLE instance started.

Total System Global Area  313860096 bytes
Fixed Size                  1299624 bytes
Variable Size             184552280 bytes
Database Buffers          121634816 bytes
Redo Buffers                6373376 bytes
Database mounted.
Database opened.
SQL > select LOGINS from v$instance;
LOGINS
----------
RESTRICTED

Important thing to note here is that, after starting the capture process the database will automatically be switched into UNRESTRICTED mode and the users will be able to connect normally.

After capturing the initial database information, the capture process will internally fire a statement to disable restricted sessions, this can be verified by setting 10046 trace before starting the capture process as:

SQL> alter session set events '10046 trace name context forever, level 4';

Now start the capture process.

SQL> alter session set events '10046 trace name context off';

You will see following information:

PARSING IN CURSOR #21 len=39 dep=1 uid=0 oct=49 lid=0 tim=121695902847 hv=14656808 ad='0' sqlid='7shphjbg23h8'
alter system disable restricted session
END OF STMT
PARSE #21:c=0,e=6,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=0,tim=121695384843

Now check the information from database:

SQL > select LOGINS from v$instance;
LOGINS
----------
ALLOWED

– Create a directory to hold the workload information.

SQL > create directory DIR as '/tmp/ss';

Check the permission of directory ss at OS level and make sure that oracle is able to write to this directory.
Also make sure that the directory is empty, before starting the capture process.

– Define Workload Filters
By adding filters we can restrict the Workload Capture process to be specific to certain user sessions. By default all the sessions will be captured.

There are two types of filters i.e INCLUDE and EXCLUDE.

If INCLUDE is used then the filter will be active and workload for the objects mentioned in the filter will not be captured rather all other user’s workload Will be captured i.e the objects defined in the filter will be FILTERED OUT from capture process.

If EXCLUDE is used, only the workload done by objects defined in the filter will be captured, no other client request to database will be captured.

EXAMPLE:
========

SQL> exec dbms_workload_capture.ADD_FILTER( fname  IN VARCHAR2, fattribute  IN VARCHAR2,fvalue IN VARCHAR2);

fname=Name of the filter.
fattribute=Attribute on which the filter will be applied i.e USER, PROGRAM, INSTANCE_NUMBER etc.
fvalue= value for the corresponding attribute.

SQL > exec dbms_workload_capture.ADD_FILTER( fname =>'FILTER_SCOTT',fattribute => 'USER',fvalue => 'SCOTT');

– Starting workload capture:

To start the workload capture START_CAPTURE procedure is used as:

SQL > exec dbms_workload_capture.start_capture(NAME => 'PEAK_LOAD_TEST',DIR => 'DIR', DURATION => 600);

Name,Dir are mandatory parameters while duration (in seconds) is optional, here the capture will run for 600 seconds, if the duration is not specified then you have to finish the capture manually as:

SQL > dbms_workload_capture.finish_capture();

– While the capture process is runnig it will generate two files : wcr_scapture.wmd and wcr_cap_000xx.start

– After using the finish_capture procedure you will get folloing files also: wcr_cr.html and wcr_cr.text

wcr_cr.html is similar to the AWR report generated by awrrpt.sql

During capture process database initiate a AWR report for the duration of capture period.

SQL>  select id,AWR_BEGIN_SNAP,AWR_END_SNAP from dba_workload_captures;

<span style="font-family: arial,helvetica,sans-serif; font-size: small;">ID AWR_BEGIN_SNAP AWR_END_SNAP
---------- -------------- ------------
1             20           21
2             29           30

– Exporting AWR data for the workload capture:

It will enable detailed analysis of the workload:

SQL> exec DBMS_WORKLOAD_CAPTURE.EXPORT_AWR (capture_id => 2);

After this you can see two more files generated under the specified directory:

wcr_ca.dmp and wcr_ca.log

2. Workload Preprocessing:

Once the workload is captured it can be replayed on the test system. For this you need to copy the capture files to the test system. Preprocessing a workload will convert the captured files into REPLAY files and create the necessary metadata for this. Preprocessing will put load on the system and is resource intensive and should be performed only on test systems and on the same version of oracle database.

– Create a new directory for the capture files placed on the test system.
Let us suppose that the capture files are placed under /tmp/ss on test system, now create a directory as:

SQL > create directory REPLAY as '/tmp/ss';

SQL>EXEC DBMS_WORKLOAD_REPLAY.PROCESS_CAPTURE(capture_dir => 'REPLAY');

After preprocessing followings files will be generated:

wcr_process.wmd, wcr_login.pp, wcr_seq_data.extb, wcr_scn_order.extb , wcr_conn_data.extb

3. Workload Replay:

You have to use an instance which is logically identical to the instance whose workload is captured and need to setup Replay Client (WRC).

– wrc executable has to be run in calibrate mode to estimate the number of replay clients.

$  wrc mode=calibrate replaydir=/tmp/sood

Workload Replay Client: Release 11.1.0.6.0 - Production on Sat Jul 26 18:49:29 2008

Copyright (c) 1982, 2007, Oracle.  All rights reserved

Report for Workload in: /tmp/sood
-----------------------

Recommendation:
Consider using at least 1 clients divided among 1 CPU(s).

Workload Characteristics:
- max concurrency: 1 sessions
- total number of sessions: 5

Assumptions:
- 1 client process per 50 concurrent sessions
- 4 client process per CPU
- think time scale = 100
- connect time scale = 100
- synchronization = TRUE

– Initialize Replay Data:


Initializing replay loads the metadata into tables required for database replay.

SQL>; exec DBMS_WORKLOAD_REPLAY.INITIALIZE_REPLAY (replay_name => 'REPLAY_1', REPLAY_DIR=> 'REPLAY');

– Put the database in “PREPARE REPLAY” mode as:

SQL > exec DBMS_WORKLOAD_REPLAY.PREPARE_REPLAY()

To check the status of replay use:

SQL > select name,status from  dba_workload_replays;
NAME                 STATUS
-------------------- ----------------------------------------
REPLAY_1             PREPARE

— Start the replay client:

Workload Replay Client: Release 11.1.0.6.0 - Production on Sat Jul 26 22:12:35 2008

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

Wait for the replay to start (22:12:35)

– Start the replay as:

SQL> exec DBMS_WORKLOAD_REPLAY.START_REPLAY ();

Check the status of the replay as:

SQL> select name,status from  dba_workload_replays;

Initially it will show :
NAME                 STATUS
-------------------- ----------------------------------------
REPLAY_1             IN PROGRESS

And after that it will be shown as completed.

– Once the status is completed finish the replay as:

SQL> exec DBMS_WORKLOAD_REPLAY.CANCEL_REPLAY ();

To generate the report of the replay in HTML format you need to run :

DECLARE
My_Report  CLOB;
BEGIN
My_Report := DBMS_WORKLOAD_REPLAY.REPORT(replay_id => 1, format => DBMS_WORKLOAD_REPLAY.TYPE_HTML);
END;

Effect of OS Terminal Setting “STTY” on Oracle Database

Few days back, while trying to create a synonym over dblink I experienced following error:

SQL> create synonym synonymn_name for dual@DBLINK9i;
SP2-0042: unknown command "DBLINK9i" - rest of line ignored

Strange error !!!

Also while trying to perform a simple select statement over another dblink:

SQL > select * from dual@dblink10g;
SP2-0042: unknown command "DBLINK10g" - rest of line ignored.

It also failed with a similar error, though the select over this dblink was working fine earlier.The tnsnames.ora entry was correct and double checked.So what happened to this dblink now?

From other database I was able to perform a simple select over same dblink.

While I was researching this, I came to know that sql queries also started failing :

SQL> UPDATE HRRECORD_TEST SET EMAIL_ADDRESS = '[email protected]' WHERE NAME = 'XYZ';
SP2-0734: unknown command beginning "123.com'..." - rest of line ignored.
SQL>

One Notable thing was that all the queries which failed contained “@” symbol and I also got this error  while create a synonym over dblink (having “@” symbol).

These two issues were related, Then I tried to create a synonym locally and it went fine. This led to conclusion  that sql prompt was not recognising “@” symbol.

So it was clear that the problem was with terminal settings at OS level for oracle user.

Searcing google for terminal setting at OS, I found a command which is used to Sets options for your terminal i.e “stty”. By using stty we can change the terminal settings.

Now checking the current terminal settings using:

# stty -a

speed 9600 baud; line = 0;
rows = 36; columns = 80
min = 4; time = 0;
intr = DEL; quit = ^\; erase = DEL; kill = @
eof = ^D; eol = ^@; eol2 <undef>; swtch <undef>
stop = ^S; start = ^Q; susp <undef>; dsusp <undef>
werase <undef>; lnext <undef>
parenb -parodd cs7 -cstopb hupcl -cread -clocal -loblk -crts

From this output I found that there were two settings causing problem:

1) kill = @

2) eol = ^@

Then, to overcome this I added the following two lines in the oracle user .profile

stty kill ^U

stty eol ^E

Above change will force oracle user to use ^U to kill any session and ^E to end a line instead of using @ for these purpose.

After making these changes in oracle user’s .profile everything went fine. In the end it turned out OS issue whose settings was modified by sysadmin. 😉

Adding new ASM disk to RAC database fails

Many times i came across a common problem in RAC databases where trying to add an asm disk is not possible due to errors like

ORA-15075 “disk(s) are not visible cluster-wide”

ORA-15020 “discovered duplicate ASM disk “DISK1” and

ORA-15054 “disk “ORCL:DISK1” does not exist in diskgroup “DG1”.

Rebalancing the diskgroup and trying to add the disk with “FORCE” option also does not help in this case.

I will be discussing how to come out of a situation like this i.e When you are trying to add an asm disk in cluster environment and it says that disk is already added and when trying to drop the same disk it says that disk is not present in the diskgroup.

Lets start from the very begining:
I have decided to add an asm disk in RAC environment to an already existing diskgroup DATA1.

Login to asm instance “/ as sysdba”
SQL > ALTER DISKGROUP DATA1 ADD DISK ‘/dev/rdsk/c1t2d3s4’;

But it failed with following error:

ALTER DISKGROUP DATA1 ADD DISK '/dev/rdsk/c1t2d3s4';*
ERROR at line 1:
<strong>ORA-15032: not all alterations performed
ORA-15075: disk(s) are not visible cluster-wide</strong>

This is due to the fact that the physical disk partition is not visible from all RAC nodes. Then i contacted the sysadmins to make sure that the disk is visible from all RAC nodes and accessible by ORACLE. They have fixed the problem and now the disk /dev/rdsk/c1t2d3s4 can be seen from all RAC nodes. Then i tried to add the disk again using force option as:

SQL > ALTER DISKGROUP DATA1 ADD DISK ‘/dev/rdsk/c1t2d3s4’ force;
But it failed with following error:

ORA-15020: discovered duplicate ASM disk “/dev/rdsk/c1t2d3s4”

It shows that disk with same name is already present in the diskgroup.

As it shows that the disk is already present in the diskgroup, while trying to drop the disk i got following error:

SQL&gt; alter diskgroup DATA1 drop disk '/dev/rdsk/c1t2d3s4';
alter diskgroup DATA1 drop disk '/dev/rdsk/c1t2d3s4'
*
ERROR at line 1:
<strong>ORA-15032 : not all alterations performed
ORA-15054 : disk "/dev/rdsk/c1t2d3s4" does not exist in diskgroup "DATA1"</strong>

Now I cannot move further as adding and dropping the disk is not possible here. Then I decided to check the status of the disk from v$asm_disk from all RAC nodes, to do this issue following query:

SQL > col name format a15
SQL > col path format a20
SQL > select GROUP_NUMBER,DISK_NUMBER,MOUNT_STATUS,HEADER_STATUS,NAME,PATH from v$asm_disk;

We Obain following results from all the nodes :

G# D# HEADER_STATU MOUNT_S STATE NAME PATH
—- —- ———— ——- ——– ———— ————————-
0 0 MEMBER IGNORED NORMAL /dev/rdsk/c1t2d3s4

Header_status=MEMBER means that the disk is a valid asm disk on all RAC nodes.
Mount_status=IGNORED means that Disk is present in the system, but is ignored by ASM.

Group_number=0 This is the number used when a disk is not mounted by a diskgroup.

Now by checking the dd output of the disk as :

$dd if=/dev/rdsk/c1t2d3s4 of=/tmp/disk.out bs=4096 count=1096

$ vi /tmp/disk.out

I found that the diskgroup name and disk number allocated to this disk, which confirms that the disk is now a part of diskgroup DATA1.

But from the results of the header_status,mount_status and group_number it is clear that the disk is partially added to RAC asm instances. To correct this we will have to clear the disk header to add it again:

# dd if=/dev/null of=/dev/rdsk/c1t2d3s4 bs=4096 count=5000

This command cleared the disk header and after that disk was added successfully.

Note: – Please note that using dd will clear the ASM header and should be used only after confirming the disk. Using it on a wrong disk can cause Diskgroup to dismount and lead to Data Loss.