Saurabh Sood

Resolving Shutdown Immediate Hang Situations

Many times Oracle DBA’s are faced with a situation where shutting down the instance with shutdown immediate seems to take infinite time and gives a impression that the session is hung. In most of the cases this is not a hang. Hang can be defined as a scenario when few sessions/processes are waiting on some action and in turn blocking other session/processes to complete their tasks. Now the original sessions could be waiting because of resource issues or Oracle Bugs.

Shutdown immediate can take long time to complete (appear to be hung) because of three reasons:

1. Uncommitted transactions are being rolled back.

2. SMON is cleaning temp segments or performing delayed block cleanouts.

3. Processes still continue to be connected to the database and do not terminate.

1. Uncommitted transactions are being rolled back:

This is the case when the message ‘Waiting for smon to disable tx recovery’ is posted in the alert log after we issue shutdown immediate.

There are two reasons for this:
– A large query was running at the time of shutdown immediate.
-A large transaction was running at the time of shutdown immediate.

For large queries:

SQL > select count(*) from v$session_longops where time_remaining>0;

If it returns a value > 0 then we can do a shutdown abort and then startup restrict and then again shutdown immediate.

For large transactions:

SQL > select sum(used_ublk) from v$transaction;

If it returns a large value then we have to wait for a long time for shutdowm to get completed.
If the large transaction is aborted and then shutdown is issued then we have to query v$fast_start_transactions and v$fast_start_server, we will not see anything in v$transaction at this time.

At this particular moment transaction recovery is going on and the count(*) will keep on decreasing:

SQL > select count(*) from v$fast_start_transaction;

Decreasing count will show that recovery is going on and when the recovery is completed the database will be shutdown.

But it is not desirable under some circumstances such as, when we have very short maintance window and we need to perform a shutdown immediate to do some work, in those cases we can use the following event and set in the init.ora file TEMPERORARLY To disable transaction recovery:

event=”10513 trace name context forever, level 2″

and bounce the instance and issue shutdown immediate to get complete without transaction recovery.SMON will not do a transaction recovery untill this event is set in the init.ora file so it is necessary to remove this event whenever you get a chance to shutdown the database again, this time shutdown immediate can even take 3-5 hours(Just remove this event from pfile).

2. SMON is cleaning temp segments or performing delayed block cleanouts:

During a SHUTDOWN IMMEDIATE and SHUTDOWN NORMAL, SMON cleans up extents which are no longer needed and marking them as freed. It means that count from uet$ will decrease and count in fet$ will increase.

To verify that the temporary segments are decreasing have an active session available in SQL during the SHUTDOWN IMMEDIATE. Run query to ensure the database is not hanging, but is actually perform extent cleanup:

SQL> select count(block#) from fet$;
COUNT(BLOCK)
----------
115

SQL> select count(block#) from uet$;
COUNT(BLOCK)
----------
713

After some time, issue the query again and check the results:

SQL> select count(block#) from fet$;
COUNT(BLOCK)
----------
210

SQL > select count(block#) from uet$;
COUNT(BLOCK)
----------
512

If you do not have sufficient time to wait for this cleanup then you can set the following event and bounce the database and reissue shutdown immediate to skip this cleanup:

event=”10061 trace name context forever, level 10″

It allows you to prevent SMON from cleaning up temporary segments. Again it is not recommended to set this event event forever. Whenever you have large downtime remove this event and allow SMON to do its work.

3. Processes still continue to be connected to the database and do not terminate:

After issuing shutdown immediate, If we see entries in alert log file as:

Tue Jan  8 12:00:27 2008
Active call for process 10071 user 'oracle' program '[email protected] (J001)'
SHUTDOWN: waiting for active calls to complete.
Tue Jan  8 12:00:57 2008

SHUTDOWN: Active sessions prevent database close operation

It shows that there are some active calls at program ‘[email protected] (J001)’ which pmon is not able to clear up.This message is due to the fact that database is waiting for pmon to clean up processes, but pmon is unable to clean them. The client connections to the server are causing the shutdown immediate or normal to hang. Do the following in this case:

1. Before shutdown immediate, shutdown the listener:

$ lsnrctl stop

2. Now check if there are any connection present at the database as:

$ ps -eaf | grep LOCAL

It will give you the OSPIDs of the client connected to database.

3 Manually kill them as:

# Kill -9 <OSPID>

4. Issue shutdown immediate now.

Do not forget to bring up the listener after startup 🙂

In addition to this you can set 10046 event in the session used to shutdown the instance. This will help to tell the event on which session is waiting

SQL>alter session set events '10046 trace name context forever, level 12'

SQL>Shutdown immediate;

Look for the trace file in user_dump_dest location. Also look at the alert.log for any other messages. They might be helpful in case the shutdown is experiencing hang situation.

Solving ORA-1031 while connecting as “/ as sysdba” :

Many times we see an issue like this:

SQL> conn / as sysdba
ERROR:
ORA-01031: insufficient privileges

This is a very common and frequent error that can occur after the new oracle software install
or due to some permissions changes at OS level.

I will dicuss the approach to solve ORA-1031 error on UNIX environment.

1. Check that oracle_sid and oracle_home are set correctly as:

$ echo $ORACLE_SID
$ echo $ORACLE_HOME

Find the values returned by above command and match these values under /etc/oratab file, these
have to be listed there.

EXAMPLE:
========

$ echo $ORACLE_SID
BSNL

$ echo $ORACLE_HOME
/u01/app/oracle/product/10.2.0/db_2

$ cat /etc/oratab

BSNL:/u01/app/oracle/product/10.2.0/db_2:N
VSNL:/u01/app/oracle/product/10.2.0/db_2:N

The values above are matching with /etc/oratab entries

If the oracle_sid and oracle_home are not set properly then set it as:

$ export ORACLE_SID=BSNL
$ export ORACLE_HOME=/u01/app/oracle/product/10.2.0/db_2

And try to connect as “/ as sysdba” It should work now.

If these are correct but still the error is coming then move to step 2.

2. Ensure TWO_TASK is not set

$ echo $TWO_TASK

If it return any lines as:

TWO_TASK=
OR
TWO_TASK=<some_db_name>

Then unset the environment variable as:

$ unset TWO_TASK

Now try to connect as “/ as sysdba”

If these are correct but still the error is coming then move to step 3.

3.Check the permissions on the oracle executable file:

$ cd $ORACLE_HOME/bin
$ ls -la oracle

It should show the following permissions:

-rwsr-s–x 1 oracle oinstall 96725724 Apr 2 13:43 oracle

If its not the same then issue the following command to set the correct permissions:

$ chmod 6751 oracle

If these are correct but still the error is coming then move to step 4.

4. Check for the dba group at OS level. We need to make sure that Operating System users issuing / as sysdba belongs to dba group at OS level.

There is one file we need to check for this i.e $ORACLE_HOME/rdbms/lib/config.s OR $ORACLE_HOME/rdbms/lib/config.c (File name vary from OS to OS on some OS it is config.c and on some OS it is config.s). The value in these file is typically set to “dba” as:

.ascii “dba”

Login as oracle user:
# su – oracle
$ id
uid=111(oracle) gid=123(usdba)

Look for the gid value here.(usdba)
The gid value is usdba so we need to modify the config.c or config.s so that it should look like:
.ascii “usdba”

After making changes to config file relink oracle binaries as:

– Make sure that no oracle processes running
– Login as oracle
– Make sure LD_LIBRARY_PATH and ORACLE_HOME are set properly
$ORACLE_HOME/bin/relink all

If these are correct but still the error is coming then move to step 5.

5. Make sure that dba group at OS level only exists once in /etc/group file and that the users belonging to the dba group are properly comma separated.

Example:

usdba::123:oracle,oracle1

ii) Check that the oracle user uid and gid are same in /etc/group and /etc/passwd

If all these 5 settings are correct and still ora-1031 is coming then the only option is to take truss output and check while opening which file the error is coming.
e.g.

$ truss -aefo /tmp/truss.out sqlplus “/ as sysdba”

6. Ensure you are invoking sqlplus from correct ORACLE_HOME

This actually came as comment on this post and I would agree. Ensure that you are using sqlplus from correct ORACLE_HOME. To do this set

export PATH=$PATH:$ORACLE_HOME/bin

You can confirm the home using which sqlplus command

Recovering from ORA-1578 ORA-8103 (Logical Corruption)

1. For ORA-1578:

There are two ways in which we can extract the data from a corrupted table:

a) Using DBMS_REPAIR.SKIP_CORRUPT_BLOCKS
b) Using Event 10231

a) Connect as sysdba user:

execute DBMS_REPAIR.SKIP_CORRUPT_BLOCKS(‘<schema_name>’,'<table_name>’);

Put the schema name and the table name of the corrupted table.

Then issue the following command to craete a new table from corrupted table:

create table new_table as select * from corrupted_table_name;

Now Drop the corrupted table, rename the newly created table to its original name and
clear the attribute for the new table as:

execute DBMS_REPAIR.SKIP_CORRUPT_BLOCKS

(‘<schema_name>’,'<table_name>’,flags=>dbms_repair.noskip_flag);

b) Set the event 10231 at session level to skip corrupted rows:

ALTER SESSION SET EVENTS ‘10231 TRACE NAME CONTEXT FOREVER, LEVEL 10’;

Then issue the following command to craete a new table from corrupted table:

create table new_table as select * from corrupted_table_name;

Now Drop the corrupted table, rename the newly created table to its original name.

2. For ORA-8103

This is a case of LOGICAL CORRUPTION and cannot be taken care by skipping the corrupted rows.

Follow Metalink Note 422547.1 to salvage data in case of ORA-8103

Physical Corruption: ORA-1578 PART-2

Recovering from physical corruption:

1. Using BMR when RMAN backups are available
When small set of blocks
are corrupted, Block Media Recovery (BMR) can be used as a recovery technique. The advantage of BMR is that it can be performed without taking the datafiles offline but the block undergoing BMR is not accessible to users. RMAN BLOCKRECOVER command is used for BMR.


1)How to identify and recover the corrupted blocks needing media recovery:
To find the corrupted blocks we can use v$database_block_corruption OR v$backup_corruption view
We can use RMAN to populate the he v$backup_corruption view, which will tell us the corruptedblocks are the datafile number on which they resides:

RMAN > backup validate database;

This command will check the datafiles for physical and logical corruption. If the backup validation discovers corrupt blocks,then RMAN updates the V$DATABASE_BLOCK_CORRUPTION and v$backup_corruption views.
Now Run :

SQL> select * from V$backup_corruption;

SQL> select * from V$database_block_corruption;

to find out the corrupted blocks.
It will give you the output like :

FILE# BLOCK BLOCKS CORRUPTION_CHANGE# MAR

15 215 1 0 YES

Now we have the list of corrupted blocks file#=15 and block#=215
We will use the following command to recover corrupted blocks:

RMAN > BLOCKRECOVER DATAFILE 15 BLOCK 215;
RMAN > BLOCKRECOVER CORRUPTION LIST;


The corrupted block will be recovered from last RMAN backup and RMAN will take care of applying archives to make the block current.


2. Using BMR when RMAN backups are not taken ?
If we have a OS level backup of corrupted datafile, it can be cataloged with RMAN to perform BMR.
Find the location of Backup of datafiles which have corruption in it.
Let us suppose the backup of corrupted datafile is at /u01/app/oracle/backup/data01.dbf

Now use the following command to catalog the backup as:
RMAN> catalog datafilecopy ‘/u01/app/oracle/backup/data01.dbf’;


Find the archive log:


RMAN will use the archive logs which are present in V$ARCHIVED_LOG, If the archivelogs are moved to some other location or they are aged out of controlfilethen they need to be cataloged as well:


RMAN> catalog archivelog ‘/u02/app/oracle/backup/archoracle_1_40.dbf’;

Once the datafiles and the required archive logs are cataloged, we can use the BMR as:


RMAN > BLOCKRECOVER datafile 15 block 215;

Restrictions on BLOCKRECOVER

1. The database needs to be in mounted or open stage.

2. Only complete media recovery of block is possible.

3. Only full backups of datafiles containing corrupted blocks can be used to perform block media recovery.

4. If the archivelogs are missing or not accessible, BMR will fail.

5. Header blocks cannot be recovered.

6. Database needs to be ARCHIVELOG mode for performing a BMR.


I will be discussing more on Recovery and data salvaging techniques under Corruption Part-3.

Physical Corruption: ORA-1578 -Part 1

ORA-1578 is indication of physical block corruption. Oracle computes an internal checksum before fetching the block from disk, when it is about to fetch the block it reads the block header and compare the checksum stored in it. If the checksum mismatches we get ora-1578 error.
It indicates that a block is corrupted at
hardware level i.e at OS.

This error will always come whenever the corrupted block is read by Oracle.

Reasons for ora-1578:

1. Bad sectors in the Storage Disk Drive.
2. Formatting of blocks At OS level.( Zeroed out disk blocks).
3. some corruption in underlying hardware which is making block unreadable.

How to collect information for ora-1578:

Error Text:

ORA-01578: ORACLE data block corrupted (file # 14, block # 38976)

The block # 38976 shows the corrupted block number and the file # 14 indicates
the reletive file number 14.

1. Get the absolute file number:

To get the absolute file # we will use the following query:

SQL > select file# from v$datafile where rfile#=14;

FILE#
——-
14

here the absolute file# is also 14.

2. Get the segment name and segment type in which this corrupted block resides:

Use the following query:

SQL > SELECT tablespace_name, segment_type, owner, segment_name FROM
dba_extents WHERE file_id = 14 and 38976 between block_id AND block_id + blocks – 1;

— It will give us the segment type (Table, Index, LOB etc.), name of the segment (Table name
Index name etc.), Owner of the segment and the tablespace name of the segment in which corrupted block resides.

SQL > select FILE#,TS#,STATUS,BLOCKS,NAME from v$datafile where FILE#=14;

— From this we will get the tablespace number and the fine name of the datafile in which the corrupted block resides.

SQL> select TS#,NAME from v$tablespace where TS#=;

— It will give us the Name of the tablespace.

Now we have all the information required to perform an action to overcome ora-1578.

I will be providing more on this about solving ora-1578 soon.