Physical Corruption: ORA-1578 PART-2

Recovering from physical corruption:

1. Using BMR when RMAN backups are available
When small set of blocks
are corrupted, Block Media Recovery (BMR) can be used as a recovery technique. The advantage of BMR is that it can be performed without taking the datafiles offline but the block undergoing BMR is not accessible to users. RMAN BLOCKRECOVER command is used for BMR.


1)How to identify and recover the corrupted blocks needing media recovery:
To find the corrupted blocks we can use v$database_block_corruption OR v$backup_corruption view
We can use RMAN to populate the he v$backup_corruption view, which will tell us the corruptedblocks are the datafile number on which they resides:

RMAN > backup validate database;

This command will check the datafiles for physical and logical corruption. If the backup validation discovers corrupt blocks,then RMAN updates the V$DATABASE_BLOCK_CORRUPTION and v$backup_corruption views.
Now Run :

SQL> select * from V$backup_corruption;

SQL> select * from V$database_block_corruption;

to find out the corrupted blocks.
It will give you the output like :

FILE# BLOCK BLOCKS CORRUPTION_CHANGE# MAR

15 215 1 0 YES

Now we have the list of corrupted blocks file#=15 and block#=215
We will use the following command to recover corrupted blocks:

RMAN > BLOCKRECOVER DATAFILE 15 BLOCK 215;
RMAN > BLOCKRECOVER CORRUPTION LIST;


The corrupted block will be recovered from last RMAN backup and RMAN will take care of applying archives to make the block current.


2. Using BMR when RMAN backups are not taken ?
If we have a OS level backup of corrupted datafile, it can be cataloged with RMAN to perform BMR.
Find the location of Backup of datafiles which have corruption in it.
Let us suppose the backup of corrupted datafile is at /u01/app/oracle/backup/data01.dbf

Now use the following command to catalog the backup as:
RMAN> catalog datafilecopy ‘/u01/app/oracle/backup/data01.dbf’;


Find the archive log:


RMAN will use the archive logs which are present in V$ARCHIVED_LOG, If the archivelogs are moved to some other location or they are aged out of controlfilethen they need to be cataloged as well:


RMAN> catalog archivelog ‘/u02/app/oracle/backup/archoracle_1_40.dbf’;

Once the datafiles and the required archive logs are cataloged, we can use the BMR as:


RMAN > BLOCKRECOVER datafile 15 block 215;

Restrictions on BLOCKRECOVER

1. The database needs to be in mounted or open stage.

2. Only complete media recovery of block is possible.

3. Only full backups of datafiles containing corrupted blocks can be used to perform block media recovery.

4. If the archivelogs are missing or not accessible, BMR will fail.

5. Header blocks cannot be recovered.

6. Database needs to be ARCHIVELOG mode for performing a BMR.


I will be discussing more on Recovery and data salvaging techniques under Corruption Part-3.

Application Design and ORA – 4031

Much has been written on designing good application (which uses bind variables) so as to have better performance and avoid shared pool memory issues. Tom Kyte explains this on one of his post consequences of not using bind variables

If you do not use bind variables and you flood the server with
hundreds/thousands of unique queries you will
-run dog slow
-consume a
ton of RAM (and maybe run out)
-not scale beyond a handful of users, if
that
among other really bad side effects.

This is very true and have observed it myself many times. But today we will see one more kind of bad application design which I came across working on a ORA – 4031 issue.

Environment was already using CURSOR_SHARING=SIMILAR (application developers were not ready to change code, so it was Kind of quick fix) to force bind variables but we were still observing ORA – 4031 errors. Query on V$SGASTAT gave us following results

http://askoracledba.wordpress.com/2008/04/27/application-design-and-ora-4031/sgastat/

Out of 4 gb allocated to shared pool, around 2.4 G is being used by sql area and library cache indicating that sql are consuming lot of memory in shared pool.

Whenever I observe high memory usage for Sqlarea, I run following sql statements.

1) select sa.sql_text,sa.version_count ,ss.*from v$sqlarea sa,v$sql_shared_cursor ss where sa.address=ss.KGLHDPAR and sa.version_count > 50 order by sa.version_count ;

– This one is to find if child cursors are being generated.

2) Use script from Asktom website to find if application is using Bind variables. You can find ,more information by clicking here

create table t1 as select sql_text from v$sqlarea;
alter table t1 add sql_text_wo_constants varchar2(1000);
create or replace function
remove_constants( p_query in varchar2 ) return varchar2
as
    l_query long;
    l_char  varchar2(1);
    l_in_quotes boolean default FALSE;
begin
    for i in 1 .. length( p_query )
    loop
        l_char := substr(p_query,i,1);
        if ( l_char = '''' and l_in_quotes )
        then
            l_in_quotes := FALSE;
        elsif ( l_char = '''' and NOT l_in_quotes )
        then
            l_in_quotes := TRUE;
            l_query := l_query  '''#';
        end if;
        if ( NOT l_in_quotes ) then
            l_query := l_query  l_char;
        end if;
    end loop;
l_query := translate( l_query, '0123456789', '@@@@@@@@@@' );
    for i in 0 .. 8 loop
        l_query := replace( l_query, lpad('@',10-i,'@'), '@' );
        l_query := replace( l_query, lpad(' ',10-i,' '), ' ' );
    end loop;
    return upper(l_query);
end;
/
update t1 set sql_text_wo_constants = remove_constants(sql_text);

select sql_text_wo_constants, count(*)
  from t1
 group by sql_text_wo_constants
having count(*) > 100
 order by 2
/

On executing first statement , there were no rows returned. But on executing second query , got following results

SELECT NVL(SUM(FEE),:"SYS_B_@") FROM (SELECT NVL(SUM(FREECALLFEE)+SUM(SALARY)+
SUM(SALARY@)+SUM(FREEINFOFEE),:"SYS_B_@") FEE
FROM SCOTT.A_BL_C@_@_F_@ WHERE MSISDN=:MSISDN UNION ALL
SELECT NVL(SUM(COMM),:"SYS_B_@") FEE FROM SCOTT.T_BL_SM_F_@ WHERE MSISDN=:MSISDN )
150

SELECT NVL(SUM(FEE),:"SYS_B_@") FROM (SELECT NVL(SUM(FREECALLFEE)+
SUM(SALARY)+SUM(SALARY@)+SUM(FREEINFOFEE),:"SYS_B_@") FEE FROM
SCOTT.A_BL_C@_@_G_@ WHERE MSISDN=:MSISDN UNION ALL SELECT NVL(SUM(COMM),:"SYS_B_@") FEE
FROM SCOTT.T_BL_SM_G_@ WHERE MSISDN=:MSISDN )
150

SELECT NVL(SUM(FEE),:"SYS_B_@") FROM (SELECT NVL(SUM(FREECALLFEE)+SUM(SALARY)+
SUM(SALARY@)+SUM(FREEINFOFEE),:"SYS_B_@") FEE FROM
SCOTT.A_BL_C@_@_H_@ WHERE MSISDN=:MSISDN UNION ALL SELECT
NVL(SUM(COMM),:"SYS_B_@") FEE FROM SCOTT.T_BL_SM_H_@ WHERE MSISDN=:MSISDN )

150

TRUNCATE TABLE SCOTT.A_BL_C@_@_L_@
310

TRUNCATE TABLE SCOTT.A_BL_C@_@_M_@
310

TRUNCATE TABLE SCOTT.A_BL_C@_@_P_@
310

Ideally this should not have returned any query as we are forcing bind variables.
But looking closely we find that there around 620tables with name of format SCOTT.A_BL_C@_@_M_@ and SCOTT.A_BL_C@_@_P_@(basically @ will be actually a Numeric or a literal) which is cause for so many versions for sql. Application is running set of 20 queries on these tables , so in total there are 20 X620 queries being run , leading to high usage of shared pool memory.

Apart from that we see that all these tables are being truncated .This actually invalidates the cursors and causes a hard parse along with causing shared pool fragmentation

Based on our findings, we gave following recommendation which could be performed to alleviate (if not resolve) the issue.

– Have the shared pool set to 4 Gb itself
– Turn these 620tables into two paritioned tables. This will avoid multiple sql’s in shared pool. But this can have impact on the time taken for query to run and had to be tested
– Instead of truncate tables, use delete operation so that we do not invalidate the cursor. But again catch here is that we will have more redo generation and if data being loaded uses append operation, size for these tables will grow.

As you see this is basically Application design issue and made a application which is not scalable and also leading to downtime.

Physical Corruption: ORA-1578 -Part 1

ORA-1578 is indication of physical block corruption. Oracle computes an internal checksum before fetching the block from disk, when it is about to fetch the block it reads the block header and compare the checksum stored in it. If the checksum mismatches we get ora-1578 error.
It indicates that a block is corrupted at
hardware level i.e at OS.

This error will always come whenever the corrupted block is read by Oracle.

Reasons for ora-1578:

1. Bad sectors in the Storage Disk Drive.
2. Formatting of blocks At OS level.( Zeroed out disk blocks).
3. some corruption in underlying hardware which is making block unreadable.

How to collect information for ora-1578:

Error Text:

ORA-01578: ORACLE data block corrupted (file # 14, block # 38976)

The block # 38976 shows the corrupted block number and the file # 14 indicates
the reletive file number 14.

1. Get the absolute file number:

To get the absolute file # we will use the following query:

SQL > select file# from v$datafile where rfile#=14;

FILE#
——-
14

here the absolute file# is also 14.

2. Get the segment name and segment type in which this corrupted block resides:

Use the following query:

SQL > SELECT tablespace_name, segment_type, owner, segment_name FROM
dba_extents WHERE file_id = 14 and 38976 between block_id AND block_id + blocks – 1;

— It will give us the segment type (Table, Index, LOB etc.), name of the segment (Table name
Index name etc.), Owner of the segment and the tablespace name of the segment in which corrupted block resides.

SQL > select FILE#,TS#,STATUS,BLOCKS,NAME from v$datafile where FILE#=14;

— From this we will get the tablespace number and the fine name of the datafile in which the corrupted block resides.

SQL> select TS#,NAME from v$tablespace where TS#=;

— It will give us the Name of the tablespace.

Now we have all the information required to perform an action to overcome ora-1578.

I will be providing more on this about solving ora-1578 soon.

Deciding “PROCESSES” parameter value

Many times we observe following error which corresponds to low setting of processes parameter.

$ oerr ora 20

00020, 00000, “maximum number of processes (%s) exceeded”

// *Cause: All process state objects are in use.

// *Action: Increase the value of the PROCESSES initialization parameter.




This is mostly seen in new database installation as the processes parameter is not taken into account.

According to Oracle® Database Reference 10g Release 2 (10.2) Part Number B14237-03

PROCESSES specifies the maximum number of operating system user processes that can simultaneously connect to Oracle. Its value should allow for all background processes such as locks, job queue processes, and parallel execution processes.

The default values of the SESSIONS and TRANSACTIONS parameters are derived from this parameter. Therefore, if you change the value of PROCESSES, you should evaluate whether to adjust the values of those derived parameters.




This nrings us to a question that “Can we set up a large value for this parameter so that user’s are not affected and we do not have to restart the system (PROCESSES is a static parameter) ?”

Answer is NO.



You need to calculate ideal value for the processes parameter by taking care of Default Background processes (Also looking at number of Archiver processes, parallel_max_servers) and total number of user connections.

This is required as otherwise you will be required to change the value of SEMMNS parameter (for UNIX only) i.e Max number of semaphores in system and along with this , it will increase the “Overhead” Memory in SGA.

Find below explanation for the above points.

1)SEMMNS parameter

Each Oracle instance needs to have a set amount of semaphores. The total amount of semaphores required is derived from the ‘processes’ parameter specified inOracle instance init.ora file. As more instances and/or databases are added the OS kernel parameter SEMMNS will need to be adjusted accordingly.

According to Metalink Note:153961.1



In Oracle7 , number of semaphores required by an instance is equal to ‘processes’ parameter in the init.ora for the instance.For Oracle8, Oracle8i, Oracle9i and Oracle10g, number of semaphores required by an instance is equal to 2 times the setting of the ‘processes’ parameter in the init.ora for the instance.

However, Oracle only momentarily grabs 2 X ‘processes’ and then releases half at instance startup.




2)Overhead Memory



Parameters like db_files, open_cursors and processes contribute to the variable part of SGA. This can be seen by issuing “Show SGA” command. You will see that “Variable Size” will be more then sum of “Shared Pool + Large Pool + Java Pool”. This is attibuted to the value of these parameters.

Please note that in case you specify a low value for SGA_MAX_SIZE, you will see Oracle bumping the value to higher value so as to accomodate high value of Overhead memory.

Staring from 10g, Overhead memory is accomodated in shared_pool_size.

e.g If you specify SHARED_POOL_SIZE as 200 MB and your internal overhead is 100 Mb, then your actual shared pool value available to instance is only 100Mb.

You can read Note:351018.1 – Minimum for SHARED_POOL_SIZE Parameter in 10.2 Version for more information.

ERRORSTACK- Oracle Debugging Event

One more post arising out of my frequent visits to Oracle forum 😉 Basically there was a question on “How to set events for ORA – 1652 errors”

I had replied to the post but Aman asked me if I could provide any notes regarding Errorstack.I didn’t had one so I informed him that I will be posting on my blog ( Cheap Publicity of my blog 🙂 )

Ok, this was the background to it, now coming to the main thing. Errorstack is used as debugging event for following reasons

  • To extract Failing SQL statement
  • To get Oracle Function call (to be used by Oracle) for interpreting cause for the error (e.g ORA 4030, ORA- 4031). It is also used in case of database hang scenarios. Refer to my earlier article on database hang
  • It also contains execution plan in some cases which can be used to diagnose ORA – 1652 errors (Excessive temp usage)

Errorstack can be set at System level and also at Session level.

e.g To set errorstack for ORA – 1652 ,Syntax will be

alter system set events ‘1652 trace name errorstack level 3’;

If suppose, you want to set it at session level, then

alter session set events ‘1652 trace name errorstack level 3’;

You can also use oradebug to take errorstack for particular process (say ospid 1234) as below

sqlplus “/ as sysdba”

oradebug setospid 1234

oradebug unlimit

oradebug dump errorstack 3

There are different levels for the errorstack with level 3 being the highest.

Note: There is no such level as level 10 or 12 which you might
sometimes find on web.

e.g event=’1401 trace name errorstack, level 12′

Extracted from http://www.orafaq.com/faqdbain.htm .
Specifying this will default to level 3.

Let’s see one trace file for error ORA – 942 i.e “table or view does not exist”

SYS>alter session set events ‘942 trace name errorstack level 3’;
Session altered.
SYS>select * from err;

select * from err

*ERROR at line 1:ORA-00942: table or view does not exist

Trace file is generated in user_dump_dest as a user process has failed.

Trace file Header
/u01/app/oracle/admin/sql10g/udump/sql10g_ora_23773.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – 64bit Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /u01/app/oracle/oracle/product/10.2.0
System name: SunOS
Node name: test
Release: 5.9
Version: Generic_117171-12
Machine: sun4u
Instance name: sql10g
Redo thread mounted by this instance: 1
Oracle process number: 28
Unix process pid: 23773, image:
oracle@test(TNS V1-V3)

Here we find information about Database version and Process id along with Operating System.

Failing Sql
ksedmp: internal or fatal error
ORA-00942: table or view does not exist
Current SQL statement for this session:
select * from err

Here we can see that faling sql statement is “select * from err”

Call Stack

—– Call Stack Trace —–
calling call entry argument values in hex
location type point (? means dubious value)
——————– ——– ——————– —————————-
ksedmp()+744 CALL ksedst() 000000840 ?
FFFFFFFF7FFF06FC ?
000000000 ?
FFFFFFFF7FFED1F0 ?
FFFFFFFF7FFEBF58 ?
FFFFFFFF7FFEC958 ?
ksddoa()+1148 PTR_CALL 0000000000000000 000106000 ? 106323304 ?
106323000 ? 000106323 ?
000106000 ? 106323304 ?
ksdpcg()+276 CALL ksddoa() 00000000F ? 10631DCD0 ?
105502DA8 ? 10631D990 ?
000000180 ? 10631E564 ?
ksdpec()+200 CALL ksdpcg() 000000000 ? 000000000 ?
000000000 ? 000000000 ?
0000003AE ? 00010558F ?
ksfpec()+160 CALL ksdpec() 0000003AE ? 10631EDB0 ?

Immediately after Failing sql, we will find call stack which are actually Oracle functions, which can be used by Oracle Support/ Developement to find the reason for error. You will find sometimes some metalink notes containing reference to functions so as to match Bugs.

PROCESS STATE DUMP
PROCESS STATE
————-
Process global information:
process: 38a1df2f8, call: 38a2b3218, xact: 0, curses: 38a264b38, usrses: 38a264b38
—————————————-
SO: 38a1df2f8, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=28, calls cur/top: 38a2b3218/3877b9820, flag: (0) –
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
last post received-location: No post
last process to post me: none
last post sent: 0 0 0
last post sent-location: No post
last process posted by me: none
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 38a2039c8
O/S info: user: oracle, term: pts/19, ospid: 23773
OSD pid info: Unix process pid: 23773, image: oracle@test (TNS V1-V3)

SO: 38a264b38, type: 4, owner: 38a1df2f8, flag: INIT/-/-/0x00
(session) sid: 71 trans: 0, creator: 38a1df2f8, flag: (41) USR/- BSY/-/-/-/-/-
SO: 38a264b38, type: 4, owner: 38a1df2f8, flag: INIT/-/-/0x00
(session) sid: 71 trans: 0, creator: 38a1df2f8, flag: (41) USR/- BSY/-/-/-/-/-
DID: 0001-001C-0000026A, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 3, prv: 0, sql: 385b129b0, psql: 383567fa0, user: 0/SYS
O/S info: user: oracle, term: pts/19, ospid: 23772, machine: test
program: sqlplus@test (TNS V1-V3)
application name: sqlplus@test (TNS V1-V3), hash value=0
last wait for ‘SQL*Net message from client’ blocking sess=0x0 seq=30 wait_time=24569781 seconds since wait started=23
driver id=62657100, #bytes=1, =0

SO: 387f804d0, type: 53, owner: 38a264b38, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=387f804d0 handle=385b129b0 mode=N
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=387f80550[387f2d2e8,3877d6288] htb=387f2d2e8 ssga=387f2cce0
user=38a264b38 session=38a264b38 count=1 flags=[0000] savepoint=0x40
LIBRARY OBJECT HANDLE: handle=385b129b0 mutex=385b12ae0(0)
name=select * from err
hash=737b6fa6ce797a460e47cca17ef84a50 timestamp=04-17-2008 13:56:40

Now this gives the process details like Oracle Process id (pid),OS Process Id (ospid) along with wait event which it was waiting on it. It will also show the memory objects (State Objects SO ) associated with this process.

Systemstate dump which is used to diagnose the database hang contains process state dumps for all the processes present in Database.

Execution Plan

As the execution has failed because of ORA- 942 during semantic check, it will not contain the execution plan.

Changing Database Characterset With Export/Import Utilities

Recently there was a thread on Oracle Forums regarding change of Database characterset from WE8ISO8859P1 to AL32UTF8. User wanted to upgrade his database from 9i to 10g and also change the database characterset. He was looking forward to use either Export/Import utilities or Database Upgrade Assistant for the same. There were few responses to this question with responses from few people (NLS being a dry topic for most of DBA’s 🙂 )

To make this issue simpler, we will keep this problem to only change of Database characterset. Export/Import is considered to be a Straightforward approach but we should understand that this should not be the first step.

As per Globalization Support Guide, there are two steps to change the characterset

1)Data Scanning

2)Data Conversion

Export/Import comes under second step, i.e Data Conversion . Data Scanning is a very important step for any Characterset Change and should not be compromised. Failure to do so can result in data loss or data corruption.

In case you are planning to use Export/Import ,Approach should be to first run csscan.

Csscan utility helps to identify if data is Changeless (Data to be stored in same way in new Characterset), Convertible (Data have different code points in new Characterset and needs to be converted), Truncation (Data will be truncated in new Characterset , so column needs to be modified) and Lossy (Data is not understood by new characterset and will be lost on Conversion).

In case only Changeless and Convertible data is there, you can go ahead and use export/import.

If only changeless data is there, then you need to simply use “Alter database ” statement or csalter (in 10g) to change the characterset.

In case you have lossy data, then you would be required to identify what the data is, correct it and then change the characterset.

You should read the following Metalink Notes and Documentation before attempting to change Database Characterset

Oracle® Database Globalization Support Guide10g Release 2 (10.2)Part Number B14225-02

Note:227338.1 – Character Set Scanner – Frequently Asked Questions

Note 257736.1 Changing the Database Character Set – an extended overview

Note 260192.1 Changing WE8ISO8859P1/WE8ISO8859P15 or WE8MSWIN1252 to (AL32)UTF8