Advanced Corruption Resolution with RMAN


When Oracle Database 11g came with automatic recovering procedures, I was somewhat suspicious, but since all the exercises in the Oracle University labs went well in my classes, my perception of  these procedures was somewhat neutral.

The case I present to you today will show that these procedures are not so automatic, and above all, not so flexible that a DBA can use them in any panic situation.

We will start by asking RMAN if we have any problems:

RMAN> advise failure all;

no failures found that match specification

The database we will use for this is the standard one, with the sample schemas tablespace called EXAMPLE, with one datafile:

$ ll
(...)
-rw-r-----  1 oracle oinstall 104865792 example01.dbf
(...)

The datafile has aprox 100 MBytes, but we will mess with the “example01.dbf” datafile where the HR schema has its objects:

$ dd if=/dev/zero of=example01.dbf bs=8192 count=1024
1024+0 records in
1024+0 records out

So now the file has 8 MBytes instead of 100 Mbytes…

$ ll example01.dbf
-rw-r-----  1 oracle oinstall 8388608 example01.dbf

Will the HR schema be able to fetch its data? Take a look:

SQL> select count(*) from hr.employees;
select count(*) from hr.employees
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 5, block # 211)
ORA-01110: data file 5: '/u01/app/oracle/oradata/orcl/example01.dbf'

This is so awsome! Having corruption errors hit your face and not sh*ting you pants is such a relief.

Going back to the RMAN console, let’s try to get advice again:

RMAN> advise failure all;
using target database control file instead of recovery catalog
List of Database Failures
=========================
Failure ID Priority Status    Time Detected Summary
---------- -------- --------- ------------- -------
82         HIGH     OPEN      08-OCT-09     One or more non-system datafiles are corrupt
analyzing automatic repair options; this may take some time
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=134 device type=DISK
analyzing automatic repair options complete
Mandatory Manual Actions
========================
no manual actions available
Optional Manual Actions
=======================
no manual actions available
Automated Repair Options
========================
Option Repair Description
------ ------------------
1      Restore and recover datafile 5
Strategy: The repair includes complete media recovery with no data loss
Repair script: /u01/app/oracle/diag/rdbms/orcl/orcl/hm/reco_105909009.hm

This is the whole output of the “ADVISE” RMAN command, which tells us that there is a corrupt datafile (doesn’t tell you which).

There are no Manual Actions you need to take, only automated. But the only automated task can be manually triggered by executing a script. So let’s do it:

RMAN> @/u01/app/oracle/diag/rdbms/orcl/orcl/hm/reco_105909009.hm
RMAN>    # restore and recover datafile
2>    sql 'alter database datafile 5 offline';
sql statement: alter database datafile 5 offline
RMAN>    restore datafile 5;
Starting restore at 08-OCT-09
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 10/08/2009 11:58:32
ORA-01135: file 5 accessed for DML/query is offline
ORA-01110: data file 5: '/u01/app/oracle/oradata/orcl/example01.dbf'
RMAN-06010: error while looking up datafile: 5
RMAN>    recover datafile 5;
Starting recover at 08-OCT-09
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 10/08/2009 11:58:32
ORA-01135: file 5 accessed for DML/query is offline
ORA-01110: data file 5: '/u01/app/oracle/oradata/orcl/example01.dbf'
RMAN-06010: error while looking up datafile: 5
RMAN>    sql 'alter database datafile 5 online';
sql statement: alter database datafile 5 online
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of sql command on default channel at 10/08/2009 11:58:32
RMAN-11003: failure during parse/execution of SQL statement: alter database datafile 5 online
ORA-01122: database file 5 failed verification check
ORA-01110: data file 5: '/u01/app/oracle/oradata/orcl/example01.dbf'
ORA-01210: data file header is media corrupt
RMAN> **end-of-file**

Looks like the recovery script didn’t do no good… is there any other “automatic” way of doing it? Yes, if you use the REPAIR command:

RMAN> repair failure;
Strategy: The repair includes complete media recovery with no data loss
Repair script: /u01/app/oracle/diag/rdbms/orcl/orcl/hm/reco_105909009.hm
contents of repair script:
# restore and recover datafile
sql 'alter database datafile 5 offline';
restore datafile 5;
recover datafile 5;
sql 'alter database datafile 5 online';
Do you really want to execute the above repair (enter YES or NO)? yes
executing repair script
sql statement: alter database datafile 5 offline
Starting restore at 08-OCT-09
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of repair command at 10/08/2009 13:27:41
RMAN-03015: error occurred in stored script Repair Script
ORA-01135: file 5 accessed for DML/query is offline
ORA-01110: data file 5: '/u01/app/oracle/oradata/orcl/example01.dbf'
RMAN-06010: error while looking up datafile: 5

The REPAIR FAILURE command does exactly the same. Let’s try to do it manually:

RMAN> sql 'alter database datafile 5 offline';
sql statement: alter database datafile 5 offline

The first command went ok. Now the restore:

RMAN> restore datafile 5;
Starting restore at 08-OCT-09
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 10/08/2009 13:30:01
ORA-01135: file 5 accessed for DML/query is offline
ORA-01110: data file 5: '/u01/app/oracle/oradata/orcl/example01.dbf'
RMAN-06010: error while looking up datafile: 5

So the error is in the restore…

Just a wild guess: what if I rename the existing, yet corrupted datafile?

$ mv example01.dbf example01.dbf.old

And try again the restore…

RMAN> restore datafile 5;
Starting restore at 08-OCT-09
using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00005 to /u01/app/oracle/oradata/orcl/example01.dbf
channel ORA_DISK_1: reading from backup piece /u01/app/oracle/fra/ORCL/backupset/2009_10_08/o1_mf_nnndf_TAG20091008T102933_5dvd7xrb_.bkp
channel ORA_DISK_1: piece handle=/u01/app/oracle/fra/ORCL/backupset/2009_10_08/o1_mf_nnndf_TAG20091008T102933_5dvd7xrb_.bkp
tag=TAG20091008T102933
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
Finished restore at 08-OCT-09

And now do the rest of the process manually:

RMAN> recover datafile 5;
Starting recover at 08-OCT-09
using channel ORA_DISK_1
starting media recovery
media recovery complete, elapsed time: 00:00:00
Finished recover at 08-OCT-09
RMAN> sql 'alter database datafile 5 online';
sql statement: alter database datafile 5 online

And now we can check if the HR schema can select from its own tables:

SQL> select count(*) from hr.employees;
COUNT(*)
----------
107

Everything now is tip-top! Problem solved.

So why did the automatic process failed?

Because what we’ve had wasn’t either a missing file nor corruption. We had a “fake” file, which led us to a case of “fake” corruption, though the message pointed to that direction, it wasn’t corruption. The automatic process didn’t even built a script to handle corruption, otherwise it would use RMAN commands like “RECOVER CORRUPTION LIST;” or “RECOVER DATAFILE 5 BLOCK 211;”.

If we had selected from the Data Dictionary we would see no corruption at all:

SQL> select * from V$DATABASE_BLOCK_CORRUPTION;
no rows selected

So in these cases, the file you have is no good and you should rename or even delete it, and THEN try the automatic process of restoring and recovering it.

Bottom-line: You should take this in consideration when using automatic recovering processes, because when the DBA has the knive on the neck, he will go for anything automatic.

LMC.

211
About these ads

5 thoughts on “Advanced Corruption Resolution with RMAN

  1. É pá, confesso que fiquei surpreendido/baralhado com isto. :-D

    Resolvi fazer alguns testes e ainda fiquei mais surpreendido ao tentar outro cenario,. Em vez de usar o dd criei um ficheiro de texto como mesmo nome do datafile e nessa situação ele consegue fazer a recuperação…
    Resolvi criar o mesmo ficheiro de raiz com o dd, isto é apagando primeiro o ficheiro original e nesta situação ele nem sequer sugere o restore do datafile, sugere usar um antigo export.

    Mas que raio de verificação é que o rman faz na altura do restore!?!?!?

  2. Rui,
    I know you meant well by writting it in portuguese, since we’re both are portuguese, and even more than that we know each other and sometimes we even work together. But I’ll try to make a quick translation of your comment in order to post my answer and everybody else knows what are we are talking about.

    You say that when you remove the datafile and place it with a text file, RMAN can handle it, but when you remove the datafile and replace it by a brand new file build with the “dd” command RMAN “talks” about export. I can’t really understand what the word “export” means in that context, since I’ve tried the exact same thing and the behaviour was the same as explained in my post.~

    My questions are:
    1) What “export” means in your comment?
    2) Which “input file” are you using in your “dd” command? Is it /dev/zero?
    3) Are you using 11gR2?

    Thanks.

    LMC.

  3. Hi Luis,

    I agree it makes sense to continue in english, so everybody can follow this.

    I’m using 11gR2 version
    I have made again the same scenario and I was not able to reproduce it again :(

    yesterday after some tests, I have received an error from the advisor telling me if I have a previous made export file I should use it instead of making a recovery. This happened when I have created a new file with dd, using the same command that you have used on you demo. Has i said, I was not able to reproduce it again, so I probably have made something else…

    But apart from this strange error, I have continued the tests and advisor is able to recover in the following situations:
    1) Creating a file with vi using the same name of datafile, it can recover with no problem using the advisor.
    2) If I copy another datafile and rename it to example01.dbf (I know this is a strange test, but this is a strange situation :-D ), it is able to repair it too.
    ex: cp users01.dbf example01.dbf
    it detects corruption on blocks and is able to repair the problem.

    There are something that dd makes to file structure that doesn’t let rman make the restore but I don’t know what…

    thanks,
    Rui
    So, I’m still confused about the difficulty to repair with the dd scenario

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s