Automatic RAC Recovery (Database Cloning)

For the past couple of months I've been working on a project to automatically clone a three node RAC database on AIX to a single instance database from storage snapshots. The clone was to be done daily after our ETL process had completed and was used to offload reporting as well as other purposes.
Initially I thought this would be quite straightforward having done clones using RMAN before. How hard could it be?
The procedure to clone the database is summarized as follows:

1. Generate controlfile  creation script
2. Identify cloned LUNS
3. Rename cloned disk groups e.g. DG_DATA -> DG_DATA_CLONE
4. Mount cloned disk groups
5. Modify/Create pfile or use existing pfile for clone
6. Setup environment variables for clone
7. Create audit directories for clone
8. Startup clone instance in nomount mode and re-create controlfile from script
9. Recover clone database (Apply archivelogs from source DB)
10. Open database with resetlogs
11. Rename database using NID
12. Create new temp tablespace
13. Create entry in /etc/oratab

Step 9 is where things started to get a bit tricky. I could recover the database manually but automatic didn't work. The Oracle Database Backup & Recovery Guide basically describes this as follows:

If you use an Oracle Real Application Clusters configuration, and if you are performing incomplete recovery or using a backup control file, then the database can only compute the name of the first archived redo log file from the first redo thread. You may have to manually apply the first log file from the other redo threads. After the first log file in a given thread has been supplied, the database can suggest the names of the subsequent logs in this thread.

What this means is that issuing a RECOVER DATABASE statement may only apply logs from the first redo thread correctly, and you may see the following errors.

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 178995131 generated at 06/07/2014 09:32:39 needed for thread
1
ORA-00289: suggestion :
+DG_ARCH_NEMOCLN/nemo/archivelog/2014_06_07/thread_1_seq_3056.1280.849605577
ORA-00280: change 178995131 for thread 1 is in sequence #3056


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 178995131 generated at  needed for thread 2
ORA-00289: suggestion : +DG_ARCH_NEMOCLN


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00308: cannot open archived log '+DG_ARCH_NEMOCLN'
ORA-17503: ksfdopn:2 Failed to open file +DG_ARCH_NEMOCLN
ORA-15045: ASM file name '+DG_ARCH_NEMOCLN' is not in reference form


ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01195: online backup of file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG_DATA_NEMOCLN/nemo/datafile/system.260.792524565'
The remaining log will have to be computed and applied manually. To compute the logs needed to recover the database you can execute the following query against the source database.
select thread#, name from v$archived_log where &change_number between first_change# and next_change#;
After finding the correct logs you can then manually apply them to the database. So how can we automate this step?
Since I was working with NetApp Snap Creator Framework to do the storage snapshots I reached out to my friend Neto who happens to work at NetApp and he provided a simple but elegant solution using named pipes.
I hacked up a script to perform steps 9-12. It assumes that the cloned database is already mounted and accepts the name of the target database as a parameter. Feel free to modify or improve as needed.

Many thanks Nascimento (https://twitter.com/@rodrigontap) and neto from Brazil (https://twitter.com/@netofrombrazil) from NetApp for their help.

You can find the code on github.

Comments

Popular posts from this blog

Viewing ASM trace files and alert logs in Unix/Linux

ORA-00020: maximum number of processes (%s) exceeded

Troubleshooting RAC Public Network Failure