Database – Page 3 – LUXOUG – LUXEMBOURG ORACLE USERS GROUP

Database

ASM, Mount restricted force for recovery

Category: Database Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

ASM, Mount restricted force for recovery

Survive to disk failures it is crucial to avoid data corruption, but sometimes, even with redundancy at ASM, multiple failures can happen. Check in this post how to use the undocumented feature “mount restricted force for recovery” to resurrect diskgroup and lose less data when multiple failures occur.

Diskgroup redundancy is a key factor for ASM resilience, where you can survive to disk failures and still continue to run databases. I will not extend about ASM disk redundancy here, but usually, you can configure your diskgroup without redundancy (EXTERNAL), double redundancy (NORMAL), triple redundancy (HIGH), and even fourth redundancy (EXTEND for stretch clusters).

If you want to understand more about redundancy you have a lot of articles at MOS and on the internet that provide useful information. One good is this. The idea is simple, spread multiple copies in different disks. And can even be better if you group disks in the same failgroups, so, your data will have multiple copies in separate places.

As an example, this a key for Exadata, where every storage cell is one independent failgroup and you can survive to one entire cell failure (or double full, depending on the redundancy of your diskgroup) without data loss. The same idea can be applied at a “normal” environment, where you can create failgroup to disks attached to controller A, and another attached to controller B (so the failure of one storage controller does not affect all failgroups). At ASM, if you do not create failgroup, each disk is a different one in diskgroups that have redundancy enabled.

This represents for Exadata, but it is safe for representation. Basically your data will be in at least two different failgroups:

Environment

In the example that I use here, I have one diskgroup called DATA, which has 7 (seven) disks and each one is on failgroup. The redundancy for this diskgroup is NORMAL, this means that the block is copied in two failgroups. If two failures occur, probably, I will have data loss/corruption. Look:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                           FAILGROUP                      LABEL                           PATH

------------------------------ ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                        CELLI01                        CELLI01                         ORCL:CELLI01

CELLI02                        CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                        CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                        CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                        CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                        CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                        CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                         RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                    SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

The version for my GI is 19.6.0.0, but this can be used from 12.1.0.2 and newer versions (works for 11.2.0.4 in some versions). And In this server, I have three databases running, DBA19, DBB19, and DBC19.

So, with everything running correctly, the data from my databases will be spread two failgroups (this is just a representation and not correct representation where the blocks from my database are):

Remember that a NORMAL redundancy just needs two copies. So, some blocks from datafile 1 from DBA19, as an example, can be stored at CELLI01 and CELLI04. And if your database is small (and your failgroups are big), and you are lucky too, the entire database can be stored in just these two places. In case of failure that just involves CELLI02 and CELLI03 failgroups, your data (from DBA19c) can be intact.

Understanding the failure

Unfortunately, failures (will) happen and can be multiple at the same time. In the diskgroup DATA above, after the second failure, your diskgroup will be dismounted instantly. Usually when this occurs, if you can’t recover the hardware error, you need to restore and recover a backup of your databases after recreating the diskgroup.

If you have lucky and the failures occur at the same time, you can (most of the time) return the failed disks and try to mount the diskgroup because there is no difference between the failed disks/failgroups. But the problem occurs if you have one failure (like CELLI03 diskgroup disappears) and after some time another failgroup fails (like CELLI07). The detail is that between the failures, the databases continued to run and change data in the disk. And when this occurs, and when your failgroup returns, there are differences.

Another point that is very important to understand is the time to recover the failure. If you have one disk/failgroup at ASM, the attributes disk_repair_time and failgroup_repair_time define the time that you have to repair your failure before the rebalance of data takes place. The first (disk_repair_time) is the time that you have to repair the disk in case of failure if your failgroup have more than one disk, just the failed is rebalanced. The second (failgroup_repair_time) is the time that you have to repair the failed failgroup (when it fails completely).

The interesting here is that between the moment of failure until the end of this clock you are susceptible to another failure. If it occurs (more failures that your mirror protection) you will lose the diskgroup. And another fact here it is that between the failures, your databases continue to run, so, if your return the first failed disk/failgroup, you need to sync it.

These “repair times” serve to provide to you time to fix/recover the failure and avoid the rebalance. Think about the architecture, usually the diskgroups with redundancy are big and protect big environments think in one Exadata, as an example, where each disk can have 14TB – and one cell can have until 12 of them), and do rebalance of this amount of data takes a lot of time. To avoid this, if your failed disk is replaced before this time, just sync with the block changed is needed.

A “default configuration” have these values:

SQL> select dg.name,a.value,a.name

  2  from v$asm_diskgroup dg, v$asm_attribute a

  3  where dg.group_number=a.group_number

  4  and a.name like '%time'

  5  /




NAME                                     VALUE           NAME

---------------------------------------- --------------- ----------------------------------------

DATA                                     12.0h           disk_repair_time

DATA                                     24.0h           failgroup_repair_time

RECO                                     24.0h           failgroup_repair_time

RECO                                     12.0h           disk_repair_time

SYSTEMDG                                 24.0h           failgroup_repair_time

SYSTEMDG                                 12.0h           disk_repair_time




6 rows selected.




SQL>

But think in one scenario where more than one failure occurs, the first in CELLI01 at 08:00 am and the second in CELL0I6 at 10:00 am, now, from two hours, you have the new version of blocks. If you fix the failure (for CELL01) you don’t guarantee that you have everything in the last version and the normal mount will not work.

And it is here that mount restricted force for recovery enters. It allows you to resurrect the diskgroup and help you to restore fewer things. Think in the example before, if the failures occur at CELLI01 and CELL06, but your datafiles are in CELLI02 and CELLI07, you lose nothing. Or restore just some tablespaces and not all database. So, it is more gain than lose.

Mount restricted force for recovery

Here, I will simulate multiple failures for the disks (more than one) and show how you can use mount restricted force for recovery. Please be careful and follow all the steps correctly to avoid mistakes and to understand how to do and what is happening.

So, here I have DATA diskgroup, with normal redundancy and 7 (seven) failgroups. DBA19, DBB19, and DBC19 databases running.

So, at the first step, I will simulate a complete failure of CELLI03 failgroup. In my environment, to allow more control, I have one iSCSI target for each failgroup (this allows me to disconnect one by one if needed). The CELLI03 died:

[root@asmrec ~]# iscsiadm -m session

tcp: [11] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.d65b214fca9a (non-flash) CELLI04

tcp: [14] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.637b3bbfa86d (non-flash) CELLI07

tcp: [17] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.2f4cdb93107c (non-flash) CELLI05

tcp: [2] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.bb66b92348a7 (non-flash)  CELLI03

tcp: [20] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.57c0a000e316 (non-flash) (SYS)

tcp: [23] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.89ef4420ea4d (non-flash) CELLI06

tcp: [5] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.eff4683320e8 (non-flash)  CELLI01

tcp: [8] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.7d8f4c8f5012 (non-flash)  CELLI02

[root@asmrec ~]#

[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.bb66b92348a7 -p 172.16.0.3:3260 -u

Logging out of session [sid: 2, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260]

Logout of [sid: 2, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] successful.

[root@asmrec ~]#

And at ASM alertlog we can see:

2020-03-22T17:14:11.589115+01:00

NOTE: process _user8100_+asm1 (8100) initiating offline of disk 9.4042310133 (CELLI03) with mask 0x7e in group 1 (DATA) with client assisting

NOTE: checking PST: grp = 1

2020-03-22T17:14:11.589394+01:00

GMON checking disk modes for group 1 at 127 for pid 40, osid 8100

2020-03-22T17:14:11.589584+01:00

NOTE: checking PST for grp 1 done.

NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0xf0f0c1f5, mask = 0x6a, op = clear mandatory

2020-03-22T17:14:11.589746+01:00

GMON updating disk modes for group 1 at 128 for pid 40, osid 8100

cluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1

WARNING: Write Failed. group:1 disk:9 AU:1 offset:4190208 size:4096

path:ORCL:CELLI03

         incarnation:0xf0f0c1f5 synchronous result:'I/O error'

         subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so krq:0x7f9182f72210 bufp:0x7f9182f78000 osderr1:0x3 osderr2:0x2e

         IO elapsed time: 0 usec Time waited on I/O: 0 usec

WARNING: found another non-responsive disk 9.4042310133 (CELLI03) that will be offlined

So, the failure occurred at 17:14. The full output can be found here at ASM-ALERTLOG-Output-Failure-CELLI03.txt

And we can see that disappeared (but not deleted or dropped) from ASM:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01                        CELLI01                         ORCL:CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

At this point, ASM is starting to count the clock of 12hours (as defined in my repair attributes). The failgroup was not dropped and rebalance was not going on because ASM is optimistic that you will fix the issue in this period.

But after some time I had a second failure in the diskgroup:

Now at ASM alertlog you can see that diskgroup was dismounted (and several other messages). Bellow a cropped from the alertlog. The full output (and I think that deserve a look) it is here at ASM-ALERTLOG-Output-Failure-CELLI03-and-CELL01.txt

2020-03-22T17:18:39.699555+01:00

WARNING: Write Failed. group:1 disk:1 AU:1 offset:4190208 size:4096

path:ORCL:CELLI01

         incarnation:0xf0f0c1f3 asynchronous result:'I/O error'

         subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so krq:0x7f9182f833d0 bufp:0x7f91836ef000 osderr1:0x3 osderr2:0x2e

         IO elapsed time: 0 usec Time waited on I/O: 0 usec

WARNING: Hbeat write to PST disk 1.4042310131 in group 1 failed. [2]

2020-03-22T17:18:39.704035+01:00

...

...

2020-03-22T17:18:39.746945+01:00

NOTE: cache closing disk 9 of grp 1: (not open) CELLI03

ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline.

2020-03-22T17:18:39.747462+01:00

ERROR: too many offline disks in PST (grp 1)

2020-03-22T17:18:39.759171+01:00

NOTE: cache dismounting (not clean) group 1/0xB48031B9 (DATA)

NOTE: messaging CKPT to quiesce pins Unix process pid: 12050, image: [email protected] (B001)

2020-03-22T17:18:39.761807+01:00

NOTE: halting all I/Os to diskgroup 1 (DATA)

2020-03-22T17:18:39.766289+01:00

NOTE: LGWR doing non-clean dismount of group 1 (DATA) thread 1

NOTE: LGWR sync ABA=23.3751 last written ABA 23.3751

...

...

2020-03-22T17:18:40.207406+01:00

SQL> alter diskgroup DATA dismount force /* ASM SERVER:3028300217 */

...

...

2020-03-22T17:18:40.841979+01:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_8756.trc:

ORA-15130: diskgroup "DATA" is being dismounted

2020-03-22T17:18:40.853738+01:00

...

...

ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline.

2020-03-22T17:18:40.861939+01:00

ERROR: too many offline disks in PST (grp 1)

...

...

2020-03-22T17:18:43.214368+01:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_8756.trc:

ORA-15130: diskgroup "DATA" is being dismounted

2020-03-22T17:18:43.214885+01:00

NOTE: client DBC19:DBC19:asmrec no longer has group 1 (DATA) mounted

2020-03-22T17:18:43.215492+01:00

NOTE: client DBB19:DBB19:asmrec no longer has group 1 (DATA) mounted

NOTE: cache deleting context for group DATA 1/0xb48031b9

...

...

2020-03-22T17:18:43.298551+01:00

SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER:3028300217 */

SUCCESS: ASM-initiated MANDATORY DISMOUNT of group DATA

2020-03-22T17:18:43.352003+01:00

SQL> ALTER DISKGROUP DATA MOUNT  /* asm agent *//* {0:1:9} */

2020-03-22T17:18:43.372816+01:00

NOTE: cache registered group DATA 1/0xB44031BF

NOTE: cache began mount (first) of group DATA 1/0xB44031BF

NOTE: Assigning number (1,8) to disk (ORCL:CELLI02)

NOTE: Assigning number (1,0) to disk (ORCL:CELLI04)

NOTE: Assigning number (1,11) to disk (ORCL:CELLI05)

NOTE: Assigning number (1,3) to disk (ORCL:CELLI06)

NOTE: Assigning number (1,2) to disk (ORCL:CELLI07)

2020-03-22T17:18:43.514642+01:00

cluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1

2020-03-22T17:18:46.089517+01:00

NOTE: detected and added orphaned client id 0x10010

NOTE: detected and added orphaned client id 0x1000e

So, the second failure occurred at 17:18 and lead to diskgroup force dismount. And you can see messages like “NOTE: cache dismounting (not clean)”, “ERROR: too many offline disks in PST (grp 1)”, and even “ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline”.

So, probably some data was lost. And even if you consider that between these 4 minutes data was changed in the databases, the mess is Big. If you want to see the alertlog from databases, check here at ASM-ALERTLOG-Output-From-Databases-Alertlog-at-Failure.txt

And now we have this at ASM:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01

                                                                        CELLI02                         ORCL:CELLI02

                                                                        CELLI04                         ORCL:CELLI04

                                                                        CELLI05                         ORCL:CELLI05

                                                                        CELLI06                         ORCL:CELLI06

                                                                        CELLI07                         ORCL:CELLI07




7 rows selected.




SQL>

And if we try to mount we receive an error due to disk offline:

SQL> alter diskgroup data mount;

alter diskgroup data mount

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "9" is missing from group number "1"

ORA-15042: ASM disk "1" is missing from group number "1"


SQL>

Now is the key decision. If you have important data that worth the effort to try to recover you can continue. It is your decision and based on several details. Since the diskgroup is dismounted, the repair time is not counting, and you have days until recovery. Sometimes one day stopped is better than several days to recover all databases from the last backup.

Imagine that you can bring online the first failed failgroup (CELL03) that have 4 minutes of the difference of data:

[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.bb66b92348a7 -p 172.16.0.3:3260 -l

Logging in to [iface: default, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] (multiple)

Login to [iface: default, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] successful.

[root@asmrec ~]#

And if you try to mount it normally you will receive an error (output from alertlog at this try can be seen here at ASM-ALERTLOG-Output-Mout-With-One-Disk-Online):

SQL> alter diskgroup data mount;

alter diskgroup data mount

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15017: diskgroup "DATA" cannot be mounted

ORA-15066: offlining disk "1" in group "DATA" may result in a data loss

SQL>

So, now we can try the mount restricted force for recovery:

SQL> alter diskgroup data mount restricted force for recovery;




Diskgroup altered.




SQL>

The alertlog from ASM (that you can full here at ASM-ALERTLOG-Output-Mout-Restricted-Force-For-Recovery.txt) report messages related with cache from diskgropup and disk that need to be checked. And now we are like this:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01

                                                                        CELLI03                         ORCL:CELLI03




10 rows selected.




SQL>

The next step is to bring online the failgroup that came back:

SQL> alter diskgroup data online disks in failgroup CELLI03;




Diskgroup altered.




SQL>

Doing this ASM will resync this failgroup (using this block as the last version) and bring the cache of this disk online. At ASM alertlog you can see messages like (full output here at ASM-ALERTLOG-Output-Online-Restored-Failgroup):

2020-03-22T17:27:47.729003+01:00

SQL> alter diskgroup data online disks in failgroup CELLI03

2020-03-22T17:27:47.729551+01:00

NOTE: cache closing disk 1 of grp 1: (not open) CELLI01

2020-03-22T17:27:47.729640+01:00

NOTE: cache closing disk 9 of grp 1: (not open) CELLI03

2020-03-22T17:27:47.730398+01:00

NOTE: GroupBlock outside rolling migration privileged region

NOTE: initiating resync of disk group 1 disks

CELLI03 (9)




NOTE: process _user6891_+asm1 (6891) initiating offline of disk 9.4042310248 (CELLI03) with mask 0x7e in group 1 (DATA) without client assisting

2020-03-22T17:27:47.737580+01:00

...

...

2020-03-22T17:27:47.796524+01:00

NOTE: disk validation pending for 1 disk in group 1/0x1d7031d4 (DATA)

NOTE: Found ORCL:CELLI03 for disk CELLI03

NOTE: completed disk validation for 1/0x1d7031d4 (DATA)

2020-03-22T17:27:47.935467+01:00

...

...

2020-03-22T17:27:48.116572+01:00

NOTE: cache closing disk 1 of grp 1: (not open) CELLI01

NOTE: cache opening disk 9 of grp 1: CELLI03 label:CELLI03

2020-03-22T17:27:48.117158+01:00

SUCCESS: refreshed membership for 1/0x1d7031d4 (DATA)

2020-03-22T17:27:48.123545+01:00

NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0x0, mask = 0x5d, op = assign mandatory

...

...

2020-03-22T17:27:48.142068+01:00

NOTE: PST update grp = 1 completed successfully

2020-03-22T17:27:48.143197+01:00

SUCCESS: alter diskgroup data online disks in failgroup CELLI03

2020-03-22T17:27:48.577277+01:00

NOTE: Attempting voting file refresh on diskgroup DATA

NOTE: Refresh completed on diskgroup DATA. No voting file found.

...

...

2020-03-22T17:27:48.643277+01:00

NOTE: Starting resync using Staleness Registry and ATE scan for group 1

2020-03-22T17:27:48.696075+01:00

NOTE: Starting resync using Staleness Registry and ATE scan for group 1

NOTE: header on disk 9 advanced to format #2 using fcn 0.0

2020-03-22T17:27:49.725837+01:00

WARNING: Started Drop Disk Timeout for Disk 1 (CELLI01) in group 1 with a value 43200

2020-03-22T17:27:57.301042+01:00

...

2020-03-22T17:27:59.687480+01:00

NOTE: cache closing disk 1 of grp 1: (not open) CELLI01

NOTE: reset timers for disk: 9

NOTE: completed online of disk group 1 disks

CELLI03 (9)




2020-03-22T17:27:59.714674+01:00

ERROR: ORA-15421 thrown in ARBA for group number 1

2020-03-22T17:27:59.714805+01:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arba_8786.trc:

ORA-15421: Rebalance is not supported when the disk group is mounted for recovery.

2020-03-22T17:27:59.715047+01:00

NOTE: stopping process ARB0

NOTE: stopping process ARBA

2020-03-22T17:28:00.652115+01:00

NOTE: rebalance interrupted for group 1/0x1d7031d4 (DATA)

And not we have at ASM:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

And rebalance not continue because is not allowed when diskgroup is in restrict mode:

SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

         1            1 REBAL COMPACT   WAIT          1                                                                                                               0

         1            1 REBAL REBALANCE ERRS          1                                                         ORA-15421                                             0

         1            1 REBAL REBUILD   WAIT          1                                                                                                               0

         1            1 REBAL RESYNC    WAIT          1                                                                                                               0




SQL>

But since the failgroup become online “in force way”, the old cache (from CELL01) need to be clean. And since it is not the last version, maybe some files were corrupted. To check this, you can look the arb process trace files at ASM trace directory:

[root@asmrec trace]# ls -lFhtr *arb*

...

...

-rw-r----- 1 grid oinstall 6.4K Mar 22 17:10 +ASM1_arb0_3210.trm

-rw-r----- 1 grid oinstall  44K Mar 22 17:10 +ASM1_arb0_3210.trc

-rw-r----- 1 grid oinstall  984 Mar 22 17:27 +ASM1_arb0_8788.trm

-rw-r----- 1 grid oinstall 2.1K Mar 22 17:27 +ASM1_arb0_8788.trc

-rw-r----- 1 grid oinstall  882 Mar 22 17:27 +ASM1_arba_8786.trm

-rw-r----- 1 grid oinstall 1.2K Mar 22 17:27 +ASM1_arba_8786.trc

[root@asmrec trace]#

And looking from one of the last, we can see that some extend (that does not exist, the recovered failgroup, or the cache is not the last one) was filled with dummy (BADFDA7A) data:

[root@asmrec trace]# cat +ASM1_arb0_8788.trc

Trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_8788.trc

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.6.0.0.0

Build label:    RDBMS_19.3.0.0.0DBRU_LINUX.X64_190417

ORACLE_HOME:    /u01/app/19.0.0.0/grid

System name:    Linux

Node name:      asmrec.oralocal

Release:        4.14.35-1902.10.8.el7uek.x86_64

Version:        #2 SMP Thu Feb 6 11:02:28 PST 2020

Machine:        x86_64

Instance name: +ASM1

Redo thread mounted by this instance: 0 <none>

Oracle process number: 40

Unix process pid: 8788, image: [email protected] (ARB0)







*** 2020-03-22T17:27:59.044949+01:00

*** SESSION ID:(402.55837) 2020-03-22T17:27:59.044969+01:00

*** CLIENT ID:() 2020-03-22T17:27:59.044975+01:00

*** SERVICE NAME:() 2020-03-22T17:27:59.044980+01:00

*** MODULE NAME:() 2020-03-22T17:27:59.044985+01:00

*** ACTION NAME:() 2020-03-22T17:27:59.044989+01:00

*** CLIENT DRIVER:() 2020-03-22T17:27:59.044994+01:00




 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery




*** 2020-03-22T17:27:59.680119+01:00

NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0x0, mask = 0x7f, op = assign mandatory

kfdp_updateDsk(): callcnt 195 grp 1

PST verChk -0: req, id=266369333, grp=1, requested=91 at 03/22/2020 17:27:59

NOTE: PST update grp = 1 completed successfully

NOTE: kfdsFilter_freeDskSrSlice for Filter 0x7fbaf6238d38

NOTE: kfdsFilter_clearDskSlice for Filter 0x7fbaf6238d38 (all:TRUE)

NOTE: completed online of disk group 1 disks

CELLI03 (9)

[root@asmrec trace]#

And as you can imagine, this will lead to files that need to be restored from backup. But look that just some data, not everything. Remember at the beginning of the post that this depends on how your data is distributed inside of ASM failgroups. If you have luck, you have just a few data impacted. This depends on a lot of factors, as the time that was offline, the size of the failgroup, the activity of your databases, and many others. But, the gains can be good and mad it worth the effort.

After that, we can normally dismount the diskgroup:

SQL> alter diskgroup data dismount;




Diskgroup altered.




SQL>

And mount it again:

SQL> alter diskgroup data mount;




Diskgroup altered.




SQL>

Since now the diskgroup is mounted in a clean way, you can continue with the rebalance:

SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

         1            1 REBAL COMPACT   WAIT          1                                                                                                               0

         1            1 REBAL REBALANCE ERRS          1                                                         ORA-15421                                             0

         1            1 REBAL REBUILD   WAIT          1                                                                                                               0

         1            1 REBAL RESYNC    WAIT          1                                                                                                               0




SQL> alter diskgroup DATA rebalance;




Diskgroup altered.




SQL>

The state at ASM side it is:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

As you can see, the CELL01 was not removed yet (I will talk about it later). But the activities can continue, databases can be checked.

Database side

At database side we need to check what we lost and need to recover. Since I am using cluster the GI tried to start it (and as you can see two became up):

[oracle@asmrec ~]$ ps -ef |grep smon

root      8254     1  2 13:53 ?        00:04:40 /u01/app/19.0.0.0/grid/bin/osysmond.bin

grid      8750     1  0 13:54 ?        00:00:00 asm_smon_+ASM1

oracle   11589     1  0 17:31 ?        00:00:00 ora_smon_DBB19

oracle   11751     1  0 17:31 ?        00:00:00 ora_smon_DBA19

oracle   18817 29146  0 17:44 pts/9    00:00:00 grep --color=auto smon

[oracle@asmrec ~]$

DBA19

The firs that I checked was DBA19C, I used rman to VALIDATE DATABASE:

[oracle@asmrec ~]$ rman target /




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 17:45:21 2020

Version 19.6.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database: DBA19 (DBID=828667324)




RMAN> validate database;




Starting validate at 22-MAR-20

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=260 device type=DISK

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBA19/DATAFILE/system.256.1035153873

input datafile file number=00004 name=+DATA/DBA19/DATAFILE/undotbs1.258.1035153973

input datafile file number=00003 name=+DATA/DBA19/DATAFILE/sysaux.257.1035153927

input datafile file number=00007 name=+DATA/DBA19/DATAFILE/users.259.1035153975

channel ORA_DISK_1: validation complete, elapsed time: 00:03:45

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              17722        117766          5042446

  File Name: +DATA/DBA19/DATAFILE/system.256.1035153873

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              79105

  Index      0              13210

  Other      0              7723




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    OK     0              19445        67862           5042695

  File Name: +DATA/DBA19/DATAFILE/sysaux.257.1035153927

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7988

  Index      0              5531

  Other      0              34876




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    FAILED 1              49           83247           5042695

  File Name: +DATA/DBA19/DATAFILE/undotbs1.258.1035153973

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      511            83151




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              93           641             4941613

  File Name: +DATA/DBA19/DATAFILE/users.259.1035153975

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              65

  Index      0              15

  Other      0              467




validate found one or more corrupt blocks

See trace file /u01/app/oracle/diag/rdbms/dba19/DBA19/trace/DBA19_ora_19219.trc for details

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> shutdown abort;




Oracle instance shut down




RMAN> startup mount;




connected to target database (not started)

Oracle instance started

database mounted




Total System Global Area    1610610776 bytes




Fixed Size                     8910936 bytes

Variable Size                859832320 bytes

Database Buffers             734003200 bytes

Redo Buffers                   7864320 bytes




RMAN> run{

2> restore datafile 4;

3> recover datafile 4;

4> }




Starting restore at 22-MAR-20

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=249 device type=DISK




channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: specifying datafile(s) to restore from backup set

channel ORA_DISK_1: restoring datafile 00004 to +DATA/DBA19/DATAFILE/undotbs1.258.1035153973

channel ORA_DISK_1: reading from backup piece /tmp/9puro5qr_1_1

channel ORA_DISK_1: piece handle=/tmp/9puro5qr_1_1 tag=BKP-DB-INC0

channel ORA_DISK_1: restored backup piece 1

channel ORA_DISK_1: restore complete, elapsed time: 00:00:45

Finished restore at 22-MAR-20




Starting recover at 22-MAR-20

using channel ORA_DISK_1




starting media recovery

media recovery complete, elapsed time: 00:00:02




Finished recover at 22-MAR-20




RMAN> alter database open;




Statement processed




RMAN> exit







Recovery Manager complete.

[oracle@asmrec ~]$

[oracle@asmrec ~]$

As you can see, the datafile 4 FAILED and needs to be recovered. Luckily, the redo was not affected too and the open was OK. Since it was the UNDO, I made abort (because the immediate can take an eternity, and even since undo was down, nothing was happening inside of the database).

But as you saw, just one datafile was corrupted. Of course that with big databases and big failgroup, more files will be corrupted. But it is a shot that can worth it.

DBB19

The second was DBB19 and I used the same approach, VALIDATE DATABASE:

[oracle@asmrec ~]$ export ORACLE_SID=DBB19

[oracle@asmrec ~]$

[oracle@asmrec ~]$ rman target /




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 17:55:20 2020

Version 19.6.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




PL/SQL package SYS.DBMS_BACKUP_RESTORE version 19.03.00.00 in TARGET database is not current

PL/SQL package SYS.DBMS_RCVMAN version 19.03.00.00 in TARGET database is not current

connected to target database: DBB19 (DBID=1336872427)




RMAN> validate database;




Starting validate at 22-MAR-20

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=374 device type=DISK

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBB19/DATAFILE/system.261.1035154051

input datafile file number=00003 name=+DATA/DBB19/DATAFILE/sysaux.265.1035154177

input datafile file number=00004 name=+DATA/DBB19/DATAFILE/undotbs1.267.1035154235

input datafile file number=00007 name=+DATA/DBB19/DATAFILE/users.268.1035154241

channel ORA_DISK_1: validation complete, elapsed time: 00:00:35

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              16763        116487          3861452

  File Name: +DATA/DBB19/DATAFILE/system.261.1035154051

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              78871

  Index      0              13010

  Other      0              7836




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    OK     0              19307        62758           3861452

  File Name: +DATA/DBB19/DATAFILE/sysaux.265.1035154177

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7459

  Index      0              5158

  Other      0              30796




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    OK     0              1            35847           3652497

  File Name: +DATA/DBB19/DATAFILE/undotbs1.267.1035154235

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      0              35839




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              85           641             3759202

  File Name: +DATA/DBB19/DATAFILE/users.268.1035154241

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              70

  Index      0              15

  Other      0              470




channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> VALIDATE CHECK LOGICAL DATABASE;




Starting validate at 22-MAR-20

using channel ORA_DISK_1

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBB19/DATAFILE/system.261.1035154051

input datafile file number=00003 name=+DATA/DBB19/DATAFILE/sysaux.265.1035154177

input datafile file number=00004 name=+DATA/DBB19/DATAFILE/undotbs1.267.1035154235

input datafile file number=00007 name=+DATA/DBB19/DATAFILE/users.268.1035154241

channel ORA_DISK_1: validation complete, elapsed time: 00:00:35

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              16763        116487          3861452

  File Name: +DATA/DBB19/DATAFILE/system.261.1035154051

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              78871

  Index      0              13010

  Other      0              7836




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    OK     0              19307        62758           3861452

  File Name: +DATA/DBB19/DATAFILE/sysaux.265.1035154177

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7459

  Index      0              5158

  Other      0              30796




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    OK     0              1            35847           3652497

  File Name: +DATA/DBB19/DATAFILE/undotbs1.267.1035154235

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      0              35839




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              85           641             3759202

  File Name: +DATA/DBB19/DATAFILE/users.268.1035154241

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              70

  Index      0              15

  Other      0              470




channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> exit







Recovery Manager complete.

[oracle@asmrec ~]$

[oracle@asmrec ~]$

[oracle@asmrec ~]$

As you saw, no failures for DBB19. I still checked logically the database with VALIDATE CHECK LOGICAL DATABASE because since the validate returned no failed files, I wanted to check logically the blocks.

DBC19

Same for the last database, but now, datafile 3 failed:

[oracle@asmrec ~]$ export ORACLE_SID=DBC19

[oracle@asmrec ~]$ rman target /




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 18:01:33 2020

Version 19.6.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database (not started)




RMAN> startup mount;




Oracle instance started

database mounted




Total System Global Area    1610610776 bytes




Fixed Size                     8910936 bytes

Variable Size                864026624 bytes

Database Buffers             729808896 bytes

Redo Buffers                   7864320 bytes




RMAN> validate database;




Starting validate at 22-MAR-20

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=134 device type=DISK

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBC19/DATAFILE/system.262.1035154053

input datafile file number=00004 name=+DATA/DBC19/DATAFILE/undotbs1.270.1035154249

input datafile file number=00003 name=+DATA/DBC19/DATAFILE/sysaux.266.1035154181

input datafile file number=00007 name=+DATA/DBC19/DATAFILE/users.271.1035154253

channel ORA_DISK_1: validation complete, elapsed time: 00:03:15

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              17777        117764          4188744

  File Name: +DATA/DBC19/DATAFILE/system.262.1035154053

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              79161

  Index      0              13182

  Other      0              7640




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    FAILED 1              19272        66585           4289434

  File Name: +DATA/DBC19/DATAFILE/sysaux.266.1035154181

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7311

  Index      0              4878

  Other      511            35099




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    OK     0              1            84522           4188748

  File Name: +DATA/DBC19/DATAFILE/undotbs1.270.1035154249

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      0              84479




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              93           641             3717377

  File Name: +DATA/DBC19/DATAFILE/users.271.1035154253

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              65

  Index      0              15

  Other      0              467




validate found one or more corrupt blocks

See trace file /u01/app/oracle/diag/rdbms/dbc19/DBC19/trace/DBC19_ora_22091.trc for details

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> run{

2> restore datafile 3;

3> recover datafile 3;

4> }




Starting restore at 22-MAR-20

using channel ORA_DISK_1




channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: specifying datafile(s) to restore from backup set

channel ORA_DISK_1: restoring datafile 00003 to +DATA/DBC19/DATAFILE/sysaux.266.1035154181

channel ORA_DISK_1: reading from backup piece /tmp/0buro5rh_1_1

channel ORA_DISK_1: piece handle=/tmp/0buro5rh_1_1 tag=BKP-DB-INC0

channel ORA_DISK_1: restored backup piece 1

channel ORA_DISK_1: restore complete, elapsed time: 00:00:45

Finished restore at 22-MAR-20




Starting recover at 22-MAR-20

using channel ORA_DISK_1




starting media recovery




archived log for thread 1 with sequence 25 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_25.323.1035737103

archived log for thread 1 with sequence 26 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_26.329.1035739907

archived log for thread 1 with sequence 27 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_27.332.1035741283

archived log file name=+RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_25.323.1035737103 thread=1 sequence=25

media recovery complete, elapsed time: 00:00:03

Finished recover at 22-MAR-20




RMAN> alter database open;




Statement processed




RMAN> exit







Recovery Manager complete.

[oracle@asmrec ~]$

Dropping failgroup

If the fix for the remaining failgroup took a lot, it will be dropped automatically. But we can do this manually with force (look that without force it fails):

SQL> ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01;

ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15084: ASM disk "CELLI01" is offline and cannot be dropped.







SQL>

SQL> ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01 FORCE;




Diskgroup altered.




SQL>

And after the rebalance finish, all disk will be removed:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

_DROPPED_0001_DATA                       CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

         1            1 REBAL COMPACT   WAIT          1          1          0          0          0           0                                                       0

         1            1 REBAL REBALANCE WAIT          1          1          0          0          0           0                                                       0

         1            1 REBAL REBUILD   RUN           1          1        292        642        666           0                                                       0

         1            1 REBAL RESYNC    DONE          1          1          0          0          0           0                                                       0




SQL> select * from gv$asm_operation;




no rows selected




SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




8 rows selected.




SQL>

The steps for MOUNT RESTRICTED FORCE FOR RECOVERY

To resume, the steps needed are (in order):

Put online the failed disk/failgroup
Execute alter diskgroup <DG> mount restricted force for recovery
Brink online the failgroup with alter diskgroup data online disks in failgroup <FG>
Clean dismount DG alter diskgroup <DG> dismount
Clean mount alter diskgroup <DG> mount
Check databases for failures and recover it

Undocumented feature

So, the question is, why it is undocumented? I don’t have the answer but can figure out some points. For me, the most important is that is not a full, clean return. You need to restore and recover from the backup. Maybe you will lose a lot of data.

Of course that here in this example is a controlled scenario, I have just a few databases and my failgroup have just one disk inside. In real life, the problem will be worst. More diskgroups can be affected, as RECO/REDO/FRA. And probably you lost some redologs and archivelogs too and you can’t do a clean recovery. Or even need to recover OCR and Votedisk from the cluster.

This is the point for correct architecture design, if you need more protection at ASM side, you can use HIGH redundancy to survive at least two failures without interruption. This is the reason that SYSTEMDG (or OCR/Vote disk) is put high redundancy diskgroup at Exadata.

Outages and failures can occur in different layers of your environment. But storage/disk failures are catastrophic for databases because they can lead data corruption and you need to use backups to recover it. They can occur in any environment, from Storage until Exadata. I had one in an old Exadata V2 in 2016, used just for DEV databases, that crashed two storage cells (with one hour of difference) and needed to use this procedure to save some files and reduce the downtime avoiding to restore everything (more than 10TB).

So, it is good to know this kind of a procedure because can save time. But it is your decision to use it or no, check if worth or no.

Some references that you can check:

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”

Cleaning up JOB’s with erros in OEM 13c.

Category: Database Author: Andre Luiz Dutra Ontalba (Board Member) Date: 5 years ago Comments: 0

Cleaning up JOB's with OEM 13c errors

Today a simple but useful article, in OEM 13c we have the very useful job schedulling system.

However, as we can see in the image below when we have several errors, it is difficult to clean the jobs using the web interface or EMCLI.

EMCLI verb “get_jobs” reports all runs of the jobs, and “delete job” can delete job and its all runs but it’s not possible to delete a specific run of a job. For example, if you want to delete hundreds of “failed” (or “skipped”) runs of a job, EMCLI won’t help you and doing it through the web interface will be very time consuming.

We can easily solve this with a PL/SQL script to clean up jobs.

Connect to sqlplus with the sysman account:




[oracle@dbadutra:/home/oracle] sqlplus sysman@OEM13c




SQL*Plus: Release 19.0.0.0.0 – Production on Fri 28 08:15:03 2020

Version 19.3.0.0.0




Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 – Production

Version 19.3.0.0.0




SQL>

BEGIN
    FOR C IN
    (SELECT SCHEDULED_TIME, JOB_ID
    FROM MGMT$JOB_EXECUTION_HISTORY
    WHERE JOB_OWNER = ‘SYSMAN’
    AND JOB_ID IN (SELECT DISTINCT(JOB_ID) FROM MGMT$JOB_EXECUTION_HISTORY WHERE STATUS=’Error’ or STATUS=’Failed’)
    AND STATUS = ‘Error’ or STATUS=’Failed’)
    LOOP
          EM_JOB_OPS.DELETE_JOB_RUN(C.JOB_ID,C.SCHEDULED_TIME);
    END LOOP;
  COMMIT;
END;

Now let’s get there and the jobs are clean

I hope I helped with this tip

André Ontalba

TFA error after GI upgrade to 19c

Category: Database Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

TFA error after GI upgrade to 19c

Recently I made an Exadata stack upgrade/update to the last 19.2 version (19.2.7.0.0.191012) and I upgraded the GI from 18c to 19c (last 19c version – 19.5.0.0.191015) and after that, TFA does not work.

Since I don’t want to complete execute a TFA clean and reinstallation I tried to find the error and the solution. Here I want to share with you the workaround (since there is no solution yet) that I discovered and used to fix the error.

The environment

The actual environment is:

Old Grid Infrastructure: Version 18.6.0.0.190416
New Grid Infrastructure: Version 19.5.0.0.191015
Exadata domU: Version 19.2.7.0.0.191012 running kernel 4.1.12-124.30.1.el7uek.x86_64

TFA error

After upgrade the GI from 18c to 19c, the TFA does not work. If you try to start it or collect log using it, you can receive errors. In the environment described here, the TFA was running fine with the 18c version, and the rootupgrade script from 18c to 19c does not report an error.

And to be more precise, the TFA upgrade from 18c to 19c called by rootupgrade was ok (according to the log – I will show later). But even after that, the error occurs.

The provided solution as usual (by MOS support): download the lastest TFA and reinstall the actual one. Unfortunately, I not like this approach because can lead to an error during GI upgrade for next releases (like 20) and updates (19.6 as an example).

So, when I tried to collect TFA:

[root@exsite1c1 ~]# /u01/app/19.0.0.0/grid/tfa/bin/tfactl diagcollect

WARNING - TFA Software is older than 180 days. Please consider upgrading TFA to the latest version.

TFA-00002 Oracle Trace File Analyzer (TFA) is not running

Please start TFA before running collections

[root@exsite1c1 ~]#

So, when checking for running TFA I made ps -ef and not saw process running:

[root@exsite1c1 ~]# ps -ef |grep tfa

root      10665      1  0 Nov21 ?        00:00:06 /bin/sh /etc/init.d/init.tfa run >/dev/null 2>&1 </dev/null

root      40285  37137  0 11:05 pts/0    00:00:00 grep --color=auto tfa

[root@exsite1c1 ~]#

And if I try to start TFA (as root), nothing report (error or OK):

[root@exsite1c1 ~]# /etc/init.d/init.tfa start

Starting TFA..

Waiting up to 100 seconds for TFA to be started..

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

[root@exsite1c1 ~]#

[root@exsite1c1 ~]# ps -ef |grep tfa

root      10665      1  0 Nov21 ?        00:00:06 /bin/sh /etc/init.d/init.tfa run >/dev/null 2>&1 </dev/null

root      46031  37137  0 11:07 pts/0    00:00:00 grep --color=auto tfa

[root@exsite1c1 ~]#

Checking in the MOS I saw related problems with bad Perl version. For this TFA release is needed version 5.10 at lease. But was not the case:

[root@exsite1c1 ~]# perl -v




This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

(with 39 registered patches, see perl -V for more detail)




Copyright 1987-2012, Larry Wall




Perl may be copied only under the terms of either the Artistic License or the

GNU General Public License, which may be found in the Perl 5 source kit.




Complete documentation for Perl, including FAQ lists, should be found on

this system using "man perl" or "perldoc perl".  If you have access to the

Internet, point your browser at http://www.perl.org/, the Perl Home Page.




[root@exsite1c1 ~]#

Searching the problem

Digging for the source of the problem I checked the rootupgrade but the report was good. The TFA upgrade completed with success:

[root@exsite1c1 ~]# vi /u01/app/grid/crsdata/exsite1c2/crsconfig/rootcrs_exsite1c2_2019-11-15_12-12-21AM.log

...

...

2019-11-14 14:18:40: Executing the [UpgradeTFA] step with checkpoint [null] ...

2019-11-14 14:18:40: Executing cmd: /u01/app/19.0.0.0/grid/bin/clsecho -p has -f clsrsc -m 595 '1' '18' 'UpgradeTFA'

2019-11-14 14:18:40: Executing cmd: /u01/app/19.0.0.0/grid/bin/clsecho -p has -f clsrsc -m 595 '1' '18' 'UpgradeTFA'

2019-11-14 14:18:40: Command output:

>  CLSRSC-595: Executing upgrade step 1 of 18: 'UpgradeTFA'.

>End Command output

2019-11-14 14:18:40: CLSRSC-595: Executing upgrade step 1 of 18: 'UpgradeTFA'.

2019-11-14 14:18:40: Executing cmd: /u01/app/19.0.0.0/grid/bin/clsecho -p has -f clsrsc -m 4015

2019-11-14 14:18:40: Executing cmd: /u01/app/19.0.0.0/grid/bin/clsecho -p has -f clsrsc -m 4015

2019-11-14 14:18:40: Command output:

>  CLSRSC-4015: Performing install or upgrade action for Oracle Trace File Analyzer (TFA) Collector.

>End Command output

2019-11-14 14:18:40: CLSRSC-4015: Performing install or upgrade action for Oracle Trace File Analyzer (TFA) Collector.

2019-11-14 14:18:40: Executing the [ValidateEnv] step with checkpoint [null] ...

2019-11-14 14:18:40: Executing cmd: /u01/app/19.0.0.0/grid/crs/install/tfa_setup -silent -crshome /u01/app/19.0.0.0/grid

2019-11-14 14:18:40: Executing cmd: /u01/app/19.0.0.0/grid/bin/clsecho -p has -f clsrsc -m 595 '2' '18' 'ValidateEnv'

2019-11-14 14:18:40: Executing cmd: /u01/app/19.0.0.0/grid/bin/clsecho -p has -f clsrsc -m 595 '2' '18' 'ValidateEnv'

2019-11-14 14:18:40: Command output:

>  CLSRSC-595: Executing upgrade step 2 of 18: 'ValidateEnv'.

...

...

2019-11-14 14:23:45: Command output:

> 

>  TFA Installation Log will be written to File : /tmp/tfa_install_293046_2019_11_14-14_18_40.log

...

...

2019-11-14 14:23:45: Command output:

>  CLSRSC-4003: Successfully patched Oracle Trace File Analyzer (TFA) Collector.

>End Command output

And other related logs reported complete success:

[root@exsite1c1 ~]# cat /tmp/tfa_install_293046_2019_11_14-14_18_40.log

[2019-11-14 14:18:40] Log File written to : /tmp/tfa_install_293046_2019_11_14-14_18_40.log

[2019-11-14 14:18:40]

[2019-11-14 14:18:40] Starting TFA installation

[2019-11-14 14:18:40]

[2019-11-14 14:18:40] TFA Version: 192000 Build Date: 201904260414

[2019-11-14 14:18:40]

[2019-11-14 14:18:40] About to check previous TFA installations ...

[2019-11-14 14:18:40] TFA HOME : /u01/app/18.0.0/grid/tfa/exsite1c1/tfa_home

[2019-11-14 14:18:40]

[2019-11-14 14:18:40] Installed Build Version: 184100 Build Date: 201902260236

[2019-11-14 14:18:40]

[2019-11-14 14:18:40] INSTALL_TYPE GI

[2019-11-14 14:18:40] Shutting down TFA for Migration...

[2019-11-14 14:20:24]

[2019-11-14 14:20:24] Removing /etc/init.d/init.tfa...

[2019-11-14 14:20:24]

[2019-11-14 14:20:24] Migrating TFA to /u01/app/19.0.0.0/grid/tfa/exsite1c1/tfa_home...

[2019-11-14 14:20:50]

[2019-11-14 14:20:50] Starting TFA on exsite1c1...

[2019-11-14 14:20:50]

[2019-11-14 14:21:05]

[2019-11-14 14:21:05] TFA_INSTALLER /u01/app/19.0.0.0/grid/crs/install/tfa_setup

[2019-11-14 14:21:05] TFA is already installed. Upgrading TFA

[2019-11-14 14:21:05]

[2019-11-14 14:21:05] TFA patching CRS or DB from zipfile extracted to /tmp/.293046.tfa

[2019-11-14 14:21:06] TFA Upgrade Log : /u01/app/19.0.0.0/grid/tfa/exsite1c1/tfapatch.log

[2019-11-14 14:23:31] Patch Status : 0

[2019-11-14 14:23:31] Patching OK : Running install_ext

[2019-11-14 14:23:32] Installing oratop extension..

[2019-11-14 14:23:32]

.-----------------------------------------------------------------.

| Host      | TFA Version | TFA Build ID         | Upgrade Status |

+-----------+-------------+----------------------+----------------+

| exsite1c1 |  19.2.0.0.0 | 19200020190426041420 | UPGRADED       |

| exsite1c2 |  18.4.1.0.0 | 18410020190226023629 | NOT UPGRADED   |

'-----------+-------------+----------------------+----------------'




[2019-11-14 14:23:44] Removing Old TFA /u01/app/18.0.0/grid/tfa/exsite1c1/tfa_home...

[2019-11-14 14:23:45] Cleanup serializable files

[2019-11-14 14:23:45]

[root@exsite1c1 ~]#

[root@exsite1c1 ~]# cat /u01/app/19.0.0.0/grid/tfa/exsite1c1/tfapatch.log




TFA will be upgraded on Node exsite1c1:







Upgrading TFA on exsite1c1 :




Stopping TFA Support Tools...




Shutting down TFA for Patching...




Shutting down TFA

. . . . .

. . .

Successfully shutdown TFA..




No Berkeley DB upgrade required




Copying TFA Certificates...







Starting TFA in exsite1c1...




Starting TFA..

Waiting up to 100 seconds for TFA to be started..

. . . . .

Successfully started TFA Process..

. . . . .

TFA Started and listening for commands




Enabling Access for Non-root Users on exsite1c1...




[root@exsite1c1 ~]#

One know problem occurs when (for some reason) the nodes of the clusters lost the sync for TFA. I tried to do the sync, and this pointed one clue:

[root@exsite1c1 ~]# /u01/app/19.0.0.0/grid/tfa/bin/tfactl syncnodes

WARNING - TFA Software is older than 180 days. Please consider upgrading TFA to the latest version.




/u01/app/19.0.0.0/grid/tfa/exsite1c1/tfa_home/bin/synctfanodes.sh: line 237: /u01/app/18.0.0/grid/perl/bin/perl: No such file or directory

TFA-00002 Oracle Trace File Analyzer (TFA) is not running




Current Node List in TFA :

1.




Unable to determine Node List to be synced. Please update manually.




Do you want to update this node list? [Y|N] [N]: ^C[root@exsite1c1 ~]#

[root@exsite1c1 ~]#

As you can see, the syncnodes.sh made a reference for the old 18c GI home. And inside of the sync script, you can see the reference of that like 237 (my mark below) checked for PERL, and this came from the file tfa_setup.txt.

[root@exsite1c1 ~]# vi /u01/app/19.0.0.0/grid/tfa/exsite1c1/tfa_home/bin/synctfanodes.sh

...

...

        if [ `$GREP -c '^PERL=' $tfa_home/tfa_setup.txt` -ge 1 ]    <== TFA CHECK

        then

                PERL=`$GREP '^PERL=' $tfa_home/tfa_setup.txt | $AWK -F"=" '{print $2}'`;

        fi




        if [ `$GREP -c '^CRS_HOME=' $tfa_home/tfa_setup.txt` -ge 1 ]

        then

                CRS_HOME=`$GREP '^CRS_HOME=' $tfa_home/tfa_setup.txt | $AWK -F"=" '{print $2}'`;

        fi




        if [ `$GREP -c '^RUN_MODE=' $tfa_home/tfa_setup.txt` -ge 1 ]

        then

                RUN_MODE=`$GREP '^RUN_MODE=' $tfa_home/tfa_setup.txt | $AWK -F"=" '{print $2}'`;

        fi

fi




RUSER=`$RUID | $AWK '{print $1}' | $AWK -F\( '{print $2}' | $AWK -F\) '{print $1}'`;




if [ $RUSER != $DAEMON_OWNER ]

then

        $ECHO "User '$RUSER' does not have permissions to run this script.";

        exit 1;

fi




SSH_USER="$DAEMON_OWNER";




HOSTNAME=`hostname | $CUT -d. -f1 | $PERL -ne 'print lc'`;    <===== LINE 237

...

...

Checking tfa_setup.txt

Checking the file we can see the error:

[root@exsite1c1 ~]# cat /u01/app/19.0.0.0/grid/tfa/exsite1c1/tfa_home/tfa_setup.txt

CRS_HOME=/u01/app/18.0.0/grid

exsite1c1%CRS_INSTALLED=1

NODE_NAMES=exsite1c1

ORACLE_BASE=/u01/app/grid

JAVA_HOME=/u01/app/18.0.0/grid/jdk/jre

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/OPatch/crs/log

exsite1c1%CFGTOOLS%DIAGDEST=/u01/app/12.1.0.2/grid/cfgtoollogs

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/crf/db/exsite1c1

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/crs/log

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/cv/log

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/evm/admin/log

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/evm/admin/logger

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/evm/log

exsite1c1%INSTALL%DIAGDEST=/u01/app/12.1.0.2/grid/install

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/log

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/network/log

exsite1c1%DBWLM%DIAGDEST=/u01/app/12.1.0.2/grid/oc4j/j2ee/home/log

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/opmn/logs

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/racg/log

exsite1c1%ASM%DIAGDEST=/u01/app/12.1.0.2/grid/rdbms/log

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/scheduler/log

exsite1c1%CRS%DIAGDEST=/u01/app/12.1.0.2/grid/srvm/log

exsite1c1%ACFS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/acfs

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/core

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/crsconfig

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/crsdiag

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/cvu

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/evm

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/output

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/trace

exsite1c1%INSTALL%DIAGDEST=/u01/app/oraInventory/ContentsXML

exsite1c1%INSTALL%DIAGDEST=/u01/app/oraInventory/logs

TRACE_LEVEL=1

INSTALL_TYPE=GI

PERL=/u01/app/18.0.0/grid/perl/bin/perl

RDBMS_ORACLE_HOME=/u01/app/oracle/product/12.1.0.2/dbhome_1||

RDBMS_ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/dbhome_1||

RDBMS_ORACLE_HOME=/u01/app/12.2.0.1/grid||

TZ=Europe/Luxembourg

RDBMS_ORACLE_HOME=/u01/app/18.0.0/grid||

localnode%ADRBASE=/u01/app/grid

RDBMS_ORACLE_HOME=/u01/app/oracle/product/18.0.0/dbhome_1||

localnode%ADRBASE=/u01/app/oracle

RDBMS_ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/financ||

localnode%ADRBASE=/u01/app/oracle

RDBMS_ORACLE_HOME=/u01/app/oracle/product/12.1.0.2/financ||

localnode%ADRBASE=/u01/app/oracle

DAEMON_OWNER=root

RDBMS_ORACLE_HOME=/u01/app/oracle/agent/13.2.0/agent_13.2.0.0.0||

RDBMS_ORACLE_HOME=/u01/app/12.1.0.2/grid||

RDBMS_ORACLE_HOME=/u01/app/19.0.0.0/grid||

localnode%ADRBASE=/u01/app/grid

CRS_ACTIVE_VERSION=

[root@exsite1c1 ~]#

As you can see above, the CRS_HOME, JAVA_HOME, PERL, and ORACLE_HOME parameters are pointing to the old GI folder. As a workaround I edited the tfa_setup.txt in both nodes and fixed the GI folder from 18.0 to 19.0:

[root@exsite1c1 ~]# vi /u01/app/19.0.0.0/grid/tfa/exsite1c1/tfa_home/tfa_setup.txt

[root@exsite1c1 ~]#

[root@exsite1c1 ~]#

[root@exsite1c1 ~]#

[root@exsite1c1 ~]# cat /u01/app/19.0.0.0/grid/tfa/exsite1c1/tfa_home/tfa_setup.txt

CRS_HOME=/u01/app/19.0.0.0/grid

exsite1c1%CRS_INSTALLED=1

NODE_NAMES=exsite1c1

ORACLE_BASE=/u01/app/grid

JAVA_HOME=/u01/app/19.0.0.0/grid/jdk/jre

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/OPatch/crs/log

exsite1c1%CFGTOOLS%DIAGDEST=/u01/app/19.0.0.0/grid/cfgtoollogs

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/crf/db/exsite1c1

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/crs/log

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/cv/log

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/evm/admin/log

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/evm/admin/logger

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/evm/log

exsite1c1%INSTALL%DIAGDEST=/u01/app/19.0.0.0/grid/install

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/log

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/network/log

exsite1c1%DBWLM%DIAGDEST=/u01/app/19.0.0.0/grid/oc4j/j2ee/home/log

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/opmn/logs

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/racg/log

exsite1c1%ASM%DIAGDEST=/u01/app/19.0.0.0/grid/rdbms/log

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/scheduler/log

exsite1c1%CRS%DIAGDEST=/u01/app/19.0.0.0/grid/srvm/log

exsite1c1%ACFS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/acfs

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/core

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/crsconfig

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/crsdiag

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/cvu

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/evm

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/output

exsite1c1%CRS%DIAGDEST=/u01/app/grid/crsdata/exsite1c1/trace

exsite1c1%INSTALL%DIAGDEST=/u01/app/oraInventory/ContentsXML

exsite1c1%INSTALL%DIAGDEST=/u01/app/oraInventory/logs

TRACE_LEVEL=1

INSTALL_TYPE=GI

PERL=/u01/app/19.0.0.0/grid/perl/bin/perl

RDBMS_ORACLE_HOME=/u01/app/oracle/product/12.1.0.2/dbhome_1||

RDBMS_ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/dbhome_1||

TZ=Europe/Luxembourg

RDBMS_ORACLE_HOME=/u01/app/oracle/product/18.0.0/dbhome_1||

localnode%ADRBASE=/u01/app/oracle

RDBMS_ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/financ||

localnode%ADRBASE=/u01/app/oracle

RDBMS_ORACLE_HOME=/u01/app/oracle/product/12.1.0.2/financ||

localnode%ADRBASE=/u01/app/oracle

DAEMON_OWNER=root

RDBMS_ORACLE_HOME=/u01/app/oracle/agent/13.2.0/agent_13.2.0.0.0||

RDBMS_ORACLE_HOME=/u01/app/19.0.0.0/grid||

localnode%ADRBASE=/u01/app/grid

CRS_ACTIVE_VERSION=19.0.0.0

[root@exsite1c1 ~]#

And after edit was possible to start TAF correctly:

[root@exsite1c1 ~]# /etc/init.d/init.tfa start

Starting TFA..

Waiting up to 100 seconds for TFA to be started..

. . . . .

Successfully started TFA Process..

. . . . .

WARNING - TFA Software is older than 180 days. Please consider upgrading TFA to the latest version.

TFA Started and listening for commands

[root@exsite1c1 ~]#

[root@exsite1c1 ~]#

[root@exsite1c1 ~]# ps -ef |grep tfa

root     113905      1  0 11:31 ?        00:00:00 /bin/sh /etc/init.d/init.tfa run >/dev/null 2>&1 </dev/null

root     115917      1 99 11:31 ?        00:00:24 /u01/app/19.0.0.0/grid/jdk/jre/bin/java -server -Xms256m -Xmx512m -Djava.awt.headless=true -Ddisable.checkForUpdate=true -XX:ParallelGCThreads=5 oracle.rat.tfa.TFAMain /u01/app/19.0.0.0/grid/tfa/exsite1c1/tfa_home

root     117853  37137  0 11:31 pts/0    00:00:00 grep --color=auto tfa

[root@exsite1c1 ~]#

And execute the diagcollect:

[root@exsite1c1 ~]# /u01/app/19.0.0.0/grid/tfa/bin/tfactl diagcollect

WARNING - TFA Software is older than 180 days. Please consider upgrading TFA to the latest version.

 

By default TFA will collect diagnostics for the last 12 hours. This can result in large collections

For more targeted collections enter the time of the incident, otherwise hit <RETURN> to collect for the last 12 hours

[YYYY-MM-DD HH24:MI:SS,<RETURN>=Collect for last 12 hours] :

 

Collecting data for the last 12 hours for all components...

Collecting data for all nodes

 

Collection Id : 20191122124148exsite1c1

 

Detailed Logging at : /u01/app/grid/tfa/repository/collection_Fri_Nov_22_12_41_49_CET_2019_node_all/diagcollect_20191122124148_exsite1c1.log

2019/11/22 12:41:53 CET : NOTE : Any file or directory name containing the string .com will be renamed to replace .com with dotcom

2019/11/22 12:41:53 CET : Collection Name : tfa_Fri_Nov_22_12_41_49_CET_2019.zip

2019/11/22 12:41:54 CET : Collecting diagnostics from hosts : [exsite1c1, exsite1c2]

2019/11/22 12:41:54 CET : Scanning of files for Collection in progress...

2019/11/22 12:41:54 CET : Collecting additional diagnostic information...

2019/11/22 12:44:13 CET : Completed collection of additional diagnostic information...

2019/11/22 13:15:39 CET : Getting list of files satisfying time range [11/22/2019 00:41:53 CET, 11/22/2019 13:15:39 CET]

2019/11/22 13:40:42 CET : Collecting ADR incident files...

2019/11/22 13:40:48 CET : Completed Local Collection

2019/11/22 13:40:48 CET : Remote Collection in Progress...

.---------------------------------------.

|           Collection Summary          |

+-----------+-----------+-------+-------+

| Host      | Status    | Size  | Time  |

+-----------+-----------+-------+-------+

| exsite1c2 | Completed | 412MB |  318s |

| exsite1c1 | Completed | 284MB | 3534s |

'-----------+-----------+-------+-------'

 

Logs are being collected to: /u01/app/grid/tfa/repository/collection_Fri_Nov_22_12_41_49_CET_2019_node_all

/u01/app/grid/tfa/repository/collection_Fri_Nov_22_12_41_49_CET_2019_node_all/exsite1c1.tfa_Fri_Nov_22_12_41_49_CET_2019.zip

/u01/app/grid/tfa/repository/collection_Fri_Nov_22_12_41_49_CET_2019_node_all/exsite1c2.tfa_Fri_Nov_22_12_41_49_CET_2019.zip

[root@exsite1c1 ~]#

[root@exsite1c1 ~]# /u01/app/19.0.0.0/grid/tfa/bin/tfactl diagcollect -since 1h

WARNING - TFA Software is older than 180 days. Please consider upgrading TFA to the latest version.

Collecting data for all nodes

 

Collection Id : 20191122134319exsite1c1

 

Detailed Logging at : /u01/app/grid/tfa/repository/collection_Fri_Nov_22_13_43_20_CET_2019_node_all/diagcollect_20191122134319_exsite1c1.log

2019/11/22 13:43:24 CET : NOTE : Any file or directory name containing the string .com will be renamed to replace .com with dotcom

2019/11/22 13:43:24 CET : Collection Name : tfa_Fri_Nov_22_13_43_20_CET_2019.zip

2019/11/22 13:43:24 CET : Collecting diagnostics from hosts : [exsite1c1, exsite1c2]

2019/11/22 13:43:24 CET : Scanning of files for Collection in progress...

2019/11/22 13:43:24 CET : Collecting additional diagnostic information...

2019/11/22 13:44:49 CET : Getting list of files satisfying time range [11/22/2019 12:43:24 CET, 11/22/2019 13:44:49 CET]

2019/11/22 13:45:50 CET : Completed collection of additional diagnostic information...

2019/11/22 13:59:19 CET : Collecting ADR incident files...

2019/11/22 13:59:19 CET : Completed Local Collection

2019/11/22 13:59:19 CET : Remote Collection in Progress...

.--------------------------------------.

|          Collection Summary          |

+-----------+-----------+-------+------+

| Host      | Status    | Size  | Time |

+-----------+-----------+-------+------+

| exsite1c2 | Completed | 230MB | 295s |

| exsite1c1 | Completed | 105MB | 955s |

'-----------+-----------+-------+------'

 

Logs are being collected to: /u01/app/grid/tfa/repository/collection_Fri_Nov_22_13_43_20_CET_2019_node_all

/u01/app/grid/tfa/repository/collection_Fri_Nov_22_13_43_20_CET_2019_node_all/exsite1c2.tfa_Fri_Nov_22_13_43_20_CET_2019.zip

/u01/app/grid/tfa/repository/collection_Fri_Nov_22_13_43_20_CET_2019_node_all/exsite1c1.tfa_Fri_Nov_22_13_43_20_CET_2019.zip

[root@exsite1c1 ~]#

TFA error #2

Another error that I got in another cluster that passed for the same update/upgrade process was related with *ser files in tfa home. If I try to use TFA (with diagcolect as an example) I receive this error:

[root@exsite2c1 ~]# /u01/app/19.0.0.0/grid/tfa/exsite2c1/tfa_home/bin/tfactl diagcollect

WARNING - TFA Software is older than 180 days. Please consider upgrading TFA to the latest version.

Storable binary image v2.10 contains data of type 101. This Storable is v2.9 and can only handle data types up to 30 at /usr/lib64/perl5/vendor_perl/Storable.pm line 381, at /u01/app/19.0.0.0/grid/tfa/exsite2c1/tfa_home/bin/common/tfactlshare.pm line 25611.

[root@exsite2c1 ~]#

If you look in the MOS, will point to PERL version. But it is not the case here, the perl it is more than 5.10 version for this version of Exadata. **The solution was more *.ser files to another folder (remove from TFA home), or delete it**. After that, no more “Storage binary error” (but the error about with tfa_setup.txt continues):

[root@exsite2c1 ~]# mv /u01/app/19.0.0.0/grid/tfa/exsite2c1/tfa_home/internal/*ser /tmp

[root@exsite2c1 ~]# ls -l /u01/app/19.0.0.0/grid/tfa/exsite2c1/tfa_home/internal/*ser

ls: cannot access /u01/app/19.0.0.0/grid/tfa/exsite2c1/tfa_home/internal/*ser: No such file or directory

[root@exsite2c1 ~]#

Problem and Solution

It is not clear the source of the problem in this case. As you saw above, the logs of upgrade/update of GI from 18c to 19c reported success, even for TFA. But it is clear that tfa_setup.txt was left with wrong parameters inside. And if you look closely you can see that exists reference to the new GI home in the first version.

But unfortunately, the needed parameters were left with the wrong values. The workaround was just to change the tfa_setup.txt and fix the wrong folders for parameters. Was not tested to execute the $GI_HOME/grid/crs/install/tfa_setup -silent -crshome $GI_HOME to fix the filed, but you can try. The idea was trying to identify the issue instead of just remove TFA and reinstall it.

Again, this is a workaround tested in my environment and worked. You need to verify logs and other files to see if you hit the same issues. If yes, at least, you can try.

Using the feature recover standby database from service Dataguard on 19c

Category: Database Author: Andre Luiz Dutra Ontalba (Board Member) Date: 5 years ago Comments: 0

Using the feature recover standby database from service Dataguard on 19c

The Oracle Dataguard, part from MAA (Max Availability Architecture), is a prouct vastly utilized all over the entire globe by several companies whom willing to keep their data safe, preventing lost of data in case of human erros, hardware failures and other kind of disasters. The main function of this product is provide a copy from your primary database, synchronized by archivelogs, usually maintained into a different geographical location, in most of cases. With the Active Dataguard introduction, on Oracle 11g, an option, licensed, leverage this product to another level, even better, in my humble opinion. This feature allows the users to run some reports on standby database with the state read only with apply, this can deviate a lot of heavy workloads from principal database and reducing a lot of resources on it.

With Oracle database 19c, Oracle introduced a feature that allows some DML done on standby database be redirected to the primary database and the shipped by archivelogs again to standby. Note, this is not recomended for usage with intense workloads, because of performance matters, use more, cpu, network and etc. Just mention this features for you follow up some improvements on the product on each version. Today we will talk about one more enhancement that make it easier DBA’s life time to time.

Sometimes, because of some hardware failures, human failures, bugs or even some reconfiguration, we need to recreate the standby database, and on Oracle 18c and onwards, Oracle has introduced a single RMAN command to performance the whole resynchronization from standby databases over network. I don’t need to tell you that is no suitable perform multi-terabyte database restoration over network, most of time (you can have a dedicated network to play with).

Benefits

Simplicity to create or recreate the standby database
Best control of the process

Scenario

Primary database: orclcdb
Standby database: orclstb

The primary database and also the standby are managed by Dataguard broker (DGMGRL), I like to use this command line utility to manage the Dataguard environments. It helps to perform all dataguard administration like switchover, configure and a lot more of tasks in a Dataguard environment, if you want to have more information about this incredible tool, please refer the official documentation on the link https://docs.oracle.com/en/database/oracle/oracle-database/19/dgbkr/index.html

Verification

As I mentioned before, using the command line interface from Dataguard Broker, we can use the command show configuration to display the configuration made by us, previously, in our case, on this Dataguard configuration we have the primary database orclcdb and also the standby database orclstb, we can have multiple standby databases and also ZDLRA registered on broker. We will cause some damage on standby database just to show how to solve it on this article, by removing a datafile, let’s say system01.dbf, just for fun.

We can see on the imagem below, there are no problems there, like a gap or any other kind of issues, in this case, if we lose the primary database, the standby will be able to become the primary database with just a manual switchover. It is also possible to automate this task, by configuring Fast-Start Failover.

Causing some damage on standby

On the follwoing image, we can verify all belonging datafiles from our standby database, where we will cause some damage and recover it just using recover standby database from service. The list below, shows all datafiles from our standby (orclstb)

Now, we will remove the main datafile on the standby, /u01/app/oracle/oradata/ORCLSTB/system01.dbf

As we expected, after remove, one of the datafiles on purpose, our environment crashed, look:

Disabling the apply on standby database

Before to initiate with the recover process, from this standby database, we will neneed stop the archivelog synchronization, to do that, we can use the commands below, to edit the state of the database within broker. If you don’t do that, we will face this error:

starting media recovery

media recovery failed

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-03002: failure of recover command at 01/19/2020 11:16:26

RMAN-03015: error occurred in stored script Memory Script

RMAN-11003: failure during parse/execution of SQL statement: alter database recover

 if needed standby start

ORA-01153: an incompatible media recovery is active

As we can see, the command to stop apply on DGMGRL was issued, and now, no more error will happen.

After stopping the recover, we just need to login on RMAN and issue the command recover standby database from service, as we can follow up on upcomming lines of this article:

[oracle@ora19c ~]$ rman target=sys/oracle@orclstb




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Jan 19 11:27:59 2020

Version 19.3.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database: ORCLCDB (DBID=2780785463, not open)




RMAN> recover standby database from service orclcdb;




Starting recover at 19-JAN-20

using target database control file instead of recovery catalog

Oracle instance started




Total System Global Area    1895823376 bytes




Fixed Size                     9136144 bytes

Variable Size                436207616 bytes

Database Buffers            1442840576 bytes

Redo Buffers                   7639040 bytes




contents of Memory Script:

{

   restore standby controlfile from service  'orclcdb';

   alter database mount standby database;

}

executing Memory Script




Starting restore at 19-JAN-20

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=39 device type=DISK




channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: using network backup set from service orclcdb

channel ORA_DISK_1: restoring control file

channel ORA_DISK_1: restore complete, elapsed time: 00:00:02

output file name=/u01/app/oracle/oradata/ORCLSTB/control01.ctl

output file name=/u01/app/oracle/fast_recovery_area/ORCLSTB/control02.ctl

Finished restore at 19-JAN-20




released channel: ORA_DISK_1

Statement processed




contents of Memory Script:

{

set newname for datafile  1 to

 "/u01/app/oracle/oradata/ORCLSTB/system01.dbf";

   restore from service  'orclcdb' datafile

    1;

   catalog datafilecopy  "/u01/app/oracle/oradata/ORCLSTB/system01.dbf";

   switch datafile all;

}

executing Memory Script




executing command: SET NEWNAME




Starting restore at 19-JAN-20

Starting implicit crosscheck backup at 19-JAN-20

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=49 device type=DISK

Crosschecked 3 objects

Finished implicit crosscheck backup at 19-JAN-20




Starting implicit crosscheck copy at 19-JAN-20

using channel ORA_DISK_1

Crosschecked 2 objects

Finished implicit crosscheck copy at 19-JAN-20




searching for all files in the recovery area

cataloging files...

cataloging done




List of Cataloged Files

=======================

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_25_gyq4g4hz_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_26_gyq5km3v_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_27_gyq64bom_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_28_gyq64n7f_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_29_gyq64zy8_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_30_gyq6vxg5_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_32_gyq6vzrv_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_31_gyq6vzs6_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_33_gyq84b59_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_34_gyq88dll_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_35_gyq9rvx9_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_36_gyqcf807_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_37_gyqcfsk1_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_38_gyqcg4l3_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_39_gyqckh4o_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_41_gyqckkfk_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_40_gyqckkg6_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_42_gyqcpsko_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_43_gyqdchj7_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_08/o1_mf_1_44_gyso1jwz_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_08/o1_mf_1_45_gytopnjo_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_08/o1_mf_1_46_gytowb7h_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_09/o1_mf_1_47_h1gstntb_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_09/o1_mf_1_48_h1gtcgd0_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_09/o1_mf_1_49_h1gv3wqn_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_09/o1_mf_1_50_h1gv8rbs_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_09/o1_mf_1_51_h1gvo2x8_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_16/o1_mf_1_52_h21lqfd2_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_16/o1_mf_1_53_h21m7fwt_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_19/o1_mf_1_54_h28w8y78_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/autobackup/2019_12_07/o1_mf_s_1026372241_gyq6h263_.bkp

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/autobackup/2019_12_07/o1_mf_s_1026373077_gyq7r09k_.bkp

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/autobackup/2019_12_07/o1_mf_s_1026378166_gyqd9fxy_.bkp




using channel ORA_DISK_1




channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: using network backup set from service orclcdb

channel ORA_DISK_1: specifying datafile(s) to restore from backup set

channel ORA_DISK_1: restoring datafile 00001 to /u01/app/oracle/oradata/ORCLSTB/system01.dbf

channel ORA_DISK_1: restore complete, elapsed time: 00:00:07

Finished restore at 19-JAN-20




cataloged datafile copy

datafile copy file name=/u01/app/oracle/oradata/ORCLSTB/system01.dbf RECID=5 STAMP=1030102136




datafile 1 switched to datafile copy

input datafile copy RECID=5 STAMP=1030102136 file name=/u01/app/oracle/oradata/ORCLSTB/system01.dbf




contents of Memory Script:

{

  recover database from service  'orclcdb';

}

executing Memory Script




Starting recover at 19-JAN-20

using channel ORA_DISK_1

skipping datafile 1; already restored to SCN 4595092

skipping datafile 3; already restored to SCN 4594570

skipping datafile 5; already restored to SCN 2163739

skipping datafile 6; already restored to SCN 2163739

skipping datafile 7; already restored to SCN 4594577

skipping datafile 8; already restored to SCN 2163739

skipping datafile 9; already restored to SCN 4594580

skipping datafile 10; already restored to SCN 4594582

skipping datafile 12; already restored to SCN 4594588

skipping datafile 13; already restored to SCN 4594593

skipping datafile 14; already restored to SCN 4594596

skipping datafile 15; already restored to SCN 4594598

skipping datafile 19; already restored to SCN 4594600

skipping datafile 20; already restored to SCN 4594604

skipping datafile 21; already restored to SCN 4594611




starting media recovery




media recovery complete, elapsed time: 00:00:00

Finished recover at 19-JAN-20

Finished recover at 19-JAN-20




RMAN>

Thus, we are able to see, with just one simple commad line on RMAN, how can be easy recover/rebuild my entire dataguard database and stay ready for the next disaster recover, we hope not face that, but problems happens. The final result, our standby database was restored and can be fully syncrhonized again with the production database.

So, via DGMGRL interface, we can enable the synchronization again as we can see on the image:

For validation purposes, we will issue a switchover to change the roles of our databases, making the primary database becoming the standby and standby becoming the primary:

Now, our database orclstb, originally standby database is the primary, and the database orclcdb was changed to physical standby role:

Alls are opened and ready to use, the instainswith db_unique_name = orclstb now is with PRIMARY ROLE.

Conclusion: Oracle is always inovating and make our lives easier. With every launched version, new features are added. We can agree on this, this funcionality saves a lot of work, but, be concerned that you restore is fully using network, if your database is huge, you may kill the performance of your entire network.

André Ontalba / Rodrigo Mufalani / Y V RaviKumar

Converting Dictionary Managed Tablespaces (DMT) to Local Managed Tablespaces (LMT)

Category: Database Author: Andre Luiz Dutra Ontalba (Board Member) Date: 5 years ago Comments: 0

Converting Dictionary Managed Tablespaces (DMT) to Local Managed Tablespaces (LMT)

Today we found some really old database using really old fashion tablespaces allocation. In another life, when I began with Oracle database and we use to call MOSC as Metalink (@ludodba) there was a way to manage the extents on database called DICTIONARY MANAGED TABLESPACES (DMT), most because of performance issues, Oracle has decided to change this management of the extents instead to resides on data dictionary (SYSTEM tbs), to be stored on bitmaps on local tablespace, that’s why they called it LOCAL MANAGED TABLESPACE (LMT). This feature was introduced on Oracle 9i in 2001, almost 20 years ago and we still see old stuff running around.

The steps that we will perform are very fast and simple, but as we are changing some core things on the database, please, make sure to have full and confiable backup before.

Here is an important note, a little bit confusing, please read it before to proceed. Resuming, after convert SYSTEM, you cannot put DMT in read write mode anymore.

* IMPORTANT NOTE **

Note:After the SYSTEM tablespace is migrated to locally managed, any dictionary-managed tablespaces in the database cannot be made read/write.
If you want to use the dictionary-managed tablespaces in read/write mode, then Oracle recommends
that you first migrate these tablespaces to locally managed before migrating the SYSTEM tablespace.

***********************************

1) Put DATABASE in restricted session

SQL> ALTER SYSTEM ENABLE RESTRICTED SESSION;

System altered.

SELECT logins FROM v$instance;

LOGINS
---------
RESTRICTED

2) Define a default temporary tablespace on the database

SQL> col PROPERTY_NAME form a30
SQL> col PROPERTY_VALUE form a30
SQL> col DESCRIPTION form a50
SQL> select * from database_properties where PROPERTY_NAME = 'DEFAULT_TEMP_TABLESPACE';

PROPERTY_NAME PROPERTY_VALUE DESCRIPTION
------------------------------ ------------------------------ --------------------------------------------------
DEFAULT_TEMP_TABLESPACE SYSTEM ID of default temporary tablespace

Obs.: Really bad, using system tablespace to store temporary segments, some DBA here didn’t made his homework… In fact his work.

SQL> ALTER DATABASE DEFAULT TEMPORARY TABLESPACE TEMPTBS01;

Database altered.

Obs.: Check if temp is dictionary managed, if so, recreate it before to proceed with the others.

SQL> select * from database_properties where PROPERTY_NAME = 'DEFAULT_TEMP_TABLESPACE';

PROPERTY_NAME PROPERTY_VALUE DESCRIPTION
------------------------------ ------------------------------ --------------------------------------------------
DEFAULT_TEMP_TABLESPACE TEMPTBS01 ID of default temporary tablespace

3) Migrate all other DICTIONARY MANAGED TBS before MIGRATE system

select tablespace_name, status, extent_management 
from dba_tablespaces 
where extent_management = 'DICTIONARY';

4) Execute the migration

select 'execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('''||tablespace_name||''');' cmd 
from dba_tablespaces where extent_management = 'DICTIONARY'
and tablespace_name <> 'SYSTEM';

CMD
----------------------------------------------------------------------------------------
execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('INDX');
execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('TOOLS');
execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('TESTTBS1');
execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('TESTTBS2');
execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('USERS');

SQL> execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('INDX');

PL/SQL procedure successfully completed.

SQL> execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('TOOLS');

PL/SQL procedure successfully completed.

SQL> execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('TESTTBS1');

PL/SQL procedure successfully completed.

SQL> execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('TESTTBS2');

PL/SQL procedure successfully completed.

SQL> execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('USERS');

PL/SQL procedure successfully completed.

5) Make sure that all tablespaces, other than SYSTEM were migrated to LMT, after execute the migration of SYSTEM to LMT, if you left any dictionary managed tablespaces (DMT), you cannot put them read write anymore.

select tablespace_name, status, extent_management 
from dba_tablespaces 
where extent_management = 'DICTIONARY'
and tablespace_name <> 'SYSTEM';

no rows selected

6) Put all tablespaces other than UNDO, TEMP AND SYSAUX in readonly mode.

select 'ALTER TABLESPACE '||tablespace_name||' READ ONLY;' CMD
from dba_tablespaces 
where tablespace_name NOT in ('SYSTEM','SYSAUX') 
and contents not in ('UNDO','TEMPORARY');

CMD
----------------------------------------------------------
ALTER TABLESPACE INDX READ ONLY;
ALTER TABLESPACE TOOLS READ ONLY;
ALTER TABLESPACE TESTTBS1 READ ONLY;
ALTER TABLESPACE TESTTBS2 READ ONLY;
ALTER TABLESPACE USERS READ ONLY;

7) Check the status of the tablespaces

SQL> select tablespace_name, status from dba_tablespaces;

TABLESPACE_NAME STATUS
------------------------------ ---------
SYSTEM ONLINE
INDX READ ONLY
TOOLS READ ONLY
TESTTBS1 READ ONLY
TESTTBS2 READ ONLY
USERS READ ONLY
UNDOTBS01 ONLINE
TEMPTBS01 ONLINE
SYSAUX ONLINE

8) Put SYSAUX tablespace offline

SQL> alter tablespace SYSAUX offline;

Tablespace altered.

9) Check the status of tablespaces again, just UNDO, TEMP and SYSTEM must be online.

SQL> select tablespace_name, status from dba_tablespaces;

TABLESPACE_NAME STATUS
------------------------------ ---------
SYSTEM ONLINE
INDX READ ONLY
TOOLS READ ONLY
TESTTBS1 READ ONLY
TESTTBS2 READ ONLY
USERS READ ONLY
UNDOTBS01 ONLINE
TEMPTBS01 ONLINE
SYSAUX OFFLINE

10) Now we can migrate SYSTEM tablespaces

SQL> execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('SYSTEM');

PL/SQL procedure successfully completed.

11) Put the tablespaces again read write and online

select 'ALTER TABLESPACE '||tablespace_name||' READ WRITE;' CMD
from dba_tablespaces 
where tablespace_name NOT in ('SYSTEM','SYSAUX') 
and contents not in ('UNDO','TEMPORARY');

CMD
-----------------------------------------------------------
ALTER TABLESPACE INDX READ WRITE;
ALTER TABLESPACE TOOLS READ WRITE;
ALTER TABLESPACE USR READ WRITE;
ALTER TABLESPACE TESTTBS2 READ WRITE;
ALTER TABLESPACE USERS READ WRITE;

SQL> ALTER TABLESPACE INDX READ WRITE;

Tablespace altered.

SQL> ALTER TABLESPACE TOOLS READ WRITE;

Tablespace altered.

SQL> ALTER TABLESPACE USR READ WRITE;

Tablespace altered.

SQL> ALTER TABLESPACE TESTTBS2 READ WRITE;

Tablespace altered.

SQL> ALTER TABLESPACE USERS READ WRITE;

Tablespace altered.

12) Put SYSAUX online again

SQL> ALTER TABLESPACE SYSAUX ONLINE;

Tablespace altered.

13) Check the status of the tablespaces one more time

SQL> select tablespace_name, status, extent_management from dba_tablespaces;

TABLESPACE_NAME STATUS EXTENT_MAN
------------------------------ --------- ----------
SYSTEM ONLINE LOCAL
INDX ONLINE LOCAL
TOOLS ONLINE LOCAL
TESTTBS1 ONLINE LOCAL
TESTTBS2 ONLINE LOCAL
USERS ONLINE LOCAL
UNDOTBS01 ONLINE LOCAL
TEMPTBS01 ONLINE LOCAL
SYSAUX ONLINE LOCAL

14) Disable restricted session on the database

SQL> ALTER SYSTEM DISABLE RESTRICTED SESSION;

System altered.

SELECT logins FROM v$instance;

LOGINS
---------
ALLOWED

If you don’t follow all steps carefully you can reach some issues like these below

SQL> execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('SYSTEM');

ERROR at line 1:
ORA-10644: SYSTEM tablespace cannot be default temporary tablespace
ORA-06512: at "SYS.DBMS_SPACE_ADMIN", line 227
ORA-06512: at line 1

SQL> execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('SYSTEM');

ERROR at line 1:
ORA-10647: Tablespace other than SYSTEM, UNDO, TEMP not
found in read only mode
ORA-06512: at "SYS.DBMS_SPACE_ADMIN", line 227
ORA-06512: at line 1

SQL> select tablespace_name from dba_tablespaces;

TABLESPACE_NAME
------------------------------
SYSTEM
TEMP
USERS
IDX
SYSAUX
UNDO

SQL> alter tablespace USERS read only;
SQL> alter tablespace IDX read only;

SQL> execute DBMS_SPACE_ADMIN.TABLESPACE_MIGRATE_TO_LOCAL ('SYSTEM');

ERROR at line 1:
ORA-10648: Tablespace SYSAUX is not offline
ORA-06512: at "SYS.DBMS_SPACE_ADMIN", line 227
ORA-06512: at line 1

I hope this helps you!!!

Rodrigo Mufalani and Andre Luiz Dutra Ontalba

Disclaimer: “The postings on this site are my own and don’t necessarily represent may actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications was removed to allow reach generic audience and to be useful.”

Webinar – MAA and ZDLRA protecting everything

Category: Database Author: Andre Luiz Dutra Ontalba (Board Member) Date: 5 years ago Comments: 1

Hi guys !!

On January 16th at 19:00 PM (CET) we will have our Webinar – MAA and ZDLRA protecting everything.

Speaker: Fernando Simon

You can follow on Youtube – Click Here.

Click here for download presentation.

André Ontalba – Board Member

LUXOUG at OOW19. Have you missed the Oracle Open World 2019?

Category: Database Author: BRUNO REIS Date: 5 years ago Comments: 0

LUXOUG at OOW19 - Have you missed the Oracle Open World 2019?

Fernando Simon’s article about the new Exadata: http://www.fernandosimon.com/blog/exadata-x8m/

Download presentations:https://events.rainfocus.com/widget/oracle/oow19/catalogow19?

Bruno Reis

« 1 2 3 4 5 … 7 »

ASM, Mount restricted force for recovery

This represents for Exadata, but it is safe for representation. Basically your data will be in at least two different failgroups:

Environment

In the example that I use here, I have one diskgroup called DATA, which has 7 (seven) disks and each one is on failgroup. The redundancy for this diskgroup is NORMAL, this means that the block is copied in two failgroups. If two failures occur, probably, I will have data loss/corruption. Look:

The version for my GI is 19.6.0.0, but this can be used from 12.1.0.2 and newer versions (works for 11.2.0.4 in some versions). And In this server, I have three databases running, DBA19, DBB19, and DBC19.

So, with everything running correctly, the data from my databases will be spread two failgroups (this is just a representation and not correct representation where the blocks from my database are):

Understanding the failure

A “default configuration” have these values:

Mount restricted force for recovery

Here, I will simulate multiple failures for the disks (more than one) and show how you can use mount restricted force for recovery. Please be careful and follow all the steps correctly to avoid mistakes and to understand how to do and what is happening.

So, here I have DATA diskgroup, with normal redundancy and 7 (seven) failgroups. DBA19, DBB19, and DBC19 databases running.

So, at the first step, I will simulate a complete failure of CELLI03 failgroup. In my environment, to allow more control, I have one iSCSI target for each failgroup (this allows me to disconnect one by one if needed). The CELLI03 died:

And at ASM alertlog we can see:

So, the failure occurred at 17:14. The full output can be found here at ASM-ALERTLOG-Output-Failure-CELLI03.txt

And we can see that disappeared (but not deleted or dropped) from ASM:

At this point, ASM is starting to count the clock of 12hours (as defined in my repair attributes). The failgroup was not dropped and rebalance was not going on because ASM is optimistic that you will fix the issue in this period.

But after some time I had a second failure in the diskgroup:

Now at ASM alertlog you can see that diskgroup was dismounted (and several other messages). Bellow a cropped from the alertlog. The full output (and I think that deserve a look) it is here at ASM-ALERTLOG-Output-Failure-CELLI03-and-CELL01.txt

So, probably some data was lost. And even if you consider that between these 4 minutes data was changed in the databases, the mess is Big. If you want to see the alertlog from databases, check here at ASM-ALERTLOG-Output-From-Databases-Alertlog-at-Failure.txt

And now we have this at ASM:

And if we try to mount we receive an error due to disk offline:

Imagine that you can bring online the first failed failgroup (CELL03) that have 4 minutes of the difference of data:

And if you try to mount it normally you will receive an error (output from alertlog at this try can be seen here at ASM-ALERTLOG-Output-Mout-With-One-Disk-Online):

So, now we can try the mount restricted force for recovery:

The alertlog from ASM (that you can full here at ASM-ALERTLOG-Output-Mout-Restricted-Force-For-Recovery.txt) report messages related with cache from diskgropup and disk that need to be checked. And now we are like this:

The next step is to bring online the failgroup that came back:

Doing this ASM will resync this failgroup (using this block as the last version) and bring the cache of this disk online. At ASM alertlog you can see messages like (full output here at ASM-ALERTLOG-Output-Online-Restored-Failgroup):

And not we have at ASM:

And rebalance not continue because is not allowed when diskgroup is in restrict mode:

But since the failgroup become online “in force way”, the old cache (from CELL01) need to be clean. And since it is not the last version, maybe some files were corrupted. To check this, you can look the *arb* process trace files at ASM trace directory:

And looking from one of the last, we can see that some extend (that does not exist, the recovered failgroup, or the cache is not the last one) was filled with dummy (BADFDA7A) data:

After that, we can normally dismount the diskgroup:

And mount it again:

Since now the diskgroup is mounted in a clean way, you can continue with the rebalance:

The state at ASM side it is:

As you can see, the CELL01 was not removed yet (I will talk about it later). But the activities can continue, databases can be checked.

Database side

At database side we need to check what we lost and need to recover. Since I am using cluster the GI tried to start it (and as you can see two became up):

DBA19

The firs that I checked was DBA19C, I used rman to VALIDATE DATABASE:

As you can see, the datafile 4 FAILED and needs to be recovered. Luckily, the redo was not affected too and the open was OK. Since it was the UNDO, I made abort (because the immediate can take an eternity, and even since undo was down, nothing was happening inside of the database).

But as you saw, just one datafile was corrupted. Of course that with big databases and big failgroup, more files will be corrupted. But it is a shot that can worth it.

DBB19

The second was DBB19 and I used the same approach, VALIDATE DATABASE:

As you saw, no failures for DBB19. I still checked logically the database with VALIDATE CHECK LOGICAL DATABASE because since the validate returned no failed files, I wanted to check logically the blocks.

DBC19

Same for the last database, but now, datafile 3 failed:

Dropping failgroup

If the fix for the remaining failgroup took a lot, it will be dropped automatically. But we can do this manually with force (look that without force it fails):

And after the rebalance finish, all disk will be removed:

The steps for MOUNT RESTRICTED FORCE FOR RECOVERY

To resume, the steps needed are (in order):

Put online the failed disk/failgroup

Execute alter diskgroup <DG> mount restricted force for recovery

Brink online the failgroup with alter diskgroup data online disks in failgroup <FG>

Clean dismount DG alter diskgroup <DG> dismount

Clean mount alter diskgroup <DG> mount

Check databases for failures and recover it

Undocumented feature

Recover from diskgroup failure using the 12.1.0.2 “mount restricted force for recovery” feature – An Example (Doc ID 1968642.1)

How to change the DISK_REPAIR_TIME timer after disk goes offline from failgroup (Doc ID 1404123.1)

The ASM Priority Rebalance feature – An Example (Doc ID 1968607.1)

Cleaning up JOB's with OEM 13c errors

Today a simple but useful article, in OEM 13c we have the very useful job schedulling system.

However, as we can see in the image below when we have several errors, it is difficult to clean the jobs using the web interface or EMCLI.

We can easily solve this with a PL/SQL script to clean up jobs.

Now let’s get there and the jobs are clean

I hope I helped with this tip

André Ontalba

TFA error after GI upgrade to 19c

Recently I made an Exadata stack upgrade/update to the last 19.2 version (19.2.7.0.0.191012) and I upgraded the GI from 18c to 19c (last 19c version – 19.5.0.0.191015) and after that, TFA does not work.

Since I don’t want to complete execute a TFA clean and reinstallation I tried to find the error and the solution. Here I want to share with you the workaround (since there is no solution yet) that I discovered and used to fix the error.

The environment

The actual environment is:

Old Grid Infrastructure: Version 18.6.0.0.190416

New Grid Infrastructure: Version 19.5.0.0.191015

Exadata domU: Version 19.2.7.0.0.191012 running kernel 4.1.12-124.30.1.el7uek.x86_64

TFA error

After upgrade the GI from 18c to 19c, the TFA does not work. If you try to start it or collect log using it, you can receive errors. In the environment described here, the TFA was running fine with the 18c version, and the rootupgrade script from 18c to 19c does not report an error.

And to be more precise, the TFA upgrade from 18c to 19c called by rootupgrade was ok (according to the log – I will show later). But even after that, the error occurs.

But since the failgroup become online “in force way”, the old cache (from CELL01) need to be clean. And since it is not the last version, maybe some files were corrupted. To check this, you can look the arb process trace files at ASM trace directory: