Engineer System – Page 4 – LUXOUG – LUXEMBOURG ORACLE USERS GROUP

Engineer System

Exadata, Missing Metric

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

Exadata, Missing Metric

Understand metrics for Exadata Storage Server is important to understand how all the software features are being used and all the details from that. Here I will discuss one case where the FC_IO_BY_R_SEC metric can show not precise values. And I will discuss one missing metric that can save a lot.

If you have doubts about metrics, you can check my post about metrics, it was an introduction, but cover some aspects of how to read and use it. You can check my other post where I show how to use metric DB_FC_IO_BY_SEC to identify database problems that can be hidden when checking only from the database side.

Metrics collection

For this post, all metrics were collected from one storage server, but the behavior occurs in all cells from Exadata, and in all versions. And they were collected using the same timestamp be accurate. So, they represent one minute, but again, the same behavior that I show here occurs every time. When reading the metrics, please look at the metric name and the collect moment.

The idea for this post came when I was investigating storage usage from one database to see if we are hitting the limits for Exadata Storage. I needed to check disk, flash cache, and other details. For the behavior I show below, I already worked with Oracle about (SR, BUG, ER).

DB_FC_IO_BY_SEC

I already discussed about this metric in a previous post, but its report (according to the official doc) the “The number of megabytes of I/O per second for this database to flash cache”. Look the usage for the database DBPR1_EXA:

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name = 'DB_FC_IO_BY_SEC' and metricObjectName like 'DBPR1_EXA'

         DB_FC_IO_BY_SEC         DBPR1_EXA       850 MB/sec      2019-12-13T15:42:03+01:00




CellCLI>

As you can see, at 15:42:03 of 2019-12-13 the storage server reported that during the last minute this database made around 850 MB/s of reading from flash cache.

And this was the highest database consumption (that used more than 0.5 MB/s):

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name like 'DB_.*BY.*SEC' and metricvalue > 0.5

         DB_FC_IO_BY_SEC         ORADB01P                1 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         ORADB02P                7 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         ORADB01V                1 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         ORADB03P                5 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         ORAD01P                 6 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         ORADBPR_EXA             2 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         DBPR1_EXA               850 MB/sec      2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         DBP01                   104 MB/sec      2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORADB01P                1 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORADB02P                7 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORADB01V                1 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORADB03P                5 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORAD01P                 6 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORADBPR_EXA             2 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         DBPR1_EXA               850 MB/sec      2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         DBP01                   105 MB/sec      2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         _OTHER_DATABASE_        1 MB/sec        2019-12-13T15:42:03+01:00

         DB_FL_IO_BY_SEC         DBP01                   1.150 MB/sec    2019-12-13T15:42:03+01:00

         DB_IO_BY_SEC            ORAD01P                 2 MB/sec        2019-12-13T15:42:03+01:00

         DB_IO_BY_SEC            ORADB1P_EXA             1 MB/sec        2019-12-13T15:42:03+01:00

         DB_IO_BY_SEC            DBP01                   2 MB/sec        2019-12-13T15:42:03+01:00




CellCLI>

If I sum all the database usage from flash cache, it will be around 950 MB/s.

FC_IO_BY_R_SEC

Since DB_FC_IO_BY_SEC reported just a small usage, I went check the metric for whole flash cache. So, I went to FC_IO_BY_R_SEC. This metric report (according to the docs): “The number of megabytes read per second from flash cache”. And since we are reading the metrics for flash cache direct, it report the number for all databases.

But, look the numbers:

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name = 'FC_IO_BY_R_SEC';

         FC_IO_BY_R_SEC  FLASHCACHE      19.343 MB/sec   2019-12-13T15:42:03+01:00




CellCLI>

So, at the same time moment that one database metric reported around 850 MB/s from FC reading, the metrics from FC reported around 19MB/s. Something it is not correct.

And even if I check the minute before and after (because maybe was a glimpse of time collection) the numbers are not quite different:

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:41:00+01:00' and collectionTime < '2019-12-13T15:44:00+01:00' and name = 'FC_IO_BY_R_SEC';

         FC_IO_BY_R_SEC  FLASHCACHE      27.519 MB/sec   2019-12-13T15:41:03+01:00

         FC_IO_BY_R_SEC  FLASHCACHE      19.343 MB/sec   2019-12-13T15:42:03+01:00

         FC_IO_BY_R_SEC  FLASHCACHE      39.600 MB/sec   2019-12-13T15:43:03+01:00




CellCLI>

Others flash metrics

So, since the numbers appears to be complete wrong I searched for others metrics that report flash usage (disks and flash cache as example).

CD_IO_BY___SEC

I started to search with the lowest level possible, going directly to the celldisk metrics. For that I used CD_IO_BY_R_LG_SEC, CD_IO_BY_R_SM_SEC (for reads), CD_IO_BY_W_LG_SEC, and CD_IO_BY_W_SM_SEC (for writes). Basically, it reports: “The rate which is the number of megabytes read in large blocks per second from a cell disk”.

Filtering just from flash disks type I had this numbers:

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name like 'CD_IO_BY.*SEC.*' and metricobjectname like 'FD_.*' and metricvalue > 0

         CD_IO_BY_R_LG_SEC       FD_00_exastradm01       179 MB/sec      2019-12-13T15:42:03+01:00

         CD_IO_BY_R_LG_SEC       FD_01_exastradm01       296 MB/sec      2019-12-13T15:42:03+01:00

         CD_IO_BY_R_LG_SEC       FD_02_exastradm01       200 MB/sec      2019-12-13T15:42:03+01:00

         CD_IO_BY_R_LG_SEC       FD_03_exastradm01       250 MB/sec      2019-12-13T15:42:03+01:00

         CD_IO_BY_R_SM_SEC       FD_00_exastradm01       3.161 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_R_SM_SEC       FD_01_exastradm01       3.152 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_R_SM_SEC       FD_02_exastradm01       2.990 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_R_SM_SEC       FD_03_exastradm01       3.741 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_LG_SEC       FD_00_exastradm01       0.859 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_LG_SEC       FD_01_exastradm01       1.125 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_LG_SEC       FD_02_exastradm01       1.028 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_LG_SEC       FD_03_exastradm01       0.801 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_SM_SEC       FD_00_exastradm01       0.982 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_SM_SEC       FD_01_exastradm01       0.998 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_SM_SEC       FD_02_exastradm01       1.006 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_SM_SEC       FD_03_exastradm01       0.937 MB/sec    2019-12-13T15:42:03+01:00




CellCLI>

As you can see, just picking up the CD_IO_BY_R_LG_SEC I got around 925 MB/s, a close number from the same that I got from the database metric reading the flash. Another hint from that says to me that reads are requesting more than 128KB for each access since they were counted as large (LG).

FL_IO_DB_BY_W_SEC and FL_IO_FL_BY_W_SEC

Since I was checking for bad flash report, I checked the flash log metrics but they reported close values from cell disk metrics:

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name like 'FL_IO.*_SEC'

         FL_IO_DB_BY_W_SEC       FLASHLOG        1.974 MB/sec    2019-12-13T15:42:03+01:00

         FL_IO_FL_BY_W_SEC       FLASHLOG        2.673 MB/sec    2019-12-13T15:42:03+01:00




CellCLI>

So, the metrics for disk usage, was OK, reporting correct values.

SIO_IO_RD_FC_SEC

As you can see, the metrics from database usage from flash cache are reporting the same values that were possible to check from the closest as from hardware report (celldisk). So, maybe could be a software misreading in-memory values.

Because of that, I checked from other metrics that can report form features that interact with flash cache. So, I checked from Smart I/O:

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name like 'SIO_.*SEC.*'

         SIO_IO_EL_OF_SEC        SMARTIO         1,223 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_OF_RE_SEC        SMARTIO         34.688 MB/sec   2019-12-13T15:42:03+01:00

         SIO_IO_PA_TH_SEC        SMARTIO         0.000 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_RD_FC_HD_SEC     SMARTIO         0.174 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_RD_FC_SEC        SMARTIO         843 MB/sec      2019-12-13T15:42:03+01:00

         SIO_IO_RD_HD_SEC        SMARTIO         0.101 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_RD_RQ_FC_HD_SEC  SMARTIO         0.183 IO/sec    2019-12-13T15:42:03+01:00

         SIO_IO_RD_RQ_FC_SEC     SMARTIO         850 IO/sec      2019-12-13T15:42:03+01:00

         SIO_IO_RD_RQ_HD_SEC     SMARTIO         0.000 IO/sec    2019-12-13T15:42:03+01:00

         SIO_IO_RV_OF_SEC        SMARTIO         3.392 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_SI_SV_SEC        SMARTIO         362 MB/sec      2019-12-13T15:42:03+01:00

         SIO_IO_WR_FC_SEC        SMARTIO         0.008 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_WR_HD_SEC        SMARTIO         0.000 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_WR_RQ_FC_SEC     SMARTIO         0.017 IO/sec    2019-12-13T15:42:03+01:00

         SIO_IO_WR_RQ_HD_SEC     SMARTIO         0.000 IO/sec    2019-12-13T15:42:03+01:00




CellCLI>

And as you can see the SIO_IO_RD_FC_SEC (that means “The number of megabytes per second read from flash cache by smart I/O”) reported almost the same that value (843 MB/s) from the database and cell disk.

The Missing Metric

As you can see above, the metrics close from HW reported almost the same that was reported by DB. And from features usage, almost the same too. To be more clear is that:

So, it is not miss reading from memory since the Smart I/O report correctly. Returning the review for flash cache, all metrics are:

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name like 'FC_IO_BY.*' and metricvalue > 0;

         FC_IO_BYKEEP_W                          FLASHCACHE      519 MB                          2019-12-13T15:42:03+01:00

         FC_IO_BY_DISK_WRITE                     FLASHCACHE      111,626,264 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_DISK_WRITE_SEC                 FLASHCACHE      2.052 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_R                              FLASHCACHE      1,475,538,050 MB                2019-12-13T15:42:03+01:00

         FC_IO_BY_R_ACTIVE_SECONDARY             FLASHCACHE      1,132,686 MB                    2019-12-13T15:42:03+01:00

         FC_IO_BY_R_ACTIVE_SECONDARY_MISS        FLASHCACHE      237,442 MB                      2019-12-13T15:42:03+01:00

         FC_IO_BY_R_DISK_WRITER                  FLASHCACHE      114,172,096 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_R_DISK_WRITER_SEC              FLASHCACHE      2.057 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_R_DW                           FLASHCACHE      13,071,021,100 MB               2019-12-13T15:42:03+01:00

         FC_IO_BY_R_MISS                         FLASHCACHE      42,381,380 MB                   2019-12-13T15:42:03+01:00

         FC_IO_BY_R_MISS_DW                      FLASHCACHE      133,613,408 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_R_MISS_SEC                     FLASHCACHE      0.438 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_R_SEC                          FLASHCACHE      19.343 MB/sec                   2019-12-13T15:42:03+01:00

         FC_IO_BY_R_SKIP                         FLASHCACHE      97,002,568 MB                   2019-12-13T15:42:03+01:00

         FC_IO_BY_W                              FLASHCACHE      1,761,639,940 MB                2019-12-13T15:42:03+01:00

         FC_IO_BY_W_DISK_WRITER                  FLASHCACHE      111,615,088 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_W_DISK_WRITER_SEC              FLASHCACHE      2.052 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_FIRST                        FLASHCACHE      183,171,872 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_W_FIRST_SEC                    FLASHCACHE      2.452 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_OVERWRITE                    FLASHCACHE      1,475,454,720 MB                2019-12-13T15:42:03+01:00

         FC_IO_BY_W_OVERWRITE_SEC                FLASHCACHE      1.346 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_POPULATE                     FLASHCACHE      103,121,912 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_W_POPULATE_SEC                 FLASHCACHE      0.381 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SEC                          FLASHCACHE      4.179 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SKIP                         FLASHCACHE      1,039,399,810 MB                2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SKIP_LG                      FLASHCACHE      605,535,040 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SKIP_LG_SEC                  FLASHCACHE      2.695 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SKIP_SEC                     FLASHCACHE      6.589 MB/sec                    2019-12-13T15:42:03+01:00




CellCLI>

As you can above it is not the case for FC_IO_BY_R_SKIP (“The number of megabytes read from disks for I/O requests that bypass flash cache”) and FC_IO_BY_R_MISS_SEC (“The number of megabytes read from disks per second because not all requested data was in flash cache”). And even if we consider some minutes before and after:

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:41:00+01:00' and collectionTime < '2019-12-13T15:44:00+01:00' and name = 'FC_IO_BY_R_SKIP';

         FC_IO_BY_R_SKIP         FLASHCACHE      97,002,568 MB   2019-12-13T15:41:03+01:00

         FC_IO_BY_R_SKIP         FLASHCACHE      97,002,568 MB   2019-12-13T15:42:03+01:00

         FC_IO_BY_R_SKIP         FLASHCACHE      97,002,568 MB   2019-12-13T15:43:03+01:00




CellCLI> list metrichistory where collectionTime > '2019-12-13T15:41:00+01:00' and collectionTime < '2019-12-13T15:44:00+01:00' and name = 'FC_IO_BY_W_SKIP';

         FC_IO_BY_W_SKIP         FLASHCACHE      1,039,399,420 MB        2019-12-13T15:41:03+01:00

         FC_IO_BY_W_SKIP         FLASHCACHE      1,039,399,810 MB        2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SKIP         FLASHCACHE      1,039,401,150 MB        2019-12-13T15:43:03+01:00




CellCLI> list metrichistory where collectionTime > '2019-12-13T15:41:00+01:00' and collectionTime < '2019-12-13T15:44:00+01:00' and name like 'FC_IO_BY_.*_MISS.*SEC' and  metricvalue > 0

         FC_IO_BY_R_MISS_SEC     FLASHCACHE      0.702 MB/sec    2019-12-13T15:41:03+01:00

         FC_IO_BY_R_MISS_SEC     FLASHCACHE      0.438 MB/sec    2019-12-13T15:42:03+01:00

         FC_IO_BY_R_MISS_SEC     FLASHCACHE      0.521 MB/sec    2019-12-13T15:43:03+01:00




CellCLI>

But if we look closely the metrics from FC above, we can see the high value for FC_IO_BY_R_DW (“The number of megabytes of DW data read from flash cache”). And if we check some minutes we can see the usage:

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:41:00+01:00' and collectionTime < '2019-12-13T15:44:00+01:00' and name = 'FC_IO_BY_R_DW';

         FC_IO_BY_R_DW   FLASHCACHE      13,070,966,800 MB       2019-12-13T15:41:03+01:00

         FC_IO_BY_R_DW   FLASHCACHE      13,071,021,100 MB       2019-12-13T15:42:03+01:00

         FC_IO_BY_R_DW   FLASHCACHE      13,071,055,900 MB       2019-12-13T15:43:03+01:00




CellCLI>

From that, we can see that at 15:41 the usage was 13,070,966,800 MB and at 15:42 the usage was 13,071,021,100 MB. And if we do the diff, between this minute the difference was 54300 MB. And if we divide by each sec we have 54300/60 = 905 MB/s.

So, the missing values were found, the flash cache read usage was counted as DW. But if you followed everything, we wad legitim reads from flash cache that was not counted at FC_IO_BY_R_SEC. So, the missing metric, in this case, it is FC_IO_BY_R_DW_SEC.

Reasons

The reason why the read was counted at FC_IO_BY_R_DW and not by the generic FC_IO_BY_R_SEC is not clear. Maybe because the block size of tablespace for the database is different than 8K, maybe the requests were large (more than 128 KB are we saw in cell disk metric). It is not clear why. Or maybe FC_IO_BY_R_SEC just report what was really read by flash cache from the database. It can occur that documentation is not clear and some details why this behavior occurs need to be clarified.

But we can see that FC_IO_BY_R_SEC don’t reflect the current usage in some cases. If you want to verify the FC usage, the metric FC_IO_BY_R_SEC can, sometimes, report wrong values. If the other metrics for FC are correct (like SKIP or MISS) was impossible to check at this point. One option is check directly at celldisk for flash disk usage to verify if you reached the limits for flash usage or check directly from database usage.

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”

Exadata, Workaround for oracka.ko error

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

Exadata, Workaround for oracka.ko error

Recently I made an Exadata stack upgrade/update to the last 19.2 version (19.2.7.0.0.191012) released in October of 2019, and update the GI to the last 19c version (19.5.0.0.191015) and after that, I hade some issues to create 11G databases.

So, when I try to create an 11G RAC database, the error “File -oracka.ko- was not found” appears and creation fails. Here I want to share with you the workaround (since there is no solution yet) that I discovered and used to bypass the error.

The environment

The actual environment is:

Grid Infrastructure: Version 19.5.0.0.191015
Exadata domU: Version 19.2.7.0.0.191012 running kernel 4.1.12-124.30.1.el7uek.x86_64
11G Database: Version 11.2.0.4.180717
ACFS: Used to store some files

oracka.ko

So, calling dbca:

[DEV-oracle@exsite1c1-]$ /u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbca -silent -createDatabase -templateName General_Purpose.dbc -gdbName D11TST19 -adminManaged -sid D11TST19 -sysPassword oracle11 -systemPassword oracle11 -characterSet WE8ISO8859P15 -emConfiguration NONE -storageType ASM -diskGroupName DATAC8 -recoveryGroupName RECOC8 -nodelist exsite1c1,exsite1c2 -sampleSchema false

Copying database files

100% complete

Look at the log file "/u01/app/oracle/cfgtoollogs/dbca/D11TST19/D11TST19.log" for further details.

[DEV-oracle@exsite1c1-]$

And at the log:

[DEV-oracle@exsite1c1-]$ cat /u01/app/oracle/cfgtoollogs/dbca/D11TST19/D11TST19.log

Creating and starting Oracle instance

PRCR-1006 : Failed to add resource ora.d11tst19.db for d11tst19

PRCD-1094 : Failed to create start dependency for database d11tst19 on the ASM cluster filesystem where ACFS path /u01/app/oracle/product/11.2.0.4/dbhome_1 resides

PRCT-1129 : Execution of ACFS driver state check failed on local node with result Errors uncompressing file: oracka.zip.

File -oracka.ko- was not found.

Error # 50 - . Output: .

Errors uncompressing file: oracka.zip.

File -oracka.ko- was not found.

Error # 50 - . Output: .

ACFS-9200: Supported

DBCA_PROGRESS : 100%

[DEV-oracle@exsite1c1-]$

If you check for the trace of database creation:

[DEV-oracle@exsite1c1-]$ vi /u01/app/oracle/cfgtoollogs/dbca/D11TST19/trace.log

[Thread-166] [ 2019-11-07 08:30:37.860 CET ] [ASMFactoryImpl.isACFSSupported:954]  Entry: oraHome=/u01/app/oracle/product/11.2.0.4/dbhome_1, m_crsHome=/u01/app/19.0.0.0/grid

[Thread-166] [ 2019-11-07 08:30:37.860 CET ] [ASMFactoryImpl.isACFSSupported:958]  Checking if ACFS now...

[Thread-166] [ 2019-11-07 08:30:37.861 CET ] [USMDriverUtil.<init>:117]  Checking file exists for: /u01/app/19.0.0.0/grid/bin/acfsdriverstate

[Thread-166] [ 2019-11-07 08:30:37.862 CET ] [NativeSystem.isCmdScv:502]  isCmdScv: cmd=[]

[Thread-182] [ 2019-11-07 08:30:37.866 CET ] [StreamReader.run:61]  In StreamReader.run

[Thread-181] [ 2019-11-07 08:30:37.868 CET ] [StreamReader.run:61]  In StreamReader.run

[Thread-182] [ 2019-11-07 08:30:38.230 CET ] [StreamReader.run:65]  ERROR>Argument "2/8" isn't numeric in numeric ne (!=) at /u01/app/19.0.0.0/grid/lib/osds_acfsroot.pm line 2093.

[Thread-182] [ 2019-11-07 08:30:38.230 CET ] [StreamReader.run:65]  ERROR>Argument "2/8" isn't numeric in numeric ne (!=) at /u01/app/19.0.0.0/grid/lib/osds_acfsroot.pm line 2093.

[Thread-182] [ 2019-11-07 08:30:38.235 CET ] [StreamReader.run:65]  ERROR>error:  cannot delete old /u01/app/19.0.0.0/grid/usm/install/Oracle/EL7UEK/x86_64/4.1.12-112.16.4/4.1.12-112.16.4-x86_64/bin/oracka.ko

[Thread-182] [ 2019-11-07 08:30:38.235 CET ] [StreamReader.run:65]  ERROR>        Permission denied

[Thread-181] [ 2019-11-07 08:30:38.310 CET ] [StreamReader.run:65]  OUTPUT>Errors uncompressing file: oracka.zip.

[Thread-181] [ 2019-11-07 08:30:38.405 CET ] [StreamReader.run:65]  OUTPUT>File -oracka.ko- was not found.

[Thread-181] [ 2019-11-07 08:30:38.485 CET ] [StreamReader.run:65]  OUTPUT>Error # 50 - . Output: .

[Thread-182] [ 2019-11-07 08:30:38.504 CET ] [StreamReader.run:65]  ERROR>error:  cannot delete old /u01/app/19.0.0.0/grid/usm/install/Oracle/EL7UEK/x86_64/4.1.12-112.16.4/4.1.12-112.16.4-x86_64/bin/oracka.ko

[Thread-182] [ 2019-11-07 08:30:38.504 CET ] [StreamReader.run:65]  ERROR>        Permission denied

[Thread-181] [ 2019-11-07 08:30:38.587 CET ] [StreamReader.run:65]  OUTPUT>Errors uncompressing file: oracka.zip.

[Thread-181] [ 2019-11-07 08:30:38.667 CET ] [StreamReader.run:65]  OUTPUT>File -oracka.ko- was not found.

[Thread-181] [ 2019-11-07 08:30:38.753 CET ] [StreamReader.run:65]  OUTPUT>Error # 50 - . Output: .

[Thread-181] [ 2019-11-07 08:30:40.877 CET ] [StreamReader.run:65]  OUTPUT>ACFS-9200: Supported

[Thread-166] [ 2019-11-07 08:30:40.881 CET ] [UnixSystem.dorunRemoteExecCmd:3232]  retval = 0

[Thread-166] [ 2019-11-07 08:30:40.881 CET ] [UnixSystem.dorunRemoteExecCmd:3256]  exitValue = 0

[Thread-166] [ 2019-11-07 08:30:40.882 CET ] [USMDriverUtil.checkACFSState:267]  Printing ACFS output

[Thread-166] [ 2019-11-07 08:30:40.882 CET ] [USMDriverUtil.checkACFSState:268]  Errors uncompressing file: oracka.zip.

File -oracka.ko- was not found.

Error # 50 - . Output: .

Errors uncompressing file: oracka.zip.

File -oracka.ko- was not found.

Error # 50 - . Output: .

ACFS-9200: Supported

[Thread-166] [ 2019-11-07 08:30:40.882 CET ] [InstanceStepOPS.executeImpl:1014]  PRCR-1006 : Failed to add resource ora.d11tst19.db for d11tst19

PRCD-1094 : Failed to create start dependency for database d11tst19 on the ASM cluster filesystem where ACFS path /u01/app/oracle/product/11.2.0.4/dbhome_1 resides

PRCT-1129 : Execution of ACFS driver state check failed on local node with result Errors uncompressing file: oracka.zip.

File -oracka.ko- was not found.

Error # 50 - . Output: .

Errors uncompressing file: oracka.zip.

File -oracka.ko- was not found.

Error # 50 - . Output: .

ACFS-9200: Supported

[Thread-166] [ 2019-11-07 08:30:40.883 CET ] [BasicStep.configureSettings:304]  messageHandler being set=oracle.sysman.assistants.util.SilentMessageHandler@5a943dc4

[Thread-166] [ 2019-11-07 08:30:40.883 CET ] [BasicStep.configureSettings:304]  messageHandler being set=oracle.sysman.assistants.util.SilentMessageHandler@5a943dc4

oracle.sysman.assistants.util.step.StepExecutionException: PRCR-1006 : Failed to add resource ora.d11tst19.db for d11tst19

PRCD-1094 : Failed to create start dependency for database d11tst19 on the ASM cluster filesystem where ACFS path /u01/app/oracle/product/11.2.0.4/dbhome_1 resides

PRCT-1129 : Execution of ACFS driver state check failed on local node with result Errors uncompressing file: oracka.zip.

File -oracka.ko- was not found.

Error # 50 - . Output: .

Errors uncompressing file: oracka.zip.

File -oracka.ko- was not found.

Error # 50 - . Output: .

ACFS-9200: Supported

        at oracle.sysman.assistants.dbca.backend.InstanceStepOPS.executeImpl(InstanceStepOPS.java:1015)

        at oracle.sysman.assistants.util.step.BasicStep.execute(BasicStep.java:210)

        at oracle.sysman.assistants.util.step.BasicStep.callStep(BasicStep.java:251)

        at oracle.sysman.assistants.dbca.backend.DBEntryStep.executeImpl(DBEntryStep.java:229)

        at oracle.sysman.assistants.util.step.BasicStep.execute(BasicStep.java:210)

        at oracle.sysman.assistants.util.step.Step.execute(Step.java:140)

        at oracle.sysman.assistants.util.step.StepContext$ModeRunner.run(StepContext.java:2711)

        at java.lang.Thread.run(Thread.java:637)

[Thread-166] [ 2019-11-07 08:30:40.883 CET ] [SQLEngine.done:2189]  Done called

So, as you can see in the trace the process for database creation when detecting that ACFS is in place, tries to unzip some files (.ko, kernel objects) related to ACFS access but receive the error or that cannot delete old oracka.ko and for uncompressing file oracka.zip.

Source of the problem

Checking for the problem, we can see that access denied occurs for folder $GI_HOME/usm/install/Oracle/EL7UEK/x86_64/4.1.12-112.16.4/4.1.12-112.16.4-x86_64/bin/ inside of GI home. And as you can see is the same family of our domU kernel version. Inside os MOS there is no reference for this error, and after open the SR for Oracle was sent do dev team without ETA (as usual).

Because of that I continue to check and checked the folder:

[root@exsite1c1 ~]# cd /u01/app/19.0.0.0/grid/usm/install/Oracle/EL7UEK/x86_64/4.1.12-112.16.4/4.1.12-112.16.4-x86_64/bin/

[root@exsite1c1 bin]#

[root@exsite1c1 bin]# ls -l

total 151244

-rw-r--r-- 1 root oinstall  3085340 Sep  1 12:40 oracka.ko

-rw-r--r-- 1 root oinstall  2304100 Sep  1 12:44 oracka_mod_kga.ko

-rw-r--r-- 1 grid oinstall   623207 Oct  9 17:13 oracka_mod_kga.zip

-rw-r--r-- 1 grid oinstall   889060 Oct  9 17:14 oracka.zip

-rw-r--r-- 1 root oinstall 64000932 Sep  1 12:04 oracleacfs.ko

-rw-r--r-- 1 grid oinstall 19246647 Oct  9 17:13 oracleacfs.zip

-rw-r--r-- 1 root oinstall 27882452 Sep  1 11:25 oracleadvm.ko

-rw-r--r-- 1 grid oinstall  8777000 Oct  9 17:13 oracleadvm.zip

-rw-r--r-- 1 root oinstall  8776676 Sep  1 12:55 oracleafd.ko

-rw-r--r-- 1 grid oinstall  2879744 Oct  9 17:14 oracleafd.zip

-rw-r--r-- 1 root oinstall 10696156 Sep  1 11:04 oracleoks.ko

-rw-r--r-- 1 grid oinstall  3346594 Oct  9 17:14 oracleoks.zip

-rw-r--r-- 1 root oinstall  1757116 Sep  1 13:02 oracleolfs.ko

-rw-r--r-- 1 grid oinstall   567865 Oct  9 17:14 oracleolfs.zip

[root@exsite1c1 bin]#

As you can see, the GI Home came with some ko files already uncompressed and the permission for a group just as read. So, the delete and uncompressing by dbca will not work properly, and as a workaround to solve the issue, I changed the permission for folder bin and ko files to allow change for oinstall – made in all nodes of the cluster:

[root@exsite1c1 4.1.12-112.16.4-x86_64]# pwd

/u01/app/19.0.0.0/grid/usm/install/Oracle/EL7UEK/x86_64/4.1.12-112.16.4/4.1.12-112.16.4-x86_64

[root@exsite1c1 4.1.12-112.16.4-x86_64]#

[root@exsite1c1 4.1.12-112.16.4-x86_64]#

[root@exsite1c1 4.1.12-112.16.4-x86_64]# chmod g+w bin

[root@exsite1c1 4.1.12-112.16.4-x86_64]#

[root@exsite1c1 4.1.12-112.16.4-x86_64]#

[root@exsite1c1 4.1.12-112.16.4-x86_64]# ls -l

total 4

drwxrwxr-x 2 grid oinstall 4096 Nov  8 15:08 bin

[root@exsite1c1 4.1.12-112.16.4-x86_64]# cd bin

[root@exsite1c1 bin]# chown grid:oinstall oracka.ko oracka_mod_kga.ko oracleacfs.ko oracleadvm.ko oracleafd.ko oracleoks.ko oracleolfs.ko

[root@exsite1c1 bin]#

After executing this in both nodes, the dbca finish:

[DEV-oracle@exsite1c1-]$ /u01/app/oracle/product/11.2.0.4/dbhome_1/bin/dbca -silent -createDatabase -templateName General_Purpose.dbc -gdbName D11TST19 -adminManaged -sid D11TST19 -sysPassword oracle11 -systemPassword orcle11 -characterSet WE8ISO8859P15 -emConfiguration NONE -storageType ASM -diskGroupName DATAC8 -recoveryGroupName RECOC8 -nodelist exsite1c1,exsite1c2 -sampleSchema false

Creating and starting Oracle instance

1% complete

…

96% complete

100% complete

Look at the log file "/u01/app/oracle/cfgtoollogs/dbca/D11TST19/D11TST19.log" for further details.

[DEV-oracle@exsite1c1-]$

[DEV-oracle@exsite1c1-]$ $ORACLE_HOME/bin/srvctl status database -d D11TST19

Instance D11TST191 is running on node exsite1c1

Instance D11TST192 is running on node exsite1c2

[DEV-oracle@exsite1c1-]$

Just to check the ko files and bin folder already came with wrong permissions in the original 19c packages:

[grid@exsite1c1 +ASM1]$ cd /u01/patches/grid

[grid@exsite1c1 +ASM1]$ unzip -q V982068-01.zip

[grid@exsite1c1 +ASM1]$ cd usm/install/Oracle/EL7UEK/x86_64/4.1.12-112.16.4/4.1.12-112.16.4-x86_64/bin

[grid@exsite1c1 +ASM1]$ ls -l

total 149116

-rw-r--r-- 1 grid oinstall  3085340 Feb 23  2019 oracka.ko

-rw-r--r-- 1 grid oinstall  2304100 Feb 23  2019 oracka_mod_kga.ko

-rw-r--r-- 1 grid oinstall   623214 Feb 23  2019 oracka_mod_kga.zip

-rw-r--r-- 1 grid oinstall   889106 Feb 23  2019 oracka.zip

-rw-r--r-- 1 grid oinstall 62740980 Feb 23  2019 oracleacfs.ko

-rw-r--r-- 1 grid oinstall 18881690 Feb 23  2019 oracleacfs.zip

-rw-r--r-- 1 grid oinstall 27485580 Feb 23  2019 oracleadvm.ko

-rw-r--r-- 1 grid oinstall  8643808 Feb 23  2019 oracleadvm.zip

-rw-r--r-- 1 grid oinstall  8773444 Feb 23  2019 oracleafd.ko

-rw-r--r-- 1 grid oinstall  2878816 Feb 23  2019 oracleafd.zip

-rw-r--r-- 1 grid oinstall 10655524 Feb 23  2019 oracleoks.ko

-rw-r--r-- 1 grid oinstall  3335912 Feb 23  2019 oracleoks.zip

-rw-r--r-- 1 grid oinstall  1757348 Feb 23  2019 oracleolfs.ko

-rw-r--r-- 1 grid oinstall   567876 Feb 23  2019 oracleolfs.zip

[grid@exsite1c1 +ASM1]$ cd ..

[grid@exsite1c1 +ASM1]$ ls -l

total 4

drwxr-xr-x 2 grid oinstall 4096 Apr 18  2019 bin

[grid@exsite1c1 +ASM1]$

The sum of all problems

This error for oracka.zip and oracka.ko occurs because there is a sum of little details. The release of GI 19c version already came with bad folder permission. The release update 19.5.0.0.191015 does not solve the issue too (and even the installation or rootupgrade script). This bin folder was checked by dbca because the kernel image for Exadata 19.2.7.0.0 (19.2.7.0.0.191012) comes with kernel 4.1.12-124.30.1.el7uek.x86_64 and the ACFS modules for this family are there. And to finish the sum of all problems, the ACFS was used and for that reason, it tries to use that kernel modules to check access.

So, since there is no solution at MOS (today while I am publishing this post), and there is no official workaround, was needed to find own solution. So, the workaround described here it is just temporary to allow 11g RAC database creation for this specific sun of Exadata Version + GI Home 19c + ACFS in use.

Exadata, Understanding Metrics

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

Exadata, Understanding Metrics

Metrics for Exadata deliver to you one way to deeply see, and understand, what it is happening for Exadata Storage Server and Exadata Software. Understand it is fundamental to identify and solve problems that can be hidden (or even unsee) from the database side. In this post, I will explain details about these metrics and what you can do using them.

My last article about Exadata Storage Server metrics was about one example of how to use them to identify problems that do not appear in the database side. In that post, I showed how I used the metric DB_FC_IO_BY_SEC to identify bad queries.

The point for Exadata (that I made in that article), is that most of the time, Exadata is so powerful that bad statements are handled without a problem because of the features that exist (flashcache, smartio, and others). But another point is that usually, Exadata is a high consolidated environment, where you “consolidate” a lot of databases and it is normal that some of them have different workloads and needs. Using metrics can help you to do a fine tune of your environment, but besides that, it delivers to you one way to check and control everything that’s happening.

In this post, I will not explain each metric one by one, but guide you to understand metrics and some interesting and important details about them.

Understanding metrics

Metrics for Exadata are values extract directly from hardware or directly from storage server software. Values from “IOPS from each disk”, or “MB/s read by SMARTIO” are an example of what you can discover. Directly from the docs:

“Metrics are recorded observations of important run-time properties or internal instrumentation values of the storage cell and its components, such as cell disks or grid disks. Metrics are a series of measurements that are computed and retained in memory for an interval of time, and stored on a disk for a more permanent history.”

To check the definition for Exadata metrics, and all metrics available the best place it the official Exadata User Guide, chapter 6. You can see the definition for all metrics and other important information. I really recommend that you read it to be aware of what you can extract from the metrics.

When reading metrics, you can read the current values (from the last minute), or from history view. From the historic list, values are for each minute from the last 7 days. So, with metrics, you cover 24×7 for each minute during the last 7 days. So, a good source of information to help you. And most important, they are individual and read from each storage server.

Reading metrics

To read metrics you can connect directly in the storage server and with the cellcli use the “list metriccurrent” or “list metrichistory” commands to read it:

[root@exacellsrvx-01 ~]# cellcli

CellCLI: Release 18.1.9.0.0 - Production on Sun Dec 08 15:01:42 BRST 2019




Copyright (c) 2007, 2016, Oracle and/or its affiliates. All rights reserved.




CellCLI> list metriccurrent

         CD_BY_FC_DIRTY                          CD_00_exacellsrvx-01                            0.000 MB

…

…

         SIO_IO_WR_RQ_FC_SEC                     SMARTIO                                         0.000 IO/sec

         SIO_IO_WR_RQ_HD                         SMARTIO                                         2,768,097 IO requests

         SIO_IO_WR_RQ_HD_SEC                     SMARTIO                                         0.000 IO/sec

Since it is based in the list command you can detail it, restrict with where, or change the attributes to display it:

CellCLI> list metriccurrent where name = 'FL_IO_DB_BY_W_SEC' detail

         name:                   FL_IO_DB_BY_W_SEC

         alertState:             normal

         collectionTime:         2019-12-08T15:10:14-02:00

         metricObjectName:       FLASHLOG

         metricType:             Instantaneous

         metricValue:            0.189 MB/sec

         objectType:             FLASHLOG




CellCLI> list metriccurrent where name = 'FL_IO_DB_BY_W_SEC' attributes name, metricvalue, collectionTime

         FL_IO_DB_BY_W_SEC       0.133 MB/sec    2019-12-08T15:11:14-02:00




CellCLI>

You can query the metric for each one of the attributes. Like all metrics for IORM or all metrics for that have FC in the name. If you want to query values in the past, you need to use list metrichistory:

CellCLI> list metrichistory where name = 'FL_IO_DB_BY_W_SEC' and collectionTime = '2019-12-08T15:21:14-02:00' attributes name, metricvalue, collectionTime

         FL_IO_DB_BY_W_SEC       0.196 MB/sec    2019-12-08T15:21:14-02:00




CellCLI>

Metric types

There are three types of metrics: Instantaneous (value reflect the moment when was read), Rate (values computed based in the period of time), Cumulative (values since you started storage server from the last time).

All the metrics type Rate, usually are expressed by second. This means that Exadata counted the values from the last minute and divided it by seconds. **So, the *_SEC means the average based at the last minute.**

One important detail is that some have the “small” and “large” metrics. This means that if your request from the database needs more than 128KB the values are marked as large LG, otherwise, as small SM.

Using metrics

To understand metrics for Exadata it is important to know the limits for your hardware, and for Exadata the good (and quick way) is the datasheet. Using the X8M datasheet as an example we can see that the max GB/s per second for each storage server is around 1.78 (25GB/s for full rack divided by 14 storage – as for example). Understand these numbers are important, I recommend you to read the datasheet and understand them.

Since every storage computes each metric in the separate way you need to query each one to have the big picture. But this does not mean that some metrics need to be analyzed globally, instead of per each server. I usually divide Exadata metrics in two ways, Isolated and Per Database.

Isolated Metrics

I consider isolated metrics that are important to check for each server. They express values that are important to check isolated per each storage server. Some metrics that I like to check isolated:

CL_CPUT: The cell CPU utilization.
CL_MEMUT: The percentage of total physical memory used.
N_HCA_MB_RCV_SEC: The number of megabytes received by the InfiniBand interfaces per second
N_HCA_MB_TRANS_SEC: The number of megabytes transmitted by the InfiniBand interfaces per second.
CD_IO_BY_R_LG_SEC: The rate which is the number of megabytes read in large blocks per second from a cell disk.
CD_IO_BY_R_SM_SEC: The rate which is the number of megabytes read in small blocks per second from a cell disk.
CD_IO_BY_W_LG_SEC: The rate which is the number of megabytes written in large blocks per second on a cell disk.
CD_IO_BY_W_SM_SEC: The rate which is the number of megabytes written in small blocks per second on a cell disk.
CD_IO_RQ_R_LG_SEC: The rate which is the number of requests to read large blocks per second from a cell disk.
CD_IO_RQ_R_SM_SEC: The rate which is the number of requests to read small blocks per second from a cell disk.
CD_IO_RQ_W_LG_SEC: The rate which is the number of requests to write large blocks per second to a cell disk.
CD_IO_RQ_W_SM_SEC: The rate which is the number of requests to write small blocks per second to a cell disk.
CD_IO_TM_R_LG_RQ: The rate which is the average latency of reading large blocks per request to a cell disk.
CD_IO_TM_R_SM_RQ: The rate which is the average latency of reading small blocks per request from a cell disk.
CD_IO_TM_W_LG_RQ: The rate which is the average latency of writing large blocks per request to a cell disk.
CD_IO_TM_W_SM_RQ: The rate which is the average latency of writing small blocks per request to a cell disk.
CD_IO_UTIL: The percentage of device utilization for the cell disk.
FC_BY_ALLOCATED: The number of megabytes allocated in flash cache.
FC_BY_USED: The number of megabytes used in flash cache.
FC_BY_DIRTY: The number of megabytes in flash cache that are not synchronized to the grid disks.
FC_IO_BY_R_SEC: The number of megabytes read per second from flash cache.
FC_IO_BY_R_SKIP_SEC: The number of megabytes read from disks per second for I/O requests that bypass flash cache. Read I/O requests that bypass flash cache go directly to disks. These requests do not populate flash cache after reading the requested data.
FC_IO_BY_R_MISS_SEC: The number of megabytes read from disks per second because not all requested data was in flash cache.
FC_IO_BY_W_SEC: The number of megabytes per second written to flash cache.
FC_IO_BY_W_SKIP_SEC: The number of megabytes written to disk per second for I/O requests that bypass flash cache.
FC_IO_RQ_R_SEC: The number of read I/O requests satisfied per second from flash cache.
FC_IO_RQ_W_SEC: The number of I/O requests per second which resulted in flash cache being populated with data.
FC_IO_RQ_R_SKIP_SEC: The number of read I/O requests per second that bypass flash cache. Read I/O requests that bypass flash cache go directly to disks
FC_IO_RQ_W_SKIP_SEC: The number of write I/O requests per second that bypass flash cache
FL_IO_DB_BY_W_SEC: The number of megabytes written per second were written to hard disk by Exadata Smart Flash Log
FL_IO_FL_BY_W_SEC: The number of megabytes written per second were written to flash by Exadata Smart Flash Log.
FL_IO_TM_W_RQ: Average redo log write latency. It includes write I/O latency only.
FL_RQ_TM_W_RQ: Average redo log write request latency.
FL_IO_W_SKIP_BUSY: The number of redo writes during the last minute that could not be serviced by Exadata Smart Flash Log.
N_MB_RECEIVED_SEC: The rate which is the number of megabytes received per second from a particular host.
N_MB_SENT_SEC: The rate which is the number of megabytes received per second from a particular host.
SIO_IO_EL_OF_SEC: The number of megabytes per second eligible for offload by smart I/O.
SIO_IO_OF_RE_SEC: The number of interconnect megabytes per second returned by smart I/O.
SIO_IO_RD_FC_SEC: The number of megabytes per second read from flash cache by smart I/O.
SIO_IO_RD_HD_SEC: The number of megabytes per second read from hard disk by smart I/O.
SIO_IO_RD_RQ_FC_SEC: The number of read I/O requests per second from flash cache by smart I/O.
SIO_IO_RD_FC_HD_SEC: The number of read I/O requests per second from hard disk by smart I/O.
SIO_IO_WR_FC_SEC: The number of megabytes per second of flash cache population writes by smart I/O.
SIO_IO_SI_SV_SEC: The number of megabytes per second saved by the storage index.

With these metrics you can discover, by example, in each server:

Percentage of CPU and memory utilization.
GB/s sent by Smart I/O to one database server. And if you compare with the ingest that came from the database you can see percentage safe by smartio.
The number of flashcache reads that was redirected to disk because the data was not there. Here if you see increase the value tread, maybe you have fixed (or have a lot of data) in flash cache and your queries are demanding other tables.
For celldisk (CD*) metrics it is important to divide by metricObjectName attribute to identify the read from Disks and from Flash. There is no direct metrics for FD*, they are present (at storage level) as a celldisk, but they have different throughputs values. This is true for EF too that just have flash.
For flashcache directly, you can check the allocated values (usually 100%), but also the “dirty” usage when data it does not sync (between flash and disks), this can mean a lot of writes for your database, bad query design, or high pressure between consolidating databases (maybe you can disable for one category/database trough IORM).
From smartscan you can check the MB/s read from flashcache or disk to offload your query. Or even check MB/s that was saved by storage index.

So, as you can see there is a lot of information that you can extract from storage server. I prefer to read these separately (per storage) because if I consider globally (smartio or flascache as an example), I don’t have a correct representation of what it is happening under the hood. Maybe, a good value from one storage can hide a bad from another when I calculate the averages.

The idea for these metrics is to provide one way to see Exadata storage software overview. The amount of data that it is reading from hardware (CD_* metrics) and how it is used by the features. You can see how was safe by storage index of smarscan as an example, or see if the flashcache is correct populated (and not dirty). And even help to identify some queries that may are bypassing the flashache or not using smartio.

Database Metrics

The concept of global metrics does not exist directly in the Exadata, you still need to read separately from each storage server. But I recommend that check it globally, doing the sum for values from each storage server to analyze it. One example it the IOPS (or MB/s) for database, you usually want to know the value for the entire Exadata and not for each server.

In the list, I will put just for database, but you have the same for PDB, Consumer Groups, and from Categories. Remember that for IORM the hierarchy is first Categories and after Databases when creating the plans.

DB_FC_BY_ALLOCATED: The number of megabytes allocated in flash cache for this database.
DB_FC_IO_BY_SEC: The number of megabytes of I/O per second for this database to flash cache.
DB_FC_IO_RQ_LG_SEC: The number of large I/O requests issued by a database to flash cache per second.
DB_FC_IO_RQ_SM_SEC: The number of small I/O requests issued by a database to flash cache per second.
DB_FL_IO_BY_SEC: The number of megabytes written per second to Exadata Smart Flash Log.
DB_FL_IO_RQ_SEC: The number of I/O requests per second issued to Exadata Smart Flash Log.
DB_IO_BY_SEC: The number of megabytes of I/O per second for this database to hard disks.
DB_IO_LOAD: The average I/O load from this database for hard disks. For a description of I/O load, see CD_IO_LOAD.
DB_IO_RQ_LG_SEC: The rate of large I/O requests issued to hard disks by a database per second over the past minute.
DB_IO_RQ_SM_SEC: The rate of small I/O requests issued to hard disks by a database per second over the past minute.
DB_IO_TM_LG_RQ: The rate which is the average latency of reading or writing large blocks per request by a database from hard disks.
DB_IO_TM_SM_RQ: The rate which is the average latency of reading or writing small blocks per request by a database from hard disks.
DB_IO_UTIL_LG: The percentage of hard disk resources utilized by large requests from this database.
DB_IO_UTIL_SM: The percentage of hard disk resources utilized by small requests from this database.
DB_IO_WT_LG_RQ: The average IORM wait time per request for large I/O requests issued to hard disks by a database.
DB_IO_WT_SM_RQ: The average IORM wait time per request for small I/O requests issued to hard disks by a database.

With these metrics you can see how the database/pdb/cg/ct are using the Exadata. As an example, you can check the MB/s (or IOPS) read from flashcache by seconds. And if you compare with CD_* as an example, you can discover which database is using more it. The same can be done by flashlog.

What you can discover

With metrics you can discover a lot of things that are hidden when you just look from the database side (AWR as an example). And it is more critical when you have a consolidated environment in your Exadata. You can compare values from different databases and have insights from something that is not correct (as my example from the previous post).

Here I will show you another example. Again, all the values I collected using my script that you can check in my post at OTN. This script connects in each storage server and retrieves the values from the last 10 minutes. After that I stored, I made the average value for the minutes reported. This means that every point listed below are the average values computed for every 10 minutes.

Look above. This graph represents the values from DB_IO_BY_SEC metric. As you can see, the database reached around 1.6 to 1.7GB/s of disk usage. And if you remember for Exadata datasheet, this means that this database used almost everything from disk capacity/throughput from Exadata storage. This can be confirmed by:

As you can see, the average was around 85 to 90 percent of IO/Utilization for this database. But if I look at the values for each minute, I can see some high usage around 95%.

And as you can image, other databases started to see wait time for the same moment:

If you compare the three graphs at the same time, you can see that when one database started to use a lot from the storage server (around the max throughput from hardware), another database started to see more wait time for each request. The value below is expressed in ms/r (milliseconds per request).

This is one example of what you can discover using metrics. In this case specifically, the analyses started because one application DBA Team reported that started to notice some slow queries during the morning. But the application changed nothing in the previous 4 months and the AWR and tuning their queries helped nothing. Using the metric was discovered that another database deployed a new version with bad SQL. But looking from this database, everything appears normal.

How to read

If you want to read metrics the best way if trough cellcli, but sometimes you don’t need it of you can’t. You can use the Enterprise Manager/Cloud Control to check some information, but you don’t have all the metrics available, but can use the Exadata plugin to check IOPS and MB/s.

Still, at EM/CC you can try to check the KPI for Storage server and write some queries direct to EM/CC database. Oracle has a good reference about this procedure in this PDF: https://www.oracle.com/technetwork/database/availability/exadata-storage-server-kpis-2855362.pdf

You can still use the Oracle Management Cloud for Exadata to help you to identify bottlenecks in your environment. You can use some IA and Oracle expertise to identify trends easier and be more proactive than reactive https://cloudcustomerconnect.oracle.com/files/32e7361eed/OMC_For_EXADATA_whitepaper_v28.pdf or this blog post from Oracle.

Another way is writing/use one script to integrate with your monitoring tool. You can use my script as a guide and adapt it to your needs. The script can be checked in my post at Oracle OTN (Portuguese Version, Spanish Version, unfortunately, there is no English version. Was sent to be published, but never was – I don’t know why), the version published read all metrics and output it in one line. You can use it to insert it into one database or save it in one file as you desire.

Insights

The idea for metric analysis is to check it and have insights about what can be wrong. Sometimes it is easy to identify (like the example before), but other times you need to dig in a lot of metrics to find the problem (smartio, flashcache, dirty cache, etc).

But unfortunately, to understand metric you need to truly understand Exadata. It is more than a simple knowledge of the hardware or how to patch the Exadata. You need to understand the limits for your machine (datasheet can help), database tuning (to understand a bad query plan), and where to search (which metric use). You need to use this together to have the insights, and unfortunately, this came just with day-to-day usage/practice.

The idea for this post was to provide you a guide, or even a shortcut, to understand the Exadata metric and do more. I did not cover every metric, or every detail for each metric, but tried to show you how to read, analyze, and use it to identify problems.

Exadata, Using metrics to help you

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

Exadata, Using metrics to help you

It is well known that Exadata delivers a lot of power for databases and, besides that, has a lot of features that can be combined to reach the desired goals. But you need to understand how to use Exadata, it is not just knowing the internal hardware pieces, put some SQL hints, or use smart scan that makes a better DBA (or DMA).

Think about the “traditional” environment (DB + storage) and how you check for performance problems there. Basically, you just have/receive the number of IOPS from luns, throughput in MB/s, and latency from the storage side. But Exadata provides a lot of metrics that go beyond that and can be used to really understand what it is happening between the database and the access of data blocks.

For me, one of the most underrated (and not even well explained in web) features of Exadata is the metrics because they can help you to really understand Exadata deeply. As an example, from metrics, you can check the MB/s read from flash cache, disks (per type), flash log writes, reads that bypassed flash cache and went to disk, Smart I/O per database, PDB or consumer groups. It is not part of this post explain all the metrics (will be in another one), but you can read more at Chapter 6 of the Exadata User Guide.

In this post, I will show you one example of how to use the metric to identify and solve database problems. Sometimes it can be a hide and seek game, but I will try to show you how to use metrics and how they can help you on your daily basis.

DB_FC_IO_BY_SEC

Some weeks ago I was extending one script that I created 5 years ago to read metrics from all Exadata storage cells and saw interesting behavior for one database that can be a good example. The script can be checked in my post at Oracle OTN (Portuguese Version, Spanish Version, unfortunately, there is no English version. Was sent to be published, but never was – I don’t know why), the version published read all metrics and output it in one line. You can use it to insert it into one database or save it in one file as you desire.

The behavior that I saw was related to the values reported by DB_FC_IO_BY_SEC that represent the “The number of megabytes of I/O per second for this database to flash cache” done per database. The image below reports the GB/s read from DB_FC_IO_BY_SEC for just one database in two consecutive days, each point in the graph represent the average value from the last 10 minutes (read minute by minute).

As you can see, there are peaks of values, but in the end, as you can imagine, it does not appear to be correct. This does not appear to be/represent a normal trend/pattern and intuitively this got me attention. And ss you can see below, the activity report from EM/CC (for one/same day than above) for this database does not report the same behavior or even a hint:

Understand, or know Exadata, it is more than put database running there and forgot it. Even a normal/expected report from the default monitoring tools (EM/CC) of the database side can hide bad things. This can be worst in Exadata because you have a lot of power available. Using metrics, you can find these gaps and provide a better result about database tune, or discover and solve problems.

Remember above the metric values for DB_FC_IO_BY_SEC, they have a max peak (in the image) at 25GB/s. And if you check at Exadata X8 Data Sheet, the max value from flash read per server (not EF) is around 21GB/s. Since the values that I showed before are the sum for all cells (Half Rack here), they really bellow the maximum that Exadata can reach. The database was not struggling in Exadata (I imagine that will be in the traditional environment), but the power available at Exadata (and flash cache feature) was hiding the problem.

Returning to the example, since the report hinted some bad usage I made research from running SQL’s and hit two notes:

Slow Performance Of DBA_AUDIT_SESSION Query From “Failed logins” Metric in Enterprise Manager (Doc ID 1178736.1).
EM 12c, 11g: High Cpu Utilization from Enterprise Manger Agent perl Process Running failedLogin.pl for audit_failed_logins metric (Doc ID 1265699.1).

Basically, the EM/CC collection for failed login attempts was running and consuming a lot of resources. And since this EM/CC audit was not needed, it was can be disabled and after that, the DB_FC_IO_BY_SEC changed to:

As you can see, more normal behavior for database usage.

The Point

The point of this article is clear. Exadata can be simple to use, you just drop the database there and the features available will help. But unfortunately, the extra power that you have there can hide the bad things (as shown above). If you compare how to do the same in the traditional environment, the only details from the storage side that you can discover are just IOPS and throughput. But for Exadata, you can read a lot of other points to tune the database or show you problems.

Understand Exadata it is more than just learn about network, hardware, or put some hints to have a better performance. Remember that hardware change every release, and the database optimizer every version too. But Exadata metrics are there since the beginning and they just expand to provide a complete view about what happens between the database and the hardware itself. Know how to read and interpret it is the key to a good DMA.

I really recommend read Chapter 6 of the Exadata User Guide. But it is important too to have a good base about what Exadata HW can deliver and how the features work. The Exadata X8 Data Sheet, Technical White Paper and MAA OOW presentations provide you a good source of information.

Remember, it is more than learning about hardware or use some hints for SQL. It is more about understanding how hardware and software are integrated and what you can extract from there. And with Exadata metrics, you can really discover what is happening.

ZDLRA, Multi-site protection – ZERO RPO for Primary and Standby

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

ZDLRA, Multi-site protection - ZERO RPO for Primary and Standby

ZDLRA can be used from a small single database environment to big environments where you need protection in more than one site at the same time. At every level, you can use different features of ZDLRA to provide desirable protection. Here I will show how to reach zero RPO for both primary and standby databases. All the steps, doc, and tech parts are covered.

You can check the examples the reference for every scenario int these two papers from the Oracle MAA team: MAA Overview On-Premises and Oracle MAA Reference Architectures. They provide good information on how to prepare to reduce RPO and improve RTO. In resume, the focus is the same, reduce the downtime and data loss in case of a catastrophe (zero RPO, and zero RPO).

Multi-site protection

If you looked both papers before, you saw that to provide good protection is desirable to have an additional site to, at least, send the backups. And if you go higher, for GOLD and PLATINUM environments, you start to have multiple sites synced with data guard. These Critical/Mission-critical environments need to be protected for every kind of catastrophic failure, from disk until complete site outage (some need to follow specific law’s requirements, bank as an example).

And the focus of this post is these big environments. I will show you how to use ZDLRA to protect both sites, reaching zero RPO even for standby databases. And doing that, you can survive for a catastrophic outage (like entire datacenter failure) and still have zero RPO. Going further, you can even have zero RPO if you lose completely on site when using real-time redo for ZDLRA, and this is not written in the docs by the way.

Some features I already wrote about that in the previous posts. Already wrote about ZDLRA features, how to enroll a database, and how to reach zero RPO for database protection. All of these I will use here, to protect the bigger environments it is used incremental forever strategy for backup and real-time redo to protect primary and standby databases to reach zero RPO.

MAA team already wrote about how to do that at Deploying the Zero Data Loss Recovery Appliance in a Data Guard Configuration, but frankly, it is hard to understand this doc. And here in this post, I will try to provide a better example and how to reach RPO zero for both sites.

Multiple ZDLRA’s

The way that we will use ZDLRA, in this case, is different from the replication feature that exists for ZDLRA. Here, we have two ZDLRA, one for each site. The environment in this case is:

ORCL19: RAC primary database.
ORCL19S: RAC standby database.
ZDLRAS1: ZDLRA that protect the primary site.
ZDLRAS2: ZDLRA that protects the standby site.

And the target will be:

As you can see above, each ZDLRA protects your own site, and the replication between sites is done by DG. The DG configuration it is not part of this post, but the output (and order) for all commands how I created the RAC + RAC DG you can check here – Steps-RAC-RAC-DG-Creation. If you have some doubts about it, I can explain if needed.

ZDLRA Configuration – Protecting and Enrolling Databases

VPC

If you already have a VPC user created for your ZDLRA and want to use an existing one, you can skip this step. Just remember that the same user needs to exist in both ZDLRA’s. This is more critical if you are intending to use real-time redo and reach zero RPO.

So, if needed, create the same VPC user in both ZDLRA’s:

At ZDLRAS1:

[root@zdlras1n1 ~]# /opt/oracle.RecoveryAppliance/bin/racli add vpc_user --user_name=vpczdlra

[vpczdlra] New Password:

Sat Nov  2 19:43:59 2019: Start: Add vpc user vpczdlra.

Sat Nov  2 19:44:00 2019:        Add vpc user vpczdlra successfully.

Sat Nov  2 19:44:00 2019: End:   Add vpc user vpczdlra.

[root@zdlras1n1 ~]#

At ZDLRAS2:

[root@zdlras2n1 ~]# /opt/oracle.RecoveryAppliance/bin/racli add vpc_user --user_name=vpczdlra

[vpczdlra] New Password:

Sat Nov  2 19:43:41 2019: Start: Add vpc user vpczdlra.

Sat Nov  2 19:43:42 2019:        Add vpc user vpczdlra successfully.

Sat Nov  2 19:43:42 2019: End:   Add vpc user vpczdlra.

[root@zdlras2n1 ~]#

Backup Policy

It is not needed to have the same policy in each ZDLRA and it is possible to enroll the database using different policies since they (ZDLRA’s) work separately. Here I used the same policy in both sites (but with different recovery window goals).

At ZDLRAS1:

[oracle@zdlras1n1 ~]$ sqlplus rasys/change^Me2

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Nov 2 22:21:18 2019

Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.

Last Successful login time: Sat Nov 02 2019 11:20:06 +01:00

Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.3.0.0.0

SQL> BEGIN

  2  DBMS_RA.CREATE_PROTECTION_POLICY(

  3      protection_policy_name => 'ZDLRA'

  4      , description => 'Policy ZDLRA S1'

  5      , storage_location_name => 'DELTA'

  6      , recovery_window_goal => INTERVAL '5' DAY

  7      , max_retention_window => INTERVAL '10' DAY

  8      , guaranteed_copy => 'NO'

  9      , allow_backup_deletion => 'YES'

 10  );

 11  END;

 12  /

PL/SQL procedure successfully completed.

SQL>

At ZDLRAS2:

[oracle@zdlras2n1 ~]$ sqlplus rasys/change^Me2

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Nov 2 22:22:13 2019

Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.

Last Successful login time: Sat Nov 02 2019 11:21:04 +01:00

Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.3.0.0.0

SQL> BEGIN

  2  DBMS_RA.CREATE_PROTECTION_POLICY(

  3      protection_policy_name => 'ZDLRA'

  4      , description => 'Policy ZDLRA S2'

  5      , storage_location_name => 'DELTA'

  6      , recovery_window_goal => INTERVAL '7' DAY

  7      , max_retention_window => INTERVAL '14' DAY

  8      , guaranteed_copy => 'NO'

  9      , allow_backup_deletion => 'YES'

 10  );

 11  END;

 12  /

PL/SQL procedure successfully completed.

SQL>

Grant DB at ZDLRA

Here we have the first key point, and it is related to the way that database is registered database inside ZDLRA. It follows the traditional way, using the ADD_DB, GRANT_DB_ACCESS, and using DB_UNIQUE_NAME as identification.

The point here is that for dataguard the DB_UNIQUE_NAME is different for each database, but one detail it is that ZDLRA internally links DBID of the database with the unique name. So, it is needed and required, that for ZDLRA who will protect the standby site that the add and registration use DB_UNIQUE_NAME from the primary. If you do not do that, the ZDLRA will understand that the standby database it is a completely different database (and will report, RMAN-03009: failure of resync command on default channel, ORA-00001: unique constraint (RASYS.ODB_P) violated, and ORA-04088: error during execution of trigger ‘RASYS.NODE_INSERT_UPDATE’).

And another common error, if you try to register the standby database (instead of primary database) is RMAN-01005: Mounted control file type must be CURRENT to register the database. This occurs because some registrations and checks done by rman and database controlfile need to be at current one.

At ZDLRAS1:

[oracle@zdlras1n1 ~]$ sqlplus rasys/change^Me2

SQL*Plus: Release 19.0.0.0.0 - Production on Sun Nov 3 14:11:37 2019

Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.

Last Successful login time: Sun Nov 03 2019 11:16:12 +01:00

Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.3.0.0.0

SQL> BEGIN

  2  DBMS_RA.ADD_DB(

  3        db_unique_name => 'ORCL19'

  4        , protection_policy_name => 'ZDLRA'

  5        , reserved_space => '5G'

  6  );

  7  END;

  8  /

PL/SQL procedure successfully completed.

SQL>

SQL> BEGIN

  2  DBMS_RA.GRANT_DB_ACCESS (

  3        db_unique_name => 'ORCL19'

  4        , username => 'VPCZDLRA'

  5  );

  6  END;

  7  /

PL/SQL procedure successfully completed.

SQL>

At ZDLRAS2:

[oracle@zdlras2n1 ~]$ sqlplus rasys/change^Me2

SQL*Plus: Release 19.0.0.0.0 - Production on Sun Nov 3 16:14:06 2019

Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.

Last Successful login time: Sun Nov 03 2019 11:16:55 +01:00

Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.3.0.0.0

SQL> BEGIN

  2  DBMS_RA.ADD_DB(

  3        db_unique_name => 'ORCL19'

  4        , protection_policy_name => 'ZDLRA'

  5        , reserved_space => '5G'

  6  );

  7  END;

  8  /

PL/SQL procedure successfully completed.

SQL> BEGIN

  2  DBMS_RA.GRANT_DB_ACCESS (

  3        db_unique_name => 'ORCL19'

  4        , username => 'VPCZDLRA'

  5  );

  6  END;

  7  /

PL/SQL procedure successfully completed.

SQL>

Look above that ADD_DB and GRANT_DB_ACCESS at ZDLRAS2 used the DB_UNIQUE_NAME as ORCL19 (that it is the name for primary).

Registry database

At Primary – ORLC19

At this point, it is possible to register and backup the primary database with ZDLRAS1. The steps are the same that I described at my post on how to enroll a database at ZDLRA. The steps are, in order:

Install ZDLRA library
Create Wallet
Configure/Test tns entry to ZDLRA
Adjust sqlnet.ora file (this is needed because the real-time redo)

[oracle@orcl19p ~]$ cd /tmp/

[oracle@orcl19p tmp]$ unzip ra_linux64.zip

Archive:  ra_linux64.zip

  inflating: libra.so

  inflating: metadata.xml

[oracle@orcl19p tmp]$

[oracle@orcl19p tmp]$ cp ./libra.so /u01/app/oracle/product/19.0.0.0/dbhome_1/lib/libra.so

[oracle@orcl19p tmp]$

[oracle@orcl19p tmp]$ cd $ORACLE_HOME/dbs

[oracle@orcl19p dbs]$

[oracle@orcl19p dbs]$ mkstore -wrl /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet -createALO

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.4.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.

[oracle@orcl19p dbs]$

[oracle@orcl19p dbs]$ mkstore -wrl /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet -createCredential zdlras1-scan:1521/zdlras1:VPCZDLRA vpczdlra oracle

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.4.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.

[oracle@orcl19p dbs]$

[oracle@orcl19p dbs]$ mkstore -wrl /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet -listCredential

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.4.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.

List credential (index: connect_string username)

1: zdlras1-scan:1521/zdlras1:VPCZDLRA vpczdlra

[oracle@orcl19p dbs]$

[oracle@orcl19p dbs]$ vi $ORACLE_HOME/dbs/raORCL191.ora

[oracle@orcl19p dbs]$ cat $ORACLE_HOME/dbs/raORCL191.ora

RA_WALLET='LOCATION=file:/u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet CREDENTIAL_ALIAS=zdlras1-scan:1521/zdlras1:VPCZDLRA'

[oracle@orcl19p dbs]$

[oracle@orcl19p dbs]$ tnsping zdlras1

TNS Ping Utility for Linux: Version 19.0.0.0.0 - Production on 03-NOV-2019 15:34:09

Copyright (c) 1997, 2019, Oracle.  All rights reserved.

Used parameter files:

Used TNSNAMES adapter to resolve the alias

Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = zdlras1-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = zdlras1)))

OK (20 msec)

[oracle@orcl19p dbs]$

[oracle@orcl19p dbs]$ cat $ORACLE_HOME/network/admin/sqlnet.ora

cat: /u01/app/oracle/product/19.0.0.0/dbhome_1/network/admin/sqlnet.ora: No such file or directory

[oracle@orcl19p dbs]$

[oracle@orcl19p dbs]$ vi $ORACLE_HOME/network/admin/sqlnet.ora

[oracle@orcl19p dbs]$

[oracle@orcl19p dbs]$ cat $ORACLE_HOME/network/admin/sqlnet.ora

NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)

SQLNET.WALLET_OVERRIDE = true

WALLET_LOCATION =

(

   SOURCE =

      (METHOD = FILE)

      (METHOD_DATA =

         (DIRECTORY = /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet)

      )

)

[oracle@orcl19p dbs]$

At Standby – ORLC19S

The same procedure than above it is done here:

Install ZDLRA library
Create Wallet
Configure/Test tns entry to ZDLRA
Adjust sqlnet.ora file (this is needed because the real-time redo)

[oracle@orcl19s ~]$ cd /tmp/

[oracle@orcl19s tmp]$ unzip ra_linux64.zip

Archive:  ra_linux64.zip

  inflating: libra.so

  inflating: metadata.xml

[oracle@orcl19s tmp]$

[oracle@orcl19s tmp]$ cp ./libra.so /u01/app/oracle/product/19.0.0.0/dbhome_1/lib/libra.so

[oracle@orcl19s tmp]$

[oracle@orcl19s tmp]$ cd $ORACLE_HOME/dbs

[oracle@orcl19s dbs]$

[oracle@orcl19s dbs]$ mkstore -wrl /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet -createALO

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.4.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.

[oracle@orcl19s dbs]$

[oracle@orcl19s dbs]$ mkstore -wrl /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet -createCredential zdlras2-scan:1521/zdlras2:VPCZDLRA vpczdlra oracle

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.4.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.

[oracle@orcl19s dbs]$

[oracle@orcl19s dbs]$ mkstore -wrl /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet -listCredential

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.4.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.

List credential (index: connect_string username)

1: zdlras2-scan:1521/zdlras2:VPCZDLRA vpczdlra

[oracle@orcl19s dbs]$

[oracle@orcl19s dbs]$ vi $ORACLE_HOME/dbs/raORCL19S1.ora

[oracle@orcl19s dbs]$

[oracle@orcl19s dbs]$ cat $ORACLE_HOME/dbs/raORCL19S1.ora

RA_WALLET='LOCATION=file:/u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet CREDENTIAL_ALIAS=zdlras2-scan:1521/zdlras2:VPCZDLRA'

[oracle@orcl19s dbs]$

[oracle@orcl19s dbs]$ tnsping zdlras2

TNS Ping Utility for Linux: Version 19.0.0.0.0 - Production on 03-NOV-2019 16:32:57

Copyright (c) 1997, 2019, Oracle.  All rights reserved.

Used parameter files:

Used TNSNAMES adapter to resolve the alias

Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = zdlras2-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = zdlras2)))

OK (0 msec)

[oracle@orcl19s dbs]$

[oracle@orcl19s dbs]$ cat $ORACLE_HOME/network/admin/sqlnet.ora

cat: /u01/app/oracle/product/19.0.0.0/dbhome_1/network/admin/sqlnet.ora: No such file or directory

[oracle@orcl19s dbs]$

[oracle@orcl19s dbs]$ vi $ORACLE_HOME/network/admin/sqlnet.ora

[oracle@orcl19s dbs]$

[oracle@orcl19s dbs]$ cat $ORACLE_HOME/network/admin/sqlnet.ora

NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)

SQLNET.WALLET_OVERRIDE = true

WALLET_LOCATION =

(

   SOURCE =

      (METHOD = FILE)

      (METHOD_DATA =

         (DIRECTORY = /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet)

      )

)

[oracle@orcl19s dbs]$

Registry Primary Database – Both ZDLRA’s

After that, we can register ORACL19 at ZDLRAS1. It is critical to register the primary database first at the ZDLRA that will protect the primary site.

[oracle@orcl19p dbs]$ rman target=/ catalog=vpczdlra/oracle@zdlras1

Recovery Manager: Release 19.0.0.0.0 - Production on Sun Nov 3 15:40:37 2019

Version 19.5.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

connected to target database: ORCL19 (DBID=324042554)

connected to recovery catalog database

PL/SQL package VPCZDLRA.DBMS_RCVCAT version 19.03.00.00. in RCVCAT database is not current

PL/SQL package VPCZDLRA.DBMS_RCVMAN version 19.03.00.00 in RCVCAT database is not current

RMAN> register database;

database registered in recovery catalog

starting full resync of recovery catalog

full resync complete

RMAN>

If you want, you can do the backup of the database (it is optional), I not showed here but it is a simple backup database level 0 filesperset 1 and you can see the output here – Output-Backup-Primary-ORCL19-at-ZDLRAS1.

After registry primary database ORCL19 at ZDLRA (ZDLRAS1) that protect the primary site, we can register the primary ORCL19 at the ZDLRA that protect the standby site (ZDLRAS2)

Look that I need to add the entry of TNS names for ZDLRAS2 (at the primary server) too:

[oracle@orcl19p dbs]$ tnsping zdlras2

TNS Ping Utility for Linux: Version 19.0.0.0.0 - Production on 03-NOV-2019 16:37:38

Copyright (c) 1997, 2019, Oracle.  All rights reserved.

Used parameter files:

/u01/app/oracle/product/19.0.0.0/dbhome_1/network/admin/sqlnet.ora

Used TNSNAMES adapter to resolve the alias

Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = zdlras2-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = zdlras2)))

OK (0 msec)

[oracle@orcl19p dbs]$

[oracle@orcl19p dbs]$ rman target=/ catalog=vpczdlra/oracle@zdlras2

Recovery Manager: Release 19.0.0.0.0 - Production on Sun Nov 3 17:30:06 2019

Version 19.5.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

connected to target database: ORCL19 (DBID=324042554)

connected to recovery catalog database

PL/SQL package VPCZDLRA.DBMS_RCVCAT version 19.03.00.00. in RCVCAT database is not current

PL/SQL package VPCZDLRA.DBMS_RCVMAN version 19.03.00.00 in RCVCAT database is not current

RMAN> register database;

database registered in recovery catalog

starting full resync of recovery catalog

full resync complete

RMAN> exit

Recovery Manager complete.

[oracle@orcl19p dbs]$

Since the registration was done at primary, the controlfile of the standby database needs to be updated with one rman catalog, and it is needed to inform ZDLRAS2 (that protect standby site) who is the standby database. This is done by doing a simple resync catalog from standby, connected at ZDLRAS2.

[oracle@orcl19s dbs]$ rman target=/ catalog=vpczdlra/oracle@zdlras2

Recovery Manager: Release 19.0.0.0.0 - Production on Sun Nov 3 20:12:32 2019

Version 19.5.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

connected to target database: ORCL19 (DBID=324042554, not open)

connected to recovery catalog database

PL/SQL package VPCZDLRA.DBMS_RCVCAT version 19.03.00.00. in RCVCAT database is not current

PL/SQL package VPCZDLRA.DBMS_RCVMAN version 19.03.00.00 in RCVCAT database is not current

RMAN> show all;

RMAN configuration parameters for database with db_unique_name ORCL19S are:

CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default

CONFIGURE BACKUP OPTIMIZATION OFF; # default

CONFIGURE DEFAULT DEVICE TYPE TO DISK; # default

CONFIGURE CONTROLFILE AUTOBACKUP ON; # default

CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F'; # default

CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE SBT_TAPE TO '%F'; # default

CONFIGURE DEVICE TYPE DISK PARALLELISM 1 BACKUP TYPE TO BACKUPSET; # default

CONFIGURE DEVICE TYPE SBT_TAPE PARALLELISM 1 BACKUP TYPE TO BACKUPSET; # default

CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default

CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE SBT_TAPE TO 1; # default

CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default

CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE SBT_TAPE TO 1; # default

CONFIGURE CHANNEL 1 DEVICE TYPE 'SBT_TAPE' FORMAT   '%d_%U' PARMS  "SBT_LIBRARY=/u01/app/oracle/product/19.0.0.0/dbhome_1/lib/libra.so, ENV=(RA_WALLET='location=file:/u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet credential_alias=zdlras1-scan:1521/zdlras1:VPCZDLRA')" CONNECT '*';

CONFIGURE MAXSETSIZE TO UNLIMITED; # default

CONFIGURE ENCRYPTION FOR DATABASE OFF; # default

CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default

CONFIGURE COMPRESSION ALGORITHM 'BASIC' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE ; # default

CONFIGURE RMAN OUTPUT TO KEEP FOR 7 DAYS; # default

CONFIGURE ARCHIVELOG DELETION POLICY TO NONE; # default

CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/snapcf_orcl19s1.f'; # default

RMAN> resync catalog;

starting resync of recovery catalog

resync complete

RMAN> list db_unique_name of database orcl19;

List of Databases

DB Key  DB Name  DB ID            Database Role    Db_unique_name

------- ------- ----------------- ---------------  ------------------

835     ORCL19   324042554        PRIMARY          ORCL19

835     ORCL19   324042554        STANDBY          ORCL19S

RMAN>

Look at some points above:

show all read the already registered information from rman catalog (look channel set)
After the resync catalog, the list db_unique_name reported both databases, primary (ORCL19) and standby (ORCL19S)

After that, you can configure the correct channel for ORCL19S to point to ZDLRAS2, and do the backup. This output can be checked here – Output-Backup-Primary-ORCL19S-at-ZDLRAS2.

Environment protected

At this point, you have both sites protected, which one with dedicated local backup with ZDLRA. Primary ORCL19 database does the backup at ZDLRAS1, and standby database ORCL19S does the backup at ZDLRAS2:

But this not protect you in case of failure of one site. In this case, the RPO will not be zero because we don’t have where to send the archivelogs.

Think that you lose your primary site, and after the switch to standby the machine failures again. What will be your RPO? The last backup of standby will be, because your standby (new primary) it is not sending archivelogs to the old primary (because it is out for outage), and you don’t have real-time redo configured for ZDLRA. So, if you have an outage at standby you lose until the point of your last backup. This is not clear at the docs from MAA Overview On-Premises and Oracle MAA Reference Architectures.

ZERO RPO

To reach zero RPO we just need to enable real-time redo for both sites. We can do manually or using the broker. If you want to see in detail how to configure you can check my two posts: ZDLRA, Real-Time Redo and ZDLRA, Real-Time Redo and Zero RPO, here I will post resumed steps.

Be careful with the order of the steps below, they are important because we are changing the user (redo_transport_user) that will be configured in both databases. If you change first at standby you will face issued because the user is created at primary first and the “creation” it sends automatically by redo. By the way, this user needs to be the same as VPC user that you used to configure/add/grant database access for ZDLRA. And need to be the same for every ZDLRA because both databases use the same user. This is the second key point of this config.

Manually Configuration

The manually way differs because here we set manually the log_archive_dest parameters. The others need to be done besides the way that we choose. And the manual config is the only way for 11G databases.

Basic config

For real-time redo we need to configure some parameters, but basically, I made these configs in order:

log_archive_config: Check that for primary it is ZDLRAS1, and ZDLRAS2 for standby
log_archive_dest_3: Again, each site points to the own ZDLRA. And look the type defined at VALID_FOR
Create the user for redo_transport_user inside database.
redo_transport_user: Same for both databases.

At Primary ORCL19:

[oracle@orcl19p ~]$ sqlplus / as sysdba




SQL*Plus: Release 19.0.0.0.0 - Production on Sun Nov 3 22:34:28 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle.  All rights reserved.







Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0




SQL> show parameter log_archive_config;




NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

log_archive_config                   string      DG_CONFIG=(orcl19,orcl19s)

SQL> ALTER SYSTEM SET log_archive_config='DG_CONFIG=(orcl19,orcl19s,zdlras1)' SCOPE=SPFILE SID='*';




System altered.




SQL> ALTER SYSTEM SET log_archive_dest_3='SERVICE="zdlras1-scan:1521/zdlras1:VPCZDLRA" ASYNC NOAFFIRM DB_UNIQUE_NAME=zdlras1 VALID_FOR=(ONLINE_LOGFILE,ALL_ROLES)' scope = spfile sid = '*';




System altered.




SQL>

SQL> alter system set log_archive_dest_state_3=DEFER scope = spfile sid = '*';




System altered.




SQL>

SQL> col username format a20

SQL> select USERNAME,SYSDBA,SYSOPER,SYSASM,SYSBACKUP,SYSDG,SYSKM from v$pwfile_users;




USERNAME             SYSDB SYSOP SYSAS SYSBA SYSDG SYSKM

-------------------- ----- ----- ----- ----- ----- -----

SYS                  TRUE  TRUE  FALSE FALSE FALSE FALSE




SQL> show parameter redo_transport_user;




NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

redo_transport_user                  string

SQL> alter system set redo_transport_user='VPCZDLRA' scope = spfile SID='*';




System altered.




SQL>

SQL> alter session set "_ORACLE_SCRIPT"=true;




Session altered.




SQL> create user VPCZDLRA identified by oracle;




User created.




SQL> grant sysoper to VPCZDLRA;




Grant succeeded.




SQL> alter session set "_ORACLE_SCRIPT"=false;




Session altered.




SQL> select USERNAME,SYSDBA,SYSOPER,SYSASM,SYSBACKUP,SYSDG,SYSKM from v$pwfile_users;




USERNAME             SYSDB SYSOP SYSAS SYSBA SYSDG SYSKM

-------------------- ----- ----- ----- ----- ----- -----

SYS                  TRUE  TRUE  FALSE FALSE FALSE FALSE

VPCZDLRA             FALSE TRUE  FALSE FALSE FALSE FALSE




SQL> exit

Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0

[oracle@orcl19p ~]$

At Standby ORCL19S:

[oracle@orcl19s dbs]$ sqlplus / as sysdba




SQL*Plus: Release 19.0.0.0.0 - Production on Sun Nov 3 22:53:05 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle.  All rights reserved.







Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0




SQL> show parameter log_archive_config;




NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

log_archive_config                   string      DG_CONFIG=(orcl19,orcl19s)

SQL> ALTER SYSTEM SET log_archive_config='DG_CONFIG=(orcl19,orcl19s,zdlras2)' SCOPE=SPFILE SID='*';




System altered.




SQL> ALTER SYSTEM SET log_archive_dest_3='SERVICE="zdlras2-scan:1521/zdlras2:VPCZDLRA" ASYNC NOAFFIRM DB_UNIQUE_NAME=zdlras2 VALID_FOR=(ALL_LOGFILES,ALL_ROLES)' scope = spfile sid = '*';




System altered.




SQL> alter system set redo_transport_user='VPCZDLRA' scope = spfile SID='*';




System altered.




SQL>

SQL> exit

Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0

[oracle@orcl19s dbs]$

[oracle@orcl19s dbs]$ srvctl stop database -d orcl19s -o immediate

[oracle@orcl19s dbs]$

As you can see here, the user is not created at standby because the creation will come by redo. And as you can see in the end, I stopped the standby database.

Password file

Since one user was created in just one site, we need to copy the password file from production to the standby.

At Primary ORCL19:

[root@orcl19p ~]# su - grid

Last login: Sun Nov  3 23:03:09 CET 2019

[grid@orcl19p ~]$ asmcmd

ASMCMD> cd +DATA/ORCL19/

ASMCMD> ls -l

Type             Redund  Striped  Time             Sys  Name

                                                   Y    95D7C2B2568F0A70E0533205A00AC96B/

                                                   Y    95D86C40430C3E85E0533205A00A0EBA/

                                                   Y    CONTROLFILE/

                                                   Y    DATAFILE/

                                                   Y    DATAGUARDCONFIG/

                                                   Y    ONLINELOG/

                                                   Y    PARAMETERFILE/

                                                   Y    PASSWORD/

                                                   Y    TEMPFILE/

DATAGUARDCONFIG  UNPROT  COARSE   NOV 03 23:00:00  N    dr1orcl19.dat => +DATA/ORCL19/DATAGUARDCONFIG/ORCL19.275.1023297961

ASMCMD> ls -l PASSWORD/

Type      Redund  Striped  Time             Sys  Name

PASSWORD  UNPROT  COARSE   OCT 26 23:00:00  Y    pwdorcl19.256.1022714087

ASMCMD> ls -l

Type             Redund  Striped  Time             Sys  Name

                                                   Y    95D7C2B2568F0A70E0533205A00AC96B/

                                                   Y    95D86C40430C3E85E0533205A00A0EBA/

                                                   Y    CONTROLFILE/

                                                   Y    DATAFILE/

                                                   Y    DATAGUARDCONFIG/

                                                   Y    ONLINELOG/

                                                   Y    PARAMETERFILE/

                                                   Y    PASSWORD/

                                                   Y    TEMPFILE/

DATAGUARDCONFIG  UNPROT  COARSE   NOV 03 23:00:00  N    dr1orcl19.dat => +DATA/ORCL19/DATAGUARDCONFIG/ORCL19.275.1023297961

ASMCMD>

ASMCMD> cp PASSWORD/pwdorcl19.256.1022714087 /tmp

copying +DATA/ORCL19/PASSWORD/pwdorcl19.256.1022714087 -> /tmp/pwdorcl19.256.1022714087

ASMCMD> exit

[grid@orcl19p ~]$

As you saw, I saved the password file at /tmp folder.

At Standby ORCL19S:

[root@orcl19s ~]# su - grid

Last login: Sun Nov  3 23:03:00 CET 2019

[grid@orcl19s ~]$

[grid@orcl19s ~]$

[grid@orcl19s ~]$ scp orcl19p:/tmp/pwdorcl19.256.1022714087 /tmp/pwdorcl19.256.1022714087

grid@orcl19p's password:

pwdorcl19.256.1022714087                                                                                                                                                                                   100% 2560     4.3MB/s   00:00

[grid@orcl19s ~]$

[grid@orcl19s ~]$

[grid@orcl19s ~]$

[grid@orcl19s ~]$ asmcmd

ASMCMD> ls -l +DATA/ORCL19S/

Type             Redund  Striped  Time             Sys  Name

                                                   Y    95D7C2B2568F0A70E0533205A00AC96B/

                                                   Y    95D86C40430C3E85E0533205A00A0EBA/

                                                   Y    CONTROLFILE/

                                                   Y    DATAFILE/

                                                   Y    DATAGUARDCONFIG/

                                                   Y    ONLINELOG/

                                                   Y    PARAMETERFILE/

                                                   Y    PASSWORD/

DATAGUARDCONFIG  UNPROT  COARSE   NOV 03 19:00:00  N    dr1orcl19s.dat => +DATA/ORCL19S/DATAGUARDCONFIG/orcl19s.275.1023298021

PASSWORD         UNPROT  COARSE   NOV 02 15:00:00  N    orapworcl19s => +DATA/ORCL19S/PASSWORD/pwdorcl19s.268.1023290373

PARAMETERFILE    UNPROT  COARSE   NOV 03 22:00:00  N    spfile.ora => +DATA/ORCL19S/PARAMETERFILE/spfile.272.1023290247

ASMCMD>

ASMCMD> pwcopy --dbuniquename orcl19s '/tmp/pwdorcl19.256.1022714087' '+DATA/ORCL19S/orapworcl19s' -f

ASMCMD-9453: failed to register password file as a CRS resource

copying /tmp/pwdorcl19.256.1022714087 -> +DATA/ORCL19S/orapworcl19s

ASMCMD-9453: failed to register password file as a CRS resource

ASMCMD>

ASMCMD>

ASMCMD> ls -l +DATA/ORCL19S/

Type             Redund  Striped  Time             Sys  Name

                                                   Y    95D7C2B2568F0A70E0533205A00AC96B/

                                                   Y    95D86C40430C3E85E0533205A00A0EBA/

                                                   Y    CONTROLFILE/

                                                   Y    DATAFILE/

                                                   Y    DATAGUARDCONFIG/

                                                   Y    ONLINELOG/

                                                   Y    PARAMETERFILE/

                                                   Y    PASSWORD/

DATAGUARDCONFIG  UNPROT  COARSE   NOV 03 19:00:00  N    dr1orcl19s.dat => +DATA/ORCL19S/DATAGUARDCONFIG/orcl19s.275.1023298021

PASSWORD         UNPROT  COARSE   NOV 03 23:00:00  N    orapworcl19s => +DATA/ORCL19S/PASSWORD/pwdorcl19s.268.1023405041

PARAMETERFILE    UNPROT  COARSE   NOV 03 22:00:00  N    spfile.ora => +DATA/ORCL19S/PARAMETERFILE/spfile.272.1023290247

ASMCMD>

ASMCMD> exit

[grid@orcl19s ~]$

As you saw above, the file from primary was copied and overwritten the old.

Bounce everything

After doing this it is time to bounce the primary database to reload the parameters and start the standby database to read password file and new parameters

At Primary ORCL19:

[oracle@orcl19p ~]$ srvctl stop database -d orcl19s -o immediate

[oracle@orcl19p ~]$ srvctl start database -d orcl19

[oracle@orcl19p ~]$

At Standby ORCL19S



[oracle@orcl19s ~]$ srvctl start database -d orcl19s

[oracle@orcl19s ~]$

[oracle@orcl19s ~]$ sqlplus / as sysdba




SQL*Plus: Release 19.0.0.0.0 - Production on Sun Nov 3 23:13:22 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle.  All rights reserved.







Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0




SQL> col username format a20

SQL> select USERNAME,SYSDBA,SYSOPER,SYSASM,SYSBACKUP,SYSDG,SYSKM from v$pwfile_users;




USERNAME             SYSDB SYSOP SYSAS SYSBA SYSDG SYSKM

-------------------- ----- ----- ----- ----- ----- -----

SYS                  TRUE  TRUE  FALSE FALSE FALSE FALSE

VPCZDLRA             FALSE TRUE  FALSE FALSE FALSE FALSE


SQL>

Checking ZERO RPO

To check if the conf is OK, you can do a simple archivelog at primary and verify if everything was OK for both sites.

At Primary ORCL19:

[oracle@orcl19p ~]$ sqlplus / as sysdba




SQL*Plus: Release 19.0.0.0.0 - Production on Mon Nov 4 00:46:46 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle.  All rights reserved.







Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0




SQL> archive log list;

Database log mode              Archive Mode

Automatic archival             Enabled

Archive destination            USE_DB_RECOVERY_FILE_DEST

Oldest online log sequence     117

Next log sequence to archive   119

Current log sequence           119

SQL> alter system archive log current;




System altered.




SQL> exit

Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0

[oracle@orcl19p ~]$ rman target=/ catalog=vpczdlra/oracle@zdlras1




Recovery Manager: Release 19.0.0.0.0 - Production on Mon Nov 4 00:47:04 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database: ORCL19 (DBID=324042554)

connected to recovery catalog database

PL/SQL package VPCZDLRA.DBMS_RCVCAT version 19.03.00.00. in RCVCAT database is not current

PL/SQL package VPCZDLRA.DBMS_RCVMAN version 19.03.00.00 in RCVCAT database is not current




RMAN> list backup of archivelog sequence 118;







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

1965    8.00K      SBT_TAPE    00:00:01     04/11/2019 00:45:49

        BP Key: 1966   Status: AVAILABLE  Compressed: YES  Tag: TAG20191104T004548

        Handle: $RSCN_1_RTIM_1022714106_THRD_1_SEQ_118_CTKEY_1946_BACKUP   Media:




  List of Archived Logs in backup set 1965

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    118     2137116    04/11/2019 00:44:35 2137128    04/11/2019 00:44:39




RMAN> exit







Recovery Manager complete.

[oracle@orcl19p ~]$

At Standby ORCL19S

[oracle@orcl19s ~]$ rman target=/ catalog=vpczdlra/oracle@zdlras2




Recovery Manager: Release 19.0.0.0.0 - Production on Mon Nov 4 00:48:39 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database: ORCL19 (DBID=324042554, not open)

connected to recovery catalog database

PL/SQL package VPCZDLRA.DBMS_RCVCAT version 19.03.00.00. in RCVCAT database is not current

PL/SQL package VPCZDLRA.DBMS_RCVMAN version 19.03.00.00 in RCVCAT database is not current




RMAN> list backup of archivelog sequence 118;







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

2240    8.00K      SBT_TAPE    00:00:00     04/11/2019 00:45:35

        BP Key: 2241   Status: AVAILABLE  Compressed: YES  Tag: TAG20191104T004535

        Handle: $RSCN_1_RTIM_1022714106_THRD_1_SEQ_118_CTKEY_2228_BACKUP   Media:




  List of Archived Logs in backup set 2240

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    118     2137116    04/11/2019 00:44:35 2137128    04/11/2019 00:44:39




RMAN> exit







Recovery Manager complete.

[oracle@orcl19s ~]$

So, as you can see above. The ZDLRAS1 received the onlinelog (sequence 118) from ORCL19. And the ZDLRAS2 received a copy of it (standby redo log) from ORCL19S. This means that now you have zero RPO from primary and standby.

Broker config

From 12.1 and new versions, you can have recovery_appliance configured directly from broker when using DG. This executes the configuration for log_archive_dest parameters automatically when you add the ZDLRA.

Basic config

Pretty same config that was made as manual config before.

At Primary ORCL19:

[oracle@orcl19p ~]$ sqlplus / as sysdba




SQL*Plus: Release 19.0.0.0.0 - Production on Sat Nov 9 16:24:27 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle.  All rights reserved.







Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0




SQL>

SQL> col username format a20

SQL> select USERNAME,SYSDBA,SYSOPER,SYSASM,SYSBACKUP,SYSDG,SYSKM from v$pwfile_users;




USERNAME             SYSDB SYSOP SYSAS SYSBA SYSDG SYSKM

-------------------- ----- ----- ----- ----- ----- -----

SYS                  TRUE  TRUE  FALSE FALSE FALSE FALSE




SQL> show parameter redo_transport_user;




NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

redo_transport_user                  string

SQL> alter system set redo_transport_user='VPCZDLRA' scope = spfile SID='*';




System altered.




SQL>

SQL> alter session set "_ORACLE_SCRIPT"=true;




Session altered.




SQL> create user VPCZDLRA identified by oracle;




User created.




SQL> grant sysoper to VPCZDLRA;




Grant succeeded.




SQL> alter session set "_ORACLE_SCRIPT"=false;




Session altered.




SQL> select USERNAME,SYSDBA,SYSOPER,SYSASM,SYSBACKUP,SYSDG,SYSKM from v$pwfile_users;




USERNAME             SYSDB SYSOP SYSAS SYSBA SYSDG SYSKM

-------------------- ----- ----- ----- ----- ----- -----

SYS                  TRUE  TRUE  FALSE FALSE FALSE FALSE

VPCZDLRA             FALSE TRUE  FALSE FALSE FALSE FALSE




SQL> exit

Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0

[oracle@orcl19p ~]$

At Standby ORCL19S

[oracle@orcl19s ~]$ sqlplus / as sysdba




SQL*Plus: Release 19.0.0.0.0 - Production on Sat Nov 9 16:25:09 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle.  All rights reserved.







Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0




SQL>

SQL> alter system set redo_transport_user='VPCZDLRA' scope = spfile SID='*';




System altered.




SQL>

SQL> exit

Disconnected from Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.5.0.0.0

[oracle@orcl19s dbs]$

Password file

At Primary ORCL19:

[grid@orcl19p ~]$ asmcmd

ASMCMD> cd +DATA/ORCL19/

ASMCMD> ls -l

Type             Redund  Striped  Time             Sys  Name

                                                   Y    95D7C2B2568F0A70E0533205A00AC96B/

                                                   Y    95D86C40430C3E85E0533205A00A0EBA/

                                                   Y    CONTROLFILE/

                                                   Y    DATAFILE/

                                                   Y    DATAGUARDCONFIG/

                                                   Y    ONLINELOG/

                                                   Y    PARAMETERFILE/

                                                   Y    PASSWORD/

                                                   Y    TEMPFILE/

DATAGUARDCONFIG  UNPROT  COARSE   NOV 03 23:00:00  N    dr1orcl19.dat => +DATA/ORCL19/DATAGUARDCONFIG/ORCL19.275.1023297961

ASMCMD> ls -l PASSWORD/

Type      Redund  Striped  Time             Sys  Name

PASSWORD  UNPROT  COARSE   OCT 26 23:00:00  Y    pwdorcl19.256.1022714087

ASMCMD> ls -l

Type             Redund  Striped  Time             Sys  Name

                                                   Y    95D7C2B2568F0A70E0533205A00AC96B/

                                                   Y    95D86C40430C3E85E0533205A00A0EBA/

                                                   Y    CONTROLFILE/

                                                   Y    DATAFILE/

                                                   Y    DATAGUARDCONFIG/

                                                   Y    ONLINELOG/

                                                   Y    PARAMETERFILE/

                                                   Y    PASSWORD/

                                                   Y    TEMPFILE/

DATAGUARDCONFIG  UNPROT  COARSE   NOV 03 23:00:00  N    dr1orcl19.dat => +DATA/ORCL19/DATAGUARDCONFIG/ORCL19.275.1023297961

ASMCMD>

ASMCMD> cp PASSWORD/pwdorcl19.256.1022714087 /tmp

copying +DATA/ORCL19/PASSWORD/pwdorcl19.256.1022714087 -> /tmp/pwdorcl19.256.1022714087

ASMCMD> exit

[grid@orcl19p ~]$

At Standby ORCL19S

[grid@orcl19p ~]$ asmcmd

ASMCMD> cd +DATA/ORCL19/

ASMCMD> ls -l

Type             Redund  Striped  Time             Sys  Name

                                                   Y    95D7C2B2568F0A70E0533205A00AC96B/

                                                   Y    95D86C40430C3E85E0533205A00A0EBA/

                                                   Y    CONTROLFILE/

                                                   Y    DATAFILE/

                                                   Y    DATAGUARDCONFIG/

                                                   Y    ONLINELOG/

                                                   Y    PARAMETERFILE/

                                                   Y    PASSWORD/

                                                   Y    TEMPFILE/

DATAGUARDCONFIG  UNPROT  COARSE   NOV 03 23:00:00  N    dr1orcl19.dat => +DATA/ORCL19/DATAGUARDCONFIG/ORCL19.275.1023297961

ASMCMD> ls -l PASSWORD/

Type      Redund  Striped  Time             Sys  Name

PASSWORD  UNPROT  COARSE   OCT 26 23:00:00  Y    pwdorcl19.256.1022714087

ASMCMD> ls -l

Type             Redund  Striped  Time             Sys  Name

                                                   Y    95D7C2B2568F0A70E0533205A00AC96B/

                                                   Y    95D86C40430C3E85E0533205A00A0EBA/

                                                   Y    CONTROLFILE/

                                                   Y    DATAFILE/

                                                   Y    DATAGUARDCONFIG/

                                                   Y    ONLINELOG/

                                                   Y    PARAMETERFILE/

                                                   Y    PASSWORD/

                                                   Y    TEMPFILE/

DATAGUARDCONFIG  UNPROT  COARSE   NOV 03 23:00:00  N    dr1orcl19.dat => +DATA/ORCL19/DATAGUARDCONFIG/ORCL19.275.1023297961

ASMCMD>

ASMCMD> cp PASSWORD/pwdorcl19.256.1022714087 /tmp

copying +DATA/ORCL19/PASSWORD/pwdorcl19.256.1022714087 -> /tmp/pwdorcl19.256.1022714087

ASMCMD> exit

[grid@orcl19p ~]$

Wallet Configuration

When using broker configuration, it is recommended to add both credential entries, of ZDLRA from primary and standby, in the wallet from both databases.

At Primary ORCL19:

[oracle@orcl19p ~]$ mkstore -wrl /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet -createCredential zdlras2-scan:1521/zdlras2:VPCZDLRA vpczdlra oracle

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.4.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.




[oracle@orcl19p ~]$

[oracle@orcl19p ~]$ mkstore -wrl /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet -listCredential

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.4.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.




List credential (index: connect_string username)

2: zdlras2-scan:1521/zdlras2:VPCZDLRA vpczdlra

1: zdlras1-scan:1521/zdlras1:VPCZDLRA vpczdlra

[oracle@orcl19p ~]$

At Standby ORCL19S:

[oracle@orcl19s ~]$ mkstore -wrl /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet -createCredential zdlras1-scan:1521/zdlras1:VPCZDLRA vpczdlra oracle

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.4.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.




[oracle@orcl19s ~]$ mkstore -wrl /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/ra_wallet -listCredential

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.4.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.




List credential (index: connect_string username)

2: zdlras1-scan:1521/zdlras1:VPCZDLRA vpczdlra

1: zdlras2-scan:1521/zdlras2:VPCZDLRA vpczdlra

[oracle@orcl19s ~]$

Bounce everything

At Standby ORCL19S:

[oracle@orcl19s ~]$ srvctl stop database -d orcl19s -o immediate

[oracle@orcl19s ~]$

At Primary ORCL19

[oracle@orcl19p ~]$ srvctl stop database -d orcl19 -o immediate

[oracle@orcl19p ~]$ srvctl start database -d orcl19

[oracle@orcl19p ~]$

At Standby ORCL19S

[oracle@orcl19s ~]$ srvctl start database -d orcl19s

[oracle@orcl19s ~]$

Broker config

Now I can add the ZDLRA at config. First, look the config:

[oracle@orcl19p ~]$ dgmgrl sys/oracle@orcl19

DGMGRL for Linux: Release 19.0.0.0.0 - Production on Sat Nov 9 16:26:26 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




Welcome to DGMGRL, type "help" for information.

Connected to "ORCL19"

Connected as SYSDBA.

DGMGRL>

DGMGRL> show configuration verbose;




Configuration - orcl19




  Protection Mode: MaxAvailability

  Members:

  orcl19  - Primary database

    orcl19s - Physical standby database




  Properties:

    FastStartFailoverThreshold      = '30'

    OperationTimeout                = '30'

    TraceLevel                      = 'USER'

    FastStartFailoverLagLimit       = '0'

    CommunicationTimeout            = '180'

    ObserverReconnect               = '0'

    FastStartFailoverAutoReinstate  = 'TRUE'

    FastStartFailoverPmyShutdown    = 'TRUE'

    BystandersFollowRoleChange      = 'ALL'

    ObserverOverride                = 'FALSE'

    ExternalDestination1            = ''

    ExternalDestination2            = ''

    PrimaryLostWriteAction          = 'CONTINUE'

    ConfigurationWideServiceName    = 'orcl19_CFG'




Fast-Start Failover:  Disabled




Configuration Status:

SUCCESS




DGMGRL>

Now, add both ZDLRA entries. Look that connect identifier it is the same as the credential name. This is a key point too.

DGMGRL> add recovery_appliance zdlras1 as connect identifier is 'zdlras1-scan:1521/zdlras1:VPCZDLRA';

Recovery Appliance "zdlras1" added

DGMGRL> add recovery_appliance zdlras2 as connect identifier is 'zdlras2-scan:1521/zdlras2:VPCZDLRA';

Recovery Appliance "zdlras2" added

DGMGRL> show configuration verbose;




Configuration - orcl19




  Protection Mode: MaxAvailability

  Members:

  orcl19  - Primary database

    orcl19s - Physical standby database

      zdlras2 - Recovery appliance (disabled)

        ORA-16905: The member was not enabled yet.




    zdlras1 - Recovery appliance (disabled)

      ORA-16905: The member was not enabled yet.




  Properties:

    FastStartFailoverThreshold      = '30'

    OperationTimeout                = '30'

    TraceLevel                      = 'USER'

    FastStartFailoverLagLimit       = '0'

    CommunicationTimeout            = '180'

    ObserverReconnect               = '0'

    FastStartFailoverAutoReinstate  = 'TRUE'

    FastStartFailoverPmyShutdown    = 'TRUE'

    BystandersFollowRoleChange      = 'ALL'

    ObserverOverride                = 'FALSE'

    ExternalDestination1            = ''

    ExternalDestination2            = ''

    PrimaryLostWriteAction          = 'CONTINUE'

    ConfigurationWideServiceName    = 'orcl19_CFG'




Fast-Start Failover:  Disabled




Configuration Status:

SUCCESS




DGMGRL>

Look that now I have both ZDLRA in the config, but they did not enable yet. Before I enable it, I need to configure the RedoRoutes. If I do not do that, both ZDLRA will be enabled for the primary site as log_archive_dest. This is a key point too.

DGMGRL> edit database orcl19 set property RedoRoutes = '(orcl19 : orcl19s sync, zdlras1 async)( orcl19s : zdlras2 async )';

Property "redoroutes" updated

DGMGRL> edit database orcl19s set property RedoRoutes = '(orcl19s : orcl19 sync, zdlras2 async)( orcl19 : zdlras1 async )';

Property "redoroutes" updated

DGMGRL>

Just to explain what this means:

RedoRoutes for ORCL19: When ORCL19 is the primary database, it will send the redo to ORCL19S in sync mode, and to ZDLRAS1 in async mode. And at the same time, the ORCL19S will send their redo (stand by redo – since it is standby) to ZDLRAS2 in async mode.
RedoRoutes for ORCL19S: When ORCL19S is primary database, it will send the redo to ORCL19 in sync mode, and to ZDLRAS2 in async mode. And at the same time, the ORCL19 will send their redo (stand by redo – since it is standby) to ZDLRAS1 in async mode.

Now, I need to enable ZDLRA’s. I will do one by one:

DGMGRL> show configuration;




Configuration - orcl19




  Protection Mode: MaxAvailability

  Members:

  orcl19  - Primary database

    orcl19s - Physical standby database

      zdlras2 - Recovery appliance (disabled)

        ORA-16905: The member was not enabled yet.




    zdlras1 - Recovery appliance (disabled)

      ORA-16905: The member was not enabled yet.




Fast-Start Failover:  Disabled




Configuration Status:

SUCCESS   (status updated 35 seconds ago)




DGMGRL>

DGMGRL> enable recovery_appliance zdlras1;

Enabled.

DGMGRL>

And at primary (ORCL19) alertlog will show (look the parameter set):

2019-11-09T21:36:24.189975+01:00

Data Guard Broker executes SQL [alter system set log_archive_config='dg_config=(ORCL19,orcl19s,zdlras1)']

2019-11-09T21:36:24.205453+01:00

ALTER SYSTEM SET log_archive_config='dg_config=(ORCL19,orcl19s,zdlras1)' SCOPE=BOTH;

2019-11-09T21:36:36.625928+01:00

ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH;

2019-11-09T21:36:36.638226+01:00

RSM0 (PID:9918): Using STANDBY_ARCHIVE_DEST parameter default value as USE_DB_RECOVERY_FILE_DEST [krsd.c:18157]

2019-11-09T21:36:36.650443+01:00

ALTER SYSTEM SET log_archive_dest_3='service="zdlras1-scan:1521/zdlras1:VPCZDLRA"','ASYNC NOAFFIRM delay=0 optional compression=disable max_failure=0 reopen=300 db_unique_name="zdlras1" net_timeout=30','valid_for=(online_logfile,all_roles)' SCOPE=BOTH;

2019-11-09T21:36:36.670412+01:00

ALTER SYSTEM SET log_archive_dest_state_3='ENABLE' SCOPE=BOTH;

2019-11-09T21:36:36.670547+01:00

ALTER SYSTEM ARCHIVE LOG

2019-11-09T21:36:36.683698+01:00

Thread 1 cannot allocate new log, sequence 242

Checkpoint not complete

  Current log# 3 seq# 241 mem# 0: +DATA/ORCL19/ONLINELOG/group_3.260.1022714131

  Current log# 3 seq# 241 mem# 1: +RECO/ORCL19/ONLINELOG/group_3.259.1022714135

2019-11-09T21:36:38.922616+01:00

LGWR (PID:7900): SRL selected to archive T-1.S-242

LGWR (PID:7900): SRL selected for T-1.S-242 for LAD:2

2019-11-09T21:36:38.934895+01:00

Thread 1 advanced to log sequence 242 (LGWR switch)

  Current log# 1 seq# 242 mem# 0: +DATA/ORCL19/ONLINELOG/group_1.258.1022714111

  Current log# 1 seq# 242 mem# 1: +RECO/ORCL19/ONLINELOG/group_1.257.1022714115

2019-11-09T21:36:38.974260+01:00

RSM0 (PID:9918): Archived Log entry 588 added for T-1.S-241 ID 0x135c2429 LAD:1




And in alertlog from standby (ORCL19S):

2019-11-09T21:36:23.036628+01:00

Data Guard Broker executes SQL [alter system set log_archive_config='dg_config=(orcl19s,orcl19,zdlras1)']

2019-11-09T21:36:23.051110+01:00

ALTER SYSTEM SET log_archive_config='dg_config=(orcl19s,orcl19,zdlras1)' SCOPE=BOTH;

2019-11-09T21:36:38.755320+01:00

 rfs (PID:22533): Standby controlfile consistent with primary

2019-11-09T21:36:38.770320+01:00

 rfs (PID:22533): Selected LNO:4 for T-1.S-242 dbid 324042554 branch 1022714106

2019-11-09T21:36:38.773001+01:00

MRP0 (PID:11871): Media Recovery Waiting for T-1.S-242 (in transit)

2019-11-09T21:36:38.776175+01:00

ARC0 (PID:9009): Archived Log entry 304 added for T-1.S-241 ID 0x135c2429 LAD:1

2019-11-09T21:36:39.836191+01:00

Recovery of Online Redo Log: Thread 1 Group 4 Seq 242 Reading mem 0

  Mem# 0: +DATA/ORCL19S/ONLINELOG/group_4.279.1023289067

  Mem# 1: +RECO/ORCL19S/ONLINELOG/group_4.269.1023289069

And now, the second ZDLRA:

DGMGRL> enable recovery_appliance zdlras2;

Enabled.

DGMGRL>

Again, at primary (ORCL19) alertlog:

2019-11-09T21:37:32.341087+01:00

Data Guard Broker executes SQL [alter system set log_archive_config='dg_config=(ORCL19,orcl19s,zdlras1,zdlras2)']

2019-11-09T21:37:32.353276+01:00

ALTER SYSTEM SET log_archive_config='dg_config=(ORCL19,orcl19s,zdlras1,zdlras2)' SCOPE=BOTH;

2019-11-09T21:37:40.827841+01:00

ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=MEMORY SID='*';

2019-11-09T21:37:41.943026+01:00

LGWR (PID:7900): SRL selected to archive T-1.S-243

LGWR (PID:7900): SRL selected for T-1.S-243 for LAD:2

2019-11-09T21:37:41.949207+01:00

Thread 1 advanced to log sequence 243 (LGWR switch)

  Current log# 2 seq# 243 mem# 0: +DATA/ORCL19/ONLINELOG/group_2.259.1022714121

  Current log# 2 seq# 243 mem# 1: +RECO/ORCL19/ONLINELOG/group_2.258.1022714125

2019-11-09T21:37:41.981177+01:00

ARC1 (PID:10098): Archived Log entry 591 added for T-1.S-242 ID 0x135c2429 LAD:1

And in the alertlog from stabdby (ORCL19S):

2019-11-09T21:37:31.175857+01:00

Data Guard Broker executes SQL [alter system set log_archive_config='dg_config=(orcl19s,orcl19,zdlras1,zdlras2)']

2019-11-09T21:37:31.193280+01:00

ALTER SYSTEM SET log_archive_config='dg_config=(orcl19s,orcl19,zdlras1,zdlras2)' SCOPE=BOTH;

2019-11-09T21:37:37.424268+01:00

RSM0 (PID:10393): Using STANDBY_ARCHIVE_DEST parameter default value as USE_DB_RECOVERY_FILE_DEST [krsd.c:18157]

2019-11-09T21:37:37.438735+01:00

ALTER SYSTEM SET log_archive_dest_2='service="zdlras2-scan:1521/zdlras2:VPCZDLRA"','ASYNC NOAFFIRM delay=0 optional compression=disable max_failure=0 reopen=300 db_unique_name="zdlras2" net_timeout=30','valid_for=(standby_logfile,all_roles)' SCOPE=BOTH;

2019-11-09T21:37:37.476545+01:00

ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH;

2019-11-09T21:37:41.778391+01:00

MRP0 (PID:11871): Media Recovery Waiting for T-1.S-243 (in transit)

2019-11-09T21:37:41.791532+01:00

ARC0 (PID:9009): Archived Log entry 312 added for T-1.S-242 ID 0x135c2429 LAD:1

2019-11-09T21:37:41.793625+01:00

 rfs (PID:22533): Standby controlfile consistent with primary

2019-11-09T21:37:41.800377+01:00

 rfs (PID:22533): Selected LNO:5 for T-1.S-243 dbid 324042554 branch 1022714106

2019-11-09T21:37:42.826630+01:00

Recovery of Online Redo Log: Thread 1 Group 5 Seq 243 Reading mem 0

  Mem# 0: +DATA/ORCL19S/ONLINELOG/group_5.278.1023289071

  Mem# 1: +RECO/ORCL19S/ONLINELOG/group_5.268.1023289073

And now the conf reports:

DGMGRL> show configuration;




Configuration - orcl19




  Protection Mode: MaxAvailability

  Members:

  orcl19  - Primary database

    orcl19s - Physical standby database

      zdlras2 - Recovery appliance (receiving current redo)

    zdlras1 - Recovery appliance




Fast-Start Failover:  Disabled




Configuration Status:

SUCCESS   (status updated 54 seconds ago)




DGMGRL>

As you can see, the primary is cascading the redo to ORCL19S and ZDLRAS1. And the ORCL19S is cascading the redo (standby redo) to ZDLRAS2.

Checking ZERO RPO

To check if everything is fine, and we have zero RPO in both sited I made a simple archivelog at the primary site and check the cascade of events.

At Primary ORCL19:

[oracle@orcl19p ~]$ rman target=/ catalog=vpczdlra/oracle@zdlras1




Recovery Manager: Release 19.0.0.0.0 - Production on Sat Nov 9 21:43:25 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database: ORCL19 (DBID=324042554)

connected to recovery catalog database

PL/SQL package VPCZDLRA.DBMS_RCVCAT version 19.03.00.00. in RCVCAT database is not current

PL/SQL package VPCZDLRA.DBMS_RCVMAN version 19.03.00.00 in RCVCAT database is not current




RMAN> alter system archive log current;




Statement processed




RMAN> list copy of archivelog all completed after "sysdate - 5/1440";




List of Archived Log Copies for database with db_unique_name ORCL19

=====================================================================




Key     Thrd Seq     S Low Time

------- ---- ------- - -------------------

3296    1    243     A 09/11/2019 21:37:41

        Name: +RECO/ORCL19/ARCHIVELOG/2019_11_09/thread_1_seq_243.389.1023918423







RMAN> list backup of archivelog all completed after "sysdate - 5/1440";







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

3304    322.00K    SBT_TAPE    00:00:02     09/11/2019 21:48:02

        BP Key: 3305   Status: AVAILABLE  Compressed: YES  Tag: TAG20191109T214800

        Handle: $RSCN_1_RTIM_1022714106_THRD_1_SEQ_243_CTKEY_3206_BACKUP   Media:




  List of Archived Logs in backup set 3304

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    243     3082409    09/11/2019 21:37:41 3083705    09/11/2019 21:47:03




RMAN>

At Standby ORCL19S

[oracle@orcl19s ~]$ rman target=/ catalog=vpczdlra/oracle@zdlras2




Recovery Manager: Release 19.0.0.0.0 - Production on Sat Nov 9 21:44:58 2019

Version 19.5.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database: ORCL19 (DBID=324042554, not open)

connected to recovery catalog database

PL/SQL package VPCZDLRA.DBMS_RCVCAT version 19.03.00.00. in RCVCAT database is not current

PL/SQL package VPCZDLRA.DBMS_RCVMAN version 19.03.00.00 in RCVCAT database is not current




RMAN> list copy of archivelog all completed after "sysdate - 5/1440";




List of Archived Log Copies for database with db_unique_name ORCL19S

=====================================================================




Key     Thrd Seq     S Low Time

------- ---- ------- - -------------------

3428    1    243     A 09/11/2019 21:37:41

        Name: +RECO/ORCL19S/ARCHIVELOG/2019_11_09/thread_1_seq_243.371.1023918423







RMAN> list backup of archivelog all completed after "sysdate - 5/1440";







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

3419    322.00K    SBT_TAPE    00:00:00     09/11/2019 21:47:10

        BP Key: 3420   Status: AVAILABLE  Compressed: YES  Tag: TAG20191109T214710

        Handle: $RSCN_1_RTIM_1022714106_THRD_1_SEQ_243_CTKEY_3306_BACKUP   Media:




  List of Archived Logs in backup set 3419

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    243     3082409    09/11/2019 21:37:41 3083705    09/11/2019 21:47:03




RMAN>

As you can see here, the sequence 243 was created at primary, sent by DG to ORCL19S. And ZDLRAS1 created the backup for archivelog automatically, and the same was made by ZDLRAS2.

Multiple levels of protection

The idea of this post was to show how to protect a big environment with ZDLRA, reaching ZERO RPO for both sites, and going beyond the protection defined by Oracle MAA docs. If you look closely, the environment that I showed here has zero RPO and zero RTO. You will lose data only if you have a catastrophic failure in both sites (but you can protect even more if you use tape backup offloads from ZDLRA).

As told before, these critical/extreme critical environments need to follow some strict rules for data protection. Adding ZDLRA for it, provide an additional level of protection and allow you to use some features like incremental backup strategy.

ZDLRA, Real-Time REDO and RPO ZERO

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

ZDLRA, Real-Time REDO and RPO ZERO

The idea for Real-Time Redo is to reach zero RPO for every kind of database and this includes ones with and without DG. As you can see in my last post, where I showed how to configure Real-Time Redo for one database, some little steps need to be executed and they are pretty similar than a remote destination for archivelog for DG.

But if you noticed, the configuration for the remote destination was defined as ASYNC, and hinted like that at ZDLRA docs (“Protection of Ongoing Transactions” or at “How Real-Time Redo Transport Works”). In the same post, I suggested as “controversial” because the ASYNC does not guarantee the RPO zero.

You can see more in the DataGuard docs at (Oracle Data Guard Protection Modes and Oracle Data Guard Concepts and Administration), but the resume it is:

ASYNC: The primary database does not wait for the response from a remote destination.
SYNC/NOAFIRM: The Primary database holds commit until the remote destination report that received the redo data. It does not wait until the remote site report that wrote the data in the disc.
SYNC/AFFIRM: The primary database holds commit until the remote destination report that received redo data and wrote it at the disk.

You can read with more details the difference here: Best Practices for Synchronous Redo Transport and Best Practices for Asynchronous Redo Transport.

The idea is simple, if you use ASYNC, there is no guarantee for zero data loss between the primary database and the remote destination.

ZDLRA Protection

If needed, the configuration to reach zero RPO, when using ZDLRA, it is simple. You just need to put the archive destination in SYNC. Simple like that. Before put archivelog destination in SYNC, I will do an example using ASYNC to show some features of real-time redo.

Saving crashes

One interesting point is that ZDLRA generates for you a new archivelog in case of a primary database crash. As the documentation says:

If the redo stream terminates unexpectedly, then the Recovery Appliance can close the incoming redo stream and create a partial archived redo log file backup, thereby protecting transactions up to the last change that the appliance received.

So, even not using DG or sync, you can protect it until the last transaction. In real life it is this (I will simulate a catastrophic failure; check the current archivelog, and kill the instances process after that; at the end, check how far the protection was):

Current archivelog:

SQL> archive log list;

Database log mode              Archive Mode

Automatic archival             Enabled

Archive destination            USE_DB_RECOVERY_FILE_DEST

Oldest online log sequence     29

Next log sequence to archive   31

Current log sequence           31

SQL>

SQL> ALTER SYSTEM ARCHIVE LOG current;




System altered.




SQL>

SQL> show parameter dest_2;




NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

db_create_online_log_dest_2          string

log_archive_dest_2                   string      SERVICE="zeros1ig-scan.simo.fz

                                                 net.org:1521/zeros1:dedicated"

                                                  ASYNC NOAFFIRM delay=0 option

                                                 al compression=disable max_fai

                                                 lure=1 max_connections=1 reope

                                                 n=10 DB_UNIQUE_NAME=zeros1 net

                                                 _timeout=8 group=1 priority=1

                                                 VALID_FOR=(ONLINE_LOGFILE,ALL_

                                                 ROLES)

log_archive_dest_20                  string

…

…

Backups of archivelog (look the “Next Time”)

RMAN> list backup of archivelog sequence 31;







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

50958158 1.12M      SBT_TAPE    00:00:00     2019-10-18_19-28-21

        BP Key: 50958159   Status: AVAILABLE  Compressed: YES  Tag: TAG20191018T222821

        Handle: $RSCN_1_RTIM_1022003674_THRD_1_SEQ_31_CTKEY_50958151_BACKUP   Media:




  List of Archived Logs in backup set 50958158

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    31      1025023    2019-10-18_19-27-40 1025761    2019-10-18_19-28-20




RMAN> list backup of archivelog sequence 32;




specification does not match any backup in the repository




RMAN>

Killing the instance:

[oracle@exac1vm01-ORAD18]$ ps -ef |grep pmon_ORAD18

oracle    31646      1  0 22:27 ?        00:00:00 ora_pmon_ORAD18

oracle    37116 396809  0 22:29 pts/2    00:00:00 grep pmon_ORAD18

[oracle@exac1vm01-ORAD18]$ ps -ef |grep smon_ORAD18

oracle    31731      1  0 22:27 ?        00:00:00 ora_smon_ORAD18

oracle    37213 396809  0 22:29 pts/2    00:00:00 grep smon_ORAD18

[oracle@exac1vm01-ORAD18]$ ps -ef |grep lgwr_ORAD18

oracle    31724      1  0 22:27 ?        00:00:00 ora_lgwr_ORAD18

oracle    37245 396809  0 22:29 pts/2    00:00:00 grep lgwr_ORAD18

[oracle@exac1vm01-ORAD18]$ ps -ef |grep dbw0_ORAD18

oracle    31722      1  0 22:27 ?        00:00:00 ora_dbw0_ORAD18

oracle    40104 396809  0 22:29 pts/2    00:00:00 grep dbw0_ORAD18

[oracle@exac1vm01-ORAD18]$ ps -ef |grep arc* |grep ORAD18

oracle    31805      1  0 22:27 ?        00:00:00 ora_mark_ORAD18

oracle    32021      1  0 22:27 ?        00:00:00 ora_arc0_ORAD18

oracle    32042      1  0 22:27 ?        00:00:00 ora_arc1_ORAD18

oracle    32050      1  0 22:27 ?        00:00:00 ora_arc2_ORAD18

oracle    32057      1  0 22:27 ?        00:00:00 ora_arc3_ORAD18

[oracle@exac1vm01-ORAD18]$

[oracle@exac1vm01-ORAD18]$

[oracle@exac1vm01-ORAD18]$

[oracle@exac1vm01-ORAD18]$ date

Fri Oct 18 22:30:32 CEST 2019

[oracle@exac1vm01-ORAD18]$ kill -9 31646 31731 31724 31722 32021 32042 32050 32057

[oracle@exac1vm01-ORAD18]$

#####

At alertlog:

    2019-10-18T22:30:33.160912+02:00

    RMS0 (ospid: 31718): terminating the instance due to ORA error 472

    Cause - 'Instance is being terminated due to fatal process PMON being dead.'

    2019-10-18T22:30:33.290189+02:00

    System state dump requested by (instance=1, osid=31718 (RMS0)), summary=[abnormal instance termination]. error - 'Instance is terminating.

    '

    System State dumped to trace file /u01/app/oracle/diag/rdbms/orad18/ORAD18/trace/ORAD18_diag_31692_20191018223033.trc

    2019-10-18T22:30:34.378356+02:00

    Dumping diagnostic data in directory=[cdmp_20191018223033], requested by (instance=1, osid=31718 (RMS0)), summary=[abnormal instance termination].

    2019-10-18T22:30:35.642670+02:00

    Instance terminated by RMS0, pid = 31718

Above look the hour that I killed the instance.

How Far it Was

[oracle@exac1vm01-ORAD18]$ export ORACLE_SID=SIMON

[oracle@exac1vm01-SIMON]$ rman target / catalog vpcusr/welcome1@zeros1ig-scan:1521/zeros1:dedicated




Recovery Manager: Release 18.0.0.0.0 - Production on Fri Oct 18 22:31:39 2019

Version 18.3.0.0.0




Copyright (c) 1982, 2018, Oracle and/or its affiliates.  All rights reserved.




connected to target database (not started)

connected to recovery catalog database




RMAN> startup force nomount;




startup failed: ORA-01078: failure in processing system parameters

LRM-00109: could not open parameter file '/u01/app/oracle/product/18.0.0/dbhome_1/dbs/initSIMON.ora'




starting Oracle instance without parameter file for retrieval of spfile

Oracle instance started




Total System Global Area    1073740616 bytes




Fixed Size                     8665928 bytes

Variable Size                281018368 bytes

Database Buffers             775946240 bytes

Redo Buffers                   8110080 bytes




RMAN> set dbid 2464352672;




executing command: SET DBID

database name is "ORAD18" and DBID is 2464352672




RMAN> list backup of archivelog sequence 32;







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

50958220 72.50K     SBT_TAPE    00:00:00     2019-10-18_22-30-42

        BP Key: 50958221   Status: AVAILABLE  Compressed: YES  Tag: TAG20191018T223042

        Handle: $RSCN_1_RTIM_1022003674_THRD_1_SEQ_32_CTKEY_50958156_BACKUP   Media:




  List of Archived Logs in backup set 50958220

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    32      1025761    2019-10-18_22-28-20 1026104    2019-10-18_22-30-32




RMAN>

Above you can see that now I have at catalog one backup of archivelog sequence number 32. And the “Next Time” for it is 22:30:32 that was the same moment that I killed the instance. So, the ZDLRA protected until the last point before the crash (even in ASYNC mode).

One little explanation. As you saw, I opened one new instance (called SIMON) to check the catalog because I wanted to avoid that when opening the same instance (ORAD18) the partial archivelog generated by ZDLRA was removed because it believes that database returned. This is explained in the doc:

If the protected database crashes, redo data received from the current redo log group until the time of the crash is backed up at the Recovery Appliance as a “partial” archived redo log. If the protected database is reopened, crash recovery of the protected database will complete the current redo log group at the time of the crash, and the completed redo log will be re-shipped to the Recovery Appliance through the automatic Data Guard Gap fetching feature. The “complete” archived redo log will be used in any future restore/recover operations instead of the previously backed up “partial” archived redo log.

Saving crashed, now with load

Using the example above you can argue that ZDLRA reached RPO zero just because I don’t have load over the database. And, being worst, without SYNC you can’t guarantee zero data loss. Yes, I agree with you for both (and I already exposed thins in my last post and in the introduction of this post). To show you how to use ZDLRA with SYNC for archive log destination.

Sync mode

Putting the destination in SYNC mode. Just changed to SYNC NOAFFIRM:

[oracle@exac1vm01-ORAD18]$ sqlplus / as sysdba




SQL*Plus: Release 18.0.0.0.0 - Production on Fri Oct 18 23:04:25 2019

Version 18.3.0.0.0




Copyright (c) 1982, 2018, Oracle.  All rights reserved.







Connected to:

Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

Version 18.3.0.0.0




SQL> ALTER SYSTEM SET log_archive_dest_2='SERVICE="zeros1ig-scan.simo.fznet.org:1521/zeros1:dedicated" SYNC NOAFFIRM delay=0 optional compression=disable max_failure=1 max_connections=1 reopen=10 DB_UNIQUE_NAME=zeros1 net_timeout=8 group=1 priority=1 VALID_FOR=(ONLINE_LOGFILE,ALL_ROLES)' scope = spfile sid = '*';




System altered.




SQL>

Current archivelog:

After bouncing the instance, checking the current archivelog and creating a new one:

SQL> alter system archive log current;




System altered.




SQL> archive log list;

Database log mode              Archive Mode

Automatic archival             Enabled

Archive destination            USE_DB_RECOVERY_FILE_DEST

Oldest online log sequence     3

Next log sequence to archive   5

Current log sequence           5

SQL>

Backups of archivelog (look the “Next Time”)

Check that does not exist archivelog from the current redo log:

[oracle@exac1vm01-ORAD18]$ rman target / catalog vpcusr/welcome1@zeros1ig-scan:1521/zeros1:dedicated




Recovery Manager: Release 18.0.0.0.0 - Production on Fri Oct 18 23:05:35 2019

Version 18.3.0.0.0




Copyright (c) 1982, 2018, Oracle and/or its affiliates.  All rights reserved.




connected to target database: ORAD18 (DBID=2464352672)

connected to recovery catalog database




RMAN> list backup of archivelog sequence 4;







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

50958759 41.00K     SBT_TAPE    00:00:00     2019-10-18_23-05-23

        BP Key: 50958760   Status: AVAILABLE  Compressed: YES  Tag: TAG20191018T230523

        Handle: $RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_4_CTKEY_50958752_BACKUP   Media:




  List of Archived Logs in backup set 50958759

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    4       1131626    2019-10-18_23-04-53 1131667    2019-10-18_23-05-09




RMAN> list backup of archivelog sequence 5;




specification does not match any backup in the repository




RMAN>

The simulate load I created a table with 3 columns and put two sessions loading data and generating a huge number of commits. After every insert, I commit the data and show the current time for each insert. Look at the sessions:

[oracle@exac1vm01-ORAD18]$ for i in {1..100000}

> do

> echo "Insert Data $i - `date +%d-%m-%Y-%H%M%S`"

> sqlplus -s / as sysdba<<EOF

> set heading on feedback on;

> insert into testIns(c1, c2, c3) values ($i, sysdate, 'Loop');

> commit;

> EOF

> done

Insert Data 1 - 18-10-2019-230723




1 row created.







Commit complete.




Insert Data 2 - 18-10-2019-230723




1 row created.







Commit complete.




Insert Data 3 - 18-10-2019-230723




1 row created.







Commit complete.




Insert Data 4 - 18-10-2019-230724




1 row created.







Commit complete.

....

....







[oracle@exac1vm01-ORAD18]$

[oracle@exac1vm01-ORAD18]$ for i in {1..100000}

> do

> echo "Insert Data $i - `date +%d-%m-%Y-%H%M%S`"

> sqlplus -s / as sysdba<<EOF

> set heading on feedback on;

> insert into testIns(c1, c2, c3) values ($i, sysdate, 'Loop2');

> commit;

> EOF

> done

Insert Data 1 - 18-10-2019-230816




1 row created.







Commit complete.




Insert Data 2 - 18-10-2019-230816




1 row created.







Commit complete.




Insert Data 3 - 18-10-2019-230816




1 row created.







Commit complete.

...

...

Killing the instance

[oracle@exac1vm01-]$

[oracle@exac1vm01-]$

[oracle@exac1vm01-]$ ps -ef |grep pmon_ORAD18

oracle    93700      1  0 22:51 ?        00:00:00 ora_pmon_ORAD18

oracle   140193 357155  0 23:09 pts/1    00:00:00 grep pmon_ORAD18

[oracle@exac1vm01-]$  ps -ef |grep smon_ORAD18

oracle    93786      1  0 22:51 ?        00:00:00 ora_smon_ORAD18

oracle   140394 357155  0 23:09 pts/1    00:00:00 grep smon_ORAD18

[oracle@exac1vm01-]$ ps -ef |grep lgwr_ORAD18

oracle    93780      1  0 22:51 ?        00:00:00 ora_lgwr_ORAD18

oracle   140561 357155  0 23:09 pts/1    00:00:00 grep lgwr_ORAD18

[oracle@exac1vm01-]$ ps -ef |grep dbw0_ORAD18

oracle    93776      1  0 22:51 ?        00:00:00 ora_dbw0_ORAD18

oracle   140738 357155  0 23:09 pts/1    00:00:00 grep dbw0_ORAD18

[oracle@exac1vm01-]$ ps -ef |grep arc* |grep ORAD18

oracle    93835      1  0 22:51 ?        00:00:00 ora_mark_ORAD18

oracle   103814      1  0 22:55 ?        00:00:00 ora_arc0_ORAD18

oracle   103816      1  0 22:55 ?        00:00:00 ora_arc1_ORAD18

oracle   103818      1  0 22:55 ?        00:00:00 ora_arc2_ORAD18

oracle   103820      1  0 22:55 ?        00:00:00 ora_arc3_ORAD18

[oracle@exac1vm01-]$

[oracle@exac1vm01-]$ date

Fri Oct 18 23:09:44 CEST 2019

[oracle@exac1vm01-]$ kill -9  93700 93786 93780 93776 103814 103816 103818 103820

[oracle@exac1vm01-]$

Failed load

As expected, the two sessions failed:

...

...

Insert Data 1016 - 18-10-2019-230944




1 row created.







Commit complete.




Insert Data 1017 - 18-10-2019-230944




1 row created.




commit

     *

ERROR at line 1:

ORA-03113: end-of-file on communication channel

Process ID: 142277

Session ID: 53 Serial number: 30197













...

...

Insert Data 646 - 18-10-2019-230944




1 row created.







Commit complete.




Insert Data 647 - 18-10-2019-230944




1 row created.




commit

     *

ERROR at line 1:

ORA-03113: end-of-file on communication channel

Process ID: 142274

Session ID: 41 Serial number: 3186

Losing everything

So, the session loading string “Loop” apparently loaded until the value 1016 with success. And the session 2, loading with string “ Loop2” loaded until value 647. These values are the mark until we want to reach.

But to be more precise and catastrophic, I removed the data from filesystem/ASM too:

ASMCMD> cd +DATAC1/ORAD18/

ASMCMD> rm -rf CONTROLFILE/

ASMCMD> rm -rf DATAFILE/

ASMCMD> rm -rf ONLINELOG/

ASMCMD> rm -rf TEMPFILE/

ASMCMD> cd +RECOC1/ORAD18/

ASMCMD> rm -rf ARCHIVELOG/

ASMCMD> rm -rf AUTOBACKUP/

ASMCMD> rm -rf CONTROLFILE/

ASMCMD> rm -rf ONLINELOG/

ASMCMD>

Now, there is no escape. I need to recover completely the database to retrieve it until the last transaction. Let’s see if ZDLRA works 😉

Recovery Everything – ZERO RPO

Below I restore the database until the last transaction. The major steps follow the same as every full disaster recovery:

Restore Controlfile
Discover the last backup of archivelog
Restore and recover database until the found scn

Look at how it is:

[oracle@exac1vm01-ORAD18]$ rman target / catalog vpcusr/welcome1@zeros1ig-scan:1521/zeros1:dedicated




Recovery Manager: Release 18.0.0.0.0 - Production on Fri Oct 18 23:12:19 2019

Version 18.3.0.0.0




Copyright (c) 1982, 2018, Oracle and/or its affiliates.  All rights reserved.




connected to target database (not started)

connected to recovery catalog database




RMAN> startup nomount;




Oracle instance started




Total System Global Area    2147480376 bytes




Fixed Size                     8659768 bytes

Variable Size                671088640 bytes

Database Buffers            1459617792 bytes

Redo Buffers                   8114176 bytes




RMAN> list backup of controlfile completed after "sysdate - 10/1440";







List of Backup Sets

===================







BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

50958685 Full    50.25M     SBT_TAPE    00:00:01     2019-10-18_23-03-15

        BP Key: 50958686   Status: AVAILABLE  Compressed: NO  Tag: TAG20191018T230314

        Handle: c-2464352672-20191018-0b   Media: Recovery Appliance (ZDLRAK)

  Control File Included: Ckp SCN: 1131375      Ckp time: 2019-10-18_23-03-14




RMAN> restore controlfile from tag = TAG20191018T230314;




Starting restore at 2019-10-18_23-13-24

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=30 device type=DISK

allocated channel: ORA_SBT_TAPE_1

channel ORA_SBT_TAPE_1: SID=414 device type=SBT_TAPE

channel ORA_SBT_TAPE_1: RA Library (ZDLRAK) SID=95344A4C604F5404E0538C43B20A52E2




channel ORA_SBT_TAPE_1: starting datafile backup set restore

channel ORA_SBT_TAPE_1: restoring control file

channel ORA_SBT_TAPE_1: reading from backup piece c-2464352672-20191018-0b

channel ORA_SBT_TAPE_1: piece handle=c-2464352672-20191018-0b tag=TAG20191018T230314

channel ORA_SBT_TAPE_1: restored backup piece 1

channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:00:02

output file name=+DATAC1/ORAD18/CONTROLFILE/current.741.1022012007

output file name=+RECOC1/ORAD18/CONTROLFILE/current.968.1022012009

Finished restore at 2019-10-18_23-13-28




RMAN> list backup of database completed after "sysdate - 15/1440";







List of Backup Sets

===================







BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

50958689 Incr 1  40.00K     SBT_TAPE    00:00:01     2019-10-18_23-03-11

        BP Key: 50958690   Status: AVAILABLE  Compressed: YES  Tag: BKP-LEVEL1

        Handle: VB$_1965521110_50958684I   Media:

  List of Datafiles in backup set 50958689

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    1  Incr 1131350    2019-10-18_23-03-10              NO    +DATAC1/ORAD18/DATAFILE/system.894.1022011045




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

50958693 Incr 0  249.44M    SBT_TAPE    00:00:01     2019-10-18_23-03-11

        BP Key: 50958694   Status: AVAILABLE  Compressed: YES  Tag: BKP-LEVEL1

        Handle: VB$_1965521110_50958684_1   Media:

  List of Datafiles in backup set 50958693

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    0  Incr 1131350    2019-10-18_23-03-10              NO    +DATAC1/ORAD18/DATAFILE/system.894.1022011045




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

50958710 Incr 1  40.00K     SBT_TAPE    00:00:01     2019-10-18_23-03-10

        BP Key: 50958711   Status: AVAILABLE  Compressed: YES  Tag: BKP-LEVEL1

        Handle: VB$_1965521110_50958707I   Media:

  List of Datafiles in backup set 50958710

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  3    1  Incr 1131348    2019-10-18_23-03-09              NO    +DATAC1/ORAD18/DATAFILE/undotbs1.496.1022011015




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

50958723 Incr 0  10.68M     SBT_TAPE    00:00:01     2019-10-18_23-03-10

        BP Key: 50958724   Status: AVAILABLE  Compressed: YES  Tag: BKP-LEVEL1

        Handle: VB$_1965521110_50958707_3   Media:

  List of Datafiles in backup set 50958723

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  3    0  Incr 1131348    2019-10-18_23-03-09              NO    +DATAC1/ORAD18/DATAFILE/undotbs1.496.1022011015




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

50958727 Incr 1  40.00K     SBT_TAPE    00:00:01     2019-10-18_23-03-13

        BP Key: 50958728   Status: AVAILABLE  Compressed: YES  Tag: BKP-LEVEL1

        Handle: VB$_1965521110_50958714I   Media:

  List of Datafiles in backup set 50958727

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  4    1  Incr 1131357    2019-10-18_23-03-12              NO    +DATAC1/ORAD18/DATAFILE/users.353.1022011029




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

50958731 Incr 0  40.00K     SBT_TAPE    00:00:01     2019-10-18_23-03-13

        BP Key: 50958732   Status: AVAILABLE  Compressed: YES  Tag: BKP-LEVEL1

        Handle: VB$_1965521110_50958714_4   Media:

  List of Datafiles in backup set 50958731

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  4    0  Incr 1131357    2019-10-18_23-03-12              NO    +DATAC1/ORAD18/DATAFILE/users.353.1022011029




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

50958736 Incr 1  40.00K     SBT_TAPE    00:00:01     2019-10-18_23-03-12

        BP Key: 50958737   Status: AVAILABLE  Compressed: YES  Tag: BKP-LEVEL1

        Handle: VB$_1965521110_50958735I   Media:

  List of Datafiles in backup set 50958736

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  2    1  Incr 1131353    2019-10-18_23-03-11              NO    +DATAC1/ORAD18/DATAFILE/sysaux.921.1022010999




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

50958740 Incr 0  93.84M     SBT_TAPE    00:00:01     2019-10-18_23-03-12

        BP Key: 50958741   Status: AVAILABLE  Compressed: YES  Tag: BKP-LEVEL1

        Handle: VB$_1965521110_50958735_2   Media:

  List of Datafiles in backup set 50958740

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  2    0  Incr 1131353    2019-10-18_23-03-11              NO    +DATAC1/ORAD18/DATAFILE/sysaux.921.1022010999




RMAN> list backup of archivelog all completed after "sysdate - 15/1440";







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

50958604 2.00M      SBT_TAPE    00:00:00     2019-10-18_23-02-52

        BP Key: 50958605   Status: AVAILABLE  Compressed: NO  Tag: BKP-ARCH

        Handle: 14uel9vc_1_1   Media: Recovery Appliance (ZDLRAK)




  List of Archived Logs in backup set 50958604

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    2       1129806    2019-10-18_22-58-29 1131219    2019-10-18_23-02-50




BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

50958618 1.15M      SBT_TAPE    00:00:00     2019-10-18_23-02-54

        BP Key: 50958619   Status: AVAILABLE  Compressed: YES  Tag: TAG20191018T230254

        Handle: $RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_2_CTKEY_50958445_BACKUP   Media:




  List of Archived Logs in backup set 50958618

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    2       1129806    2019-10-18_22-58-29 1131219    2019-10-18_23-02-50




BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

50958754 67.50K     SBT_TAPE    00:00:00     2019-10-18_23-05-00

        BP Key: 50958755   Status: AVAILABLE  Compressed: YES  Tag: TAG20191018T230500

        Handle: $RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_3_CTKEY_50958595_BACKUP   Media:




  List of Archived Logs in backup set 50958754

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    3       1131219    2019-10-18_23-02-50 1131626    2019-10-18_23-04-53




BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

50958759 41.00K     SBT_TAPE    00:00:00     2019-10-18_23-05-23

        BP Key: 50958760   Status: AVAILABLE  Compressed: YES  Tag: TAG20191018T230523

        Handle: $RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_4_CTKEY_50958752_BACKUP   Media:




  List of Archived Logs in backup set 50958759

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    4       1131626    2019-10-18_23-04-53 1131667    2019-10-18_23-05-09




BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

50958809 2.05M      SBT_TAPE    00:00:00     2019-10-18_23-10-12

        BP Key: 50958810   Status: AVAILABLE  Compressed: YES  Tag: TAG20191018T231012

        Handle: $RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_5_CTKEY_50958757_BACKUP   Media:




  List of Archived Logs in backup set 50958809

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    5       1131667    2019-10-18_23-05-09 1135762    2019-10-18_23-09-44




RMAN>

       

    

RMAN> alter database mount;




released channel: ORA_DISK_1

released channel: ORA_SBT_TAPE_1

Statement processed




RMAN>

  

  

  

RMAN> run{

2> set until scn 1135762;

3> restore database;

4> recover database;

5> }




executing command: SET until clause




Starting restore at 2019-10-18_23-16-04

Starting implicit crosscheck backup at 2019-10-18_23-16-04

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=35 device type=DISK

Crosschecked 1 objects

Finished implicit crosscheck backup at 2019-10-18_23-16-05




Starting implicit crosscheck copy at 2019-10-18_23-16-05

using channel ORA_DISK_1

Finished implicit crosscheck copy at 2019-10-18_23-16-05




searching for all files in the recovery area

cataloging files...

no files cataloged




using channel ORA_DISK_1

allocated channel: ORA_SBT_TAPE_1

channel ORA_SBT_TAPE_1: SID=419 device type=SBT_TAPE

channel ORA_SBT_TAPE_1: RA Library (ZDLRAK) SID=953453F453376CA6E0538C43B20A62A0




channel ORA_SBT_TAPE_1: starting datafile backup set restore

channel ORA_SBT_TAPE_1: specifying datafile(s) to restore from backup set

channel ORA_SBT_TAPE_1: restoring datafile 00001 to +DATAC1/ORAD18/DATAFILE/system.894.1022011045

channel ORA_SBT_TAPE_1: reading from backup piece VB$_1965521110_50958684_1

channel ORA_SBT_TAPE_1: piece handle=VB$_1965521110_50958684_1 tag=BKP-LEVEL1

channel ORA_SBT_TAPE_1: restored backup piece 1

channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:00:15

channel ORA_SBT_TAPE_1: starting datafile backup set restore

channel ORA_SBT_TAPE_1: specifying datafile(s) to restore from backup set

channel ORA_SBT_TAPE_1: restoring datafile 00003 to +DATAC1/ORAD18/DATAFILE/undotbs1.496.1022011015

channel ORA_SBT_TAPE_1: reading from backup piece VB$_1965521110_50958707_3

channel ORA_SBT_TAPE_1: piece handle=VB$_1965521110_50958707_3 tag=BKP-LEVEL1

channel ORA_SBT_TAPE_1: restored backup piece 1

channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:00:15

channel ORA_SBT_TAPE_1: starting datafile backup set restore

channel ORA_SBT_TAPE_1: specifying datafile(s) to restore from backup set

channel ORA_SBT_TAPE_1: restoring datafile 00004 to +DATAC1/ORAD18/DATAFILE/users.353.1022011029

channel ORA_SBT_TAPE_1: reading from backup piece VB$_1965521110_50958714_4

channel ORA_SBT_TAPE_1: piece handle=VB$_1965521110_50958714_4 tag=BKP-LEVEL1

channel ORA_SBT_TAPE_1: restored backup piece 1

channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:00:15

channel ORA_SBT_TAPE_1: starting datafile backup set restore

channel ORA_SBT_TAPE_1: specifying datafile(s) to restore from backup set

channel ORA_SBT_TAPE_1: restoring datafile 00002 to +DATAC1/ORAD18/DATAFILE/sysaux.921.1022010999

channel ORA_SBT_TAPE_1: reading from backup piece VB$_1965521110_50958735_2

channel ORA_SBT_TAPE_1: piece handle=VB$_1965521110_50958735_2 tag=BKP-LEVEL1

channel ORA_SBT_TAPE_1: restored backup piece 1

channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:00:15

Finished restore at 2019-10-18_23-17-11




Starting recover at 2019-10-18_23-17-12

using channel ORA_DISK_1

using channel ORA_SBT_TAPE_1




starting media recovery




channel ORA_SBT_TAPE_1: starting archived log restore to default destination

channel ORA_SBT_TAPE_1: restoring archived log

archived log thread=1 sequence=3

channel ORA_SBT_TAPE_1: reading from backup piece $RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_3_CTKEY_50958595_BACKUP

channel ORA_SBT_TAPE_1: piece handle=$RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_3_CTKEY_50958595_BACKUP tag=TAG20191018T230500

channel ORA_SBT_TAPE_1: restored backup piece 1

channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:00:01

archived log file name=+RECOC1/ORAD18/ARCHIVELOG/2019_10_18/thread_1_seq_3.967.1022012235 thread=1 sequence=3

channel default: deleting archived log(s)

archived log file name=+RECOC1/ORAD18/ARCHIVELOG/2019_10_18/thread_1_seq_3.967.1022012235 RECID=33 STAMP=1022012236

channel ORA_SBT_TAPE_1: starting archived log restore to default destination

channel ORA_SBT_TAPE_1: restoring archived log

archived log thread=1 sequence=4

channel ORA_SBT_TAPE_1: reading from backup piece $RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_4_CTKEY_50958752_BACKUP

channel ORA_SBT_TAPE_1: piece handle=$RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_4_CTKEY_50958752_BACKUP tag=TAG20191018T230523

channel ORA_SBT_TAPE_1: restored backup piece 1

channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:00:02

archived log file name=+RECOC1/ORAD18/ARCHIVELOG/2019_10_18/thread_1_seq_4.967.1022012237 thread=1 sequence=4

channel default: deleting archived log(s)

archived log file name=+RECOC1/ORAD18/ARCHIVELOG/2019_10_18/thread_1_seq_4.967.1022012237 RECID=34 STAMP=1022012237

channel ORA_SBT_TAPE_1: starting archived log restore to default destination

channel ORA_SBT_TAPE_1: restoring archived log

archived log thread=1 sequence=5

channel ORA_SBT_TAPE_1: reading from backup piece $RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_5_CTKEY_50958757_BACKUP

channel ORA_SBT_TAPE_1: piece handle=$RSCN_1129803_RTIM_1022011106_THRD_1_SEQ_5_CTKEY_50958757_BACKUP tag=TAG20191018T231012

channel ORA_SBT_TAPE_1: restored backup piece 1

channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:00:01

archived log file name=+RECOC1/ORAD18/ARCHIVELOG/2019_10_18/thread_1_seq_5.967.1022012239 thread=1 sequence=5

channel default: deleting archived log(s)

archived log file name=+RECOC1/ORAD18/ARCHIVELOG/2019_10_18/thread_1_seq_5.967.1022012239 RECID=35 STAMP=1022012240

media recovery complete, elapsed time: 00:00:01

Finished recover at 2019-10-18_23-17-21




RMAN> alter database open resetlogs;




Statement processed

new incarnation of database registered in recovery catalog

starting full resync of recovery catalog

full resync complete




RMAN> exit







Recovery Manager complete.

[oracle@exac1vm01-ORAD18]$

Some details in this log:

The archivelog sequence 5 was generated by ZDLRA. Compare in the previous output that I don’t have it when listed the generated archivelog.
The archivelog “Next Time” was “23:09:44” that was the same moment of failure by instance crash.
I used the SCN to pick up exactly the last transaction, covered by SCN.

Checking data

[oracle@exac1vm01-ORAD18]$ sqlplus / as sysdba




SQL*Plus: Release 18.0.0.0.0 - Production on Fri Oct 18 23:18:37 2019

Version 18.3.0.0.0




Copyright (c) 1982, 2018, Oracle.  All rights reserved.







Connected to:

Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

Version 18.3.0.0.0




SQL>

SQL>

SQL> select count(*) from testIns group by c3;




  COUNT(*)

----------

       646

      1016




SQL>

Look above that I have exactly the 646 and 1016 data that I was inserting. Everything was protected and I had RPO zero.

Even in ASYNC

Besides the SYNC mode, I made the same test as ASYNC mode, and you can see that even in ASYNC mode we reached RPO zero (and doing load during the test). You can see in the txt file that all the transactions were protected.

ZDLRA and ZERO RPO

As you saw above with the examples, you can reach RPO zero for ZDLRA. As already told in my previous post (and in the Oracle docs), the real-time redo is based on DG remote transport of redo.

But with ZDLRA you can have zero RPO even for databases without DG configurated. You saw that ZDLRA protected the database even in case of catastrophic failure (losing all data and redos). Everything was protected.

Think about how it was to save this failure in the traditional backup environment. Until the last backup of archivelog? Until the last sync of storage that holds the archivelogs/redo? How much data your company can loose?

The idea of SYNC with ZDLRA, besides the docs hinting to use ASYNC to avoid overload the primary database, if fine to use. I understand that they don’t want to overload primary database and put it to wait ZDLRA writes at disks. The usage of SYNC with NOAFFIRM was proved that work. I don’t think that SYNC/AFFIRM cam improve even more the protection, because ZDLRA it is not a full DG remote destination (there are no standby redo logs), and the write to disks can’t be synchronous. The impact for SYNC/NOAFFIRM is almost zero since it involves only memory to memory copy (redo log buffers from source to remote destination).

Besides that, think about what kind of environment you usually have ZDLRA. Critical databases right? And even the name of the product says ZERO DATA LOSS. So, SYNC is the only way to truly guarantee this.

Now it is easy to understand how real-time redo it is important. You can protect your database (from small to the critical) and reach RPO zero. And if you go further and follow MAA best practices you can reach RPO and RTO zero.

ZDLRA, Real-Time Redo

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

ZDLRA, Real-Time Redo

Real-time redo transport is the feature that allows you to reduce to zero the RPO (Recovery Point Objective) for your database. Check how to configure real-time redo, the steps, parameters, and other details that need to be modified to enable it.

The idea behind real-time redo transport it is easy, basically the ZDLRA it is a remote destination for your redo log buffers/archivelogs of your database. It is really, really, similar to what occurs for data guard configurations (but here you don’t need to set all datafiles as an example). It is not the same too because ZDLRA can detect if the database stops/crash and will generate the archivelog (at ZDLRA side) with all the received redo and this can be used to restore to, at least zero/sub-seconds, of data loss.

Using real-time redo it is the only way to reach RPO zero. With other features of ZDLRA, you can have a better backup window time (but just that) using incremental backups. Just using real-time redo you reach zero RPO and this impacts directly how to configure for MAA compliance. There are a lot of options and level of protection for MAA that you can check at “Maximum Availability Architecture (MAA) – On-Premises HA Reference Architectures 2019”, “Maximum Availability Architecture Best Practices for Oracle Cloud”, “Oracle MAA Reference Architectures”, “Maximum Availability Architecture – Best Practices for Oracle Database 19c”.

This post starts from one environment that you already enrolled in the database at ZDLRA. I already wrote about how to do that, you can check here in my previous post. This is the first post about real-time redo, here you will see how to configure and verify it is working.

Pre-Check

The way that I configure real-time redo for database it is a little different than what is in the official docs. I added some pre-checks and another order for some steps, but the result is the same. If you want to check the official steps you can check at Zero Data Loss Recovery Appliance Protected Database Configuration Guide or at Zero Data Loss Recovery Appliance Administrator’s Guide.

Just contextualization of environment:

ZDLRA database name: zdlras1
Database name: orcl19
VPC user: cat_zdlra

Check Configuration

The first step that I do is verify the current configuration. Mainly the preexistence of wallet:

[oracle@orcloel7 ~]$ mkstore -wrl /u01/app/oracle/product/19.3.0.0/dbhome_1/dbs/ra_wallet -listCredential

Oracle Secret Store Tool Release 19.0.0.0.0 - Production

Version 19.3.0.0.0

Copyright (c) 2004, 2019, Oracle and/or its affiliates. All rights reserved.




List credential (index: connect_string username)

1: zdlras1-scan:1521/zdlras1:CAT_ZDLRA cat_zdlra

[oracle@orcloel7 ~]$

[oracle@orcloel7 ~]$ cat $ORACLE_HOME/dbs/raORCL19.ora

RA_WALLET='LOCATION=file:/u01/app/oracle/product/19.3.0.0/dbhome_1/dbs/ra_wallet CREDENTIAL_ALIAS=zdlras1-scan:1521/zdlras1:CAT_ZDLRA'

[oracle@orcloel7 ~]$

The most import here is the credential name (zdlras1-scan:1521/zdlras1:CAT_ZDLRA in this case) because it needs to be the same name for the database file that points to RA_WALLET. The config file is for each database/instance and has the name ra<DBNAME>.ora (it was already configured by the previous post pointed before).

After that, I check if the sqlnet.ora and tnsnames.ora have the information for wallet and TNS entry respectively:

[oracle@orcloel7 ~]$ cat $ORACLE_HOME/network/admin/sqlnet.ora

NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)




SQLNET.WALLET_OVERRIDE = true




WALLET_LOCATION =

(

   SOURCE =

      (METHOD = FILE)

      (METHOD_DATA =

         (DIRECTORY = /u01/app/oracle/product/19.3.0.0/dbhome_1/dbs/ra_wallet)

      )

)

[oracle@orcloel7 ~]$

[oracle@orcloel7 ~]$ tnsping zdlras1




TNS Ping Utility for Linux: Version 19.0.0.0.0 - Production on 01-OCT-2019 23:30:36




Copyright (c) 1997, 2019, Oracle.  All rights reserved.




Used parameter files:

/u01/app/oracle/product/19.3.0.0/dbhome_1/network/admin/sqlnet.ora







Used TNSNAMES adapter to resolve the alias

Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = zdlras1-scan)(PORT = 1521)) (CONNECT_DATA = (SERVICE_NAME = zdlras1)))

OK (0 msec)

[oracle@orcloel7 ~]$

Check Database Configuration

After check outside of the database config, it is important to check some database requirements. The most import is verifying if REMOTE_LOGIN_PASSWORDFILE it is set as “exclusive” or “shared”:

[oracle@orcloel7 ~]$ sqlplus / as sysdba




SQL*Plus: Release 19.0.0.0.0 - Production on Tue Oct 1 23:32:09 2019

Version 19.3.0.0.0




Copyright (c) 1982, 2019, Oracle.  All rights reserved.







Connected to:

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.3.0.0.0




SQL>

SQL> show parameter REMOTE_LOGIN_PASSWORDFILE;




NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

remote_login_passwordfile            string      EXCLUSIVE

SQL>

SQL>

Other parameters to verify are log_archive_config and db_unique_name. They are important to verify the current configuration in cases of databases using DG as an example.

Configuring Real-Time Redo

After these pre-checks, we can start the configuration. Here I set everything at spfile to, after configuring everything, bounce the database and start with everything up and running.

REDO_TRANSPORT_USER

It is the parameter that defines the user that the database utilizes to send the redo to another database. This user needs to exist in both databases (for DG), and because of that needs to exist in ZDLRA. When using ZDLRA and real-time redo it needs to be the same as VPC inside the database. So, the same VPC user needs to exists in database side and set for this parameter:

SQL> show parameter redo_transport_user;




NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

redo_transport_user                  string

SQL>

SQL> alter system set redo_transport_user='CAT_ZDLRA' scope = spfile SID='*';




System altered.




SQL>

SQL> create user cat_zdlra identified by s3nhazdlra;




User created.




SQL> grant sysoper to cat_zdlra;




Grant succeeded.




SQL>

As you can see the user CAT_ZDLRA was created inside of the database. The user needs to have at least the grant SYSOPER to work properly.

LOG_ARCHIVE_CONFIG

This parameter works the same for DG environments, and since ZDLRA real-time redo it is a remote destination for archive, we do the same. Here, we set this specifying the database unique name for ZDLRA here (zdlras1 in my current environment):

SQL> show parameter log_archive_config;




NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

log_archive_config                   string

SQL>

SQL> show parameter db_unique_name;




NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

db_unique_name                       string      ORCL19

SQL>

SQL> ALTER SYSTEM SET log_archive_config='DG_CONFIG=(orcl19,zdlras1)' SCOPE=SPFILE SID='*';




System altered.




SQL>

If you already have DG configured, you need to just add the entry for ZDLRA database here in the parameter.

LOG_ARCHIVE_DEST_n

SQL> ALTER SYSTEM SET log_archive_dest_2='SERVICE="zdlras1-scan:1521/zdlras1:CAT_ZDLRA" ASYNC NOAFFIRM delay=0 optional compression=disable max_failure=1 max_connections=1 reopen=10 DB_UNIQUE_NAME=zdlras1 net_timeout=8 group=1 priority=1 VALID_FOR=(ONLINE_LOGFILE,ALL_ROLES)' scope = spfile sid = '*';




System altered.




SQL>

SQL> alter system set log_archive_dest_state_2=DEFER scope = spfile sid = '*';




System altered.




SQL>

The secrets are:

SERVICE: it is the connection to ZDLRA. It can be a TNS service name, or an ezconnect. If you sawm I have the credential name, as the same of ezconnection. I recommend to use this, so you know what the credential it is for. If you use TNS entry for SERVICE, recommend having a credential name with the same name.
ASYNC NOAFFIRM: This is controversy (I will explain later), but this defines the mode that redo log is shipped for the remote destination.
DB_UNIQUE_NAME: Database unique name of recovery appliance database.

All the others are the normal parameters for DG config and you can check at LOG_ARCHIVE_DEST_n Parameter Attributes.

As I told before, the ASYNC NOAFFIRM are controversial because if you look it was defined as ASYNC, and it is the same in the ZDLRA docs. You can read here at “Protection of Ongoing Transactions” (or at “How Real-Time Redo Transport Works”):

“Redo data from the protected database is written asynchronously to the Recovery Appliance as it is generated. Load on production database servers is minimized because redo data is shipped directly from the memory to the Recovery Appliance without the involvement of disk I/O”

“To avoid degrading the performance of the protected database, protected databases transfer redo asynchronously to the Recovery Appliance. If a protected database is lost, zero to subsecond data loss is expected in most cases.”

The controversy here is because the appliance it is ZERO DATA LOSS, and ASYNC does not guarantee zero data loss. So, is it a scam case and we need to change the name for Non-Zero Data Loss Recovery Appliance? No, they just are trying to remember you that to minimize the impact over the database, it was configured in async mode. But you can configure in SYNC mode, it is allowed, but remember that this can/will impact the database.

If you want to understand the impact over that SYNC/ASYNC/AFFIRM/NOAFFIRM have over the database, read the “Best Practices for Synchronous Redo Transport Data Guard and Active Data Guard” doc. It is fundamental.

I will cover this in the next post about real-time redo.

Using Real-Time Redo

After pre-check and configure all the parameters above, we can start to use it. If you have some part already configured (like the redo transport user, and the wallet), you can config without restart database. But, here, I made the restart of the database (and enabled the log_archive_dest_state_n).

The best way to show/check real-time redo working is through rman, where I can describe the archivelogs and backups that exist for each one:

 [oracle@orcloel7 ~]$ rman target=/ catalog=cat_zdlra/s3nhazdlra@zdlras1




Recovery Manager: Release 19.0.0.0.0 - Production on Tue Oct 1 23:57:42 2019

Version 19.3.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database: ORCL19 (DBID=310627084)

connected to recovery catalog database




RMAN> list copy of archivelog all;




specification does not match any archived log in the repository




RMAN> alter system archive log current;




Statement processed




RMAN> list copy of archivelog all;




List of Archived Log Copies for database with db_unique_name ORCL19

=====================================================================




Key     Thrd Seq     S Low Time

------- ---- ------- - -------------------

4329    1    47      A 02-10-2019_00:09:32

        Name: /u01/app/oracle/oradata/ORCL19/archivelog/2019_10_02/o1_mf_1_47_gs7mzgkn_.arc







RMAN> list backup of archivelog all completed after "sysdate - 5/1440";







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

4365    595.50K    SBT_TAPE    00:00:00     02-10-2019_00:15:34

        BP Key: 4366   Status: AVAILABLE  Compressed: YES  Tag: TAG20191002T001534

        Handle: $RSCN_1920977_RTIM_1009298641_THRD_1_SEQ_47_CTKEY_4222_BACKUP   Media:




  List of Archived Logs in backup set 4365

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    47      2370920    02-10-2019_00:09:32 2372594    02-10-2019_00:15:10




RMAN>

Look the example above, I started with no copy of archivelogs locally. After I made a switch for archivelog and sequence 47 was generated. Instantly, if I do a list backup archivelog you can see that already exists one copy of the same sequence.

For the copy, you can identify that it is a real-time redo and not a copy because of the handle. It has a specific format starting as $RSCN.

If I do one more time, the same behavior:

RMAN> list copy of archivelog all;




List of Archived Log Copies for database with db_unique_name ORCL19

=====================================================================




Key     Thrd Seq     S Low Time

------- ---- ------- - -------------------

4329    1    47      A 02-10-2019_00:09:32

        Name: /u01/app/oracle/oradata/ORCL19/archivelog/2019_10_02/o1_mf_1_47_gs7mzgkn_.arc







RMAN> alter system archive log current;




Statement processed




RMAN> list copy of archivelog all;




List of Archived Log Copies for database with db_unique_name ORCL19

=====================================================================




Key     Thrd Seq     S Low Time

------- ---- ------- - -------------------

4329    1    47      A 02-10-2019_00:09:32

        Name: /u01/app/oracle/oradata/ORCL19/archivelog/2019_10_02/o1_mf_1_47_gs7mzgkn_.arc




4406    1    48      A 02-10-2019_00:15:10

        Name: /u01/app/oracle/oradata/ORCL19/archivelog/2019_10_02/o1_mf_1_48_gs7n1ck8_.arc







RMAN> list backup of archivelog all completed after "sysdate - 5/1440";







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

4365    595.50K    SBT_TAPE    00:00:00     02-10-2019_00:15:34

        BP Key: 4366   Status: AVAILABLE  Compressed: YES  Tag: TAG20191002T001534

        Handle: $RSCN_1920977_RTIM_1009298641_THRD_1_SEQ_47_CTKEY_4222_BACKUP   Media:




  List of Archived Logs in backup set 4365

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    47      2370920    02-10-2019_00:09:32 2372594    02-10-2019_00:15:10




BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

4445    41.00K     SBT_TAPE    00:00:00     02-10-2019_00:16:32

        BP Key: 4446   Status: AVAILABLE  Compressed: YES  Tag: TAG20191002T001632

        Handle: $RSCN_1920977_RTIM_1009298641_THRD_1_SEQ_48_CTKEY_4294_BACKUP   Media:




  List of Archived Logs in backup set 4445

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    48      2372594    02-10-2019_00:15:10 2372793    02-10-2019_00:16:11




RMAN>

Another point is that if you try to do the backup of this archivelog you will see that they was already backed:

RMAN> BACKUP DEVICE TYPE SBT FILESPERSET 100 FORMAT '%U' ARCHIVELOG ALL NOT BACKED UP TAG 'BKP-ARCH';




Starting backup at 02-10-2019_00:18:30

current log archived

released channel: ORA_DISK_1

allocated channel: ORA_SBT_TAPE_1

channel ORA_SBT_TAPE_1: SID=86 device type=SBT_TAPE

channel ORA_SBT_TAPE_1: RA Library (ZDLRAS1) SID=93E18A9FC482277FE053010310AC3886

skipping archived logs of thread 1 from sequence 47 to 48; already backed up

Real-Time Redo… in real

The real-time redo transport configuration that I showed here just covered the configuration part. You can use it as a guide do understand what you need to configure and check. But this post does not cover full details about it.

For the next post about real-time redo, I will explore more details about the controversy about SYNC/ASYNC mode for transport. I will test and simulate other details, like a complete crash of the environment and verify if everything will be safe and we reach zero RPO.

But in the end, we have the full “Delta Push” for our database. The incremental forever strategy and real-time redo enabled.

« 1 2 3 4 5 6 7 »

Exadata, Missing Metric

Understand metrics for Exadata Storage Server is important to understand how all the software features are being used and all the details from that. Here I will discuss one case where the FC_IO_BY_R_SEC metric can show not precise values. And I will discuss one missing metric that can save a lot.

Metrics collection

The idea for this post came when I was investigating storage usage from one database to see if we are hitting the limits for Exadata Storage. I needed to check disk, flash cache, and other details. For the behavior I show below, I already worked with Oracle about (SR, BUG, ER).

DB_FC_IO_BY_SEC

I already discussed about this metric in a previous post, but its report (according to the official doc) the “The number of megabytes of I/O per second for this database to flash cache”. Look the usage for the database DBPR1_EXA:

As you can see, at 15:42:03 of 2019-12-13 the storage server reported that during the last minute this database made around 850 MB/s of reading from flash cache.

And this was the highest database consumption (that used more than 0.5 MB/s):

If I sum all the database usage from flash cache, it will be around 950 MB/s.

FC_IO_BY_R_SEC

But, look the numbers:

So, at the same time moment that one database metric reported around 850 MB/s from FC reading, the metrics from FC reported around 19MB/s. Something it is not correct.

And even if I check the minute before and after (because maybe was a glimpse of time collection) the numbers are not quite different:

Others flash metrics

So, since the numbers appears to be complete wrong I searched for others metrics that report flash usage (disks and flash cache as example).

CD_IO_BY_*_*_SEC

Filtering just from flash disks type I had this numbers:

As you can see, just picking up the CD_IO_BY_R_LG_SEC I got around 925 MB/s, a close number from the same that I got from the database metric reading the flash. Another hint from that says to me that reads are requesting more than 128KB for each access since they were counted as large (LG).

FL_IO_DB_BY_W_SEC and FL_IO_FL_BY_W_SEC

Since I was checking for bad flash report, I checked the flash log metrics but they reported close values from cell disk metrics:

So, the metrics for disk usage, was OK, reporting correct values.

SIO_IO_RD_FC_SEC

As you can see, the metrics from database usage from flash cache are reporting the same values that were possible to check from the closest as from hardware report (celldisk). So, maybe could be a software misreading in-memory values.

Because of that, I checked from other metrics that can report form features that interact with flash cache. So, I checked from Smart I/O:

And as you can see the SIO_IO_RD_FC_SEC (that means “The number of megabytes per second read from flash cache by smart I/O”) reported almost the same that value (843 MB/s) from the database and cell disk.

The Missing Metric

As you can see above, the metrics close from HW reported almost the same that was reported by DB. And from features usage, almost the same too. To be more clear is that:

So, it is not miss reading from memory since the Smart I/O report correctly. Returning the review for flash cache, all metrics are:

But if we look closely the metrics from FC above, we can see the high value for FC_IO_BY_R_DW (“The number of megabytes of DW data read from flash cache”). And if we check some minutes we can see the usage:

From that, we can see that at 15:41 the usage was 13,070,966,800 MB and at 15:42 the usage was 13,071,021,100 MB. And if we do the diff, between this minute the difference was 54300 MB. And if we divide by each sec we have 54300/60 = 905 MB/s.

So, the missing values were found, the flash cache read usage was counted as DW. But if you followed everything, we wad legitim reads from flash cache that was not counted at FC_IO_BY_R_SEC. So, the missing metric, in this case, it is FC_IO_BY_R_DW_SEC.

Reasons

Exadata, Workaround for oracka.ko error

Recently I made an Exadata stack upgrade/update to the last 19.2 version (19.2.7.0.0.191012) released in October of 2019, and update the GI to the last 19c version (19.5.0.0.191015) and after that, I hade some issues to create 11G databases.

So, when I try to create an 11G RAC database, the error “File -oracka.ko- was not found” appears and creation fails. Here I want to share with you the workaround (since there is no solution yet) that I discovered and used to bypass the error.

The environment

The actual environment is:

Grid Infrastructure: Version 19.5.0.0.191015

Exadata domU: Version 19.2.7.0.0.191012 running kernel 4.1.12-124.30.1.el7uek.x86_64

11G Database: Version 11.2.0.4.180717

ACFS: Used to store some files

oracka.ko

So, calling dbca:

And at the log:

If you check for the trace of database creation:

So, as you can see in the trace the process for database creation when detecting that ACFS is in place, tries to unzip some files (.ko, kernel objects) related to ACFS access but receive the error or that cannot delete old oracka.ko and for uncompressing file oracka.zip.

Source of the problem

Because of that I continue to check and checked the folder:

After executing this in both nodes, the dbca finish:

Just to check the ko files and bin folder already came with wrong permissions in the original 19c packages:

The sum of all problems

Exadata, Understanding Metrics

My last article about Exadata Storage Server metrics was about one example of how to use them to identify problems that do not appear in the database side. In that post, I showed how I used the metric DB_FC_IO_BY_SEC to identify bad queries.

In this post, I will not explain each metric one by one, but guide you to understand metrics and some interesting and important details about them.

Understanding metrics

Metrics for Exadata are values extract directly from hardware or directly from storage server software. Values from “IOPS from each disk”, or “MB/s read by SMARTIO” are an example of what you can discover. Directly from the docs:

To check the definition for Exadata metrics, and all metrics available the best place it the official Exadata User Guide, chapter 6. You can see the definition for all metrics and other important information. I really recommend that you read it to be aware of what you can extract from the metrics.

Reading metrics

To read metrics you can connect directly in the storage server and with the cellcli use the “list metriccurrent” or “list metrichistory” commands to read it:

Since it is based in the list command you can detail it, restrict with where, or change the attributes to display it:

You can query the metric for each one of the attributes. Like all metrics for IORM or all metrics for that have FC in the name. If you want to query values in the past, you need to use list metrichistory:

Metric types

There are three types of metrics: Instantaneous (value reflect the moment when was read), Rate (values computed based in the period of time), Cumulative (values since you started storage server from the last time).

All the metrics type Rate, usually are expressed by second. This means that Exadata counted the values from the last minute and divided it by seconds. So, the *_SEC means the average based at the last minute.

One important detail is that some have the “small” and “large” metrics. This means that if your request from the database needs more than 128KB the values are marked as large *LG*, otherwise, as small *SM*.

Using metrics

Since every storage computes each metric in the separate way you need to query each one to have the big picture. But this does not mean that some metrics need to be analyzed globally, instead of per each server. I usually divide Exadata metrics in two ways, Isolated and Per Database.

Isolated Metrics

I consider isolated metrics that are important to check for each server. They express values that are important to check isolated per each storage server. Some metrics that I like to check isolated:

CL_CPUT: The cell CPU utilization.

CL_MEMUT: The percentage of total physical memory used.

N_HCA_MB_RCV_SEC: The number of megabytes received by the InfiniBand interfaces per second

N_HCA_MB_TRANS_SEC: The number of megabytes transmitted by the InfiniBand interfaces per second.

CD_IO_BY_R_LG_SEC: The rate which is the number of megabytes read in large blocks per second from a cell disk.

CD_IO_BY_R_SM_SEC: The rate which is the number of megabytes read in small blocks per second from a cell disk.

CD_IO_BY_W_LG_SEC: The rate which is the number of megabytes written in large blocks per second on a cell disk.

CD_IO_BY_W_SM_SEC: The rate which is the number of megabytes written in small blocks per second on a cell disk.

CD_IO_RQ_R_LG_SEC: The rate which is the number of requests to read large blocks per second from a cell disk.

CD_IO_RQ_R_SM_SEC: The rate which is the number of requests to read small blocks per second from a cell disk.

CD_IO_RQ_W_LG_SEC: The rate which is the number of requests to write large blocks per second to a cell disk.

CD_IO_BY___SEC

All the metrics type Rate, usually are expressed by second. This means that Exadata counted the values from the last minute and divided it by seconds. **So, the *_SEC means the average based at the last minute.**

One important detail is that some have the “small” and “large” metrics. This means that if your request from the database needs more than 128KB the values are marked as large LG, otherwise, as small SM.