Database
Observer, Quorum
Category: Database Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

Observer, Quorum

This article closes the series for DG and Fast-Start Failover that I covered with more details the case of isolation can leverage the shutdown of your healthy/running primary database. The “ORA-16830: primary isolated from fast-start failover partners”.
In the first article, I wrote about how one simple detail that impacts dramatically the reliability of your MAA environment. Where you put your Observer in DG environment (when Fast-Start Failover is in use) have a core figure in case of outages, and you can face Primary isolation and shutdown. Besides that, there is no clear documentation to base yourself about “pros and cons” to define the correct place for Observer. You read more in my article here.
In the second article, I wrote about one new feature that can help to have more protected and cover more scenarios for Fast-Start Failover/DG. Using Multiple Observers you can remove the single point of failure and allow you to put one Observer in each side of your environment (primary, standby and a third one). You can read more in my article here.
In this last article I discuss how, even using all the features, there is no               perfect solution. Another point is discussing here is how (maybe) Oracle can improve that. Below I will show more details that even multiple observers continue to shutdown a healthy primary database. Unfortunately, it is a lot of tech info and is a log thread output. But you can jump directly to the end to see the discussion about how this can be improved.

 

Fast-Start Failover and Multiple Observers

Because the design of Fast-Start Failover, Broker and DG even using multiple observers, we continue to have the decision (to failover or no the database) based in just one observer report. The others are just backup from the master, but what they saw not count when the failover scenario hit the environment. Even if the Primary Database can receive connections from other two observers, but not receive the connection master (and standby), it decides to shutdown because it is “isolated”.
Look the example below where I have three observers (one in each site and a third one in the “cloud”):

 

 

The image above can be translated in this config for Broker where the Master Observer it is “dbobss” (that resides in standby datacenter):

 

DGMGRL> show fast_start failover;




Fast-Start Failover: ENABLED




  Threshold:          240 seconds

  Target:             orcls

  Observers:      (*) dbobss

                      dbobsp

                      dbobst

  Lag Limit:          30 seconds (not in use)

  Shutdown Primary:   TRUE

  Auto-reinstate:     TRUE

  Observer Reconnect: 10 seconds

  Observer Override:  FALSE




Configurable Failover Conditions

  Health Conditions:

    Corrupted Controlfile          YES

    Corrupted Dictionary           YES

    Inaccessible Logfile            NO

    Stuck Archiver                  NO

    Datafile Write Errors          YES




  Oracle Error Conditions:

    (none)




DGMGRL>

 

Check that in this case I set the threshold for fast-start failover as 240 seconds just to have more time to show the logs. But be aware that this parameter defines the time that your system waits/freeze until proceeding with the failover in case of system isolation or lost the primary.
Here, to simulate side isolation and same behavior for the first article, I shutdown the network from standby (that talk with primary database), for the Master observer (just network that talks with primary database), and for others observers the network for standby communication. The image below reflects this:

 

 

After that, the log from Broker in the primary start to report a lot of information and I will discuss bellow. You can click here to check the full text for this. From this log you can see that the primary detected that lost communication with standby at 20:14:06.504 and in 4 minutes (240 seconds) will trigger the failover. But the most important part is marked below:

 

2019-05-05 20:14:36.506                      LGWR: still awaiting FSFO ack after 30 seconds

2019-05-05 20:14:41.968                      DMON: Creating process FSFP

2019-05-05 20:14:44.976                      FSFP: Process started

2019-05-05 20:14:44.976                      FSFP: The ping from current master observer (No. 1) to primary database timed out.

2019-05-05 20:14:44.976                      FSFP: Preparing to switch master observer from observer 1 to observer 2

2019-05-05 20:14:44.976                      FSFP: FSFO SetState(st=43 "SET SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

2019-05-05 20:14:44.976                      DMON: FSFP successfully started

2019-05-05 20:14:44.977                      FSFP: persisting FSFO state flags=0x10040001, version=128, enver=123,target=2, lad=, obid=0x32cc2adc (852241116), threshold=240, laglim=30, obslim=30

2019-05-05 20:14:44.979                      FSFP: Data Guard Broker initiated a master observer switch since the current master observer cannot reach the primary database

2019-05-05 20:14:44.980                      FSFP: NET Alias translated: (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = dbstb_dg)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = orcls) (UR=A)))

2019-05-05 20:14:44.980                      FSFP: Net link using RFIUPI_DB_CDESC, site=2, inst=0

2019-05-05 20:14:44.981                      FSFP: Connect to member orcls using (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=dbstb_dg)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(UR=A)(SERVICE_NAME=orcls_DGB)))

2019-05-05 20:14:44.981                      FSFP: 0 x 10 seconds network retry READY

2019-05-05 20:14:51.506                      LGWR: still awaiting FSFO ack after 45 seconds

2019-05-05 20:14:55.120                      FSFP: Failed to connect to remote database orcls. Error is ORA-12543

2019-05-05 20:14:55.120                      FSFP: FSFO SetState(st=44 "CLR SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

2019-05-05 20:14:55.120                      FSFP: persisting FSFO state flags=0x40001, version=128, enver=123,target=2, lad=, obid=0x32cc2adc (852241116), threshold=240, laglim=30, obslim=30

2019-05-05 20:15:06.507                      LGWR: still awaiting FSFO ack after 60 seconds

2019-05-05 20:15:10.124                      FSFP: The ping from current master observer (No. 1) to primary database timed out.

2019-05-05 20:15:10.124                      FSFP: Preparing to switch master observer from observer 1 to observer 2

2019-05-05 20:15:10.124                      FSFP: FSFO SetState(st=43 "SET SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

2019-05-05 20:15:10.124                      FSFP: persisting FSFO state flags=0x10040001, version=128, enver=123,target=2, lad=, obid=0x32cc2adc (852241116), threshold=240, laglim=30, obslim=30

2019-05-05 20:15:10.128                      FSFP: Data Guard Broker initiated a master observer switch since the current master observer cannot reach the primary database

2019-05-05 20:15:10.128                      FSFP: NET Alias translated: (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = dbstb_dg)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = orcls) (UR=A)))

2019-05-05 20:15:10.128                      FSFP: Net link using RFIUPI_DB_CDESC, site=2, inst=0

2019-05-05 20:15:10.128                      FSFP: Connect to member orcls using (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=dbstb_dg)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(UR=A)(SERVICE_NAME=orcls_DGB)))

2019-05-05 20:15:10.128                      FSFP: 0 x 10 seconds network retry READY

2019-05-05 20:15:14.979                      DMON: A Fast-Start Failover target switch is necessary because the primary cannot reach the Fast-Start Failover target standby database

2019-05-05 20:15:14.980                      DMON: A target switch was not attempted because the observer has not pinging primary recently.

2019-05-05 20:15:21.246                      FSFP: Failed to connect to remote database orcls. Error is ORA-12543

2019-05-05 20:15:21.247                      FSFP: FSFO SetState(st=44 "CLR SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

2019-05-05 20:15:21.247                      FSFP: persisting FSFO state flags=0x40001, version=128, enver=123,target=2, lad=, obid=0x32cc2adc (852241116), threshold=240, laglim=30, obslim=30

 

Above you can see that the broker in the primary detected that master observer is down and tries to realize the switch to another observer. Unfortunately, since it was not possible to connect with the standby database was impossible to change it. Look the event “SET SWOB INPRG” was triggered but the “CLR SWOB INPRG” was impossible because “Failed to connect to remote database orcls. Error is ORA-12543”. If you compare with the log from the previous article (when I changed the master observer manually) here we don’t see the “SET OBID” and even “OBSERVED” events.
So, basically, because the standby database was incommunicable, the primary database can’t swap the master observer (even if receive connection with them). This behavior does not change even if you set big values for “FastStartFailoverThreshold”, “OperationTimeout”, “CommunicationTimeout” parameters.

Quorum

In one scenario for DG with Fast-Start Failover enabled you can hit a shutdown from your healthy primary database because it thinks that it is isolated when lost communication from standby and observer. Even when you add multiple observers the behavior does not change.
By the actual design for DG, this is 100% correct, it is impossible for each side to know if it is isolated or no. The primary, when lost the connection from Master Observer and Standby, shutdown because can’t guarantee the transactions. And the standby (if alive and have the connection for the Master Observer) failover to be the next primary.  The problem is that even using multiple observers, where you can have odd votes, you still face isolation if the minimal part (standby + master observer) vote itself (even if it is isolated). As discussed in the first article, where you put your observer it is very important, but you need to check the pros and cons for your decision.
Going deeper, when you use the Fast-Start Failover your DG start to operate in sync (even in Max Availability), and in the first sign of communication failure the primary database freeze and don’t accept more transactions. No records are stored, and this is why, even with multiple observers, the primary can’t switch to another one. The database itself is blocked to store this change.
One option that can improve this gap it is Broker start to use Quorum (or voting system) to control if it is ok to proceed with the failover or no. In the scenario that I showed before, where the primary still has a connection from others two observers, the shutdown of the healthy primary not occur because have more votes/connections from observers (2) compared with standby (that have just one). Unfortunately, there is not a perfect solution, if one outage occurs in the primary side, and you lost connection with most part of observers, the standby can’t failover because don’t know if have votes to do that or no.
This can be a future and additional option for Fast-Start Failover environments. But the key and fundamental part is still there, even quorum will not work correctly if you still put all the observers in just one side. And become even more critical when, now, you can have hybrid clouds with one for databases and other for applications.
The idea about writing these articles was to show the problem and try to fill the gap about the place to put your observer. There is no clear doc to search the “pros and cons” for each side option. Unfortunately, even with the new features, still exists a gap that needs to be covered to improve the reliability for some scenarios. There is no perfect solution, but it is important to know what you can do to reach MAA at the highest level.
 

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for community.”

 


OBSERVERS, MORE THAN ONE
Category: Database Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

OBSERVERS, MORE THAN ONE

Recently I made a post about a little issue that I got with Oracle Dataguard. In that scenario, because outage in the standby datacenter, healthy primary database shutdown with error “ORA-16830: primary isolated…”. Just to remember that the database was running with Maximum Availability, Fail-start Failover enabled and (the most important detail) the Observer was running in the standby datacenter too.
The point from my post (that you can read here) tried to show that does not exists one doc that provide full details about “pros” and “cons” where put your observer. Whatever place, on primary datacenter or in standby, have little details to check. Even the best (ideal) scenario with a third datacenter can be tough to sustain.
Here I will try to show one option that can help you and improve the reliability of your environment. At least, you will have more options to decide how to protect your database. Bellow I show some details about how to configure and use multiple observers, but if you want to see a little concern about this you can directly to the end of the post.

 

 

More than one

Basically, to do that, you can add more than one observer to protect your DG environment. It is simple to configure, and you can use this since 12.2 and have at least three of them. To configure you just need to do (in the simplest way):
  1. Install the default Oracle Client infrastructure.
  2. Add TNS entry for/to both sides.
  3. Open the DGMGRL.
  4. Call “start observer” command.
Check how easy it is:

 

[oracle@dbobss ~]$ dgmgrl sys/oracle@orcls

DGMGRL for Linux: Release 12.2.0.1.0 - Production on Sun May 5 16:30:58 2019




Copyright (c) 1982, 2017, Oracle and/or its affiliates.  All rights reserved.




Welcome to DGMGRL, type "help" for information.

Connected to "orcls"

Connected as SYSDBA.

DGMGRL> start observer

[W000 05/05 16:31:40.34] FSFO target standby is orcls

[W000 05/05 16:31:42.53] Observer trace level is set to USER

[W000 05/05 16:31:42.53] Try to connect to the primary.

[W000 05/05 16:31:42.53] Try to connect to the primary orcl.

[W000 05/05 16:31:42.54] The standby orcls is ready to be a FSFO target

[W000 05/05 16:31:42.54] Reconnect interval expired, create new connection to primary database.

[W000 05/05 16:31:42.54] Try to connect to the primary.

[W000 05/05 16:31:43.68] Connection to the primary restored!

[W000 05/05 16:31:44.68] Disconnecting from database orcl.

 

When using multiple observers you can have at least 3 observers at same time. Exists only one master observer and it is responsible for fast-start failover and protect the system. If you lost the master observer the Broker/Primary/Standby decide which one will be the next master observer. Until the 19c version they not work in quorum (or something like this using a voting system to decide the role switch) to protect the DG.
The interesting part about multiple observer it is that provide to you another way to customize your environment. Remember in my first post I reported the complexity (bases in pros and con) to choose the better place to put the observer. Now with multiple observers, you can put one in each data center and switch between then when you want to protect one side or another.
Now, my example environment it is two databases, three observers:

 

Check that I have one in each datacenter and one in external place. And inside of broker you can see:

 

DGMGRL> show observer




Configuration - dgconfig




  Primary:            orcl

  Target:             orcls




Observer "dbobss" - Master




  Host Name:                    dbobss

  Last Ping to Primary:         1 second ago

  Last Ping to Target:          1 second ago




Observer "dbobsp" - Backup




  Host Name:                    dbobsp

  Last Ping to Primary:         1 second ago

  Last Ping to Target:          0 seconds ago




Observer "dbobst" - Backup




  Host Name:                    dbobst

  Last Ping to Primary:         2 seconds ago

  Last Ping to Target:          2 seconds ago


 

In case of failure Broker/Primary/Standby decides which one will be the next master observer. The time to decides that occurs after 30 seconds and need to be coordinated/communicated and the agreement from both, primary and standby. Unfortunately, there is no way to reduce this time/check from 30 seconds.
In my environment, I shutdown the machine running the master observer (dbobss) and the log from broker (in primary):

 

05/05/2019 17:15:34

FSFP: FSFO SetState(st=43 "SET SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

Data Guard Broker initiated a master observer switch since the current master observer cannot reach the primary database

FSFP: FSFO SetState(st=12 "SET OBID", fl=0x0 "", ob=0x32cc2ad6, tgt=0, v=0)

Succeeded in switching master observer from observer 'dbobss' to 'dbobsp'

FSFP: FSFO SetState(st=44 "CLR SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

FSFP: FSFO SetState(st=16 "UNOBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0)

Master observer begins pinging this instance

Fore: FSFO SetState(st=15 "OBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0)

 

And in the broker log for standby:

 

05/05/2019 17:15:34

drcx: FSFO SetState(st=16 "UNOBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0)

drcx: FSFO SetState(st=43 "SET SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

05/05/2019 17:15:37

drcx: FSFO SetState(st=15 "OBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0)

drcx: FSFO SetState(st=44 "CLR SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

drcx: FSFO SetState(st=12 "SET OBID", fl=0x0 "", ob=0x32cc2ad6, tgt=0, v=0)

05/05/2019 17:15:39

Master observer begins pinging this instance

 

Look in the logs that both (primary and standby) agreed with the change. After the failure you saw the events SET SWOB INPRG (switch observer in progress) and SET OBID (set observer ID) and CLR SWOB INPRG (clear switch observer in progress) to confirm that was detect UNOBSERVED state. You can see here the output when you use the trace level for broker as support. Interesting note that inside broker the faulty observer does not disappears after the failure:

 

DGMGRL> show observer




Configuration - dgconfig




  Primary:            orcl

  Target:             orcls




Observer "dbobsp" - Master




  Host Name:                    dbobsp

  Last Ping to Primary:         1 second ago

  Last Ping to Target:          3 seconds ago




Observer "dbobss" - Backup




  Host Name:                    dbobss

  Last Ping to Primary:         256 seconds ago

  Last Ping to Target:          221 seconds ago




Observer "dbobst" - Backup




  Host Name:                    dbobst

  Last Ping to Primary:         1 second ago

  Last Ping to Target:          5 seconds ago




DGMGRL>

 

After you reinstate you observer and it go back, you can simple set the master observer to the desired one:

 

DGMGRL> set masterobserver to dbobss;

Sent the proposed master observer to the data guard broker configuration. Please run SHOW OBSERVER to see if master observer switch actually happens.

DGMGRL>

 

Hierarchy

When you use multiple observers you can have more control how to protect your DG, you can have one observer in each site and choose the side that you want to protect. You can write one script to check the database role in the observer side and change the master to protect the desired database role.
Remember my previous post. If you choose to protect the primary (with observer in the same datacenter), if your entire datacenter fails, FSFO not occurs because standby does not decide alone. If you choose to protect the standby (with observer in the same datacenter), a datacenter/network failure in standby side, this can lead you a complete shutdown from a healthy primary database because it become “isolated”.
Since multiple observers continues to use hierarchy decision, the decision remains over only one observer. Even if you have a multiple observers, 3 as example and one in each side, if you put the master observer in the same site than standby and they become isolated, they still decide alone and because FSFO the primary continues to shutdown because it thinks that it is isolated. Even if it continues to receive connections from other two observers.
Because the actual design, even if you put the “FastStartFailoverThreshold” as 240, the automatic switch from Master observer does not occurs because the standby side cannot be reach to confirm the change. Maybe for the next versions (20, 21…) we can see a change in this design and when you use multiple observers voting/quorum method are used to decide role change for FSFO. Of course that even a quorum approach can lead a problem if you put two in the same datacenter, but it can mitigate problems in some cases.
In my next post I will dig more about this, with some examples and logs/traces analyses. You will see some details when the standby is isolated and you use multiple observers.

 

 

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”


Overview of Oracle Sharding in Oracle Database 12.2
Category: Database Author: Skant Gupta Date: 5 years ago Comments: 0

Overview of Oracle Sharding in Oracle Database 12.2

What is Sharding?

In today’s growing demand for highly scalable solutions, Oracle has come up with an all new feature called Sharding also known as share nothing (SN). Sharding is a type of database partitioning that separates larger databases into smaller, faster and easily manageable subsets known as data shards.
Oracle Sharding can be used for custom-designed OLTP applications enabling distribution and replication of data across a pool of Oracle databases that do not have any hardware or software in common (share nothing). First release of Oracle Database 12.2.0.1 supports scaling upto 1000 shards.
Sharding enables you to split your information between different servers thereby allowing better scalability and high availability. In sharding, every node is referred to as a shard, each holding a specific portion of data. Different shards can hold the same data to provide better availability and scalability options. In order to access the required data, you need to know the exact shards holding the relevant information
Data is split amongst the shards in a round-robin fashion. Say for instance you are saving 20 items in shard. If you have 2 shards, each shard will comprise of 10 items. Similarly, if you have 5 shrads; each shard will comprise of 4 items and so on.
Oracle Sharding provides superior runtime performance and is easy to manage when compared to home grown deployments that use a similar approach.

 

Components of the Oracle Sharding Architecture

  • Shards – independent physical Oracle databases hosting a subset of the sharded database
  • Sharded database (SDB) – a single logical Oracle Database that is horizontally partitioned across a pool of physical Oracle Databases (shards) that do not share any hardware or software
  • Shard directors – network listeners that enable high performance connection routing, based on a sharding key
  • Shard catalog – An Oracle Database that supports automated shard deployment, centralized management of a sharded database, and multi-shard queries
  • Global service – database services that provide access to data in an SDB
  • Connection pools – connection pools act as shard directors by routing database requests across pooled connections at runtime
  • Management interfaces – GDSCTL (command-line utility) and Oracle Enterprise Manager (GUI)
Note: Sharding key affects the entire physical architecture, hence you need to be very careful while choosing the sharding Architecture and key.

 

 

General Architecture of Oracle Sharding with two shards and one shard catalog database.

Figure 1 Oracle Sharding Architecture with two shards and one catalog
Figure 2 Oracle Sharding Architecture

 

Sharding Methods

Sharding can be done using three different methods as discussed below:
  • System-Managed Sharding

With system managed sharding, data is automatically distributed across multiple shards using the partitioning algorithm. Partitioning algorithm is responsible for evenly distributing data across shards. Here, user need not specify the mapping of data across shards.
  • Composite Sharding
The composite sharding method allows you to create multiple shardspaces for different subsets of data in a table partitioned by consistent hash. A shardspace is set of shards that store data that corresponds to a range or list of key values.
  • Using Subpartitions with Sharding
All of the subpartitioning methods provided by Oracle Database are also supported for sharding as Oracle Sharding is based on table partitioning.
 
 

Benefits of Sharding

  • Fault Containment. The SN infrastructure eliminates single point of failure and hence provides strong fault isolation. Failure or slow-down of one shard does not affect the performance or availability of other shards.
  • Linear Scalability. Shards are highly scalability making it convenient to add new data into the shards without impacting the previously available shards. Sharding eliminates performance bottlenecks which makes it possible to linearly scale performance and capacity by adding shards.
  • Rolling Upgrades. Applying configuration changes on one shard at a time does not affect other shards, and allows administrators to first test the changes on a small subset of data.
  • Geographical Distribution of Data. Sharding makes it possible to store particular data close to its consumers and satisfy regulatory requirements when data must be located in a particular region.
  • Simplicity of Cloud Deployment. Sharding is well suited to deployment in the cloud. Shards may be sized as required to accommodate whatever cloud infrastructure is available and still achieve required service levels. Oracle Sharding supports on-premises, cloud, and hybrid deployment models
  • Reduced Downtime. Maintenance can perform on specific shards, allowing only a local downtime.
 

Oracle Sharding supports the following features

  • Relational schemas
  • SQL and other programmatic interfaces
  • Online schema changes
  • Compression
  • Multi-core scalability
  • High Availability
  • Advanced security
  • Read consistency and ACID properties (very rich feature when compared to other databases)
  • Database partitioning
  • Enterprise-scale backup and recovery
  • Developer agility with JSON
 

Observer, Where?
Category: Database Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

Observer, Where?

Some months ago I got one error with Oracle Data Guard and now I had time to review again and time write this article. Just to be clear since the begin, the discussion here it is not about the error itself, but about the circumstances that generated it.
The environment described here follow the most, and common, best practices for DG by Oracle. Have 1 dedicated server for each one: Primary Database, Physical Standby Database and Observer. The primary and standby resides in different datacenters in different cities, dedicated network for interconnect between sites, protection mode was Maximum Availability and runs with fast-start failover enabled (with 30 seconds for threshold). The version here is 12.2, but will be the same for 19c. So, nothing so bad in the environment, basic DG configuration trying to follow the best practices.
But, one day, application servers running, primary linux db server running, but database itself down. Looking for the cause, found in the broker log:

 

 

03/10/2019 12:48:05

LGWR: FSFO SetState(st=2 "UNSYNC", fl=0x2 "WAIT", ob=0x0, tgt=0, v=0)

LGWR: FSFO SetState("UNSYNC", 0x2) operation requires an ack

        Primary database will shutdown within 30 seconds if permission

         is not granted from Observer or FSFO target standby to proceed

LGWR:   current in-memory FSFO state flags=0x40001, version=46

03/10/2019 12:48:19

A Fast-Start Failover target switch is necessary because the primary cannot reach the Fast-Start Failover target standby database

A target switch was not attempted because the observer has not pinging primary recently.

FSFP network call timeout. Killing process FSFP.

03/10/2019 12:48:36

Notifying Oracle Clusterware to disable services and monitoring because primary will be shutdown

Primary has heard from neither observer nor target standby

        within FastStartFailoverThreshold seconds. It is

        likely an automatic failover has already occurred.

        The primary is shutting down.

LGWR: FSFO SetState(st=8 "FO PENDING", fl=0x0 "", ob=0x0, tgt=0, v=0)

LGWR: Shutdown primary instance 1 now because the primary has been isolated




And in the alertlog:

2019-03-10T12:48:35.860142+01:00

Starting background process FSFP

2019-03-10T12:48:35.870802+01:00

FSFP started with pid=30, OS id=6055

2019-03-10T12:48:36.875828+01:00

Primary has heard from neither observer nor target standby within FastStartFailoverThreshold seconds.

It is likely an automatic failover has already occurred. Primary is shutting down.

2019-03-10T12:48:36.876434+01:00

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_lgwr_2472.trc:

ORA-16830: primary isolated from fast-start failover partners longer than FastStartFailoverThreshold seconds: shutting down

LGWR (ospid: 2472): terminating the instance due to error 16830

 

But this is (and was) not a DG problem, the DG made what was design to do. Primary lost the communication with the Standby and Observer and after the Fast-Start Failover threshold, FSFP killed the primary because it don’t know if was evicted and want avoid split brain (or something similar). Worked as designed!
As I wrote before the main question here is not the ORA-XXXX error, but the circumstances. In this case, by design definition (and probably based in the docs), chosen to put the observer in the same site than standby. But, because one failure in the standby datacenter (just in the enclosure that runs blades for Oracle stuffs, application continued to runs), the entire database was unavailable. One outage in standby datacenter, shutdown the primary database even DG running in Maximum Availability mode.

 

 

As you can imagine, because the design decision “where to put the observer”, everything was down. If the observer was running in the primary datacenter, nothing supposed to occurs. But here it is the point for this post: “Where you put the Observer?”, “Primary site? Standby site?”, and “Why?”, “How you based this decision?”. Appears to be a simple question to answer, but there are a lot of pros and cons, and there is not much information about that.
If you search in the Oracle docs they recommend to put the observer in a third datacenter, isolated from the others, or in the in the same network than application, or in the standby datacenter. Here: https://www.oracle.com/technetwork/database/availability/maa-roletransitionbp-2621582.pdf (page 9). But, where are pros and cons for the decision? And how many clients that have a third datacenter? And if you search about where put the observer over the google, the 99% spread the same information (that go in the opposite of docs) “primary site”. But again, “Why?”.

 

 

Observer in primary site:
  • Pros: Protect for most failures (primary db crash failure, db logical crash as example), low impact network issues for observer (usually same LAN than primary).
  • Cons: Not protect (not switch) in case of whole primary datacenter failure.
  • When to use: When you want to avoid “false positive” switches in case of network problems against standby datacenter or you “not trust” in standby datacenter. Or because all transactions are important (will explain later).

Observer in standby site:

  • Pros: Protect against whole primary datacenter failure.
  • Cons: Heavy dependency from network between sites (need even multiple paths), can suffer “false positives” switches since standby site decides even if primary is running correctly (similar than related here). When database is more important than transactions, maybe Maximum Availability is not suitable (explain later).
  • When to use: When you want to protect the database and when your system and network infrastructure between datacenter are reliable.
Observer Third datacenter:
  • Pro: Cover all scenarios of failures.
  • Cons: Very heavily dependent of good network to avoid “false positives” or will suffer from fast-start failover disabled. Will be more expensive.
  • When to use: when you want to protect for most of possible scenarios.
Bellow just one idea about a good design for high availability for database and applications. This came directly from the Oracle docs about  “Recovering from Unscheduled Outages – https://docs.oracle.com/database/121/HABPT/E40019-02.pdf

 

 

 
Above I talked about transactions and Maximum Availability that it is deeply related here. Remember that primary database shutdown only after the fast start failover threshold? This means that primary database received transactions during 30 seconds before shutting down. If you put observer in standby site you can suffer from data loss (of course that failover, by design, means that) because standby side decides everything.
 
If you do that, maybe you need to operate in Maximum Protection to have zero data loss. This is more clear, or critical, if your database receive connections from applications that are not in the same datacenter and can connect in both at same time. In Maximum Protection you avoid that primary commit data in moments of possible failures, but you will put some overhead in every transaction operating in sync mode (https://www.oracle.com/technetwork/database/availability/sync-2437177.pdf). So, you need to decide if database running is more important than transactions.
 
Of course that every strategy have pro and cons. Observer in primary is more easy, maybe can allow you to use less strict protection and continue to have a high transaction protection, but not handle full datacenter failure. Observer in standby can protect from datacenter failure, but is possible that you need to handle with more “false positive” switches of primary db or even complete primary shutdown (as related here) in the case of standby datacenter failure. Adding the fact that maybe your application needs to allow some data loss, or if not, you may need to operate in maximum protection. Observer in third site can be best option for data protection, but will be more expensive and heavily network and operational dependent.
 
As you can see, there is not easy design. Even a simple choose, like observer location, will left plenty of decisions to be made. With Oracle 12.2 and beyond (including 19c) you have better options to handle it, I will pass over this in the next post about multiple observers, but there is no (yet) 100% solution to cover all possible scenarios.
Martin Bracher, Dr. Martin Wunderli and Torsten Rosenwald made a good cover of this in the past. You can check the presentations here:
 
https://www.doag.org/formes/pubfiles/303232/2008-K-IT-Bracher-Dataguard_Observer_ohne_Rechenzentrum.pdf
https://www.trivadis.com/sites/default/files/downloads/fsfo_understood_decus07.pdf
https://www.doag.org/formes/pubfiles/218046/FSFO.pdf
I recommend read the docs too:
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/dgbkr/data-guard-broker.pdf
https://www.oracle.com/technetwork/database/availability/maa-roletransitionbp-2621582.pdf
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sbydb/data-guard-concepts-and-administration.pdf
https://docs.oracle.com/database/121/HABPT/E40019-02.pdf
 

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”


Webinar – Oracle Golden Gate Microservices Overview (With Demo)
Category: Database Author: Andre Luiz Dutra Ontalba (Board Member) Date: 5 years ago Comments: 0

Webinar - Oracle Golden Gate Microservices Overview (With Demo)

Hi guys !!

Today we had our Webinar and we are posting the webinar here for you.

 

 

Follow the link in the PDF presentation: Click here

 

See you next time.

André Ontalba – Board Member


Webinar – Flashback in PDB
Category: Database Author: Andre Luiz Dutra Ontalba (Board Member) Date: 5 years ago Comments: 0

Webinar - Flashback PDB in Oracle Database

Hi guys !!

Today we had our Webinar and we are posting the webinar here for you.

 
 

Follow the link in the PDF presentation: Click here

 

See you next time.

André Ontalba – Board Member


Configuring Oracle Transparent Data Encryption (TDE) on Oracle 12c multitenant architecture
Category: Database Author: Andre Luiz Dutra Ontalba (Board Member) Date: 5 years ago Comments: 0

Configuring Oracle Transparent Data Encryption (TDE)
on Oracle 12c multitenant architecture

To configure TDE on Oracle 12c multitenant architecture we need to execute some steps in order to
be able to create encrypted tablespaces on Oracle, for example.

 

1) Adjust the sqlnet.ora file to refer your wallet path

 

[oracle@db1 admin]$ cat sqlnet.ora 
ENCRYPTION_WALLET_LOCATION=
(SOURCE=
(METHOD=FILE)
(METHOD_DATA=
(DIRECTORY=/u01/app/oracle/CDB1)
)
)
2) Create the folder on OS side

 

[oracle@db1 admin]$ mkdir -p /u01/app/oracle/CDB1
3) Log into database using the role syskm, create the keystore and the auto login keystore

 

[oracle@db1 admin]$ sqlplus / as syskm




SQL*Plus: Release 12.1.0.2.0 Production on Tue Jul 30 13:42:13 2019

Copyright (c) 1982, 2014, Oracle. All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> administer key management create keystore '/u01/app/oracle/CDB1' identified by oracle;

keystore altered.

SQL>
SQL>
SQL> ADMINISTER KEY MANAGEMENT CREATE AUTO_LOGIN KEYSTORE FROM KEYSTORE '/u01/app/oracle/CDB1' IDENTIFIED BY oracle;

keystore altered.

SQL> select * from v$encryption_wallet;

WRL_TYPE WRL_PARAMETER STATUS WALLET_TYPE WALLET_OR FULLY_BAC CON_ID
------------ --------------------------------- ---------------------- -------------- ----------------------- -----------
FILE /u01/app/oracle/CDB1/ OPEN_NO_MASTER_KEY AUTOLOGIN SINGLE UNDEFINED 0

4) Bounce the database to see if AUTO_LOGIN keystore is working correctly

 

SQL> connect / as sysdba 
Connected.
SQL> shut immediate;
Database closed.
Database dismounted.
startup
ORACLE instance shut down.
SQL> ORACLE instance started.




Total System Global Area 1660944384 bytes
Fixed Size 2925072 bytes
Variable Size 1040190960 bytes
Database Buffers 603979776 bytes
Redo Buffers 13848576 bytes
Database mounted.
Database opened.

SQL> select * from v$encryption_wallet;

WRL_TYPE WRL_PARAMETER STATUS WALLET_TYPE WALLET_OR FULLY_BAC CON_ID
------------ --------------------------------- ---------------------- -------------- ----------------------- -----------
FILE /u01/app/oracle/CDB1/ OPEN_NO_MASTER_KEY AUTOLOGIN SINGLE UNDEFINED 0

Obs.: The first time, you need to set the masterkey to avoid this status and errors with the key not open properly.

 

SQL> alter pluggable database all open;

Pluggable database altered.

SQL> show pdbs

CON_ID CON_NAME OPEN MODE RESTRICTED
---------- ------------------------------ ---------- ----------
2 PDB$SEED READ ONLY NO 
3 PDB2 READ WRITE NO 
4 PD3 READ WRITE NO


5) Let’s setup the masterkey on the keystore

 

SQL> conn / as sysdba
Connected.
SQL> grant dba,syskm to C##dba identified by oracle container=ALL;




Grant succeeded.

SQL> administer key management set keystore close container=all;

keystore altered.

SQL> administer key management set keystore open identified by oracle container=all;

keystore altered.

 

6) On pluggable database, connected with common user C##DBA, let’s set the key for all pdbs.
 


SQL> conn c##dba/oracle@db1:1521/PD3 as syskm;
Connected.
SQL>
SQL> conn / as sysdba
Connected.
SQL> administer key management set key identified by oracle with backup using 'OMK' container=ALL;

keystore altered.

SQL> select * from v$encryption_wallet;

WRL_TYPE WRL_PARAMETER STATUS WALLET_TYPE WALLET_OR FULLY_BAC CON_ID
------------ --------------------------------- ---------------------- -------------- ----------------------- -----------
FILE /u01/app/oracle/CDB1/ OPEN PASSWORD SINGLE NO 0

As you can see above, now the status is OPEN (by password)
7) Connected on PD3 as common user C##DBA and now we can create our tablespace

 

SQL> conn c##dba/oracle@db1:1521/PD3; 
Connected.




SQL> create tablespace OCM_CRYP
2 datafile '/u01/app/oracle/oradata/CDB1/8DE664DD797442E9E055000000000001/datafile/OCM.DBF'
3 size 10m
4 encryption using 'AES256'
5 default storage(ENCRYPT);

Tablespace created.

 

So, this is the new way for 12c to management keys for encryption on database.
 
I hope this helps you!!!
 
This article was written by Andre Ontalba e Rodrigo Mufalani
 
Disclaimer: “The postings on this site are my own and don’t necessarily represent may actual employer positions, strategies or opinions. The information here was edited  to be useful for general purpose, specific data and identifications was removed to allow reach generic audience and to be useful.”

 


1 3 4 5 6 7