Non-working logon trigger on Data Guard standby database

Today I did something very dumb when working with a vendor to trace and issue and rendered our Data Guard physical standby database unusable. I added an AFTER LOGON ON DATABASE trigger which collected information about a session and wrote it to a table. You can see the problem already: standby databases are not writable and so this is not going to work.

Very soon users connecting to the standby were complaining:

ORA-00604: error occurred at recursive SQL level 1
ORA-16000: database open for read-only access
ORA-06512: at line 2
ORA-02063: preceding 3 lines from SERVER_RO
00604. 00000 -"error occurred at recursive SQL level %s"
*Cause:An error occurred while processing a recursive SQL statement
(a statement applying to internal dictionary tables).
*Action: If the situation described in the next error on the stack
can be corrected, do so; otherwise contact Oracle Support.

Realising my mistake I dropped the trigger but this didn’t make the problem go away on the standby.

A quick look in the alert log on the primary and there was a logon failure related to Data Guard:

Error 1017 received logging on to the standby
------------------------------------------------------------
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
------------------------------------------------------------
PING[ARCl]: Heartbeat failed to connect to standby 'SERVER2'. Error is 16191.

The secondary’s alert log didn’t have anything interesting in it.

I checked the SYS passwords on both nodes in case an underlying problem had suddenly occurred but all was ok.

MOS to the rescue! I checked REDO_TRANSPORT_USER (we don’t use it) and SEC_CASE_SENSITIVE_LOGON (we do use it but it was working ok before so I didn’t want to change it).

In this situation tracing is your friend and MOS note 1368170.1, Troubleshooting ORA-16191 and ORA-1017/ORA-1031 in Data Guard Log Transport Services or Data Guard Broker, recommends tracing the error on the primary.

alter system set events '16191 trace name errorstack level 10';

This generated a trace file, /ora/diag/rdbms/server1/SERVER1/trace/SERVER1_arc1_13924.trc, containing:

*** 2015-08-26 12:26:56.914 939 krsu.c
Error 16191 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'SERVER2'
Error 16191 attaching to destination LOG_ARCHIVE_DEST_2 standby host 'SERVER2'
*** 2015-08-26 12:26:56.914 2917 krsi.c
krsi_dst_fail: dest:2 err:16191 force:0 blast:1
kcrrwkx: unknown error:16191

*** 2015-08-26 12:27:30.677
OCISessionBegin failed -1
.. Detailed OCI error val is 604 and errmsg is 'ORA-00604: error occurred at recursive SQL level 1
ORA-16000: database open for read-only access
ORA-06512: at line 2
'
OCISessionBegin failed. Error -1
.. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
'

*** 2015-08-26 12:27:31.737
OCISessionBegin failed. Error -1
.. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
'
OCISessionBegin failed. Error -1
.. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
'
*** 2015-08-26 12:27:31.767 4338 krsh.c
Error 1017 received logging on to the standby
------------------------------------------------------------
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
------------------------------------------------------------

Enough tracing so trace off:

alter system set events '16191 trace name context off';

Looking at the output we have a big clue here, as it looks like the error the users were getting:

ORA-16000: database open for read-only access

Why are we still getting that when I’ve dropped the trigger on the primary database? It is immediately apparent that I have broken Data Guard and my drop is in a log file awaiting shipping to the standby. This is a catch 22: I can’t apply the log because the trigger won’t let Data Guard authenticate and Data Guard can’t authenticate because of the presence of the trigger.

Two questions sprang to my mind:

  1. Can I disable the trigger?
  2. Why is the trigger affecting the SYS user?

I can’t workaround the trigger in any documented way because any attempt to do so on the physical standby is prevented by its read-only status, for example:

SQL> grant ADMINISTER DATABASE TRIGGER to appuser;
grant ADMINISTER DATABASE TRIGGER to appuser
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-16000: database open for read-only access

Here user SYS is connecting with a password not “as sysdba” and so we are hitting the logon trigger.

Luckily help is at hand with a system parameter I can change:

alter system set "_system_trig_enabled"=FALSE;

My fellow York resident, Pete Finnigan, has more detail on this here.

I can then restart the recovery with:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;

And of course not forgetting to re-enable system triggers:

alter system set "_system_trig_enabled"=TRUE;

It should be pointed out that this method is nice because it doesn’t involve restarting the standby and therefore interrupting any running reports but if you don’t like undocumented parameters or you have other system triggers you depend on you can instead:

  • Shut down the standby.
  • Restart the standby in MOUNT mode. This means that any triggers will not fire.
  • Start the recovery.
  • If the trigger has been dropped on the primary it should replicate to the secondary.
  • Open the database.

Monitoring real-time apply progress

In many cases monitoring Data Guard, at least in Maximum Performance mode, is down to checking transport and apply lag to make sure standby database is not “too far” behind primary, and quite often it is sufficient in a day to day operation. By checking v$archived_log we can easily get information about the last archived and applied log sequence number:

SQL> select sequence#,ARCHIVED,APPLIED from v$archived_log order by sequence#;
SEQUENCE# ARCHIVED  APPLIED
---------- --------- ---------
… output truncated …
       208 YES       YES
       209 YES       YES
       210 YES       YES
       211 YES       IN-MEMORY

If we use real-time apply changes are applied to the standby database as soon as they are written to the standby redo logs, which should keep standby database as up to date as possibly without going to Maximum Availability or Maximum Protection.

However, if our primary database fails, how can we check what is the last change transmitted from the primary, or that all changes transmitted were actually applied before we activate standby?
Of course we can trust that Oracle will apply all the transmitted redo, when standby is activated, but without knowing what was the last transmitted change we have no way of verifying it. It is also possible that the failover decision depends on how far behind the primary the standby was at the time of failure or how much data can potentially be lost.

One way of finding the information we need is by checking v$standby_log and v$recovery_progress views.
v$standby_log displays information about standby redo logs. Columns LAST_CHANGE# and LAST_TIME can be used to find last changes transmitted from primary.

SQL> select GROUP#,THREAD#,SEQUENCE#,STATUS,LAST_CHANGE#,LAST_TIME from v$standby_log;

    GROUP#    THREAD#  SEQUENCE# STATUS           LAST_CHANGE# LAST_TIME
---------- ---------- ---------- ---------- ------------------ -------------------
         4          1        213 ACTIVE                1699715 2013-04-18 20:26:15
         5          1          0 UNASSIGNED
         6          1          0 UNASSIGNED
         7          1          0 UNASSIGNED

SQL>

v$recovery_progress can be used to monitoring database recovery operations, and gives us access to information like: Last Applied Redo, Active Apply Rate, Average Apply Rate, Apply Time per Log and a few more. The item we are interested in is “Last Applied Redo” and the value of the “TIMESTAMP” column. The value should be very close to or match the value of LAST_TIME column from v$standby_log view.

SQL> select START_TIME,TYPE, ITEM,UNITS,SOFAR,TIMESTAMP 
  2  from v$recovery_progress where ITEM='Last Applied Redo';
START_TIME          TYPE             ITEM                UNITS     TIMESTAMP
------------------- ---------------- ------------------- --------- -------------------
2013-04-17 22:46:26 Media Recovery   Last Applied Redo   SCN+Time  2013-04-18 20:26:16

SQL>