Protect single instance with GI (without RAC-One-Node)

This is my first post since I meet Martin Nash in London 2 years ago on a Storage seminar with James Morle. He invited me on ORAganism blog and since then I wanted to start blogging but didn’t take the time to do so.
I spent one year working as a Consultant for a company called Digora where I have had great time and learnt a lot with all the different consultants, Laurent Leturget made me felt guilty not to blog ( laurent-leturgez.com ) !

Anyway, I discovered lately that it is now really simple to protect a single instance with “Cold failover” technique  through the Grid Infrastructure. Before discovering this technique, it was not this easy. There was an article on Oracle blogs talking about an equivalent technique :
https://blogs.oracle.com/xpsoluxdb/entry/clusterware_11gr2_setting_up_an_activepassive_failover_configuration

Compared to this post, the new technique can’t be much easier, there are 2 commands :

- Check the current properties of your database - dbtest here
 crsctl status resource ora.dbtest.db -p
- Change the placement to FAVORED for your database
 crsctl modify resource ora.dbtest.db -attr "PLACEMENT=favored"
- Change the list of HOSTs that can receive the instance of your database (here host1 or host2)
 crsctl modify resource ora.dbtest.db -attr "HOSTING_MEMBERS=host1 host2"
- Check back the new properties of your database - dbtest here
 crsctl status resource ora.dbtest.db -p

If you want your client connection to be handle the failover automatically and the SCAN listener to be aware of the host change, you need to modify your remote_listener parameter so that it points to the SCAN address of our GI setup

sqlplus / as sysdba 
alter system set remote_listener="host-scan:1521" scope=both;

That’s it !!

You now have a single-instance database that is protected again hardware failure. So next question is why using this technique when you can use the RAC-One Node and what are the differences ?

Reason – Cost
This is no additional cost and no extra licence compared to the RAC-One Node solution and moreover, you can use this technique with a Standard Edition database.

– Difference 1 – No RAC feature

Compared to the RAC-One Node solution, the database cannot failover smoothly to the other host (online failover). When you use RAC-One Node, if you choose in advance that you want to failover to another host (host2), you can use the following command :

srvctl relocate database -d dbtest -n host2 -v

When you launch this command, while they are some active transactions on the first host, the instance will be up and one second instance will start on the future host, so you are in the case of RAC features since both instances are up at the time, that’s why we have 2 redolog threads and 2 undo on RAC-One Node configuration.

If you want to learn more on RAC One Node, Marcin Przepiorowski has a serie on the subject (http://oracleprof.blogspot.fr/2009/12/oracle-rac-one-node-part-1.html)

If you try to use the relocate command with “poor’s man” failover solution, you’ll get this error :
   PRCD-1027 : Failed to retrieve database orcltest
   PRCD-1146 : Database orcltest is not a RAC One Node database

– Difference 2 – No native support in Cloud Control 12c

If you want to add this database to your beloved Cloud Control, you will need to add a wee extra script in case of failover so that CC12c can know it has to connect to the new host, here is an example below. If needed I wrote a script which handle this automatically (ask for it).

– Step 1 : create blackout (better but not compulsary)

emcli create_blackout -name="relocating active passive targets" -add_targets=${DB_NAME}_${INSTANCE_NAME}:oracle_database -schedule="frequency:once;duration:0:30" -reason="DB relocate"

– Step 2 – relocate the target from host1 to host2

emcli relocate_targets -src_agent=${SERVER_SOURCE}:3872 -dest_agent=${SERVER_DEST}:3872 -target_name=${DB_NAME}_${INSTANCE_NAME} -target_type=oracle_database -copy_from_src -force=yes -changed_param=MachineName:${VIP_NAME}

– Step 3 : stop blackout

emcli stop_blackout -name="relocating active passive targets"

– Difference 3 – No “Omotion like” relocation

At the beginning, RAC-One Node was named “Omotion“, it could be used almost as “Vmotion” from Vsphere. In the configuration using the Grid Infrastructure, you can only failover if you shutdown the host or shutdown the instance down on host1 and start it again on the second host :

-- From Host 1
srvctl stop database -d dbtest
-- From Host 2
srvctl start database -d dbtest -n host2

You are now ready to test this out !! This is amazing how Oracle handle this configuration, even if you delete the spfile from host1, the GI will recreate it automatically.

Hope this helps. I did not find any blog article talking about this type of configuration so this is my first contribution to the Oracle community from which I learned so much.

 

 

 

 

Deterministic Service Failover

Until some testing about 6 months ago I was under the impression that if a database service was defined as being “preferred” on 1 instance of a RAC database and “available” on N other instances (where N is greater than 1); and the preferred instance is killed then the instance that the service moves to is not deterministic, i.e., Oracle picks one of the available instances to start the service on in a unpredictable way. I cannot work out what gave me this impression, so will just have to put it down to an assumption I made at some point in the past and never validated. From conversations with other DBAs I was not alone.

Anyway, I now believe that the failover it deterministic as demonstrated below.

Creating some test services

$ srvctl add service -d DB_TST -s DB_TST_SVC1 -r "DB_TST1" -a "DB_TST2,DB_TST3,DB_TST4"
$ srvctl add service -d DB_TST -s DB_TST_SVC2 -r "DB_TST1" -a "DB_TST3,DB_TST4,DB_TST2"
$ srvctl add service -d DB_TST -s DB_TST_SVC3 -r "DB_TST1" -a "DB_TST4,DB_TST2,DB_TST3"

Verifying the state of the services

$ srvctl status service -d DB_TST
Service DB_TST_SVC1 is running on instance(s) DB_TST1
Service DB_TST_SVC2 is running on instance(s) DB_TST1
Service DB_TST_SVC3 is running on instance(s) DB_TST1

Killing instance 1

At this point I killed instance 1 using “shutdown abort” in SQL*Plus.

Checking the status of the services

$ srvctl status service -d DB_TST
Service DB_TST_SVC1 is running on instance(s) DB_TST2
Service DB_TST_SVC2 is running on instance(s) DB_TST3
Service DB_TST_SVC3 is running on instance(s) DB_TST4

As you can see the service failed over to the instance based on the order the instances are defined in the “available” list. I did a few more tests to convince myself that this wasn’t just a one off, but won’t fill this post with the output.

If you’re reading this and thinking, “Yes, that was obvious.” Sorry!

During the testing I ran into something else that appears very trivial, but worth being aware of… When is shutdown abort not an immediate abort.

My Fight with FCF

I’m at the end of a what feels like a long battle to get Fast Connection Failover (FCF) working with Universal Connection Pool (UCP). It should have been easy, or so I thought…

The most notable reference I’m aware of for setting up FCF with UCP outside of the Oracle documentation is this blog post by Martin Bach and its take 2. I have to thank Martin for publishing his work as without it I wouldn’t have been in a position to carry out my own testing. However, as I’ve discussed directly with him, I don’t think Martin’s implementation is complete for a couple of reasons that I’ll cover below.

Getting Started

I followed Martin’s blog posts…

When I’d got as far as the end of the first post I had a web page looking like this:

Hitting reload would show the instance name in “Hello World from …” changing every so often as the connection returned from the pool happened to switch instance.

I then moved on to following the second of Martin’s posts and ended up with this:

So far, so good.

Breaking It

Next I tinkered with the configuration a bit and changed the following from the configuration provided by Martin:

  • minPoolSize
  • maxPoolSize
  • inactiveConnectionTimeout

I’d been listening when members of the Oracle Real World Performance Group said that setting minPoolSize and maxPoolSize to different values was not a great idea and so I set them both to the same number in order to keep the pool size as steady as possible. I set inactiveConnectionTimeout to 0 (zero) as closing inactive connections and then creating new connections would be an unnecessary overhead on the database.

After making the changes I started playing around with stopping and starting services on individual instances…

Nothing happened! My connections did not failover! Fast Connection Failover was broken!

I changed the servlet to include some more details about the pool as shown below:

The most important thing to note in the screenshot is that “ONS Configuration” is not populated with a value. This could be as a result of my code not doing what I wanted or simply because getONSConfiguration was returning null. At this point you’ll have to trust me that it’s the latter, but this will be demonstrated shortly.

Digging Into The Detail

The key line in the log output is:

oracle.ucp.jdbc.oracle.OracleJDBCConnectionPool.setONSConfiguration onsConfigStr:

But, if you don’t know that then you might not be searching for it; and when logging is set to ALL you get a lot to look at.

I’ll add a couple of other extracts from the log files I generated during my testing (when things didn’t work) that may help others get here:

1) Exception in thread “main” oracle.ons.SubscriptionException: Subscription request timed out after 30000 millseconds. Possible causes: OPMN may not be running, you may have an OPMN running in an alternate ORACLE_HOME using duplicate port values, or OPMN may be misconfigured.

To verify that this is not actually the problem you can check that ONS is up via “srvctl status nodeapps” or you can use onsctl (as the owner of the GI installation) in the form of “onsctl ping” or “onsctl debug” depending on how deep you need/want to dig. I found being able to telnet to the remote ONS port on each of the database servers the most convincing way to reassure myself that ONS really was up and that there was nothing blocking communication between my client on the database servers.

2) oracle.ons.ONSException: Unable to open config file

There’s not much to say about this other than when you’re using “remote ONS configuration” there is no ons.config file and if it is being looked for then it implies the remote configuration has not been detected. I also observed “ERROR: config file (ons.config) could not be found.” in the Tomcat console window.

[Note] – I saw the messages above in different environments: the first was when I had a full client install with $ORACLE_HOME/opmn/ons.config present on the client, but not modified from the installation; and the second was when I’d installed the instant client and $ORACLE_HOME/opmn did not exist.

False Hope

During my investigation I found myself reading this thread on forum.springsource.org and changed my contents.xml to have onsConfigStr in place of ONSConfiguration. I could see the logic given the log output I’d seen (and shown above), but unfortunately this didn’t get things working.

Bingo

It was only when I found my way to Constant Field Values in Oracle Universal Connection Pool for JDBC Java API Reference that I could see what was going wrong.

onsConfiguration

I updated my context.xml changing ONSConfiguration to onsConfiguration as well as adding the missing “nodes=” to the start of the string before compiling and deploying the application yet again.

Lo and behold, I found the following in my trace file:

2012-08-05T02:31:46.494+0100 UCP FINEST seq-227,thread-53 oracle.ucp.jdbc.oracle.OracleJDBCConnectionPool.setONSConfiguration onsConfigStr: nodes=rac-01:6200,rac-02:6200
2012-08-05T02:31:46.494+0100 UCP FINEST seq-228,thread-53 oracle.ucp.jdbc.oracle.RACManagerImpl.setONSConfiguration onsConfigStr: nodes=rac-01:6200,rac-02:6200

The output in the servlet also provided the proof that my ONS Configuration was finally be picked up…

At this point I got stuck into testing the closing of inactive/returned connections when a service is stopped. It worked and it worked very quickly. However, during my testing I discovered, or rather uncovered, a few things worth noting:

Case of Service Name

This is actually covered in MOS note Fast Connection Failover Does Not Work With Upper Case Database Service Name [ID 1457806.1], but what caught me out was something I was already aware of, but initially overlooked: When you remove a service using “srvctl remove service …” it does not clean up the service in the database. Therefore, when you recreate the service with a name in a difference case it will not update the case of the service name in the database unless you manually remove the service from the database using DBMS_SERVICE.DELETE_SERVICE.

EZConnect

In my experience do not use EZConnect for the URL otherwise your connections may not move as expected due to messages like this in the trace file:

2012-08-05T02:44:53.444+0100 UCP FINEST seq-1328,thread-59 oracle.ucp.jdbc.oracle.RACManagerImpl.processUpEvent eventType: database/event/service, status = up
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1329,thread-59 oracle.ucp.jdbc.oracle.RACManagerImpl.processServiceUpEvent initialSize: 0, maxSize: 10, instanceName: orcl2, dbUniqueName: orcl
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1330,thread-59 oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfo. DbUniqNameKey: orcl, InstNameKey: orcl2, HostNameKey: rac-02
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1331,thread-59 oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfoList.setNamedInstanceUrl connect data keyword not found
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1332,thread-59 oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfoList.setNamedInstanceUrl instance=orcl2, namedInstanceUrl=jdbc:oracle:thin:@rac-scan.orasavon.com/oltp_fcf
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1333,thread-59 oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfoList.getConnectionToNamedInstance URL invalid for connecting to named instance
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1334,thread-59 oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfoList.markUpInstanceForUpEvent namedInstanceConn is null
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1335,thread-59 oracle.ucp.jdbc.oracle.RACManagerImpl.processServiceUpEvent first up instance, no new connections to obtain
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1336,thread-59 oracle.ucp.jdbc.oracle.RACManagerImpl.processUpEvent2ndPhase Fast Connection Failover succeeded
2012-08-05T02:44:53.444+0100 UCP FINE seq-1337,thread-59 oracle.ucp.jdbc.oracle.ONSDatabaseEventHandlerTask.run event triggered: Service name: oltp_fcf, Instance name: orcl2, Unique name: orcl, Host name: rac-02, Status: up, Cardinality: 2, Reason: user, Event type: database/event/service

Initially I just moved away from using EZConnect, but after doing some more detailed testing I established that this only happens if there has never been any connections to the instance that has just come up, i.e., the problem does not occur if an instance is shutdown and then brought back during the life of the connection pool, but if the pool is started without a particular instance offering a service and then the service is started on that instance you’ll get the error.

initialPoolSize

MOS note Universal Connection Pool For Jdbc Doesn’T Work Properly Using Jndi [ID 1307692.1] covers the problem with this setting not taking effect, but I wasn’t aware of that until I had spent a fair amount of time trying to work out why my testing with JNDI was not working in the same way as when I used the code provided by Oracle in MOS note How to Verify Universal Connection Pool (UCP) / Fast Connection Failover (FCF) Setup [ID 1064652.1], which is an excellent reference when trying to get FCF with UCP working, but does not cover JNDI whereas Martin Bach’s blog posts do.

You may have noticed that the “Available Connections” displayed in all the screenshots is 1. This is because of this bug. I only got the available connections to increase above 1 during my testing by either using inactiveConnectionTimeout, as this appears to cause the connection pool be reestablished based on minPoolSize, or by running multiple simulated users hitting the webpage at the same time to increase the demand for connections.

MOS ID 1307692.1 offers both a patch and a workaround for the bug.

Summary

If you want to be sure that your FCF configuration is working correctly then I recommend not setting inactiveConnectionTimeout as I think this is a very easy way to create the illusion that the pool is being redistributed as a result of FCF when the connections are actually being redistributed due to connect-time load balancing and the fact that an instance that does not offer a service will not receive any connection requests for that service.

Tracing/logging is your friend. Refer to Martin’s “take 2” blog post and MOS note How to Verify Universal Connection Pool (UCP) / Fast Connection Failover (FCF) Setup [ID 1064652.1] for details.

Finally, another MOS note that was useful along the way is How to Log or Trace ONS in RAC 11g Release 2 [ID 1270841.1].

If you want the Java source code for my final Hello.java (which borrows a lot from both the Tomcat sample and Martin Bach’s blog posts) then it’s available for download here.