I’m at the end of a what feels like a long battle to get Fast Connection Failover (FCF) working with Universal Connection Pool (UCP). It should have been easy, or so I thought…
The most notable reference I’m aware of for setting up FCF with UCP outside of the Oracle documentation is this blog post by Martin Bach and its take 2. I have to thank Martin for publishing his work as without it I wouldn’t have been in a position to carry out my own testing. However, as I’ve discussed directly with him, I don’t think Martin’s implementation is complete for a couple of reasons that I’ll cover below.
I followed Martin’s blog posts…
When I’d got as far as the end of the first post I had a web page looking like this:
Hitting reload would show the instance name in “Hello World from …” changing every so often as the connection returned from the pool happened to switch instance.
I then moved on to following the second of Martin’s posts and ended up with this:
So far, so good.
Next I tinkered with the configuration a bit and changed the following from the configuration provided by Martin:
I’d been listening when members of the Oracle Real World Performance Group said that setting minPoolSize and maxPoolSize to different values was not a great idea and so I set them both to the same number in order to keep the pool size as steady as possible. I set inactiveConnectionTimeout to 0 (zero) as closing inactive connections and then creating new connections would be an unnecessary overhead on the database.
After making the changes I started playing around with stopping and starting services on individual instances…
Nothing happened! My connections did not failover! Fast Connection Failover was broken!
I changed the servlet to include some more details about the pool as shown below:
The most important thing to note in the screenshot is that “ONS Configuration” is not populated with a value. This could be as a result of my code not doing what I wanted or simply because getONSConfiguration was returning null. At this point you’ll have to trust me that it’s the latter, but this will be demonstrated shortly.
Digging Into The Detail
The key line in the log output is:
But, if you don’t know that then you might not be searching for it; and when logging is set to ALL you get a lot to look at.
I’ll add a couple of other extracts from the log files I generated during my testing (when things didn’t work) that may help others get here:
1) Exception in thread “main” oracle.ons.SubscriptionException: Subscription request timed out after 30000 millseconds. Possible causes: OPMN may not be running, you may have an OPMN running in an alternate ORACLE_HOME using duplicate port values, or OPMN may be misconfigured.
To verify that this is not actually the problem you can check that ONS is up via “srvctl status nodeapps” or you can use onsctl (as the owner of the GI installation) in the form of “onsctl ping” or “onsctl debug” depending on how deep you need/want to dig. I found being able to telnet to the remote ONS port on each of the database servers the most convincing way to reassure myself that ONS really was up and that there was nothing blocking communication between my client on the database servers.
2) oracle.ons.ONSException: Unable to open config file
There’s not much to say about this other than when you’re using “remote ONS configuration” there is no ons.config file and if it is being looked for then it implies the remote configuration has not been detected. I also observed “ERROR: config file (ons.config) could not be found.” in the Tomcat console window.
[Note] – I saw the messages above in different environments: the first was when I had a full client install with $ORACLE_HOME/opmn/ons.config present on the client, but not modified from the installation; and the second was when I’d installed the instant client and $ORACLE_HOME/opmn did not exist.
During my investigation I found myself reading this thread on forum.springsource.org and changed my contents.xml to have onsConfigStr in place of ONSConfiguration. I could see the logic given the log output I’d seen (and shown above), but unfortunately this didn’t get things working.
It was only when I found my way to Constant Field Values in Oracle Universal Connection Pool for JDBC Java API Reference that I could see what was going wrong.
I updated my context.xml changing ONSConfiguration to onsConfiguration as well as adding the missing “nodes=” to the start of the string before compiling and deploying the application yet again.
Lo and behold, I found the following in my trace file:
2012-08-05T02:31:46.494+0100 UCP FINEST seq-227,thread-53 oracle.ucp.jdbc.oracle.OracleJDBCConnectionPool.setONSConfiguration onsConfigStr: nodes=rac-01:6200,rac-02:6200
2012-08-05T02:31:46.494+0100 UCP FINEST seq-228,thread-53 oracle.ucp.jdbc.oracle.RACManagerImpl.setONSConfiguration onsConfigStr: nodes=rac-01:6200,rac-02:6200
The output in the servlet also provided the proof that my ONS Configuration was finally be picked up…
At this point I got stuck into testing the closing of inactive/returned connections when a service is stopped. It worked and it worked very quickly. However, during my testing I discovered, or rather uncovered, a few things worth noting:
Case of Service Name
This is actually covered in MOS note Fast Connection Failover Does Not Work With Upper Case Database Service Name [ID 1457806.1], but what caught me out was something I was already aware of, but initially overlooked: When you remove a service using “srvctl remove service …” it does not clean up the service in the database. Therefore, when you recreate the service with a name in a difference case it will not update the case of the service name in the database unless you manually remove the service from the database using DBMS_SERVICE.DELETE_SERVICE.
In my experience do not use EZConnect for the URL otherwise your connections may not move as expected due to messages like this in the trace file:
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1328,thread-59 oracle.ucp.jdbc.oracle.RACManagerImpl.processUpEvent eventType: database/event/service, status = up
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1329,thread-59 oracle.ucp.jdbc.oracle.RACManagerImpl.processServiceUpEvent initialSize: 0, maxSize: 10, instanceName: orcl2, dbUniqueName: orcl
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1330,thread-59 oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfo. DbUniqNameKey: orcl, InstNameKey: orcl2, HostNameKey: rac-02
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1331,thread-59 oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfoList.setNamedInstanceUrl connect data keyword not found
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1332,thread-59 oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfoList.setNamedInstanceUrl instance=orcl2, namedInstanceUrl=jdbc:oracle:thin:@rac-scan.orasavon.com/oltp_fcf
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1333,thread-59 oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfoList.getConnectionToNamedInstance URL invalid for connecting to named instance
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1334,thread-59 oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfoList.markUpInstanceForUpEvent namedInstanceConn is null
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1335,thread-59 oracle.ucp.jdbc.oracle.RACManagerImpl.processServiceUpEvent first up instance, no new connections to obtain
2012-08-05T02:44:53.444+0100 UCP FINEST seq-1336,thread-59 oracle.ucp.jdbc.oracle.RACManagerImpl.processUpEvent2ndPhase Fast Connection Failover succeeded
2012-08-05T02:44:53.444+0100 UCP FINE seq-1337,thread-59 oracle.ucp.jdbc.oracle.ONSDatabaseEventHandlerTask.run event triggered: Service name: oltp_fcf, Instance name: orcl2, Unique name: orcl, Host name: rac-02, Status: up, Cardinality: 2, Reason: user, Event type: database/event/service
Initially I just moved away from using EZConnect, but after doing some more detailed testing I established that this only happens if there has never been any connections to the instance that has just come up, i.e., the problem does not occur if an instance is shutdown and then brought back during the life of the connection pool, but if the pool is started without a particular instance offering a service and then the service is started on that instance you’ll get the error.
MOS note Universal Connection Pool For Jdbc Doesn’T Work Properly Using Jndi [ID 1307692.1] covers the problem with this setting not taking effect, but I wasn’t aware of that until I had spent a fair amount of time trying to work out why my testing with JNDI was not working in the same way as when I used the code provided by Oracle in MOS note How to Verify Universal Connection Pool (UCP) / Fast Connection Failover (FCF) Setup [ID 1064652.1], which is an excellent reference when trying to get FCF with UCP working, but does not cover JNDI whereas Martin Bach’s blog posts do.
You may have noticed that the “Available Connections” displayed in all the screenshots is 1. This is because of this bug. I only got the available connections to increase above 1 during my testing by either using inactiveConnectionTimeout, as this appears to cause the connection pool be reestablished based on minPoolSize, or by running multiple simulated users hitting the webpage at the same time to increase the demand for connections.
MOS ID 1307692.1 offers both a patch and a workaround for the bug.
If you want to be sure that your FCF configuration is working correctly then I recommend not setting inactiveConnectionTimeout as I think this is a very easy way to create the illusion that the pool is being redistributed as a result of FCF when the connections are actually being redistributed due to connect-time load balancing and the fact that an instance that does not offer a service will not receive any connection requests for that service.
Tracing/logging is your friend. Refer to Martin’s “take 2” blog post and MOS note How to Verify Universal Connection Pool (UCP) / Fast Connection Failover (FCF) Setup [ID 1064652.1] for details.
Finally, another MOS note that was useful along the way is How to Log or Trace ONS in RAC 11g Release 2 [ID 1270841.1].
If you want the Java source code for my final Hello.java (which borrows a lot from both the Tomcat sample and Martin Bach’s blog posts) then it’s available for download here.