The RAC and HA roundtable gives users a chance to share experience and gotchas. Often the discussion points that generate most interest are when something unusual or unexplained has happened, with the participants attempting to drill down to where the problem was by firing questions at the DBA telling the tale. The discussion this time there was such a discussion around a RAC stretch cluster where the storage connection had been deliberately broken during a test (but not the RAC interconnect between the two sites). The end result of this was a failure of the cluster but after a restart, the database had been rolled back to a point in time several minutes before the failure. This is discussed here: http://davidburnham.wordpress.com/category/oracle-rac-stretched-cluster-testing/
Some participants suggested that Active Data Guard was a better option than a stretch RAC cluster with one saying that Active Data Guard should be the default position with RAC only deployed if Active DataGuard didn’t fit the bill. Unfortunately as a presentation on DataGuard was taking place at the same time, most people using it were presumably elsewhere.
Another discussion point was whether RAC actually reduces the volume of outages, ensuring high availability. Participants generally agreed that it is great for installing rolling updates and so on without disruption – in other words, planned maintenance work – and dealing with some other modes of failure. But sometimes the added complexity could cause unplanned outages that may never have happened on a single instance system. One participant argued that some of the outages cited should not happen, since they were down to human error.
Talk moved on to whether an Oracle Exadata box represents good value for money. One participant mentioned that he could easily build a similar set up from commodity hardware for a fraction of the cost, but it would not include a high-speed interconnect or the Exadata software. He cited some data-warehouse type queries that were sped up by an order of magnitude when implemented using Exadata. This had transformed the attitude to these queries in the company concerned: because they were quick to run, users were more likely to use them. So the answer is that it depends.
A somewhat inconclusive discussion took place on how best to deploy ASM when there are multiple databases: should all databases share the same ASM instance or diskgroup? One participant suggested that ASM was effectively a single point of failure which could cause all databases to fail in the event on a software bug. Another responded that he had been using ASM for four years without any such disruption. I guess ASM should just be viewed as another filesystem for this point of view and what are the chances of GPFS or even raw devices failing?
The number of RAC nodes in a cluster was another talking point. With a two node cluster, in the event of a failure you are left with just a single node which ought to be able to manage the full system load. Many participants seemed to prefer a three or four node cluster as 4 to 3 or 3 to 2 isn’t as dramatic a change as 2 to 1.
So, it was an interesting discussion and definitely a refreshing change from sitting passively through presentations.