RAC/Exadata Shell Aliases
I’ve been working on Oracle RAC systems for a while now and the subject of how best to setup shell environment variables and aliases seems to be something that there is a fair range of opinion on.
- Do you have entries in oratab for both the database and the instance on that server?
- Do you use oraenv or set the variables required for each environment via another method?
- If you use oraenv, do you customise it?
- How do you switch ORACLE_SID between the db_name and instance_name, if you do?
Please feel free to answer these questions via a comment below, however, the main point of this post is to share a few aliases that I’m finding very useful at the moment, and to encourage you to share aliases you use in your RAC environments. Actually, the non-RAC specific, Oracle related, aliases you create would be interesting to me too, but more on that later…
As I mentioned in a much earlier post I feel that in the world of RAC databases there is scope for an environment variable of ORACLE_DB or ORA_DB_NAME… But, I also feel that there’s a lot to be said for not introducing more variables then is necessary. To this end I have taken to creating the following aliases in RAC environments that I’m working in[1].
alias inst='ORACLE_SID=${ORACLE_SID}$($(grep "^+ASM[1-8]" /etc/oratab | cut -d":" -f2)/bin/olsnodes -l -n | cut -f2)'
alias db='ORACLE_SID=${ORACLE_SID%$($(grep "^+ASM[1-8]" /etc/oratab | cut -d":" -f2)/bin/olsnodes -l -n | cut -f2)}'
These aliases allow me to quickly switch ORACLE_SID from being set to the relevant value for the database to it being set to the relevant value for the instance on the given host. Obviously it doesn’t stop you modifying the value of ORACLE_SID to something that is not valid, but at least it is easy to undo a mistake you might have made.
He’s a quick example:
[ORCL@rac-02 ~]$ inst [ORCL2@rac-02 ~]$ inst [ORCL22@rac-02 ~]$ inst [ORCL222@rac-02 ~]$ inst [ORCL2222@rac-02 ~]$ db [ORCL222@rac-02 ~]$ db [ORCL22@rac-02 ~]$ db [ORCL2@rac-02 ~]$ db [ORCL@rac-02 ~]$
And, for a little more alias sharing… In my time working on Exadata environments I’ve found myself using dcli a lot and created the aliases below to save me typing on the basis that I, almost always, either want to use dcli to communicate with all “compute nodes” or all “storage cells” so why not simplify the process and shorten the commands:
alias ddcli='dcli -g ~/dbs_group -l oracle' alias cdcli='dcli -g ~/cell_group -l cellmonitor'
I’d be keen to get a page going for useful aliases so we can all benefit from the ideas of each other… If you’ve got something to share please comment and if the collection builds up we’ll start a page for alias sharing.
__________
[1] – If I’m working in an environment where a shared “oracle” account is in use then I define my aliases in a separate file and source this when I login so that my preferences are not inflicted on others
cell smart file creation
One of the offload capabilities you get with Exadata that I’ve been appreciating recently is “cell smart file creation”. In a nutshell, the fact that formatting of blocks is offloaded to the cells rather than being performed by a process on the database server, or “compute node” if you want to use Exadata speak, makes the creation of datafiles faster.
As as quick demonstration look at the elapsed time for the creation of a 100G datafile. First with _cell_fast_file_create at the default value of TRUE and then set to FALSE in order to disable the offloading of the datafile creation:
SQL> create bigfile tablespace martin01 datafile '+DATA01' size 100G; Tablespace created. Elapsed: 00:01:40.39 SQL> alter session set "_cell_fast_file_create" = false; Session altered. Elapsed: 00:00:00.00 SQL> create bigfile tablespace martin02 datafile '+DATA01' size 100G; Tablespace created. Elapsed: 00:04:15.17
101 secs with offload vs. 255 secs without represents a saving of 60%. Examining the wait events associated with the 2 different datafile creations we see “cell smart file creation” for the offloaded file creation and “Data file init write” for the non-offloaded file creation.
Great, creating datafiles is faster on Exadata… But what about redo logs?
OK, I take the point that redo log files aren’t generally created with the same regularity as datafiles, but one of the things I’ve been working on recently has involved a “roll your own” DUPLICATE FROM ACTIVE DATABASE which itself involves creating a reasonable number of redo logs (4G log files, 4 redo groups, 8 threads = waiting). I don’t like waiting, especially when I’ve been asked to make something go as fast as I can, so I thought I’d have a little look and confirm that creation of redo logs is not offloaded…
Tracing the creation of 2 x 4G redo logs revealed the following:
SQL ID: 3r832sm0yj1gv Plan Hash: 0
alter database add logfile thread 1 group 99 ('+DATA01','+RECO') size 4096m
blocksize 512
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 3.81 37.96 0 0 2 0
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 3.81 37.96 0 0 2 0
Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 36
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
enq: FS - contention 2 0.00 0.00
control file sequential read 58 0.05 0.13
KSV master wait 20 1.71 3.27
ASM file metadata operation 6 0.00 0.00
kfk: async disk IO 4 0.00 0.00
Disk file operations I/O 4 0.00 0.00
Log file init write 4381 0.13 31.35
enq: CF - contention 1 0.00 0.00
control file parallel write 14 0.01 0.03
log file single write 2 0.00 0.00
ges inquiry response 1 0.00 0.00
log file sync 1 0.00 0.00
SQL*Net message to client 1 0.00 0.00
SQL*Net message from client 1 0.00 0.00
********************************************************************************
38 seconds might not be long, but when you consider the number of redo log groups and number of redo threads it soon adds up. And when you have 7 databases to do this for, the opportunity to save 60% of the time in this phase starts to look attractive. Add to this that ALTER DATABASE OPEN RESETLOGS spends the vast majority of its time waiting on “Log file init write” and I can’t help feeling that ”cell smart log creation” would be a nicer event to be waiting on
I realise that it’s probably way down on the list of enhancements, but I for one would appreciate it.
Fun with _SERIAL_DIRECT_READ
Serial Direct Reads are a subject that has been getting a reasonable amount of attention since 11g was released. Doug Burns, Dion Cho (and here), Kerry Osborne and Miladin Modrakovic have all written on the subject. Tanel Põder also covered it in his presentation at the UKOUG Exadata Special Event.
Kerry and Tanel both cover the importance of direct reads in the context Smart Scan when running Exadata. I won’t reproduce their material here, you can read it direct from them, but to summarise: seeing “storage” an execution plan simply means that Smart Scan is possible, but does not mean that you’ll get it. You must have direct reads.
There are a number of ways to ensure your statement executes using direct reads and the one I’ll cover here is the (hidden) parameter _serial_direct_read.
In the process of using _serial_direct_read to ensure a serial delete used Smart Scan I ran into something that I find both interesting and slightly concerning. Let me demonstrate.
Create an environment to test in:
SQL> create tablespace sdr_test datafile '+DATA01' size 1G autoextend on maxsize 5G 2 / Tablespace created. Elapsed: 00:00:01.41 SQL> create user sdr_test identified by martin 2 default tablespace sdr_test 3 quota unlimited on sdr_test 4 / User created. Elapsed: 00:00:00.05 SQL> grant create session 2 , create table 3 , alter session 4 , alter system 5 to sdr_test 6 / Grant succeeded. Elapsed: 00:00:00.01
Connect as the test user and create a test table:
SQL> connect sdr_test/martin Connected. SDR_TEST - SESSION 1> create table a as 2 ( 3 select rownum id 4 , rpad(rownum,200,'X') c1 5 from dual 6 connect by rownum <= 10000000 7 ) 8 / Table created. Elapsed: 00:00:25.43
I’m sure it’s obvious, but I’ll state it for clarity: with no index on the table all access will be via full table scan.
Flush the buffer cache, set a trace file identifier, start tracing, run DELETE statement and rollback:
SDR_TEST - SESSION 1> alter system flush buffer_cache 2 / System altered. Elapsed: 00:00:02.65 SDR_TEST - SESSION 1> alter session set tracefile_identifier='sdr_1a' 2 / Session altered. Elapsed: 00:00:00.00 SDR_TEST - SESSION 1> exec dbms_session.session_trace_enable(true,true,'ALL_EXECUTIONS') PL/SQL procedure successfully completed. Elapsed: 00:00:00.02 SDR_TEST - SESSION 1> var id number SDR_TEST - SESSION 1> exec :id := 1 PL/SQL procedure successfully completed. Elapsed: 00:00:00.00 SDR_TEST - SESSION 1> delete from a where id = :id 2 / 1 row deleted. Elapsed: 00:00:19.93 SDR_TEST - SESSION 1> rollback 2 / Rollback complete. Elapsed: 00:00:00.01
Check the trace file to see what we get:
WAIT #47706004018352: nam='gc cr multi block request' ela= 324 file#=5382 block#=2584 class#=1 obj#=761740 tim=1308902615410001 WAIT #47706004018352: nam='gc cr multi block request' ela= 806 file#=5382 block#=2680 class#=1 obj#=761740 tim=1308902615411418 WAIT #47706004018352: nam='cell multiblock physical read' ela= 12689 cellhash#=764630511 diskhash#=3671767124 bytes=1032192 obj#=761740 tim=1308902615424291 WAIT #47706004018352: nam='gc cr multi block request' ela= 408 file#=5382 block#=2744 class#=1 obj#=761740 tim=1308902615426480 WAIT #47706004018352: nam='gc cr multi block request' ela= 1037 file#=5382 block#=2808 class#=1 obj#=761740 tim=1308902615428066 WAIT #47706004018352: nam='cell multiblock physical read' ela= 5475 cellhash#=764630511 diskhash#=3671767124 bytes=1032192 obj#=761740 tim=1308902615433706 WAIT #47706004018352: nam='gc cr multi block request' ela= 696 file#=5382 block#=2881 class#=1 obj#=761740 tim=1308902615436139 WAIT #47706004018352: nam='gc cr multi block request' ela= 895 file#=5382 block#=2943 class#=1 obj#=761740 tim=1308902615437558 WAIT #47706004018352: nam='cell multiblock physical read' ela= 5300 cellhash#=764630511 diskhash#=3671767124 bytes=1032192 obj#=761740 tim=1308902615443082 WAIT #47706004018352: nam='gc cr multi block request' ela= 916 file#=5382 block#=3009 class#=1 obj#=761740 tim=1308902615445733
… That’s not a Smart Scan.
Set _serial_direct_read to TRUE, set new trace file identifier, flush buffer cache, run DELETE and rollback:
SDR_TEST - SESSION 1> alter session set "_serial_direct_read"=true 2 / Session altered. Elapsed: 00:00:00.00 SDR_TEST - SESSION 1> alter session set tracefile_identifier='sdr_1b' 2 / Session altered. Elapsed: 00:00:00.00 SDR_TEST - SESSION 1> alter system flush buffer_cache 2 / System altered. Elapsed: 00:00:00.45 SDR_TEST - SESSION 1> delete from a where id = :id 2 / 1 row deleted. Elapsed: 00:00:14.54 SDR_TEST - SESSION 1> rollback 2 / Rollback complete. Elapsed: 00:00:00.00
Check the trace file to see what we get:
WAIT #47706540316360: nam='cell multiblock physical read' ela= 8168 cellhash#=764630511 diskhash#=1422992472 bytes=1048576 obj#=761740 tim=1308902638029103 WAIT #47706540316360: nam='cell multiblock physical read' ela= 5468 cellhash#=764630511 diskhash#=1422992472 bytes=1048576 obj#=761740 tim=1308902638035744 WAIT #47706540316360: nam='cell multiblock physical read' ela= 5190 cellhash#=764630511 diskhash#=1422992472 bytes=1015808 obj#=761740 tim=1308902638042035 WAIT #47706540316360: nam='cell multiblock physical read' ela= 13723 cellhash#=2714602255 diskhash#=266393524 bytes=1048576 obj#=761740 tim=1308902638056851 WAIT #47706540316360: nam='cell multiblock physical read' ela= 7911 cellhash#=2714602255 diskhash#=266393524 bytes=1048576 obj#=761740 tim=1308902638066297 WAIT #47706540316360: nam='cell multiblock physical read' ela= 5505 cellhash#=2714602255 diskhash#=266393524 bytes=1048576 obj#=761740 tim=1308902638072991 WAIT #47706540316360: nam='cell multiblock physical read' ela= 5218 cellhash#=2714602255 diskhash#=266393524 bytes=32768 obj#=761740 tim=1308902638079404 WAIT #47706540316360: nam='cell multiblock physical read' ela= 1424 cellhash#=1730663586 diskhash#=755155853 bytes=1048576 obj#=761740 tim=1308902638082024 WAIT #47706540316360: nam='cell multiblock physical read' ela= 1493 cellhash#=1730663586 diskhash#=755155853 bytes=1048576 obj#=761740 tim=1308902638084697 WAIT #47706540316360: nam='cell multiblock physical read' ela= 1314 cellhash#=1730663586 diskhash#=755155853 bytes=1048576 obj#=761740 tim=1308902638087102
… That’s still not a Smart Scan
Set new trace file identifier, flush buffer cache, run DELETE (with an embedded comment to force hard parse) and rollback:
SDR_TEST - SESSION 1> alter session set tracefile_identifier='sdr_1c' 2 / Session altered. Elapsed: 00:00:00.00 SDR_TEST - SESSION 1> alter system flush buffer_cache 2 / System altered. Elapsed: 00:00:00.23 SDR_TEST - SESSION 1> delete /* SESSION 1 */ from a where id = :id 2 / 1 row deleted. Elapsed: 00:00:00.29 SDR_TEST - SESSION 1> rollback 2 / Rollback complete. Elapsed: 00:00:00.01
Check the trace file to see what we get:
WAIT #47706540290640: nam='cell smart table scan' ela= 26 cellhash#=3018218633 p2=0 p3=0 obj#=761740 tim=1308902650538344 WAIT #47706540290640: nam='cell smart table scan' ela= 8 cellhash#=3018218633 p2=0 p3=0 obj#=761740 tim=1308902650538400 WAIT #47706540290640: nam='cell smart table scan' ela= 310 cellhash#=3018218633 p2=0 p3=0 obj#=761740 tim=1308902650538741 WAIT #47706540290640: nam='cell smart table scan' ela= 9 cellhash#=3018218633 p2=0 p3=0 obj#=761740 tim=1308902650538821 WAIT #47706540290640: nam='cell smart table scan' ela= 26 cellhash#=3018218633 p2=0 p3=0 obj#=761740 tim=1308902650538887 WAIT #47706540290640: nam='cell smart table scan' ela= 25 cellhash#=3018218633 p2=0 p3=0 obj#=761740 tim=1308902650538961 WAIT #47706540290640: nam='cell smart table scan' ela= 25 cellhash#=1078241440 p2=0 p3=0 obj#=761740 tim=1308902650539712 WAIT #47706540290640: nam='cell smart table scan' ela= 35 cellhash#=1078241440 p2=0 p3=0 obj#=761740 tim=1308902650539801 WAIT #47706540290640: nam='cell smart table scan' ela= 7 cellhash#=1078241440 p2=0 p3=0 obj#=761740 tim=1308902650539856 WAIT #47706540290640: nam='cell smart table scan' ela= 564 cellhash#=1078241440 p2=0 p3=0 obj#=761740 tim=1308902650540456
… Smart Scan, Smart Scan!
Start a new session, set variable, set new trace file identifier, start tracing, run DELETE and rollback:
SQL> connect sdr_test/martin Connected. SDR_TEST - SESSION 2> var id number SDR_TEST - SESSION 2> exec :id := 1 PL/SQL procedure successfully completed. Elapsed: 00:00:00.00 SDR_TEST - SESSION 2> alter session set tracefile_identifier='sdr_2a' 2 / Session altered. Elapsed: 00:00:00.00 SDR_TEST - SESSION 2> exec dbms_session.session_trace_enable(true,true,'ALL_EXECUTIONS') PL/SQL procedure successfully completed. Elapsed: 00:00:00.00 SDR_TEST - SESSION 2> alter system flush buffer_cache 2 / System altered. Elapsed: 00:00:00.12 SDR_TEST - SESSION 2> delete /* SESSION 2 */ from a where id = :id 2 / 1 row deleted. Elapsed: 00:00:14.07 SDR_TEST - SESSION 2> SDR_TEST - SESSION 2> rollback 2 / Rollback complete. Elapsed: 00:00:00.01
Check the trace file to see what we get:
WAIT #47830984391856: nam='cell multiblock physical read' ela= 5305 cellhash#=2714602255 diskhash#=1846921898 bytes=1032192 obj#=761740 tim=1308902651164153 WAIT #47830984391856: nam='cell multiblock physical read' ela= 1356 cellhash#=2714602255 diskhash#=1846921898 bytes=1032192 obj#=761740 tim=1308902651168319 WAIT #47830984391856: nam='cell multiblock physical read' ela= 1411 cellhash#=1857444352 diskhash#=3010233083 bytes=1032192 obj#=761740 tim=1308902651174085 WAIT #47830984391856: nam='cell multiblock physical read' ela= 1346 cellhash#=1857444352 diskhash#=3010233083 bytes=1032192 obj#=761740 tim=1308902651178202 WAIT #47830984391856: nam='cell multiblock physical read' ela= 1286 cellhash#=1857444352 diskhash#=3010233083 bytes=1032192 obj#=761740 tim=1308902651182195 WAIT #47830984391856: nam='cell multiblock physical read' ela= 1259 cellhash#=1857444352 diskhash#=3010233083 bytes=1032192 obj#=761740 tim=1308902651186208 WAIT #47830984391856: nam='cell multiblock physical read' ela= 15622 cellhash#=281363246 diskhash#=3630261201 bytes=1032192 obj#=761740 tim=1308902651206140 WAIT #47830984391856: nam='cell multiblock physical read' ela= 5431 cellhash#=281363246 diskhash#=3630261201 bytes=1032192 obj#=761740 tim=1308902651214612 WAIT #47830984391856: nam='cell multiblock physical read' ela= 5315 cellhash#=281363246 diskhash#=3630261201 bytes=1032192 obj#=761740 tim=1308902651222641 WAIT #47830984391856: nam='cell multiblock physical read' ela= 5303 cellhash#=281363246 diskhash#=3630261201 bytes=1032192 obj#=761740 tim=1308902651230702
… Not a Smart Scan, but that’s expected.
Set a new trace file identifier, flush buffer cache and run DELETE statement from session 1 in session 2:
SDR_TEST - SESSION 2> alter session set tracefile_identifier='sdr_2b' 2 / Session altered. Elapsed: 00:00:00.00 SDR_TEST - SESSION 2> alter system flush buffer_cache 2 / System altered. Elapsed: 00:00:00.21 SDR_TEST - SESSION 2> delete /* SESSION 1 */ from a where id = :id 2 / 1 row deleted. Elapsed: 00:00:00.25 SDR_TEST - SESSION 2> rollback 2 / Rollback complete. Elapsed: 00:00:00.01
Check the trace file to see what we get:
WAIT #47830986253416: nam='cell smart table scan' ela= 48 cellhash#=281363246 p2=0 p3=0 obj#=761740 tim=1308902665561263 WAIT #47830986253416: nam='cell smart table scan' ela= 8 cellhash#=281363246 p2=0 p3=0 obj#=761740 tim=1308902665561338 WAIT #47830986253416: nam='cell smart table scan' ela= 3944 cellhash#=281363246 p2=0 p3=0 obj#=761740 tim=1308902665565316 WAIT #47830986253416: nam='cell smart table scan' ela= 10 cellhash#=764630511 p2=0 p3=0 obj#=761740 tim=1308902665565394 WAIT #47830986253416: nam='cell smart table scan' ela= 3869 cellhash#=764630511 p2=0 p3=0 obj#=761740 tim=1308902665569302 WAIT #47830986253416: nam='cell smart table scan' ela= 14 cellhash#=764630511 p2=0 p3=0 obj#=761740 tim=1308902665569389 WAIT #47830986253416: nam='cell smart table scan' ela= 6876 cellhash#=764630511 p2=0 p3=0 obj#=761740 tim=1308902665576298 WAIT #47830986253416: nam='cell smart table scan' ela= 26 cellhash#=3970173858 p2=0 p3=0 obj#=761740 tim=1308902665576745 WAIT #47830986253416: nam='cell smart table scan' ela= 25 cellhash#=3970173858 p2=0 p3=0 obj#=761740 tim=1308902665576816 WAIT #47830986253416: nam='cell smart table scan' ela= 25 cellhash#=281363246 p2=0 p3=0 obj#=761740 tim=1308902665576988
… Smart Scan! But, hold on a minute, that means it’s using direct reads.
Set _serial_direct_read to NEVER, set new trace file identifier, flush buffer cache, run delete from session 1 in session 2 and rollback:
SDR_TEST - SESSION 2> alter session set "_serial_direct_read"=never 2 / Session altered. Elapsed: 00:00:00.00 SDR_TEST - SESSION 2> alter session set tracefile_identifier='sdr_2c' 2 / Session altered. Elapsed: 00:00:00.00 SDR_TEST - SESSION 2> alter system flush buffer_cache 2 / System altered. Elapsed: 00:00:00.13 SDR_TEST - SESSION 2> delete /* SESSION 1 */ from a where id = :id 2 / 1 row deleted. Elapsed: 00:00:00.20 SDR_TEST - SESSION 2> rollback 2 / Rollback complete. Elapsed: 00:00:00.00
Check the trace file to see what we get:
WAIT #47830984369624: nam='cell smart table scan' ela= 201 cellhash#=3108293528 p2=0 p3=0 obj#=761740 tim=1308902665858678 WAIT #47830984369624: nam='cell smart table scan' ela= 8 cellhash#=3108293528 p2=0 p3=0 obj#=761740 tim=1308902665858742 WAIT #47830984369624: nam='cell smart table scan' ela= 201 cellhash#=3108293528 p2=0 p3=0 obj#=761740 tim=1308902665858973 WAIT #47830984369624: nam='cell smart table scan' ela= 8 cellhash#=3108293528 p2=0 p3=0 obj#=761740 tim=1308902665859048 WAIT #47830984369624: nam='cell smart table scan' ela= 288 cellhash#=3108293528 p2=0 p3=0 obj#=761740 tim=1308902665859367 WAIT #47830984369624: nam='cell smart table scan' ela= 7 cellhash#=3108293528 p2=0 p3=0 obj#=761740 tim=1308902665859415 WAIT #47830984369624: nam='cell smart table scan' ela= 23 cellhash#=3108293528 p2=0 p3=0 obj#=761740 tim=1308902665859471 WAIT #47830984369624: nam='cell smart table scan' ela= 7 cellhash#=3108293528 p2=0 p3=0 obj#=761740 tim=1308902665859521 WAIT #47830984369624: nam='cell smart table scan' ela= 868 cellhash#=3108293528 p2=0 p3=0 obj#=761740 tim=1308902665860420 WAIT #47830984369624: nam='cell smart table scan' ela= 7 cellhash#=3108293528 p2=0 p3=0 obj#=761740 tim=1308902665860485
… Still doing Smart Scan!
During his UKOUG presentation Tanel emphasised that the CBO knows nothing about Smart Scan, which is confirmed by execution plans and CBO traces, but it seems that there must be information about how the execution plan will actually be executed stored in memory, and associated with the execution plan… I don’t know where that is yet, but I’m guessing Tanel does.
Anyway, the key point is that the value of _serial_direct_read is not considered by the CBO to be part of the “optimizer environment”, but it appears (to me) that it should be. After all, wouldn’t you expect ALTER SESSION to do just that, rather than having the ability to affect other sessions? I appreciate that this is a hidden parameter, which shouldn’t be used without agreement from Oracle Support, but it seems like something has been overlooked. In Kerry’s PDF Tuning Exadata he mentions that _serial_direct_read is not compatible with the OPT_PARAM hint, which is a more succinct way of saying all the above, but the implications weren’t clear to me when I first read that.
Maybe at some point the parameter will lose its underscore and become part of the the optimizer environment… Time will tell.
UK OUG conference: RAC & HA special interest group round table
The RAC and HA roundtable gives users a chance to share experience and gotchas. Often the discussion points that generate most interest are when something unusual or unexplained has happened, with the participants attempting to drill down to where the problem was by firing questions at the DBA telling the tale. The discussion this time there was such a discussion around a RAC stretch cluster where the storage connection had been deliberately broken during a test (but not the RAC interconnect between the two sites). The end result of this was a failure of the cluster but after a restart, the database had been rolled back to a point in time several minutes before the failure. This is discussed here: http://davidburnham.wordpress.com/category/oracle-rac-stretched-cluster-testing/
Some participants suggested that Active Data Guard was a better option than a stretch RAC cluster with one saying that Active Data Guard should be the default position with RAC only deployed if Active DataGuard didn’t fit the bill. Unfortunately as a presentation on DataGuard was taking place at the same time, most people using it were presumably elsewhere.
Another discussion point was whether RAC actually reduces the volume of outages, ensuring high availability. Participants generally agreed that it is great for installing rolling updates and so on without disruption – in other words, planned maintenance work – and dealing with some other modes of failure. But sometimes the added complexity could cause unplanned outages that may never have happened on a single instance system. One participant argued that some of the outages cited should not happen, since they were down to human error.
Talk moved on to whether an Oracle Exadata box represents good value for money. One participant mentioned that he could easily build a similar set up from commodity hardware for a fraction of the cost, but it would not include a high-speed interconnect or the Exadata software. He cited some data-warehouse type queries that were sped up by an order of magnitude when implemented using Exadata. This had transformed the attitude to these queries in the company concerned: because they were quick to run, users were more likely to use them. So the answer is that it depends.
A somewhat inconclusive discussion took place on how best to deploy ASM when there are multiple databases: should all databases share the same ASM instance or diskgroup? One participant suggested that ASM was effectively a single point of failure which could cause all databases to fail in the event on a software bug. Another responded that he had been using ASM for four years without any such disruption. I guess ASM should just be viewed as another filesystem for this point of view and what are the chances of GPFS or even raw devices failing?
The number of RAC nodes in a cluster was another talking point. With a two node cluster, in the event of a failure you are left with just a single node which ought to be able to manage the full system load. Many participants seemed to prefer a three or four node cluster as 4 to 3 or 3 to 2 isn’t as dramatic a change as 2 to 1.
So, it was an interesting discussion and definitely a refreshing change from sitting passively through presentations.
1 comment