Ulimits and ORA-04030
Oracle database processes are to many systems administrators just another service that runs on the machines they look after. Understandably sys admins like to be able to put some controls in to prevent these processes from consuming too many CPU resources, filling up the disks or hogging all the memory on the system. This can be done on UNIX and Linux systems with ulimits which control the resources available to your shell and the programs it starts.
You can check your ulimits that currently apply to you by running the command ulimit -a. On Linux it looks like this and it’s similar with AIX:
[ben@linux ~]$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) 32768 scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 23551 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 32768 cpu time (seconds, -t) unlimited max user processes (-u) 23551 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
You’ll note that I ran the above commands as my own user. Oracle makes some recommendations for the settings for the oracle user, generally found in the release notes or prerequisites for your platform. Because Oracle makes no recommendations about shell limits on a DBA account, the limits applying to you personally will often be much lower than those applying to the Oracle user.
So when might this be a problem? Let’s take how you start up your Oracle service. Here are two ways it’s possible to do it:
- Using srvctl, e.g.
srvctl start database -d dbname - Using SQL*Plus, e.g.
sqlplus / as sysdbaand then issue commandstartup.
The problem comes with the second method. When invoking Oracle using SQL*Plus, the shell limits that apply to all the Oracle processes are inherited from your shell. Therefore you need to make sure you’re logged in as the Oracle user. There are some subtleties around this: what happens if you use sudo, e.g. sudo -s -u oracle to invoke an Oracle user’s shell? On some systems you might find that your shell limits don’t change and you’re still stuck with the lower limits of your own shell: you can check by running ulimit -a. You may be setting yourself up for ORA-04030 errors; limiting the PGA memory area allowed for sorting data; and imposing other operating system limits.
This problem doesn’t exist if you invoke sudo su - oracle to change to the oracle user.
What happens if Oracle is already running and you’re not sure if it was started correctly? With Linux it is quite easy to find out, provided you have the right privileges. First get the process id of your smon process, in this case 4057:
[ben@linux ~]$ ps -ef | grep smon root 3790 1 1 Feb05 ? 08:04:14 /u01/app/grid/11203/home_1/bin/osysmond.bin grid 4057 1 0 Feb05 ? 00:02:02 asm_smon_+ASM1 oracle 4495 1 0 Feb05 ? 00:04:50 ora_smon_beverley1 ben 7468 7438 0 11:17 pts/0 00:00:00 grep smon
And then take a look in /proc/4057/limits:
[ben@linux 4057]$ sudo cat limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 33554432 unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 23551 23551 processes Max open files 65536 65536 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 23551 23551 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us
You can compare the above with the ulimits set for the oracle user.
With AIX it’s harder since there is no /proc filesystem. You have to run a debugger against the process. There’s a nice blog post on one method here, although it is not suitable for a production system because it interrupts the Oracle process and kills it afterwards.
So is that everything? Unfortunately not, ulimits don’t just affect the server processes like smon; they also affect server processes for connections using SQL*Plus. Below are two server processes running, each serving a connection from SQL*Plus, one for my ben user and another for the oracle user:
SQL> select spid, username, program from v$process where spid in (9351, 9798); SPID USERNAME PROGRAM ------------------------ --------------- ------------------------------------------------ 9798 oracle oracle@o112-c2n1.ora.pwk (TNS V1-V3) 9351 ben oracle@o112-c2n1.ora.pwk (TNS V1-V3)
The below shows these processes:
SQL> !ps -ef | egrep '(9350|9351|9797|9798)' ben 9350 7438 0 13:08 pts/0 00:00:00 sqlplus oracle 9351 9350 0 13:09 ? 00:00:00 oraclebeverley1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) oracle 9797 9762 0 13:34 pts/0 00:00:00 sqlplus oracle 9798 9797 0 13:34 ? 00:00:00 oraclebeverley1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
As I am running Linux, I can easily compare the ulimits applying to processes 9351 and 9798:
[ben@linux 9351]$ sudo cat limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size 131072 131072 bytes Max stack size 33554432 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 23551 23551 processes Max open files 1024 1024 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 23551 23551 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [ben@linux 9798]$ sudo cat limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size 131072 131072 bytes Max stack size 33554432 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 131072 131072 processes Max open files 131072 131072 files Max locked memory 51200000000 51200000000 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 23551 23551 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us
The differences are highlighted.
All this brings me onto the ORA-04030 problem, which can be related to the value of max data size. This parameter limits the amount of PGA memory that your session can use. Checking the ulimits is quick and easy, at least on Linux, so if you encounter such a problem this is one place you can easily check. It may save you from having to experiment with undocumented Oracle parameters like _pga_max_size which may not be the cause.
UK OUG conference: RAC & HA special interest group round table
The RAC and HA roundtable gives users a chance to share experience and gotchas. Often the discussion points that generate most interest are when something unusual or unexplained has happened, with the participants attempting to drill down to where the problem was by firing questions at the DBA telling the tale. The discussion this time there was such a discussion around a RAC stretch cluster where the storage connection had been deliberately broken during a test (but not the RAC interconnect between the two sites). The end result of this was a failure of the cluster but after a restart, the database had been rolled back to a point in time several minutes before the failure. This is discussed here: http://davidburnham.wordpress.com/category/oracle-rac-stretched-cluster-testing/
Some participants suggested that Active Data Guard was a better option than a stretch RAC cluster with one saying that Active Data Guard should be the default position with RAC only deployed if Active DataGuard didn’t fit the bill. Unfortunately as a presentation on DataGuard was taking place at the same time, most people using it were presumably elsewhere.
Another discussion point was whether RAC actually reduces the volume of outages, ensuring high availability. Participants generally agreed that it is great for installing rolling updates and so on without disruption – in other words, planned maintenance work – and dealing with some other modes of failure. But sometimes the added complexity could cause unplanned outages that may never have happened on a single instance system. One participant argued that some of the outages cited should not happen, since they were down to human error.
Talk moved on to whether an Oracle Exadata box represents good value for money. One participant mentioned that he could easily build a similar set up from commodity hardware for a fraction of the cost, but it would not include a high-speed interconnect or the Exadata software. He cited some data-warehouse type queries that were sped up by an order of magnitude when implemented using Exadata. This had transformed the attitude to these queries in the company concerned: because they were quick to run, users were more likely to use them. So the answer is that it depends.
A somewhat inconclusive discussion took place on how best to deploy ASM when there are multiple databases: should all databases share the same ASM instance or diskgroup? One participant suggested that ASM was effectively a single point of failure which could cause all databases to fail in the event on a software bug. Another responded that he had been using ASM for four years without any such disruption. I guess ASM should just be viewed as another filesystem for this point of view and what are the chances of GPFS or even raw devices failing?
The number of RAC nodes in a cluster was another talking point. With a two node cluster, in the event of a failure you are left with just a single node which ought to be able to manage the full system load. Many participants seemed to prefer a three or four node cluster as 4 to 3 or 3 to 2 isn’t as dramatic a change as 2 to 1.
So, it was an interesting discussion and definitely a refreshing change from sitting passively through presentations.
1 comment