Parallel Enabled Pipelined PL/SQL Functions – an Old Feature That’s New to Me

This post describes an Oracle feature that is several versions old yet new to me (I can feel a series coming on “Old New Features For The Dog-eared DBA”). Apologies if I’m the only DBA unaware of this.

I came across an unusual request recently regarding a batch job that was running for longer than desired. For each record in a list it retrieves sensitive data from an external source, processes it and sends it onwards to another external source. The retrieval takes a short amount of time, say 0.05 seconds. However, if there are 1,000,000 look-ups to be done the time starts to mount up.

It is common for in-house developed batch processes to have custom parallel execution designed in, for example several processes may be run concurrently and collect their portion of data into table. This data is then collated by a watcher process and sent onwards as appropriate, not for the purist perhaps but perfectly acceptable. As this structure was already in place the obvious advice would be to utilise it for this process.

Now here comes the unusual bit- the data to be handled was very sensitive, so sensitive that it could not be stored anywhere, including database tables and archive logs. The data should just be retrieved, processed and sent on. So the question was:

How can we process sets of records in PL/SQL in a number of parallel streams, not store the resulting data set anywhere and generate a single onward data stream?

I figured pipelined PL/SQL functions must be usable somewhere here and, after a bit of Googling and AskTom searching, I came up with the following mechanism which I’m recording here as I find it quite exciting and you may too 🙂 . Some links I found very useful:

http://www.oracle-developer.net/display.php?id=207
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:4447489221109

First we create a list of id records to be processed and some types to support the pipelined function.

DROP TABLE id_list;
CREATE TABLE id_list AS
SELECT ROWNUM id FROM all_objects WHERE ROWNUM <= 5000;

CREATE OR REPLACE TYPE private_rec IS OBJECT (
	id 		NUMBER
,	data_1 		NUMBER(16)
,	data_2		NUMBER(20)
,	date_1		DATE
,	date_2		DATE
);
/
CREATE OR REPLACE TYPE private_tab AS TABLE OF private_rec;
/

A word of warning here – don’t be tempted to cut the row count above down too much – I wasted a couple of hours thinking I was missing something when all I was missing was enough rows for parallelism to kick in.

Next we create a pipelined function. This accepts a weak refcursor and for each row simulates a look-up to an external data source by sleeping for 0.05 seconds. I won’t go into the details of how this pipelining and partitioning works as I learned most of what I know about this within the last few days from the rather excellent oracle-developer.net site (plus I don’t fully understand it myself yet).

CREATE OR REPLACE FUNCTION retrieve_private_data(I_cursor IN SYS_REFCURSOR)
	RETURN private_tab PARALLEL_ENABLE(PARTITION I_cursor BY ANY) PIPELINED 
IS
	L_id_rec id_list%ROWTYPE;
	TYPE id_tab IS TABLE OF id_list%ROWTYPE;
	L_id_tab id_tab;
	L_num NUMBER := 0;
	L_start_time DATE := SYSDATE;
	L_end_time DATE;
BEGIN
	LOOP
		FETCH I_cursor BULK COLLECT INTO L_id_tab LIMIT 10; 

		L_num := L_num + L_id_tab.COUNT;
		FOR i IN 1 .. L_id_tab.COUNT LOOP
			--each retrieval of sensitive data takes 0.05 seconds
			dbms_lock.sleep(0.05);

			--in reality this is where the sensitive data would be piped back.
			--PIPE ROW(private_rec(L_id_tab(i).id,dbms_random.value(1,10),dbms_random.value(1,10),NULL,NULL));
		END LOOP;

		EXIT WHEN I_cursor%NOTFOUND;
	END LOOP;
	L_end_time := SYSDATE;

	-- pipe a row showing number of records processed by the call and the start and end time
	PIPE ROW(private_rec(L_num,NULL,NULL,L_start_time,L_end_time));

	CLOSE I_cursor;
END retrieve_private_data;
/

So now we can test a serial call. Processing 5000 rows one after the other. Key information below is the elapsed time of 4 minutes and that a single row was returned by the function showing that the call processed 5000 rows.

alter session set nls_date_format ='DD-MON-YYYY HH24:MI:SS';
set timing on
set autotrace on
SELECT * FROM table(retrieve_private_data(CURSOR(SELECT * FROM id_list i)));
set autotrace off

        ID     DATA_1     DATA_2 DATE_1               DATE_2
---------- ---------- ---------- -------------------- --------------------
      5000                       18-MAY-2010 23:02:28 18-MAY-2010 23:06:44

Elapsed: 00:04:16.49

------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |                       |  8168 |  2073K|    24   (0)| 00:00:01 |
|   1 |  VIEW                              |                       |  8168 |  2073K|    24   (0)| 00:00:01 |
|   2 |   COLLECTION ITERATOR PICKLER FETCH| RETRIEVE_PRIVATE_DATA |       |       |            |          |
|   3 |    TABLE ACCESS FULL               | ID_LIST               |  5000 | 65000 |     4   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

And now the bit that I still find quite amazing. We’ll process the data in 4 parallel streams using a PARALLEL hint. Notice again the elapsed time and number of records/rows processed returned by the function.

alter session set nls_date_format ='DD-MON-YYYY HH24:MI:SS';
set timing on
set autotrace on
SELECT * FROM table(retrieve_private_data(CURSOR(SELECT  /*+ PARALLEL(i 4)*/ * FROM id_list i)));
set autotrace off


        ID     DATA_1     DATA_2 DATE_1               DATE_2
---------- ---------- ---------- -------------------- --------------------
      1058                       18-MAY-2010 23:37:18 18-MAY-2010 23:38:12
      1314                       18-MAY-2010 23:37:18 18-MAY-2010 23:38:25
      1314                       18-MAY-2010 23:37:18 18-MAY-2010 23:38:25
      1314                       18-MAY-2010 23:37:18 18-MAY-2010 23:38:25

Elapsed: 00:01:07.82

---------------------------------------------------------------------------------------------------------
|Id| Operation                            | Name                  | Rows | Bytes | Cost(%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
| 0| SELECT STATEMENT                     |                       | 8168 |  2073K|   24   (0)| 00:00:01 |
| 1|  PX COORDINATOR                      |                       |      |       |           |          |
| 2|   PX SEND QC (RANDOM)                | :TQ10000              | 8168 |  2073K|   24   (0)| 00:00:01 |
| 3|    VIEW                              |                       | 8168 |  2073K|   24   (0)| 00:00:01 |
| 4|     COLLECTION ITERATOR PICKLER FETCH| RETRIEVE_PRIVATE_DATA |      |       |           |          |
| 5|      PX BLOCK ITERATOR               |                       | 5000 | 65000 |    2   (0)| 00:00:01 |
| 6|       TABLE ACCESS FULL              | ID_LIST               | 5000 | 65000 |    2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

The mind boggles at the possibilities 🙂

3 thoughts on “Parallel Enabled Pipelined PL/SQL Functions – an Old Feature That’s New to Me”

Abhishek Sagar says:

April 15, 2012 at 1:04 pm

I want to ask whether this parallel pipelined table function feature is present in postgres also or not. I am sure , pipelined table function is present, but whether they exhibit parallelism or not i dont know. Can somebody help me out with this ? Also , i want to ask , HOW parallel pipelined table functions parallelize the query. Can somebody explain me this step wise , preferably with the help of example.

Neil Johnson says:

April 15, 2012 at 8:37 pm

Hi Abhishek,
I’m afraid I don’t know anything Postgres. My understanding of the Oracle example above is that each parallel slave feeds its data into its own copy of the functon and the results are all then brought together. However – the scenario above was me sharing a feature that I’d only just discovered so I am not the best person to describe the inner workings. I suggest “Ask Tom” is the best place to start searching.
Thanks for reading the post
Neil

Pingback: Pipelined function in oracle -to improve performance of function returning large data set | anupambl