Wednesday, 18 March 2009

Oracle JDBC Update Batching may insert bad data without warning.

Recently I was working on a piece of java code that parsed a very large (3Gb) XML and loaded it into an Oracle database in a relational structure. This allowed an existing legacy job that used that structure to continue to function without change.

A SAX processing class that feeds a database writing class is the most efficient way of doing this, and after some coding I had a job that would run and take about an hour to parse the XML and insert the several million records required.

As my code was running many millions of inserts using the same Prepared Statements, I thought I would try and speed up the code by using some of the Oracle extensions to JDBC around update batching. These performance extensions batch up prepared statement updates, inserts and deletes and send them to the database in batches instead of individually over the line, thus saving execution time.

I was impressed when this cut the run time of my job down to 12 minutes, a 500% speedup.

Upon further examination of some of the data though I found that all was not as it appeared and while most of my statements worked flawlessly and often the correct number of records and the correct data were present in the database, in the odd table data had changed and was incorrect. It seems that the Oracle performance extensions were not as transparent as they imply. What was wrong? - the documentation states that this is good for anything where you repeat the same statements with the same bind variables.

To cut a long story short, the following sort of prepared statement is ideal for the Oracle Update Batching performance extensions:

INSERT INTO emp (a,b,c) values ?, ?, ?

If I batch up 20 execute updates using the above statement then I will get 20 correct inserts, and it will run in about a quarter of the time - excellent!

But unfortunately a gotcha. This is not an ideal statement:

INSERT INTO emp (a,b,c) values ?,?, (SELECT deptno FROM dept WHERE deptname = ?)

Say you batch up 20 executes of the above prepared statement and send them to the database. Whilst the first two bind variables in the statement above will be populated correctly for every record inserted, the select part of the statement will only be executed ONCE for all 20 statements, even though the result of the select should clearly be different for every statement. The other 19 inserts will get the same value from the select as the first lookup, and thus be incorrect.

The solution here is to disable update batching explicitly for the above type of prepared statements when you see them, otherwise you will get the wrong data in the database without warning and no exceptions will be raised.

This is a bit of an omission in the Oracle documentation and a real gotcha. Oracle does detect some situations when batching will not work and internally switches it off, and I think it should detect this one too.

No comments: