On a client site today I came across a large piece of code that had been running for four years, written by some senior people, and it looked very correct. However, it was failing to carry out all the work it was designed to do. My job was to resolve this one, and a fair bit of investigation and head scratching followed. After a long period of time the failing code was found. It was a DELETE statement of the form :
DELETE emp WHERE (employee_id, department) IN
(SELECT employee_id, department FROM emp WHERE name = 'SMITH'
UNION
SELECT employee_id, department WHERE name = 'JONES');
only a lot more complicated.
I have seen code like this a fair bit. The above example should delete all people from the emp table whose name is either 'SMITH' or 'JONES'. However, that is not what it does. Witness the following from SQL*Plus:
SQL> SELECT 1 FROM dual WHERE (1,2) IN (SELECT 1,2 FROM dual);
1
----------
1
All good so far. But this surprised me :
SQL> SELECT 1 FROM dual WHERE (1,null) IN (SELECT 1,null FROM dual);
1
----------
Yup, it didn't find a result. When doing the comparison of the result sets here, Oracle compares NULLs using equals, not IS.
Looking back at the first example above, if the employee_id or department is NULL in any row in the emp table then that record is NOT going to get deleted by this statement even if the name is 'SMITH' or 'JONES'. A silent fail.
The solution in my clients case was to use NVL() to assign the employee_id and department a value in the DELETE clause and both of the SELECTs. The correct rows were then deleted.
Friday, 1 May 2009
Tuesday, 24 March 2009
Performance, integrity and the use of the JDBC Savepoint vs. a JDBC Commit.
I'm processing a very large record set in a standalone Java I have written. For perfomance reasons, I commit changes only once every 1000 records to the Oracle Server.
While delaying commits is a common way of increasing performance in a large load, it is nearly always at the cost of the integrity of the application. What are you sacrificing? You are sacrificing defining what a transaction is to that application. Even if all you are doing is loading heaps of identical objects to a database, it could be said that each of those objects defines a transaction in this application and by not committing after each one, you are taking a risk that later data inserts will affect this one even though that later record is completely unrelated to this one.
As an example, say you successfully insert a record into the database. This record comprises a logical unit of data, for instance say it contains details of a transaction selling a particular green car. Because we aren't committing, this data is not permanently saved to the database yet. Now lets say several hundred successful inserts later a record about a completely separate car sale (a red car) contains a bad value in a field, and it fails to insert into the database. As a result, a rollback is triggered, and even though our original green car sale data was 100% correct, through no fault of its own it gets pulled out of the database. Bummer.
For this reason, I decided to try out Java Savepoints during this load. To cut a long story short, a savepoint is the commit you have when you don't have a commit. It defines a point where you can say a logical transaction begins. In the above situation, I would issue a savepoint after successfully inserting each transaction. A savepoint would go in after the green car sale, and each of the other several hundred successful inserts after than. When the red car sale record goes in and it fails, we can simply rollback to the most recent savepoint. This means that the green car record is not affected, and when we eventually commit, we have a better set of data than without using the savepoints.
So, given that we are trying for performance here, how well does the savepoint perform relative to a commit? Much better. Without having measured it precisely, having savepoints in the tight insert loop certainly slowed down the code, but nowhere near as dramatically as having commits in there. It is definitely worth considering if you worry about data integrity like I do, and you don't want the huge overhead of commits to maintain that integrity.
Unfortunately, my customer needed superb performance during the raw load, so no commits or savepoints for me.
Search java.sql.Savepoint and java.sql.rollback on Google for more info.
While delaying commits is a common way of increasing performance in a large load, it is nearly always at the cost of the integrity of the application. What are you sacrificing? You are sacrificing defining what a transaction is to that application. Even if all you are doing is loading heaps of identical objects to a database, it could be said that each of those objects defines a transaction in this application and by not committing after each one, you are taking a risk that later data inserts will affect this one even though that later record is completely unrelated to this one.
As an example, say you successfully insert a record into the database. This record comprises a logical unit of data, for instance say it contains details of a transaction selling a particular green car. Because we aren't committing, this data is not permanently saved to the database yet. Now lets say several hundred successful inserts later a record about a completely separate car sale (a red car) contains a bad value in a field, and it fails to insert into the database. As a result, a rollback is triggered, and even though our original green car sale data was 100% correct, through no fault of its own it gets pulled out of the database. Bummer.
For this reason, I decided to try out Java Savepoints during this load. To cut a long story short, a savepoint is the commit you have when you don't have a commit. It defines a point where you can say a logical transaction begins. In the above situation, I would issue a savepoint after successfully inserting each transaction. A savepoint would go in after the green car sale, and each of the other several hundred successful inserts after than. When the red car sale record goes in and it fails, we can simply rollback to the most recent savepoint. This means that the green car record is not affected, and when we eventually commit, we have a better set of data than without using the savepoints.
So, given that we are trying for performance here, how well does the savepoint perform relative to a commit? Much better. Without having measured it precisely, having savepoints in the tight insert loop certainly slowed down the code, but nowhere near as dramatically as having commits in there. It is definitely worth considering if you worry about data integrity like I do, and you don't want the huge overhead of commits to maintain that integrity.
Unfortunately, my customer needed superb performance during the raw load, so no commits or savepoints for me.
Search java.sql.Savepoint and java.sql.rollback on Google for more info.
Wednesday, 18 March 2009
Oracle JDBC Update Batching may insert bad data without warning.
Recently I was working on a piece of java code that parsed a very large (3Gb) XML and loaded it into an Oracle database in a relational structure. This allowed an existing legacy job that used that structure to continue to function without change.
A SAX processing class that feeds a database writing class is the most efficient way of doing this, and after some coding I had a job that would run and take about an hour to parse the XML and insert the several million records required.
As my code was running many millions of inserts using the same Prepared Statements, I thought I would try and speed up the code by using some of the Oracle extensions to JDBC around update batching. These performance extensions batch up prepared statement updates, inserts and deletes and send them to the database in batches instead of individually over the line, thus saving execution time.
I was impressed when this cut the run time of my job down to 12 minutes, a 500% speedup.
Upon further examination of some of the data though I found that all was not as it appeared and while most of my statements worked flawlessly and often the correct number of records and the correct data were present in the database, in the odd table data had changed and was incorrect. It seems that the Oracle performance extensions were not as transparent as they imply. What was wrong? - the documentation states that this is good for anything where you repeat the same statements with the same bind variables.
To cut a long story short, the following sort of prepared statement is ideal for the Oracle Update Batching performance extensions:
INSERT INTO emp (a,b,c) values ?, ?, ?
If I batch up 20 execute updates using the above statement then I will get 20 correct inserts, and it will run in about a quarter of the time - excellent!
But unfortunately a gotcha. This is not an ideal statement:
INSERT INTO emp (a,b,c) values ?,?, (SELECT deptno FROM dept WHERE deptname = ?)
Say you batch up 20 executes of the above prepared statement and send them to the database. Whilst the first two bind variables in the statement above will be populated correctly for every record inserted, the select part of the statement will only be executed ONCE for all 20 statements, even though the result of the select should clearly be different for every statement. The other 19 inserts will get the same value from the select as the first lookup, and thus be incorrect.
The solution here is to disable update batching explicitly for the above type of prepared statements when you see them, otherwise you will get the wrong data in the database without warning and no exceptions will be raised.
This is a bit of an omission in the Oracle documentation and a real gotcha. Oracle does detect some situations when batching will not work and internally switches it off, and I think it should detect this one too.
A SAX processing class that feeds a database writing class is the most efficient way of doing this, and after some coding I had a job that would run and take about an hour to parse the XML and insert the several million records required.
As my code was running many millions of inserts using the same Prepared Statements, I thought I would try and speed up the code by using some of the Oracle extensions to JDBC around update batching. These performance extensions batch up prepared statement updates, inserts and deletes and send them to the database in batches instead of individually over the line, thus saving execution time.
I was impressed when this cut the run time of my job down to 12 minutes, a 500% speedup.
Upon further examination of some of the data though I found that all was not as it appeared and while most of my statements worked flawlessly and often the correct number of records and the correct data were present in the database, in the odd table data had changed and was incorrect. It seems that the Oracle performance extensions were not as transparent as they imply. What was wrong? - the documentation states that this is good for anything where you repeat the same statements with the same bind variables.
To cut a long story short, the following sort of prepared statement is ideal for the Oracle Update Batching performance extensions:
INSERT INTO emp (a,b,c) values ?, ?, ?
If I batch up 20 execute updates using the above statement then I will get 20 correct inserts, and it will run in about a quarter of the time - excellent!
But unfortunately a gotcha. This is not an ideal statement:
INSERT INTO emp (a,b,c) values ?,?, (SELECT deptno FROM dept WHERE deptname = ?)
Say you batch up 20 executes of the above prepared statement and send them to the database. Whilst the first two bind variables in the statement above will be populated correctly for every record inserted, the select part of the statement will only be executed ONCE for all 20 statements, even though the result of the select should clearly be different for every statement. The other 19 inserts will get the same value from the select as the first lookup, and thus be incorrect.
The solution here is to disable update batching explicitly for the above type of prepared statements when you see them, otherwise you will get the wrong data in the database without warning and no exceptions will be raised.
This is a bit of an omission in the Oracle documentation and a real gotcha. Oracle does detect some situations when batching will not work and internally switches it off, and I think it should detect this one too.
Subscribe to:
Comments (Atom)