suggestion on batch processing db records

I am working on developing a webapp (visual jsf, ejb3, hibernate on glassfish/mysql) that works with an existing legacy system.

I have an 'employee' table (with fields such as name (String), location (String), etc.) which is separate from an employee table on the legacy db2 side. I do all of the webapp processing with my employee table. However, every week I need to schedule a task to go through all the employees in my table and compare them against the employees in the legacy db2 table. If the employee location has changed in the legacy table, I need to update my employee table to reflect the new location.

What would you suggest as the best way to go about doing this?

Currently I am reading in all the employees into an ArrayList and then looping through each employee entity in the list, getting the corresponding legacy employee instance, comparing locations and updating my employee entity if location change detected.

Since I have close to 50000 records in my employee table, the initial build of the ArrayList takes around 5 minutes and this employee number will only keep on increasing.

13.10.2009 20:20:00
unfortunately i cannot modify the db2 in any way or else i could have set up some sort of triggers.
SibzTer 14.10.2009 01:02:09

Is there a reason why it should be synched only once in a week? If not, you might want to spread the operation over the week - do 1/7-th of the work every day. You can also consider adding a table in your side to keep track of which record was synched when.

13.10.2009 22:30:36
that is a thought. not sure how i would go about doing that as of now. if nothing else works out, i guess i could do that. and i am using a location_history kind of table to keep track of the location changes. thanks.
SibzTer 14.10.2009 01:06:30
poor man's batching (sqlite dialect)... select id from employee where id % 7 = cast(strftime('%w', 'now') as integer);
Trenton 14.10.2009 02:38:54

I would create a dblink ( dblinks do exist on DB2 right? ) and do something like:

 select, a.location 
      empl a, empl@link b 
 where = 
     and a.location <> b.location

Then iterate the resultset which will have all those whose location have changed.

13.10.2009 22:36:08

If you have the ability to alter the legacy table in any way, you could add a needs_sync column to it. Then, using a trigger or modifying the code that updates the location, set needs_sync = 1 when you do the update. (Add an index on that column, too.)

Then, to find records to update

select id, location
from legacy.employee
where needs_sync = 1

When you've successfully done the sync

update employee
set needs_sync = 0
where needs_sync = 1

Do it all in a transaction to avoid a race condition.

This solution has the advantage of only examining records which have been changed, so it will be efficient at runtime. It does require a change in the legacy schema, which might be painful or impossible to do.

13.10.2009 22:51:31
unfortunately, i cannot modify the legacy schema in any way. but thanks for the reply though
SibzTer 14.10.2009 01:03:06

Im thinking of using jpa query's "setMaxResults()" and "setFirstResults()" methods to retrieve employee data in chunks. These methods are used for paginating display data in the UI, so I dont see any reason why I cant do the same. This way I can process chunks at a time. And I could probably throw in a queue and mdb for processing the chunks in parallel since I cant create threads within the ejb container.

14.10.2009 14:48:29
not sure if this is going to work or not. someone pointed out to me that setMaxResults() and setFirstResults() methods might not be supported properly by db2. need to check that out.
SibzTer 14.10.2009 16:25:41

I am thinking of using JMS messages, queues and MDBs to try and solve this problem instead. I would send each employee record as a separate message to a queue and then, the corresponding MDB can do all the processing and updating for that record. I am thinking I might get more simultaneous multiprocessing done that way.

17.11.2009 16:54:20