Content Upload

Posted by ankitshukla on 14-May-2015 10:53

Which approach will be more optimized while loading the data from text file to database?
note: I have more than 10 lacs of records in text file.

1)  Text file --> temp-table --> database (committing 100-100 records per transaction)

Input stream <instr> from.....
Repeat:
    create tt-upload-data.
    Import stream <instr> delimiter "|" tt-upload-data.
End.
Input stream close.

find first tt-upload-data no-error.
Do while avail tt-upload-data:
<counter initialition>.
  do transaction:
      do avail tt-upload-data and <counter> < 100:

         <data base update : making 100 records commit in one transaction>
         <counter increment>.

         find next tt-upload-data no-error.
      End.
   end.
end.

2) Text-file --> database (direct commit)

Input stream <instr> from.....
Repeat :
    Import stream <instr> unformatted  <str>.
    do while <str> <> ‘’:
        <data base update>
    end.      
end.
Input stream close.

All Replies

Posted by TheMadDBA on 14-May-2015 12:02

2 will be faster than 1 because it will not create a huge temp-table.

You can combine the two approaches and make it easier on the database and your code. Create/Import the temp-table records... increment a counter while you do it. If you get to 100 records then run through the entire temp-table and create your DB records in one transaction. Then empty the temp-table either as you go along or after the transaction. Just make sure to run the same temp-table to DB update after the input stream close to catch any remaining records.

Best to just put the temp-table DB update logic in a procedure/function and call it from inside/outside the loop.

Posted by OctavioOlguin on 15-May-2015 10:39

Sorry, pardon my spanish, but what is a " lacs of records"

Posted by OctavioOlguin on 15-May-2015 10:41

wikipedia: Lac is a unit in the South Asian numbering system equal to one hundred thousand (100,000) ... ?

Posted by TheMadDBA on 15-May-2015 10:50

Correct. Basically they are loading at least a million records. That is one pretty big temp-table :)

Posted by OctavioOlguin on 15-May-2015 11:31

DEFINE VAR v1 AS...
DEFINE VAR v2 AS...
DEFINE VAR v3 AS...
DEFINE VAR v4 AS...

INPUT FORM <yourFile>.

REPEAT:
   IMPORT v1 v2 v3 v4.
   FIND dbTable  
      WHERE dbTable.PKeyField1  = v1 
      AND   dbTable.PKeyField2  = v2 ...
      NO-ERROR.
   IF NOT AVAILABLE dbTable THEN DO TRANSACTION:
         CREATE dbTable.
         ASSIGN dbTable.Field1 = v1
                dbTable.Field2 = v2
                dbTable.Field.......
         RELEASE dbTable.
   END.   /* trx */
END.


That way, you accomplish the object of having small transactions, and you can resume broken uploads.

Once I did same load, and one problem I couldn't overcame easily, was truncating the source file (4 gigs size, or more, I don't remember) when got a problem half loading, so I decided let db to discriminate previously loaded records.

And  then some days later, i got records loaded.

Posted by TheMadDBA on 15-May-2015 11:35

Usually you don't want to have small transactions when loading records in bulk. How many per transaction depends on a lot of factors (Hardware, Record Size, Version. etc) but 100 per transaction is usually a good start and works well in most cases.

You can still do error handling and lookups to handle partial loads obviously.

Posted by OctavioOlguin on 15-May-2015 11:38

Yes, the find no-lock is 2 trips to database, and CAN-FIND() is one trip to index... (or so)

Posted by ankitshukla on 15-May-2015 13:35

Ya I am using can-find() wherever it is possible............

One more thing when will a delete operation fast

A) if we will delete all records in

"for each <table> no-lock:

  delete <table>.

end."

B) making 100 records deletion per transaction

Posted by TheMadDBA on 15-May-2015 13:55

In the ancient days it was usually faster to do them in batches of 100.  In current versions with proper -spin settings it is usually just as fast or faster to delete one per transaction.

I would not advise having transient tables that get loaded/deleted frequently in your main DB. Things that you would do on Oracle (no logging for a table, no caching for a table, etc) don't really apply to Progress/OpenEdge.

If this is something you do on a regular basis investigate putting them into a different DB and connect to both when needed.

Posted by George Potemkin on 16-May-2015 03:34

> In the ancient days it was usually faster to do them in batches of 100. In current versions with proper -spin settings it is usually just as fast or faster to delete one per transaction.

I'm forced to disagree.

Increasing the number of records processed per transaction (RPT) decreases the number of recovery notes (ai/bi notes) generated by the whole transaction per record. Just by two times (in best case) between 1 and 100 RPT. And the changes will be negligible for RPT higher than 100.
The simplest transaction consists of the following notes:
RL_TBGN + RL_RMCR + RL_CXINS + RL_TEND
Transaction creates one record (RL_RMCR) and one index key (RL_CXINS). RL_TBGN/RL_TEND are bracketing the transaction.


Minimal size of any recover note is 32 bytes.
The size of RL_TBGN/RL_TEND is 60 bytes + length(USERID) for RL_TBGN note but let's assume we run the transaction as a blank user.
The size of RL_RMCR includes the size of the record. The minimal record size is 16 bytes. (but in typical cases the record size is 100-300 bytes). So let's estimate the minimal size of RL_RMCR as 48 bytes (it's, of course, an underestimation).
I did not check the sizes of RL_CXINS notes but let's estimate it by the minimum: 32 bytes
So the total size of recovery notes generated by our simplest transaction is 200 bytes (=60+48+32+60)
If we will create n records per transaction then the size of recovery notes per record will be
80 + 120/n = (60+n*(48+32)+60)/n

Recs  RL_Size
   1  200
  10   92
 100   81
1000   80

The larger the records the less benefits from high RPT.


During a transaction Progress will create a lot of latch locks. While using the large transactions we commit less transactions per sec and it does slightly decrease the latch usage. But as everybody knows the latches are very short term resource locks. The lock duration is much much shorter than any operation with a record. So the total number of latch locks will only slightly depends from the number of records per transaction. Also the -spin parameter helps to resolve the latch /conflicts/. It can't care how large are the transactions. And if other sessions mainly use/lock the different latches then the spin will not help either because we don't have a competition for shared memory resources.

Posted by George Potemkin on 18-May-2015 01:19

By the way, we can use the similar idea for the sequences. Set a sequence increment, let's say, to 100. So when Progress session asks for the next sequence value it can use the range of values either to create 100 new records in one transaction or just to keep them as a pool of the values for a future use. Such approach will significantly reduce the number of requests to the sequence block and will increase the performance in the cases when the sequences are actively used to create the records in the different tables. Unfortunately it will not help much if the concurrent sessions are creating the records in the same table because the access to a data block at the head of RM chain will stay a bottleneck (of the same "insalubrity" as the access to the sequence block was).

Posted by ankitshukla on 18-May-2015 03:28

Thanks George and themadDBA for such important information.

can you please let me know about any documentation for performance tuning in progress 4gl and also any documentation for DB concepts.

Posted by George Potemkin on 18-May-2015 05:54

> can you please let me know about any documentation for performance tuning in progress 4gl

Check Dan Foreman's books:

www.bravepoint.com/products-publications.shtml

> and also any documentation for DB concepts.

The Engine Crew Monographs

www.fast4gl.com/.../monographs

Posted by TheMadDBA on 18-May-2015 08:01

Interesting stuff George.

I have never noticed any measurable difference. But then again I normally don't do mass deletes on my production databases with users connected and AI running. Nothing over a few million records at a time anyways :)

This thread is closed