Performance degradation after dump&load

Posted by Piotr Ryszkiewicz on 07-Oct-2015 07:07

Hello,

I would like to share with you my strange experience with dump&load of the database. I am struggling with that for quite a long time now, and in fact already have the solution, but I do not understand what really happened, so maybe someone more experienced will be able to explain. I apologise in advance for quite a long post.

The situation is following:

- we have the database with just one huge table (cca 200 GB). Data and indices are separated to their own Areas. Data Area has RPB of 32 and cluster size of 512. Block size is 8kb. DBANALYS shows, that average size of record is 395 B, with minimum of 210B and maximum of 1668 B. All extents are fixed size. OE is 11.3.3 on AIX 64-bit.

- data in this table are created in order of it's primary index (current date + ID from the sequencer). Records are never updated after they are created.

- database was dumped and loaded (tried both binary and bulk load) with no changes to it's configuration. Dump and load was also done in the order of the primary index.

- after dump&load of this database we observed massive performance degradation. I was able to narrow down the problem to simple sequential read of data based on primary index. Read through data created by dump&load was approximately 15 times faster than read through data created afterwards, during normal production.

- to eliminate influences of other processes running on the database I started to play on the copy of the database with no other users connected. I created testing program which was doing sequential reads of data day by day, and after each day looking at changes in all relevant VSTs. I found that the difference between reading data created by dump&load and in production was in _ActIOFile only. Data created by dump&load were located really sequentially (almost all reads were done from the same or adjanced extents). But data created in production were scattered all over the database. So the read was not really sequential, what explains why it was slow. I mean reads from Data Area now - reads from Index area were pretty low and sequential in both cases. Looks like dump&load process had left holes in the database which were later filled during production.

- after several days it went back to normal - sequential reads were as fast as before and the data were fitted sequentially at the end of the database. Looks like all holes were filled during initial days of production.

- at this point I contacted technical support - as I was sure it was a bug. It took quite a long of time and I didn't get any satisfactory answer. They basically said it's a feature and suggested to decrease RPB to 16. I didn't like the idea but I had another - to play with CreateLimit of the table

- at the beginning I tried to simulate whole process - dump&load and production - without any changes, just to have the baseline to compare. Unfortunately my first tries were not successful - that means everything worked good, data created by simulated production were not scattered at all, but created nicely at the end of the database. It was clear, that the problem is with production simulation, as dump&load was done in exactly the same way as before. After some investigation I found out where my simulation programs were different from real production. In real production process the records in that table are not created with single ASSIGN command, but it's divided into several blocks - but still within single DB transaction. At this point I remembered one of the sessions of the PUG last year - record is not allocated at the end of transaction, but in some cases somewhere in the middle. When I changed my simulation program that way I got the same result as in production.

- when I found out that I thought that I know the reason and have the solution. My theory was following:

When creating new record the database tries to keep it within single DB block. During load process it always knows the size of current record and the free space in current block, so if it fits it's created in current block, otherwise it is moved to next block. That causes that there is free space left in blocks, and the average size of this free space should be lower than the average size of the database record. As minimum size of the record is higher than default CreateLimit it should never happen that the record is created in next block because of hitting CreateLimit, but always because of record size. Increasing CreateLimit after the load to value close to average size of the record should cause that during production this empty space should not be filled (in most cases at least)

- my first test was to increase CreateLimit to 350 after the load. Then started the production and got expected results - data were created sequentially at the end of the database. I could stop at this point and be satisfied, but (un)fortunately I didn't

- my next test was to perform the load with CreateLimit set to minimum value (32). If my theory was correct it should have no impact on load process - as minimum record size is higher than default CreateLimit. But it was not true. Load packed the data better than before (HWM after load went down from 28124159 to 27628031). As I understand that it's possible only if records could be splitted to more (2 ?) blocks during load. But that should mean, that the holes in the blocks should not be bigger than current CreateLimit. But if so, why afterwards in production these holes were filled, if the CreateLimit was still the same ?

- after the load I set CreateLimit back to default value (150) and simulated the production. It was still slowed down, but not so dramatically as in previous case (holes were smaller ? less of them ?). On the other hand, sequential read through loaded data was slower than before, and my test program showed some scattering. It looks like that with CreateLimit of 32 the database fills some holes created during load within the same load.

That's all I found out till now. As you see the solution of my problem is quite simple - keep default CreateLimit during load and increase it afterwards. But I would also like to have some understanding what really happened. Are records splitted during load or not ? Why the database leaves holes during load which are apparently bigger than CreateLimit ? Is there any way to avoid that ?

I would appreciate any suggestions :)

Best Regards,

Piotr

Posted by Richard Banville on 07-Oct-2015 08:29

I thought I answered that but I'll give it another go without the complete rmchain/createlimit/tosslimit dissertation.

The rm chain maintains the free space used for record creation.  When you do a load and the toss limit is not set appropriately, there will be many blocks on this rm free chain with just under 396 bytes (in your case).  The blocks are scattered sequentially throughout the area.  This can be proven with a chanalys run just after the load.  These will not be used for the average record you are insertion (396), only for the "outliers" between 300 and 396.  BTW, this will make your load time much longer too.

This rm free chain after the load then contains a "list of holes" as you put it but it is really just a list of a bunch of blocks (many in your case I am guessing) that will be attempted to be used when inserting new records to this table during production.  There are a list of reasons why the application can use these "holes" even if their average record size is 396, mostly due to when the actual record is created (due ro release, validate, txn scope, etc) and when the record becomes its final average size.

As these blocks with free space for just one record are used, you will get the scattering affect you describe.

So why are these holes maintained?  It is just a list of free space that is to be used for new record creation.  This list is goeverned by create and tosslimits.  The storage engine does not maintain a running list of average record sizes for each table so you must tell it how best to maintain the rm free list for record insertion (it will work it you don't, it just won't be the most efficient).  In your case, you would do this by setting your tosslimit for this table to 400.

Posted by Richard Banville on 07-Oct-2015 10:12

Yes, create and toss limit should be set correctly during the load process.

They are created in one step with their final size, right ? - Yes

Why the load behave differently with Create limit set to 32 compared to default 150 -

Create limit is the amount of space that must be available in the block after a record is created.  This is typically reserved for record expansion in the future.  By decreasing this value, during the load you were able to insert an additional record in some of these blocks giving you a higher packing but with the drawback that if any record were updated in the future requiring more than 32 bytes of additional space it would fragment.  (This is 32 byte of reserved space for updates of records in the block - not per record per block).

Assuming that these records will never grow, setting it lower as you did is a good idea, but after the load I would set it higher once again.  However, and increase in these existing records in the future would yield very bad performance so you really need to be careful with solving the issue in this way.  That is why I mentioned toss limits is the way to go.

Posted by Richard Banville on 07-Oct-2015 10:24

The new toss limit will be honored (acted on) as additional space is requested for new records or update of existing records requiring more space for this table.  As blocks are encountered on the rm free chain that have less space than either the create limit or the toss limit they will be removed from the rm free chain.  This action is amortized by up to 100 blocks removed from the chain per new space request (at which time a new cluster of free space will be requested).  

There are "tech support" maintenance ways to rebuild the rm free chain immediately through dbrpr but not something generally suggested as a normal maintenance operation.

All Replies

Posted by Abri Venter on 07-Oct-2015 07:48

Piotr,

Do you grow the empty DB to size before the load ?

Do you set any of the  toss limits (must be higher than create limit), you did mention the create limit ?

Have you compacted  indexes before or after D&L?

proutil dbname –C idxcompact table.index target%

From Progress 10.1 onwards progress also "defrags" automatically.

Thank you

A Venter

Posted by Richard Banville on 07-Oct-2015 07:57

I have seen the phenomenon you describe in the past as well.  It can be overshadowed by increasing the create limit as you describe.  However, in the case you describe, the parameter to change is the toss limit not the create limit.  In your case, your tosslimit should be around 400 for the table with the average record size of 396.

There have been many discussions explaining the ins and outs of create and toss limits so I will not go into that here.

After your initial load, the rm chain for this table was probably really long with little space left in each block - you could see this with a proutil chanalys output. When production ran, it filled in many of the "holes" or more appropriately use the free space scatter among blocks on the free chain.  Most likely one record per block.

The other assumption is that your primary index is your most frequently used index for the queries you need to run the best since that is the one you dumped and loaded by.  This is not the case for all applications.   For instance, if you customer table has cust-num as the primary index, you may lookup customers frequently by cust-num, but that is a random lookup.  If you run your reports or complicated queries "by some other field", then that may be the index you want for the D&L.

Posted by Piotr Ryszkiewicz on 07-Oct-2015 08:00

Hello Abri,

Yes, the database was grew to sufficient size before the load. If not, load would fail, as all extents were created with fixed size.

During the initial load I did not change neither create or toss limits. When I did my tests I described in my first post, I set toss limits in the following way:

- when create limit was set to 32 or to default value (150), toss limit was leave as default (300)

- when create limit was set to 350, toss limit was also set to 350.

I did not try idxcompact. How does this "defragmentation" work ?

Regards,

Piotr

Posted by Richard Banville on 07-Oct-2015 08:09

Here's how the auto defrag works:

When a record is created, it is inserted using the minimum number for fragments given the record size.  As records are updated and require more space, additional fragments can be created if there is not space available in the blocks of the current fragments.  Create limit can help with this.

The auto defrag happens at runtime as each record is updated (or deleted and rolled back).  All the pieces (fragments) of the record are gathered up into one record buffer.  The record buffer is then re-inserted starting with its current record location (since the rowed cannot change).  If the record fits in the block with the first fragment then the record has been completely defragged.  If not, a block with enough space to hold the remaining part of the record is searched for.  Once the operation is complete, the maximum number of fragments the record will have is the minimum # of fragments given the record size + one (maybe).

However, I do not believe this to be your problem.  a dbanalys report after the load and after the subsequent production record insert operations can confirm this.

Posted by Piotr Ryszkiewicz on 07-Oct-2015 08:15

Hello Rich,

Thanks for suggestion, I will increase Toss limit to 400 together with Create limit.

Regerding most frequently used index - you are right it's not the primary index. But changing dump&load order would help performance only for a while, as new records would anyway be created in primary index order, and queries are most often run on newest, not old data. I am also going to create separate area for each of frequently used indices.

But still - it does not answer my questions. Why these holes are there after the load ?

Regards,

Piotr

Posted by Piotr Ryszkiewicz on 07-Oct-2015 08:22

Rich,

I think you are right - auto defrag would not help here. If I understand correctly, it would help only if some record(s) in the block(s) where fragmented record resides will be deleted or updated in the way that it's size will be decreased. But it never happens here.

Regards,

Piotr

Posted by Richard Banville on 07-Oct-2015 08:29

I thought I answered that but I'll give it another go without the complete rmchain/createlimit/tosslimit dissertation.

The rm chain maintains the free space used for record creation.  When you do a load and the toss limit is not set appropriately, there will be many blocks on this rm free chain with just under 396 bytes (in your case).  The blocks are scattered sequentially throughout the area.  This can be proven with a chanalys run just after the load.  These will not be used for the average record you are insertion (396), only for the "outliers" between 300 and 396.  BTW, this will make your load time much longer too.

This rm free chain after the load then contains a "list of holes" as you put it but it is really just a list of a bunch of blocks (many in your case I am guessing) that will be attempted to be used when inserting new records to this table during production.  There are a list of reasons why the application can use these "holes" even if their average record size is 396, mostly due to when the actual record is created (due ro release, validate, txn scope, etc) and when the record becomes its final average size.

As these blocks with free space for just one record are used, you will get the scattering affect you describe.

So why are these holes maintained?  It is just a list of free space that is to be used for new record creation.  This list is goeverned by create and tosslimits.  The storage engine does not maintain a running list of average record sizes for each table so you must tell it how best to maintain the rm free list for record insertion (it will work it you don't, it just won't be the most efficient).  In your case, you would do this by setting your tosslimit for this table to 400.

Posted by gus on 07-Oct-2015 08:35

another possibility is that when you rebuilt indexes after the load, there was no free space in the indexes. that is good for reading the loaded data but for some period after load, adding new records will cause index block splits since there is no room for new index entries. this causes a temporary performance degradation since every index insert requires allocating a new index block and updating at least three blocks.

after a while, when there is sufficient free space, things work bhetter.

> On Oct 7, 2015, at 9:15 AM, Piotr Ryszkiewicz wrote:

>

> Update from Progress Community [https://community.progress.com/]

>

> Piotr Ryszkiewicz [https://community.progress.com/members/piotr.ryszkiewicz]

>

> Hello Rich,

>

> Thanks for suggestion, I will increase Toss limit to 400 together with Create limit.

>

> Regerding most frequently used index - you are right it's not the primary index. But changing dump&load order would help performance only for a while, as new records would anyway be created in primary index order, and queries are most often run on newest, not old data. I am also going to create separate area for each of frequently used indices.

>

> But still - it does not answer my questions. Why these holes are there after the load ?

>

> Regards,

>

> Piotr

>

> View online [https://community.progress.com/community_groups/openedge_rdbms/f/18/p/20592/72904#72904]

>

> You received this notification because you subscribed to the forum. To unsubscribe from only this thread, go here [https://community.progress.com/community_groups/openedge_rdbms/f/18/t/20592/mute].

>

> Flag [https://community.progress.com/community_groups/openedge_rdbms/f/18/p/20592/72904?AbuseContentId=f76f9ef6-3a4c-40f6-aa62-f614005b0078&AbuseContentTypeId=f586769b-0822-468a-b7f3-a94d480ed9b0&AbuseFlag=true] this post as spam/abuse.

Posted by Richard Banville on 07-Oct-2015 08:37

One more important thing, the reason the load (or any record creation for that matter) does not just use a block from the rm free chain containing less than the needed space to create the record in one fragment is to avoid the performance impact of multiple I/Os for a single record read in the future.  (obvious but I thought I'd mention it for completeness.)    

Posted by Piotr Ryszkiewicz on 07-Oct-2015 09:52

Thank you, now it's much more clear for me. Seems that my initial theory was basically correct.

If I understand correctly, you suggest to set Create and Toss limit to reasonable values (maybe 350 and 400) already during load process, right ? What I didn't mention till now - old records are in average smaller than new ones (as some fields did not exist at the beginning of the application, and thus are not filled).

But what I still do not understand is, why the load behave differently with Create limit set to 32 compared to default 150, when Toss limit was set to the same value of 300 in both cases and even the smallest records are bigger than 150 bytes. I assume that during load there is no record expansion, they are created in one step with their final size, right ? I know it makes no sense to set it to 32, I am just curious.

Regards,

Piotr

Posted by Piotr Ryszkiewicz on 07-Oct-2015 10:11

Just one more question if I can...

When I set Toss limit to some value when exactly is RM free chain updated according to that ?

At this moment Toss limit is set to 350 and I just run chananalys.

As you supposed there is a lot of blocks on the list, but some of them show free space below 350 (but not below 300, which was set during load when most records were created).

Piotr

Posted by Richard Banville on 07-Oct-2015 10:12

Yes, create and toss limit should be set correctly during the load process.

They are created in one step with their final size, right ? - Yes

Why the load behave differently with Create limit set to 32 compared to default 150 -

Create limit is the amount of space that must be available in the block after a record is created.  This is typically reserved for record expansion in the future.  By decreasing this value, during the load you were able to insert an additional record in some of these blocks giving you a higher packing but with the drawback that if any record were updated in the future requiring more than 32 bytes of additional space it would fragment.  (This is 32 byte of reserved space for updates of records in the block - not per record per block).

Assuming that these records will never grow, setting it lower as you did is a good idea, but after the load I would set it higher once again.  However, and increase in these existing records in the future would yield very bad performance so you really need to be careful with solving the issue in this way.  That is why I mentioned toss limits is the way to go.

Posted by Piotr Ryszkiewicz on 07-Oct-2015 10:24

Now I understand. The word "after" is the key :)

I suggest to update KB entry describing that:

http://knowledgebase.progress.com/articles/Article/P97487

It misses "after" when describing create limit. That's why I didn't understand...


Many thanks !


Piotr

Posted by Richard Banville on 07-Oct-2015 10:24

The new toss limit will be honored (acted on) as additional space is requested for new records or update of existing records requiring more space for this table.  As blocks are encountered on the rm free chain that have less space than either the create limit or the toss limit they will be removed from the rm free chain.  This action is amortized by up to 100 blocks removed from the chain per new space request (at which time a new cluster of free space will be requested).  

There are "tech support" maintenance ways to rebuild the rm free chain immediately through dbrpr but not something generally suggested as a normal maintenance operation.

Posted by George Potemkin on 07-Oct-2015 11:27

Piotr, can you post tabanalys for table area? How long is RM chain?

Posted by George Potemkin on 07-Oct-2015 11:32

> There are "tech support" maintenance ways to rebuild the rm free chain immediately through dbrpr but not something generally suggested as a normal maintenance operation.

BTW, why dbrpr use two phases to rebuild the chains (scan of the whole area adn rebuild phase when dbrpr seems to scan each table's partion)?

Posted by George Potemkin on 07-Oct-2015 11:45

> Read through data created by dump&load was approximately 15 times faster than read through data created afterwards, during normal production.

Most likely it's caused by record fragmentation. Just recently I wrote a program that can analysis the record fragmentation in the specified /part/ of area as well as to do online defragmentation of the records (like idxcompact for indexes). Let me know if you would like to give it a try.

Regards,

George

Posted by Richard Banville on 07-Oct-2015 13:16

BTW, why dbrpr use two phases to rebuild the chains (scan of the whole area adn rebuild phase when dbrpr seems to scan each table's partion)? - See more at: community.progress.com/.../72927

Where do you see scanning of 2 phases?

For Type 2 it should only scan by cluster if not it is a bug.

Posted by George Potemkin on 07-Oct-2015 13:45

> Where do you see scanning of 2 phases?

Example, for sports db:

Scanning .\sports_8.d1
From dbkey 32 to dbkey 2528

Scan of Customer/Order Area extent 1   10% complete
Scan of Customer/Order Area extent 1   20% complete
Scan of Customer/Order Area extent 1   30% complete
Scan of Customer/Order Area extent 1   41% complete
Scan of Customer/Order Area extent 1   50% complete
Scan of Customer/Order Area extent 1   60% complete
Scan of Customer/Order Area extent 1   70% complete
Scan of Customer/Order Area extent 1   80% complete
Scan of Customer/Order Area extent 1   91% complete
Scan of Customer/Order Area extent 1  100% complete
Rebuilt RM chain with 0 blocks on object 0/8.
Rebuilt RM chain with 0 blocks on object 2/1.
Rebuilt RM chain with 0 blocks on object 16/2.
Rebuilt RM chain with 0 blocks on object 4/1.
Rebuilt RM chain with 0 blocks on object 21/2.
Rebuilt RM chain with 0 blocks on object 5/1.

Also I did timing on large db with partition enabled. Rebuild of RM chains with 1000 blocks ("Rebuilt RM chain" phase) may take a few minutes.
If we answer Yes to the question: "Scan Backward (Yes/No)?" then the scan is very slow (5-6 hours for 20 GB area) and the result is funny: blocks on RM chains will be sorted in the descending order by the clusters but in the ascending order by dbkey inside each cluster.
If the answer is No then the scan is going much (many times) faster and the chains will be rebuilt in right order.

Posted by Richard Banville on 07-Oct-2015 14:08

I don't see how that tells you directly it is scanning twice.  However, the result of truss or trace does indeed tell us that.  

The reason (I am speculating here as this is not my code) is that the code was written to handle multiple options available on the "scan" screen, some of which are area block based (options 1-6) and some of which are cluster based (options 7-9 & 11)  and that Type II was a bit of an ugly add on to the existing mechanism.  I will file a bug for the scanning twice when only cluster scan is necessary.  

As far as the scanning backwards, the performance difference seems ridiculous - I'm wondering if the area is scanned backwards for each object or something.  The dbkey order would seem incorrect as well but it sounds like it was done purposefully.  I will ask around.

Posted by George Potemkin on 07-Oct-2015 14:22

Here is an example of chain rebuilt for large db with timing (and dbanalys timing for comparison):

OpenEdge Release 11.5.1 as of Wed May  6 19:00:27 EDT 2015

08:38:38
Scanning /test/testdb3/largedb3_7.d1
From dbkey 250281920 to dbkey 64

08:38:55 Scan of Table Area extent 1   10% complete   17 secs = 00:00:17
08:39:12 Scan of Table Area extent 1   20% complete   17 secs = 00:00:17
08:39:25 Scan of Table Area extent 1   30% complete   13 secs = 00:00:13
09:03:34 Scan of Table Area extent 1   40% complete 1449 secs = 00:24:09
09:23:19 Scan of Table Area extent 1   50% complete 1185 secs = 00:19:45
09:56:32 Scan of Table Area extent 1   60% complete 1993 secs = 00:33:13
10:28:41 Scan of Table Area extent 1   70% complete 1929 secs = 00:32:09
11:07:19 Scan of Table Area extent 1   80% complete 2318 secs = 00:38:38
11:45:58 Scan of Table Area extent 1   90% complete 2319 secs = 00:38:39
12:24:24 Scan of Table Area extent 1  100% complete 2306 secs = 00:38:26
13546 secs = 03:45:46
Rebuilt RM chain with 0 blocks on master object
         Rebuilt free chain with 505 blocks on master object
         Rebuilt lock chain with 0 blocks on master object
12:51:52 ./dbrpr: warning: forked pid 24371 appears in running job 0
13:27:50 Rebuilt RM chain with 2045948 blocks on table 1, partition 0. 3806 secs = 01:03:26
         Rebuilt free chain with 0 blocks on table 1, partition 0.
         Rebuilt index delete chain with 0 blocks on index 1, partition 0.
         Rebuilt RM chain with 0 blocks on table 13, partition 0.
         Rebuilt free chain with 507 blocks on index 13, partition 0.
         Rebuilt index delete chain with 0 blocks on index 13, partition 0.
13:34:43 Rebuilt RM chain with 1025 blocks on table 1, partition 1. 413 secs = 00:06:53
         Rebuilt free chain with 0 blocks on table 1, partition 1.
         Rebuilt index delete chain with 0 blocks on index 1, partition 1.
13:51:49 Rebuilt RM chain with 1130 blocks on table 1, partition 2. 1026 secs = 00:17:06
         Rebuilt free chain with 0 blocks on table 1, partition 2.
         Rebuilt index delete chain with 0 blocks on index 1, partition 2.
14:09:55 Rebuilt RM chain with 343 blocks on table 1, partition 3. 1086 secs = 00:18:06
         Rebuilt free chain with 0 blocks on table 1, partition 3.
         Rebuilt index delete chain with 0 blocks on index 1, partition 3.
14:28:10 Rebuilt RM chain with 1018 blocks on table 1, partition 4. 1095 secs = 00:18:15
         Rebuilt free chain with 0 blocks on table 1, partition 4.
         Rebuilt index delete chain with 0 blocks on index 1, partition 4.
14:37:53 Rebuilt RM chain with 1079 blocks on table 1, partition 5. 583 secs = 00:09:43
         Rebuilt free chain with 0 blocks on table 1, partition 5.
         Rebuilt index delete chain with 0 blocks on index 1, partition 5.
14:38:51 Rebuilt RM chain with 1082 blocks on table 1, partition 6. 58 secs = 00:00:58
         Rebuilt free chain with 0 blocks on table 1, partition 6.
         Rebuilt index delete chain with 0 blocks on index 1, partition 6.
14:39:28 Rebuilt RM chain with 1095 blocks on table 1, partition 8. 37 secs = 00:00:37
         Rebuilt free chain with 0 blocks on table 1, partition 8.
         Rebuilt index delete chain with 0 blocks on index 1, partition 8.
14:40:42 Rebuilt RM chain with 1076 blocks on table 1, partition 9. 74 secs = 00:01:14
         Rebuilt free chain with 0 blocks on table 1, partition 9.
         Rebuilt index delete chain with 0 blocks on index 1, partition 9.
14:42:11 Rebuilt RM chain with 1149 blocks on table 1, partition 10. 89 secs = 00:01:29
         Rebuilt free chain with 0 blocks on table 1, partition 10.
         Rebuilt index delete chain with 0 blocks on index 1, partition 10.
14:44:43 Rebuilt RM chain with 982 blocks on table 1, partition 11. 152 secs = 00:02:32
         Rebuilt free chain with 0 blocks on table 1, partition 11.
         Rebuilt index delete chain with 0 blocks on index 1, partition 11.
8419 secs = 02:20:19

[2015/09/09@08:38:39.011+0300] P-24372      T-1     I DBUTIL     (451)   Database Repair session begin for root on batch.
[2015/09/09@08:38:39.041+0300] P-24372      T-1     I DBUTIL     (61)    Backward scan
[2015/09/09@14:44:44.761+0300] P-24372      T-1     I DBUTIL     (334)   Database Repair session end.
21965 secs = 06:06:05

                Wed Sep  9 14:57:57 2015
[2015/09/09@14:57:57.030+0300] P-7808       T-1     I DBANALYS : (451)   Dbanalys session begin for root on /dev/pts/2.
[2015/09/09@14:57:57.052+0300] P-7808       T-1     I DBANALYS : (7129)  Usr 0 set name to root.
[2015/09/09@15:09:50.661+0300] P-7808       T-1     I DBANALYS : (334)   Dbanalys session end.
713 secs = 00:11:53

Posted by Piotr Ryszkiewicz on 08-Oct-2015 02:48

> Most likely it's caused by record fragmentation. Just recently I wrote a program that can analysis the record fragmentation in the specified /part/ of area as well as to do online defragmentation of the records (like idxcompact for indexes). Let me know if you would like to give it a try.

HI George,

Sounds interesting, please share :)

Regards,

Piotr

Posted by George Potemkin on 08-Oct-2015 04:21

Hi Piotr,

The program was just recently created. Dmitri Levin and me will present it in Copenhagen. The programt successfully passed all my tests but I don't like to share it publicly before it will be tested on real databases. I will leave here just its description. Let me know if there are the volunteers who are ready to participate in the tests. So far the results are very promissory: record defragmentation may speed up the reads by a few times (and even more than ten times).

The syntax:

&SCOPED-DEFINE Use-RecCopy-DB FALSE
&SCOPED-DEFINE Allow-Defrag {&Use-RecCopy-DB}

RUN ScanArea({&Allow-Defrag}, AreaNumber, FromOrToBlock, NumOfBlocks).

where

RecCopy is an auxiliary database that you may use or not use during real defragmentation (see below),

If Allow-Defrag is FALSE then procedure creates just a few reports:
1) The report that is similar to the tabanalys;
2) Size distribution for the scanned records and separately the size distribution for the found fragmented records;
3) Distribution of real RPB and distribution of slots used by fragmented records.I guess in your case (with small Create Limit) all fragmented records will be resided on largest slots in the data blocks.

AreaNumber - which area to scan. Update: program can be used only for SAT2;

FromOrToBlock - either first or last block to scan;

NumOBlocks - if the value is greater than 0 then FromOrToBlock set a first block to begin the scanning. Otherwise FromOrToBlock is a last block,

Examples:
RUN ScanArea(FALSE, 7, 1, ?) - scan the whole area # 7
RUN ScanArea(FALSE, 7, 1, 10000) - scan the first 10000 blocks.
RUN ScanArea(FALSE, 7, ?, -10000) - scan the last 10000 blocks (below HWM).

If you will allow defragmentation then the fragmented records will be copied to the RecCopy table (real one or temp-table). When the program collects enough records (100 records by default) it will delete them all in one transaction and will re-create them all in another transaction. You can you use an extental database with the RecCopy table to protect your data against the lost during power failure and database crash between those two transactions. The chances of such events are, of course, negligibly low. Nevertheless you can use the protection. Use-RecCopy-DB FALSE means that program will use the RecCopy temp-table.

In my tests the program processed approximately 300 blocks per sec. It's, of course, not fast compared with tabanalys. And speed might depend from area's RPB. On other hand, we can process the large areas in a few steps by specifying the ranges of the blocks to scan (simultaneously or sequentially).

Best regards,
George

P.S. How to send a private message on this forum?

Posted by Piotr Ryszkiewicz on 08-Oct-2015 04:41

Hi George,

Sounds interesting. I don't think I am brave enough to run it on the real database, but I can give it a try on the test one I used for my testing if you want. Based on your description I believe it should be able to defragment the database, and setting toss limit to proper value should cause that fragmentation would not appear again.

Best regards,

Piotr

Posted by George Potemkin on 08-Oct-2015 04:48

Piotr,

Send me a message at g.potemkin@progress-tech.ru;

Best regards,

George

Posted by Piotr Ryszkiewicz on 09-Oct-2015 08:15

> Piotr, can you post tabanalys for table area? How long is RM chain?

George, here are the files:

community.progress.com/.../dbanal.zip

I don't know how sharing works on this forum, please let me know if you won't be able to download it.

There are 3 files:

dbanal.baseline - after the load with default create and toss limits

dbanal.crlim32 - after the load with create limit 32 and default toss limit

dbanal.tosslim - after the load with create limit 100 and toss limit 400

Posted by George Potemkin on 09-Oct-2015 09:19

Your table:
                    -Record Size (B)-
  Records    Size   Min   Max  Mean  
552001843  203.1G   210  1667   395  
It uses 29,622,159 blocks
Or 18.6 records per block.
Cluster size is 512.

RM chain:
 4,354,789 blocks dbanal.crlim32
 2,641,801 blocks dbanal.tosslim
14,587,876 blocks dbanal.baseline

The length of RM chain is huge! BTW, the "normal" length of RM chain, IMHO, should be less than two cluster sizes.

Don't ever try to run the offline dbanalys! It will take a few days (online dbanalys completed in one and a half hours).

Set Toss Limit to 500 bytes and Create Limit to 300 bytes or so (but I would also check the size of template record - AreaDefrag reports it).
I guess there is ONE character field in your table that contributes the most to the record's size. Do you know this field?

Best regards,
George

Posted by Piotr Ryszkiewicz on 09-Oct-2015 09:46

George,

You are quite close... there is one field which takes about 1/4 of record size - if I compare RECORD-LENGTH of the record with LENGTH of that field. Why it's important ? And how did you know that ?

In production I was going to set create limit to 300 and toss limit to 400. Create limit was 100 only during load - because during load records are created in one step and will never be updated.

I am just running test with this setup, but it takes some time.

If I increase toss limit to 500 as you suggested I will have shorter RM chain, but also worse space utilization... it would require another testing to say what's better.

RECORD-LENGTH of template record is 236. This was very good question... as it made me think about that. That means, that during normal production when newly created record is allocated it takes at least 236 bytes, even if it's not filled yet, right ? If average record size is around 400, that means it will afterwards grow by around 170 bytes. That means create limit should be at least 170, am I right ?

Regards,

Piotr

Posted by George Potemkin on 09-Oct-2015 10:49

> RECORD-LENGTH of template record is 236

Are you sure? Dbanalys says that min record size is 210 bytes. And it was a database after load - hence the schema versioning is not applied here. The size reported by dbanalys is 2 bytes higher than the value returned by RECORD-LENGTH() function. It means that your table has at least one record with RECORD-LENGTH 208 bytes.

> there is one field which takes about 1/4 of record size

I expected the bigger contribution. How large was the record or field? Max record size is 1667 bytes. I'd check the records with 1K size or larger.

> Why it's important ? And how did you know that ?

There are two types of the tables ;-) - the ones with record sizes where (mean - min) close to (max - mean) and the ones where log(mean/min) close to log(max/mean). Your table seems to belong to the second type. The difference is a distribution of record sizes.

> That means, that during normal production when newly created record is allocated it takes at least 236 bytes, even if it's not filled yet, right ? If average record size is around 400, that means it will afterwards grow by around 170 bytes. That means create limit should be at least 170, am I right ?

Most likely you're right. But I would also check _TableStat. Namely the ratio of _Table-create to _Table-update. If it's 1:1 with high presision then you're 100% right. Or check the slots used by the fragmented records in your production database (it's reported by AreaDefrag). I guess you will see the slots 15, 16 or so and no slots with low numbers.

Posted by Piotr Ryszkiewicz on 09-Oct-2015 11:22

>> RECORD-LENGTH of template record is 236

>Are you sure? Dbanalys says that min record size is 210 bytes. And it was a database after load - hence the schema versioning is not applied here. The size reported by dbanalys is 2 bytes higher than the value returned by RECORD-LENGTH() function. It means that your table has at least one record with RECORD-LENGTH 208 bytes.

I did not run your program yet, just checked it with this code:

find first _file where _file-name = "old_trans".

find first old_trans where recid(old_trans) = _file._template no-lock.

disp record-length(old_trans).

I suppose it's correct ?

>> there is one field which takes about 1/4 of record size

>I expected the bigger contribution. How large was the record or field? Max record size is 1667 bytes. I'd check the records with 1K size or larger.

That was for records with average size, now I found some bigger than 1000 bytes it's more - 1/3 to 1/2.

> That means, that during normal production when newly created record is allocated it takes at least 236 bytes, even if it's not filled yet, right ? If average record size is around 400, that means it will afterwards grow by around 170 bytes. That means create limit should be at least 170, am I right ?

Most likely you're right. But I would also check _TableStat. Namely the ratio of _Table-create to _Table-update. If it's 1:1 with high presision then you're 100% right. Or check the slots used by the fragmented records in your production database (it's reported by AreaDefrag). I guess you will see the slots 15, 16 or so and no slots with low numbers.

I will try AreaDefrag next week. I will have to prepare fragmented database again. Now I am playing with proper toss and create limits, so the database is supposed to not be fragmented so much. But I can run report on that one as well.

Unfortunately I don't have enough space to keep more than one copy.

Regards,

Piotr

Posted by George Potemkin on 09-Oct-2015 11:43

Piotr,

I seemed to miss the obvious thing: most likely some fields (character fields) in your table have non-empty default values. But these fields can be empty in the real records. That is why their sizes can be smaller than the size of template record.

Regards,

George

Posted by gus on 09-Oct-2015 15:24

that is because there is a 2 byte length prefix which is included by one and not the other.

gus

> On Oct 9, 2015, at 11:50 AM, George Potemkin wrote:

>

> The size reported by dbanalys is 2 bytes higher than the value returned by RECORD-LENGTH() function.

Posted by George Potemkin on 10-Oct-2015 03:35

> that is because there is a 2 byte length prefix which is included by one and not the other.

Probably dbanalys should add 4 bytes to the value of RECORD-LENGTH() - 2 byte length prefix and 2 byte record offset in block header. Then the value would accurately represents the space used by a record in data block. The value reported by dbanalys is neither brutto nor netto record size.

I thought about this choice when I wrote AreaDefrag program that is a kind of 4GL analogue of tabanalys: should it report the real space usage or should it mimic tabanalys (de-facto standart) or should it use the RECORD-LENGTH() value (native choice for 4GL). Not a big issue though.

Best regards,

George

Posted by Piotr Ryszkiewicz on 13-Oct-2015 11:06

Hello,

We continued discussion with George in private regarding his AreaDefrag.p. By George's request now we return to public.

Here is George's comment to the results I sent him together with my answers:

***********************************************************************************************

> Piotr,

> I added the timestamps to the names of dbanalys files:

> 1. dbanal.2015-09-19T20.58.02.baseline

default values of create limit and toss limit (just load)

> 2. dbanal.2015-10-02T16.23.09.crlim32 (just load)

Create limit 32, Toss limit 300

> 3. dbanal.2015-10-09T11.34.16.tosslim

Create limit 100, Toss limit 400 (just load)

> 4. dbanal.2015-10-12T13.25.46.tosslim.poload tosslimit 400,createlimit 300

Same as 3, but after creating some millions records with createlimit changed to 300

> Can you specify the toss/create limits for first 3 dbanalys files?

>> If I understand the results correctly, toss and create limits during load were set correctly (2651 fragmented records out of 551 mln records looks ok),

> I disagree.

> First of all, the load can't prove if the create limit is correct or not.

I was not precise enough here. I meant mainly toss limit here.

> Secondly, as minimum the correct toss limit during data load should result in short RM chain (shorter than 2 data > clusters). But you got:

> RM CHAIN ANALYSIS

> ---------------------------

> Number of          Object    Object

> Blocks             Type

> ------------------------------------------------------------------------------------------------------

> 2716479            Table     PUB.old_trans:1    

> 29622159 block(s) found in the area.

> 9% of all blocks are RM chain.

> You can test the toss limit during the load but you don’t need to load 200 GB of records.

> Area cluster size is 512 blocks. 2 clusters = 1000 blocks. Load 100,000 blocks.

> Your have 20 records per block. Hence it's enough to load only 2 millions records. It would be less than 1 GB.

I think that having 2 clusters (1024 blocks) in rm chain is easy when you have maybe 100000 blocks, but not with almost 30 millions blocks. Moreover, record length is not homogenic over time. As system lived we added more and more fields so old records are smaller than new ones. To achieve 1024 blocks in RM chain I would have to set very high toss limit, but then I will get much bigger database with a lot of empty space (mainly when loading old, small records). Maybe if I calculate ideal toss limit for each time period and then increase it  during load, but I am not sure if it's worth such effort.

> You run AreaDefrag 5 times:

> 1. post_tr.area_7.2015-10-12T13.28.42 Blocks from 513 to 28147199, 315320 frags found, Defrag: not enabled,  Elapsed time: 78429.372

> 2. post_tr.area_7.2015-10-13T12.40.53 Blocks from 27867649 to 28147199, 313002 frags found, Defrag: enabled, Elapsed time: 2668.86

> 3. post_tr.area_7.2015-10-13T13.36.33 Blocks from 27867649 to 28169727, 1722 frags found, Defrag: not enabled, Elapsed time: 867.796

> 4. post_tr.area_7.2015-10-13T13.58.15 Blocks from 27867649 to 28169727, 1747 frags found, Defrag: enabled, Elapsed time: 888.936

> 5. post_tr.area_7.2015-10-13T14.20.15 Blocks from 27867649 to 28169727, 8 frags found, Defrag: not enabled, Elapsed time: 890.01

> Between 12.40.53 and 13.36.33 you seemed to create new records.

> I guess a few new records were also created between 13.58.15 and 14.20.15.

No. I am absolutely sure I did not create any records.

>> 4th run again with Allow-Defrag – it made almost all records.

>Program should defragment all 100% records. Otherwise it will report the reasons why the fragmented records were skipped. But program's log says nothing.

At first sight at your code I agree - it should defragment all records. But apparently it doesn't. Maybe our assumption, that the database always keeps the record in one piece when it is created in one code block is not always true ? I will try to analyze your code tomorrow, maybe something will come to my mind.

>Would you mind to continue the discussion on Community forum?

> I can be wrong in my statements. And I would be glad if other people will correct me.

> You can copy the current discussion to Community as well.

Sure. Here is it.

Beste Regards,

Piotr

Posted by George Potemkin on 13-Oct-2015 11:42

Piotr,

> Moreover, record length is not homogenic over time. As system lived we added more and more fields so old records are smaller than new ones. To achieve 1024 blocks in RM chain I would have to set very high toss limit, but then I will get much bigger database with a lot of empty space (mainly when loading old, small records).

You also wrote:

> and then cca 4.6 mln records were created with production simulation

Does it mean that you can create the records that imitate the production?

> Maybe if I calculate ideal toss limit for each time period and then increase it  during load, but I am not sure if it's worth such effort

You can run chanalys. It will give our the detailed information about all 2,716,479 blocks on RM chain. I have a program that loads the "LIST OF RM CHAIN BLOCKS" from chanalys, sorts the blocks by "free space". So we can see how many blocks will stay on RM chain if the toss limit will be increased to any value of "free space". Based on chanalys' that I got from our customers the rule of thumb is: toss limit = mean rec size + 20%.

Another approach is the virtual "simulation":

AreaDefrag.p program creates the statistics of record sizes. You don't need to scan the whole area. 1000 or 10,000 records would be enough. We know the rules that Progress uses for RM chain. We can virtually create the "records" with random size (simply generate a random "size" value according to the found distribution), add them to the virtual "blocks" and count the blocks that should be added to RM chain based on Progress rules. I planned to write such program before the conference but unfortunately the time is limited. ;-(

> At first sight at your code I agree - it should defragment all records. But apparently it doesn't. Maybe our assumption, that the database always keeps the record in one piece when it is created in one code block is not always true ?

I did not see such results during my tests.

Were you the only one who was connected to the database during the tests? What db log says?

At the end of AreaDefrag.p run you can dump RecCopy.OldRecid and RecCopy.NewRecid. Or you can use the persistent RecCopy table (instead of temp-table). So you can check the last records that were defragmented by AreaDefrag.p.

Another way: enable after-image and aimage scan will show who and when change the data in you database.

Best regards,

George

Posted by George Potemkin on 13-Oct-2015 12:15

> Were you the only one who was connected to the database during the tests? What db log says?

I'd like to recall my question. The data you sent me proves that new records were not added.

Another possible explanation: AreDefrag.p program does not check all RPB slots in each data block. Otherwise it would run much slower. You scan a part of area just below HWM. The records recreated during defragmentation will be moved exactly here. Next run after defragmentation will read the same blocks but with new contents. And next run could see the records missed by its predecessor.

Anyway the goal of AreaDefrag.p is just to /reduce/ the number of the fragmented records.

Posted by Piotr Ryszkiewicz on 14-Oct-2015 03:07

George,

> Does it mean that you can create the records that imitate the production?

It's not 100% exact, but similar enough I think. I dump 5 days of real data. Then I run few parallel processes, which read dumped data into temp-table, change values of fields which are the part of primary index and copy them to the table - but not in one piece, but with several assigns. During that another process is doing random reads just to be closer to real life.

> You can run chanalys. It will give our the detailed information about all 2,716,479 blocks on RM chain. I have a program that loads the "LIST OF RM CHAIN BLOCKS" from chanalys, sorts the blocks by "free space". So we can see how many blocks will stay on RM chain if the toss limit will be increased to any value of "free space". Based on chanalys' that I got from our customers the rule of thumb is: toss limit = mean rec size + 20%.

Just started chanalys, it will take some time I suppose. Will you be so kind and share this program ?

Regards,

Piotr

Posted by George Potemkin on 14-Oct-2015 04:41

Piotr,

I sent you the program with my comments offline. In short: the program loads the chanalys from Progress V10 but not from V11. I hope it can be fixed easy.

This thread is closed