Investigating BI growth incident

Posted by Rob Fitzpatrick on 20-Nov-2017 13:27

RHEL 5.9
OE 64-bit 11.6.3.017

I'm investigating a recent issue at a client site.  Their BI grew until it hit the threshold, which at the time was 2 GB.  I am analyzing aimage scan verbose output using George's code (thank you George!) and am uncertain how to interpret what I see.

I have identified three very long-running transactions (4328, 7665, and 2179 seconds) so I know why the BI grew.  The next step is to understand what certain other users were doing in that time frame.  I don't know whether it is normal activity or not.  I'll be analyzing more AI files from the week prior to try to figure that out.  Several of them were not doing the typical record/index CUD activity I would expect; they just had a ton of save points:

Trid      Userid   AiNum   BegTime                    EndTime                     Len          Notes     SubTrans  Notes/Sub  NoteList
32560737  alice    73134   Fri Nov 17 09:25:26 2017   Fri Nov 17 10:37:34 2017   4328              6            2          3  (RL_TMSAVE,RL_RMCHG:7)*2
34745864  alice    73145   Fri Nov 17 12:07:12 2017   Fri Nov 17 12:12:38 2017    326        1728970      1728968          1  (RL_TMSAVE)*1728968
34756650  bob      73145   Fri Nov 17 12:11:07 2017   Fri Nov 17 12:15:18 2017    251        1172022      1172020          1  (RL_TMSAVE)*1172020
34766381  bob      73146   Fri Nov 17 12:32:41 2017   Fri Nov 17 12:36:11 2017    210        1086047      1086044          1  (RL_TMSAVE)*1086044,RL_LSTMOD:6
34769079  bob      73147   Fri Nov 17 12:38:17 2017   Fri Nov 17 14:46:02 2017   7665       20010227     20010222          1  (RL_TMSAVE)*20010222,RL_LSTMOD:6*3
34780858  carol    73148   Fri Nov 17 12:57:06 2017   Fri Nov 17 13:33:25 2017   2179        9239104      9239100          1  (RL_TMSAVE)*9239100,RL_LSTMOD:6*2
34838161  carol    73156   Fri Nov 17 16:17:33 2017   ?                           743        4448476      4448475          1  (RL_TMSAVE)*4448475

Can someone help me to understand what these remote ABL users might have been doing to cause all these TMSAVE notes?  Thanks.

All Replies

Posted by George Potemkin on 20-Nov-2017 13:34

The following program will generate only TMSAVE notes:

DO /* LARGE */ TRANSACTION:
  DO vLoop = 1 TO vRepeats ON ERROR UNDO, LEAVE:
    CREATE Table.
    DELETE Table.
  END.
END.

Posted by Rob Fitzpatrick on 20-Nov-2017 13:44

Interesting.  Thanks George.

Posted by George Potemkin on 20-Nov-2017 13:44

BTW, _TableStat/_UserTableStat will show the creates/deletes but _IndexStat/_UserIndexStat will not. In other words, you can use these statistics to identify a table name used in the create/delete statements => xref => the programs run by the clients.

Posted by Rob Fitzpatrick on 20-Nov-2017 13:50

> BTW, _TableStat/_UserTableStat will show the creates/deletes but _IndexStat/_UserIndexStat will not. In other words, you can use these statistics to identify a table name => the programs that wre run by clients.

Thanks, I'll file that away for next time.  Unfortunately I was investigating after a shutdown and restart, so no VST data to look at.

Posted by dbeavon on 20-Nov-2017 14:22

Wow, based on your screenshots it looks like the problem developed over the course of 3.5 hours.  Does that sound right?  I sure wish OE had some sort of reasonable transaction timeout which could be enforced by the DBMS itself.

This problem happens to us regularly.  There are a few factors that are typically involved.  It normally stems from the interactive CHUI users who have open transactions that use pessimistic record locking.  The CHUI user may go to lunch without remembering to close out of their OE database transaction.  The OE dba then steps away for a few hours as well, and nobody gets alerted about BI growth, nor long-running transactions.  Then the problem slowly escalates into a system-wide issue as other similar CHUI users, or batch processes start to become backlogged on the same locked records.

This problem is something we were hoping to be able to troubleshoot with OE Explorer's  "oldest transaction age" property.  I think the transaction age is even more important than the BI usage, since it is the initial indicator that something is going wrong, and you can start sending out alerts long before BI grows to 2 GB. (Assuming transactions shouldn't legitimately last for hours at a time.)

Unfortunately the "oldest transaction age" doesn't seem to be reported properly in OEE.  I just asked about this earlier today: community.progress.com/.../36140

Posted by Rob Fitzpatrick on 20-Nov-2017 15:30

> Wow, based on your screenshots it looks like the problem developed over the course of 3.5 hours.

The anomalous activity started around noon and the DB stalled at 13:48.  I haven't pinpointed yet exactly when the growth started, which wasn't necessarily when the first long-running transaction started.

>  I sure wish OE had some sort of reasonable transaction timeout which could be enforced by the DBMS itself.

The problem with trying to design a generic platform solution is that everyone has their own idea of what is "reasonable".

In this case, the client was not monitoring BI growth or transaction age.  I am having that conversation with them now.

Both are possible programmatically, even if OEE isn't working for you as you expect.  You can look at _trans._trans-duration and _dbstatus._dbstatus-bisize.

Posted by Rob Fitzpatrick on 20-Nov-2017 15:56

Speaking of enhancements, it would be nice if there was the option of a DB log message each time the BI is extended (logically or physically).

Posted by cjbrandt on 20-Nov-2017 16:00

Used to manage a system that had an issue with generating too many Sequences, we noticed it because of BI / AI activity and no matching table / index activity.

This thread is closed