DB Crash: Out of free shared memory. Use -Mxs to increase

Posted by Steven Peeters on 08-Mar-2018 06:00

Hi,

HPUX 11.31 / OE 11.6.3

We are performing some tests on large transactions on our SAII test-database.
After deleting a large amount of records in 1 transaction (BI growth > 2GB) and then canceling the still running procedure (CTRL-C in editor with direct connection to DB) our database did a forced shut-down trying to undo the deletes:

[2018/03/08@12:28:32.519+0100] P-23126 T-1 F ABL 22: (6495) Out of free shared memory. Use -Mxs to increase.
[2018/03/08@12:28:32.527+0100] P-17493 T-1 I BIW 11: (453) Logout by root on batch.
[2018/03/08@12:28:32.531+0100] P-23126 T-1 I ABL 22: (5028) SYSTEM ERROR: Releasing regular latch. latchId: 17
[2018/03/08@12:28:32.531+0100] P-23126 T-1 I ABL 22: (5029) SYSTEM ERROR: Releasing multiplexed latch. latchId: 2865857
[2018/03/08@12:28:32.531+0100] P-23126 T-1 I ABL 22: (-----) SYSTEM ERROR: dbenvout(a) released latch.
[2018/03/08@12:28:32.531+0100] P-23126 T-1 F ABL 22: (5026) User 22 died holding 2 shared memory locks.
[2018/03/08@12:28:32.552+0100] P-23126 T-1 I ABL 22: (439) ** Save file named core for analysis by Progress Software Corporation.
[2018/03/08@12:28:32.563+0100] P-17496 T-1 I APW 12: (453) Logout by root on batch.
[2018/03/08@12:28:32.564+0100] P-17502 T-1 I APW 14: (453) Logout by root on batch.
[2018/03/08@12:28:32.565+0100] P-17508 T-1 I APW 16: (453) Logout by root on batch.
[2018/03/08@12:28:32.566+0100] P-17505 T-1 I APW 15: (453) Logout by root on batch.
[2018/03/08@12:28:32.577+0100] P-17489 T-1 I BROKER 0: (15192) The database will complete shutdown within approximately 60 seconds.
[2018/03/08@12:28:32.586+0100] P-17489 T-1 I BROKER 0: (2249) Begin ABNORMAL shutdown code 2
[2018/03/08@12:28:42.685+0100] P-17489 T-1 I BROKER 0: (2527) Disconnecting dead user 22.
[2018/03/08@12:29:35.206+0100] P-17489 T-1 I BROKER 0: (15194) Database activity did not finish before the shutdown timeout expired so the database is performing an immediate shutdown.
[2018/03/08@12:30:03.355+0100] P-17489 T-1 I BROKER : (334) Multi-user session end.
[2018/03/08@12:30:03.363+0100] P-17489 T-1 I BROKER : (16869) Removed shared memory with segment_id: 1114118

Database can be restarted with a (long) recovery. 

I'm guessing this is caused by error 6495?

It's unclear to me how the -Mxs parameter should be set.
Could this have been prevented with a higher -Mxs? (at the moment we don't set the parameter at all)

Regards,

Steven

Posted by George Potemkin on 08-Mar-2018 07:01

Yes, it's an expected behaviour. May be the absence of the "Lock table overflow" errors should be treated as a bug.

> Should i just increase -L to a very large number?

Yes. Or use the smaller transactions.

The point of no return: the number of records deleted in one transaction is equal to the number of free entries in Lock Table. Then you have only two choices: to commit a transaction or to crash a database.

All Replies

Posted by George Potemkin on 08-Mar-2018 06:06

latchId 17 is LKF latch

latchId 2865857 is a LHT latch (family of type 1)

Did you get a lot of "Lock table overflow" errors?

Posted by George Potemkin on 08-Mar-2018 06:10

> After deleting a large amount of records in 1 transaction (BI growth > 2GB) and then canceling the still running procedure (CTRL-C in editor with direct connection to DB) our database did a forced shut-down trying to undo the deletes

I missed this part. Sorry. It's indeed the "hidden" Lock table overflow errors. Undo of each record's delete creates one lock.

Posted by Steven Peeters on 08-Mar-2018 06:20

Since the DB was started this morning we did get a couple of -l increase (5408) in the log, but nothing out of the ordinary.

The procedure was that caused this was simple:

do transaction:
  for each [table] exclusive-lock:
   
delete [table].
  end.
end.

Posted by George Potemkin on 08-Mar-2018 06:36

You can reproduce the situation against sports db: start it with -L 128, delete the order-line records in one transaction and undo the transaction.

Posted by Steven Peeters on 08-Mar-2018 06:52

So the undo transaction functionality is limited by the size of the -L parameter?

Is this expected behaviour or a bug?

Should i just increase -L to a very large number?

Posted by George Potemkin on 08-Mar-2018 07:01

Yes, it's an expected behaviour. May be the absence of the "Lock table overflow" errors should be treated as a bug.

> Should i just increase -L to a very large number?

Yes. Or use the smaller transactions.

The point of no return: the number of records deleted in one transaction is equal to the number of free entries in Lock Table. Then you have only two choices: to commit a transaction or to crash a database.

Posted by George Potemkin on 08-Mar-2018 07:19

> Should i just increase -L to a very large number?

To dot the i: after "for each table: delete table. end" the high water mark in Lock Table will be equal one. Undo of this transaction will increase HWM to the number of records deleted in transaction.

Posted by ChUIMonster on 08-Mar-2018 09:57

Using ONE BIG DB TRANSACTION for bulk deletes is asking for trouble.

Bulk delete operations are almost always very easy to restart and make excellent candidates for breaking into smaller transactions.

You might also want to test the performance of deleting a large number of records in a single transaction vs deleting one record  at a time.

Posted by gus bjorklund on 08-Mar-2018 18:44

> On Mar 8, 2018, at 7:53 AM, Steven Peeters wrote:

>

> So the undo transaction functionality is limited by the size of the -L parameter?

>

>

not just undo, also creating and updating can also consume all the lock table entries. i wouldn’t characterize this as a bug since it was a deliberate design decision long ago, though it is something that could be improved upon.

the reason you get the shared memory exhausted error is because each time there is a lock table overflow during rollback, we allocate one more lock table from the unused shared memory in hopes that the error won’t occur again and the rollback will succeed. sometimes it does not.

Posted by Steven Peeters on 09-Mar-2018 01:50

You might also want to test the performance of deleting a large number of records in a single transaction vs deleting one record  at a time.

We followed instructions from KB P36834. (splitting up transactions) However this was still causing BI growth.
I then tried to investigate the potential effects on a test-db of large BI growth and backing out of the transaction by using a not-realistic one-big-transaction code, resulting in crash that is the subject of this thread.

I'm stull puzzled as of why the BI was growing when I was batching the deletes in 10000 per transaction.
In this regard I found the following in KB 15124:
Sometimes there is no way to avoid the .bi file being extended; there just isn't enough time to commit transactions and make the cluster available before it is needed again by other transactions. 

I guess we'll just manually split up our deletes by adjusting the query. (We don't archive records that often)

Posted by George Potemkin on 09-Mar-2018 02:07

> I'm stull puzzled as of why the BI was growing when I was batching the deletes in 10000 per transaction.

Did you gather any statistics during the tests?

For example, promon/R&D/1/4/3. Status: Active Transactions?

> In this regard I found the following in KB 15124:

> Sometimes there is no way to avoid the .bi file being extended; there just isn't enough time to commit transactions and make the cluster available before it is needed again by other transactions.

This old article seems to mean the Before Image Cluster Age parameter (-G) that is zero since 9.1E02 and 10.0B02.

This thread is closed