Lock Table Overflow

Posted by Paul Koufalis on 09-Sep-2014 14:32

Is my vague recollection true?

Every time there is a lock table overflow the RDBMS increases -L by 1.

Or something like that...

Paul

All Replies

Posted by TheMadDBA on 09-Sep-2014 14:38

Wow... when did that start?

Tested with small -L setting on AIX 10.2B and it does indeed increase -L by one each time.

Posted by ChUIMonster on 09-Sep-2014 14:41

Yes.

Posted by ChUIMonster on 09-Sep-2014 14:47

It has been going on for a long time. At least since v8, probably
longer. Eventually it eats up all of -Mxs and the db goes BOOM!

Posted by Rob Fitzpatrick on 09-Sep-2014 14:53
Posted by TheMadDBA on 09-Sep-2014 15:07

I guess I never noticed because I always either get the code fixed and/or adjust the settings (even temporarily).

I was going to ask what happened when it ran out of segments/memory, but I see Tom has already covered that :)

Posted by Paul Koufalis on 09-Sep-2014 15:08

Yep...exactly: bye-bye -Mxs and bye-bye DB.

Now the $1MCAD question: what can I extract from the db.lg and aiscan verbose to maybe help me figure out who the guilty party was?

AISCAN verbose...not for the faint of heart.  Even with George Potemkin's AI Scan programs.

Does anyone have any words of wisdom to help alleviate my burden?  You know, a quick shortcut magic method that says HERE!! LOOK HERE!!!

Paul

Posted by ChUIMonster on 09-Sep-2014 15:17

Guilt is in the eye of the beholder.

It might be that some particular user grabs a whole mess of locks *and*
blows up -L.

Or some user grabs a lot, but not all of the locks and the normal
workload results in some poor unlucky user just happens to ask for a
modest helping at just the wrong moment.

Or maybe -L is simply too small and a bunch of users with perfectly
reasonable requirements all make requests at about the same time...

Having said that... I've been know to go tripping through _userlock
like so:

for each _userLock no-lock where _userLock-usr <> ? and
_userLock-recid[512] <> ?:

to find people that seem to have more locks than might be generally
recognized as "reasonable"... I find the approach above to be *much*
safer than fiddling with _lock.

Posted by TheMadDBA on 09-Sep-2014 15:20

The database log is only going to give you the PID and possibly the userid (depends on how they are connecting).

Sometimes this will end up being a victim of another program that just pushed the lock table near the limits, sometimes it will be the offending process.

If it was actually doing DB updates and not just locks there should be some info in AISCAN, but like you said it can be a pain to look through especially if you have a lot of activity.

Posted by Paul Koufalis on 09-Sep-2014 15:22

The problem is that I don't get a call until AFTER the DB has crashed.  I am doing a post-mortem now.

Posted by TheMadDBA on 09-Sep-2014 15:28

What is -L set to now? How does that compare to the number of users and transactions?

I know Tom is probably going to blow a gasket on this suggestion,but ...

Try cranking -L up to a much larger value and put in more monitoring/alerts/logging to let you have time to find out where the suspects are. Then you can decrease it if it makes sense after the code is changed.

Posted by ChUIMonster on 09-Sep-2014 15:40

> I know Tom is probably going to blow a gasket on this suggestion,but ...

Bang!

Try cranking -L up to a much larger value and put in more monitoring/alerts/logging to let you have time to find out where the suspects are. Then you can decrease it if it makes sense after the code is changed.


It will never get decreased ;)

But, yes, if this is something that is happening regularly then sometimes you have to make -L bigger so that you can get by while the code is being fixed.

On the other hand...

Sometimes you cannot make it big enough.  This is especially true if the issue is that some coder has decided to use a db transaction to ensure that his year end purge is "all or nothing".  And things of that ilk.

I find that that sort of thinking is behind an awful lot of these -L problems.

The solution in those cases is to  distinguish between a "business transaction" and  a database transaction.  99.44% of the business transactions like this can be trivially fixed by making them restartable and perhaps adding a flag somewhere to indicate if they completely finished.  The other .56% are fixable with a little extra thought and effort.  Exactly 0% of them *have* to be coded as database transactions.

-- 
Tom Bascom
603 396 4886
tom@greenfieldtech.com

Posted by Paul Koufalis on 09-Sep-2014 15:40

-L 100K

About 1500 concurrent users

We suspect the offending job was a batch process that will refire so I did exactly what you suggest in hope of catching it the next time around.

Then again, when one of the programmers heard our theory his eye twitched and he went back to his desk.

UPDATE: The twitch was something he saw in the batch scheduler. Someone scheduled a job with no start/end date.

Bad code once again...and the programmers wonder why we always say it's their fault.

Paul

Posted by TheMadDBA on 09-Sep-2014 15:49

Tom: I agree that some DBAs will just let it go forever.. In fact I have jacked up a few -L values to absurd values because the developers just will not stop writing crappy code.

I got tired of explaining to the CIO and business execs why parts of the business stops every few weeks or so because of "some database problem". The 10MB of memory just wasn't worth the trouble.compared to explaining it to them.

It also gave me time to track down the bad code and crack some knuckles instead of talking to managers about it. Because it was always my problem, even when it wasn't my problem :)

Paul: Sounds like a decent plan to me.-L of 100K seems a tad low to me for that many users but of course it depends on the application. You know it better than me obviously.

This thread is closed