BROKER stuck on wait1=7931

Posted by Paul Koufalis on 12-Aug-2016 16:01

Does anyone have any insight?

OE 11.6 on RHEL 5.9

No one can login to the DB. _progres $DB hangs. We are going to bounce the DB but I'm looking for insight into what happened.

promon -NL works and I see the following in "All processes":

Usr:Ten   Name      Domain     Type      Wait            Wait Info

  0       root           0      BROK       --            7931:0

and a bunch of other users logged in but stuck. The other users don't have anything under Wait and Wait Info.

There was some ugliness last night: a hung process that was kill -SIGFPE at around midnight as it was not responding to signals and was reading 1.6M logRd/sec, sucking up an entire CPU. The process died without bringing down the DB but I guess there were some side-effects.

Promon - latch counts doesn't show any current latch owners and blocked users doesn't show anything either.

I ran a couple of George's favourite restricted options. Too long to post here but I can provide specific info if necessary.

All Replies

Posted by mfurgal on 12-Aug-2016 16:26

The _connect structure (called usrctl internally) is protected by a semaphore and we use that semaphore the way they were designed to be used, meaning a locking mechanism.  A process was likely killed while that semaphore was locked.  This means no one can login or logout of the database.  Shutdown is the only solution.  If you send the broker a kill –15 that will initiate a normal shutdown.  If you remove the LK file, that will get noticed and initiate an abnormal shutdown.

Mike
-- 
Mike Furgal
Director – Database and Pro2 Services
PROGRESS Bravepoint
678-225-6331 (office)
617-803-2870 (cell)


Posted by Paul Koufalis on 12-Aug-2016 16:52

The next logical question is: where do I see that? Is there a promon hidden option?

Posted by George Potemkin on 12-Aug-2016 17:13

Promon uses a login semaphore unless you start it with the -F option.

Hence a login semaphore is not locked.

Create a protrace file for a process while it hangs on login.

Session is waiting for login semaphore:
dsmUserConnect -> semLockLog

Logout:
dbDelUsrctl -> SemLockLog

Session is waiting for USR latch:
dsmUserConnect -> dbGetUsrctl ->latlatch -> latXlock -> latWait -> latWaitOnSem

Posted by George Potemkin on 12-Aug-2016 17:18

> I ran a couple of George's favourite restricted options. Too long to post here but I can provide specific info if necessary.

I would check them. ;-)

Posted by Paul Koufalis on 12-Aug-2016 17:22

Too late - DB bounced.  Next time.

Posted by George Potemkin on 12-Aug-2016 17:30

BTW, dbmon script is trained to gather information during such cases.

Posted by George Potemkin on 13-Aug-2016 03:21

Maybe this information will be helpfull for someone else:

During login the promon/proshut utilities read less data than the normal sessions.
If these utilities started without the -F option then at login they:
1. Lock the login semaphore (also known as the Ready/Login semaphore) stored in its own semaphore set;
2. Lock the USR latch. Proshut creates one USR latch lock, promon - 2 locks before V11 and 3 locks since V11 (I guess it's related to the sql-only-user feature though promon ignores the contents of the _User table and the -U/-P parameters);
3. Both utilities read the main record (where _Db-local eq true) in _Db table also known as mb_dbrecid. This record used to be fragmented. Hence it creates 2 db requests which of them creates LRU + BHT + 2*BUF latch locks (provided that the -lruskips is 0);
4. Since V11 promon also reads the root block of the _User._User-sql-only-user index (why?);
5. When you first time enter into any activity screen in promon or R&D level or update the activity counters then promon creates one TXQ latch lock (unless promon was started with the -NL option) and reads the ACO (Area Control Object) blocks - for all data areas including Control Area and Schema Area by default or only one ACO block if you set the working area in Adjust Monitor screen or does not read ACO blocks if the working area is Primary Recovery Area.

If promon/proshut successfully connect a database then none of the resources above are locked.

If promon/proshut was started with the -F option then they skip the points 1 and 2. Warning: do not use the -F option unless you absolutely sure that the normal connection does not work.

With/without the -F option the utilities during logout use the login semaphore and the USR latch - 2 USR latch locks without the -F option and one USR latch lock with the -F.

4GL session reads much more during login:
It creates many SCH latch locks and the schema locks.
It reads a lot of the blocks with metaschema data (even if a session is started with the -cache parameter).
It uses the lock table while reading metaschema: the shared record locks, record get locks, table locks.

In other words 4GL sessions have more chances to hang on a locked resource. But if the login semaphore and the USR latch are not locked then a session should leave a login message in db log.

Posted by cjbrandt on 13-Aug-2016 17:45

Do you know if the process that was killed the night before had an active transaction ?  Was there something in the db log about backing out a transaction after the user was killed ?

Did this db have AI enabled and if so, was the database able to continue switching AI logs ?  When we have a db that is hung waiting on latch or semaphore, it seems the db isn't able to switch AI files.  So that is what we monitor for to detect a db that is going to require a restart.

Posted by George Potemkin on 14-Aug-2016 01:16

Carey,

> was the database able to continue switching AI logs ?

Do you mean 'rfutil -C aimage new'?

It locks the same db resources as 4GL sessions plus MTX and AIB latches.

Posted by George Potemkin on 14-Aug-2016 04:58

Regarding login semaphore... Broker locks it at db startup and holds during the crash recovery phase. The "mode" part of db lock file (.lk) will contain the value 5 that means "crash recovery". At this time proutil -C busy will return code 64 that according documentation means "Database is in process of starting up" though proutil will incorrectly issue the message #276 ("The database is in use in multi-user mode"). Broker unlocks the login semaphore and update the mode in .lk file to the value 2 ("multi-user mode") after the message #10471 ("Database connections have been enabled").  My tests shown that as the first step 'proutil -C busy' checks the contents of the .lk file. Nothing says it will check the lock of login semaphore but I would try it anyways. At least it would be useful to check the login semaphore lock using proutil.

Notes:
Proutil -C holder does not distinguish "Database is in process of starting up" and "Database is in use in multi-user mode" and in both cases it returns code 16.
Return code 64 was added to 'proutil -C busy' somewhere in V8.

Posted by Paul Koufalis on 14-Aug-2016 10:01

CJ,

There was no active transaction. The process was doing 1.6M logRd/sec but no records were locked. The protrace pointed to a CAN-FIND() in a form definition.

AI is not enabled (yet) on this database. But before someone says anything the client has authorized me to enable AI on all their databases so it will be on very soon.

Posted by George Potemkin on 14-Aug-2016 11:25

> The protrace pointed to a CAN-FIND() in a form definition.

Restricted options without a doubt say that the index used by can-find consists of only one block - the root block. Do you know what index was used by can-find? Do all sessions read this index at startup?

> No one can login to the DB. _progres $DB hangs.

The main question: did the sessions write the login messages to db log? It looks like they did. Then by parsing db log it's rather easy to find more or less accurately when db hung (provided there were enough attempts to login). I have a program that does that kind of log analysis.

> a hung process that was kill -SIGFPE at around midnight

Posted by Paul Koufalis on 15-Aug-2016 09:05

I do not know what index: the DB did not have -tablerangesize -indexrangesize (but it does now!)

Luckily for the customer, this was a very quiet, low activity DB so there were only 5 or so login attempts after the SIGFPE kill. We can see pretty clearly the timeline by visually scanning the db.lg.

Posted by gus bjorklund on 16-Aug-2016 08:26

>

> Usr:Ten Name Domain Type Wait Wait Info

>

> 0 root 0 BROK -- 7931:0

>

>

> and a bunch of other users logged in but stuck. The other users don't have anything under Wait and Wait Info.

>

>

paul, fyi:

when the Wait column shows two dashes (dang helpful software will not let me type them - it always elides them into a long dash) that means that the process is NOT waiting on a resource. the wait info column contains a value that (usually) identifies the specific resource involved when the connection is in a wait state. if the connection is not in a wait state, this value doesn't not means much. it is just the last one but we don’t know that last resource type.

the login semaphore is a special case and uses a completely different set of code than all the other waits. the reason for that is that until you log in to the database successfully, you don’t have a connection table entry. the login semaphore is used to allocated one for self-serving connections.

This thread is closed