Long running transaction immune to proshut

Posted by Rohan on 02-Aug-2017 09:33

Red Hat Enterprise Linux 6.7, Openedge 11.3.2.  (24x7)

I have a long running transaction (over 24hours) which doesn't respond to usual proshut command. This is a fairly busy production db. The user no longer has an active Linux login, but does have a _progres process active.

They also have various exclusive and shared locks.

In an attempt to disconnect the user, I tried (as the db owner account) "proshut dbname -C disconnect 269".

Output is normal: "User 269 disconnect initiated. (6796)" . There is no entry in the db.lg to confirm this.

I also tried Dimtri Leven's disconn-long-trans.p (thanks Dmitri), but without effect.

The BI file is usually up to around 1GB for this db, but currently over 8GB and growing.

-bithold is 19GB. BI dedicated filesystem size is 40GB. -bistall = yes.

Interestingly, in promon the user appears in "1.User Control" list, but not in the list of users to disconnect in menu option "8.Shut Down Database">"1.Disconnect a User".

Any suggestions on how to end this transaction, while remaining up and running?

All Replies

Posted by ChUIMonster on 02-Aug-2017 10:09

If you disconnected or killed the user then transaction backout should be underway.  There should be mention of this in the .lg file.  You should also see some evidence in PROMON.  The _connect VST should show that the disconnect flag is set.

If that session was actually doing something that changed 24 hours worth of data you can expect the transaction backout to take a very long time.  Killing it repeatedly is going to make things worse.

Setting -bithold to 19GB is irresponsible.  It leads to problems like this.  You should have had it set to 500MB or less -- then this would never have happened.

You should strive to be able to set -bithold as small as possible.

Fix the code.  Don't keep increasing -bithold.

I hope that you have after-imaging enabled and that you have a good backup that you can restore and roll-forward against because I have a bad feeling that you are about to need it.

Posted by George Potemkin on 02-Aug-2017 10:40

> Interestingly, in promon the user appears in "1.User Control" list, but not in the list of users to disconnect in menu option "8.Shut Down Database">"1.Disconnect a User".

It's normal. "Shut Down" menu does not show the sessions whose disconnect is already initiated.

> Any suggestions on how to end this transaction, while remaining up and running?

The ony way - to wait while the transaction rollback is completed.

> I have a long running transaction (over 24hours)

Rollback will be finnished aproxomately in 6-8 hours since the disconnect of the session was initiated.

Posted by Rohan on 02-Aug-2017 11:08

Hi Tom

Yeah. We are working on getting the vendor to change the code to reduce transaction scopes et al, and user training.

I know 19GB is wrong - a bad work around at best.

Unfortunately proshut has been done more than once, but originally before transaction 3hours old.

.

Login 11:41:35

Transaction start 11:42:16

I can't see proshut, but can see the user had "KILL signal received" 11:42:33 in the db.lg file.

AI is running.  Backups and archived AI files are available (but not verified)

I have CHUI ProTop running. What code would I need to see the disconnect flags?

Posted by George Potemkin on 02-Aug-2017 11:13

> What code would I need to see the disconnect flags?

for each dictdb._Connect no-lock where _Connect-Disconnect gt 0:
  display _Connect-Usr.
end.

Posted by Rohan on 02-Aug-2017 11:17

I ran your code George. It does show user 269

Posted by Rohan on 02-Aug-2017 11:21

first proshut to disconnect was more than 24hours ago

Posted by George Potemkin on 02-Aug-2017 11:25

promon/R&D/3/2. I/O Operations by Process

Check the changes of the activity counters for user 269 over time.

Posted by ChUIMonster on 02-Aug-2017 11:57

ProTop shows a "d" in the "Flags" column for sessions that have the disconnect flag set

Posted by Rohan on 02-Aug-2017 12:10

Top does show a d for user 269.

(Apologies if the format is poor)

From Promon

                              -------- Database ------     ---- BI -----     ---- AI -----

 Usr:Ten   Name      Domain        Access     Read    Write     Read    Write     Read    Write

269       XXXXXX        0       136455        1        7        0        0        0        0

Posted by Rohan on 02-Aug-2017 12:11

AI write = 0

Posted by George Potemkin on 02-Aug-2017 12:18

It would be useful to get "I/O Operations by Process" twice - to see the change.

But in your case it looks like user just hung. AI write = 0 is not a problem. AI writes is a job for AIW. BI read = 0 means that process is not trying to roll back its transaction.

Posted by George Potemkin on 02-Aug-2017 12:19

Try to generate protrace file: kill -USR1 <PID>

Posted by Rohan on 02-Aug-2017 12:22

Same numbers again:

269       XXXXXX        0       136455        1        7        0        0        0        0

Posted by Dmitri Levin on 02-Aug-2017 12:29

> for each dictdb._Connect no-lock where _Connect-Disconnect gt 0:

>  display _Connect-Usr._Connect-Usr _Connect-Name _connect-pid

>         _Connect-Batch _Connect-NumTrans.

Since this code should produce no records in normal conditions, we can use that to send alerts when we have users set to disconnect, but still active. Just a thought.

>AI is running.  Backups and archived AI files are available (but not verified)

Rohan, did you consider either OpenEdge Replication or After-Image ( free log based ) replication?

Unless you can afford the downtime to restore from a backup and roll forward. It is a business decision of course, loss of business vs price for replication.

Posted by Rohan on 02-Aug-2017 13:07

The BI file size reported by ProTop has dropped to 0

Posted by Rohan on 02-Aug-2017 13:07

but transaction is still there

Posted by George Potemkin on 02-Aug-2017 13:33

Most likely it will be safe to use kill -9 in your case but it can't be 100% guaranteed.

Posted by Rohan on 02-Aug-2017 13:43

You mean it will either solve the problem and get rid of the transaction, or crash the db?

Posted by George Potemkin on 02-Aug-2017 13:46

Kill -9 will get rid of the transaction and it will not crash db (at least I hope so).

Posted by Old Jeremiah Johnson on 03-Aug-2017 02:52

I would use kill -8 instead.

Posted by George Potemkin on 03-Aug-2017 04:41

Agree. The process will report SIGFPE signal in db log and it's a good thing. Other differences between the standard fatal and really fatal signals, IMHO, are not important.

BTW, does anybody have a to-do-list when a process hung (like in the current case)? I don't mean the checks on Progress level because it's too big topic, only OS level.

I'd try to generate a protrace file. There is a chance that Process will ignore SIGUSR1 signal or it will partially create a protrace file (only a section with startup parameters). Can we use Unix debuggers to to get more information?

I'd get 'ps -ef' to check if the process has launched any sub-processes.

I'd get 'lsof -p PID' to check the ports opened by the process.

What can/should be done else?

Posted by Rohan on 03-Aug-2017 09:16

As expected, plain old ‘kill PID’ did not work on the _progres process.

 

As it was now overnight / non-busy hours for us, we chose not to try 'kill -9 PID' on the _progres process to end the transaction, and instead got "most" users to log out. Then shutdown the db with 'Unconditional Shutdown'. The shutdown took a few minutes longer than normal. The process disappeared with the shutdown. The server was not shutdown, although it may have been better for a proper clean up.

I won’t post the full db.lg entries unless someone is interested, but we received various Progress ‘error’ codes and messages in the db log file:

(-----) Sending signal 12 to user 269

(-----) Sending signal 14 to user 269

(-----) Sending signal 15 to user 269

(15194) Database activity did not finish before…

(2251) Destroyed user 269 pid 29319

(334)   Multi-user session end.

(16869) Removed shared memory with segment_id: 425997

 

 

Next truncated the BI file.

Ran bigrow

Started db with no obvious problems.

 

On the subject of a list to do when (or before) this happens...

To produce a core file and protrace, a pre-requisite is to set "core file size" to a non-zero value, for the db owner. The Linux ulimit -c command can be used to set the core file size to unlimited or some value in kilobytes. Size of 0 will not allow core files etc to be created. This was the case for us.

 

These KB's were useful:

Guidelines on the use of UNIX kill command to stop a process

(This suggests an order of kill commands

https://knowledgebase.progress.com/articles/Article/P14679

 

How to produce a stack trace for a running OpenEdge process without killing it

http://knowledgebase.progress.com/articles/Article/P112486

 

How to produce a readable stack trace using PROSTACK?

http://knowledgebase.progress.com/articles/Article/P2262

 

Anyway, we may have got away with it this time.

 

Thanks all for assistance.

Posted by George Potemkin on 03-Aug-2017 09:39

> To produce a core file and protrace, a pre-requisite is to set "core file size" to a non-zero value, for the db owner.

No, to produce a protrace file you need only the permission to send a signal (SIGUSR1) to a process.

> Guidelines on the use of UNIX kill command to stop a process

PROSIGTRACE option (unfortunately it's not widely known) exists since V7 me think. It allows us to ask any Progress process what it's going to do after receiving the signals.

Posted by cverbiest on 03-Aug-2017 10:01

> to produce a protrace file you need only the permission to send a signal (SIGUSR1) to a process.

And the owner of the process needs write permission in the start directory. protrace files are created in the process start directory, not in temp directory (-T)

Posted by George Potemkin on 03-Aug-2017 10:23

> And the owner of the process needs write permission in the start directory.
Otherwise the process will write to stdout:

Failed to open file protrace.13618 errno 13 (1263)

PROGRESS stack trace as of Thu Aug 03 19:18:39 2017
<snip>

Posted by George Potemkin on 03-Aug-2017 10:37

More over under some special condition it's possible to write the protrace files to a directory where a user has no permissions to write. Maybe it should be treated as the vulnerability.

This thread is closed