Bizarre DB Freeze Behaviour

Posted by Paul Koufalis on 06-Dec-2013 09:18

Would anyone care to venture a theory?  Does the AIMGT grab a latch of something?

HPUX 11.31

OE 11.2.1

ChUI/telnet application

 

Problem:  DB seems to have frozen (writes only) for 6 minutes from 20:06 - 20:12.

Pertinent information:

 

- The AIMGT writes AI files to an NFS-mounted directory every 15 minutes

- The SAN where that NFS mount resides had some issues during the exact 6 minute period

- That SAN is supposedly NOT in any way connected to the SAN in prod (NOT on the same Brocade network)

  - Remains to be confirmed

 

db.lg:

- Users report that they can login and see the application menu but cannot access any screens

- I see a couple of hundred logins and logouts during that 6 minute interval

- At 20:12:05, precisely when everything comes back to normal, I see 172 users logout in a one-second time frame

  - Begin transaction backout

  - Transaction backout complete

  - Logout user x

- AIMGT successfully switches AI files but does not complete copy until 20:12 (prod.a5 is only 8 Mg)

 

[2013/12/05@20:06:11.838-0500] P-21185      T-1     I AIMGT   9: (3777)  Switched to ai extent prod.a1.

[2013/12/05@20:06:11.838-0500] P-21185      T-1     I AIMGT   9: (3778)  This is after-image file number 1766 since the last AIMAGE BEGIN

[2013/12/05@20:12:05.503-0500] P-21185      T-1     I AIMGT   9: (13199) After-image extent prod.a5 has been copied to <nfs mount>

 

AI scan verbose: no AI notes generated between 20:06:16 and 20:12:05

 

Trid: 154141637 Thu Dec  5 20:06:16 2013. (2598)

Trid: 154141638 Thu Dec  5 20:06:16 2013. (2598)

Trid: 154141638 Thu Dec  5 20:06:16 2013. (2598)

Trid: 154141636 Thu Dec  5 20:12:05 2013. (2598)

Trid: 154141642 Thu Dec  5 20:12:05 2013. (2598)

 

 

syslog:

- I see a couple of hundred of this message during that 6 minute period:

   telnetd[9409]: Error checking child termination status: error 4: Interrupted system call

- Outside of the 6 minutes, this message appears a handful of time

 

- Also:

Dec  5 20:08:11 vmunix: NFS server x not responding still trying

Dec  5 20:08:21 vmunix: NFS server x ok

Dec  5 20:09:16 vmunix: NFS server x not responding still trying

Dec  5 20:12:05 vmunix: NFS server x ok

 

 

 

Paul

 

******************************************

* Paul Koufalis

* Progresswiz Consulting

*

* Email: pk@progresswiz.com

* Phone: 514 247 2023

* Fax  : 514 439 6782

*

* Progress, MFG/PRO, UNIX, Windows, EDI

******************************************

 

All Replies

Posted by gus on 06-Dec-2013 10:16

Hypothesi, /not/ verified by code inspections:

0) ai management daemon was blocked on a write system call to the nfs mounted output device.

1) ai management daemon was holding a lock on some data structure in shared memory

2) no operations that required generating or writing a bi note or ai note could take place until the locks held by the ai management daemon were released. any process that attempted such an operation became blocked waiting for a lock.

3) eventually the write operation to the nfs mounted output device completed

4) ai management daemon released whatever locks it was holding

5) processes blocked on step 2 could continue

note that other kinds of write operations (e.g. online backup, data block writes, etc) can also block and cause similar results. 4GL code can also block on writes, causing other users to block on record locks held by the writer.

Posted by Paul Koufalis on 06-Dec-2013 11:00

Ouch if that's true.  I will open a TS ticket to see if we can find out for certain.

This would mean that my "best practice" of setting the primary AI archive directory to an NFS mount is NOT a best practice.  Instead I will need to archive locally and write a script to push the files to the NFS mount.  That would suck.

Paul

******************************************

* Paul Koufalis

* Progresswiz Consulting

*

* Email: pk@progresswiz.com

* Phone: 514 247 2023

* Fax  : 514 439 6782

*

* Progress, MFG/PRO, UNIX, Windows, EDI

******************************************

This thread is closed