OE Replication Target Crash, source continues

Posted by James Palmer on 12-Feb-2018 04:18

OE 10.2B08. 

See this from time to time on various sites and just haven't got to the bottom of it. This time it's landed us in a difficult position with a customer because it's the 3rd time Replication has failed in a week (other 2 were from other causes). 

[2018/02/08@12:41:16.083+0000] P-6244       T-5912  I RPLA   41: (9407)  Connection failure for host 10.100.1.40 port 64037 transport TCP. 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Poll Error:2
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 0000:  0000 0000 0000 0000 f056 e401 9113 0000 8813 0000 ec13 0000 0200 0000 2400 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 0020:  cc01 0000 a411 0000 0000 0000 6a45 7c5a 0000 0000 3c41 0000 0000 0000 2d00 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 0040:  0000 0000 58f0 ffff 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 0060:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 0080:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 00a0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 00c0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 00e0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 0100:  0000 0000 0000 0000 0000 0000 3130 2e31 3030 2e31 2e34 3000 0000 0000 0000 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 0120:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (-----) 0140:  0000 0000 0000 0000 0000 0000 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (10492) A communications error -157 occurred in function rpNLA_PollListener while receiving a message. 
[2018/02/08@12:41:16.085+0000] P-6244       T-5912  I RPLA   41: (11699) A TCP/IP failure has occurred.  The Agent's will enter PRE-TRANSITION, waiting for connection from the Replication Server. 
[2018/02/11@16:08:34.568+0000] P-6244       T-7392  I RPLA   41: (9438)  CTRL_SHUTDOWN_EVENT console event received. 

See above logfile. As you see, the target quite clearly crashes because of a communication error. The thing is, we're replicating 10 databases and this is the only one with the issue. It occurred within an hour of reseeding Replication in case that's pertinent. 

The source database just carried on happily with nothing in the logs. It just started locking AI files. First we knew about it was when the system crashed as there was a bug in our monitoring script. 

Has anyone seen this before? Any ideas as to the cause? Any ideas how to fix? 

I've fixed the monitoring script so we should get alerts well in advance of DB crash so it's not so much of an issue, but it's not pretty. 

All Replies

Posted by e.schutten on 12-Feb-2018 05:53

Perhaps you can set the transition-timeout and/or connect-timeout higher?

Posted by James Palmer on 12-Feb-2018 05:55

Nope the .properties files are identical for all DBs with the exception of the DB names and ports.

Posted by Paul Koufalis on 12-Feb-2018 07:05

I do not see the "target quite clearly crashes" per the lg file you posted. I see the rpagent lose comms with the rpserver at 12:41:16 and enter pre-transition mode, then I see a CTRL_SHUTDOWN_EVENT on the console 3.5h later. Did you reboot the server at 16:08:34?

What I would like to know is if the rpserver process stopped at 12:41:16. It certainly looks that way.

BTW: you should be monitoring both 6021 on the source AND 3049 on the target as there is a bug in 10.2B08 and some earlier 11s whereby the source reports 6021/Normal Processing but it isn't.

Posted by James Palmer on 12-Feb-2018 09:10

Fair point Paul - the database didn't crash, but the agent stopped. Yes the customer restarted both source and target at 16:08. Replication didn't sort itself out at that point. I'm not sure why.

Thanks for the pointer on monitoring source and target - I'm aware of that bug. Very aware! :)

This thread is closed