Communication error - Replication

Posted by James Palmer on 15-May-2017 11:25

Still having trouble with Replication with a client. I'll try and explain. OE 10.2B08. Windows. 

This has been an ongoing issue for a while. Struggling to get anywhere with Tech Support as well. Had another instance of problems just now. 

Client rebooted the target. This comes up perfectly well and the agents are listening. But the repl servers are "Performing Failure Recovery". If I Terminate Server all you get is a line in the log file indicating that the administrator account logged in and out, but no action, and the repl server does NOT terminate. If I repeat the action the process terminating the server hangs itself up, and if I kill it the DB shuts down due to a latch being held. I can't disconnect it using proshut. 

At the point the target server was rebooted we get this in the source logs:

[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (9407)  Connection failure for host 192.168.16.238 port 4400 transport TCP. 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (11713) A communications error -4008 in rpCOM_RecvMsg. 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Receive Error
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0000:  d8a0 0102 0000 0000 0000 0000 3011 0000 8abb 0000 8abb 0000 0200 0000 4400 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0020:  c3e1 0300 fb6c 0000 0000 0000 efce 1959 0000 0000 3c41 0000 0100 0000 1900 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0040:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0060:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0080:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 00a0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 00c0:  0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3136 2e32 3338 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 00e0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0100:  0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3136 2e32 3338 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0120:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0140:  0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.180+0100] P-6108       T-7148  I RPLS   26: (10492) A communications error -157 occurred in function rpNLS_PollListener while receiving a message. 
[2017/05/15@16:54:16.180+0100] P-6108       T-7148  I RPLS   26: (10661) The Fathom Replication Server is beginning recovery for agent l_idx_audit. 
[2017/05/15@16:54:16.180+0100] P-6108       T-7148  I RPLS   26: (10842) Connecting to Fathom Replication Agent l_idx_audit.

We've advised the client to reboot the source, but it's not an ideal solution as they are 24/7. 

 

All Replies

Posted by James Palmer on 15-May-2017 11:27

%% Properties File
%% version 1.1
%% 6 janv. 06 13:56:16

[server]
    control-agents=l_idx_cs
    database=chemsource
    defer-agent-startup=1000
    transition=manual
    transition-timeout=600
    agent-shutdown-action=recovery
    
[control-agent.l_idx_cs]
   name=l_idx_cs
   database=chemsource
   host=192.168.16.238
   port=48090
   connect-timeout=1000
   replication-method=async
   critical=0

[agent]
    name=l_idx_cs
    database=chemsource
    connect-timeout=1000
    listener-maxport=4408
    listener-minport=4406

[transition]
    database-role=reverse
    restart-after-transition=0
    auto-begin-ai=1
    transition-to-agents=l_idx_cs

Posted by Paul Koufalis on 15-May-2017 12:09

Known issue. There are a few of these in 10.2b08. You're toast. Restart src is only solution of which I am aware.

Posted by James Palmer on 15-May-2017 14:18

Thanks Paul. Shouldn't you be relaxing on a beach somewhere?! :)

Is there a way of preventing this from happening? Would, say, stopping the databases before rebooting reduce the chances?

I ask because we don't experience these sorts of issues on other sites with very similar OERepl configurations.

Posted by Libor Laubacher on 15-May-2017 15:06

proshut DB -C disconnect RPLS (= its userid)

and then kill from TaskManager so far has always worked for me, but then I don't use -C terminate, but -C restart server first, when that tells me 'server already running', then proshut -C disconnect and kill from TaskManager

Posted by James Palmer on 15-May-2017 15:09

I'm very very reluctant to kill from TaskManager as even after the disconnect I've had the process kill crash the DB.

Posted by ezequielmontoya on 19-May-2017 10:16

Hello ​James, how did you solve the problem?​

Posted by James Palmer on 19-May-2017 10:29

No solution yet. Working with Progress Support on it. We've enabled additional logging to see if we can capture info, but we need a system reboot for it to become active.

This thread is closed