Still having trouble with Replication with a client. I'll try and explain. OE 10.2B08. Windows.
This has been an ongoing issue for a while. Struggling to get anywhere with Tech Support as well. Had another instance of problems just now.
Client rebooted the target. This comes up perfectly well and the agents are listening. But the repl servers are "Performing Failure Recovery". If I Terminate Server all you get is a line in the log file indicating that the administrator account logged in and out, but no action, and the repl server does NOT terminate. If I repeat the action the process terminating the server hangs itself up, and if I kill it the DB shuts down due to a latch being held. I can't disconnect it using proshut.
At the point the target server was rebooted we get this in the source logs:
[2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (9407) Connection failure for host 192.168.16.238 port 4400 transport TCP. [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (11713) A communications error -4008 in rpCOM_RecvMsg. [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Receive Error [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 0000: d8a0 0102 0000 0000 0000 0000 3011 0000 8abb 0000 8abb 0000 0200 0000 4400 0000 [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 0020: c3e1 0300 fb6c 0000 0000 0000 efce 1959 0000 0000 3c41 0000 0100 0000 1900 0000 [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 0040: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 0060: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 0080: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 00a0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 00c0: 0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3136 2e32 3338 0000 0000 0000 [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 00e0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 0100: 0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3136 2e32 3338 0000 0000 0000 [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 0120: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 [2017/05/15@16:54:16.178+0100] P-6108 T-7148 I RPLS 26: (-----) 0140: 0000 0000 0000 0000 0000 0000 [2017/05/15@16:54:16.180+0100] P-6108 T-7148 I RPLS 26: (10492) A communications error -157 occurred in function rpNLS_PollListener while receiving a message. [2017/05/15@16:54:16.180+0100] P-6108 T-7148 I RPLS 26: (10661) The Fathom Replication Server is beginning recovery for agent l_idx_audit. [2017/05/15@16:54:16.180+0100] P-6108 T-7148 I RPLS 26: (10842) Connecting to Fathom Replication Agent l_idx_audit.
We've advised the client to reboot the source, but it's not an ideal solution as they are 24/7.
%% Properties File %% version 1.1 %% 6 janv. 06 13:56:16 [server] control-agents=l_idx_cs database=chemsource defer-agent-startup=1000 transition=manual transition-timeout=600 agent-shutdown-action=recovery [control-agent.l_idx_cs] name=l_idx_cs database=chemsource host=192.168.16.238 port=48090 connect-timeout=1000 replication-method=async critical=0 [agent] name=l_idx_cs database=chemsource connect-timeout=1000 listener-maxport=4408 listener-minport=4406 [transition] database-role=reverse restart-after-transition=0 auto-begin-ai=1 transition-to-agents=l_idx_cs
Known issue. There are a few of these in 10.2b08. You're toast. Restart src is only solution of which I am aware.
Thanks Paul. Shouldn't you be relaxing on a beach somewhere?! :)
Is there a way of preventing this from happening? Would, say, stopping the databases before rebooting reduce the chances?
I ask because we don't experience these sorts of issues on other sites with very similar OERepl configurations.
proshut DB -C disconnect RPLS (= its userid)
and then kill from TaskManager so far has always worked for me, but then I don't use -C terminate, but -C restart server first, when that tells me 'server already running', then proshut -C disconnect and kill from TaskManager
I'm very very reluctant to kill from TaskManager as even after the disconnect I've had the process kill crash the DB.
No solution yet. Working with Progress Support on it. We've enabled additional logging to see if we can capture info, but we need a system reboot for it to become active.