Earlier today we had (planned) downtime on our link between the replication sources and targets. Once the link was up again after 30 of minutes or so replication was down.
Looking at the target databases (dsutil db -C monitor) showed that the replication agents/clients were up and running. Trying the same on the sources/servers revealed nothing:
proenv>dsrutil <source> -C monitor
Cannot connect to replication shared memory. Status = -1
Looking in the log file reveals the downtime:
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Send:isConnected Error
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 0000: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 7969 0000 2a0a 0000
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 0020: 2a0a 0000 0000 0000 4200 0000 cf43 0300 0d08 0100 0000 0000 a6ad 2752 0000 0000
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 0040: 4021 0000 0000 0000 0000 0000 0000 0000 64ff ffff 0000 0000 0000 0000 0000 0000
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 0060: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 0080: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 00a0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 00c0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 3139 322e 3136 382e
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 00e0: 312e 3130 3700 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 0100: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 3139 322e 3136 382e
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 0120: 312e 3130 3700 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (-----) 0140: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (10491) A communications error -155 occurred in function rpNLS_PingAgent while sending PING.
[2013/09/05@00:06:11.006+0200] P-5449 T-140592286672640 I RPLS 9: (10661) The Fathom Replication Server is beginning recovery for agent agent1.
[2013/09/05@00:06:11.007+0200] P-5449 T-140592286672640 I RPLS 9: (10842) Connecting to Fathom Replication Agent agent1.
[2013/09/05@00:06:11.416+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:06:22.420+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:06:33.424+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:06:43.480+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:06:54.484+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:07:03.768+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:07:14.772+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:07:25.776+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:07:35.832+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:07:46.836+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:07:56.120+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:08:07.124+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:08:17.796+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:08:28.184+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:08:39.188+0200] P-5449 T-140592286672640 I RPLS 9: (9407) Connection failure for host <someip> port <someport> transport TCP.
[2013/09/05@00:08:47.188+0200] P-5449 T-140592286672640 I RPLS 9: (10396) The Fathom Replication Server cannot connect to the database broker on <someip> at the port <someport>.
[2013/09/05@00:08:47.188+0200] P-5449 T-140592286672640 I RPLS 9: (10397) The connection attempt to the Fathom Replication Agent agent1 failed.
[2013/09/05@00:08:47.188+0200] P-5449 T-140592286672640 I RPLS 9: (10697) The Fathom Replication Server was unable to reconnect to agent agent1. Recovery for this agent will not be performed.
[2013/09/05@00:08:47.188+0200] P-5449 T-140592286672640 I RPLS 9: (10698) The Fathom Replication Server will shutdown but the source database will remain active.
[2013/09/05@00:08:49.188+0200] P-5449 T-140592286672640 I RPLS 9: (10505) The Fathom Replication Server is ending.
Thanks!
Hello Jens,
For the error below;
proenv>dsrutil -C monitor
Cannot connect to replication shared memory. Status = -1
use the command below;
dsrutil source restart server
this command restarts OER server on source database side.
------------------------------
1) Are there any ways of restarting replication without restarting the database?
dsrutil source restart server is the answer.
----------------------------------
connect-timeout=10080 means 1 week. 120 is so low for configuration. set it 10080.
a sample repl.properties file of me is below:
database-role=normal
-------------
ON source side, you need to restart server command.
ON target side, you must restart database after source started.
BUT It seems your AI extents are all locked. LOCKED means there are data on AI extent and these data did not transferred to target so all extents are full.
youı can add additional extents or solve the connection problem between source and target.
Chech extents with the code below:
rfutil source -C aimage list.
Regards,
Kaan
Jens,
if your production is down. You need to add new extents and serve the database. then you can focus on solving connection probleö.
Step by step
1. prepare and additional structure file. I.e. called "ai-extents.st"
if you have 5 extents then you will add 5 more and you will name extents start from extent 6.
like below;
a .\sourcedbname.a10 f 512000
512000 means 512 mb
2. prostrct add source ai-extents.st or prostrct addonline source ai-extents.st if db is online. but i guess it is offline now.
3. prostrct reorder ai sourcedbname (this is needed, because we do not know which extent is the last one)
4. serve the source db.
Regards,
Kaan
Actually I cannot see any signs of all After Image extents being locked. Rather they are all empty - except for one being busy. A couple of hours after replication was turned off automatic backup was run.
All systems are up and seems to be in good shape. If this happens again I will try the restart server option before restaring the database.
Hello Jens,
It is nice to hear it. OER is a very stable product. Good luck.
Regards,
Kaan