Restarting replication

Posted by Jens Dahlin on 05-Sep-2013 01:17

Earlier today we had (planned) downtime on our link between the replication sources and targets. Once the link was up again after 30 of minutes or so replication was down.

Looking at the target databases (dsutil db -C monitor) showed that the replication agents/clients were up and running. Trying the same on the sources/servers revealed nothing:

proenv>dsrutil <source> -C monitor

Cannot connect to replication shared memory.  Status = -1

Looking in the log file reveals the downtime:

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Send:isConnected Error

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 0000:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 7969 0000 2a0a 0000

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 0020:  2a0a 0000 0000 0000 4200 0000 cf43 0300 0d08 0100 0000 0000 a6ad 2752 0000 0000

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 0040:  4021 0000 0000 0000 0000 0000 0000 0000 64ff ffff 0000 0000 0000 0000 0000 0000

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 0060:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 0080:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 00a0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 00c0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 3139 322e 3136 382e

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 00e0:  312e 3130 3700 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 0100:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 3139 322e 3136 382e

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 0120:  312e 3130 3700 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (-----) 0140:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (10491) A communications error -155 occurred in function rpNLS_PingAgent while sending PING.

[2013/09/05@00:06:11.006+0200] P-5449       T-140592286672640 I RPLS    9: (10661) The Fathom Replication Server is beginning recovery for agent agent1.

[2013/09/05@00:06:11.007+0200] P-5449       T-140592286672640 I RPLS    9: (10842) Connecting to Fathom Replication Agent agent1.

[2013/09/05@00:06:11.416+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:06:22.420+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:06:33.424+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:06:43.480+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:06:54.484+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:07:03.768+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:07:14.772+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:07:25.776+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:07:35.832+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:07:46.836+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:07:56.120+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:08:07.124+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:08:17.796+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:08:28.184+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:08:39.188+0200] P-5449       T-140592286672640 I RPLS    9: (9407)  Connection failure for host <someip> port <someport> transport TCP.

[2013/09/05@00:08:47.188+0200] P-5449       T-140592286672640 I RPLS    9: (10396) The Fathom Replication Server cannot connect to the database broker on <someip> at the port <someport>.

[2013/09/05@00:08:47.188+0200] P-5449       T-140592286672640 I RPLS    9: (10397) The connection attempt to the Fathom Replication Agent agent1 failed.

[2013/09/05@00:08:47.188+0200] P-5449       T-140592286672640 I RPLS    9: (10697) The Fathom Replication Server was unable to reconnect to agent agent1.  Recovery for this agent will not be performed.

[2013/09/05@00:08:47.188+0200] P-5449       T-140592286672640 I RPLS    9: (10698) The Fathom Replication Server will shutdown but the source database will remain active.

[2013/09/05@00:08:49.188+0200] P-5449       T-140592286672640 I RPLS    9: (10505) The Fathom Replication Server is ending.

Shutting down and starting the source again (with the -DBService replserv option) restarts replication.
1) Are there any ways of restarting replication without restarting the database?
2) Would increasing the connect-timeout in the repl.properties-file to a larger value (let's say 3600, it's currently 120) make replication go up once the link is up? Are there any negative side effects of a larger value?

Thanks!

All Replies

Posted by kaan_verdioglu on 05-Sep-2013 03:24

Hello Jens,

For  the error below;

proenv>dsrutil -C monitor

Cannot connect to replication shared memory.  Status = -1

use the command below;

dsrutil source restart server

this command restarts OER server on source database side.

------------------------------

1) Are there any ways of restarting replication without restarting the database?

dsrutil source restart server is the answer.

----------------------------------

connect-timeout=10080 means 1 week. 120 is so low for configuration. set it 10080.

a sample repl.properties file of me is below:

[server]
    control-agents=targetagent1, targetagent2
    database=kaynak
    transition=manual
    transition-timeout=10080
    defer-agent-startup=720
[control-agent.targetagent1]
   name=targetagent1
   database=hedef1
   host=10.0.0.202
   port=5010
   connect-timeout=10080
   replication-method=async
   critical=0
[control-agent.targetagent2]
   name=targetagent2
   database=hedef2
   host=10.0.0.203
   port=5020
   connect-timeout=10080
   replication-method=async
   critical=0
[transition]

   database-role=normal

-------------

ON source side, you need to restart server command.

ON target side, you must restart database after source started.

BUT It seems your AI extents are all locked. LOCKED means there are data on AI extent and these data did not transferred to target so all extents are full.

youı can add additional extents or solve the connection problem between source and target.

Chech extents with the code below:

rfutil source -C aimage list.

Regards,

Kaan

Posted by kaan_verdioglu on 05-Sep-2013 03:30

Jens,

if your production is down. You need to add new extents and serve the database. then you can focus on solving  connection probleö.

Step by step

1. prepare and additional structure file. I.e. called "ai-extents.st"

if you have 5 extents then you will add 5 more and you will name extents start from extent 6.

like below;

a .\sourcedbname.a6 f 512000
a .\sourcedbname.a7 f 512000
a .\sourcedbname.a8 f 512000
a .\sourcedbname.a9 f 512000

a .\sourcedbname.a10 f 512000

512000 means 512 mb

2. prostrct add source ai-extents.st   or prostrct addonline source ai-extents.st   if db is online. but i guess it is offline now.

3. prostrct reorder ai sourcedbname (this is needed, because we do not know which extent is the last one)

4. serve the source db.

Regards,

Kaan

Posted by Jens Dahlin on 05-Sep-2013 03:35

Actually I cannot see any signs of all After Image extents being locked. Rather they are all empty - except for one being busy. A couple of hours after replication was turned off automatic backup was run.

All systems are up and seems to be in good shape. If this happens again I will try the restart server option before restaring the database.

Posted by kaan_verdioglu on 05-Sep-2013 03:49

Hello Jens,

It is nice to hear it. OER is a very stable product. Good luck.

Regards,

Kaan

This thread is closed