Quick replication question

Posted by James Palmer on 30-Mar-2015 06:52

Our replication server is complaining when trying to monitor:

F:\DATABASE\LIVE>dsrutil icmasliv -C monitor
Cannot connect to replication shared memory.  Status = -1


The target is in normal processing, but hasn't received any data since the server stopped responding. 

I've run Restart Server and it is now Connecting to Agents but seems to be unsuccessful in doing so. Is there anything else I can attempt before restarting the DB? 

Also, is there anywhere I can look to see if I can work out why replication failed? Nothing obvious in the logs I've looked in so far. 

All Replies

Posted by James Palmer on 30-Mar-2015 07:11

Had to restart the DB and it's still complaining so I'm restarting the server. :/

Posted by Jimmer on 30-Mar-2015 07:14

Did you check the connectivity between both servers/databases, i.e. telnet on the target database port from the source db server, do you get a response?

Posted by Libor Laubacher on 30-Mar-2015 07:15

Patience and being lil more specific would be good J
 
The agent should be in pre-transition, since your replication server was/has not seem to be running.
So you can connect onto target with dsrutil and it says “normal processing” ?
 
Unless you post both database log files, it’s literally impossible to make a guess.
 
What database did you restart ?
 
[collapse]
From: James Palmer [mailto:bounce-jdpjamesp@community.progress.com]
Sent: Monday, March 30, 2015 2:12 PM
To: TU.OE.RDBMS@community.progress.com
Subject: RE: [Technical Users - OE RDBMS] Quick replication question
 
Reply by James Palmer

Had to restart the DB and it's still complaining so I'm restarting the server. :/

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse]

Posted by James Palmer on 30-Mar-2015 07:22

Unfortunately I am restricted by when the business says it's convenient to do things. The next "convenient" window being Wednesday when our AI extents would be full...

I restarted the source as that is the one which was giving the error.

Posted by James Palmer on 30-Mar-2015 07:54


Found some tell-tale problems in the Target log file, although the timings don't match completely. Target DB was up to date until 12:19 when things went south.

[2015/03/30@11:35:35.088+0100] P-4528       T-4520  I RPLA  162: (9407)  Connection failure for host 192.168.125.1 port 4859 transport TCP. 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Poll Error:2
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0000:  0000 0000 0000 0000 6080 4050 2811 0000 2311 0000 9411 0000 0200 0000 2400 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0020:  4a92 1100 3730 af00 0000 0000 9e26 1955 0000 0000 4021 0000 0000 0000 2c01 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0040:  0000 0000 58f0 ffff 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0060:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0080:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00a0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00c0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00e0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0100:  0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3132 352e 3100 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0120:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0140:  0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (10492) A communications error -157 occurred in function rpNLA_PollListener while receiving a message. 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (11699) A TCP/IP failure has occurred.  The Agent's will enter PRE-TRANSITION, waiting for connection from the Replication Server. 
[2015/03/30@11:35:38.027+0100] P-4528       T-4520  I RPLA  162: (10392) Database f:\database\live\icmasliv is being replicated from database f:\database\live\icmasliv on host 192.168.125.1. 
[2015/03/30@11:35:39.030+0100] P-4528       T-4520  I RPLA  162: (10671) The OpenEdge Replication Agent agent1 is beginning Recovery Synchronization at block 11913. 
[2015/03/30@11:35:39.399+0100] P-4528       T-4520  I RPLA  162: (6806)  Retry transaction point located at dbkey 0 note type 10 updctr 0. 
[2015/03/30@11:35:39.399+0100] P-4528       T-4520  I RPLA  162: (10705) Retry point located at logical op 1 note type 70 trid 908325446. 
[2015/03/30@11:35:39.720+0100] P-4528       T-4520  I RPLA  162: (10670) The Source and Target databases are synchronized.  Normal processing is resuming.

Posted by Libor Laubacher on 30-Mar-2015 08:02

There is nothing wrong per say with this.
 
Target detected a failure, agent went into pre-transition and then the connection got back up.
 

Would need data from “Target DB was up to date until 12:19 when things went south”.

 
[collapse]
From: James Palmer [mailto:bounce-jdpjamesp@community.progress.com]
Sent: Monday, March 30, 2015 2:55 PM
To: TU.OE.RDBMS@community.progress.com
Subject: RE: [Technical Users - OE RDBMS] Quick replication question
 
Reply by James Palmer

Found some tell-tale problems in the Target log file, although the timings don't match completely. Target DB was up to date until 12:19 when things went south.

[2015/03/30@11:35:35.088+0100] P-4528       T-4520  I RPLA  162: (9407)  Connection failure for host 192.168.125.1 port 4859 transport TCP. 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Poll Error:2
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0000:  0000 0000 0000 0000 6080 4050 2811 0000 2311 0000 9411 0000 0200 0000 2400 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0020:  4a92 1100 3730 af00 0000 0000 9e26 1955 0000 0000 4021 0000 0000 0000 2c01 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0040:  0000 0000 58f0 ffff 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0060:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0080:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00a0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00c0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00e0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0100:  0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3132 352e 3100 0000 0000 0000 
[2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0120:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[/collapse]

Posted by Jimmer on 30-Mar-2015 08:13

Is the target database "aware" that you restarted the source db/server, did it say anything about it or it kept happily ignoring the issue on source?

Posted by James Palmer on 30-Mar-2015 08:25

No target carried merrily on its way when source was rebooted.

When I attempted to restart the target db I got a message saying shared memory was already in use.

Posted by Jimmer on 30-Mar-2015 08:35

So target db stopped, but gave you an error upon starting. I've seen it take sometimes a couple of minutes to completely shut down. Is the dbname.lk still there? Can you promon the database, or it says that there is no server for it? Is the rpagent.exe process still alive (assuming Windows platform)?

Posted by James Palmer on 30-Mar-2015 08:58

I left it 30 minutes before trying again and still no joy. The admin service log file had an error saying shared memory was already in use. DB log file showed shutdown was complete.

Posted by Libor Laubacher on 30-Mar-2015 09:03

Assuming this is Windows – Process Monitor will/should tell what is holding the shared memory
 
[collapse]
From: James Palmer [mailto:bounce-jdpjamesp@community.progress.com]
Sent: Monday, March 30, 2015 4:00 PM
To: TU.OE.RDBMS@community.progress.com
Subject: RE: [Technical Users - OE RDBMS] Quick replication question
 
Reply by James Palmer

I left it 30 minutes before trying again and still no joy. The admin service log file had an error saying shared memory was already in use. DB log file showed shutdown was complete.

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse]

Posted by Jimmer on 30-Mar-2015 09:04

If (.lk isn't there AND promon says "no server..." AND rpagent.exe isn't there) then maybe the db's port is in a hanging state. You should be able to kill/force disconnect the port (with cports or TCP View, etc...)

Then start the db again.

Posted by James Palmer on 30-Mar-2015 16:03

Hmmmm Back in the scenario again. I'll try and give more info this time.

Source:

Win 2003 server 32 bit, running Progress 11.2.1 32 bit (yes I know. We are migrating to 11.5 64 bit in May).

Target:

Win 2008 R2 64 bit running Progress 11.2.1 32 bit.

Log File:

[2015/03/30@21:01:27.491+0100] P-3632       T-3628  I RPLA  162: (9407)  Connection failure for host 192.168.125.1 port 2633 transport TCP. 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Poll Error:2
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0000:  0000 0000 0000 0000 28a9 6700 2811 0000 2311 0000 9411 0000 0200 0000 2400 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0020:  8d6d 0000 a044 0400 0000 0000 508f 1955 0000 0000 4021 0000 0000 0000 0500 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0040:  0000 0000 58f0 ffff 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0060:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0080:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 00a0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 00c0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 00e0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0100:  0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3132 352e 3100 0000 0000 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0120:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0140:  0000 0000 0000 0000 0000 0000 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (10492) A communications error -157 occurred in function rpNLA_PollListener while receiving a message. 
[2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (11699) A TCP/IP failure has occurred.  The Agent's will enter PRE-TRANSITION, waiting for connection from the Replication Server. 

We are in Pre Transition. 

I've restarted the server and we've gone to Performing Startup Synchronisation. 

Fingers crossed it'll come back but some ideas where to look for why this is happening would be good as I don't appreciate getting alerted during the night :D 

Posted by James Palmer on 30-Mar-2015 16:04

When I say restarted the server I mean I've restarted the replication server, not the whole server!

Posted by Jimmer on 31-Mar-2015 00:54

A little bit off topic, I had a support case once, and was told that replication across nonidentical OS (Windows 2003 32 bit vs Windows 2003 64 bit in my case) wasn't supported.

Posted by James Palmer on 31-Mar-2015 02:22

Good job we're moving to homogenous systems in May then!

James Palmer | Application Developer
Tel: 01253 785103

[collapse] From: Jimmer
Sent: ‎31/‎03/‎2015 06:55
To: TU.OE.RDBMS@community.progress.com
Subject: RE: [Technical Users - OE RDBMS] Quick replication question

Reply by Jimmer

A little bit off topic, I had a support case once, and was told that replication across nonidentical OS (Windows 2003 32 bit vs Windows 2003 64 bit in my case) wasn't supported.

Stop receiving emails on this subject.

Flag this post as spam/abuse.




This email has been scanned for email related threats and delivered safely by Mimecast.
For more information please visit http://www.mimecast.com
[/collapse]

Posted by James Palmer on 31-Mar-2015 03:08

Just as an extra note: we have 6 databases that are replicating between the same servers. Is it at all pertinent that only one of them is failing like this?

Posted by James Palmer on 31-Mar-2015 03:12

And now the systems guys tell me they were messing around with switches last night. Nice of them to warn me.

Posted by Brian Bowman on 31-Mar-2015 07:24

Hi All –
   I have checked on this and as long as the OE version and bit-ness is the same the OS and the bit-ness should not make a difference.  Just remember it has to be windows to windows etc.  This should be a supported environment.  There are many use cases where I can foresee the two sides not being the same OS level.
 
I will work with support to get this cleaned up.
 
If you could give me the Tech Support case (offline) I can ensure we are successful moving forward.
 
Thanks
 
Brian
 
Brian L. Bowman
 
Sr. Principal Product Manager
Progress Software Corporation
14 Oak Park, Bedford, MA, USA 01730
 
Phone: +1 (603) 801-8259
Email: bowman@progress.com
 
 
[collapse]
From: Jimmer [mailto:bounce-Jimmer@community.progress.com]
Sent: Tuesday, March 31, 2015 1:55 AM
To: TU.OE.RDBMS@community.progress.com
Subject: RE: [Technical Users - OE RDBMS] Quick replication question
 
Reply by Jimmer

A little bit off topic, I had a support case once, and was told that replication across nonidentical OS (Windows 2003 32 bit vs Windows 2003 64 bit in my case) wasn't supported.

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse]

Posted by George Potemkin on 31-Mar-2015 07:40

The Progress service-pack level (as a part of OE version) should be (or is recommended to be) the same on source and on target boxes (due to the possible difference in the structure of recovery notes). I guess it's can't be guaranteed if the Progress bit-ness is different. That is why it's not supported.

Regards,

George

Posted by James Palmer on 14-Apr-2015 11:55

So how do I work out what's using the shared memory on win? Got ProcMon installed. Not sure how to reconcile what I see with a particular DB.

Posted by Libor Laubacher on 14-Apr-2015 12:17

Find handle search dbname

Sent from Nine

[collapse]
From: James Palmer <bounce-jdpjamesp@community.progress.com>
Sent: 14 Apr 2015 18:55
To: TU.OE.RDBMS@community.progress.com
Subject: RE: [Technical Users - OE RDBMS] Quick replication question

Reply by James Palmer

So how do I work out what's using the shared memory on win? Got ProcMon installed. Not sure how to reconcile what I see with a particular DB.

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse]

Posted by Libor Laubacher on 14-Apr-2015 12:25

Not procmon tho, but procexp. Eg process explorer

Sent from Nine

[collapse]
From: Libor Laubacher <bounce-llaubach@community.progress.com>
Sent: 14 Apr 2015 19:18
To: TU.OE.RDBMS@community.progress.com
Subject: Re: [Technical Users - OE RDBMS] Quick replication question

Reply by Libor Laubacher
Find handle search dbname

Sent from Nine

[collapse]
From: James Palmer <bounce-jdpjamesp@community.progress.com>
Sent: 14 Apr 2015 18:55
To: TU.OE.RDBMS@community.progress.com
Subject: RE: [Technical Users - OE RDBMS] Quick replication question

Reply by James Palmer

So how do I work out what's using the shared memory on win? Got ProcMon installed. Not sure how to reconcile what I see with a particular DB.

Stop receiving emails on this subject.

Flag this post as spam/abuse.

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse][/collapse]

Posted by James Palmer on 14-Apr-2015 16:44

Got a really weird one at the moment. I'll keep it in this thread as it is related.

Source DB says that replication is in normal processing. Target DB is listening. AI notes are not being processed as the files are all set at locked. I've restarted the target DB. I don't want to restart source unless I have to. Source hasn't responded to a terminate server request. restart server says the server is already running. Any ideas?

This thread is closed