Intermittent Error 778 - struggling to track down

Posted by James Palmer on 20-Aug-2015 03:41

We are intermittently getting Error 778

Error reading socket, ret=10054, errno=0. (778)

It only seems to happen overnight. It is usually on processes that happen on the primary server that query information (-RO) from the replication target (Don't ask!), or from batch jobs on remote machines. Very occasionally it seems to be hitting the Progress Backups. 

[2015/08/19@21:22:01.624+0100] P-53412      T-18760 I BACKUP272: (5057)  Backup failed due to EOF during next output device request. 
[2015/08/19@21:22:01.627+0100] P-53412      T-18760 I BACKUP272: (5462)  End backup of Data file(s). 
[2015/08/19@21:22:01.672+0100] P-53412      T-18760 I BACKUP272: (1617)  Backup terminated due to errors. 


Which is more of a worry. 

Progress 11.5, Win Server 2012 R2. VMWare environment.

I've been through all the logs I can think of on the server and found nothing. My Systems guy is reluctant to look into it even though I've managed to tie it down to a 1 minute window. 

Does anyone have any ideas where we are best looking to find out what might be happening at this time? I understand it's most likely to do with something using the sockets that Progress is trying to use. (My understanding in this area is very slim). 

All Replies

Posted by TheMadDBA on 20-Aug-2015 09:57

The probackup issue may or may not be related depending on where you are writing the files (local or remote). The 778 errors you are getting are network related (see the KBs below to decode the return codes and info about your issue)

knowledgebase.progress.com/.../P42852

knowledgebase.progress.com/.../P52956

In this exact case you are getting "Connection reset by peer", meaning there is some firewall or network related issue causing the disconnect. Event Viewer "should" have some information.

Ask about firewall/network changes that roughly correspond to the time frames where you have the issues. Network guys always think there changes can be done online without impact  but that isn't always true for databases with persistent connections.

VMWare adds another twist so I would force him to look into it (with management twisting his arm if needed). Make sure there aren't any vmware snapshots/backups or other maintenance happening during that time.

Posted by James Palmer on 20-Aug-2015 10:03

The probackup is going over the network sadly. If they continue to fail like this then it's good weight behind getting enough space allocated on the compellant.

Posted by James Palmer on 27-Aug-2015 03:23

As a slight tangent, is it possible to catch and handle this error more gracefully somehow? Obviously not for the backups, but for the batch processes. Was hoping to send email alerts.

Posted by gus on 27-Aug-2015 06:32

one things comes to mind. The backup program does not use the network at all. But is your backup writing to a target device that is accessed over the network (like NFS or CIFS)? If so, it could be that that is generating so much traffic that it interferes with other network activity.

Posted by James Palmer on 27-Aug-2015 07:36

Thanks [mention:9617a07f61934bc98f6fccb3b7fabfae:e9ed411860ed4f2ba0265705b8793d05] yes it's writing over the network so it's a possibility. Although the backups fail less frequently than we get the error.

I'll look at getting a backup target location on the SAN though - it's long overdue.

Posted by gus on 28-Aug-2015 08:48

from where is the backup getting its list of volume names to write on? my guess is that there is some sort of interference between the backup and other stuff and it is chance that determines who loses.

Posted by James Palmer on 28-Aug-2015 08:55

The volume names are defined in a file in the target location.

We tried writing locally last night on a new volume and it seems Veeam kicked in, and the server shut itself down. Not ideal. It has brought to light though that Veeam is snapshotting the database server when I specifically asked for it not to do so. So it's been excluded from Veeam and we'll see how we go tonight.

Posted by TheMadDBA on 28-Aug-2015 10:14

Probably not a bad time to review the other things they weren't supposed to do. Trust but verify and all that :-)

Posted by James Palmer on 05-Sep-2015 16:28

So the server has been excluded from Veeam snapshots since this incident, and as far as I know (been on leave so not 100% sure, but no reported errors) this issue has gone away.

Is this particular error a known issue with Veeam/VMWare? If not, should I raise it with support for a KB article?

Posted by TheMadDBA on 06-Sep-2015 12:24

This KB ( knowledgebase.progress.com/.../000045806 ) should probably be updated to include errors like the ones you received.

It is a known issue that a vmware snapshot pauses all activity on the host while the memory is being copied.  If the host is 100% self contained that is normally not a big issue (except for databases). If you have hosts communicating with each other it can cause obvious issues with communication between the hosts.

Posted by James Palmer on 07-Sep-2015 02:18

I've logged a Case to get this documented.

Posted by Libor Laubacher on 07-Sep-2015 03:04

By logging a case I hope you meant with VMware as they claim their snapshots are 'non-application-disruptive'.

Was the VEEAM backup been configured to quiet point the db ?

Posted by James Palmer on 07-Sep-2015 03:15

No the Veeam backup was not set to quiesce the databases. It should't have even been happening.

I meant logging a case with PSC to update the KB, but yeah might be an idea to shout at VMWare ;)

This thread is closed