Hi,
were running low on ideas. Since jul 13 our databases are crashing with semaphore errors.
History
things we tried
I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone
[2016/07/25@06:23:11.613+0200] P-32381 T-140189001459520 I ABL 86: (453) Logout by batchclnt on batch.
[2016/07/25@06:25:04.592+0200] P-28093 T-140142107518784 I SHUT 88: (453) Logout by proadmin on /dev/pts/0.
Mon Jul 25 04:25:04 2016
[2016/07/25@04:25:04.000+0000] P-28093 T-140142107518784 I SHUT (562) HANGUP signal received.
Mon Jul 25 06:28:41 2016
[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (2520) Stopped.
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (2520) Stopped.
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (453) Logout by proadmin on batch.
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:48.303+0200] P-27133 T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:28:48.303+0200] P-27133 T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.
[2016/07/25@06:29:28.457+0200] P-27133 T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:29:28.457+0200] P-27133 T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.
[2016/07/25@06:29:42.516+0200] P-27133 T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:29:42.516+0200] P-27133 T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.
attacheg log file errors
Found the trigger for the issue and a solution.
The database is started using a script that basically does
sudo -u proadmin proserve LISA sudo -u proadmin probiw LISA sudo -u proadmin proapw LISA sudo -u proadmin proapw LISA
if someone logs on to the system with user proadmin using ssh, and executes
proshut LISA -C list
the semaphore errors occur when that user logs off. As long as the proadmin user does not log off the db stays up & running.
On our test server the issue went away with a yum update
> I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone
It's just a feature:
A partial fix is included in OpenEdge 11.2.1 and 11.3.x
As a side effect of the fix, messages from signal handlers will always show as UTC time instead of the database's timezone.
> (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
Errno 22 is an invalid argument. In this case the semid is invalid. I guess the other small values of semid are wrong as well.
Carl, can you create the list of semid's allocated by broker at startup?
promon/R&D/debghb/16. Semaphores (but first 5. Adjust Monitor Options / 1. Enter page length: 9999
[2016/07/13@22:01:49.085+0200] P-60282 T-140225063741248 I SRV 5: (794) Usernum 187 terminated abnormally. [2016/07/13@22:01:49.085+0200] P-60282 T-140225063741248 I SRV 5: (-----) TCP/IP write error occurred with errno 32 [2016/07/13@22:01:49.092+0200] P-60282 T-140225063741248 I SRV 5: (1132) Invalid semaphore id [2016/07/13@22:01:49.092+0200] P-60282 T-140225063741248 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22.
I saw these errors (TCP/IP write error occurred with errno 32) when all remote clients servers were napped trying to get MTX latch.
Do you monitor the latch activity? You can do that using ProTop or my dbmon script.
The list of the errors from your log with the unique semid:
(10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 5, errno = 22. (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = -32763, errno = 22. (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32762, errno = 22. (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32761, errno = 22.
(10839) SYSTEM ERROR: Unable to set semaphore set : semid = 6, errno = 22.
(10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22.
Note that -32763, -32762, -32762 = 32768 minus 5, 6 and 7
(6526) Number of Semaphore Sets (-semsets): 3
Your database used 3 semsets and all their semid's became truncated to one byte.
Hi George,
thanks for your reply. Not yet monitoring latch activity, will look that up
# ipcs -s
------ Semaphore Arrays --------
key semid owner perms nsems
0x00000000 294912 proadmin 666 7
0x00000000 327681 proadmin 666 97
0x00000000 360450 proadmin 666 97
0x00000000 393219 proadmin 666 7
0x00000000 425988 proadmin 666 97
0x00000000 458757 proadmin 666 97
for the LISA database
07/25/16 Semaphores
09:07:11
User Id Num
0 327681 0
1 360450 0
2 327681 1
3 360450 1
4 327681 2
5 360450 2
6 327681 3
7 360450 3
8 327681 4
9 360450 4
10 327681 5
11 360450 5
12 327681 6
13 360450 6
14 327681 7
15 360450 7
16 327681 8
17 360450 8
18 327681 9
19 360450 9
20 327681 10
21 360450 10
22 327681 11
23 360450 11
24 327681 12
25 360450 12
26 327681 13
27 360450 13
28 327681 14
29 360450 14
30 327681 15
31 360450 15
32 327681 16
33 360450 16
34 327681 17
35 360450 17
36 327681 18
37 360450 18
38 327681 19
39 360450 19
40 327681 20
41 360450 20
42 327681 21
43 360450 21
44 327681 22
45 360450 22
46 327681 23
47 360450 23
48 327681 24
49 360450 24
50 327681 25
51 360450 25
52 327681 26
53 360450 26
54 327681 27
55 360450 27
56 327681 28
57 360450 28
58 327681 29
59 360450 29
60 327681 30
61 360450 30
62 327681 31
63 360450 31
64 327681 32
65 360450 32
66 327681 33
67 360450 33
68 327681 34
69 360450 34
70 327681 35
71 360450 35
72 327681 36
73 360450 36
74 327681 37
75 360450 37
76 327681 38
77 360450 38
78 327681 39
79 360450 39
80 327681 40
81 360450 40
82 327681 41
83 360450 41
84 327681 42
85 360450 42
86 327681 43
87 360450 43
88 327681 44
89 360450 44
90 327681 45
91 360450 45
92 327681 46
93 360450 46
94 327681 47
95 360450 47
96 327681 48
97 360450 48
98 327681 49
99 360450 49
100 327681 50
101 360450 50
102 327681 51
103 360450 51
104 327681 52
105 360450 52
106 327681 53
107 360450 53
108 327681 54
109 360450 54
110 327681 55
111 360450 55
112 327681 56
113 360450 56
114 327681 57
115 360450 57
116 327681 58
117 360450 58
118 327681 59
119 360450 59
120 327681 60
121 360450 60
122 327681 61
123 360450 61
124 327681 62
125 360450 62
126 327681 63
127 360450 63
128 327681 64
129 360450 64
130 327681 65
131 360450 65
132 327681 66
133 360450 66
134 327681 67
135 360450 67
136 327681 68
137 360450 68
138 327681 69
139 360450 69
140 327681 70
141 360450 70
142 327681 71
143 360450 71
144 327681 72
145 360450 72
146 327681 73
147 360450 73
148 327681 74
149 360450 74
150 327681 75
151 360450 75
152 327681 76
153 360450 76
154 327681 77
155 360450 77
156 327681 78
157 360450 78
158 327681 79
159 360450 79
160 327681 80
161 360450 80
162 327681 81
163 360450 81
164 327681 82
165 360450 82
166 327681 83
167 360450 83
168 327681 84
169 360450 84
170 327681 85
171 360450 85
172 327681 86
173 360450 86
174 327681 87
175 360450 87
176 327681 88
177 360450 88
178 327681 89
179 360450 89
180 327681 90
181 360450 90
182 327681 91
183 360450 91
184 327681 92
185 360450 92
186 327681 93
187 360450 93
188 327681 94
189 360450 94
190 327681 95
191 360450 95
192 327681 96
for the lisalrap database (we also had the crashes when the second db was not running)
07/25/16 Semaphores
09:10:34
User Id Num
0 425988 0
1 458757 0
2 425988 1
3 458757 1
4 425988 2
5 458757 2
6 425988 3
7 458757 3
8 425988 4
9 458757 4
10 425988 5
11 458757 5
12 425988 6
13 458757 6
14 425988 7
15 458757 7
16 425988 8
17 458757 8
18 425988 9
19 458757 9
20 425988 10
21 458757 10
22 425988 11
23 458757 11
24 425988 12
25 458757 12
26 425988 13
27 458757 13
28 425988 14
29 458757 14
30 425988 15
31 458757 15
32 425988 16
33 458757 16
34 425988 17
35 458757 17
36 425988 18
37 458757 18
38 425988 19
39 458757 19
40 425988 20
41 458757 20
42 425988 21
43 458757 21
44 425988 22
45 458757 22
46 425988 23
47 458757 23
48 425988 24
49 458757 24
50 425988 25
51 458757 25
52 425988 26
53 458757 26
54 425988 27
55 458757 27
56 425988 28
57 458757 28
58 425988 29
59 458757 29
60 425988 30
61 458757 30
62 425988 31
63 458757 31
64 425988 32
65 458757 32
66 425988 33
67 458757 33
68 425988 34
69 458757 34
70 425988 35
71 458757 35
72 425988 36
73 458757 36
74 425988 37
75 458757 37
76 425988 38
77 458757 38
78 425988 39
79 458757 39
80 425988 40
81 458757 40
82 425988 41
83 458757 41
84 425988 42
85 458757 42
86 425988 43
87 458757 43
88 425988 44
89 458757 44
90 425988 45
91 458757 45
92 425988 46
93 458757 46
94 425988 47
95 458757 47
96 425988 48
97 458757 48
98 425988 49
99 458757 49
100 425988 50
101 458757 50
102 425988 51
103 458757 51
104 425988 52
105 458757 52
106 425988 53
107 458757 53
108 425988 54
109 458757 54
110 425988 55
111 458757 55
112 425988 56
113 458757 56
114 425988 57
115 458757 57
116 425988 58
117 458757 58
118 425988 59
119 458757 59
120 425988 60
121 458757 60
122 425988 61
123 458757 61
124 425988 62
125 458757 62
126 425988 63
127 458757 63
128 425988 64
129 458757 64
130 425988 65
131 458757 65
132 425988 66
133 458757 66
134 425988 67
135 458757 67
136 425988 68
137 458757 68
138 425988 69
139 458757 69
140 425988 70
141 458757 70
142 425988 71
143 458757 71
144 425988 72
145 458757 72
146 425988 73
147 458757 73
148 425988 74
149 458757 74
150 425988 75
151 458757 75
152 425988 76
153 458757 76
154 425988 77
155 458757 77
156 425988 78
157 458757 78
158 425988 79
159 458757 79
160 425988 80
161 458757 80
162 425988 81
163 458757 81
164 425988 82
165 458757 82
166 425988 83
167 458757 83
168 425988 84
169 458757 84
170 425988 85
171 458757 85
172 425988 86
173 458757 86
174 425988 87
175 458757 87
176 425988 88
177 458757 88
178 425988 89
179 458757 89
180 425988 90
181 458757 90
182 425988 91
183 458757 91
184 425988 92
185 458757 92
186 425988 93
187 458757 93
188 425988 94
189 458757 94
190 425988 95
191 458757 95
192 425988 96
294912 = 0x48000 327681 = 0x50001 360450 = 0x58002 393219 = 0x60003 425988 = 0x68004 458757 = 0x70005
I guess Remote Client Servers get the semid's from db shared memory. When one of servers corrupts this information in shared memory then the other ones begin to report the 10839's. "Mad" server "updates" semid's for its clients. If it served more than one client it may corrupt more than one semid. Can you use the -Ma 1 as a workaround?
BTW, the first semset (the one with 7 semaphores) is used only during db startup and during logins/logouts of self-serviced clients. I guess you will not see the last byte of this semid in the 10839's
I ran dbmon.sh, output uploaded here
I found a (seemingly) similar issue reported by SteveSaltDanwood, I asked what the solution was and this is his reply
Carl,
We had been in discussions with Tech Support regarding this. The question refers to KB 61278, and our sys Admin has increased both SEMMSL, SEMMNS and SEMOPM slightly. The one thing that I did, was to add -semsets 6 to the Primary Login Broker pf file, and our issue disappeared - not sure whether this actually fixed it, but since the change on 10th Feb we have had no further instances.
Regards,
Steve
Carl, do you always pre-start 60 remote client servers immidiately after db startup?
[2016/07/13@23:48:13.123+0200] P-3986 T-140605298083648 I APW 82: (2518) Started. [2016/07/13@23:48:27.788+0200] P-4350 T-140402236524352 I SRV 1: (5646) Started on port 11002 using TCP IPV4 address 0.0.0.0, pid 4350. [2016/07/13@23:49:14.562+0200] P-4843 T-140184589739840 I SRV 2: (5646) Started on port 11003 using TCP IPV4 address 0.0.0.0, pid 4843. [2016/07/14@00:26:16.996+0200] P-17010 T-140595944453952 I ABL 87: (1132) Invalid semaphore id [2016/07/14@00:26:16.996+0200] P-17010 T-140595944453952 I ABL 87: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22. [2016/07/14@00:26:16.996+0200] P-17010 T-140595944453952 I ABL 87: (2252) Begin transaction backout. [2016/07/14@00:26:16.996+0200] P-17010 T-140595944453952 I ABL 87: (2253) Transaction backout completed. [2016/07/14@00:26:17.002+0200] P-17010 T-140595944453952 I ABL (1132) Invalid semaphore id [2016/07/14@00:26:17.002+0200] P-17010 T-140595944453952 I ABL (1132) Invalid semaphore id [2016/07/14@00:26:17.002+0200] P-17010 T-140595944453952 I ABL (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 5, errno = 22. [2016/07/14@00:26:17.002+0200] P-17010 T-140595944453952 I ABL (-----) semLockLog_2: semValue val = -1 [2016/07/14@00:26:19.903+0200] P-3966 T-140549975136064 I BROKER 0: (2527) Disconnecting dead user 87. [2016/07/14@00:55:14.379+0200] P-2288 T-139877706889024 I APW 82: (2518) Started. [2016/07/14@00:57:04.277+0200] P-3010 T-140641843181376 I SRV 2: (5646) Started on port 11003 using TCP IPV4 address 0.0.0.0, pid 3010. ... [2016/07/14@01:02:05.506+0200] P-4201 T-140565902128960 I SRV 5: (5646) Started on port 11006 using TCP IPV4 address 0.0.0.0, pid 4201. [2016/07/14@01:25:33.654+0200] P-2626 T-139742344922944 I APPSRV 91: (1132) Invalid semaphore id [2016/07/14@01:25:33.654+0200] P-2626 T-139742344922944 I APPSRV 91: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32761, errno = 22.
It looks like sometimes only a few servers were started successfully and the semaphore errors were issued by the self-service clients:
Latch activity collected by dbmon is normal. And I guess db activity was next to zero right after db startup in the midnight. I suspect that starting many RCS at once can be a root case of the errors.Multiple semsets will decrease the competion for semset's during their launches.
No servers are prestarted, would that be better ?
Just had a call with Progress tech support, they suspect something is leaking file handles.
> No servers are prestarted
Db log shows that /sometimes/ a lot of servers were started but there were no remote users connected.
> Just had a call with Progress tech support, they suspect something is leaking file handles.
I'd run promon in background to check every second a few first lines on 16. "Semaphores" screen. If they are changed then the information in shared memory has got corrupted.
Ø I'd run promon in background to ..
When you will be able to catch the changes of the shmid's then the next step would be to generate the protrace files for all Progress sessions. And maybe it's not too late to do the same immediately after a first semaphore error in db log.
Lastest crash 25 minutes after db startup following reboot of server. increase of -semsets to 6 did not help.
Captured promon semaphore output 11 seconds before crash didn't look different than one after db start
Was collecting semaphore output every minute.
Crash frequency increased (2 crashes last 2.5 hours) will stop collecting this data. May be unrelated but I wont take the risk
> Was collecting semaphore output every minute
Time interval was too long. One second would have more chances to catch the corruptions of semid's. But even one second can be too long.
What values had these semid's for last multi-user session?
Only 2 of 5 semid's were corrupted this time.
The error was reported by the most recently started session (a batch process) since 23 seconds after its login. Carl, do you know what the session did during this time?
I'll try to find out
Just a crazy idea but it helped one our customer in the past:
Create and start an empty database. Force all self-service clients to connect the empty db first and only then connect the LISA db. I hope your application (especially its authorization) does not rely on the order of the connected databases.
Only the self-service clients can corrupt data in shared memory. Semid's are stored at the beginning of shared memory => at the beginning of the client's address space. Shared-memory of the empty db will take the risks. Either the errors will disappear or you will get them for the empty database.
Hi George,
sorry but I'm kind of drowning in logfiles, can you give some more info on the process that you refer to ?
below is the most recent protrace file but that's probably not the process that interests you
PROGRESS stack trace as of Mon Jul 25 12:46:48 2016
Progress OpenEdge Release 11.6 build 1346 SP02 TF04 on Linux franky-db 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
Command line arguments are
/usr/dlc/bin/_mprosrv -classpath /usr/oe116/jdk/lib/tools.jar:/usr/oe116/java/progress.jar:/usr/oe116/java/messages.jar:/usr/oe116/java/prosp.jar /data/cce/applprod/data/LISA -L 150000 -n 112 -spin 50000 -bibufs 20 -Mi 1 -Ma 1 -Mn 79 -Mpb 78 -Mm 10240 -lruskips 10 -semsets 6 -S 11000 -B 2000000 -tablerangesize 1000 -indexrangesize 3500 -minport 11001 -maxport 12999 -ServerType 4GL
Startup parameters:
-pf /usr/dlc/startup.pf,-cpinternal UTF-8,-cpstream UTF-8,-cpcoll Basic,-cpcase Basic,-d dmy,-numsep 46,-numdec 44,-Mm 10240,(end .pf),-classpath /usr/oe116/jdk/lib/tools.jar:/usr/oe116/java/progress.jar:/usr/oe116/java/messages.jar:/usr/oe116/java/prosp.jar,-db /data/cce/applprod/data/LISA,-L 150000,-n 112,-spin 50000,-bibufs 20,-Mi 1,-Ma 1,-Mn 79,-Mpb 78,-Mm 10240,-lruskips 10,-semsets 6,-S 11000,-B 2000000,-tablerangesize 1000,-indexrangesize 3500,-minport 11001,-maxport 12999,-ServerType 4GL
#1 [0x5e81bb] uttraceback+0xfb from /usr/dlc/bin/_mprosrv
#2 [0x5ead4e] uttrace+0x13e from /usr/dlc/bin/_mprosrv
#3 [0x5eaec4] utcore+0x74 from /usr/dlc/bin/_mprosrv
#4 [0x424952] drexit+0x212 from /usr/dlc/bin/_mprosrv
#5 [0x4d6045] dbExit+0x7c from /usr/dlc/bin/_mprosrv
#6 [0x4f8d27] dbKillAll+0x32a from /usr/dlc/bin/_mprosrv
#7 [0x4f91de] dbKillUsers+0x4a8 from /usr/dlc/bin/_mprosrv
#8 [0x47893b] mdend+0x13b from /usr/dlc/bin/_mprosrv
#9 [0x424911] drexit+0x1d1 from /usr/dlc/bin/_mprosrv
#10 [0x4d6045] dbExit+0x7c from /usr/dlc/bin/_mprosrv
#11 [0x581e07] semLockLog+0x200 from /usr/dlc/bin/_mprosrv
#12 [0x4f5e2c] dbDelUsrctl+0xd2 from /usr/dlc/bin/_mprosrv
#13 [0x4f7b23] brclrdead+0x44d from /usr/dlc/bin/_mprosrv
#14 [0x4fa06a] dbWatchdog+0x764 from /usr/dlc/bin/_mprosrv
#15 [0x51b214] dsmWatchdog+0xa7 from /usr/dlc/bin/_mprosrv
#16 [0x5d4355] nsaloop+0x518 from /usr/dlc/bin/_mprosrv
#17 [0x4784b3] doserve+0xbd8 from /usr/dlc/bin/_mprosrv
#18 [0x478771] main+0x103 from /usr/dlc/bin/_mprosrv
#19 [0x49eadb15] __libc_start_main+0xf5 from /lib64/libc.so.6
~
~
We have 2 databases and sometimes this issues appeared first on the second db.
I'll try to replace all db connections with client server connections to see if that helps
> can you give some more info on the process that you refer to ?
[2016/07/25@12:54:48.591+0200] P-9540 T-140149562840896 I ABL 86: (452) Login by proadmin on batch. [2016/07/25@12:54:48.608+0200] P-9540 T-140149562840896 I ABL 86: (7129) Usr 86 set name to batchclnt. [2016/07/25@12:55:11.829+0200] P-9540 T-140149562840896 I ABL 86: (1132) Invalid semaphore id [2016/07/25@12:55:11.829+0200] P-9540 T-140149562840896 I ABL 86: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22. [2016/07/25@12:55:11.830+0200] P-9540 T-140149562840896 I ABL 86: (453) Logout by batchclnt on batch. [2016/07/25@12:55:11.830+0200] P-9540 T-140149562840896 I ABL (1132) Invalid semaphore id [2016/07/25@12:55:11.830+0200] P-9540 T-140149562840896 I ABL (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22. [2016/07/25@12:55:11.830+0200] P-9540 T-140149562840896 I ABL (-----) semLockLog_2: semValue val = -1 [2016/07/25@12:55:14.327+0200] P-4295 T-140025558579008 I BROKER 0: (2527) Disconnecting dead user 86.
> Captured promon semaphore output 11 seconds before crash didn't look different than one after db start
Can you post first 5 lines of promon semaphore output?
> We have 2 databases and sometimes this issues appeared first on the second db.
Only one session was connected to the second database at 14:29:29
I just detected our batch clients connect with rather low (old) values for -s -mmax and -D, could this cause these kind of issues ? I'm going to increase the values anyway
-d dmy
-yy 1960
-E
-h 7
-D 80
-s 63
-B 1000
-Mr 8192
-TB 31
-TM 32
-mmax 5000
-Bt 200
#-cpinternal ISO8859-1
-cpinternal utf-8
-cpstream IBM850
-rereadnolock
-p ccetools/start-b
-T /tmp
Carl, when your database doesn't have the semaphore errors it has the tcp errors:
[2016/07/25@07:25:12.467+0200] P-37327 T-139985194260288 I SRV 16: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@07:42:00.265+0200] P-38269 T-140325921544000 I SRV 19: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@07:56:39.636+0200] P-37336 T-139744649774912 I SRV 18: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:22:32.523+0200] P-46233 T-139696198965056 I SRV 34: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:33:28.461+0200] P-46233 T-139696198965056 I SRV 34: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:53:58.327+0200] P-51006 T-139629681858368 I SRV 38: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:56:58.560+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:58:07.372+0200] P-53007 T-139680708523840 I SRV 41: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:19:55.615+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:21:45.428+0200] P-40014 T-140399388444480 I SRV 21: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:25:09.992+0200] P-56699 T-140641693988672 I SRV 43: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:47:20.925+0200] P-33969 T-140121481381696 I SRV 2: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:48:22.646+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:33:19.665+0200] P-40014 T-140399388444480 I SRV 21: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:34:55.088+0200] P-53007 T-139680708523840 I SRV 41: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:48:41.338+0200] P-33974 T-139697772349248 I SRV 4: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:50:27.258+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:12:19.914+0200] P-10713 T-140585542993728 I SRV 46: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:14:42.414+0200] P-10713 T-140585542993728 I SRV 46: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:44:40.016+0200] P-51424 T-139858028005184 I SRV 39: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:46:17.233+0200] P-11690 T-139931360241472 I SRV 48: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:55:47.234+0200] P-33974 T-139697772349248 I SRV 4: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:59:46.517+0200] P-10225 T-139944178841408 I SRV 45: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@12:01:44.940+0200] P-51424 T-139858028005184 I SRV 39: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@13:41:28.143+0200] P-4635 T-140640063956800 I SRV 29: (-----) TCP/IP write error occurred with errno 32
Most likely their root case is a hardware.
> could this cause these kind of issues ?
Everything possible. The good method is to keep changing the things (of course, one change at once).
> The good method is to keep changing the things
I know but it's not easy to follow that rule right now
Thanks you for all your advice and interest
reply to the 14:13 question what the process was doing found following 4gl trace, unfortunately trace level is too low to be really useful
The second log file shows lsof output that is produced by the batch job prior to the quit statement
code fragment :
output close. /* batcherror logfile */ if opsys = "unix" then os-command silent value(subst("sudo /usr/sbin/lsof -p &1 >> &2", WZ-PID, WZ-LOGFILE)). run WriteMemo. quit.
first log file
# cat /tmp/batch_9539.log
[16/07/25@12:54:48.609+0200] P-009540 T-485005120 1 4GL -- Logging level set to = 1
[16/07/25@12:55:11.829+0200] P-009540 T-485005120 1 4GL -- Invalid semaphore id (1132)
[16/07/25@12:55:11.829+0200] P-009540 T-485005120 1 4GL -- SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22. (10839)
[16/07/25@12:55:11.830+0200] P-009540 T-485005120 1 4GL -- Invalid semaphore id (1132)
[16/07/25@12:55:11.830+0200] P-009540 T-485005120 1 4GL -- SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22. (10839)
[16/07/25@12:55:11.830+0200] P-009540 T-485005120 1 4GL -- semLockLog_2: semValue val = -1
# cat be9540.log print.sh -f acrobat blancke01 1 /tmp/Kat46489_9540.pdf 2>&1 Acrobat exitcode: 0 print.sh -f acrobat blancke01 1 /tmp/Kat46493_9540.pdf 2>&1 Acrobat exitcode: 0 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME _progres 9540 proadmin cwd DIR 8,2 12288 2228226 /usr2/cce/applprod _progres 9540 proadmin rtd DIR 8,1 4096 2 / _progres 9540 proadmin txt REG 8,1 12589577 400553 /usr/oe116/bin/_progres _progres 9540 proadmin mem REG 8,1 110808 394029 /usr/lib64/libresolv-2.17.so _progres 9540 proadmin mem REG 8,1 27512 394017 /usr/lib64/libnss_dns-2.17.so _progres 9540 proadmin mem REG 8,1 156770 1054635 /usr/oe116/lib/libproxml.so _progres 9540 proadmin mem REG 8,1 113328 394011 /usr/lib64/libnsl-2.17.so _progres 9540 proadmin mem REG 8,1 5102967 1054636 /usr/oe116/lib/libxerces_psc.so _progres 9540 proadmin mem REG 8,1 2546137 1054640 /usr/oe116/lib/libicui18n48_psc.so _progres 9540 proadmin mem REG 8,1 1751412 1054642 /usr/oe116/lib/libicuuc48_psc.so _progres 9540 proadmin mem REG 8,1 18187923 1054638 /usr/oe116/lib/libicudata48_psc.so _progres 9540 proadmin DEL REG 0,4 98305 /SYSV00000000 _progres 9540 proadmin mem REG 8,1 37441 1054490 /usr/oe116/lib/libenhlog.so _progres 9540 proadmin mem REG 8,1 61928 394019 /usr/lib64/libnss_files-2.17.so _progres 9540 proadmin mem REG 8,1 11376 393960 /usr/lib64/libfreebl3.so _progres 9540 proadmin mem REG 8,1 2112384 394001 /usr/lib64/libc-2.17.so _progres 9540 proadmin mem REG 8,1 88720 393235 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 _progres 9540 proadmin mem REG 8,1 995840 394050 /usr/lib64/libstdc++.so.6.0.19 _progres 9540 proadmin mem REG 8,1 1141560 394009 /usr/lib64/libm-2.17.so _progres 9540 proadmin mem REG 8,1 40816 394005 /usr/lib64/libcrypt-2.17.so _progres 9540 proadmin mem REG 8,1 44096 394031 /usr/lib64/librt-2.17.so _progres 9540 proadmin mem REG 8,1 19520 394007 /usr/lib64/libdl-2.17.so _progres 9540 proadmin mem REG 8,1 142304 394027 /usr/lib64/libpthread-2.17.so _progres 9540 proadmin mem REG 8,1 164440 402388 /usr/lib64/ld-2.17.so _progres 9540 proadmin 0r CHR 1,3 0t0 1028 /dev/null _progres 9540 proadmin 1w REG 8,6 14110 25 /tmp/batcherr32 _progres 9540 proadmin 2w REG 8,6 14110 25 /tmp/batcherr32 _progres 9540 proadmin 3r REG 8,1 1749843 400498 /usr/oe116/promsgs _progres 9540 proadmin 4r REG 8,1 944417 400495 /usr/oe116/convmap.cp _progres 9540 proadmin 5r REG 8,1 944417 400495 /usr/oe116/convmap.cp _progres 9540 proadmin 6u REG 8,6 74955 7235 /tmp/lbiN5u92T (deleted) _progres 9540 proadmin 7w REG 8,7 3082181 13528468 /data/cce/applprod/data/LISA.lg _progres 9540 proadmin 8u REG 8,7 655360 13528466 /data/cce/applprod/data/LISA.db _progres 9540 proadmin 9u REG 8,7 655360 13528466 /data/cce/applprod/data/LISA.db _progres 9540 proadmin 10u REG 8,7 26869760 13500420 /data/cce/applprod/data/LISA.b1 _progres 9540 proadmin 11u REG 8,7 15204352 13500421 /data/cce/applprod/data/LISA.d1 _progres 9540 proadmin 12u REG 8,7 15204352 13500421 /data/cce/applprod/data/LISA.d1 _progres 9540 proadmin 13u REG 8,7 524288000 13500422 /data/cce/applprod/data/LISA_10.d1 _progres 9540 proadmin 14u REG 8,7 524288000 13500423 /data/cce/applprod/data/LISA_10.d2 _progres 9540 proadmin 15u REG 8,7 524288000 13500424 /data/cce/applprod/data/LISA_10.d3 _progres 9540 proadmin 16u REG 8,7 131072 13500425 /data/cce/applprod/data/LISA_10.d4 _progres 9540 proadmin 17u REG 8,7 131072 13500425 /data/cce/applprod/data/LISA_10.d4 _progres 9540 proadmin 18u REG 8,7 262144000 13500426 /data/cce/applprod/data/LISA_11.d1 _progres 9540 proadmin 19u REG 8,7 262144000 13500427 /data/cce/applprod/data/LISA_11.d2 _progres 9540 proadmin 20u REG 8,7 262144000 13500428 /data/cce/applprod/data/LISA_11.d3 _progres 9540 proadmin 21u REG 8,7 131072 13500429 /data/cce/applprod/data/LISA_11.d4 _progres 9540 proadmin 22u REG 8,7 131072 13500429 /data/cce/applprod/data/LISA_11.d4 _progres 9540 proadmin 23u REG 8,7 131072 13500430 /data/cce/applprod/data/LISA_12.d1 _progres 9540 proadmin 24u REG 8,7 131072 13500430 /data/cce/applprod/data/LISA_12.d1 _progres 9540 proadmin 25u REG 8,7 2147352576 13500431 /data/cce/applprod/data/LISA_20.d1 _progres 9540 proadmin 26u REG 8,7 2147352576 13500432 /data/cce/applprod/data/LISA_20.d2 _progres 9540 proadmin 27u REG 8,7 2147352576 13500433 /data/cce/applprod/data/LISA_20.d3 _progres 9540 proadmin 28u REG 8,7 2147352576 13500434 /data/cce/applprod/data/LISA_20.d4 _progres 9540 proadmin 29u REG 8,7 2147352576 13500435 /data/cce/applprod/data/LISA_20.d5 _progres 9540 proadmin 30u REG 8,7 2147352576 13500436 /data/cce/applprod/data/LISA_20.d6 _progres 9540 proadmin 31u REG 8,7 2147352576 13500437 /data/cce/applprod/data/LISA_20.d7 _progres 9540 proadmin 32u REG 8,7 2147352576 13500438 /data/cce/applprod/data/LISA_20.d8 _progres 9540 proadmin 33u REG 8,7 2147352576 13500439 /data/cce/applprod/data/LISA_20.d9 _progres 9540 proadmin 34u REG 8,7 2147352576 13500440 /data/cce/applprod/data/LISA_20.d10 _progres 9540 proadmin 35u REG 8,7 2147352576 13500441 /data/cce/applprod/data/LISA_20.d11 _progres 9540 proadmin 36u REG 8,7 2147352576 13500442 /data/cce/applprod/data/LISA_20.d12 _progres 9540 proadmin 37u REG 8,7 2147352576 13500443 /data/cce/applprod/data/LISA_20.d13 _progres 9540 proadmin 38u REG 8,7 2147352576 13500444 /data/cce/applprod/data/LISA_20.d14 _progres 9540 proadmin 39u REG 8,7 2147352576 13500445 /data/cce/applprod/data/LISA_20.d15 _progres 9540 proadmin 40u REG 8,7 2147352576 13500446 /data/cce/applprod/data/LISA_20.d16 _progres 9540 proadmin 41u REG 8,7 2147352576 13500447 /data/cce/applprod/data/LISA_20.d17 _progres 9540 proadmin 42u REG 8,7 2147352576 13500448 /data/cce/applprod/data/LISA_20.d18 _progres 9540 proadmin 43u REG 8,7 2147352576 13500449 /data/cce/applprod/data/LISA_20.d19 _progres 9540 proadmin 44u REG 8,7 2147352576 13500450 /data/cce/applprod/data/LISA_20.d20 _progres 9540 proadmin 45u REG 8,7 2147352576 13500451 /data/cce/applprod/data/LISA_20.d21 _progres 9540 proadmin 46u REG 8,7 2147352576 13500452 /data/cce/applprod/data/LISA_20.d22 _progres 9540 proadmin 47u REG 8,7 524288 13500453 /data/cce/applprod/data/LISA_20.d23 _progres 9540 proadmin 48u REG 8,7 524288 13500453 /data/cce/applprod/data/LISA_20.d23 _progres 9540 proadmin 49u REG 8,7 2147352576 13500454 /data/cce/applprod/data/LISA_21.d1 _progres 9540 proadmin 50u REG 8,7 2147352576 13500455 /data/cce/applprod/data/LISA_21.d2 _progres 9540 proadmin 51u REG 8,7 2147352576 13500456 /data/cce/applprod/data/LISA_21.d3 _progres 9540 proadmin 52u REG 8,7 2147352576 13500457 /data/cce/applprod/data/LISA_21.d4 _progres 9540 proadmin 53u REG 8,7 2147352576 13500458 /data/cce/applprod/data/LISA_21.d5 _progres 9540 proadmin 54u REG 8,7 2147352576 13500459 /data/cce/applprod/data/LISA_21.d6 _progres 9540 proadmin 55u REG 8,7 2147352576 13500460 /data/cce/applprod/data/LISA_21.d7 _progres 9540 proadmin 56u REG 8,7 2147352576 13500461 /data/cce/applprod/data/LISA_21.d8 _progres 9540 proadmin 57u REG 8,7 2147352576 13500462 /data/cce/applprod/data/LISA_21.d9 _progres 9540 proadmin 58u REG 8,7 2147352576 13500463 /data/cce/applprod/data/LISA_21.d10 _progres 9540 proadmin 59u REG 8,7 2147352576 13500464 /data/cce/applprod/data/LISA_21.d11 _progres 9540 proadmin 60u REG 8,7 2147352576 13500465 /data/cce/applprod/data/LISA_21.d12 _progres 9540 proadmin 61u REG 8,7 524288 13500466 /data/cce/applprod/data/LISA_21.d13 _progres 9540 proadmin 62u REG 8,7 524288 13500466 /data/cce/applprod/data/LISA_21.d13 _progres 9540 proadmin 63u REG 8,7 1572864 13500467 /data/cce/applprod/data/LISA_22.d1 _progres 9540 proadmin 64u REG 8,7 1572864 13500467 /data/cce/applprod/data/LISA_22.d1 _progres 9540 proadmin 65u REG 8,7 131072000 13500468 /data/cce/applprod/data/LISA_30.d1 _progres 9540 proadmin 66u REG 8,7 131072000 13500469 /data/cce/applprod/data/LISA_30.d2 _progres 9540 proadmin 67u REG 8,7 131072000 13500470 /data/cce/applprod/data/LISA_30.d3 _progres 9540 proadmin 68u REG 8,7 524288000 13500471 /data/cce/applprod/data/LISA_30.d4 _progres 9540 proadmin 69u REG 8,7 524288000 13500472 /data/cce/applprod/data/LISA_30.d5 _progres 9540 proadmin 70u REG 8,7 524288000 13500473 /data/cce/applprod/data/LISA_30.d6 _progres 9540 proadmin 71u REG 8,7 524288 13500474 /data/cce/applprod/data/LISA_30.d7 _progres 9540 proadmin 72u REG 8,7 524288 13500474 /data/cce/applprod/data/LISA_30.d7 _progres 9540 proadmin 73u REG 8,7 131072000 13500475 /data/cce/applprod/data/LISA_31.d1 _progres 9540 proadmin 74u REG 8,7 131072000 13500476 /data/cce/applprod/data/LISA_31.d2 _progres 9540 proadmin 75u REG 8,7 131072000 13500477 /data/cce/applprod/data/LISA_31.d3 _progres 9540 proadmin 76u REG 8,7 131072000 13500478 /data/cce/applprod/data/LISA_31.d4 _progres 9540 proadmin 77u REG 8,7 524288 13500479 /data/cce/applprod/data/LISA_31.d5 _progres 9540 proadmin 78u REG 8,7 524288 13500479 /data/cce/applprod/data/LISA_31.d5 _progres 9540 proadmin 79u REG 8,7 524288 13500480 /data/cce/applprod/data/LISA_32.d1 _progres 9540 proadmin 80u REG 8,7 524288 13500480 /data/cce/applprod/data/LISA_32.d1 _progres 9540 proadmin 81ur REG 8,6 84 7248 /tmp/batch_9539.log _progres 9540 proadmin 82u REG 8,6 0 7250 /tmp/srtVk4EVZ (deleted) _progres 9540 proadmin 83u REG 8,6 46351608 7251 /tmp/rcdhNobO5 (deleted) _progres 9540 proadmin 84u REG 8,6 819200 7256 /tmp/DBI485005120kkz8Gb (deleted) _progres 9540 proadmin 88r REG 8,1 4949823 1054570 /usr/oe116/tty/adecomm.pl _progres 9540 proadmin 89r REG 8,1 26771 1054866 /usr/oe116/tty/OpenEdge.ServerAdmin.pl _progres 9540 proadmin 90r REG 8,1 586073 1054581 /usr/oe116/tty/adeedit.pl _progres 9540 proadmin 91r REG 8,1 20580737 1054623 /usr/oe116/tty/prodict.pl _progres 9540 proadmin 92r REG 8,1 56999 1054864 /usr/oe116/tty/OpenEdge.BusinessLogic.pl _progres 9540 proadmin 93r REG 8,1 1499066 1054865 /usr/oe116/tty/OpenEdge.Core.pl _progres 9540 proadmin 94r REG 8,1 2631412 1054577 /usr/oe116/tty/adeshar.pl _progres 9540 proadmin 95r REG 8,1 240309 1054584 /usr/oe116/tty/adecomp.pl _progres 9540 proadmin 96r REG 8,1 14002200 1054621 /usr/oe116/tty/dataadmin.pl
Carl, can you save 'ipcs -s' output after each db startup? Footprint that the corruption is leaving in shared memory is the most painless way to find out the root case. The semaphore errors show what the processes found out in shared memory but we don't know what the same bytes stored after db startup.
> Just had a call with Progress tech support, they suspect something is leaking file handles
The size of the corrupted spot in shared memory, the byte alignment of the corrution's borders would confirm or deny the assumption.
> server changed, from real redhat 5 to HyperV hosted virtual centos 7
I suspect this is the root.
Hello cverbiest,
You may want to take a look at this article for tuning semaphores on Linux:
knowledgebase.progress.com/.../Database-stops-abnormally-after-showing-the-error-10839
The errors you have listed suggest an insufficient configuration for the system and the users of the system.
I agree that to find what the value should be you have to use ipcs -s to find out how many semaphores are commonly being used.
Then you can adjust the semaphores accordingly.
Thank you
> The errors you have listed suggest an insufficient configuration for the system and the users of the system.
I saw this article but I can't understand it. "Insufficient semaphore identifiers"? For the -semsets 3 (default) and/or 6 (the recent value)? On the system runnining only 2 databases? But broker successfully creates the semaphore identifiers at database startup. What processes failed to create more identifiers? The ones that reported the semaphore errors? But the client processes use only one semaphore in the semaphore identifiers created by broker and they get it at connection time:
What are you waiting for? A reason for waiting around.
Richard Banville, June 7, 2011
pugchallenge.org/.../Waiting_AmericaPUG.pptx
Semaphore usage
Resource wait semaphore – one per connection
I'll add ipcs -s to the startup script.
Yesterday we stopped using shared-memory connections and it got us through the night without crashes
Did the tcp errors disappear as well?
I haven't seen any TCP/IP errors in the db log, I do see a lot of dropped packets
ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet x.x.x.x netmask 255.255.0.0 broadcast 10.0.255.255
inet6 yyyyyyyyyyyyyyyyyy prefixlen 64 scopeid 0x20<link>
ether 00:15:5d:00:01:12 txqueuelen 1000 (Ethernet)
RX packets 72734765 bytes 25463358707 (23.7 GiB)
RX errors 0 dropped 1023562 overruns 0 frame 0
TX packets 61839003 bytes 83457857300 (77.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 0 (Local Loopback)
RX packets 483858734 bytes 164988488628 (153.6 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 483858734 bytes 164988488628 (153.6 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
TCP/IP write error occurred with errno 32
>I haven't seen any TCP/IP errors in the db log,
Do you mean that you don't see them in the current log?
Yesterday these errors were reported in log once per 5-10 minutes through the whole day.
I missed a few during my last search, there were none between 17:00 and 06:50, there are a few errors since that time but the db stays up & running.
[2016/07/25@07:25:12.467+0200] P-37327 T-139985194260288 I SRV 16: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@07:42:00.265+0200] P-38269 T-140325921544000 I SRV 19: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@07:56:39.636+0200] P-37336 T-139744649774912 I SRV 18: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:22:32.523+0200] P-46233 T-139696198965056 I SRV 34: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:33:28.461+0200] P-46233 T-139696198965056 I SRV 34: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:53:58.327+0200] P-51006 T-139629681858368 I SRV 38: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:56:58.560+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:58:07.372+0200] P-53007 T-139680708523840 I SRV 41: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:19:55.615+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:21:45.428+0200] P-40014 T-140399388444480 I SRV 21: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:25:09.992+0200] P-56699 T-140641693988672 I SRV 43: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:47:20.925+0200] P-33969 T-140121481381696 I SRV 2: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:48:22.646+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:33:19.665+0200] P-40014 T-140399388444480 I SRV 21: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:34:55.088+0200] P-53007 T-139680708523840 I SRV 41: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:48:41.338+0200] P-33974 T-139697772349248 I SRV 4: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:50:27.258+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:12:19.914+0200] P-10713 T-140585542993728 I SRV 46: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:14:42.414+0200] P-10713 T-140585542993728 I SRV 46: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:44:40.016+0200] P-51424 T-139858028005184 I SRV 39: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:46:17.233+0200] P-11690 T-139931360241472 I SRV 48: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:55:47.234+0200] P-33974 T-139697772349248 I SRV 4: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:59:46.517+0200] P-10225 T-139944178841408 I SRV 45: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@12:01:44.940+0200] P-51424 T-139858028005184 I SRV 39: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@13:41:28.143+0200] P-4635 T-140640063956800 I SRV 29: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@13:50:06.256+0200] P-3171 T-140347927385920 I SRV 12: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@13:56:17.786+0200] P-3171 T-140347927385920 I SRV 12: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@13:59:28.007+0200] P-4133 T-139904439777088 I SRV 25: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@14:06:39.272+0200] P-4133 T-139904439777088 I SRV 25: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@14:16:58.821+0200] P-5630 T-140431553333056 I SRV 37: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@14:24:17.698+0200] P-5099 T-139919449716544 I SRV 33: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@16:58:01.036+0200] P-19422 T-140208870037312 I SRV 17: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@06:53:27.189+0200] P-23437 T-140599803811648 I SRV 29: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@07:39:35.723+0200] P-27754 T-140670744590144 I SRV 36: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@08:49:59.790+0200] P-60434 T-139664345315136 I SRV 48: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@09:08:27.365+0200] P-20130 T-139837751461696 I SRV 20: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@09:14:49.987+0200] P-20130 T-139837751461696 I SRV 20: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@09:46:00.119+0200] P-57586 T-140041057462080 I SRV 46: (-----) TCP/IP write error occurred with errno 32
If I understand correctly only the batch processes (20% of all sessions) were connected db locally. Today all of them are using a remote connection.
TCP/IP write error results in 794: Usernum <n> terminated abnormally. Is there anything common for all sessions that died with tcp error? The same username? The same remote host? I guess the batch processes are now connected using localhost. Will they get the 794's today?
I don't have a "theory" but I suspect that tcp errors and semaphore errors have the same source.
Just to be clear, not only were connections changed to client/server (-Ma 1)
But that this Linux guest is now on it's OWN host (as opposed to shared with other guests)
Where the Hyper-V node is now running only the Centos Linux.
All other VM's have been moved to the second node."
I believe that this is the significant part of the change that has somehow stabilised this awful semaphore down situation.
Found the trigger for the issue and a solution.
The database is started using a script that basically does
sudo -u proadmin proserve LISA sudo -u proadmin probiw LISA sudo -u proadmin proapw LISA sudo -u proadmin proapw LISA
if someone logs on to the system with user proadmin using ssh, and executes
proshut LISA -C list
the semaphore errors occur when that user logs off. As long as the proadmin user does not log off the db stays up & running.
On our test server the issue went away with a yum update
That is why the log files of the incidents should not be trimmed. ;-)