What causes values to show in pend. in servers status of promon?
The number of users in pending add up the number of open user connections that I expect to have. At that point users are getting error 748
The server or the system has no more resources. Please contact Progress Technical Support. (748)
Maybe related, maybe not... In the same time frame users are reporting error 9407 where the host name is the name of the database server and the port is 2000 TCP.
Connection failure for host <host_name> port <port> transport <transport_name>. (9407)
What does this make you think of as the root cause? What should I look into next?
Mark
OS: Windows server 2012
Progress OE11.5.1
Are you specifying -minport/-maxport on your connection brokers? You should have unique, non-overlapping port ranges for each one. It makes firewall configuration and troubleshooting of issues like this one much more straightforward.
If a client connects to the broker (-S port) but can't connect to the assigned server port (between -minport and -maxport inclusive) then you will have a pending connection. Check that the firewall(s) between clients and server allow connections to the server on all required broker and server ports.
Also check promon to ensure that all servers that should be able to spawn are able to. Check the DB log for errors related to spawning servers. Note that the broker won't spawn a 4GL server on a port if it is defined in the OS services file, even if it isn't currently in use. Check that there aren't defined services with port numbers within your minport/maxport ranges.
Thank you Rob for the input. In promon I find servers for the number of servers I expect. All of the servers have Cur. users values that account for the connected users.
It looks like the number of pending users adds up to what I expect are open slots. So new connections get the no more resources.
I need to look into the min/max port, service file, and run a netstat to see what is in use.
Also check that -n is high enough given your anticipated connection count.
> What causes values to show in pend. in servers status of promon?
Login broker sents a port of the started server to a new client. Then it checks if the number of the clients served by the server is increased. In other words the login broker checks if the client successfully connected to the server. Otherwise it's counted as a pending client connection.
So it's always a port inside the -minport/-maxport that was used to start a remote client server. Most likely there is a problem with the port on OS level.
Can you post the output from promon?
Example:
Sv Pend. Cur. Max. Port No Pid Type Protocol Logins Users Users Users Num 361 50988512 Auto TCP 339 0 2 30 10375 362 15599228 Auto TCP 460 0 2 30 10376 363 66258402 Auto TCP 0 2 0 30 10405 364 16449868 Auto TCP 306 0 2 30 10378 365 42468244 Auto TCP 362 0 2 30 10379
Srv 363 did not have the successful logins. It was a server newly started on port 10405 and the clients were unable to connect to this port.
Connections can be made on each of the ports and then connections seem to get stuck in pending.
11/21/17 Status: Servers 07:37:58
Sv Pend. Cur. Max. Port
No Pid Type Protocol Logins Users Users Users Num
0 5484 Login TCP 0 0 0 2 2021
1 16956 Auto TCP 1 0 1 2 4002
2 7256 Login TCP 20312 0 0 6 2020
3 8132 Auto TCP 2088 4 2 6 4100
4 6276 Auto TCP 3762 4 2 6 4101
5 6116 Auto TCP 2127 4 2 6 4102
6 18496 Auto TCP 790 0 1 2 4006
7 4848 Auto TCP 6070 2 4 6 4103
8 8424 Auto TCP 2061 3 3 6 4104
9 7772 Auto TCP 2067 5 1 6 4105
10 7472 Auto TCP 2112 3 3 6 4106
11 13372 Auto TCP 105 0 1 2 4003
12 14000 Auto TCP 600 0 1 2 4004
13 17196 Auto TCP 461 0 0 2 4005
Too busy servers may result in the pending connections.
What about db performance?
R&D/Activity/Summary/Commits per sec and Active trans?
Activity/Performance Indicators/Latch timeouts per sec?
The parameter PenndConnTime is set?
R&D/Activity/Summary/Commits per sec and Active trans?
11/21/17 Activity: Summary
10:05:50 08/08/17 19:41 to 11/21/17 10:05 (2511 hrs 23 min)
Event Total Per Sec |Event Total Per Sec
Commits 98954K 11.2 |DB Reads 1414600K 160.2
Undos 6138 0.0 |DB Writes 18149892 2.0
Record Reads 346135M 40144.7 |BI Reads 701766 0.1
Record Updates 104761K 11.9 |BI Writes 5625586 0.6
Record Creates 5450812 0.6 |AI Writes 5345318 0.6
Record Deletes 2844443 0.3 |Checkpoints 76869 0.0
Record Locks 1929763K 218.6 |Flushed at chkpt 16441621 1.8
Record Waits 125 0.0 |Active trans 0
Rec Lock Waits 0 % BI Buf Waits 1 % AI Buf Waits 1 %
Writes by APW 0 % Writes by BIW 0 % Writes by AIW 0 %
DB Size: 14 GB BI Size: 1263 MB AI Size: 376 K
Empty blocks: 91305 Free blocks: 41426 RM chain: 2814
Buffer Hits 99 % Primary Hits 99 % Alternate Hits 0 %
8 Servers, 24 Users (5 Local, 19 Remote, 9 Batch), 0 Apws
11/21/17 Activity: Performance Indicators
10:15:28 08/08/17 19:41 to 11/21/17 10:15 (2511 hrs 33 min)
Total Per Min Per Sec
Per Tx
Commits 98955K 672 11.21
1.00
Undos 6138 0 0.00
0.00
Index operations 271070M 1886198 31436.64
2805.08
Record operations 346274M 2409489 40158.15
3583.30
Total o/s i/o 1444471K 9816 163.59
14.60
Total o/s reads 1416032K 9622 160.37
14.31
Total o/s writes 29121304 193 3.22
0.29
Background o/s writes 2121117 14 0.23
0.02
Partial log writes 856362 6 0.09
0.01
Database extends 118336 1 0.01
0.00
Total waits 23008 0 0.00
0.00
Lock waits 125 0 0.00
0.00
Resource waits 22883 0 0.00
0.00
Latch timeouts 262567K 1784 29.74
2.65
Buffer pool hit rate: 99 % Primary pool hit rate: 99 % Alternate pool
hit rate: 0 %
> 10:05:50 08/08/17 19:41 to 11/21/17 10:05 (2511 hrs 23 min)
Interval is longer than 100 days. It can't help to identify the current bottleneck.
Is there a way to refresh the counters without a DB stop and start?
Enter <return>, A, L, R, S, U, Z, P, T, or X (? for help): ? Entry Action taken A Activate auto repeat mode. The current display then restarts after the number of seconds specified by the current display pause time. Stop after the number of times specified by the auto repeat count or when you press control-c. S Sample activity counters for the number of seconds specified by the sampling interval. The current display then restarts. The data shown include only activity which occurred during the sample interval. U Update activity counters. The current display then restarts. The data shown include changes which occurred since the initial set of data was collected. Z Zero the activity counters so updates show changes from now.
I prefer: Z, wait a few seconds, U
♀11/21/17 Activity: Performance Indicators
10:40:55 11/21/17 10:31 to 11/21/17 10:31 (9 sec)
Total Per Min Per Sec
Per Tx
Commits 0 0 0.00
0.00
Undos 0 0 0.00
0.00
Index operations 953316 6355440 105924.00
0.00
Record operations 958636 6390907 106515.11
0.00
Total o/s i/o 497 3313 55.22
0.00
Total o/s reads 497 3313 55.22
0.00
Total o/s writes 0 0 0.00
0.00
Background o/s writes 0 0 0.00
0.00
Partial log writes 0 0 0.00
0.00
Database extends 0 0 0.00
0.00
Total waits 0 0 0.00
0.00
Lock waits 0 0 0.00
0.00
Resource waits 0 0 0.00
0.00
Latch timeouts 800 5333 88.89
0.00
Buffer pool hit rate: 99 % Primary pool hit rate: 99 % Alternate pool
hit rate: 0 %
Latch timeouts is 88.89 per sec. I suspect it's mainly MTX naps.
Check promon/R&D/debghb/6/11. Latch Counts
Update: no commits/undoes per 9 seconds! It's unlikely that MTX latch was used at all. What latches have non-zero naps?
♀11/21/17 Activity: Latch Counts
10:52:34 08/08/17 19:41 to 11/21/17 10:51 (2512 hrs 9 min)
----- Locks ----- ------ Busy ------ Naps -
--------- Spins ----------- ----- Nap Max -----
Owner Total /Sec /Sec Pct /Sec /
Sec /Lock /Busy Total HWM
MTX -- 536071432 59 0 0.0 0
0 0 0 0 0
USR -- 43428375 4 0 0.0 0
0 0 0 0 0
OM -- 310790521764 34365 0 0.0 4
0 0 0 0 0
BIB -- 443499238 49 0 0.0 0
0 0 0 0 0
SCH -- 1466550 0 0 0.0 0
0 0 0 0 0
LKP -- 11168141 1 0 0.0 0
0 0 0 0 0
GST -- 108999 0 0 0.0 0
0 0 0 0 0
TXT -- 413323790 45 0 0.0 0
0 0 0 0 0
LKT -- 39722330474 4392 0 0.0 0
0 0 0 0 0
LKT -- 39711588891 4391 0 0.0 0
0 0 0 0 0
LKT -- 39719111353 4391 0 0.0 0
0 0 0 0 0
LKT -- 39696930662 4389 0 0.0 0
0 0 0 0 0
SEQ -- 4319118 0 0 0.0 0
0 0 0 0 0
AIB -- 442376741 48 0 0.0 0
0 0 0 0 0
TXQ -- 586187429 64 0 0.0 0
0 0 0 0 0
♀11/21/17 Activity: Latch Counts
10:56:29 08/08/17 19:41 to 11/21/17 10:51 (2512 hrs 9 min)
----- Locks ----- ------ Busy ------ Naps -
--------- Spins ----------- ----- Nap Max -----
Owner Total /Sec /Sec Pct /Sec /
Sec /Lock /Busy Total HWM
MTX -- 536071432 59 0 0.0 0
0 0 0 0 0
USR -- 43428375 4 0 0.0 0
0 0 0 0 0
OM -- 310790521764 34365 0 0.0 4
0 0 0 0 0
BIB -- 443499238 49 0 0.0 0
0 0 0 0 0
SCH -- 1466550 0 0 0.0 0
0 0 0 0 0
LKP -- 11168141 1 0 0.0 0
0 0 0 0 0
GST -- 108999 0 0 0.0 0
0 0 0 0 0
TXT -- 413323790 45 0 0.0 0
0 0 0 0 0
LKT -- 39722330474 4392 0 0.0 0
0 0 0 0 0
LKT -- 39711588891 4391 0 0.0 0
0 0 0 0 0
LKT -- 39719111353 4391 0 0.0 0
0 0 0 0 0
LKT -- 39696930662 4389 0 0.0 0
0 0 0 0 0
SEQ -- 4319118 0 0 0.0 0
0 0 0 0 0
AIB -- 442376741 48 0 0.0 0
0 0 0 0 0
TXQ -- 586187429 64 0 0.0 0
0 0 0 0 0
Enter <return> for more, A, L, R, S, U, Z, P, T, or X (? for help):
♀11/21/17 Activity: Latch Counts
10:58:44 08/08/17 19:41 to 11/21/17 10:51 (2512 hrs 9 min)
----- Locks ----- ------ Busy ------ Naps -
--------- Spins ----------- ----- Nap Max -----
Owner Total /Sec /Sec Pct /Sec /
Sec /Lock /Busy Total HWM
EC -- 0 0 0 0.0 0
0 0 0 0 0
LKF -- 158527987901 17529 0 0.0 0
0 0 0 0 0
BFP -- 758422154 83 0 0.0 0
0 0 0 0 0
BHT -- 802039844902 88684 0 0.0 0
0 0 0 0 0
PWQ -- 0 0 0 0.0 0
0 0 0 0 0
CPQ -- 34543367 3 0 0.0 0
0 0 0 0 0
LRU -- 799182187673 88368 0 0.0 24
0 0 0 0 0
LRU -- 0 0 0 0.0 0
0 0 0 0 0
BUF -- 397460116541 43948 0 0.0 0
0 0 0 0 0
BUF -- 395255999673 43704 0 0.0 0
0 0 0 0 0
BUF -- 414575552839 45841 0 0.0 0
0 0 0 0 0
BUF -- 398311657901 44042 0 0.0 0
0 0 0 0 0
INC -- 0 0 0 0.0 0
0 0 0 0 0
L29 -- 0 0 0 0.0 0
0 0 0 0 0
L30 -- 0 0 0 0.0 0
0 0 0 0 0
L31 -- 0 0 0 0.0 0
0 0 0 0 0
You can create a short sampling interval using any "Activity" screen and all "Activity" screens will report statistics exactly for the same interval. It's a good idea to post the different statistics for the same sampling interval.
LRU and OM are non-zero naps.
I am feeling like a promon newbie. I am not sure what you saying with this comment "You can create a short sampling interval using any "Activity" screen and all "Activity" screens will report statistics exactly for the same interval."
Perhaps add -lruskips 25 ? Do you expect that would have an impact on getting the connections showing as pending?
On any Activity screen do: Z, wait a few second, U
Then copy the screen and go to other screens to copy statistics. Do not update interval on these screens.
It would suggest to post data for:
t/2/1. Activity: Summary
t/3/1. Activity: Performance Indicators
t/debghb/6/11. Activity: Latch Counts
t/1/3. Status: Servers
♀11/21/17 Activity: Summary
11:41:12 11/21/17 11:41 to 11/21/17 11:41 (10 sec)
Event Total Per Sec |Event Total Per Sec
Commits 17 1.7 |DB Reads 220 22.0
Undos 0 0.0 |DB Writes 0 0.0
Record Reads 398316 39831.6 |BI Reads 0 0.0
Record Updates 17 1.7 |BI Writes 3 0.3
Record Creates 0 0.0 |AI Writes 3 0.3
Record Deletes 0 0.0 |Checkpoints 0 0.0
Record Locks 51 5.1 |Flushed at chkpt 0 0.0
Record Waits 0 0.0 |Active trans 0
Rec Lock Waits 0 % BI Buf Waits 0 % AI Buf Waits 0 %
Writes by APW 0 % Writes by BIW 0 % Writes by AIW 0 %
DB Size: 14 GB BI Size: 1263 MB AI Size: 1456 K
Empty blocks: 91385 Free blocks: 41426 RM chain: 2832
Buffer Hits 99 % Primary Hits 99 % Alternate Hits 0 %
8 Servers, 22 Users (4 Local, 18 Remote, 8 Batch), 0 Apws
♀11/21/17 Activity: Performance Indicators
11:42:35 11/21/17 11:41 to 11/21/17 11:41 (10 sec)
Total Per Min Per Sec
Per Tx
Commits 17 102 1.70
1.00
Undos 0 0 0.00
0.00
Index operations 389515 2337090 38951.50
22912.65
Record operations 398333 2389998 39833.30
23431.35
Total o/s i/o 226 1356 22.60
13.29
Total o/s reads 220 1320 22.00
12.94
Total o/s writes 6 36 0.60
0.35
Background o/s writes 6 36 0.60
0.35
Partial log writes 6 36 0.60
0.35
Database extends 0 0 0.00
0.00
Total waits 0 0 0.00
0.00
Lock waits 0 0 0.00
0.00
Resource waits 0 0 0.00
0.00
Latch timeouts 1888 11328 188.80
111.06
Buffer pool hit rate: 99 % Primary pool hit rate: 99 % Alternate pool
hit rate: 0 %
♀11/21/17 Activity: Latch Counts
11:43:29 11/21/17 11:41 to 11/21/17 11:41 (10 sec)
----- Locks ----- ------ Busy ------ Naps -
--------- Spins ----------- ----- Nap Max -----
Owner Total /Sec /Sec Pct /Sec /
Sec /Lock /Busy Total HWM
MTX -- 71 7 0 0.0 0
0 0 0 0 0
USR -- 45 4 0 0.0 0
0 0 0 0 0
OM -- 466061 46606 0 0.0 36
0 0 0 0 0
BIB -- 60 6 0 0.0 0
0 0 0 0 0
SCH -- 0 0 0 0.0 0
0 0 0 0 0
LKP -- 0 0 0 0.0 0
0 0 0 0 0
GST -- 0 0 0 0.0 0
0 0 0 0 0
TXT -- 68 6 0 0.0 0
0 0 0 0 0
LKT -- 72 7 0 0.0 0
0 0 0 0 0
LKT -- 43 4 0 0.0 0
0 0 0 0 0
LKT -- 74 7 0 0.0 0
0 0 0 0 0
LKT -- 32 3 0 0.0 0
0 0 0 0 0
SEQ -- 0 0 0 0.0 0
0 0 0 0 0
AIB -- 60 6 0 0.0 0
0 0 0 0 0
TXQ -- 70 7 0 0.0 0
0 0 0 0 0
Enter <return> for more, A, L, R, S, U, Z, P, T, or X (? for help):
♀11/21/17 Activity: Latch Counts
11:43:31 11/21/17 11:41 to 11/21/17 11:41 (10 sec)
----- Locks ----- ------ Busy ------ Naps -
--------- Spins ----------- ----- Nap Max -----
Owner Total /Sec /Sec Pct /Sec /
Sec /Lock /Busy Total HWM
EC -- 0 0 0 0.0 0
0 0 0 0 0
LKF -- 68 6 0 0.0 0
0 0 0 0 0
BFP -- 0 0 0 0.0 0
0 0 0 0 0
BHT -- 906599 90659 0 0.0 0
0 0 0 0 0
PWQ -- 0 0 0 0.0 0
0 0 0 0 0
CPQ -- 0 0 0 0.0 0
0 0 0 0 0
LRU -- 906403 90640 0 0.0 148
0 0 0 0 0
LRU -- 0 0 0 0.0 0
0 0 0 0 0
BUF -- 447032 44703 0 0.0 2
0 0 0 0 0
BUF -- 451183 45118 0 0.0 0
0 0 0 0 0
BUF -- 401535 40153 0 0.0 0
0 0 0 0 0
BUF -- 514389 51438 0 0.0 0
0 0 0 0 0
INC -- 0 0 0 0.0 0
0 0 0 0 0
L29 -- 0 0 0 0.0 0
0 0 0 0 0
L30 -- 0 0 0 0.0 0
0 0 0 0 0
L31 -- 0 0 0 0.0 0
0 0 0 0 0
♀11/21/17 Status: Servers
11:44:19
Sv Pend. Cur. Max. Port
No Pid Type Protocol Logins Users Users Users Num
0 5484 Login TCP 0 0 0 2 2021
1 16956 Auto TCP 1 0 1 2 4002
2 7256 Login TCP 20343 0 0 6 2020
3 8132 Auto TCP 2091 4 2 6 4100
4 6276 Auto TCP 3764 4 2 6 4101
5 6116 Auto TCP 2128 4 2 6 4102
6 18496 Auto TCP 790 0 1 2 4006
7 4848 Auto TCP 6079 2 4 6 4103
8 8424 Auto TCP 2067 3 2 6 4104
9 7772 Auto TCP 2069 5 0 6 4105
10 7472 Auto TCP 2120 3 2 6 4106
11 13372 Auto TCP 105 0 1 2 4003
12 14000 Auto TCP 600 0 1 2 4004
13 17196 Auto TCP 461 0 0 2 4005
To eliminate the latch timeouts set the -omsize equal to the number of the objects in your database (or just count the _StorageObject records) and the -lruskip 100.
But in your case the pending connections are not caused by those latch timeouts. Their root case is probably outside Progress.
10:05:50 08/08/17 19:41 to 11/21/17 10:05 (2511 hrs 23 min) Event Total Per Sec DB Writes 18149892 2.0 BI Writes 5625586 0.6 AI Writes 5345318 0.6 Writes by APW 0 % Writes by BIW 0 % Writes by AIW 0 %
You should run APW, BIW and AIW.
Thank you George for all your feedback and input.
Mark, in your case the thousands clients successfully used the server's ports (4003-4006, 4100-4106) during 100+ days. I guess you started getting the pending users only recently - you can check the errors in db log:
Server <n> has <n> unresolved pending connections(s). Please check port <port>. (12454)
Clearing pending connections from server <n>. (12455)
Also you have the pending users only for the servers spawn by your secondary login broker (ports 4100-4106)
As a workaround/troubleshooting in future I would suggest the following:
1. Start primary database broker with -Mn twice higher than the current value.
2. Use the separate login brokers. You seem to start the primary database broker as a login broker and one secondary login broker. My suggestion: no -S parameter for the primary database broker and two secondary login brokers (with their own -minport/-maxport as you do now).
3. When you will face with the pending users again stop an affected secondary login broker without stopping the database. Then start it again on the same port but with new port range (new -minport/-maxport). The existent clients will be still served by old servers. New clients will be connected to the new servers on new ports.
4. Stop old servers when all their clients will be disconnected. So you will free the slots in the -Mn table and you will be able to use them to restart a login broker again in the future when/if needed.
Thank you very much.