My script is deleting a temp file that was created by the script itself:
rm $TmpFile 2>/dev/null
But the rm command takes 3-13 seconds. Yes, it takes a few /seconds/! Why?
I don't have a direct access to this server to run the tests. I'm just trying to recover a picture of the world crushed by this example. ;-)
More details (if they matter):
The temp file is small (just a few lines).
It's not used by any processes.
I can instantly write to the another file(s) on the same disk - no delays.
The script launches a few background processes /right before/ it tries to delete the files. The delay of the file deletion is increasing with the number of the processes: 3 sec if the script launches only 4 processes and 13 sec if 36 processes were started. These processes do write to the same disks but it's just a few kilobytes.
HP-UX B.11.31 ia64
Memory: 63.9 GB
CPU info:
Intel(R) Itanium(R) Processor 9350 (1.73 GHz, 24 MB)
4 cores, 8 logical processors per socket
4.79 GT/s QPI, CPU version E0
Active processor count:
2 sockets
8 cores (4 per socket)
8 logical processors (4 per socket)
LCPU attribute is disabled
No info about disk system.
GSC (guess/speculation/conjecture):
Try running rm with the full path ( /bin/rm) or wherever rm is supposed to live on HP-UX. If this is a not a server you maintain, it's anyone's guess if there are "helpful" aliases or scripts placed "in-front of" /bin/rm
You can also give "unlink" a try.
good idea dmitri. but if it were an alias, would it not always take the same amount of time?
i wonder if this machine is virtualized or zoned and you really have two cpu even though it says more. or less memory and it is paging.
or filesystem is on some pokey nfs server.
It would depend on what "additional work" the aliased "rm" is made to do. (ie: copying and/or compressing the file before deleting it from the original location - still strange since the files are small, and why a different number of processes every time? ) . My feeling is that the "rm" being executed might not be the "out of the box" "rm".
DimitriG4,
> My feeling is that the "rm" being executed might not be the "out of the box" "rm".
I never saw such things but your guess may be right:
rm is a tracked alias for /usr/bin/rm
> still strange since the files are small, and why a different number of processes every time?
Really the sript is dbmon.sh:
community.progress.com/.../25700
The script was used for all (10) databases running on the system.
By default dbmon is starting 3 promon sessions and one 4G session for each database. In other words, it has started 40 processes.
All processes were started inside 3 sec interval - from 14:05:08 to 14:05:11 (though the script is trying to start them simultaneously).
Then dbmon has deleted the list of databases to monitor and the result was:
time rm $DbList2
real 0m12.56s
user 0m0.57s
sys 0m4.65s
Then dbmon.sh has started vmstat and iostat. Both commands have started instantly.
vmstat 4 6
2016/07/05 14:05:24 procs memory page faults cpu 2016/07/05 14:05:24 r b w avm free re at pi po fr de sr in sy cs us sy id 2016/07/05 14:05:24 19 4 0 11644791 736268 3 0 14 1559 0 0 10 33642 86662 11038 18 7 76 2016/07/05 14:05:28 19 4 0 11644791 734051 0 0 21 311 0 0 0 13606 601601 22498 38 54 8 2016/07/05 14:05:32 19 3 0 11562973 758078 0 0 12 328 0 0 0 14214 594769 23221 37 55 8 2016/07/05 14:05:36 19 3 0 11562973 782987 0 0 6 319 0 0 0 12863 542690 22198 33 51 16 2016/07/05 14:05:40 19 3 0 11562973 794595 0 0 10 266 0 0 0 9009 497981 11478 35 20 45 2016/07/05 14:05:44 3 4 0 11521635 792468 0 0 3 247 0 0 0 6715 220381 6276 26 6 67
Progress processes started by dbmon caused a rather large increase of system time (time spent running kernel code). IMHO, it's not typical for Progress processes. Due to the delay caused by the 'rm' command the vmstat started and finished by 13 seconds after Progress processes. And we can see that the system time returned to a normal value when Progress processes terminated.
The same customer, the identical Itanium box, the same dbmon script but the result is different:
2016/06/23 10:23:26 procs memory page faults cpu 2016/06/23 10:23:26 r b w avm free re at pi po fr de sr in sy cs us sy id 2016/06/23 10:23:26 1 0 0 1487739 3634458 0 0 2 14 0 0 0 1257 924006 843 17 1 82 2016/06/23 10:23:30 1 0 0 1487739 3631967 0 0 65 0 0 0 0 1505 85419 1473 8 4 89 2016/06/23 10:23:34 1 0 0 1487739 3631985 0 0 36 0 0 0 0 1311 47381 1111 7 1 92 2016/06/23 10:23:38 1 0 0 1497610 3631646 0 0 24 0 0 0 0 1231 31851 982 6 2 92 2016/06/23 10:23:42 1 0 0 1497610 3631646 0 0 16 0 0 0 0 1218 22970 938 7 1 91 2016/06/23 10:23:46 2 0 0 1501899 3631564 0 0 16 0 0 0 0 1239 21148 982 7 2 92
iostat 4 6 on the first server:
2016/07/05 14:05:24 device bps sps msps 2016/07/05 14:05:24 2016/07/05 14:05:28 disk64 43 4.0 1.0 2016/07/05 14:05:28 disk33 0 0.0 1.0 2016/07/05 14:05:28 disk42 677 5.8 1.0 2016/07/05 14:05:28 disk72 319 44.5 1.0 2016/07/05 14:05:28 disk75 5 1.2 1.0 2016/07/05 14:05:28 disk78 0 0.0 1.0 2016/07/05 14:05:28 disk81 0 0.0 1.0 2016/07/05 14:05:28 disk84 2 0.2 1.0 2016/07/05 14:05:28 disk87 753 25.9 1.0 2016/07/05 14:05:28 disk90 0 0.0 1.0 2016/07/05 14:05:28 disk93 0 0.0 1.0 2016/07/05 14:05:28 disk94 0 0.0 1.0 2016/07/05 14:05:28 disk95 0 0.0 1.0 2016/07/05 14:05:28 disk96 633 135.0 1.0 2016/07/05 14:05:28 disk97 0 0.0 1.0 2016/07/05 14:05:28 disk98 4556 290.7 1.0 2016/07/05 14:05:28 disk99 0 0.0 1.0 2016/07/05 14:05:28 disk102 0 0.0 1.0 2016/07/05 14:05:28 disk107 1097 78.3 1.0 2016/07/05 14:05:28 disk108 17800 391.8 1.0 2016/07/05 14:05:28 disk113 103 11.2 1.0 2016/07/05 14:05:28 disk114 254 31.7 1.0
One database on this server reads a lot from disk:
07/05/16 Activity: Summary 14:05:12 07/05/16 14:05 to 07/05/16 14:05 (4 sec) Event Total Per Sec |Event Total Per Sec Commits 33 8.2 |DB Reads 17263 4315.8 Undos 0 0.0 |DB Writes 14 3.5 Record Reads 307602 76900.5 |BI Reads 0 0.0 Record Updates 8 2.0 |BI Writes 3 0.8 Record Creates 8 2.0 |AI Writes 0 0.0 Record Deletes 0 0.0 |Checkpoints 0 0.0 Record Locks 89 22.2 |Flushed at chkpt 0 0.0 Record Waits 0 0.0 |Active trans 6 Rec Lock Waits 0 % BI Buf Waits 0 % AI Buf Waits 0 % Writes by APW 100 % Writes by BIW 33 % Writes by AIW 0 % DB Size: 1771 GB BI Size: 20 GB AI Size: 0 K Empty blocks: 455 Free blocks: 1860 RM chain: 83 Buffer Hits 97 % Primary Hits 97 % Alternate Hits 0 % 4 Servers, 175 Users (126 Local, 49 Remote, 72 Batch), 2 Apws
Summary:
Discs are busy by the read operations. Starting 40 Progress processes that are not resource consumimg (like promon) caused a significant increase of CPU system time and caused a huge delay while using rm command but reads/writes operations are not noticeably affected. I do not still understand what is going wrong with this server.
one thing to be aware of:
unless you have noatime mount option, a read is really a read data + write
metadata (inode, etc). this has unpleasant side effects on some operating
systems. don¹t recall when but there were versions of HP-UX that had a
bottleneck on that.
that could have something to do with your issue, or not.
also: fork()'ing new processes is a relatively expensive system call that
hammers the o/s virtual memory system.
still, all of the above is just idle speculating too.
regards,
gus
3 Logicians walk into a bar.
The bartender asks if they'd all like a beer.
The first logician says "I don't know².
The second says "I don¹t know².
Then the third exclaims "Yes!
On 7/5/16, 9:55 AM, "George Potemkin"
wrote:
>Update from Progress Community [https://community.progress.com/]
>
>George Potemkin [https://community.progress.com/members/georgep12]
>
>DimitriG4,
>
>> My feeling is that the "rm" being executed might not be the "out of the
>>box" "rm".
>
>I never saw such things but your guess may be right:
>rm is a tracked alias for /usr/bin/rm
>
>> still strange since the files are small, and why a different number of
>>processes every time?
>
>Really the sript is dbmon.sh:
>community.progress.com/.../25700
>[https://community.progress.com/community_groups/openedge_rdbms/f/18/t/257
>00]
>
>The script was used for all (10) databases running on the system.
>By default dbmon is starting 3 promon sessions and one 4G session for
>each database. In other words, it has started 40 processes.
>All processes were started inside 3 sec interval - from 14:05:08 to
>14:05:11 (though the script is trying to start them simultaneously).
>Then dbmon has deleted the list of databases to monitor and the result
>was:
>
>time rm $DbList2
>
>real 0m12.56s
>user 0m0.57s
>sys 0m4.65s
>
>Then dbmon.sh has started vmstat and iostat. Both commands have started
>instantly.
>
>vmstat 4 6
>
>2016/07/05 14:05:24 procs memory
>page faults cpu
>2016/07/05 14:05:24 r b w avm free re at pi
>po fr de sr in sy cs us sy id
>2016/07/05 14:05:24 19 4 0 11644791 736268 3 0 14
>1559 0 0 10 33642 86662 11038 18 7 76
>2016/07/05 14:05:28 19 4 0 11644791 734051 0 0 21
>311 0 0 0 13606 601601 22498 38 54 8
>2016/07/05 14:05:32 19 3 0 11562973 758078 0 0 12
>328 0 0 0 14214 594769 23221 37 55 8
>2016/07/05 14:05:36 19 3 0 11562973 782987 0 0 6
>319 0 0 0 12863 542690 22198 33 51 16
>2016/07/05 14:05:40 19 3 0 11562973 794595 0 0 10
>266 0 0 0 9009 497981 11478 35 20 45
>2016/07/05 14:05:44 3 4 0 11521635 792468 0 0 3
>247 0 0 0 6715 220381 6276 26 6 67
>
>Progress processes started by dbmon caused a rather large increase of
>system time (time spent running kernel code). IMHO, it's not typical for
>Progress processes. Due to the delay caused by the 'rm' command the
>vmstat started and finished by 13 seconds after Progress processes. And
>we can see that the system time returned to a normal value when Progress
>processes terminated.
>
>The same customer, the identical Itanium box, the same dbmon script but
>the result is different:
>
>2016/06/23 10:23:26 procs memory
>page faults cpu
>2016/06/23 10:23:26 r b w avm free re at pi
>po fr de sr in sy cs us sy id
>2016/06/23 10:23:26 1 0 0 1487739 3634458 0 0 2
>14 0 0 0 1257 924006 843 17 1 82
>2016/06/23 10:23:30 1 0 0 1487739 3631967 0 0 65
> 0 0 0 0 1505 85419 1473 8 4 89
>2016/06/23 10:23:34 1 0 0 1487739 3631985 0 0 36
> 0 0 0 0 1311 47381 1111 7 1 92
>2016/06/23 10:23:38 1 0 0 1497610 3631646 0 0 24
> 0 0 0 0 1231 31851 982 6 2 92
>2016/06/23 10:23:42 1 0 0 1497610 3631646 0 0 16
> 0 0 0 0 1218 22970 938 7 1 91
>2016/06/23 10:23:46 2 0 0 1501899 3631564 0 0 16
> 0 0 0 0 1239 21148 982 7 2 92
>
>iostat 4 6 on the first server:
>
>2016/07/05 14:05:24 device bps sps msps
>2016/07/05 14:05:24
>
>2016/07/05 14:05:28 disk64 43 4.0 1.0
>2016/07/05 14:05:28 disk33 0 0.0 1.0
>2016/07/05 14:05:28 disk42 677 5.8 1.0
>2016/07/05 14:05:28 disk72 319 44.5 1.0
>2016/07/05 14:05:28 disk75 5 1.2 1.0
>2016/07/05 14:05:28 disk78 0 0.0 1.0
>2016/07/05 14:05:28 disk81 0 0.0 1.0
>2016/07/05 14:05:28 disk84 2 0.2 1.0
>2016/07/05 14:05:28 disk87 753 25.9 1.0
>2016/07/05 14:05:28 disk90 0 0.0 1.0
>2016/07/05 14:05:28 disk93 0 0.0 1.0
>2016/07/05 14:05:28 disk94 0 0.0 1.0
>2016/07/05 14:05:28 disk95 0 0.0 1.0
>2016/07/05 14:05:28 disk96 633 135.0 1.0
>2016/07/05 14:05:28 disk97 0 0.0 1.0
>2016/07/05 14:05:28 disk98 4556 290.7 1.0
>2016/07/05 14:05:28 disk99 0 0.0 1.0
>2016/07/05 14:05:28 disk102 0 0.0 1.0
>2016/07/05 14:05:28 disk107 1097 78.3 1.0
>2016/07/05 14:05:28 disk108 17800 391.8 1.0
>2016/07/05 14:05:28 disk113 103 11.2 1.0
>2016/07/05 14:05:28 disk114 254 31.7 1.0
>
>One database on this server reads a lot from disk:
>
>07/05/16 Activity: Summary
>14:05:12 07/05/16 14:05 to 07/05/16 14:05 (4 sec)
>Event Total Per Sec |Event Total Per
>Sec
>Commits 33 8.2 |DB Reads 17263
>4315.8
>Undos 0 0.0 |DB Writes 14
>3.5
>Record Reads 307602 76900.5 |BI Reads 0
>0.0
>Record Updates 8 2.0 |BI Writes 3
>0.8
>Record Creates 8 2.0 |AI Writes 0
>0.0
>Record Deletes 0 0.0 |Checkpoints 0
>0.0
>Record Locks 89 22.2 |Flushed at chkpt 0
>0.0
>Record Waits 0 0.0 |Active trans 6
>Rec Lock Waits 0 % BI Buf Waits 0 % AI Buf Waits 0 %
>Writes by APW 100 % Writes by BIW 33 % Writes by AIW 0 %
>DB Size: 1771 GB BI Size: 20 GB AI Size: 0 K
>Empty blocks: 455 Free blocks: 1860 RM chain: 83
>Buffer Hits 97 % Primary Hits 97 % Alternate Hits 0 %
>4 Servers, 175 Users (126 Local, 49 Remote, 72 Batch), 2 Apws
>
>Summary:
>
>Discs are busy by the read operations. Starting 40 Progress processes
>that are not resource consumimg (like promon) caused a significant
>increase of CPU system time and caused a huge delay while using rm
>command but reads/writes operations are not noticeably affected. I do not
>still understand what is going wrong with this server.
>
>View online
>[https://community.progress.com/community_groups/openedge_general/f/26/p/2
>5880/89171#89171]
>
>You received this notification because you subscribed to the forum. To
>unsubscribe from only this thread, go here
>[https://community.progress.com/community_groups/openedge_general/f/26/t/2
>5880/mute].
>
>Flag
>[https://community.progress.com/community_groups/openedge_general/f/26/p/2
>5880/89171?AbuseContentId=130508be-d634-4cf3-af41-b3f8cff5c2bc&AbuseConten
>tTypeId=f586769b-0822-468a-b7f3-a94d480ed9b0&AbuseFlag=true] this post as
>spam/abuse.
PATH length is 796 chars, 48 directories.
Hi George,
does the directory of your $TmpFile has a huge amount of files? we had a similar problem with a directory that has over 5000 files. maybe if you can change your $TmpFile temporarily to check and if this is the case reorganize the directory.
[]s
Marco
> does the directory of your $TmpFile has a huge amount of files?
No. It was a directory with two files of dbmon itself and a few archives with the logs of previous script's runs. The script that was used for the last run puts the temp files in /tmp directory but the delay on the rm command did not changed. And as Gus mention above it will not explain why the delay on rm is observed only when the monitoring processes are started and running but not before and after that.
BTW, I said before that the script is starting 40 processes. Really it's 40 pipes and which of them consists of 3-4 processes. For example, one process is responsible for an input for promon, the second is promon itself and one or two processes are parsing an output of promon. Hence the fork() call was used 120+ times during 1-2 seconds. Though it should not be a problem for a million $$$ server reserved for production environment running 4000 users.
I'm still waiting from the customer the result of the 'mount' command.
Finally I got the mount's output from the customer:
FS type: vxfs
Mount options: ioerror=mwdisable,largefiles,delaylog
An additional mount option on production server: nodatainlog
So they do not use the 'noatime' option.
They run a simple test while dbmon was not running:
Count=10 TmpFile=test.$$.tmp time while [ $Count -gt 0 ] do Count=`expr $Count - 1` touch $TmpFile rm $TmpFile done
real 0m16.063s
user 0m5.180s
sys 0m10.860s
The result on the similar test server where the disks are idle:
real 0m0.115s
user 0m0.010s
sys 0m0.080s
Unfortunately the customer's location is 2900 miles away from me and they don't want to continue the investigation further because "it's just a test server".
Thanks to everyone who shared the ideas.