Unix question: slow rm command

Posted by George Potemkin on 04-Jul-2016 09:15

My script is deleting a temp file that was created by the script itself:
rm $TmpFile 2>/dev/null

But the rm command takes 3-13 seconds. Yes, it takes a few /seconds/! Why?
I don't have a direct access to this server to run the tests. I'm just trying to recover a picture of the world crushed by this example. ;-)

More details (if they matter):
The temp file is small (just a few lines).
It's not used by any processes.
I can instantly write to the another file(s) on the same disk - no delays.
The script launches a few background processes /right before/ it tries to delete the files. The delay of the file deletion is increasing with the number of the processes: 3 sec if the script launches only 4 processes and 13 sec if 36 processes were started. These processes do write to the same disks but it's just a few kilobytes.

HP-UX B.11.31 ia64
Memory: 63.9 GB

CPU info:
Intel(R) Itanium(R) Processor 9350 (1.73 GHz, 24 MB)
4 cores, 8 logical processors per socket
4.79 GT/s QPI, CPU version E0
Active processor count:
2 sockets
8 cores (4 per socket)
8 logical processors (4 per socket)
LCPU attribute is disabled

No info about disk system.

All Replies

Posted by DimitriG4 on 04-Jul-2016 14:33

GSC (guess/speculation/conjecture):

Try running rm with the full path ( /bin/rm) or wherever rm is supposed to live on HP-UX. If this is a not a server you maintain, it's anyone's guess if there are "helpful" aliases or scripts placed "in-front of" /bin/rm  

You can also give "unlink" a try.

Posted by gus on 04-Jul-2016 17:17

good idea dmitri. but if it were an alias, would it not always take the same amount of time?

i wonder if this machine is virtualized or zoned and you really have two cpu even though it says more. or less memory and it is paging.

or filesystem is on some pokey nfs server.

Posted by DimitriG4 on 04-Jul-2016 18:43

It would depend on what "additional work" the aliased "rm" is made to do. (ie: copying  and/or compressing the file before deleting it from the original location - still strange since the files are small, and why a different number of processes every time? ) .  My feeling is that the "rm" being executed might not be the "out of the box" "rm".

Posted by George Potemkin on 05-Jul-2016 08:54

DimitriG4,

> My feeling is that the "rm" being executed might not be the "out of the box" "rm".

I never saw such things but your guess may be right:
rm is a tracked alias for /usr/bin/rm

> still strange since the files are small, and why a different number of processes every time?

Really the sript is dbmon.sh:
community.progress.com/.../25700

The script was used for all (10) databases running on the system.
By default dbmon is starting 3 promon sessions and one 4G session for each database. In other words, it has started 40 processes.
All processes were started inside 3 sec interval - from 14:05:08 to 14:05:11 (though the script is trying to start them simultaneously).
Then dbmon has deleted the list of databases to monitor and the result was:

time rm $DbList2

real 0m12.56s
user 0m0.57s
sys 0m4.65s

Then dbmon.sh has started vmstat and iostat. Both commands have started instantly.

vmstat 4 6

2016/07/05 14:05:24          procs           memory                   page                              faults       cpu
2016/07/05 14:05:24     r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs  us sy id
2016/07/05 14:05:24    19     4     0  11644791  736268    3    0    14 1559     0    0    10  33642  86662 11038  18  7 76
2016/07/05 14:05:28    19     4     0  11644791  734051    0    0    21  311     0    0     0  13606 601601 22498  38 54  8
2016/07/05 14:05:32    19     3     0  11562973  758078    0    0    12  328     0    0     0  14214 594769 23221  37 55  8
2016/07/05 14:05:36    19     3     0  11562973  782987    0    0     6  319     0    0     0  12863 542690 22198  33 51 16
2016/07/05 14:05:40    19     3     0  11562973  794595    0    0    10  266     0    0     0   9009 497981 11478  35 20 45
2016/07/05 14:05:44     3     4     0  11521635  792468    0    0     3  247     0    0     0   6715 220381  6276  26  6 67

Progress processes started by dbmon caused a rather large increase of system time (time spent running kernel code). IMHO, it's not typical for Progress processes. Due to the delay caused by the 'rm' command the vmstat started and finished by 13 seconds after Progress processes. And we can see that the system time returned to a normal value when Progress processes terminated.

The same customer, the identical Itanium box, the same dbmon script but the result is different:

2016/06/23 10:23:26          procs           memory                   page                              faults       cpu
2016/06/23 10:23:26     r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs  us sy id
2016/06/23 10:23:26     1     0     0  1487739  3634458    0    0     2   14     0    0     0   1257 924006   843  17  1 82
2016/06/23 10:23:30     1     0     0  1487739  3631967    0    0    65    0     0    0     0   1505  85419  1473   8  4 89
2016/06/23 10:23:34     1     0     0  1487739  3631985    0    0    36    0     0    0     0   1311  47381  1111   7  1 92
2016/06/23 10:23:38     1     0     0  1497610  3631646    0    0    24    0     0    0     0   1231  31851   982   6  2 92
2016/06/23 10:23:42     1     0     0  1497610  3631646    0    0    16    0     0    0     0   1218  22970   938   7  1 91
2016/06/23 10:23:46     2     0     0  1501899  3631564    0    0    16    0     0    0     0   1239  21148   982   7  2 92

iostat 4 6 on the first server:

2016/07/05 14:05:24   device    bps    sps   msps
2016/07/05 14:05:24 
2016/07/05 14:05:28   disk64     43    4.0    1.0
2016/07/05 14:05:28   disk33      0    0.0    1.0
2016/07/05 14:05:28   disk42    677    5.8    1.0
2016/07/05 14:05:28   disk72    319   44.5    1.0
2016/07/05 14:05:28   disk75      5    1.2    1.0
2016/07/05 14:05:28   disk78      0    0.0    1.0
2016/07/05 14:05:28   disk81      0    0.0    1.0
2016/07/05 14:05:28   disk84      2    0.2    1.0
2016/07/05 14:05:28   disk87    753   25.9    1.0
2016/07/05 14:05:28   disk90      0    0.0    1.0
2016/07/05 14:05:28   disk93      0    0.0    1.0
2016/07/05 14:05:28   disk94      0    0.0    1.0
2016/07/05 14:05:28   disk95      0    0.0    1.0
2016/07/05 14:05:28   disk96    633  135.0    1.0
2016/07/05 14:05:28   disk97      0    0.0    1.0
2016/07/05 14:05:28   disk98   4556  290.7    1.0
2016/07/05 14:05:28   disk99      0    0.0    1.0
2016/07/05 14:05:28  disk102      0    0.0    1.0
2016/07/05 14:05:28  disk107   1097   78.3    1.0
2016/07/05 14:05:28  disk108  17800  391.8    1.0
2016/07/05 14:05:28  disk113    103   11.2    1.0
2016/07/05 14:05:28  disk114    254   31.7    1.0

One database on this server reads a lot from disk:

07/05/16        Activity: Summary
14:05:12        07/05/16 14:05 to 07/05/16 14:05 (4 sec)
Event                  Total  Per Sec |Event                  Total  Per Sec
Commits                  33       8.2 |DB Reads              17263    4315.8
Undos                     0       0.0 |DB Writes                14       3.5
Record Reads         307602   76900.5 |BI Reads                  0       0.0
Record Updates            8       2.0 |BI Writes                 3       0.8
Record Creates            8       2.0 |AI Writes                 0       0.0
Record Deletes            0       0.0 |Checkpoints               0       0.0
Record Locks             89      22.2 |Flushed at chkpt          0       0.0
Record Waits              0       0.0 |Active trans              6
Rec Lock Waits     0 %    BI Buf Waits      0 %    AI Buf Waits      0 %
Writes by APW    100 %    Writes by BIW    33 %    Writes by AIW     0 %
DB Size:        1771 GB   BI Size:         20 GB   AI Size:          0 K
Empty blocks:    455      Free blocks:   1860      RM chain:        83
Buffer Hits       97 %    Primary Hits     97 %    Alternate Hits    0 %
4 Servers, 175 Users (126 Local, 49 Remote, 72 Batch), 2 Apws

Summary:

Discs are busy by the read operations. Starting 40 Progress processes that are not resource consumimg (like promon) caused a significant increase of CPU system time and caused a huge delay while using rm command but reads/writes operations are not noticeably affected. I do not still understand what is going wrong with this server.

Posted by gus on 05-Jul-2016 09:30

one thing to be aware of:

unless you have noatime mount option, a read is really a read data + write

metadata (inode, etc). this has unpleasant side effects on some operating

systems. don¹t recall when but there were versions of HP-UX that had a

bottleneck on that.

that could have something to do with your issue, or not.

also: fork()'ing new processes is a relatively expensive system call that

hammers the o/s virtual memory system.

still, all of the above is just idle speculating too.

regards,

gus

3 Logicians walk into a bar.

The bartender asks if they'd all like a beer.

The first logician says "I don't know².

The second says "I don¹t know².

Then the third exclaims "Yes!

On 7/5/16, 9:55 AM, "George Potemkin"

wrote:

>Update from Progress Community [https://community.progress.com/]

>

>George Potemkin [https://community.progress.com/members/georgep12]

>

>DimitriG4,

>

>> My feeling is that the "rm" being executed might not be the "out of the

>>box" "rm".

>

>I never saw such things but your guess may be right:

>rm is a tracked alias for /usr/bin/rm

>

>> still strange since the files are small, and why a different number of

>>processes every time?

>

>Really the sript is dbmon.sh:

>community.progress.com/.../25700

>[https://community.progress.com/community_groups/openedge_rdbms/f/18/t/257

>00]

>

>The script was used for all (10) databases running on the system.

>By default dbmon is starting 3 promon sessions and one 4G session for

>each database. In other words, it has started 40 processes.

>All processes were started inside 3 sec interval - from 14:05:08 to

>14:05:11 (though the script is trying to start them simultaneously).

>Then dbmon has deleted the list of databases to monitor and the result

>was:

>

>time rm $DbList2

>

>real 0m12.56s

>user 0m0.57s

>sys 0m4.65s

>

>Then dbmon.sh has started vmstat and iostat. Both commands have started

>instantly.

>

>vmstat 4 6

>

>2016/07/05 14:05:24 procs memory

>page faults cpu

>2016/07/05 14:05:24 r b w avm free re at pi

>po fr de sr in sy cs us sy id

>2016/07/05 14:05:24 19 4 0 11644791 736268 3 0 14

>1559 0 0 10 33642 86662 11038 18 7 76

>2016/07/05 14:05:28 19 4 0 11644791 734051 0 0 21

>311 0 0 0 13606 601601 22498 38 54 8

>2016/07/05 14:05:32 19 3 0 11562973 758078 0 0 12

>328 0 0 0 14214 594769 23221 37 55 8

>2016/07/05 14:05:36 19 3 0 11562973 782987 0 0 6

>319 0 0 0 12863 542690 22198 33 51 16

>2016/07/05 14:05:40 19 3 0 11562973 794595 0 0 10

>266 0 0 0 9009 497981 11478 35 20 45

>2016/07/05 14:05:44 3 4 0 11521635 792468 0 0 3

>247 0 0 0 6715 220381 6276 26 6 67

>

>Progress processes started by dbmon caused a rather large increase of

>system time (time spent running kernel code). IMHO, it's not typical for

>Progress processes. Due to the delay caused by the 'rm' command the

>vmstat started and finished by 13 seconds after Progress processes. And

>we can see that the system time returned to a normal value when Progress

>processes terminated.

>

>The same customer, the identical Itanium box, the same dbmon script but

>the result is different:

>

>2016/06/23 10:23:26 procs memory

>page faults cpu

>2016/06/23 10:23:26 r b w avm free re at pi

>po fr de sr in sy cs us sy id

>2016/06/23 10:23:26 1 0 0 1487739 3634458 0 0 2

>14 0 0 0 1257 924006 843 17 1 82

>2016/06/23 10:23:30 1 0 0 1487739 3631967 0 0 65

> 0 0 0 0 1505 85419 1473 8 4 89

>2016/06/23 10:23:34 1 0 0 1487739 3631985 0 0 36

> 0 0 0 0 1311 47381 1111 7 1 92

>2016/06/23 10:23:38 1 0 0 1497610 3631646 0 0 24

> 0 0 0 0 1231 31851 982 6 2 92

>2016/06/23 10:23:42 1 0 0 1497610 3631646 0 0 16

> 0 0 0 0 1218 22970 938 7 1 91

>2016/06/23 10:23:46 2 0 0 1501899 3631564 0 0 16

> 0 0 0 0 1239 21148 982 7 2 92

>

>iostat 4 6 on the first server:

>

>2016/07/05 14:05:24 device bps sps msps

>2016/07/05 14:05:24

>

>2016/07/05 14:05:28 disk64 43 4.0 1.0

>2016/07/05 14:05:28 disk33 0 0.0 1.0

>2016/07/05 14:05:28 disk42 677 5.8 1.0

>2016/07/05 14:05:28 disk72 319 44.5 1.0

>2016/07/05 14:05:28 disk75 5 1.2 1.0

>2016/07/05 14:05:28 disk78 0 0.0 1.0

>2016/07/05 14:05:28 disk81 0 0.0 1.0

>2016/07/05 14:05:28 disk84 2 0.2 1.0

>2016/07/05 14:05:28 disk87 753 25.9 1.0

>2016/07/05 14:05:28 disk90 0 0.0 1.0

>2016/07/05 14:05:28 disk93 0 0.0 1.0

>2016/07/05 14:05:28 disk94 0 0.0 1.0

>2016/07/05 14:05:28 disk95 0 0.0 1.0

>2016/07/05 14:05:28 disk96 633 135.0 1.0

>2016/07/05 14:05:28 disk97 0 0.0 1.0

>2016/07/05 14:05:28 disk98 4556 290.7 1.0

>2016/07/05 14:05:28 disk99 0 0.0 1.0

>2016/07/05 14:05:28 disk102 0 0.0 1.0

>2016/07/05 14:05:28 disk107 1097 78.3 1.0

>2016/07/05 14:05:28 disk108 17800 391.8 1.0

>2016/07/05 14:05:28 disk113 103 11.2 1.0

>2016/07/05 14:05:28 disk114 254 31.7 1.0

>

>One database on this server reads a lot from disk:

>

>07/05/16 Activity: Summary

>14:05:12 07/05/16 14:05 to 07/05/16 14:05 (4 sec)

>Event Total Per Sec |Event Total Per

>Sec

>Commits 33 8.2 |DB Reads 17263

>4315.8

>Undos 0 0.0 |DB Writes 14

>3.5

>Record Reads 307602 76900.5 |BI Reads 0

>0.0

>Record Updates 8 2.0 |BI Writes 3

>0.8

>Record Creates 8 2.0 |AI Writes 0

>0.0

>Record Deletes 0 0.0 |Checkpoints 0

>0.0

>Record Locks 89 22.2 |Flushed at chkpt 0

>0.0

>Record Waits 0 0.0 |Active trans 6

>Rec Lock Waits 0 % BI Buf Waits 0 % AI Buf Waits 0 %

>Writes by APW 100 % Writes by BIW 33 % Writes by AIW 0 %

>DB Size: 1771 GB BI Size: 20 GB AI Size: 0 K

>Empty blocks: 455 Free blocks: 1860 RM chain: 83

>Buffer Hits 97 % Primary Hits 97 % Alternate Hits 0 %

>4 Servers, 175 Users (126 Local, 49 Remote, 72 Batch), 2 Apws

>

>Summary:

>

>Discs are busy by the read operations. Starting 40 Progress processes

>that are not resource consumimg (like promon) caused a significant

>increase of CPU system time and caused a huge delay while using rm

>command but reads/writes operations are not noticeably affected. I do not

>still understand what is going wrong with this server.

>

>View online

>[https://community.progress.com/community_groups/openedge_general/f/26/p/2

>5880/89171#89171]

>

>You received this notification because you subscribed to the forum. To

>unsubscribe from only this thread, go here

>[https://community.progress.com/community_groups/openedge_general/f/26/t/2

>5880/mute].

>

>Flag

>[https://community.progress.com/community_groups/openedge_general/f/26/p/2

>5880/89171?AbuseContentId=130508be-d634-4cf3-af41-b3f8cff5c2bc&AbuseConten

>tTypeId=f586769b-0822-468a-b7f3-a94d480ed9b0&AbuseFlag=true] this post as

>spam/abuse.

Posted by ChUIMonster on 05-Jul-2016 16:08

How long is the PATH?

--
Tom Bascom

Posted by George Potemkin on 05-Jul-2016 17:56

PATH length is 796 chars, 48 directories.

Posted by ChUIMonster on 05-Jul-2016 18:00

That's kind of crazy...

--
Tom Bascom

Posted by Marco Aurelio on 06-Jul-2016 06:23

Hi George,

   does the directory of your $TmpFile has a huge amount of files? we had a similar problem with a directory that has over 5000 files. maybe if you can change your $TmpFile temporarily to check and if this is the case reorganize the directory.

[]s

Marco

Posted by George Potemkin on 06-Jul-2016 08:04

> does the directory of your $TmpFile has a huge amount of files?

No. It was a directory with two files of dbmon itself and a few archives with the logs of previous script's runs. The script that was used for the last run puts the temp files in /tmp directory but the delay on the rm command did not changed. And as Gus mention above it will not explain why the delay on rm is observed only when the monitoring processes are started and running but not before and after that.

BTW, I said before that the script is starting 40 processes. Really it's 40 pipes and which of them consists of 3-4 processes. For example, one process is responsible for an input for promon, the second is promon itself and one or two processes are parsing an output of promon. Hence the fork() call was used 120+ times during 1-2 seconds. Though it should not be a problem for a million $$$ server reserved for production environment running 4000 users.

I'm still waiting from the customer the result of the 'mount' command.

Posted by George Potemkin on 07-Jul-2016 06:53

Finally I got the mount's output from the customer:
FS type: vxfs
Mount options: ioerror=mwdisable,largefiles,delaylog
An additional mount option on production server: nodatainlog
So they do not use the 'noatime' option.

They run a simple test while dbmon was not running:

Count=10
TmpFile=test.$$.tmp
time while [ $Count -gt 0 ]
do
  Count=`expr $Count - 1`
  touch $TmpFile
  rm $TmpFile
done


real 0m16.063s
user 0m5.180s
sys 0m10.860s

The result on the similar test server where the disks are idle:

real 0m0.115s
user 0m0.010s
sys 0m0.080s

Unfortunately the customer's location is 2900 miles away from me and they don't want to continue the investigation further because "it's just a test server".

Thanks to everyone who shared the ideas.

This thread is closed