Slow Memory Mapped Procedure Librairy

Posted by Paul Koufalis on 24-Jul-2018 14:32

OE 11.7.1 on AIX 7.1

Is anyone aware of a bug or other issue where searching a memory-mapped PL is slow as the number of users passes some threshold?

We have about 1000 processes hitting a mapped PL and running through it to get to the next element in the PROPATH is taking tens of seconds. Like 40, 50 seconds. In the morning before the mass of users logs in, it's slowish (1-2 seconds) but by 10:00 we're seeing 40+ seconds in our tests.

Copying the same PL to another directory on the same file system and having just one process run past it to find a program in a directory after the PL in the PROPATH is fast. 

All Replies

Posted by Paul Koufalis on 24-Jul-2018 14:39

More info: even running prolib -list on the file takes 54 seconds, whereas running prolib -list on a copy of the file in the same file system takes 4 seconds.

Posted by Rob Fitzpatrick on 24-Jul-2018 14:54

Is this a local file system or a network share?

Posted by Paul Koufalis on 24-Jul-2018 15:02

IBM FC SAN with flash disks.  

Posted by Paul Koufalis on 24-Jul-2018 15:04

I also checked vmstat -v and lvmo for any bottlenecks and nothing.

VIOS is purring along. No bottlenecks there.

Posted by Rob Fitzpatrick on 24-Jul-2018 15:05

Is it possible the two directories are on different underlying storage?

Posted by Tim Kuehn on 24-Jul-2018 15:08

do a

cat < file.pl > /dev/null

on both files - that'll tell you if this is an OS or Progress issue.

Posted by Rob Fitzpatrick on 24-Jul-2018 15:10

Is this something that previously performed well and recently changed for the worse?  Or do you suspect this was a problem all along that was previously undetected?

Posted by Marcelo Pacheco on 24-Jul-2018 15:38

The beauty of memory mapped pls is we use OS features to handle concurrency.

We're simply avoiding reading pieces of the .pl into private buffers. We just map pieces of the .pl file corresponding to r-code segments and point directly to those mapped buffers.

It should follow that if performance is good with few users and bad with lots of users, the OS / OS tuning is at fault.

I suggest taking a deep hard look at the mmap(2) system call page.

There are no latches/locks or any type of contention visible from inside the OE client. It simply doesn't know if it's alone using the .pl of if a million people are using it at the same time.

Marcelo Pacheco

Posted by Paul Koufalis on 24-Jul-2018 16:01

Rob: no same VG.

Tim: copying the file to another directory was fast.

Rob: this is new. Customer upgraded application and new version users mmapped PL.

Marcelo: any theories for the simple prolib -list test?  Why slower on the mapped file? I will try again late tonight when the load is lighter.

Posted by Rob Fitzpatrick on 24-Jul-2018 16:05

Is there a separate test/UAT box?  Does the problem also show up there?

It makes me wonder what's special about the "fast" directory.  Could you change the propath so the PLs are stored there?  It's not a fix but it would buy you time to find one.

Posted by Marcelo Pacheco on 24-Jul-2018 16:06

I don't have a theory, because an AIX expert might have one rather than an OE expert.

Try running the slow going process under strace/truss/tusc and find out which system calls are super slow (under load). Then open a support ticket with IBM asking them to analyze the performance issue.

Back in 98/99 when I worked for PSC consulting, I came up with the idea for shared memory procedure libraries, and actually wrote the prototype for it. I don't know how much it much have changed over the years, but I don't see a reason to change this basic simplicity rule that OE doesn't do any locking whatsoever when mmap .pl files are used.

The very fact that prolib itself is slow confirms it's not Progress's fault, but rather we're suffering from some OS performance issue instead.

Marcelo Pacheco

Posted by Paul Koufalis on 24-Jul-2018 16:12

Rob: the problem didn't even manifest itself during pre go live stress tests. The customer brought in hundreds of employees on 3 different Sundays to stimulate prod.

Marcelo: I have three super AIX experts on the case and will share your comments with them. Tracing a running job is on the list but first we're going to try tomorrow with a standard PL.

Posted by Paul Koufalis on 25-Jul-2018 10:06

Wednesday 11:00 AM, over 1000 users, no performance issues. I think we can conclude that the combination of OE 11.7.1 + AIX 7.1 + memory mapped PLs + 600 users is a problem. I will see if I can make time to create a reproducible case that I can run kernel traces against.

Posted by ChUIMonster on 25-Jul-2018 10:48

Or the sysadmin secretly changed something that you don't know about.

This thread is closed