We just ran into an issue with a customer that just migrated from 11.4 Classic Appserver to 11.7.5 PASOE. The system was just running slow......
We narrowed it down to backend logic.
We then narrowed it down to our logging mechanism.
We then narrowed it down to NEW ClassName ().
It appears that NEWING a class in PASOE is 22x slower than doing the same in a GUI client (WebClient)
I just threw together a quick repro. The WriteMessage METHOD has no code in it so nothing is being done.
On GUI Client;
The following takes 1312ms when just passing a CHAR
DO iCnt = 1 TO 1000000:
LogManager:WriteMessage("Test1",
"ORS_APP" // we only want this to be logged if ORS_APP is enabled which is old logging
).
END.
The following takes 5268ms so about 4x slower
DO iCnt = 1 TO 1000000 ON ERROR UNDO, THROW:
LogManager:WriteMessage("Test1",
NEW MessageSubSystem ("ORS_APP") // we only want this to be logged if ORS_APP is enabled which is old logging
).
END.
On PASOE server;
The following takes 1531ms when just passing a CHAR.
DO iCnt = 1 TO 1000000:
LogManager:WriteMessage("Test1",
"ORS_APP" // we only want this to be logged if ORS_APP is enabled which is old logging
).
END.
The following takes 120324ms so 22x slower than the same code on the GUI client.
DO iCnt = 1 TO 1000000 ON ERROR UNDO, THROW:
LogManager:WriteMessage("Test1",
NEW MessageSubSystem ("ORS_APP") // we only want this to be logged if ORS_APP is enabled which is old logging
).
END.
Is this expected? I would hope not.
|
We are using ;
-D 100000
-reusableObjects 100000
Can you test the timing *without* the "LogManager" operation? Ie. just test the NEW'ing.
Nevermind, I didn't see this part:
...The WriteMessage METHOD has no code in it so nothing is being done....
Yep, no code at all in that method. I just ran the code with just the NEW and had the same results.
No, this is not expected.
We've seen some poor numbers comparing PASOE AVM to the single-threaded Client AVM when running non-db code). We expect some slight degradation, but nothing like this.
Did you run your test with pre-compiled code and running with -q?
Could you email me (tim@progress.com) your exact repro?
>> Did you run your test with pre-compiled code and running with -q?
Exactly what I was thinking. It sounds like Roger had already spent lots of time on this, but it is worth going back to this question just in case.
The fact that the repro has NOT eliminated the method call (LogManager:WriteMessage) yet is especially suspicious to me ... given that we are pointing the finger at the NEW'ing of a class. We should also be looking at the performance numbers without that method call.
At this many iterations (1000000) the performance issues in the AVM can really add up, especially if you don't have precompiled code and aren't caching it in memory with -q.
Without -q you are at the mercy of the file system resource where the code is hosted. If it is a spinning disk or a NAS file system then you will see a large performance degradation because the files are constantly polled for changes.
I'm stumped. It seems like you've narrowed down the scope of the problem to something that is entirely a CPU bottleneck.
If you watch your system resources, can you verify that this is running with 100% CPU utilization (or at least it uses the single-threaded equivalent of 100% CPU)? If not, then there are other unknown factors and resources that may be participating in your repro.
Also is there only *one* PASOE session that is active in the agent while you are running your repro? If there are any other sessions that are active within the agent process then there might be concurrency-related bottlenecks that are slowing things down, (eg I'm thinking of things like the artificial bottleneck that prevents more than one thread in a process from using XML data at a given moment of time ... see
https://community.progress.com/community_groups/openedge_general/f/26/t/57795 )
Or perhaps your new PASOE server has a virtualization problem that might be short-changing you on CPU. Maybe you should run both versions of your tests on the same hardware - this will take the hardware and virtualization out of the equation.
It would seem odd to me if PASOE was the only thing that was responsible for the slowdown. I'd guess the upgrade in OpenEdge versions would be more likely to cause this performance regression. The hosting of the code into a PASOE msagent process should not normally make any massive difference.
Do you see the same difference if you explicitly tracking and delete the object instances you NEW() ?
Would be something to test, as there's also garbage collection and how many objects stick in memory that can play a factor.
NEW()ing a new instance probably gets slower if you have larger numbers of objects in memory already. (It used to in the past, haven't benchmarked recently).
If the garbage collector behaves differently, you may end up with objects living longer, influencing that.
Or maybe the GC has some other issue why it's running slower in the PAS agent.
>> If you watch your system resources, can you verify that this is running with 100% CPU utilization (or at least it uses the single-threaded equivalent of 100% CPU)? If not, then there are other unknown factors and resources that may be participating in your repro.
It only uses 25% of CPU.
>> Also is there only *one* PASOE session that is active in the agent while you are running your repro?
Our tests were when NO other sessions were active. On the live site we did this at 2:00am when the system was not being used. We ran the tests after a reboot.
In our office using the customers DB we only had one session active.
>> Or perhaps your new PASOE server has a virtualization problem that might be short-changing you on CPU. Maybe you should run both versions of your tests on the same hardware - this will take the hardware and virtualization out of the equation.
I ran the same test not using a VM with same results...22x slower.
>> It only uses 25% of CPU.
You can compare this to the number of cores. If you have four cores, then you are essentially running a thread with exactly 100% usage of a single core. It is not really possible for your ABL session's code to use more CPU than that.
You didn't say how much time it took to run your test after removing the method call. Did you at least save about 1 second (ie ~1531ms ?)
Since this is ABL we're talking about, I know that 1,000,000 iterations of a method invocation will not be free. ;)
The good news is that you already have something that you can package up and send to Progress tech support. I would be very eager to get that performance hotfix too, if one becomes available. Please keep us in the loop.
I will keep you in the loop. We are still working on this as new information was brought to our intention early this morning.
Thanks for all the input.
To give an update, [mention:011caad85a5545659f1a1aaf22be3887:e9ed411860ed4f2ba0265705b8793d05] sent me his reproducible. I'm seeing a degradation, but nothing like the 4x or 22x severity, so something else must be going on. Will look forward to an update from Roger if he gets to the bottom of it.
The degradation I was able to measure is one that we've recently become aware of for CPU intensive operations (e.g., DO i = 1 to 1000000: userdeffunc(). END.).
We've identified the root cause and are working to close the performance gap between the single threaded AVM client and the multi-threaded PASOE AVM.
Well, it turns out we have a few issues. One issue was definitely the -q. Once I was able to get into the customers production server I quickly found out it was enabled on the client but not the server. Thanks Tim and David. I am still not sure why we could not see the same when toggling the -q in our office but that's another story.
The other issue we are experiencing is the MSAS agent is actually crashing with no error. We are working with tech support on this issue.
Thanks.
>> MSAS agent is actually crashing with no error
If you are running on Windows, then I'd recommend getting on a recent SP if possible (I've been fairly happy with OE 11.7.4 and 11.7.5).
Also on windows - remember to create anti-virus (defender) exclusions, and disable "data execution prevention" (DEP). In theory Progress OE is supposed to be DEP-compliant but I had a case where PASOE was crashing in a fairly recent version of OE 11 and it was because DEP was still enabled.
Checking DEP was one of the first things that tech support had asked to do. Once we disabled DEP the crashing stopped.
Small note about the DEP - if you're using 3rd-party components, make sure those are DEP compliant as well. If they're not, those components are likely to originate the crashes.
Anything that uses any form of native code (Com/ActiveX, direct DLL calls) in particular should be considered suspect until verified.
>> If you are running on Windows, then I'd recommend getting on a recent SP if possible (I've been fairly happy with OE 11.7.4 and 11.7.5).
We are running 11.7.5,
>> Also on windows - remember to create anti-virus (defender) exclusions, and disable "data execution prevention" (DEP).
We just did that yesterday and will see what happens.
>> Small note about the DEP - if you're using 3rd-party components, make sure those are DEP compliant as well. If they're not, those components are likely to originate the crashes.
The crashing is upon session startup but not all the time. It is typically happening as users are attempting to login and another session is attempting to start.
If it is crashing then you may have event logs (application log in windows) to back that up. Windows should detect that the _mproapsv has crashed and write a message to application log. At least that will keep track of how frequently this happens. But despite the crash, PASOE should recover and start a new msagent (_mproapsv) to replace the one that crashed. Hopefully your customer can attempt to repeat the PASOE request if they see an unusual error message.
And you should also have protrace files that Progress can examine. Sometimes those files can be sufficient to identify the problem, if you give them to tech support (... the ABL stack and/or the native stack trace can be helpful clues).
The _mproapsv shouldn't crash ... we still have some issues with leaks, and sometimes with concurrency restrictions in the process, but never any unusual crashes (or at least none that would happen in _mproapsv and not also in _progres). We've been running PASOE for about a year now. The only times I recall any surprising crashes were when my r-code was stale (just delete all the old stuff and replace it with freshly compiled code), and when DEP was enabled.
hypothesis, entirely without evidence:
perhaps some activex or other component is being loaded into the progress session concurrently by multiple users but is not actually
multi-user enabled.