Recently we had strange issue at customer system, and I'd like to consult it with you.
Here is the situation:
- we have Java clients connecting to OE AppServer through Open Client
- all communication between Java clients and AppServer goes through datasets
- java clients are calling non-persistent procedures only
- the first step at the beginning of each such procedure is to write complete input dataset content through dataset:write-xml method to longchar variable, which is immediately after that written into CLOB field in the database
- at the end of processing output dataset is also written in the same way
- this setup works without issues for few years already
- minSrvInstance for AppServer process is set to 18, maxServerInstance to 30 (I will explain later why I think it's important)
- AppServer is running in Stateless mode
- OE 11.3.3 64-bit on AIX 7.1
Now, the problem:
- some of blobs written into the database during specific 6 hours during one day in August do not contain XML as expected, but some weird binary data (looks like memory dump ?)
- around 3.3% of records created during that period are corrupted this way
- it never happened before nor later
- in all these cases both clobs (input and output) are corrupted
- datasets itself were transferred correctly
- when we tried to call these procedures again with the same input data everything worked OK
- there is nothing special in the appserver nor database logs, except one thing: shortly after the problem disappeared, two of appserver processes stopped and disconnected from the database (without any error message)
I found this KB entry:
They mention memory leak in WRITE-XML method of dataset with NESTED data-relation which is our case. Maybe something bad happened to one of appserver processes because of this memory leak and since then it started to corrupt all write-xml outputs ? 3,3% of corrupted data tells that it can be 1 of of 30 processes. And when it stopped everything went back to normal ? Unfortunately at this point I cannot say if all corrupted records were written by the same appserver process, or not.
I opened this case with technical support, but they didn't find the solution yet, and they don't think this problem is connected with the bug mentioned in KB. The bug itself is not solved yet. Of course I cannot reproduce the problem :(
Any thoughts ?
What makes you believe the corruption is caused by a memory leak ?
And what evidence do you have that you're actually leaking memory ? Did you monitor the _proapsv processes and see their private memory usage increase over time ?
Without knowing all the details, I would tend to agree with my colleagues in Tech Support. This sounds more like memory corruption. A memory leak alone would probably not corrupt the data, it would just consume the available memory, but (typically) the content of that leaked memory would be intact. In the bug attached to the KB you mentioned, if you walk the FIRST-QUERY chain, the error message indicates the leaked queries know they were auto-generated for a dataset. Of course, this bug has not been investigated, so what might look to be a leak might actually be corruption (the "right" bit just happens to be turned on to give this message). It seems unlikely, though, if the sanity of the chain is sufficient to be able to walk it.
If this is memory corruption, I think there is little that can be done about it now, other than cleaning up. Memory corruption is hard to track down at the best of times, and requires a reproducible case and/or a very dedicated user willing to work with TS/Development.
I have no evidence of course. I noticed the problem after it was gone, when we tried to read stored clob, which does not happen so commonly (it's a kind of activity log). And as I wrote it happened just once during few years and millions of records written. The KB I mentioned was just the only suspicious one I found mentioning WRITE-XML method.
Ok, but in original post you say you:
"write complete input dataset content through dataset:write-xml method to longchar variable, which is immediately after that written into CLOB field in the database."
So we don't know if it's the WRITE-XML, or the subsequent copy to the database where things go corrupt.
If this issue is recurring, I'd sugges checking if writing the same LONGCHARs to a file on disk shows the same corruption as writing it to the database.
That should at least confirm which of those 2 steps has the point of failure.
Also, monitor the (private) memory use of appserver agents. If that evens out after a while, there is no memory leak.
We have a similar issue. We also write-xml input prodatasets to the database (blob field).
I have seen one customer database with corrupt records. Reading the blob crashes the client.
I have had multiple crashes on our internal test runs (with sql dataserver) on the copy-lob statement that is responsible for filling the blob. The copy-lob statement is the statement after write-xml (to longchar) on the dataset handle.
That's interesting point... I somehow assumed that write-xml is to blame, but it's true that it can be also copying from longchar to clob.
I will add some logging and testing on each step, but the question is if and when the problem will appear again.
[mention:f74159d0d7ac493497e441f53f353b10:e9ed411860ed4f2ba0265705b8793d05] : did you report it to TS ? Is your case reproductible ?
Both cases were reported to TS:
00354526 Accessing blob field results in 'failure in large object operation, error = -8'
I only received a customer database with this issue, case was not reproducible. Even identifying corrupt records was too much work (the only way to identify the record was to read it, which crashes the AVM).
00323822 Intermittent _progres crash when calling COPY-LOB with DataServer for MS SQL Server
This case has also been closed for now.
Meanwhile, I've been digging deeper in the article described in the KB article 000031266.
On closer inspection, it's not really a memory leak. The dynamic queries WRITE-XML() creates are cleaned up eventually, just maybe later than people expect - it happens when the dataset goes out of scope.
Full details in the updated article, now found at: