Duplicate index error on TT from unrelated procedure

Posted by danielb on 11-Feb-2018 18:11

We had a really, really strange error the other day, which managed to bring down one of our production servers.

The root cause was a duplicate index error (132) on a temp-table. While the error message was raised in the procedure that cause the issue, (daFoo.r), we have log file messages that are showing the error raised in the caller, and then completely unrelated r-code that doesn't reference the temp-table, or have any access to the tempt-able at all (dynamically or otherwise). One of the procedures that referenced it contained a loop that resulted in an infinite loop (we were using the RETRY() function inside it so that stop Progress' infinite loop protection firing).

My question isn't really about the error itself, but what would cause the error to be produced in unrelated R-code. It is completely bizarre, but I have a limited knowledge of what happens in the AVM under the hood.

The relevant points from the log file that we can find:
- the procedure raising the error originally is run persistently (and kept there), and the temp-table is static, so kept in memory alongside the persistent procedure.
- we have numerous "-mmax exceeded" warnings, with the HWM on that PID prior to the error of 4724 (we run with the default 
- our rcdaxxxxx files are generally hit about 60M per process

Has anyone encountered such an issue before? We are trying to replicate on an isolated environment - we can try and adjust various settings as required.

Posted by Laura Stern on 12-Feb-2018 09:42

See the kbase that I referenced.  Basically it says:

As part of ending a transaction (not a sub-transaction), the AVM goes through any buffer that is still in memory and tries to validate it if it hasn't been written out to the db. The temp-table record that caused the initial 132 error is in a persistent procedure which is still in memory, so every time some code that ends a transaction is run, it will cause the AVM to try to validate that record, and it will fail because of the index conflict. Even if it's a transaction that has nothing to do with that temp-table, the AVM will still handle it the same way.

Does that seem to match your circumstance?

All Replies

Posted by rayherring on 11-Feb-2018 18:27

Not sure about your particular use-case, but we use Classic Appserver with Webspeed brokers/agents running in 'Stateless' mode, one of our temp-tables on very rare occasions ends up with a record in it that breaks the uniqueness.

When that temp-table goes haywire, it starts causing intermittent issues across all webspeed agent requests that have nothing to do with that temp-table at all. A restart of the affected broker (it doesn't cross brokers thankfully, of which we have 3) cures the problem.

I've never really worked out why it affects everything that isn't even related, the same thing occurs when a query isn't query-prepared properly in a service that the agents all fire up, once 1 agent causes the issue, all agents for that broker get affected by it until the broker is restarted.

Posted by Peter van Dam on 11-Feb-2018 20:08

I have seen this before, once the error occurs somewhere in the session it seems to propagate/infect the entire session.

The only solution is to find and cure the cause.

Posted by danielb on 11-Feb-2018 22:11

Thanks, [mention:b57756e2e8f346b6be10460de66c675f:e9ed411860ed4f2ba0265705b8793d05]. We're running Classic AppServer as well, on State-free (rather than Stateless). Our scenarios have only impacted a single agent - other agents seem to work fine (I say this as the error messages are only printed from the same PID, but it's hard to verify as the box is at 100%).

Posted by Stefan Drissen on 12-Feb-2018 00:45

Yes, seen it often enough in both WebSpeed and (state-free) classic AppServer. Once an AVM get's in this state (no-undo temp-table with violated index), it cannot recover and any hit on the AVM will trigger the error again and again.

Not very advanced handling imho - but never thought to raise this as a bug with PTS.

Posted by jbijker on 12-Feb-2018 02:04

Well you have 2 options:

1. Remove the "NO-UNDO" from the TT definition - i.e. in the event of an error / rollback it will remove the offending record for you. However it does have a performance implication because OE now needs to store all changes to the TT when you're inside a transaction block.

2. Change the Primary Index to a non-unique index. Personally I see no need of having a primary index on a TT.

This has bitten me a couple of times. I'm using #2 above. If you really need to enforce uniqueness you can always do so by doing a FIND FIRST to check if a record is already there.

Posted by Stefan Drissen on 12-Feb-2018 02:12

1. undo vs no-undo is more importantly a design choice. If you are for example using temp-table records to log activity inside transactions then the temp-table needs to be no-undo to get information back for failed transactions

2. primary and non-unique have little in common. I am all for things that describe what they are and what their uniqueness is. Using FIND FIRST to detect if something was unique is IMHO like using character variables to store decimals and dates.

Posted by jbijker on 12-Feb-2018 02:50

[mention:f74159d0d7ac493497e441f53f353b10:e9ed411860ed4f2ba0265705b8793d05] all well and good, but the only way to recover from a unique index violation on a NO-UNDO TT is to catch the index violation error & delete the TT record if it happens. That's a lot more coding to do compared to a FIND statement up front. I prefer to keep things as simple as possible, but not simpler!

Posted by Stefan Drissen on 12-Feb-2018 04:29

[mention:0146fc81c2f24313878c844154703962:e9ed411860ed4f2ba0265705b8793d05] If you need to catch specific violations, yes. If you don't - because they are gremlins - a finally block with an empty temp-table is sufficient.

Posted by Laura Stern on 12-Feb-2018 08:29

There is a kbase article on a similar scenario for more explanation:

knowledgebase.progress.com/.../P185553

Posted by ske on 12-Feb-2018 09:09

> My question isn't really about the error itself, but what would cause the error to be produced in unrelated R-code.

What is the scope of the buffer where the duplicate record is created?

(I'm thinking that anything that might cause the block or procedure or session where it is contained to exit, or any attempt to re-use the same buffer, or possibly even any transaction finalization, would cause an implicit release of the record, which would trigger a new attempt to validate the record, and complaining about the duplicate key again.)

Posted by Laura Stern on 12-Feb-2018 09:42

See the kbase that I referenced.  Basically it says:

As part of ending a transaction (not a sub-transaction), the AVM goes through any buffer that is still in memory and tries to validate it if it hasn't been written out to the db. The temp-table record that caused the initial 132 error is in a persistent procedure which is still in memory, so every time some code that ends a transaction is run, it will cause the AVM to try to validate that record, and it will fail because of the index conflict. Even if it's a transaction that has nothing to do with that temp-table, the AVM will still handle it the same way.

Does that seem to match your circumstance?

Posted by danielb on 12-Feb-2018 17:05

Thanks, [mention:04fbfb2e92784123a464ff2aade602b1:e9ed411860ed4f2ba0265705b8793d05] , that matches our scenario exactly. The code that is looping is a DO TRANSACTION with UNDO, RETRY inside of it (it checks if a mutex is locked). We'll examine that code to add manual infinite loop protection to it. Thanks again!

Posted by rayherring on 12-Feb-2018 20:08

The temp-table in the scenario I have isn't set as 'NO-UNDO', it is in a transaction scope (I assume it is, not entirely sure, we use PS:escript, so I just make a call to the 'updateDataset' procedure for the dataset in question).

One day I'll get time to fix the issue, it's only 1 broker out of 3 that ever has the problem (we have our webspeed users segregated by type: Online Customers, Admin, Stores. The issue only ever occurs on 'Admin'), and it is a rare problem, probably once every few months.

This thread is closed