Problem With Parallel Running Queries - Cartesian Join?

Posted by bvy on 27-May-2015 10:29

Here's an interesting problem that I'll try my best to explain at a high level. We have data warehouse software installed on a Windows server -- the software queries and pulls data from QAD source tables. The Progress OpenEdge driver installed is 11.3 (7.10.02.48 -- PGOE27.DLL). We load new records from a large QAD source table (Table A) with a lot of history, though a typical daily load consistently pulls back only a few thousand records. The problem is, under certain circumstances, if a second query is running against that same table (say, from our test server) or another large table (on the same server), the load seems to lose its way and start pulling back records in a manner that looks like a Cartesian join -- millions of rows going back ten or more years, with lots of duplicates. I was able to reproduce this issue (though not consistently) by opening two query windows -- one running our Table A query, the other a query against another large table. Sometimes, the Table A query will return the expected number of rows, other times it will keep pulling back records until the second query ends or is stopped. I'm not a Progress developer, but hopefully I'm in the right place, as I think this is a Progress driver issue. Also, I'm describing the problem at a high level. I can't share too many details about our software or business. More than anything, I'm curious to know if this issue sounds familiar to anyone and if there's a simple remedy. Thanks.

All Replies

Posted by TheMadDBA on 27-May-2015 11:12

Doesn't sound like any issue I have run into before. I know of several environments where around 20 SQL queries will be running at any given time (in addition to the hundreds of normal ABL/4GL calls).

Maybe you can post the queries and the explain plans here? QAD data structure is tricky at best.

Posted by bvy on 09-Jun-2015 07:40

Thank you. Sorry for the late reply. I'm not at liberty to post very much about our environment, but the problem query is dead simple; it joins a header and detail table (ih_hist to idh_hist) and has a restriction on invoice date (ih_hist.ih_inv_date > sysdate() - 10). The query does use, what I'll call, an "old style" join -- meaning it lists the tables and the join criteria are in the where clause (from a, b where a.col1 = b.col1 and a.col2 = b.col2). Could this be the culprit? When the query goes off as described, a.col1 <> b.col1 in the returned recordset indicating that the join criteria were, in effect, disregarded.

Posted by steve pittman on 09-Jun-2015 09:26

Both "old style" joins (in WHERE) and "standards style" joins (ON clause) are fine, and produce the same query plan if the queries are  logically equivalent.

You describe the queries as being "parallel".  Are the 2 queries from 2 threads in one process, or from 2 separate processes?

You can get more insight into what is happening by turning on ODBC tracing, or by turning on OE sql server logging (SET PRO_SERVER LOG ON). For server logging, you would need access to the server machine to see the resulting logs.

As mentioned in an earlier reply, more details (queries, maybe query plans) are needed to begin to understand what is happening.

Don't forget that OE TechSupport is a  resource to solve problems also!

hope this helps,      .....steve pittman

Posted by TheMadDBA on 09-Jun-2015 09:31

This is going to be very hard to diagnose without the actual query. If for some reason you can't share the actual query here (no reason I can think of) then you would at least need to share the query with support so they can try and help.

Posted by bvy on 09-Jun-2015 09:58

Thanks. The query text is generated, though I've reproduced it with the following in a SQL window: select * from PUB.idh_hist, , PUB.IH_HIST ih_hist WHERE ih_hist.ih_domain = idh_hist.idh_domain AND ih_hist.ih_inv_nbr = idh_hist.idh_inv_nbr AND ih_hist.ih_nbr = idh_hist.idh_nbr AND (substr('DAILY',1,4) != 'DAIL' OR (substr('DAILY',1,4) = 'DAIL' AND ih_hist.ih_inv_date > sysdate()-10 ) ) This is a basic query against QAD. If Server A is running the query above and Server B (Test) is running the same query or a different long running query (like select * from PUB.abs_mstr), then the query on A starts returning records by the millions instead of the couple hundred or thousand we might expect. Servers A and B run the same jobs in parallel and occasionally clash like this. I've also reproduced this with select count(*)... and instead of the expected count have gotten a results 20 or 50 times (roughly) the expected count. See attachment.

Posted by bvy on 09-Jun-2015 09:59

Can't imagine why the formatting strips the paragraphs away...

Posted by TheMadDBA on 09-Jun-2015 10:23

select * from PUB.idh_hist, , PUB.IH_HIST ih_hist WHERE
ih_hist.ih_domain = idh_hist.idh_domain AND
ih_hist.ih_inv_nbr = idh_hist.idh_inv_nbr AND
ih_hist.ih_nbr = idh_hist.idh_nbr AND
(substr('DAILY',1,4) != 'DAIL' OR
(substr('DAILY',1,4) = 'DAIL' AND
ih_hist.ih_inv_date > sysdate()-10 )
)

What happens when you remove all of the DAILY substr options and use an actual date instead of sysdate? Other than that it seems like a pretty straightforward query (not 100% sure you need ih_nbr and idh_nbr but it has been a while since I used QAD).

Edit to add: There are some bugs with using minus and intersect but nothing that I can see for a query this simple. I would just try making the query even simpler and contact support.

Posted by bvy on 09-Jun-2015 10:33

Thanks. The string comparisons all evaluate to true, though I think I've stripped those out before and gotten the same behavior. Haven't tried hard-coding the date...

Posted by TheMadDBA on 09-Jun-2015 10:48

All of those should work just fine (apart from some index selections). Sometimes removing the complexity will show exactly where the bug is hitting.

Is it only with this set of tables or can you reproduce this on other tables like tr_hist?

At any rate I think support is your best bet now. If you get a solution or root cause let us know.

This thread is closed