Question about the performance of ABL queries resolved by mu

Posted by dbeavon on 18-Mar-2020 01:07

I've been looking for KB articles to explain what is going with queries that are resolved by multiple indexes.  The articles I've found don't get that deep into the implementation details.  

We have a table that has, among many other indexes, three indexes that are composed of a single column each: fy, period, and div.  Here is an ABL query that interacts with these three indexes:

FOR EACH fin_gl
WHERE 
fin_gl.fy = 20190
AND 
fin_gl.period = 1
AND
(fin_gl.div = "E212")


/*   (fin_gl.div + "X" = "E212X")*/
   
   
   EXCLUSIVE-LOCK:
   EXPORT STREAM S1 fin_gl.
   
END.

According to xrefs, and according to table/index stats, this is using all three indexes.  What is confusing to me is how quickly this actually executes

I thought that using separate indexes to this degree might actually *decrease* performance.  I thought it would resolve the query on all three indexes separately and then do a hash-match of the resulting ROWID's.  I can query the three columns individually in SQL like so:

SELECT count(*) FROM pub.fin_gl WHERE fy = 20190
SELECT count(*) FROM pub.fin_gl WHERE period = 1
SELECT count(*) FROM pub.fin_gl WHERE div = 'E212'

These take about ten seconds each.  And the selectivity is bad for all of them.  The first returns ~3 million rows and the next two return ~1 million rows apiece.  The final results when combining all three criteria together are about ~1000 rows.

So based on an approach where you would run these three SQL statements individually and combining the ROWID of the results, I would have thought things would take a worst case of thirty seconds.  At the very minimum they should take ten seconds. However the ABL query runs in just around 100 ms!

If anyone understands why this takes less than ten seconds on three independent indexes, I'd love to hear it!  Below are the highlighted stats that are generated after running my query.  Notice that there were 1000 base rows that were returned.  Also notice that none of the index reads get anywhere near 1 million.

fin_gl4 is for the fy column, fin_gl5 is for the period column, and fin_gl6 is for the div column.  I would have thought that at least one of the indexes would have a million reads, prior to joining against the others.  However the largest number of reads for any of the indexes is 2782.

The only thing I can think of is that there is additional information encoded/included in the indexes (like a SQL Server index does with "INCLUDE" options).

Or perhaps the ROWID itself has some composite details encoded within it, like page information, and that could be hash-matched at a higher level, prior to hash-matching the ROWID's themselves.  If one of these things are taking place then I'm wondering if/how they would be counted in the index stats.

PS.  Any tips would be appreciated.  Someone went hog-wild about ten years ago and added a whole bunch of single-column indexes on this table.  I was skeptical about their value, and hoped they would NOT show up in our XREF's at all.  Once I discovered that they DO show up, I hoped the related programs would be super-SLOW so that I could rationalize the removal of these indexes in favor of conventional ones.  However they are NOT actually as slow as I hoped, and I'm reconsidering whether they should be removed at all.  They do appear to be used in very isolated circumstances and I still think that many of the other indexes would be suitable here as well... but I'm not eager to take risks since this is a fairly critical table.  What a mess.

All Replies

Posted by Patrick Tingen on 18-Mar-2020 09:19

Have you tried the query with a single three-field index? In smaller tests I found that a multi field index outperformed 3 single indexes, but my test set was small.

Also, here is some info on queries that you might find interesting: knowledgebase.progress.com/.../000012195

Posted by dbeavon on 18-Mar-2020 13:27

>> Have you tried the query with a single three-field index?

Yes, that will work fine too (probably faster, as you say).  There is already an index with these three fields at the top, and would satisfy equality conditions.

I'm worried that the single-field indexes were selected by the compiler in PREFERENCE to the existing one with the three fields at the top.  There are a few dozen programs where these single-field indexes are showing up in the xref's.  And removing the indexes is a nail-biter because it will force the compiler to do something different in all those cases (probably each would need to be reviewed and tested separately).

I was hoping the decision would be made easier, based on the fact that these programs would be extremely slow when joining three indexes together.  But they are not and I'm wondering why.  

I think there is some subtle magic going on related to the the JOIN operation on the ROWID's.  I would like to understand the magic and *quantify* it if possible, so I know how to compare performance against the case where I'm using a conventional index.  The problem with quantifying the magic is that the index reads seem to be under-stated for these types of queries.  (I think there is some short-circuiting going on where a read isn't counted at all - unless the index is traversed all the way down to the ROWID.  I don't think it counts the cases where the JOIN operation is evaluated using data that is above the level of the ROWID.).  

Posted by Thomas Mercer-Hursh on 18-Mar-2020 14:20

I think the mistake you are making in your expectations is that the SQL queries are reading a million *records* whereas the ABL query can identify the result set by reading the *indices*.  Read the first index and get a list of rowids, compare that to the second index to get a smaller set of rowids, compare to the third index and get a list of rowids in the result set and then read only the records in the result set.

Posted by Brian K. Maher on 18-Mar-2020 14:44

You should also do the update statistics command against the tables on the SQL side.  The SQL side uses the statistics data to determine the optimal query plan.
 

Posted by steve pittman on 18-Mar-2020 15:21

ABL multi-index query  reads only index pages to get its rowid sets for the relevant indexes, and does a very clever intersection of the rowid sets in effect.

Note that index pages can be highly compressed, depending on the index key value characteristics (and the rowid values).

If I remember correctly, the index read numbers you see are not counting number of index entries accessed in the index pages.

This is not a literal description of what happens but is accurate, to the best of my recollection.

The description by Thomas Mercer-Hursh is very good.

For various reasons, sql does a lot more work for these queries.

hope this helps, ....steve pittman

Posted by dbeavon on 18-Mar-2020 16:28

>> If I remember correctly, the index read numbers you see are not counting number of index entries accessed in the index pages.

This would make sense, or else the numbers should be around a million for the number of matching ROWID's *within* the index pages of each of those three indexes.

So it must be that the index stats are omitting the "short-circuited" reads on the *pages* of an index.  Or it may be the case that it is the *pages* that are always being counted, rather than the number of ROWID's within the index pages.  The "Reads" column header for the OEE indexes is ambiguous and I suppose it doesn't necessarily correspond to individual ROWID's. (Unlike the creates/deletes/updates at the table level which are almost always understood to refer to individual rows).

I really appreciate the tips.  The performance makes more sense if the query is not accessing the table, but only the index pages.  It is unfortunate that the SQL count(*) operation is not doing a similar thing.  Otherwise that should be blazing fast too.

What still doesn't make sense is why the compiler would prefer the three single-column indexes over a single multi-column index.  There are actually a couple of them that start with the three fields in question.  Here is one example:

fin_gl3

-fy

-period

-div

-acct

This would resolve the query faster, if the compiler used it.  And it would have a less sophisticated query plan.

Posted by ChUIMonster on 18-Mar-2020 21:25

As I recall, the compiler prefers indexes where *all* of the components are used over indexes where only _some_ of the components are used.

It would be interesting (in a test environment of course) to remove the single component indexes and see how the performance of the composite index compares.

Posted by Peter Judge on 18-Mar-2020 22:30

As I recall, the compiler prefers indexes where *all* of the components are used over indexes where only _some_ of the components are used.

 
I suspect you are correct but that's nothing I've had confirmed (or is in "the rules"). The paper at documentation.progress.com/.../database-index-usage.html is good, but it doesn't say that if you have 2 potential index matches, that the one with all satisfied fields will be chosen. Anecdotally it might do that. But you can disprove the theory easily enough too … so validation via COMPILE XREF or other means is really important.
 
 
 

Posted by Simon L. Prinsloo on 19-Mar-2020 08:31

Tom is correct.

I documented it in http://pugchallenge.org/downloads2015/176_OfCourseItWillTakeTheRightIndex.pdf.

I state the rule on slide 13 and proof it in slides 41 to 45.

But that is when all criteria is joined with AND, so you make one bracket.

The moment you use OR, you create alternative brackets. There is one more brackets than the number of OR operators. Different indexes (and presumably sets of indexes) can be selected per bracket, as can be seen on slides 47-48, as long as the statement can use multiple indexes. But statements that can use only one index will not benefit, as can be seen on slides 49-51, where the result turns out to be quite unexpected.

The source code used to make the presentation can also be found at http://pugchallenge.org/downloads2015.html.

Also note that while the theory that equality is better than a range sounds correct, but in practice it might not be. This happens when there is a low distribution of values for the equality part and a high distribution of values in the range. The compiler cannot know what the data distribution will look like, but the persons designing the database and the software must be able to predict it.

As an example, I encountered a system where invoices and credit notes were in the same table. They were differentiated with a doc-type field, which could be "INV" or "CRN". Rought 98% of the data was "INV".

There was single field indexes for the doc-type field and the inv-date field.

Doing a report on sales in a particular month worked well, because the index on inv-date was selected. But then the customer requested that credit notes must be filtered out and suddenly the report became very slow. The reason? After 5 years, a range match on a specific month's data returned 1/60 (1.67%) of the data. Roughly 2% of that result set was irrelevant. But an equality match on doc-type for "INV" returned 98% of the data, and in that result set only 1.67% of the data was relevant. 

Thus with the indexes available:

  • you would by default read 98% of the data and use 1.67% of the result set.
  • you could force a read of 1.67% of the data and use 98% of the result set.

With a compound index on "doc-no, inv-date" you would read 1.63% of the data and the entire result set would be relevant.

Posted by dbeavon on 19-Mar-2020 13:55

Thanks for the help.  I had to look up the definition of Rought. ;)

The slow example that you described was exactly what I was expecting to see when using three independent indexes with the year (fy) week (period) and company (div).  By using all three indexes to resolve the query, I would have expected it to be slow since each of the criteria individually returned a million rows.  Perhaps a million is not all that much when it comes to reading index pages and hash-joining ROWID's.  (Do you happen to know the approximate numbers of rows in your own example)?

Also just for clarification, you talk about "reading 98% of the data"... but you mean it is scanning the ROWID's in the index *pages*.  I am assuming that even in your example the query doesn't gather data from the database table itself until after it has identified the subset of data that is applicable to *both* criteria.

Another thing that I'd mention is our our memory situation.  Even if the approach for resolving the query is terrible, it might be obscured by the tuning of our buffers and by the nature of when/how the queries are executed.  For example we use alternate buffers for this table and I think the related index pages are almost always in RAM.  Whereas in your example it is possible that you are overflowing your memory buffers and swapping on disk every time the query runs.  It's just a guess.  Our query may be running in a way that doesn't make any sense, but the availability  of memory is compensating for that.

Either way, I think the only "safe" way forward is to review the xrefs and individually force each of these instances to work differently.  I suspect I will move them to another index via the USE-INDEX hint.  Once all the code is weaned off from those three silly indexes, then I can drop them.  (The only one of them that actually makes any sense to keep is the single column index on year... but we already have a handful of other indexes with year at the top anyway!)

Posted by Thomas Mercer-Hursh on 19-Mar-2020 14:54

Once all the code is weaned off from those three silly indexes, then I can drop them.

Why do you think they are silly.  Indeed, many applications made big improvements once multi-index resolution was added exactly by creating single field indexes like this.  The alternative is creating huge numbers of indices with, not only aall possible fields, but also all possible sort orders.  The single field indices give you great flexibility in both selection and sort without having to have a zillion indices.

Posted by ChUIMonster on 19-Mar-2020 15:03

Many applications also went (way) overboard creating hordes of single field indexes that do not actually provide any tangible benefit.

If testing shows that multi-component indexes give the needed performance and that the single component indexes are spurious then getting rid of them would be perfectly sensible.

Posted by Thomas Mercer-Hursh on 19-Mar-2020 15:30

Yes, but the review has to be of all queries of the table.

Posted by dbeavon on 19-Mar-2020 19:42

The problem for me is that the selectivity of each of these separate indexes is too high (a million rows for any criteria I might specify).  And practically speaking, nobody is going to have a good reason to query on the week-of-year in isolation (without specifying the fiscal-year).  Nor would anyone query the financial books for a given company without also picking the time period.

Every feature like this comes at a price.  You are compromising one thing to get another.  Do I really want to allocate tons of RAM and CPU in order to maintain large buffers of index pages, and perform these massive hash-match joins on ROWID's?  It does not seem like a conventional way to use the resources of the database server (OLTP).  I think many customers would be licensed by CPU core, and the capacity that can be handled by a database server's CPU's may depend on the type of data joins as well as the number of indexes that are used to resolve a single query.  

>> Why do you think they are silly

I was a bit prejudiced in the first place, but the main reason I came to the opinion was after reviewing the places where they show up in xrefs.  In all the cases the query had to be written in a special way, and in all cases I would have actually preferred it to use another single index.  And in all cases the query ran slower than a similar multi-field index.  I would probably feel better about these types of indexes if Progress could pick them based on a dynamic, cost-based analysis (statistics-based). However I suspect that the costs would normally be higher, and it would be all the more reason not to use these over normal indexes.

Posted by Thomas Mercer-Hursh on 19-Mar-2020 20:03

But, the point is not that you need the single field indices to be used alone, but that you can use them in combination with other indices to create selections or sorts.  Your main selections and sorts of course deserve their own indices, but big tables which are analyzed a lot have a pattern of having a wide array of ways in which they are sliced and diced.  At the very least, the first thing you should do is to build up a database of every query, which indexes are currently used, which fields are used, and what order they are used in. i.e., the sort order.  If you then find indices that are not used and which are not needed as constraints, e.g., uniqueness, then by all means get rid of them   Likewise, if you find that every use of these three fields is the same fields and the same order, then you have a good argument for substituting an appropriate index, existing or new.  But, my bet is that you will find the queries all over the map, not uniformly since many will be the same, but there will be different ones that are nevertheless important and which would perform much worse without those single field indices.

Posted by dbeavon on 19-Mar-2020 20:46

If the selectivity is bad, then I can't think how it would ever make sense to use the index.  The problem is that Progress ABL doesn't have a way to KNOW the selectivity is bad, and so it will use the index anyway.  This feature might have been good one to pair with some sort of run-time optimization hint ("FOR EACH ... USE COST-OPTIMIZER").

I suspect that you need even more experienced DBA's to manage these indexes (as compared to conventional indexes).  Otherwise these types of indexes may cause as many problems as they solve.

Just to be clear, I'm not opposed to the concept of using multiple indexes to resolve a query as long as they all have fairly good selectivity.  But it is these single-field indexes with inconsistent selectivity that I have a problem with.

Posted by Thomas Mercer-Hursh on 19-Mar-2020 21:01

The selectivity of *one* index used alone might be bad, but the selectivity of *multiple* of these indices used in combination may be far from bad.  That is the point.  They are not meant to be used in isolation, but rather to be combined with others.  One can, of course, achieve the same functionality and perhaps slightly better performance by defining a multi-field index, but if you need ABC and ACB and ABD and DAB etc. then you have a lot of overhead to cover all needed use cases and have to define a new index if you come up with a new use case.  Frankly, when this was introduced, I thought it was brilliant since customers were constantly coming up with new reporting requirements.

Posted by Brian K. Maher on 19-Mar-2020 21:02

Dave,
 
I believe there is an enhancement request to add a cost-based optimizer to the AVM.  You need to vote for it.
 
Brian
 

Posted by Simon L. Prinsloo on 20-Mar-2020 13:38

Hi [mention:77d0f2ca82a041a08c26cc89b12b968e:e9ed411860ed4f2ba0265705b8793d05]

Sorry, I do not know where "Rought" came from. It should have been "roughly".

In any case, in my example, the equality match on the doc-type selected the index on doc-type, but the match on inv-date is a range match, i.e. >= the first day of the fiscal month and <= the last day of the fiscal month. So the compiler ignored that index. The important thing to note here is that only indexes where all fields are used in equality matches are combined.

This means that it only loaded the index pages for doc-type. That has no information about inv-date, so in order to resolve the second part of the filter, the db engine will read each record in the doc-type index satisfying the equality match and compare the inv-date to the selected range to determine if it needs to be returned to the client. So not only 98% of the index had to be read into memory, but 98% of the data had to be read as well, even though only a small subset was returned to the client. I cannot remember the actual number of records or the specs of the hardware, as this was over 15 years ago. But yes, the volume was sufficient to flush all records from the buffers and consequently slow down all other processes as well.

The point is that indexing a single field that would normally be accessed with a range match, like a date, will be pointless in cases where there will always be some other indexed fields involved in equality matches as well, as the ranged index will be ignored.

So given a single index on company, a single index on fiscal year and a single index on week-of-year, when reporting on a quarter, only the first two will be combined, forcing the db engine to read a full year's worth of data for a given company. It will then discard 75% of the records based on the fact that they are from other quarters and return only the 25% that match.

In short:

When the equality matches bracket a relatively low number of records in each index, combining indexes are great.

On the other hand, when the equality match will bracket a significantly large amount of data, the index will mostly work against you.

Consider this scenario:

  • a client transaction table contains the following keys fields which are each on its own index: company-code, client-code, sku
  • there is also an index on fiscal-year, fiscal-month
  • There are four companies: A, B, C, D
  • They generate data in the table at a ratio of 5:2:2:1
  • Company A generates 1,000,000 transactions in a month.
  • There are 200,000 customers generating an average of 10 transactions each in a month.
  • There are 10,000 SKUs generating roughly 200 transactions each in a month.
  • There are 2,000,000 transactions in a month.
  • The database contains five years worth of data.

Working  with the indexes where the data distribution is high:

  • An equality match on customer will match 600 index entries.
  • An equality match on sku will match 12,000 index entries.
Finding out when a given customer bought a given item will result in a hash join that will thus work with two result sets of 600 and 12,000.
  • This is reasonably efficient, given that there are 60,000,000 records.
  • If we generate a statement for the customer for a given quarter, the range match on the period will exclude that index. The db engine will bracket over 600 records. Each one will be read, the fiscal-year and fiscal-month on it will be inspected and 30 will be returned while 570 will be discarded.
    • This is quite acceptable if you produce the statement for one customer.
    • This will ultimately read all the data, but return only 6,000,000 records (5%), if you produce a quarterly statement for all customers.

 Working with the indexes where the data distribution is low:

  • An equality match on company D will match 12,000,000 index entries.
  • An equality match on company A will match 60,000,000 index entries.
  • An equality match on fiscal-year and fiscal-month will match 2,000,000 index entries.
  • Reading the transactions for company A for a given month, will result in a hash join between a result set of 60,000,000 entries and one of 2,000,000.
    • That is a lot of work.
  • Generating a quarterly report for company A will cause a range match on the period, so the index will be ignored. Therefore 60,000,000 records will be retrieved and inspected, 3,000,000 will be returned and 57,000,000 will be ignored.

That is where humans come in. Just looking at the X-REF for the above cases, the compiler's choices might look good, but if you know something about the data distribution in your database you will spot many of the problems even without an X-REF. Most of the time you do not even need to investigate the actual data, as your gut will give you a good feeling about distribution. For example, given that you only have few users in the cash office, the data distribution on the user-id in the receipt table will be low. Or given the fact that most sales comes in through the web, the sales rep on most orders will be "WWW".

As you pointed out, in your case there is most likely no use case to read data for a fiscal month without the context of a fiscal year, neither is there one for reading a company's data without at least some sort of period associated. I also foresee very little use for reading a period's data without the context of a company, and even if you need that, there are normally quite efficient ways to do that, as you will see below.

For instance, if your index is for company, fiscal-year and week-of-year and you want to produce a summary group report, the following will be much more efficient than joining two different indexes, given that it is much simpler to read a very small table than to hash join two or three very large result sets:

FOR EACH company NO-LOCK,
EACH trans NO-LOCK WHERE trans.company-code EQ company.company-cde AND trans.fixcal-year EQ x AND trans.week-of-year EQ y:

Better yet, reading data for a quarter can use the full index, because the bracket is on the last item.

This thread is closed