UPDATE STATISTICS COLUMN DISTRIBUTION

Posted by Admin on 14-Jul-2006 15:45

Hello Guys,

Following a recent discussion @PEG.

Ten entries for column distribution isn't alot to go by. I'm especially referring to estimating index brackets.

Lets say we take a table with ten million records, estimates resultion would be in the millions.

For example we wouldn't be able to tell apart an index bracket of 5 records or 500,000 records.

How about using, something, like a table struct with the same index and index components.

Now that its an indexed struct we can even have a 1000 entries if we'd like to.

That will hold the index keys after every n rows, additionally the first and last entries in the index and of course the key sequence number in the index.

To get an estimate of an index bracket

We can take the brackets lower value, find the FIRST entry in that stats table that is BIGGER OR EQUAL to that value.

Now we can say that the index bracket lower value is between that entry and the one before.

And similarly take the brackets higher value and find the LAST entry in the stats table LOWER OR EQUAL to that value.

And now we can say that the index bracket higher value is between that entry and the one above him.

And the distance between and including them is the estimate.

Another idea that was brought up @PEG is using the index B-Tree struct to estimate a brackets size.

That, in theory, wouldnt require collecting stats and has the potential to be very precise.

Why wouldnt it work ?

All Replies

Posted by gus on 20-Jul-2006 13:30

These are not unreasonable ideas and could be considered for inclusion in future releases.

The query optimizer uses the statistics in a variety of ways, such as determining which indexes to use and what order to do joins in. The amount of detail required depends on the data type as well as the number of rows in a table. For example, if a column has a non-unique index and has only two or three distinct values, then you don't need much information and an entry for every 1000 rows is overkill for large tables and not enough if you have 100 rows.

Using the index b-tree to estimate key distributions is a reasonable thing to consider.

Posted by Admin on 20-Jul-2006 14:53

you're a bigger man then i am.

yes. i spoke to soon on that.

i think, theres enough right now to write a cost-based query optimizer, in 4gl.

... i bet you a beer i can write one.

Posted by Admin on 21-Jul-2006 09:10

Is the 4GL not really intended for or focuses on intensive, high-end read operations ?

Its not just the optimizer, that seems to me like a very central feature.

I don't know of any benchmarks focused on read operations PROGRESS has ever published. Is there even a procedure in place like ATM ?

Enhancement requests that havent gotten any attention etc. feels like its not something PROGRESS is all that interested in.

I can appreciate the challenge processing high volumes of transactions. But I have to admit its never really been a problem for me since I started using PROGRESS in version 6.

Posted by Admin on 26-Feb-2007 10:09

Hi Are you the davrozen who appeared on Ebay on 24/25th Feb 2007 ?

Posted by ChUIMonster on 27-Feb-2007 16:47

Good points.

The general idea of putting more emphasis on read throughput is one that I think is very important. In the field it is much more common for read bottlenecks to be troublesome than it is to have transaction throughput problems.

Perhaps Sal has some news about the long awaited status of the NO-INDEX enhancement?

This thread is closed