Can the different worlds hear each other?

Posted by George Potemkin on 07-Dec-2014 13:08

I'm trying to analyze my disappointment after Info Exchange at EMEA PUG Challenge 2014. We had voted for the enhancement requests. Programmers had voted for dba stuff, dba's - for the new features in language. If you did not vote for a feature because it was out of the area of your competence then it was treated as a vote against the feature. There were no possibility to advocate the suggested features. As result dba's from small shops did not support the features required for big clients. In other words, they voted against their own future. Under the "big clients" I mean the sites with the terabyte databases, thousands connections /and/ strong business requirements.

How long does it take to restart a database in small shop? A few minutes. For big clients it might take an hour or more. The larger database the more chances it had the large transactions that need to be undone. The higher number of connections the more chances that some of the sessions will hang because they were interrupted in an unexpected place. And the cost of one minute of downtime is much higher for big clients. I had the cases when dba's in small shops restarted database in the middle of working hours just because I suggested them to change the values of some startup parameters. And there were the cases 'big client' decided not to restart database even in a painful situation (due to a some issue the users can't connect to or disconnect from a database) just to avoid a downtime in the middle of the day. So no matter how good is our recovery plan - we can't guarantee that a downtime will be short for big clients. We can't even predict a time of db restart in the concrete cases - we just can't see what is going on inside a database during its shutdown or during its startup.

Big clients don't have any options to repair the indexes. Not a single one. I'm not kidding. Let's take, for example, 'idxfix/1. Scan records for missing index entries'. The time to run idxfix is growing as (db size)^2. More records in database the longer the idxfix will run. But there is another factor: the bigger the indexes the longer disk IO waits while jumping from the blocks in table's area to the blocks of one index then to the blocks of another index etc.

For example, 'idxfix/scan records' running against 400 MB database on my laptop:
Record operations: 507,533.39 (Per Sec)
Index operations: 253,766.70 (Per Sec)
Total o/s reads: 2,149.00 (Per Sec)

The same option of idxfix running against 2 TB database on a million dollar box with Hitachi USP storage arrays:
Record operations: 1,480.00 (Per Sec)
Index operations: 2,959.50 (Per Sec)
Total o/s reads: 163.00 (Per Sec)

Is my laptop 340 times faster than the customer's "monster"? Of course, not! But it's turned out that idxfix will take more than 2 months (sic!) to scan all records in 2 TB database. Fortunately, not an year. But wait! It's just one option of idxfix. Also we need to run the option 2: Scan indexes for invalid index entries. The option 2 is slower than the option 1 because for each index the idxfix will re-read all records from a table's area to validate them. Also the data above was taken from a backup server where there were no other users running against the same database. But we, of course, would like to run idxfix in production environment where the users lock the records. Hence idxfix will wait for the locked records. Practice shows that the default value of the -lkwtmo (1800 seconds) is not enough for idxfix. Hence idxfix might process just 0.0005 locked records per sec. It's crazy but I would not surprise if these two options of idxfix together will need an year to complete in production environment for one database (not the biggest one). Provided the database will not be restarted during the year. Otherwise we will need to start from the beginning. Who would wait a year to fix the index corruptions? Even a month is not an option.

We can create the multi-threaded record scan using idxfix (why it does not exist from Progress?) but not for index scan. Only developers in PSC can provide a solution for the second option of idxfix.

It's much faster to rebuild the indexes instead of running idxfix. It would take "just" a couple of days. But two days of db downtime is not an option for big clients - it's too costly including for their reputation.

We could add a new index, activate it online and then drop the old (corrupted) index. But the new locking protocol introduced in V10.2B06 makes impossible to use idxactivate unless you use an existent unique index to build a new one. What if a table does not have a unique index? What if we need to rebuild a unique index itself?

Big clients has an issue with data replication. Time to roll forward AI file on target db used to be 10% of the time to grow the AI file on source db even for the most transaction active databases. But for the terabyte databases the 20% can change into the 200-300% due to the scattered updates on source db. Note that if roll forward is persistently slower than 100% than we're in fact losing a warm spare database - the gaps between the databases will be just increasing with time. The same issue exists for OE replication but we don't have any metrics to measure how fast is a replication agent on target database. The delay in replication between the databases mainly happens due to the network problems or because a replication agent was not running on target db due to some reasons. If the delay is zero it does not mean that replication agent is idle. In other words, big clients don't have a reliable mechanism to maintain a spare database.

Small shops don't have the same issues - why they will vote for enhancement requests that should solve such issues? The big clients are few. They have only one vote no matter how much they paid for their license. Does PSC have the future if it don't hear the big customers?

All Replies

Posted by Thomas Mercer-Hursh on 07-Dec-2014 13:20

Of course, George, the votes taken in an info exchange aren't the deciding factor, by any stretch of the imagination.  I think the product managers do a pretty good job of listening and the better the use case one provides for them, the more likely are they to understand the impact and significance.  Yes, the kinds of changes you want may only apply to a small number of sites, but they are important sites and I think that is well understood.

Posted by George Potemkin on 07-Dec-2014 13:55

I'm trying to draw PSC attention to the problems of "big customers" during last year. Unfortunately I don't see any changes in Progress that would help to solve them. And I'm forced to deal with such issues of our big customer almost every day. Worst-case scenario has not yet happened but I feel my powerless in an attempt to prevent it.

Posted by Thomas Mercer-Hursh on 07-Dec-2014 14:21

And based on your posting history, I know you have lots of data to back up your position.   I hope that you write it up and get it to the relevant people.  I am sure that they will listen ... I just don't know how easy any of it is to do.  There plate for DB has been pretty full lately with encryption, multitenancy, and horizontal partitioning (which seems to need some finishing), so maybe things will pick up for high availability issues after this.  They do need to listen to all of us!   I have my list too!

Posted by George Potemkin on 08-Dec-2014 03:06

> I just don't know how easy any of it is to do.

I would say it's easy because the code similar to the one we need already exists in other Progress functionality:

1) Promon could be run when shutdown is already initiated (as we can start an emergency shutdown after a normal one). So we could at least see what's going on during shutdown;

2) The issue with slow replication could be solved by asynchronous page readers on target db (they would be similar to the asynchronous page writers);

3) 'Idfix/scan records' could be multi-threaded. First of all, we can split the recid range into the chunks and run a separate thread for each chunk. Secondly (and more useful, IMHO) idxfix can be multi-threaded by indexes to check. Parent process could read only records, lock them and put them into a special buffer. Child processes could validate the index keys for the records stored in the buffer - one child process per index. When all index keys of a record are validated then the record should be unlocked and released from the buffer;

4) 'Idxfix/scan indexes' could support a selection by keys similar to the "dumpspecified" option of binary dump. So we would not need to scan the whole index if we already know the key range to check.  Also it could be multi-threaded similar to the multi-threaded binary dump (though I dislike how this option was implemented for binary dump);

5) Idxfix could skip the locked records and check them later - similar to an online backup that does not write the blocks exactly in the same order they are stored in database.

The only enhancement request that would be hard to implement is to enable db connections before the crash recovery is completed. Logically it looks simple: the redo/undo operations during crash recovery are very similar to the normal transaction activity (at least according to the recovery notes they generate). Hence in theory the crash recovery can coexist with the activity of the normal users. But as we know that the protocols used for BI file are very complicated and the changes in this area are very risky.

And all above is not about the options that "would be nice to have". It's the options that big clients, IMHO, desperately need.

Posted by Rob Straight on 08-Dec-2014 09:55

To second what Thomas stated, the Info Exchanges are one of many ways that Progress gathers user feedback on enhancements. The main focus is to capture the feedback- the "voting" simply is an attempt to get a reading from the set of people who are participating in that particular discussion. We do not use the voting alone to make decisions on OpenEdge feature enhancements.

As mentioned as part of our 2 years roadmap, High Availability is high on our list of areas for improvement, across the product suite and of course specifically as it relates to the DB.

I appreciate your detailed suggestions, as does our DB Engineering team. Let's stay connected as it is essential that we leverage the experience that you and others in the OE community have.

Posted by Thomas Mercer-Hursh on 08-Dec-2014 10:15

George is great at coming up with detailed numbers and analysis, so I would think he would be a key source in understanding these issues.

Posted by TheMadDBA on 08-Dec-2014 11:45

Agree with all of the issues that George identified, as well as the frustration about trying run a real world multiple TB Progress database. Uptime, backups, AI applies and online maintenance are all a huge pain compared to Oracle and DB2.

The new idxbuild is a major improvment and we really need those kinds of improvements for most of the database utilities.

As far as your idxfix issue George.... that utility (along with several others) are notorious for making tons of small IO operations. If you have enough filesystem cache then you can drastically reduce the amount of time required to run idxfix/idxcheck.

Starting idxfix on a copy of production the initial estimate was several days. I was able to use dd (over and over again as the utility progressed through the index extents) to read the contents of the index area into the AIX buffer cache. We were able to run the idxfix in a few hours using this approach... although it required much more of my time to keep the buffer cache filled.

We shouldn't have to do tricks like this in a modern DB, but hopefully that trick can help you out until the utilities are fixed.

Posted by Rob Straight on 08-Dec-2014 14:09

Another shorter term possibility is to consider Table Partitioning to subset the data. DB utilities can then be run on smaller subsets vs. the entire dataset.

Posted by TheMadDBA on 08-Dec-2014 14:17

Partitioning can help IF

1) you are lucky enough to be dealing with a local index and not a global index

2) you know which partition is the issue

3) that index partition ends up being small enough to allow you to suffer through the utilities

Otherwise you are right back to waiting on the utilities to run at their slow pace.

Posted by Brian Bowman on 08-Dec-2014 15:42

George etal. -

  Thank you for your posts.  I would reiterate everything that Thomas and Rob have said already.  There are many different sources where we draw our information for to help us figure out what goes into the next release and what direction we need to move towards.  When it comes to the database I take into account all of these sources as well as direct feedback from many of our customers , the product management and the field have talked to.  This includes our Partners as well as all of our enterprise customers.  Everyone has their own list of priorities and there is no shortage of what we "should" do (in everyone's own mind).

Often times, as you have pointed out, not all customer requirements are the same.  Different customers have different requirements based on where they are in their business.  We have to balance this with our product direction and work we need to do to be able to move forward in 2 or 3 years.  

Please continue to provide feedback to us.

Regards

Brian Bowman

OpenEdge Product Manager - Best Database on this planet...

Posted by TheMadDBA on 08-Dec-2014 16:04

Brian: Progress has made more database improvements over the last few years than I had realistically hoped for. TDE was an absolute blessing for my last company. The idxbuild improvements also made major improvements for maintenance/release windows.

I do think that a little more weight should be given to the suggestions/pains of larger DB users compared to the smaller shops. Like George said... things that are minor to the smaller shops can and do cause million dollar outages for the larger shops.

Repeatedly having to explain to IT executives why certain things are big issues for Progress databases when they are non issues on other databases just leads to extra drama. Especially when you are trying to influence them on new projects/products.

Posted by George Potemkin on 10-Dec-2014 02:53

Rob: The customer is planning to use Table Partitioning but, IMHO, it will only slightly help with repairing the indexes.

Brian: I agree that not all customer requirements are the same. But the large customers are an vanguard. They are facing the issues that other customers will face in the future. Including the issues that come from large db sizes. I remember the days when 2 GB databases were treated as very large ones. I would not surprise if in the foreseeable future 2 TB databases will become a common case.

Probably in the messages above I should have avoided the words "enhancement requests". I'm looking just for a solution of the customer's issue. The customer is occasionally getting the corrupted indexes in their databases. It may happen with any customer. Fortunately for us the previous corruptions happened for more or less small tables. Nevertheless in some cases we were forced to use the tricks to fix the indexes. For example, for the bug PSC00262763 when a client transaction hangs in an infinite loop reading the same index blocks. Idxfix hung as well. Progress development did a fantastic job to identify the root case of the corruption and to fix it. But I would like to get more - namely an official tool that could quickly fix the corrupted index blocks in the future cases. According to the Progress development the trick that we used was unreliably but we did not have any other working option to fix the corruption in index block. Recently the customer got the index corruption in very large table (2 TB) and now we again see that the standard tools do not work well in such case. Fortunately it happened in the database that was used only for the tests. But I would like to use this incident to create the standard instruction: what the customer should do in the case of index corruption in production environment. I don't like to mock up with every new case. Business would not like to wait two months while idxfix will scan the indexes of large table. What can I do right now to be ready for the similar incidents in production environment?

TheMadDBA: thanks for the idea with "warming up" the filesystem cache. Though I'm going to try a slightly different approach. Our particular database has only one table. So we can assign all its indexes to an alternative buffer pool. The size of all indexes is 335 GB (20% of db size). It may be too big for buffer pool but we can use the less memory while scanning one index at time. There are a few ways to warm up the buffer pool. We can run ixanalys for the index area linked to the alternative buffer pool. Total dbanalys for this database took almost 24 hours. I think ixanalys will be completed in 5 hours (20% of dbanalys time). But we could load in memory the blocks for a selected index only. For example, using an undocumented 'idxblockreport' option of proutil. Or by running 'idxfix/scan indexes' /without/ record's validation. We did some tests and the surprise: idxfix was working faster (by 50%!) than idxblockreport though the last command supposed to be block-oriented while idxfix is checking all index keys in every block. Another interesting result of the tests: the low disk activity when these utilities were walking through an index tree (~ 250 8K-blocks per sec or 2 MB/sec or 2 days to read all index blocks). The 'vmstat' command shows high disk IO waits. The same 'idxfix/scan indexes' option but with record's validation reads 10 times more blocks from disks per second. Of course, it reads mostly the blocks with records. This means, IMHO, that we need to take into account the scatter factor for index blocks (not to be confused with a logical scatter factor of the indexes). Unfortunately Progress tools do not report such metric. I guess the idxbuild will significantly improve the scatter factor of the index blocks. It's another thing to test. And new ideas for the enhancement requests (it's not me but the customer's environment is fruitful for new ideas ;-)

1) an option of 'idxfix/scan indexes' to read the index blocks sequentially rather than from a root block down to the leaf blocks (in our case it will speed up a single-threaded 'idxfix/scan indexes' by 10 times);

2) online de-scattering the index blocks without rebuilding the indexes.

Posted by TheMadDBA on 10-Dec-2014 09:43

Sounds like you need a storage area per index :)

Good luck with the -B2 approach, that may work fine for your database. Ours had a collection of large tables (200-500GB) but not one single table that large.

Also... those times for ixanalys and dbanalys seem pretty bad even for a database that size. Using dd I was able to get read rates over 400MB a second using bs=1M.  Might be interesting to see how fast your disks can go at least.

Posted by George Potemkin on 14-Dec-2014 12:55

We run the tests with idxfix and the results exceeded the worst expectations. Online idxfix just for one of the indexes would take from a day to 2 years (TWO YEARS)! Or 6 days for offline index rebuild.

It's too much even for the tests. So we got only the estimations of the scan times. Test's scenario: we run a command, wait a few minutes to warm up a buffer pool, collect the promon's statistics during one minute (using 30 2-sec intervals), stop the process and then run a next command for the same index. We tested the following commands: idxblockreport, 'idxfix/scan indexes' with/without record's validation and 'idxfix/scan records'. Database was not restarted between the tests. Scan time was estimated as the number of records in table (reported by dbanalys) divided by the record operations per sec (reported by promon/Activity: Performance Indicators) - multiplied by two seem to be a more accurate estimation but then the numbers would be more scary - and/or as the number of index blocks divided by the index block reads per sec (reported by Activity: I/O Operations by Type). These estimations can be verified by logical reads (a.k.a. db requests) reported by Activity: Buffer Cache. The exception is idxblockreport that almost does not update the logical reads. For a comparison we also have the execution times of ixanalys for each index separately.

Sidenote #1: one minute monitoring interval is too short to provide an accurate estimation if a process is going to run the days, months or years, especially for the read operations when large db buffer pool and filesystem cache are used. But, IMHO, it's enough at least for a quick draft estimation.


Sidenote #2: idxfix creates the transactions per each index keys. But it's not a problem because the status of these transactions is just "allocated" and it's required just a few db resources (2*MTX+2*LKF locks per Tx). Idxblockreport created one transaction approximately per 2 seconds but it's just a "feature" of the undocumented command.

The estimation of run duration of idxfix against a copy of production database (86,44,278,754 records, rec's size: 1.3T, total size of the indexes: 334.6G - from 16.9G to 241.2G individually):
The best case for 'idxfix/scan indexes' with record's validation is approximately 20 hours. The worst case - 2 years.
The best case for 'idxfix/scan records' - 16 hours. The worst case - 289 days.

Ixanalys is incomparably (by 170 times) faster than idxblockreport though both commands do the block level operations. Ixanalys reads the index blocks sequentially while idxblockreport reads them in the order of the index tree. The difference between the times stayed huge even after idxbuild. The 'idxfix/scan indexes' without record's validation aprox 4 times slower than idxblockreport (tested on a small database). Both commands read the same blocks in the same order but additionally idxfix parses all keys from an index block - like 'for each ... use-index' but without real reads of the records. As expected the 'idxfix/scan indexes' with record's validation is significantly slower than without validation especially on the indexes with a bad logical scatter factor.

Idxbuild (done with pfactor 80%) remarkably improved the scan times for the "fast" indexes and almost did not affect the scan times for the "slow" ones. Obviously the slowness of an index when running idxfix is mainly the result of its logical scatter factor and its size and depends less from the scatter factor of the index blocks.

Logical reads varied between the tests from 300 requests per sec to 400,000. The last number seems to be the best result on the customer's hardware for a single-threaded run of idxfix. To check how the results depends from OS/hardware we run the "calibration" test. We created a database with a table that has one field and one index and the records with field's value from 1 to 10 millions. Db size is 400 GB. We tested only 'idfix/scan records'. I run the same tests on my laptop. The results:

Customer's test server:
Logical reads: 401,025 blk/sec
Record operations: 160,284 rec/sec
Estimated scan time: 62 sec
Index operations: 80,142 key/sec

My laptop (first run after reboot):
Logical reads: 578,132 blk/sec
Record operations: 230,981 rec/sec
Estimated scan time: 48 sec
Index operations: 115,491 key/sec
Real scan time: 159 sec

My laptop (second run):
Logical reads: 1,280,805 blk/sec
Record operations: 511,720 rec/sec
Estimated scan time: 21 sec
Index operations: 255,860 key/sec
Real scan time: 44 sec

Obviously my laptop shows the better results. Maybe because its CPU frequency is higher (2.40 GHz). Maybe because the customer is running on LPAR. But it's not a factor that would help us to scan the indexes in large database in a reasonable period of time (instead of two years).

The timings (there are a lot of numbers below, I'm sorry):

dbanalys took 23:56:11.
ixanalys for all indexes in one area (before idxbuild): 00:39:42

First tests were running before idxbuild, then each index was moved to each own area and the indexes were rebuilt. Then the tests were re-run. The results after idxbuild are specified in the brackets below.

Idxbuild was running a bit more than 6 days:
[2014/10/16@18:49:35.620+0300] P-9371946 T-1 I : (451) Idxbuild session begin for root on /dev/pts/4.
[2014/10/22@20:06:20.194+0300] P-9371946 T-1 I : (334) Idxbuild session end.

During the tests the database was running with -B 900000 (6.9 GB).


Index1 Fields Levels Blocks Size % Util Factor

Before idxbuild 6 5 52376577 241.2G 60.6 1.8
After idxbuild 6 5 39988575 240.7G 79.2 1.4

Ixanalys exec time: 00:24:57

Idxblockreport
Buffer Hits: 66%
Logical reads: 5.7 blk/sec (2.9 blk/sec)
Index block reads: 159 blk/sec (165 blk/sec)
Estimated scan time: 3.8 days (2.8 days)

Idxix / Scan indexes / Not validate recids for index entries
Buffer Hits: 99%
Logical reads: 51,938 blk/sec (380,254 blk/sec)
Index block reads: 166 blk/sec (1,143 blk/sec)
Estimated scan time: 3.7 days (09:43:11)
Index operations: 50,943 key/sec = 308 key/blk (373,432 key/sec = 2,255 key/blk)
Estimated scan time: 2 days (06:25:48)

Idxix / Scan indexes / Validate recids for index entries
Buffer Hits: 50%-52%
Logical reads: 320 blk/sec (280 blk/sec)
Data block reads: 156 blk/sec (138 blk/sec)
Record operations: 158 rec/sec = 1.01 rec/blk (139 rec/sec = 1.01 rec/blk)
Estimated scan time: 1.7 years (2 years)
Index block reads: 0.73 blk/sec (0.35 blk/sec)
Estimated scan time: 2.3 years (2 years)
Index operations: 158 key/sec = 216 key/blk (139 key/sec = 190 key/blk)

Idxix / Scan records
Buffer Hits: 99%
Logical reads: 32,422 blk/sec
Data block reads: 0.017 blk/sec
Record operations: 9,259 rec/sec
Estimated scan time: 11 days
Index block reads: 160 blk/sec
Index operations: 4,639 key/sec = 29 key/blk


Index2 Fields Levels Blocks Size % Util Factor

Before idxbuild 2 4 13891121 56.6G 53.6 1.9

After idxbuild 2 4 9366784 56.5G 79.3 1.4

Ixanalys exec time: 00:09:05

Idxblockreport
Buffer Hits: 66%
Logical reads: 1.5 blk/sec (1.5 blk/sec)
Index block reads: 140 blk/sec (170 blk/sec)
Estimated scan time: 1 day (15:20:35)

Idxix / Scan indexes / Not validate recids for index entries
Buffer Hits: 99%
Logical reads: 42,151 blk/sec (365,137 blk/sec)
Index block reads: 229 blk/sec (776 blk/sec)
Estimated scan time: 16:49:01 (03:21:12)
Index operations: 41,004 key/sec = 179 key/blk (361,292 key/sec = 466 key/blk)
Estimated scan time: 2.4 days (06:38:46)

Idxix / Scan indexes / Validate recids for index entries
Buffer Hits: 63%-94%
Logical reads: 10,253 blk/sec (14,304 blk/sec)
Data block reads: 1,530 blk/sec (1,908 blk/sec)
Record operations: 5,079 rec/sec = 3.31 rec/blk (7,127.90 rec/sec = 3.74 rec/blk)
Estimated scan time: 20 days (14 days)
Index block reads: 20 blk/sec (10 blk/sec)
Estimated scan time: 8 days (11 days)
Index operations: 5,079 key/sec = 254 key/blk (7,128 key/sec = 704 key/blk)

Idxix / Scan records
Buffer Hits: 82%-90%
Logical reads: 1,042 blk/sec
Data block reads: 0 blk/sec
Record operations: 346 rec/sec
Estimated scan time: 289 days
Index block reads: 137 blk/sec
Index operations: 173 key/sec = 1.3 key/blk



Index3 Fields Levels Blocks Size % Util Factor
Before idxbuild 3 4 8335147 19.9G 31.4 2.4
After idxbuild 3 4 3276506 19.7G 79.3 1.4

Ixanalys exec time: 00:04:08

Idxblockreport
Buffer Hits: 66%
Logical reads: 5.4 blk/sec
Index block reads: 728 blk/sec (584 blk/sec)
Estimated scan time: 03:10:47 (01:33:27)

Idxix / Scan indexes / Not validate recids for index entries
Buffer Hits: 99%
Logical reads: 231,856 blk/sec (282,018 blk/sec)
Index block reads: 365 blk/sec (264 blk/sec)
Estimated scan time: 06:20:07 (03:26:36)
Index operations: 230,063 key/sec = 629 key/blk (280,720 key/sec = 1062 key/blk)
Estimated scan time: 10:26:14 (08:33:13)

Idxix / Scan indexes / Validate recids for index entries
Buffer Hits: 99%
Logical reads: 225,061 blk/sec
Data block reads: 1,864 blk/sec (1,900 blk/sec)
Record operations: 112,203 rec/sec = 60 rec/blk (112,667 rec/sec = 59 rec/blk)
Estimated scan time: 21:24:01 (21:18:44)
Index block reads: 136 blk/sec (88 blk/sec)
Estimated scan time: 17:01:50 (10:21:45)
Index operations: 112,203 key/sec = 825 key/blk (112,667 key/sec = 1,283 key/blk)

Idxix / Scan records
Buffer Hits: 99%
Logical reads: 383,050 blk/sec
Data block reads: 995 blk/sec
Record operations: 127,363 rec/sec = 128 rec/blk
Estimated scan time: 18:51:11
Index block reads: 123 blk/sec
Index operations: 63,682 key/sec = 519 key/blk



Index4 Fields Levels Blocks Size % Util Factor
Before idxbuild 3 4 3982408 16.9G 55.7 1.9
After idxbuild 3 4 2792427 16.8G 79.3 1.4

Ixanalys exec time: 00:03:31

Idxblockreport
Buffer Hits: 66%
Logical reads: 1.8 blk/sec (124 blk/sec)
Index block reads: 139 blk/sec
Estimated scan time: 07:57:35

Idxix / Scan indexes / Not validate recids for index entries
Buffer Hits: 99%
Logical reads: 372,976 blk/sec (402,505 blk/sec)
Index block reads: 208 blk/sec (128 blk/sec)
Estimated scan time: 05:19:03 (06:02:40)
Index operations: 371,969 key/sec = 1,788 key/blk (401,909 key/sec = 3,132 key/blk)
Estimated scan time: 06:27:19 (05:58:28)

Idxix / Scan indexes / Validate recids for index entries
Buffer Hits: 50%-58%
Logical reads: 697 blk/sec (637 blk/sec)
Data block reads: 320 blk/sec (299 blk/sec)
Record operations: 347 rec/sec = 1.08 rec/blk (318 rec/sec = 1.06 rec/blk)
Estimated scan time: 288.5 days (315 days)
Index block reads: 0.75 blk/sec (0.40 blk/sec)
Estimated scan time: 62 days (81 days)
Index operations: 347 key/sec = 462 key/blk (318 key/sec = 794 key/blk)

Idxix / Scan records
Buffer Hits: 99%
Logical reads: 444,789 blk/sec
Data block reads: 0 blk/sec
Record operations: 147,891 rec/sec
Estimated scan time: 16:14:10
Index block reads: 10 blk/sec
Index operations: 73,945 key/sec = 7447 key/blk


Posted by TheMadDBA on 15-Dec-2014 11:43

George... what version of OE was this? If 10.2B08 or better what options did you use for all of the new idxbuild parameters?

Those are some pretty shocking numbers but I am not entirely amazed by the idxfix numbers. Those have always been bad.

Posted by George Potemkin on 16-Dec-2014 03:03

Hi TheMadDBA,

> what version of OE was this?

10.2B08, AIX, IBM P570, POWER6, 4 CPUs, 4704 MHz, 32 GB RAM

> what options did you use for all of the new idxbuild parameters?

I have only information from db log:

(13942) Idxbuild is running with threads, the maximum number of threads is 9.

(11443) Temporary sort file at: /test/tmp/ will use the available disk space.

(-----)  Process memory 72698912 of 32883343360 in use

(-----)  Max merge pool entries estimated at: 250325 w/memory usage: 16405299200

(-----) TMB value is 64, TM value is 32, SG value is 64, packing factor is 80.

Year ago or so the customer has tested new idxbuild parameters and SSD for temp files. In current case I don't care about the optimization of idxbuild run. Maybe it can complete in a day instead of 6 days. But one day out of business is not an option for the customer to repair 1-2 records with the corrupted indexes. Idxbuild can be used during the planned downtime but I would not expect it will happen often than one day per year.

> but I am not entirely amazed by the idxfix numbers.

Now I would say we could predict such bad results. For "good" indexes (the ones with low logical scatter factor) the scan time is directly proportional to the number of records. For "bad" indexes the scan time will grow as the number of records in power 2: idfxix is forced to jump from data blocks to the index blocks and vice versa. Due to the high logical scatter factor it's unlikely that next blocks the idxfix will read would be the same as the ones recently used. This will cause the waits on disk IO. More records in table the larger indexes, the bigger jumps between disk IOs and the longer waits. If one minute is fine to scan 10 million records then the time to scan almost 10 billions records (as in our case) would be a thousand time longer for "good" indexes and almost a million time longer for "bad" one. A thousand minutes is 17 hours and a million minutes is 694 days or two years - it's almost those values that we got in the tests.

Multi-threaded idxfix could decrease the scan times by ten times. It's not a perfect solution for "bad" large indexes but it's a solution that Progress could implement rather easy: just take the multi-threaded binary dump (and dumpspecified) and replace the final operation - instead of dumping a record check its indexes. It's as easy as to change a nozzle at an electric.

But if Progress could entirely change the algorithm of idxfix then we would get a solution even for "bad" indexes. The possible solution: idxfix can read the index blocks as idxblockreport does and put them into the buffer pool. Simultaneous it should build a temp index for recids found in the index keys. We know this index as the "default" index. It's much smaller than any normal index of the same table. So idxfix can use an additional memory (separately from -B) to build the "default" index in memory. When the buffer pool is entirely filled by index blocks then idxfix should stop reading the index blocks. Now it should read the records using the "default" index and check their indexes. One block in memory would be enough for reading data blocks. On this phase the idxfix will read data blocks sequentially without the jumps during disk IO - the correspondent index blocks are already in memory. When the phase is completed then idxfix should return to the first phase to fill the buffer pool by new chunk of index blocks. Then it will again read the records using new instant of the "default" index. New phase of record reads could re-read from disk the data blocks that were used on previous phases. It's, of course, on overhead. But the current version of idxfix does the same for each record in data block in case it scans a "bad" index and it does this after a jump from index block.

Let's take, for example, the worst indexes from our test. Its size is 200 GB. The box has 32 GB of RAM. We can start db with -B 20 GB. So the whole index will be divided into 10 chunks. We will 10 times re-read the area with the records using the different instances of the "default" index. In case of "good" index each time we would re-read only one-tenth of the area. But it's not true for "bad" indexes. On other hands it's unlikely that each time we will re-read the whole area. Let's say we will re-read each block from data area by 5 times. Hence the time to check our index will equate to the time of idxblockreport plus the time of tabanalys multiplied by 5. No doubts that the time would be much smaller than 2 years. We have a couple of the indexes that are smaller than 20 GB. They can be processed just in one interaction.

The suggested algorithm can be multi-threaded as well. But I am afraid that if the enhancements of idxfix will be implemented in 2 or 3 years then at that time we will not have the large customers who need such options.

Posted by George Potemkin on 02-Jan-2015 14:47

> those times for ixanalys and dbanalys seem pretty bad even for a database that size

Tabanalys for the table's area takes 1 hour 44 min.

Ixabalys for the index's area takes 40 min.

In both cases the proutil is processing approximately 32,000 blocks per sec.

Offline dbanalys takes ... 21 hour 20 min.

Online dbanalys has completed in 2 hours 24 min (expectedly it's a sum of tabanalys and ixanalys times). The -B value on db startup did not matter.

The outputs of both dbanalys commands are exactly the same except of the warning # 2486:

[Warning] database in use - reported statistics are approximations. (2486)

To find a place where offline dbanalys is spending the most of the time we added timestamps to each line of its output:

proutil db -C dbanalys | \

while IFS="" read Line

do

 echo `date '+%H:%M:%S'` "$Line"

done > dbanalys.log

The result of the offline dbanalys:

17:07:20 RM CHAIN ANALYSIS

17:07:20 -----------------

17:07:20

13:57:04 22675431 block(s) found in the RM chain of Table object 1

Online dbanalys:

18:51:03 RM CHAIN ANALYSIS

18:51:03 -----------------

18:51:03

18:51:03 22675431 block(s) found in the RM chain of Table object 1

In other words the offline dbanalys spent 20 hours 50 min analyzing the very long RM chain. It processed approximately 302 blocks/sec - a hundred times slower than tabanalys and ixanalys. But it did not report a single word about the results of chanalys.

Online dbanalys just instantly reads data from Object Block.

Workaround is always to run online dbanalys.

Posted by Richard Banville on 05-Jan-2015 16:49

It does seem a bit strange but I believe it has to do with the validation check for a circular record block free chain (RMCHN)
 

1.     Dbanalys does indeed perform chanalys when run offline.

2.     It does NOT report the statistics but it does the associated chain validation ensuring that:

a.     It's an RM block

b.    The block is marked as being on the RM chain.

c.     Block on chain belongs to this object (TII only).

d.    Block is below HWM

e.     Chain next is below HWM

f.     The chain is not a circular list

3.     With the exception of “f”, these are all very quick checks once the block is located.  If as you say, the average cost of locating blocks it 32,000 per second, then that should only take about 12 minutes to complete for your 22+ million blocks on the RMCHN.

4.     Therefore, I believe it has to do with validation check “f”

a.     The circular lsit validation uses a very fast bitmap technique up until some specified # of blocks (not exactly sure what the size is but I’ll check the code).

b.    It will use a much more expensive temp table index technique to store the dbkeys when encountering a very large storage area.

c.     In this case, with 22+ million entries on the RMCHAIN, it is performing 22+ million temp table index insertions (no temp table record insertions – but it is still very expensive).  

 
So from this thread, I think there are 2 enhancement requests which should be added:

1.     The ability to disable chain validation when performing a dbanalysis operaiton off-line (maybe levels of validation where circular chain validation alone could be omitted).

2.     Enhance the performance of the circular chain validation mechnaism for large databases.

 
I will discuss this with Product Management as well as make sure there is not something else weird going on in this area.
 
With all that said, I would suggest you compare the average record size of this table with the current toss limit.  With such a long rmchian, it would seem to me that adgusting the toss limit would help your performance, not only for dbanalys but also for performnace of insert operations.
 
 
 
[collapse]
From: George Potemkin [mailto:bounce-GeorgeP12@community.progress.com]
Sent: Friday, January 02, 2015 3:47 PM
To: TU.OE.General@community.progress.com
Subject: RE: [Technical Users - OE General] Can the different worlds hear each other?
 
Reply by George Potemkin

> those times for ixanalys and dbanalys seem pretty bad even for a database that size

Tabanalys for the table's area takes 1 hour 44 min.

Ixabalys for the index's area takes 40 min.

In both cases the proutil is processing approximately 32,000 blocks per sec.

Offline dbanalys takes ... 21 hour 20 min.

Online dbanalys has completed in 2 hours 24 min (expectedly it's a sum of tabanalys and ixanalys times). The -B value on db startup did not matter.

The outputs of both dbanalys commands are exactly the same except of the warning # 2486:

[Warning] database in use - reported statistics are approximations. (2486)

To find a place where offline dbanalys is spending the most of the time we added timestamps to each line of its output:

proutil db -C dbanalys | \

while IFS="" read Line

do

 echo `date '+%H:%M:%S'` "$Line"

done > dbanalys.log

The result of the offline dbanalys:

17:07:20 RM CHAIN ANALYSIS

17:07:20 -----------------

17:07:20

13:57:04 22675431 block(s) found in the RM chain of Table object 1

Online dbanalys:

18:51:03 RM CHAIN ANALYSIS

[/collapse]

Posted by George Potemkin on 06-Jan-2015 06:24

Hello Richard,

Thanks for the detailed information.

> Enhance the performance of the circular chain validation mechanism for large databases.

I don't think that our current situation is normal. Very long RM chains are like a tumor. They contradict to the purpose of RM chain - free space in the blocks on RM chain will not be re-used in the reasonable period of time due to the length of chain. I'm not even sure whether to consider the chain constantly consisting of thousands of blocks as normal. Of course, the chains can significantly grow after the clearing procedures run by application but then they should shrink again. An enhancement is not needed when an issue can be fixed (like in the current case).

> With such a long rmchian, it would seem to me that adgusting the toss limit would help your performance, not only for dbanalys but also for performnace of insert operations.

I would add that long RM chains affect even read operations. 12% of data block being on RM chain most likely means that these blocks are scattered through the whole area and the order of their dbkeys is not guaranteed. The indexes that should in theory have good logical scatter factor (like, for example, the indexes by SeqNum or by date/time of record's creation) will be in fact not so good because the new records will be scattered through the area in the same way as RM chain.

> I would suggest you compare the average record size of this table with the current toss limit.

Done an year ago. Toss limit was increased from 300 to 700 bytes. It did help (see data below). But the customer just forgot to apply the change to the database in one particular branch. ;-)

> The ability to disable chain validation when performing a dbanalys operation off-line

In some cases we definitely don't need chanalys as a part of dbanalys. For example, dbanalys created before and after D&L. Who would care about the possible issues with the chains in old db copy? And we don't have the reasons to check the chains in its fresh copy. We run dbanalys for the different purposes. Or, for example, ixanalys does not do the same things like idxcheck/idxfix. For a chain validation we could run ixanalys (as a separate option) or dbrpr.

Also PSTS reminded me about the -scan option of dbanalys introduced in V10.1B:

What is the dbanalys -scan parameter used for?

knowledgebase.progress.com/.../P125963

In other words, the chain validation creates some troubles for online dbanalys as well.

BTW, does the -scan option still valid in V10.2B and higher? Online dbanalys does not seem to scan the chains now, does it?

Another example of chanalys as being "troublemaker": since some old version Progress stopped to report the warning # 3896, afaik, because it used to cause a "panic" reaction from DBAs. ;-)

DBKEY <number> w/free space less than <CREATELIM> found on RM chain. (3896)

Disabling chain validation as a default option of dbanalys would be a simpliest solution.

> there is not something else weird going on in this area.

In our case the chanalys is taking long time, IMHO, due to the two factors:

1) the table has unusually long RM chain;

2) the data blocks on RM chain are spread over the area almost in random order.

Both factors are the result of a clearing procedure that the customer is running every day. Procedure is choosing which data can be moved to an archive and then deletes the correspondent records. As far as I know the selection algorithm is rather complicated - at least it takes into account not only date of record's creation. That is why the location of the selected records and dbkeys of the blocks on RM chain are random. Tabanalys and ixanalys are reading the blocks sequentially. That is why they can read 100 times more blocks per sec than chanalys. Low toss limit does not allow for RM chain to shrink.

It's Christmas time now in Russia. So I would dream about the cosmetic enhancements in dbanalys. Could it report the toss/create limits as well as RPB and cluster size per area? I know how to get this information but it would be nice to have it in dbanalys output as well because it's hard each time to reminde for the customers to collect all information that might be needed to analys an issue.

Some date to demonstrate the effect of toss limit:

Year ago in branche #1:

Date: Tue Jul  9 12:34:36 2013
Blocksize: 4096
29635202 block(s) found in the RM chain of Table object 1
157965823 block(s) found in the area.
                               -Record Size (B)-   ---Fragments--- Scatter
Table        Records    Size   Min   Max  Mean        Count Factor  Factor
PUB.table 3241317246  541.3G    47 24189   179   3241384153    1.0     1.0

12/17/14        Activity: Space Allocation
04:00:32        12/14/14 03:47 to 12/17/14 04:00 (72 hrs 13 min)
                        Per Sec
Database extends           7.09
Take free block            2.05
Return free block          0.43
Alloc rm space           145.31
Alloc from rm            145.31
Alloc from free            0.00
Bytes allocated        26028.50
rm blocks examined       148.10
Remove from rm            25.64
Add to rm, front           0.06
Add to rm, back           20.35
Move rm front to back      0.37
Remove locked rm entry    49.51

On Aug 11, 2013 the toss limit was changed from 300 to 700 bytes.

The same db after the change:

Blocksize: 4096
355 block(s) found in the RM chain of Table object 1
400918526 block(s) found in the area.
                               -Record Size (B)-   ---Fragments--- Scatter
Table        Records    Size   Min   Max  Mean        Count Factor  Factor
PUB.table 8057647115    1.3T    47 25481   173   8057759806    2.1     1.0
PUB.table

12/10/14        Activity: Space Allocation
04:00:33        11/27/14 02:21 to 12/10/14 04:00 (313 hrs 39 min)
                        Per Sec
Database extends           8.38
Take free block            2.41
Return free block          0.37
Alloc rm space           173.83
Alloc from rm            173.83
Alloc from free            0.00
Bytes allocated        28794.81
rm blocks examined       178.38
Remove from rm            22.32
Add to rm, front           0.01
Add to rm, back           15.96
Move rm front to back      0.12
Remove locked rm entry    44.29


The similar db in branch #2 (where dbanalys was long running):

Date: Wed Oct 15 13:28:00 2014
Blocksize: 8192
22675158 block(s) found in the RM chain of Table object 1
194006014 block(s) found in the area.

                               -Record Size (B)-   ---Fragments--- Scatter
Table        Records    Size   Min   Max  Mean        Count Factor  Factor
PUB.table 8644278754    1.3T    44 31966   170   8644502338  159.0     1.0

11/05/13        Activity: Space Allocation
04:00:30        10/27/13 03:49 to 11/05/13 04:00 (216 hrs 11 min)
                        Per Sec
Database extends           4.16
Take free block            1.07
Return free block          0.13
Alloc rm space           167.59
Alloc from rm            167.59
Alloc from free            0.00
Bytes allocated        27126.44
rm blocks examined       265.52
Remove from rm             9.13
Add to rm, front           0.01
Add to rm, back           11.69
Move rm front to back     18.93
Remove locked rm entry    23.44

Taking into account the number of "rm blocks examined" per sec the chain should shrink approximately in 20 hours after the change of toss limit.

Posted by Richard Banville on 06-Jan-2015 09:09

As for your suggestions on the enhancements.  I agree.  However, we usually do not change the default behavior of existing functionality in maintenance releases (whether that is the correct thing to do or not).

What is the dbanalys -scan parameter used for?

I ran into the undocumented –scan option of dbanalys yesterday when looking over the code.  There has been a bug filed to have the parameter formally added to the OpenEdge documentation.
 
The kbase is a bit incorrect.

1.     The parameter is still valid.

2.     However, it ony applies to specifically running “chanalys” online.  It does not apply when running “dbanalys” online

3.     What it does is avoid upgrading IS table locks to SHARE table locks during the “chanalys” for each area.

 
Could it report the toss/create limits as well as RPB and cluster size per area?
 
I will have Product Management add an entry for this.  It is certainly a reasonable request and not difficult to add.  It will affect the –csoutput though.
 
It's Christmas time now in Russia.
 
Enjoy!
 
 
[collapse]
From: George Potemkin [mailto:bounce-GeorgeP12@community.progress.com]
Sent: Tuesday, January 06, 2015 7:25 AM
To: TU.OE.General@community.progress.com
Subject: RE: [Technical Users - OE General] Can the different worlds hear each other?
 
Reply by George Potemkin

Hello Richard,

Thanks for the detailed information.

> Enhance the performance of the circular chain validation mechanism for large databases.

I don't think that our current situation is normal. Very long RM chains are like a tumor. They contradict to the purpose of RM chain - free space in the blocks on RM chain will not be re-used in the reasonable period of time due to the length of chain. I'm not even sure whether to consider the chain constantly consisting of thousands of blocks as normal. Of course, the chains can significantly grow after the clearing procedures run by application but then they should shrink again. An enhancement is not needed when an issue can be fixed (like in the current case).

> With such a long rmchian, it would seem to me that adgusting the toss limit would help your performance, not only for dbanalys but also for performnace of insert operations.

I would add that long RM chains affect even read operations. 12% of data block being on RM chain most likely means that these blocks are scattered through the whole area and the order of their dbkeys is not guaranteed. The indexes that should in theory have good logical scatter factor (like, for example, the indexes by SeqNum or by date/time of record's creation) will be in fact not so good because the new records will be scattered through the area in the same way as RM chain.

> I would suggest you compare the average record size of this table with the current toss limit.

Done an year ago. Toss limit was increased from 300 to 700 bytes. It did help (see data below). But the customer just forgot to apply the change to the database in one particular branch. ;-)

> The ability to disable chain validation when performing a dbanalys operation off-line

In some cases we definitely don't need chanalys as a part of dbanalys. For example, dbanalys created before and after D&L. Who would care about the possible issues with the chains in old db copy? And we don't have the reasons to check the chains in its fresh copy. We run dbanalys for the different purposes. Or, for example, ixanalys does not do the same things like idxcheck/idxfix. For a chain validation we could run ixanalys (as a separate option) or dbrpr.

Also PSTS reminded me about the -scan option of dbanalys introduced in V10.1B:

What is the dbanalys -scan parameter used for?

knowledgebase.progress.com/.../P125963

In other words, the chain validation creates some troubles for online dbanalys as well.

BTW, does the -scan option still valid in V10.2B and higher? Online dbanalys does not seem to scan the chains now, does it?

Another example of chanalys as being "troublemaker": since some old version Progress stopped to report the warning # 3896, afaik, because it used to cause a "panic" reaction from DBAs. ;-)

DBKEY <number> w/free space less than <CREATELIM> found on RM chain. (3896)

Disabling chain validation as a default option of dbanalys would be a simpliest solution.

> there is not something else weird going on in this area.

In our case the chanalys is taking long time, IMHO, due to the two factors:

1) the table has unusually long RM chain;

2) the data blocks on RM chain are spread over the area almost in random order.

Both factors are the result of a clearing procedure that the customer is running every day. Procedure is choosing which data can be moved to an archive and then deletes the correspondent records. As far as I know the selection algorithm is rather complicated - at least it takes into account not only date of record's creation. That is why the location of the selected records and dbkeys of the blocks on RM chain are random. Tabanalys and ixanalys are reading the blocks sequentially. That is why they can read 100 times more blocks per sec than chanalys. Low toss limit does not allow for RM chain to shrink.

It's Christmas time now in Russia. So I would dream about the cosmetic enhancements in dbanalys. Could it report the toss/create limits as well as RPB and cluster size per area? I know how to get this information but it would be nice to have it in dbanalys output as well because it's hard each time to reminde for the customers to collect all information that might be needed to analys an issue.

Some date to demonstrate the effect of toss limit:

Year ago in branche #1:

Date: Tue Jul  9 12:34:36 2013
Blocksize: 4096
29635202 block(s) found in the RM chain of Table object 1
157965823 block(s) found in the area.
                               -Record Size (B)-   ---Fragments--- Scatter
Table        Records    Size   Min   Max  Mean        Count Factor  Factor
PUB.table 3241317246  541.3G    47 24189   179   3241384153    1.0     1.0
 
12/17/14        Activity: Space Allocation
04:00:32        12/14/14 03:47 to 12/17/14 04:00 (72 hrs 13 min)
                        Per Sec
Database extends           7.09
Take free block            2.05
Return free block          0.43
Alloc rm space           145.31
Alloc from rm            145.31
Alloc from free            0.00
Bytes allocated        26028.50
rm blocks examined       148.10
Remove from rm            25.64
Add to rm, front           0.06
Add to rm, back           20.35
Move rm front to back      0.37
Remove locked rm entry    49.51

On Aug 11, 2013 the toss limit was changed from 300 to 700 bytes.

The same db after the change:

Blocksize: 4096
355 block(s) found in the RM chain of Table object 1
400918526 block(s) found in the area.
                               -Record Size (B)-   ---Fragments--- Scatter
Table        Records    Size   Min   Max  Mean        Count Factor  Factor
PUB.table 8057647115    1.3T    47 25481   173   8057759806    2.1     1.0
PUB.table
 
12/10/14        Activity: Space Allocation
04:00:33        11/27/14 02:21 to 12/10/14 04:00 (313 hrs 39 min)
                        Per Sec
Database extends           8.38
Take free block            2.41
Return free block          0.37
Alloc rm space           173.83
Alloc from rm            173.83
Alloc from free            0.00
Bytes allocated        28794.81
rm blocks examined       178.38
Remove from rm            22.32
Add to rm, front           0.01
Add to rm, back           15.96
Move rm front to back      0.12
Remove locked rm entry    44.29


The similar db in branch #2 (where dbanalys was long running):

Date: Wed Oct 15 13:28:00 2014
Blocksize: 8192
22675158 block(s) found in the RM chain of Table object 1
194006014 block(s) found in the area.
 
[/collapse]

This thread is closed