OE10.2B08 idxbuild all - fastest options

Posted by James Palmer on 18-Jul-2017 04:41

Age old question! I need to rebuild indexes on a DB as fast as possible. 

Got 10GB memory

500GB free on DB volume 

250GB free on another volume

What's the best options to use? Need any more info? 

Windows by the way. 

All Replies

Posted by James Palmer on 18-Jul-2017 04:43

4 cores on the processor by the way!

Posted by James Palmer on 18-Jul-2017 04:43

165GB DB.

Posted by ManfredBerthold on 18-Jul-2017 04:48

proutil <DB> -C idxbuild ALL -TB 64 -TM 32 -TF 90 -SG 64 -T K:\Temp

More Memory - faster idxbuild

Posted by George Potemkin on 18-Jul-2017 05:06

Truncate all index areas before running idxbuild.

Posted by James Palmer on 18-Jul-2017 05:10

Unfortunately we've got data and indexes mixed. Does that make a difference to this recommendation, [mention:ae2ea2f6412743fc8be36c522f414ef0:e9ed411860ed4f2ba0265705b8793d05]

Posted by George Potemkin on 18-Jul-2017 05:24

Of course, truncating can't be used for the mixed areas. It's sad that we can't move the inactive indexes between the areas. Otherwise you could easy separate data from indexes before running idxbuild.

Posted by Paul Koufalis on 18-Jul-2017 06:52

Assuming that you're on 10.2B06 or later, look at -mergethreads and -datascanthreads. You can also set  -threadnum <x> but be careful as mergethreads are per threadnum thread. IIRC -threadnum defaults to the number of CPUs/cores but I always set it explicitly.

There is also a new TMB parameter which I typically set to 256. And a large -B doesn't help you: leave it a 256 or 512. Same for -i: no value per my testing.

TF was the biggest addition to idxbuild, as it can move all the I/O intensive sorting and merging to memory. What we used to fake with RAM drives is now built in.

There was some weird issue with TF on HPUX where it couldn't/wouldn't allocate more than one shared memory segment, so even if you had 128 GB RAM on the server, proutil only used 4 GB (or whatever shmmax was). I had reported it as a bug so it might be fixed in later versions.

There was also an issue when you ran multiple concurrent idxbuild, all with -TF 90: they would bump into each other, sucking up all the RAM and eventually one or more would crash. But again that was way back in 10.2B0x so it might be resolved in later 11.x versions.

Posted by James Palmer on 18-Jul-2017 07:00

Thanks Paul. Appreciated as always. This is 10.2B08. Currently running a test with the command posted above by Manfred. -threadnum has definitely defaulted to num cores.

It's currently about 10% of the way through phase 1 - 17.5m blocks. Could be a long old night ahead. :(

Posted by Paul Koufalis on 18-Jul-2017 07:03

Kill it and add -datascanthreads 4 -mergethreads 2 -TMB 256 -B 512.

Also add "-z -rusage" for some extra geeky stats.

Posted by Rob Fitzpatrick on 18-Jul-2017 10:16

The -rusage param isn't just geeky, it's important for determining whether you're tuning correctly.   It will cause (11480) and (11483) messages to be written to stdout and the db log file, for each area processed, telling you how much I/O was done to temp files (-T or -SS).  Example:

DBUTIL : (11480) Temporary sort file at /u/db/11.5/pac used up 0K of disk space.

DBUTIL : (11483) A total of 0K of temporary sort disk space was used for area 6.

If these numbers are larger than 0 KB then you are doing extra disk I/O: write I/O during the data scan/key build phase, read/write during the sort/merge phase, and read I/O during the b-tree insertion phase.

That's a very small amount of RAM, especially for a DB this size.  Is this a VM?  Can you bump it up temporarily?

Posted by James Palmer on 18-Jul-2017 10:23

Sorry the 10GB memory was free memory. They might be able to beef it up for the live run. We're testing on the replica target to get an idea of the timings involved at the moment.

Posted by Rob Fitzpatrick on 18-Jul-2017 13:16

Sorry Paul and James, I was wrong about -rusage :(.  You get the 11480 and 11483 messages by default, at least back to 10.2B08.  The -rusage param adds lines like this:

[2017/07/18@14:10:14.546-0400] P-75487      T-139980717442816 I DBUTIL   : (-----) Resource usage: CPU user 0.002000, system 0.001000

Resource usage: DISK reads: 80 KB at 39 KB/sec, writes: 0 KB at 0 KB/sec

[2017/07/18@14:10:14.546-0400] P-75487      T-139980717442816 I DBUTIL   : (-----) Resource usage: CPU user 0.003000, system 0.002999

Resource usage: DISK reads: 1640 KB at 818 KB/sec, writes: 456 KB at 227 KB/sec

Not sure where i got the idea that it was tied to 11480/11483 but (not surprisingly) Paul is correct: -rusage provides info that is interesting but not critical.  

That said, you should also pay attention to your 11480/11483 messages and ideally you want to see 0K of temp file I/O per area.

Posted by James Palmer on 19-Jul-2017 01:48

[2017/07/19@03:49:18.139+0100] P-4460       T-4928  I          : (11480) Temporary sort file at f:\sort\ used up 259868928K of disk space.

[2017/07/19@03:49:18.141+0100] P-4460       T-4928  I          : (11483) A total of 259868928K of temporary sort disk space was used for area 6.

That's a lot of disk io. So adding more RAM will help that? Not that we'll be able to add 260GB more! I presume putting the sort files on as fast a disk as possible will help?

Full idxbuild took 14.5 hours! :(

Posted by James Palmer on 19-Jul-2017 02:42

Bit of a side question, but we've run idxfix option 3 on some of the tables we've got index corruption on and it makes the problem go away, but after a while it comes back again. It's always the same tables, different RECIDs though. Is it possible that their ODBC connections are causing this? Is there anything else I can do to debug where this is coming from. We haven't changed our application code in a few weeks so it's not code. Is it possible that the idxfix isn't solving the problem where the idxbuild will?

Posted by e.schutten on 19-Jul-2017 03:31

Hi James. We had the same problem overhere. It was a code page issue. An UTF-8 session was adding Chinees characters in a database with Polish code page.

Posted by James Palmer on 19-Jul-2017 03:34

The customer is receiving files from a German customer, but as far as I can see all connections are Codepage 1252. And the records we get the messages for do not have special characters in them.

Posted by George Potemkin on 19-Jul-2017 03:54

> we've got index corruption on and it makes the problem go away, but after a while it comes back again.

Does the customer use the latest hot fix for V10.2B08? If I recall correctly, it fixed some bugs related to the index corruptions.

> Is there anything else I can do to debug where this is coming from.

There is no an easy answer. More details are required. What are the errors reported by idxfix? Which indexes were corrupted? Unique or non-unique? Are the records with the corrupted indexes be created recently? What are the typical transactions that create/update these tables? One or many records per transaction?

Posted by James Palmer on 19-Jul-2017 03:59

No the customer is not running any hot fixes. I will investigate.

The indexes are a mix of unique and non-unique. Yes they are recent creations because we very rarely access older records. The transacitons will be multi record ones. Our scoping is really bad.

Posted by George Potemkin on 19-Jul-2017 04:02

It smells like a known bug. ;-)

Posted by James Palmer on 19-Jul-2017 04:22

Crap code? ;)

Posted by George Potemkin on 19-Jul-2017 05:25

Progress OpenEdge Release Notes

Release: 10.2B0836

Date: April 2015

2.  List of issues fixed in previous 10.2B08x releases

-----------------------------------------------------

Issue ID: PSC00328928

---------------------

Under certain cases involving multiple database updates, you may end up with logical index corruption if you undo certain sub-transactions.

The problem may occur on undo of certain sub-transactions when indexed fields of tables from multiple databases are updated in different nested sub-transactions.

This issue can be avoided by running with the -nosavepoints option.

The issue was fixed in 10.2B0832

Posted by James Palmer on 19-Jul-2017 05:37

Thanks George - that sounds very likely to be a cause.

Posted by Rob Fitzpatrick on 19-Jul-2017 07:11

> [2017/07/19@03:49:18.139+0100] P-4460       T-4928  I          : (11480) Temporary sort file at f:\sort\ used up 259868928K of disk space.

> [2017/07/19@03:49:18.141+0100] P-4460       T-4928  I          : (11483) A total of 259868928K of temporary sort disk space was used for area 6.

> That's a lot of disk io. So adding more RAM will help that? Not that we'll be able to add 260GB more! I presume putting the sort files on as fast a disk as possible will help?

> Full idxbuild took 14.5 hours! :(

Having large tables (or very many smaller ones) together in a single area will increase the required RAM to prevent temp file disk I/O.  Having a well-designed structure will reduce the required RAM.

Are all of the application tables in the schema area?

Posted by James Palmer on 19-Jul-2017 07:52

Yup although I've just been asked to do a full D&L to Type II as part of the solution.

Posted by Valeriy Bashkatov on 14-Nov-2018 12:44

Hi Paul,

[quote user="Paul Koufalis"]

Kill it and add -datascanthreads 4 -mergethreads 2 

[/quote]

How to calculate the values of these two parameters? From which they depend?

Regards,
Valeriy

Posted by Paul Koufalis on 14-Nov-2018 13:17

[mention:819ac0812c524a3b9718403d142f1ba2:e9ed411860ed4f2ba0265705b8793d05]  trial and error. datascanthreads depends on how much IO you can push through your disks. Maybe that's a good topic for next year's PUG Challenge: Benchmarking Index Rebuild under Various Hardware Environments.

Posted by James Palmer on 14-Nov-2018 13:30

[mention:3cb137fff0074190a2d3f32e55cf403a:e9ed411860ed4f2ba0265705b8793d05] I'd certainly be in support of such a venture!

Posted by Valeriy Bashkatov on 14-Nov-2018 13:38

Thank you, Paul!

I don’t remember what presentation I’ve watched, but it seemed to indicate that it was calculated as -datascanthreads = 1.5 X # CPU and -mergethreads = 1.5 X # CPU.

It turns out it was the wrong calculation, right?

Posted by Richard Banville on 14-Nov-2018 14:19

For data scan threads it is really trial and error as Paul already mentioned.  Balancing CPU usage and I/O throughput is the key to getting this optimal.  The idea of 1.5 threads * CPUs is to eliminate any wasted CPU during this part of the index rebuild.  With today's machines having so many CPUs, this number can become ridiculous since the file system will become the bottleneck.  Having too many threads that are not improving I/O rates will introduce contention/concurrency issues and may actually decrease performance.

So you need enough datascan threads to use all the available CPU resources up to the point where I/O rates no longer improve.    

I know everyone wants a simple formula to just apply but there really isn't one that fits every deployment.  The experience of others with success running this in similar deployments to yours is the best resource for specific tuning suggestions.

Posted by Valeriy Bashkatov on 14-Nov-2018 14:32

Thank you Richard!

Posted by Paul Koufalis on 14-Nov-2018 15:22

I smell a PUG Challenge submission brewing...

Posted by James Palmer on 14-Nov-2018 15:36

Fancy some sort of collaboration? I could do before and you could do after the Linux patch... ;)

Posted by Mike Fechner on 14-Nov-2018 15:40

Linux is never a patch.

Posted by Paul Koufalis on 14-Nov-2018 15:43

[mention:6e3b17cb0ff148039a2aabb4c7834ce4:e9ed411860ed4f2ba0265705b8793d05]  you are correct, it is not a patch. It's a cure.

:-)

Posted by Peter Judge on 14-Nov-2018 15:50

You can even get Linux on the desktop now …
 

This thread is closed