Latest Blog: Dump and Load Strategies Part I: Type II Storag

Posted by Paul Koufalis on 30-Mar-2015 11:09

Hi everyone.  Please enjoy my latest DBA blog post:

 http://wss.com/en/resources/blog

All Replies

Posted by Tom Kincaid on 30-Mar-2015 14:30

Hi Paul,

Thanks for sharing this great blog. I am a bit new to this forum. I intend to read some of your older posts as well.

Posted by George Potemkin on 30-Mar-2015 15:55

Paul,

> What about BPC? Also easy: use 512 or 8.

"Cluster Size 64 for Index Areas"

community.progress.com/.../14487.aspx

Any opinions should all indexes of a table be stored in one index area or each index in its own area? What about the scattering of the index blocks per an area?

Posted by Rob Fitzpatrick on 30-Mar-2015 16:55

Hi Paul,

On the subject of "controversial" approaches to DB & storage area design, how do you feel about:

  • always use 8 KB DB blocks, even on an OS with a 4 KB page size?
  • always use 128 RPB for data areas?  Or even 256?

Posted by Paul Koufalis on 31-Mar-2015 08:46

@George: This is covered by 512. 8 is for the little tiny tables with no action. This leads to a lot of empty space in the DB but again - it makes life simple and non-complicated for most people. 

Any opinions should all indexes of a table be stored in one index area or each index in its own area?

Since the blog post is intended to a general audience, I would answer: each data area has a corresponding index area. I.e. all tables in area "X" have indexes in area "Y". Some people (like you!) will support environments where further segregating indexes is a smart idea.

What about the scattering of the index blocks per an area?

It does not stress me in 99% of environments. For example, I may have a 500 Mg extent that contains a single 200 Mg index scattered across 100 - 2Mg clusters. But imagine the stripe size of the RAID 10 is 1 Mg; there is an uncontrolled physical scatter on the disks. And if my environment is tuned properly, hopefully all the blocks I need are already in -B.

Posted by Paul Koufalis on 31-Mar-2015 08:55

@Rob: 8 Kb DB block size: Honestly I don't care nor worry much about the mythical torn page. On AIX we deploy 8Kb even if the FS is in 4Kb blocks. I have not done a recent benchmark on Linux or Windows so until someone proves to me otherwise, I'll stick to 4 Kb on Linux/Windows and 8 Kb on UNIX.

Regarding always using 128 or 256 RPB: In a 4 Kb block you're saying that my average record size is about 15 bytes  (4096 minus block header divided by 256). That's a pretty small record so really I don't see the usefulness of 256. But if I went to a site where someone set all the RPBs to 128 I wouldn't freak out and make them change it. It might just look a little retarded if an area where the average record size is 70 bytes (~55 records per 4Kb block) "wastes" about 75 ROWIDs.

There is also the case of traversing an area by ROWID. It would be a lot slower if half the ROWIDs pointed nowhere.

Posted by ChUIMonster on 31-Mar-2015 09:33

[collapse]On 3/31/15 9:56 AM, Paul Koufalis wrote:
Regarding always using 128 or 256 RPB: In a 4 Kb block you're saying that my average record size is about 15 bytes  (4096 minus block header divided by 256). That's a pretty small record so really I don't see the usefulness of 256.

I see tables like that in the real world.  Although they are much more likely when the block size is 8k (which would be a record size of 25 bytes or so...).    Usually it is some sort of "link" or "xref" table that only has a couple of fields in it.  Some of those can be quite large.  Others are very active even though they are quite small.

There is also the case of traversing an area by ROWID. It would be a lot slower if half the ROWIDs pointed nowhere.


This is why setting rows per block to something in the neighborhood of the size suggested by the average record size is very helpful.

Of course traversing a table by RECID isn't something that people do all that often.  But when you have the urge it is quite helpful to not be wasting time with wasted RECIDs.

On the other hand if you are going to go with "one size fits all" I see no reason to pick 128 over 256.

[/collapse]

Posted by Rob Fitzpatrick on 31-Mar-2015 10:10

With 64-bit rowids, I'm not concerned about "wasting" them.  I'm confident in saying I'll never have storage areas large enough that conserving them will be a concern like it may have been when rowids were 32 bits long.  And I don't (currently) have any use case for traversing an area by rowid.

Much of the talk I hear about 256 RPB being unusable focuses on the worst case, e.g.: if you have a full row directory and 256 records of x average size, how small must x be?  I'm not concerned with whether I can fill a block to its theoretical limit.  What is more of a real-world concern is: given my table's average record size, could I fit at least 129 records in an 8 KB block?  Or 130?  Or 135?  If so then I'm artificially limiting my caching efficiency by using RPB 128, am I not?

Why distinguish by OS when choosing DB block size if you aren't concerned about matching it to the OS or file system page size?  The benefit of choosing 8 KB over 4 KB is that the block overhead is mostly fixed-size and so is proportionately less in an 8 KB block than in 4 KB.  So, given a large number of small records, a given amount of logical I/O can be satisfied with fewer physical I/Os in an 8 KB database than in a 4 KB database, all other things being equal.  Right?  I assume that's your justification for favouring 8 KB blocks on AIX.  Is there some aspect of a Linux or Windows machine that would change that reasoning?  I can't think of one.  The block structure remains the same on those OSes, so the math that underpins your decision remains the same.

Posted by George Potemkin on 31-Mar-2015 10:57

> Some people (like you!) will support environments where further segregating indexes is a smart idea.

I don't have an answer to my question. To be honest I started thinking only after I has asked the question.

When we use 'FOR EACH' query Progress reads the index blocks in the following order:
root -> intermediate1 -> intermediate2 -> leaf1 -> leaf2 -> leaf3 etc

If the index was created using idxbuild utility then we would read the leaf blocks in the sequential order at least by their dbkeys. But if the index keys were created during normal transaction activity then the existent keys can be moved at any time to the new index blocks in the middle of nowhere. Free chains make the new location of index keys even more random. Hence our query will do the random reads of index blocks through the whole index area. My first thought was it's bad for performance. On other hand an index block can store the thousands of index keys. Based on these keys our query will read many record blocks. I guess even in best case when we will read the record blocks in sequential order these read operations will still consume the time that is much higher or at least compatible with the time that is needed to jump to the next index block.

Progress tools don't provide the information if the index keys are in sync with dbkeys of the leaf blocks or not (maybe except the undocumented idxblockreport). I would call it a scatter factor of index blocks. Most likely Progress developers are right and such information would be just useless. Unfortunately I don't have any data that would confirm or dispel the doubts.

Regards,
George

Posted by George Potemkin on 31-Mar-2015 11:29

Regarding using 128 or 256 RPB: the wrong choice of RPB can have the same negative effects as the wrong choices of the toss/create limits. I did not see that DBAs are changing these values or even would ask about the proper values. The default values can be wrong and may have a strong negative impact on performance especially for /read/ performance.

Often the RPB value is set too high even for the template records. Let's take 8K db blocksize and RPB 128. Average record size would be less than 64 bytes (= 8K / 128). Example: real production database has 897 tables and the sizes of their template records vary from 16 to 425 bytes. 190 tables has the template records larger than 64 bytes, 70 tables - larger than 100 bytes. A rough rule: the mean size of real records is approximately twice higher than the size of template record. It's enough to have just an empty database to get an estimation of their record sizes and we can use this estimation to choose the right value for RPB.

Posted by Paul Koufalis on 31-Mar-2015 11:32

Shat is more of a real-world concern is: given my table's average record size, could I fit at least 129 records in an 8 KB block?  Or 130?  Or 135?  If so then I'm artificially limiting my caching efficiency by using RPB 128, am I not?

Correct. Which is why we always try to pick an RPB that will not waste space. There is at least one case where you don't want RPB too high: when the records start small and then grow over time. If you arbitrarily set RPB to 256 then you might squeeze in 150 not-fully-grown records into a block. As they grow, they will get fragmented. You may be inclined to play with the CREATE and TOSS limits OR you can just make sure that the RPB matches the eventual size of the records.

Why distinguish by OS when choosing DB block size if you aren't concerned about matching it to the OS or file system page size? ...Is there some aspect of a Linux or Windows machine that would change that reasoning?

My answer is "history". I have not personally done, nor seen, any benchmarks or proof that tell me I should use 8Kb across the board. You *probably* are not wrong, but until someone takes the time to quantify it, I prefer to err on the side of caution. Also, for a very large percentage of the people reading this it probably does not make a noticeable difference either way.

Posted by Paul Koufalis on 31-Mar-2015 11:38

Most likely Progress developers are right and such information would be just useless.

Are you calling developers useless? :-)

In a typical ERP environment, the working set of data and indexes should be mostly all in memory so I think the point becomes almost moot. And as you mention, you never know where the next order/invoice/shipper index key is going to be created so it becomes an exercise in futility. 

Posted by Paul Koufalis on 31-Mar-2015 11:47

A rough rule: the mean size of real records is approximately twice higher than the size of template record.

Interesting. I never even thought to check this. Most likely because I do not get such intimate contact with my customers' environments. When we come in for a D&L, we see the effect of fragmentation on the mature records and adjust accordingly.

However, unless the customer is doing a VERY large amount of I/O on their database, I still think that the value of a simple database structure more than offsets some microscopic, imperceptible loss of performance. I am a huge proponent of simple solutions and I only add complexity if there is a measurable ROI to the customer's business.

Posted by George Potemkin on 31-Mar-2015 11:47

The (un-)impontance of a index block's scatter factor can be tested easy: we just need to compare the time of FOR EACH query after idxcompact vs. after idxbuild. Both utilities should use the same pfactor. So the only difference between two databases will be the physical location of index blocks.

Posted by George Potemkin on 31-Mar-2015 11:53

> I only add complexity if there is a measurable ROI to the customer's business.

Visit Dmitri Levin's "DBAnalys To The Rescue!" presentation at PUG Challenge Americas ;-)

Posted by Paul Koufalis on 31-Mar-2015 11:54

Remember Tom Bascom's benchmark on SAN vs. FS vs. DB caching. I am trying to imagine how (un-)important your test would be depending on where the index blocks were coming from.

George: you always make us think. And here I have skipped lunch and not done my real work because this discussion is too interesting.

Posted by S33 on 31-Mar-2015 12:45

Thanks for the very informative blogs.

Posted by George Potemkin on 31-Mar-2015 13:41

> Remember Tom Bascom's benchmark on SAN vs. FS vs. DB caching.

I agree, the caching is a great magician/pretender but sometimes it can't help. For example, for the updates on target db during AI/OE replication. We do not care about the delays of data replication provided it's not growing with time. But if we have the scattered updates in very large database then we could have a problem with replication. The scattered index blocks = the scattered updates even if the records are created only near HWM of table's area.

> Since the blog post is intended to a general audience

I know that I'm guilty in looking for exceptions but "life is like a box of chocolates. You never know what you're gonna get."

;-)

Posted by Paul Koufalis on 31-Mar-2015 18:50

The lurkers can chime in with their opinion, but I think these esoteric, unusual exceptions are what make some of these discussions so entertaining and educational. We state the base case (98%) and then nit-pick over those last 2%. We sometimes find that the 2% teaches us something about the 98% and we adjust accordingly.

I still point to Rich Banville's discussion about variable length AI extents versus buffers flushed at checkpoint and the effect on forward processing. I have been a proponent of variable length AI extents for years, having benchmarked them multiple times on different hardware and not seen any significant *human*-detectable effect (YMMV hugely based on your load and your hardware). And yet we've all been taught for years and years that buffers flushed at ckpt are the spawn of the devil. The two positions are not reconcilable since OE10.

Posted by gus on 03-Jun-2015 16:17

you may get scattering of index blocks by putting multiple indexes in the same area. but if you put them in different areas, you may stiil get the same scattering, only in the filesystem.

This thread is closed