Migrating existing data into multiple range partitions

Posted by shaske1 on 25-Mar-2014 10:37

Hey guys...

I've been tinkering on and off with partitioning on the Linux environment in the [very little] spare time I've had over the past month or so, and I must admit I have been rather frustrated.

As the sort of table that we would be likely to partition are existing tables with hundreds of millions of records (>10G), what we are really needing to do is migrate the data in place. And in at least some cases we will want to use unique integer key value ranges.

In my particular attempt therefore, I had generated an extra one million customer records, then attempted to create three partitions of range 500,000. [The sports.customer table is not precisely analogous to our data but is a useful simulation.]

With the given examples being almost entirely driven by metaschema, I would characterize the effort as “struggling through virtually every step”. At the most farthest point, I received this error:

“Cannot add range value that splits an existing range, partition [partition-name]”

This, along with the context of the given examples, suggests to me that moving existing data may not be supported – that the current design assumes a necessary dump and load of the data. [If so, the functionality will likely be practically unusable for us, since we are primarily dealing with legacy applications & data.]

Is this the case – that existing range data can’t be subdivided – or am I misinterpreting my current results?

Admittedly, the lack of any discernible unix-based data dictionary tool makes it difficult to know if I’m using the metaschema correctly… but since those are the examples provided I assume that’s the only method available short of trying the Window environment.

At the moment I feel like the Linux examples aren’t really in a state that an end user can reasonably evaluate.

 

 

Posted by Richard Banville on 25-Mar-2014 10:59

To try and help you out with such a migration I’d like to know a little more about your partitioning scheme.
 
The integer range values you mention for the partitioning, are they associated with some time based criteria or something closer to a customer number?
 
Are the values of these increasing or at least for the most part increasing?
 
A composite partition (which is where all your current non partitioned data exists) can only have one all inclusive range associated with it.
Remember that range partitions are defined as all values <= the definition are in the current range partition down to the next lower range partition.
The intent here is that data in a “new” range can go to a new partition.
Data in the existing composite partition can be split out to a new partition over time.
 
We did try to include a range type migration I the ESAP but I guess it does not match your particular partitioning scheme.
 
 
[collapse]
From: shaske1 [mailto:bounce-shaske1@community.progress.com]
Sent: Tuesday, March 25, 2014 11:38 AM
To: Beta.OE11.4@community.progress.com
Subject: Migrating existing data into multiple range partitions
 
Thread created by shaske1

Hey guys...

I've been tinkering on and off with partitioning on the Linux environment in the [very little] spare time I've had over the past month or so, and I must admit I have been rather frustrated.

As the sort of table that we would be likely to partition are existing tables with hundreds of millions of records (>10G), what we are really needing to do is migrate the data in place. And in at least some cases we will want to use unique integer key value ranges.

In my particular attempt therefore, I had generated an extra one million customer records, then attempted to create three partitions of range 500,000. [The sports.customer table is not precisely analogous to our data but is a useful simulation.]

With the given examples being almost entirely driven by metaschema, I would characterize the effort as “struggling through virtually every step”. At the most farthest point, I received this error:

“Cannot add range value that splits an existing range, partition [partition-name]”

This, along with the context of the given examples, suggests to me that moving existing data may not be supported – that the current design assumes a necessary dump and load of the data. [If so, the functionality will likely be practically unusable for us, since we are primarily dealing with legacy applications & data.]

Is this the case – that existing range data can’t be subdivided – or am I misinterpreting my current results?

Admittedly, the lack of any discernible unix-based data dictionary tool makes it difficult to know if I’m using the metaschema correctly… but since those are the examples provided I assume that’s the only method available short of trying the Window environment.

At the moment I feel like the Linux examples aren’t really in a state that an end user can reasonably evaluate.

 

 

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse]

All Replies

Posted by Richard Banville on 25-Mar-2014 10:59

To try and help you out with such a migration I’d like to know a little more about your partitioning scheme.
 
The integer range values you mention for the partitioning, are they associated with some time based criteria or something closer to a customer number?
 
Are the values of these increasing or at least for the most part increasing?
 
A composite partition (which is where all your current non partitioned data exists) can only have one all inclusive range associated with it.
Remember that range partitions are defined as all values <= the definition are in the current range partition down to the next lower range partition.
The intent here is that data in a “new” range can go to a new partition.
Data in the existing composite partition can be split out to a new partition over time.
 
We did try to include a range type migration I the ESAP but I guess it does not match your particular partitioning scheme.
 
 
[collapse]
From: shaske1 [mailto:bounce-shaske1@community.progress.com]
Sent: Tuesday, March 25, 2014 11:38 AM
To: Beta.OE11.4@community.progress.com
Subject: Migrating existing data into multiple range partitions
 
Thread created by shaske1

Hey guys...

I've been tinkering on and off with partitioning on the Linux environment in the [very little] spare time I've had over the past month or so, and I must admit I have been rather frustrated.

As the sort of table that we would be likely to partition are existing tables with hundreds of millions of records (>10G), what we are really needing to do is migrate the data in place. And in at least some cases we will want to use unique integer key value ranges.

In my particular attempt therefore, I had generated an extra one million customer records, then attempted to create three partitions of range 500,000. [The sports.customer table is not precisely analogous to our data but is a useful simulation.]

With the given examples being almost entirely driven by metaschema, I would characterize the effort as “struggling through virtually every step”. At the most farthest point, I received this error:

“Cannot add range value that splits an existing range, partition [partition-name]”

This, along with the context of the given examples, suggests to me that moving existing data may not be supported – that the current design assumes a necessary dump and load of the data. [If so, the functionality will likely be practically unusable for us, since we are primarily dealing with legacy applications & data.]

Is this the case – that existing range data can’t be subdivided – or am I misinterpreting my current results?

Admittedly, the lack of any discernible unix-based data dictionary tool makes it difficult to know if I’m using the metaschema correctly… but since those are the examples provided I assume that’s the only method available short of trying the Window environment.

At the moment I feel like the Linux examples aren’t really in a state that an end user can reasonably evaluate.

 

 

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse]

Posted by Fernando Souza on 25-Mar-2014 10:59

Hi
 
First, thanks for the feedback.
 
We will support migrating the data in place, so what you are trying to do will be supported and allowed. But there are rules for how and when partitions can be created.
 
During the migration itself, when you are first making the table partitioned, you will need to make sure to define all partitions in the same transaction that makes the table partitioned, and the partitions need to cover all data that already exists in the table prior to the migration. Then after the migration, you can create additional partitions to split the data out, but they have to be set as split targets. The error you are getting indicates that the partition is being created after the migration took place and the partition was not set as a split target. One of the scripts that talks about splitting data does have the code for setting the partition as a split target.
 
Note that the examples/scripts that we provided in the current ESAP are meant to show the functionality in a very specific database/setup. I agree that it is not in a state that you would easily venture into trying  different things with it without bumps on the road. We are working on releasing another drop of the ESAP for table partitioned with more information and more tooling support and in a better state for you to try the steps described. Stay tuned.


[collapse]
From: shaske1 [mailto:bounce-shaske1@community.progress.com]
Sent: Tuesday, March 25, 2014 11:37 AM
To: Beta.OE11.4@community.progress.com
Subject: Migrating existing data into multiple range partitions
 
Thread created by shaske1

Hey guys...

I've been tinkering on and off with partitioning on the Linux environment in the [very little] spare time I've had over the past month or so, and I must admit I have been rather frustrated.

As the sort of table that we would be likely to partition are existing tables with hundreds of millions of records (>10G), what we are really needing to do is migrate the data in place. And in at least some cases we will want to use unique integer key value ranges.

In my particular attempt therefore, I had generated an extra one million customer records, then attempted to create three partitions of range 500,000. [The sports.customer table is not precisely analogous to our data but is a useful simulation.]

With the given examples being almost entirely driven by metaschema, I would characterize the effort as “struggling through virtually every step”. At the most farthest point, I received this error:

“Cannot add range value that splits an existing range, partition [partition-name]”

This, along with the context of the given examples, suggests to me that moving existing data may not be supported – that the current design assumes a necessary dump and load of the data. [If so, the functionality will likely be practically unusable for us, since we are primarily dealing with legacy applications & data.]

Is this the case – that existing range data can’t be subdivided – or am I misinterpreting my current results?

Admittedly, the lack of any discernible unix-based data dictionary tool makes it difficult to know if I’m using the metaschema correctly… but since those are the examples provided I assume that’s the only method available short of trying the Window environment.

At the moment I feel like the Linux examples aren’t really in a state that an end user can reasonably evaluate.

 

 

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse]

Posted by Martha Lynch on 25-Mar-2014 12:11

As Fernando said, the scripts in the first ESAP were intended to be followed precisely, and not intended to be tried on your own database.  We limited it in this way in order to get early functionality in your hands but 'protect' you : ) from areas that were not complete.  Were you able to run through the process in the script with the canned DB's?  Hopefully your experience with the next ESAP will give you a better feel for working with partitions on your own.

Posted by shaske1 on 25-Mar-2014 12:39

Hey Rich...

> The integer range values you mention for the partitioning, are they associated with some time based criteria or something closer to a customer number?

> Are the values of these increasing or at least for the most part increasing?

We are not entirely certain of all the possible scenarios we would entertain the solution for at this point, but there are two specific cases we have in mind that are integer-based.

In case one, the integers are sequence numbers -- as you say constantly increasing over time.  So if record number 100000 happens 01/01/2010 and 200000 happens 01/01/2011, we might partition by year.  Theoretically his could be especially useful for partial tablemove on a 40GB table -- where both time and BI growth would otherwise be prohibitive for a production environment.

In case two, it is a more complicated legacy issue which is effectively case one + a "range" where record number = 0.  Here, we have a table with a primary composite key which was reused inappropriately, resulting in different types of data.  So much of the new data has increasing integers but is intermixed with zero-keyed records.

We understand that it might not be optimal to code the general database engine to account for the second case because it is a poor design to begin with...  but of course those of us who are dealing with these legacy issues (which I imagine is a fair percentage of Progress administrators) would appreciate maximal flexibility where it does not cut into performance in the optimal cases.

Posted by shaske1 on 25-Mar-2014 12:58

Fernando / mlynch:

Thanks for the follow up.  I was hoping that the current examples were very preliminary, so that confirms what I was presuming was true.

Hopefully the next iteration will let us offer more useful feedback; I'll take a closer look at the split target logic in the meantime.

Posted by Richard Banville on 25-Mar-2014 14:32

The reason this grouping of ranges is important to managing your deployment is because it allows you know what data you are performing the maintenance operation on.  If the data were simply hashed across partition range values and you later decided to index rebuild partition #1, you’d be hard pressed to actually know which data is affected from the application side of things since there is no logical grouping.  Also for archival purposes, if you know data in a partition is over 10 years old and is OK to archive, it could be done at the partition level as opposed to the record level.  These are just some of the advantages of implementing partitioning where the grouping of data is discernable..
 
Migrating your existing data to a partitioned environment can be achieved with your partitioning scheme.  I would suggest however that you think about further partitioning the existing data using sub-partitions to get an even more manageable/maintainable partitioning model.
 
As was said in other posts, not all the utility capabilities were available in the first ESAP but we made it available anyway to start socializing the partitioning concepts so actually implementing different partitioning schemes will be easier moving forward.  
 
So how do you migrate your existing data?
Your example is a cross between the range partition migration example #1 and #2.  I will explain this from a conceptual level.

1.     You will identify your existing table as being a partitioned table and create a partition definition with the one range partition field that has some logical grouping value at least >= the largest existing value in that field say for example 400,000.  This will “cap” the existing partition where it is today with all records with the partitioned field <= 400,000 residing in the current location/partition.  (New records can be added to this initial partition as long as the field value is in the range <= 400,000)

2.     To add new data to this table that is > 400,000 for your partitioned field, create a new partition with a range value inclusive of your next logical range grouping, say 500,000 for example.  All new data (or data in the range of 400,001 thru 500,000) will reside in this new partition when those records are added.  This partition can be created in the same or different Type II storage area than the original data.

3.     So now you’ve got one 40GB partition (your original data) and one partition waiting to receive new data.  This is helpful for managing new data but is really not partitioning as you want it since you still need to deal with that 40GB partition.

4.     If you want to further partition your existing 40GB of original data, you will use the “split” option of proutil to split logical groupings of data out of the big partition into new partitions.  This is accomplished by creating one or more new partition with the “split target” attribute set on it and running the proutil split command as described in Migration example #2 of the ESAP drop #1.  Yes, this part does indeed require data movement but can be done online without crazy growth to the bi file and without affecting the parts of the data that are not being split so your application can run unaffected while this operation is executing (provided the application does not try to update the data being split since it is locked until completion).

 
BTW, with the next ESAP version you will not need to deal with writing any ABL code to create partition definitions.
 
Now of course this whole process could be done through a dump and load but you inferred that was not the direction you wanted to go so above is our migration strategy for such and environment.
 

Posted by shaske1 on 25-Mar-2014 16:58

Cool -- this is primarily what we're expecting.  It's just that we're doing it [metaschema] wrong in our initial experiments.

One follow up...

> Yes, this part does indeed require data movement but can be done online without crazy growth to the bi file and without affecting the parts of the data that are not being split so your application can run unaffected while this operation is executing (provided the application does not try to update the data being split since it is locked until completion).

When you say "locked until completion", do you refer to all the data in the range of records being moved (e.g. tablemove, all or nothing), or will it be along the lines of an idxcompact where the reorganization scope is limited and safely interruptible?  

Clearly the latter seems preferable -- we are presuming/hoping that if BI does not require crazy growth that the partition split can be partially completed over multiple maintenance windows.

Then we just have to deal with [slightly less crazy] AI growth/pica. (And we've heard replication improvements are coming as well...)

Posted by Alex Hart on 26-Mar-2014 06:27

I think we are all eagerly awaiting a drop that we can use more extensively with our own data to see how easy it is to migrate. The Dump and load would not necessarily be a bad idea as you may find that once you partition data based on certain fields and you then create global and partition indexes it could be beneficial to do a dump and load of the data. But my understanding so far is that it should not be requirement for you to use/activate partitioning.

Posted by Peter Mellow on 26-Mar-2014 07:27

You should be seeing something in the next couple of weeks! We are really looking forward to the feedback.
 
 
 
[collapse]
From: Alex Hart [mailto:bounce-alexhart@community.progress.com]
Sent: Wednesday, March 26, 2014 7:28 AM
To: Beta.OE11.4@community.progress.com
Subject: RE: Migrating existing data into multiple range partitions
 
Reply by Alex Hart

We are all waiting for a drop that we can use more extensively with our own data to see how easy it is to migrate. Dump and load would not be a bad idea as you may find that once you partition data based on certain fields and you then create global and partition indexes it could be beneficial to do a dump and load data. But it should not be requirement to use partitioning.

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse]

Posted by Richard Banville on 27-Mar-2014 09:12

[quote user="shaske1"]

When you say "locked until completion", do you refer to all the data in the range of records being moved (e.g. tablemove, all or nothing), or will it be along the lines of an idxcompact where the reorganization scope is limited and safely interruptible?  

Clearly the latter seems preferable -- we are presuming/hoping that if BI does not require crazy growth that the partition split can be partially completed over multiple maintenance windows.

Then we just have to deal with [slightly less crazy] AI growth/pica. (And we've heard replication improvements are coming as well...)

[/quote]

Only the data in the range to be split out will be locked down.  Other data in that same partition which is not being split out will be accessible if explicitly accessed by the restriction (where clause) in the query.

The split command operates by moving only a user specified number of records per transaction.  This way, it remains "recoverable" and bi space does not grow unbounded and it works in replication environments.

When I say "recoverable" I mean that if interrupted, the command will re-start where it left off.  It does not mean that once the split command is cancelled the entire split operation can be undone.  It must be run to completion once started.

If the split command is interrupted for some reason, the data will remain locked (even if you restart the database!).  The split command when re-executed will pick up where it left off.

This thread is closed