I'm upgrading from Openedge 10 to 11 and doing a dump and load to a new version 11 db as I need to convert to Type II storage area and add additional tables/fields/indexes in the process. I also need to convert the DB from iso8859 to utf-8 so I assume that I can't do a binary dump?
I started doing the dump in data administration with code map to UTF-8 and then loading it straight into the empty UTF-8 db with the new data structures, it is very slow which is expected (the database is about 8GB) but I'm also getting the error SYSTEM ERROR: Input data too large, try to increase -s and/or -stsh. (63).
I tried increasing -s to 32000 (from 800) and -stsh to 31 and I'm still getting that error. Is there anything else that I'm missing? Would I be better off just doing a binary dump/load and converting to UTF-8 afterwards? There will be over 100 database of varying sizes from 1GB to 30GB to upgrade using the same process so this will need to be scripted eventually which I don't think can be done for code mapping dump via data administration?
Thanks in advance for your suggestions!
I don't have answers for everything you are seeing, but here are some thoughts.
From your description and the errors you are seeing, I am guessing that
a) the tables contain non-ASCII characters (so they are doubling in size in the new database)
b) the tables have word indexes - on quite large fields
In the new database, in case you haven't done so, I would deactivate all the indexes except the primary indexes in each table before loading the data. Rebuild all the indexes once the data is loaded.
Other considerations before you start loading the data:
1. Apply an ICU collation to the new databases, e.g., ICU-UCA. If you are upgrading to 11.6, go for the more recent version (e.g. ICU_48-UCA).
2. Ensure that the UTF-8 word-break table is applied to the new databases. If they are created from DLC/prolang/utf/empty, then that will be the case.
Thanks for your suggestions. We do use word indexes so I will try turning off the indexes before loading the data.
Dumb question, but what is ICU collation? I'm upgrading to 11.5, where can I check what the version is for that?
Aidan has already made good suggestions. If you deactivate indexes, it would probably solve problem with -stsh value.
Following link might be helpful in calculating appropriate stack size:
knowledgebase.progress.com/.../P79111
Between the two approaches mentioned, I think the better approach is to do a binary dump/load and converting to UTF-8 afterwards.
Collation defines the ordering relationships of the characters in a character set. A collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other. It is basically a set of rules to compare characters in a character set. You can find lot of information about collations on internet. Some links:
en.wikipedia.org/.../Collation
programmers.stackexchange.com/.../what-is-the-difference-between-collation-and-character-set
I tried loading again with all the indexes except for the primary ones deactivated and got the same error. Although it progressed a bit further than previously.
The new db is created from DLC/prolang/utf/empty8 so it does have the word-break table applied. Although given the slowness of loading via data admin and the inability to script this (correct me if I'm wrong), would it be worthwhile to do a binary dump/load and then convert to UTF-8 afterwards? Although I think there has been problems in the past where certain characters are not properly converted to UTF-8...
> Although given the slowness of loading via data admin
How you start and connect your database?
> and the inability to script this (correct me if I'm wrong)
You can: run prodict/load_d.p(file-name, dot-d-dir)
> got the same error. Although it progressed a bit further than previously.
The same table but the different number of the loaded records?
ICU - International Components for Unicode.
Try: http://site.icu-project.org
The default collation for the utf/empty databases is Basic, which resorts to binary sorting of characters above the ASCII range. This is not ideal for queries where non-ASCII characters are involved, because they have to be resolved on the client.
A set of ICU collations is provided with OpenEdge under DLC/prolang/utf.
If you don't have a sorting requirement for a particular locale, ICU-UCA is the best choice.
Use Data Admin to load the collation definition (e.g. ICU-UCA.df). You have to rebuild the database indexes immediately after doing this.
> I think there has been problems in the past where certain characters are not properly converted to UTF-8.
This would be a concern. Is it possible that the database(s) contain characters that are not correctly encoded in iso8859-1? If the system has Windows clients, could some 1252 characters have got in there? It would be good to run a proutil convchar charscan, then deal with any problems that are reported.
As for the speed of loading the data, have you created adequately sized fixed extents for the larger tables before loading?
> How you start and connect your database?
I have tried connecting to the DB in single user mode as well as in multi-user mode with just specifying the -db -N -H -S -s -stsh parameters, both gave errors
> You can: run prodict/load_d.p(file-name, dot-d-dir)
I did find this but I don't think I can specify a code mapping UTF-8?
> The same table but the different number of the loaded records?
Next table, then on 2nd try, it was back to the previous table, different number of records. The original table it keeps on crashing on has 86639 records but is only 7MB in size.
Thanks for the ICU collation information, I will try again with it loaded. And Yes I think we have some 1252 characters for some databases, which is why I thought that doing a dump to ASCII would be able to resolve these issues rather than doing the conversion.
I have created fixed extents for all of the storage areas that are larger than their current sizes.
I'll also try a binary dump/load and see if that's any better.
In regards to the ICU collation, does that only apply for UTF-8 character sets? Because I would have to do a dump/reload into an Iso8859-1 DB if I'm to go with the binary dump. Or should I do the conversion to UTF-8 in version 10 before I do the dump/reload?
> I have tried connecting to the DB in single user mode as well as in multi-user mode with just specifying the -db -N -H -S -s -stsh parameters
This explains why the load was slow.
Start database in multi-user mode with no-integrity (-i) and connect it as a self-service client.
>> You can: run prodict/load_d.p(file-name, dot-d-dir)
> I did find this but I don't think I can specify a code mapping UTF-8?
You already used it behind of Data Dictionary's interface.
> Next table, then on 2nd try, it was back to the previous table, different number of records.
Hence the issue is not related to the particular records with the long character fields. And the issue is not related to the codepage of new database.
KB-P38226: SYSTEM ERROR: Input data too large, try to increase -s or/an
<quote>
Using the 4GL IMPORT statement to load data from a large file (12MB) generated by a third party software package.
CAUSE:
Data was corrupted. It contained double quote characters that are not properly escaped.
FIX:
Ensure that the data is properly formatted for the IMPORT statement and that it there are no embedded null characters CHR(0) or special characters that are not properly escaped.
</quote>
Data Dictionary dump (the EXPORT statement) creates the file with special characters that are not properly escaped.
The ICU collations can only be applied to databases configured with the utf-8 code page.
> And Yes I think we have some 1252 characters for some databases
I think this could be the root of some of the problems you are running into.
Think of iso8859-1 as a subset of 1252. 1252 has characters defined in the range 128-159 (decimal) that are not defined in iso8859-1. Since the database is defined as iso8859-1, any characters in the 128-159 range will fail to be converted correctly to UTF-8 when you dump the data. After that, of course, they won't load correctly into a utf-8 database.
To get an idea of the differences, start a Procedure Editor session with -cpinternal utf-8, and run this code:
DEFINE VARIABLE decval AS INTEGER.
DEFINE VARIABLE ch8859-1 AS CHARACTER.
DEFINE VARIABLE ch1252 AS CHARACTER.
DEFINE VARIABLE utval8859-1 AS INTEGER.
DEFINE VARIABLE utval1252 AS INTEGER.
MESSAGE "CPINTERNAL =" SESSION:CPINTERNAL
VIEW-AS ALERT-BOX.
REPEAT decval = 122 TO 256:
ch8859-1 = CHR(decval, SESSION:CPINTERNAL, "iso8859-1").
ch1252 = CHR(decval, SESSION:CPINTERNAL, "1252").
utval8859-1 = ASC(ch8859-1).
utval1252 = ASC(ch1252).
DISPLAY decval ch8859-1 utval8859-1 ch1252 utval1252
WITH FONT 2.
END.
The important thing here is for the conversion process to be told the correct code page that it is converting the data from, wherever the conversion is done. Another consideration is that the conversion can be done when the data is loaded into the new utf-8 database.
Here is the process that I suggest...
I assume your clients normally start with -cpinternal iso8859-1 -cpstream iso8859-1. Please correct me if I am wrong.
1. When dumping the data from the original databases, start your session with -cpinternal iso8859-1 -cpstream 1252.
When dumping the data, do not use the MAP option - it will default to 1252. No character conversions will occur, but the data will be (correctly) flagged as 1252 data at the bottom of the *.d files.
2. Set up the new databases with code page utf-8 and utf-8 word-break table.
3. Load collation definitions ICU-UCA.df. Rebuild the indexes. Deactivate all except primary index.
When loading the data into the new utf-8 databases, start your session with -cpinternal 1252 -cpstream 1252.
4. After loading the data, rebuild all the indexes.
ISO 8859-1 is /not/ a 7 bit character set. It is an 8 bit character set and it contains characters in the range 32 to 126 and in the range 160 to 255. these characters are exactly the same as the corresponding code points in Unicode.
Windows 1252 is mostly the same as ISO 8859-1 but with additional characters in the range 128 to 159.
Thanks Aiden for your suggestion. I tried your process but unfortunately still got the same error. However, I have managed to narrow it down to a particular record that's causing this. Unfortunately I can't see from the ASCII dump what the exact issue with it is, i.e. I can't see any special characters nor unterminated double-quotes.
Anyhow, this is bound to happen in other databases and if there is no way we can import the data in regardless of corrupt data/encoding, then we may as well do the binary dump and then convert UTF-8. I just tested this and it works fine with all of the data converted properly.
Thanks for all your help!
Glad to help. I hope the rest of the database migrations go well!