Converting Codepages

Posted by andrew.thornton@redprairie.com on 07-Jul-2011 07:05

Hi, I was hoping someone could give me a bit of advice on the following issue we are having with codepage conversion.

Our database and application run using ISO8859-15 - in hindsight that maybe wasn't the best choice when it was made 10 years ago!

We have incoming text interface files which are encoded as ISO8859-1 - they are coming from a third party so we have no control over those. We then have to send out web-service calls to a second third party software provider where the codepage in the xml has to be UTF-8. Again we have no control over what that codepage is - though UTF-8 tends be a standard for xml anyway.

The problem we have is that our incoming interface files have European addresses in them which will potentially include special characters such as accents or umlats. These map across fine when we effectively convert them from ISO8859-1 to ISO8859-15 in the incoming interface files. The problem is in the outgoing messages as I understand it there is no direct mapping for some of these characters from ISO8859-15 to UTF-8. What we actually then find is the xml we produce becomes invalid and the third party can't read the file (but then neither can something like Internet Explorer).

My question is how do I best get around this?

One option would be to convert the database and application (including compiled r-code?) to a different codepage - probably ISO8859-1. My worries there are the database is about 100GB in size - so what is involved in converting a databases codepage? Plus, although this currently impacts one customer, we have 50+ customers who are all distributed the same pl's for a release, and if r-code holds codepage information (does it?) will it mean we either have to convert all our customers codepage, or have two seperate releases?

Alternatively, our setup for this customer is they are running app-server, with database and app-server on a Unix server. Then they have Windows clients connecting to the appserver. Both incoming interfaces, and outgoing web-services run as batch processes on the Unix server. Could we run the the process that generates the web-service call under a specific codepage that will ultimately allow us to convert the ISO8859-15 data in the database into the UTF-8 format required in the xml?

Thanks, Andrew

All Replies

Posted by gus on 07-Jul-2011 10:15

Changing the database code page may not fix your problem. I'm not sure what the problem is since you have not provided details. How do you know the xml is invalid? What errors occur? If you have a character set mapping problem, it may be easy to correct.

For the database code page, ISO 8859-15 is a reasonable choice for western europe and the americas. It is almost the same as 8859-1 (8859-15 adds euro symbol and changes about 5 other rarely used characters) which is the default character set for OpenEdge and for many other softwares. 8859-15 ought to convert fine to UTF-8 (but I'm not an expert in these matters). Perhaps you are getting data that is not in the code page you think it is, or you don't have the conversion tables you need set up.

In OpenEdge, you do not have just one code page. In addition to the database code page for stored data, there are configuration parameters for a variety of things the client does and the 4GL runtime converts automatically among them, using the mappings specified by convmap.dat.

cpinternal sets the code page for the 4GL runtime. Most data gets converted to the internal code page for manipulation by 4GL code.

cpstream sets the code page for stream file reading and writing.

There are a number of others for printer, rcode, terminal, etc. The manual OpenEdge Development: Internationalizing Applications has lots more information.

Posted by andrew.thornton@redprairie.com on 07-Jul-2011 11:06

Hi, Thanks for your response Gus. As I understand it - and I have to admit to being lead here by one of the developers in my team - Progress doesn't supply a convmap to convert from ISO8859-15 to UTF-8. So I think that means that any character outside of the standard characterset (the lower character range of UTF-8) won't be converted and you get a codepage error. An example of a character we have problems with is ø (ASC 0248).

In theory if our database is ISO8859-15, but we ran the 4GL as ISO8859-1 (or any codepage where there is a convmap between that and both ISO8859-15 and UTF-8) should Progress be essentially converting the character into ISO8859-1 (or whatever we run as) and then into UTF-8 - so an implicit two stage conversion? To do that is it as simple as setting cpinternal in the startup parameters for the batch process? Or does the codepage of the r-code come into play as well?

Or is there a convmap somewhere for ISO8859-15 to UTF-8 and for some reason we just don't have it?

Thanks, Andrew

Posted by gus on 07-Jul-2011 13:19

There are conversion tables to go from 8859-1 to utf-8 and from 8859-15 to utf-8 and from 8859-1 to 8859-15

See $DLC/prolang/convmap/utf-8.dat and 8859-1.dat and 8859-15.dat

Because no one uses all the code pages or converts among all of them, not all the tables are loaded. You will probably have to make a few changes to convmap.dat and then generate a new convmap.cp. See $DLC/prolang/README.

HTH

Posted by asgt1974 on 08-Jul-2011 09:16

Thanks Gus, None of us knew about the conversion tables in $DLC/prolong/convmap, so that's very useful - thank you. On my laptop install (which probably mirrors all our customer installs) I do have files for both UTF-8 and 8859-15. The UTF-8 file includes a section for 8859-15 - and I'm guessing it should be this one that we need to convert our ISO8859-15 database into a UTF-8 xml file(??). The 8859-15.dat file doesn't have a section for UTF-8 - could this be the cause of our issue? I've passed the README file onto a couple of my developers to digest - hopefully that will make the issue clearer for them.

Thanks for your help - hopefully this will solve our issue!

Andrew

Posted by gus on 11-Jul-2011 09:22

0) Yes, the mapping table in the utf 8 file is what you need. It will allow you to convert from the database codepage of 8859-15 to utf-8 when you generate the xml.

1) Right, the 8859-15 file has no mapping table from 8859-15 to utf-8 and that is why you got errors.

This thread is closed