Issue while reading Chinese character from xml file

Posted by gaursaab on 25-Feb-2019 14:26

Hi All,

We are facing issue while reading xml file having supplier name in chinese, program reading chinese characters as ."...".

Xml file we are loading having Encoding - UTF-8

Session we are loading have code page - “ISO8859-1”

Loading xml using load method

  - tried by passing xml file directly to load method

    hXML:LOAD(cDocType, cFileName, NO)  NO-ERROR.

- tried reading the content of file in memptr type variable but no luck 

      INPUT FROM value (cFileName) BINARY NO-MAP CONVERT SOURCE "UTF-8".

          IMPORT unformatted mPointer.
      INPUT CLOSE.

      hXML:LOAD("memptr",mPointer,FALSE)  NO-ERROR.

- While reading value from node, tried converting using code-page-convert function but no luck.

can anyone guide me what else we can try on this?

regards,

Sachin 

 

All Replies

Posted by Matt Baker on 25-Feb-2019 18:09

Hi gaursaab,

ISO8859-1 is a single-byte character set.    en.wikipedia.org/.../IEC_8859-1

UTF-8 is a variable-byte character set  en.wikipedia.org/.../UTF-8

There is no way to represent Chinese multi-byte characters in a single-byte code page.  

You need to change the internal code page for the session to use a multi-byte code page in order to work with the characters in that file properly.

Also, be careful not to accidentally store any of those strings in a database that isn't configured with a code page capable of handling them.

In an international world, you cannot get away with using ISO 8859-1.  It is too limited.  Your default selection should be UTF-8.

For fuller explanation of why...read on:

www.w3.org/.../qa-choosing-encodings

The full manifesto:

http://utf8everywhere.org/

Posted by Peter Judge on 25-Feb-2019 18:40

Have you looked into
COPY-LOB FILE VALUE(cFileName) TO mPointer.
 
There are conversion options on that statement too.
 

Posted by gaursaab on 26-Feb-2019 11:44

Hi Mat,



Thanks for your valuable inputs 

I tried by setting cpstream for session to "UTF-8" but did not get success,

This application is running from very long and little afraid to change cpinternal as not sure about the other impacts.


Posted by gaursaab on 26-Feb-2019 11:44

Hi Peter,


Thanks for your reply, just tried this 

COPY-LOB File filename TO mPointer convert target to "UTF-8"

Posted by Dileep Dasa on 26-Feb-2019 12:29

The default character encoding used to encode the contents of an XML document for X-document object handle is "UTF-8". So, the LOAD() method should read the multibyte characters correctly. Could you share the XML file that you are loading? And, how are you reading the values from node after loading the XML file?

Posted by frank.meulblok on 26-Feb-2019 13:02

Like it or not, you need to set the session:CPINTERNAL to UTF-8. Otherwise your session will not be able to handle those Chinese characters correctly.

You'll also need to move your databases to UTF-8 if you plan on storing the data from the XML there.

(You could also use another codepage that supports Chinese such as GB2312, but then you'll run into the same wall over and over if/when you need to support languages that use Cyrillic script, Arabic, other Asian languages, Emoji's, ... Just going for a Unicode encoding means you only have to do this once to cover all of those.)

Posted by Matt Baker on 26-Feb-2019 13:11

Changing cpstream won't help.  cpstream tells the import to assume incoming bytes are UTF-8.  XML has a header that indicates the character encoding for the document, so cpstream isn't in play here.  Again, you simply cannot convert UTF-8 bytes coming from that XML document into a form that is usable with ISO 8859-1.  ISO 8859-1 has absolutely no way to represent those characters.

Posted by gus bjorklund on 26-Feb-2019 14:52

in addition, to display or print, you will need one or more typefaces that contain glyphs for the chinese characters.

This thread is closed