Codepage problems UTF8

Posted by goo on 18-Nov-2016 15:28


def var lc as longchar no-undo.

fix-codepage(lc)='UTF-8'.
find first Table where.......
lc = buffer Table::clobField.

gives an error like this:

Invalid character code found in data for codepage UTF-8 (12008)

(How to find what character that is wrong?)

if I instead does this:

copy-lob object buffer table::clobField to file './test.xml' no-convert.

copy-lob file './test.xml' to object lc.

this works swell.... 

Could someone explain that? 

I believe I should be able to do this:

I have a file with UTF-8 and I should be able to load that into a longchar set to fix-codepage with UTF-8 using 

copy-lob file 'xxx.xml' to object lc. 

or I should be able to do like this:

copy-lob file 'xxx.xml' to object lc convert source codepage 'iso8859-1' target codepage 'UTF-8'.  

//Geir Otto

Posted by Paul Clare on 22-Nov-2016 09:55

Hi Geir Otto, I would expect the data to be converted to iso8859-1 in your case.  According to the rules for assignment  when assigning character data to a longchar, data should be converted to -cpinternal or the fixed code page.  The is a bug where this isn't happening for database CLOBs (article 000074626), but I made a quick test, and this seems to work as I would expect:

DEF    VARIABLE oStreamReader AS CLASS                System.IO.StreamReader.

DEFINE VARIABLE cPath         AS CHARACTER            NO-UNDO.

DEF    VARIABLE oEncoding     AS System.Text.Encoding.

DEFINE VARIABLE lc            AS LONGCHAR             NO-UNDO.

DEFINE VARIABLE coutString    AS CHARACTER            NO-UNDO.

/* Write UTF-8 vaue '££' (C2,A3 C2,A3) to text file.  Total length 8 bytes */

OUTPUT to "C:\temp\MyTest.txt" CONVERT SOURCE SESSION:CPINTERNAL TARGET "UTF-8".

DISPLAY CHR(163,SESSION:CPINTERNAL,"iso8859-1") + CHR(163,SESSION:CPINTERNAL,"iso8859-1") WITH NO-LABELS.

OUTPUT close.  

oEncoding = System.Text.Encoding:UTF8.

cPath = "C:\temp\MyTest.txt".

oStreamReader = NEW System.IO.StreamReader(cPath,oEncoding,TRUE).

lc = oStreamReader:ReadToEnd().

cOutstring = lc.

MESSAGE "SESSION:CPINTERNAL : " SESSION:CPINTERNAL SKIP(2)

   "StreamReader to LONGCHAR" SKIP

   "---------------------------------------------------" SKIP

   "Longchar data length in Bytes (4 bytes for CR LF) : " LENGTH(lc,"RAW") SKIP

   "Longchar code page : " GET-CODEPAGE(lc) SKIP (2)

   "LONGCHAR copied to string" SKIP

   "---------------------------------------------------" SKIP

   "Variable data : " TRIM(coutString) SKIP

   "Variable data length in Bytes (4 bytes for CR LF) : " LENGTH(coutString,"RAW") SKIP

   SKIP(2)    

   VIEW-AS ALERT-BOX INFO BUTTONS OK.

Start the session with -cpinternal iso8859-1 -cpstream iso8859-1 and the byte length should be 6.  Two bytes for the data converted to two single byte pound signs (£) and 4 bytes for the CR LF.  Then do the same with -cpinternal UTF-8 -cpstream UTF-8 to see the unconverted difference.

Paul.

All Replies

Posted by jquerijero on 18-Nov-2016 16:07

If my memory serves me right, it has something to do with codepage your Progress run-time is using.

Posted by Paul Clare on 21-Nov-2016 07:27

Hi Geir Otto, this could be because there is invalid character data (e.g. control codes) in the CLOB.  COPY-LOB validates the data as real 'character' data.  From the COPY-LOB ABL reference:

"However, if the target is a LONGCHAR or a CLOB, the AVM validates the character data based on the target object's code page. For a CLOB, this is the code page of the CLOB. For a LONGCHAR, this is -cpinternal unless the LONGCHAR's code page was set using the FIX-CODEPAGE statement. If the validation fails, the AVM raises the ERROR condition."

If there are control characters in it then it isn't valid and they need to be removed.  See articles 000072678 and 000049137.  This might explain why it works when the target is an XML file as the same validation isn't performed?  

Paul.

Posted by goo on 22-Nov-2016 05:16

Thanks Paul, I will do some reading here.

Posted by goo on 22-Nov-2016 05:36

Paul,the webclient has internalcodepage as ISO8859-1, and since we don't do anything with the lc variable, it is ISO8859-1. Will Progress then convert the text given by oStreamReader:ReadToEnd() to ISO? I am pretty sure it is UTF8 when it comes from the stream. Is the RETURN lc now a ISO or do you think it is UTF8?

*******************************************************************************************************************

       def var oStreamReader as class System.IO.StreamReader.

       def var oEncoding as System.Text.Encoding.

       oEncoding = System.Text.Encoding:UTF8.

       QueueHasRecord = TRUE.

       oMsg = NEW System.Messaging.Message().

       oMsg = oMsgQ:Receive(NEW System.TimeSpan(0)).

       IF MsgFormat = 'Binary'  THEN oMsg:Formatter = oMsgBinaryFormatter.

       ELSE IF MsgFormat = 'ActiveX' THEN oMsg:Formatter = oMsgActiveXFormatter.

       ELSE oMsg:Formatter = oMsgXmlFormatter.

       oStreamReader = new System.IO.StreamReader(oMsg:BodyStream,oEncoding,TRUE).

       lc = oStreamReader:ReadToEnd().

       lc = replace(lc,'',''). /*Fjerner eventuell UTF-8 BOM*/

       RETURN lc.

*******************************************************************************************************************

Posted by Paul Clare on 22-Nov-2016 09:55

Hi Geir Otto, I would expect the data to be converted to iso8859-1 in your case.  According to the rules for assignment  when assigning character data to a longchar, data should be converted to -cpinternal or the fixed code page.  The is a bug where this isn't happening for database CLOBs (article 000074626), but I made a quick test, and this seems to work as I would expect:

DEF    VARIABLE oStreamReader AS CLASS                System.IO.StreamReader.

DEFINE VARIABLE cPath         AS CHARACTER            NO-UNDO.

DEF    VARIABLE oEncoding     AS System.Text.Encoding.

DEFINE VARIABLE lc            AS LONGCHAR             NO-UNDO.

DEFINE VARIABLE coutString    AS CHARACTER            NO-UNDO.

/* Write UTF-8 vaue '££' (C2,A3 C2,A3) to text file.  Total length 8 bytes */

OUTPUT to "C:\temp\MyTest.txt" CONVERT SOURCE SESSION:CPINTERNAL TARGET "UTF-8".

DISPLAY CHR(163,SESSION:CPINTERNAL,"iso8859-1") + CHR(163,SESSION:CPINTERNAL,"iso8859-1") WITH NO-LABELS.

OUTPUT close.  

oEncoding = System.Text.Encoding:UTF8.

cPath = "C:\temp\MyTest.txt".

oStreamReader = NEW System.IO.StreamReader(cPath,oEncoding,TRUE).

lc = oStreamReader:ReadToEnd().

cOutstring = lc.

MESSAGE "SESSION:CPINTERNAL : " SESSION:CPINTERNAL SKIP(2)

   "StreamReader to LONGCHAR" SKIP

   "---------------------------------------------------" SKIP

   "Longchar data length in Bytes (4 bytes for CR LF) : " LENGTH(lc,"RAW") SKIP

   "Longchar code page : " GET-CODEPAGE(lc) SKIP (2)

   "LONGCHAR copied to string" SKIP

   "---------------------------------------------------" SKIP

   "Variable data : " TRIM(coutString) SKIP

   "Variable data length in Bytes (4 bytes for CR LF) : " LENGTH(coutString,"RAW") SKIP

   SKIP(2)    

   VIEW-AS ALERT-BOX INFO BUTTONS OK.

Start the session with -cpinternal iso8859-1 -cpstream iso8859-1 and the byte length should be 6.  Two bytes for the data converted to two single byte pound signs (£) and 4 bytes for the CR LF.  Then do the same with -cpinternal UTF-8 -cpstream UTF-8 to see the unconverted difference.

Paul.

Posted by goo on 22-Nov-2016 10:46

Thanks Paul !!

This thread is closed