CODEPAGE question

Posted by goo on 25-Sep-2019 07:16

11.7 / 12.0

Could anyone explain to me why this gives me two different answares?

def var myISO as longchar no-undo.
def var myUTF as longchar no-undo.

fix-codepage(myISO) = session:cpinternal.
fix-codepage(myUTF) = 'UTF-8'.

myISO = 'Ä'.

/*Changing these will give me two different results for utf.txt .... why?*/

//myUTF = codepage-convert(myISO,'UTF-8',session:cpinternal).
myUTF = codepage-convert('Ä','UTF-8',session:cpinternal).

copy-lob myISO to file 'e:\temp\iso.txt' no-convert.
copy-lob myUTF to file 'e:\temp\utf.txt' no-convert.

All Replies

Posted by frank.meulblok on 25-Sep-2019 08:09

The "codepage-convert('Ä','UTF-8',session:cpinternal)"  gives mojibake instead of the expected  'Ä', and has more bytes than expected.

So in that case there's a double conversion - the character gets converted from single-byte ISO codepage to a multibyte UTF-8 sequence, then the invidual bytes of the UTF-8 sequence get interpreted as single-byte characters and converted again.

The double conversion is probably because when you fix-codepage your longchars, there is an automatic codepage conversion that takes effect. (Rules are buried in the docs on the ASSIGN statement) -> that automatic conversion happens on top of the explicit conversion you have in the code.

Posted by slacroixak on 25-Sep-2019 09:33

I am a little bit confused with the chosen sample 'A' as it is actually encoded with one single byte in UTF-8, like all characters in the ASCII set (below 128).  The 8 f UTF-8 means it can go down to 8 bits.

In UTF-8, extended characters (those above 127 in single byte encodings) are encoded with 2, 3 or 4 bytes (so 16 to 32 bits)


Said differently, strings made with only ASCII chars (below 128) should be encoded the same in all single byte codepages as well as in UTF-8

There should be differences only if extended characters are involved (like letter with accents, etc...)

Posted by frank.meulblok on 25-Sep-2019 09:50

The sample isn't 'A' (Latin captial letter A, codepoint U+0041 ), it's  'Ä' (Latin captial letter A with diaeresis, codepoint U+00C4 (assuming composed form)).

Posted by slacroixak on 25-Sep-2019 09:54

Opps, I missed the double dots on my screen

Posted by goo on 25-Sep-2019 11:01

Ok, so correct way of doing a ISO -> UTF -> ISO would be something like this?

myUTF = 'Ä'.

myISO = myUTF.

I would belive that

def var myTekst as char no-undo. /*by default session:cpinternal*/

myTekst = 'Ä'.

would be the same as 'Ä'

since myTekst = 'Ä' is true.

But when using CODEPAGE-CONVERT it converts different using 'Ä' and myTekst.

Is that the way it should be?

Posted by slacroixak on 26-Sep-2019 06:50

To what I understand, a LONGCHAR variable is "Code-Page aware".   Once you have see its code-page with the FIX-CODEPAGE Statement, you should no longer play with the CODE-PAGE() function.  It should handle the conversions for you when you assign it to something by taking into account the source and target code-pages.   At least, this is what I would expect.

Said differently, when you do:

myTekst =  "Some constant".


myTekst =  aSimpleCharVar.

-> the ABL is aware of the code page of "Some constant" or aSimpleCharVar, aka SESSION:CPINTERNAL

So the implicit ASSIGN statement should convert it automatically to the codepage of myTekst.

Similarly, if you to myTekst = myUTF.

 -> the ABL should convert the content of myUTF from its encoding to the encoding of myTekst.

If you do something like myUTF = codepage-convert('Ä','UTF-8',session:cpinternal).

then you may somehow wrongly apply twice the conversion transformation and not obtain what you want

BTW, I you look at the OpenEdge.Core.String new type, you will see it actually encapsulates a LONGCHAR variable and exposes an Encoding property, probably to use the FIX-CODEPAGE() Statement.

I'd be pleased to be corrected by a PSC person if I'm wrong

Hope it Helps.

Posted by Peter Judge on 26-Sep-2019 13:42

From the Help, it looks like your assertion is correct.
Default character conversions with the ASSIGN statement
When the target field is a . . .
And the source expression results in a . . .
The AVM converts the result of the source expression to . . .
-cpinternal code page
-cpinternal or the fixed code page

> BTW, I you look at the OpenEdge.Core.String new type, you will see it actually encapsulates a LONGCHAR variable and exposes an Encoding property, probably to use the FIX-CODEPAGE() Statement.

Sebastien, is there something that's not working properly? Or not in an expected way?

The OpenEdge.Core.String object
- holds all its values in a (private) UTF-8-encoded longchar (regardless of the session's CPINTERNAL value)
- has an Encoding property that  defaults to CPINTERNAL
- Has a public Value property that's a longchar and which performs a CODEPAGE-CONVERT() when needed (when the GET runs).
    /** Contains the actual string value. Marked as NON-SERIALIZABLE since the actual value is derived,
        and stored in the private mUTF8Value variable */
    define public non-serializable property Value as longchar no-undo
            // no need for changes if we're using UTF-8 as CPINTERNAL
            if this-object:Encoding eq 'UTF-8':u then
                return mUTF8Value.
                return codepage-convert(mUTF8Value, this-object:Encoding).
        end get.
You should be able to see this easily enough.
def var objString as OpenEdge.Core.String.
def var lcValue as longchar.
def var lcIn as longchar.
fix-codepage(lcIN) = 'utf-8'.
lcIN = '  Ä '.
objString = new OpenEdge.Core.String(lcin).
objString:Encoding = 'ISO8859-1'.
lcValue = objString:Value.
    'session:cpinternal = ' session:cpinternal skip // is UTF-8 in my case
    'lcValue:cpinternal = ' get-codepage(lcValue)   // should be ISO8859-1
    view-as alert-box.

Posted by goo on 26-Sep-2019 14:14

I was concerned by the result of
Def var myISOvalue as char no-undo. /*session:cpinternal = ‘iso8859-1*/
myISOvalue = ‘Ä’.
Fix-codepage(UTF8value1) = ‘UTF-8’.
Fix-codepage(UTF8value2) = ‘UTF-8’.
UTF8value1 = codepage-convert(myIsoValue,’UTF-8’,session:cpinternal).
UTF8value2 = codepage-convert(‘Ä’,’UTF-8’,session:cpinternal).
Gives different value when I copy-lob to file with no-convert.
So I just wondered why that happened. myISO is a char, not a longchar.
But at the end, I do not have to use codepage-convert into a fixed longchar…

Posted by slacroixak on 27-Sep-2019 09:14

Hi Goo, I agree with you this is weird.  As I was trying to say, this :

   UTF8value1 = codepage-convert(myIsoValue,’UTF-8’,session:cpinternal).

may result in applying twice the conversion transformation from your cpinternal to UTF-8

If it does so, then I'd consider it as a bug and would open a Tech Support Ticket

IMHO, when the assign statement assigns a longchar to the value of the codepage-convert function, then it should not convert a second time the result, especially if the target code page param of the codepage-convert matches the codepage of the longchar.  The all problem is what to do when these two codepages do not match... raise a runtime error perhaps?


This thread is closed