CODEPAGE question - OpenEdge General - Forum

All Replies

Posted by frank.meulblok on 25-Sep-2019 08:09

The "codepage-convert('Ä','UTF-8',session:cpinternal)" gives mojibake instead of the expected 'Ä', and has more bytes than expected.

So in that case there's a double conversion - the character gets converted from single-byte ISO codepage to a multibyte UTF-8 sequence, then the invidual bytes of the UTF-8 sequence get interpreted as single-byte characters and converted again.

The double conversion is probably because when you fix-codepage your longchars, there is an automatic codepage conversion that takes effect. (Rules are buried in the docs on the ASSIGN statement) -> that automatic conversion happens on top of the explicit conversion you have in the code.

Posted by slacroixak on 25-Sep-2019 09:33

I am a little bit confused with the chosen sample 'A' as it is actually encoded with one single byte in UTF-8, like all characters in the ASCII set (below 128). The 8 f UTF-8 means it can go down to 8 bits.

In UTF-8, extended characters (those above 127 in single byte encodings) are encoded with 2, 3 or 4 bytes (so 16 to 32 bits)

=> en.wikipedia.org/.../UTF-8

Said differently, strings made with only ASCII chars (below 128) should be encoded the same in all single byte codepages as well as in UTF-8

There should be differences only if extended characters are involved (like letter with accents, etc...)

Posted by frank.meulblok on 25-Sep-2019 09:50

The sample isn't 'A' (Latin captial letter A, codepoint U+0041 ), it's 'Ä' (Latin captial letter A with diaeresis, codepoint U+00C4 (assuming composed form)).

Posted by slacroixak on 25-Sep-2019 09:54

Opps, I missed the double dots on my screen

Posted by goo on 25-Sep-2019 11:01

Ok, so correct way of doing a ISO -> UTF -> ISO would be something like this?

myUTF = 'Ä'.

myISO = myUTF.

I would belive that

def var myTekst as char no-undo. /*by default session:cpinternal*/

myTekst = 'Ä'.

would be the same as 'Ä'

since myTekst = 'Ä' is true.

But when using CODEPAGE-CONVERT it converts different using 'Ä' and myTekst.

Is that the way it should be?

Posted by slacroixak on 26-Sep-2019 06:50

To what I understand, a LONGCHAR variable is "Code-Page aware". Once you have see its code-page with the FIX-CODEPAGE Statement, you should no longer play with the CODE-PAGE() function. It should handle the conversions for you when you assign it to something by taking into account the source and target code-pages. At least, this is what I would expect.

Said differently, when you do:

myTekst = "Some constant".

myTekst = aSimpleCharVar.

-> the ABL is aware of the code page of "Some constant" or aSimpleCharVar, aka SESSION:CPINTERNAL

So the implicit ASSIGN statement should convert it automatically to the codepage of myTekst.

Similarly, if you to myTekst = myUTF.

-> the ABL should convert the content of myUTF from its encoding to the encoding of myTekst.

If you do something like myUTF = codepage-convert('Ä','UTF-8',session:cpinternal).

then you may somehow wrongly apply twice the conversion transformation and not obtain what you want

BTW, I you look at the OpenEdge.Core.String new type, you will see it actually encapsulates a LONGCHAR variable and exposes an Encoding property, probably to use the FIX-CODEPAGE() Statement.

I'd be pleased to be corrected by a PSC person if I'm wrong

Hope it Helps.

Posted by Peter Judge on 26-Sep-2019 13:42

From the Help, it looks like your assertion is correct.

Default character conversions with the ASSIGN statement
When the target field is a . . .	And the source expression results in a . . .	The AVM converts the result of the source expression to . . .
CHARACTER	LONGCHAR	-cpinternal code page
LONGCHAR	CHARACTER	-cpinternal or the fixed code page

> BTW, I you look at the OpenEdge.Core.String new type, you will see it actually encapsulates a LONGCHAR variable and exposes an Encoding property, probably to use the FIX-CODEPAGE() Statement.

Sebastien, is there something that's not working properly? Or not in an expected way?

The OpenEdge.Core.String object

- holds all its values in a (private) UTF-8-encoded longchar (regardless of the session's CPINTERNAL value)

- has an Encoding property that defaults to CPINTERNAL

- Has a public Value property that's a longchar and which performs a CODEPAGE-CONVERT() when needed (when the GET runs).

/** Contains the actual string value. Marked as NON-SERIALIZABLE since the actual value is derived,

and stored in the private mUTF8Value variable */

define public non-serializable property Value as longchar no-undo

get():

// no need for changes if we're using UTF-8 as CPINTERNAL

if this-object:Encoding eq 'UTF-8':u then

return mUTF8Value.

else

return codepage-convert(mUTF8Value, this-object:Encoding).

end get.

You should be able to see this easily enough.

def var objString as OpenEdge.Core.String.

def var lcValue as longchar.

def var lcIn as longchar.

fix-codepage(lcIN) = 'utf-8'.

lcIN = ' Ä '.

objString = new OpenEdge.Core.String(lcin).

objString:Encoding = 'ISO8859-1'.

lcValue = objString:Value.

message

'session:cpinternal = ' session:cpinternal skip // is UTF-8 in my case

'lcValue:cpinternal = ' get-codepage(lcValue) // should be ISO8859-1

string(lcValue)

view-as alert-box.

Posted by goo on 26-Sep-2019 14:14

I was concerned by the result of

Def var myISOvalue as char no-undo. /*session:cpinternal = ‘iso8859-1*/

myISOvalue = ‘Ä’.

Fix-codepage(UTF8value1) = ‘UTF-8’.

Fix-codepage(UTF8value2) = ‘UTF-8’.

UTF8value1 = codepage-convert(myIsoValue,’UTF-8’,session:cpinternal).

UTF8value2 = codepage-convert(‘Ä’,’UTF-8’,session:cpinternal).

Gives different value when I copy-lob to file with no-convert.

So I just wondered why that happened. myISO is a char, not a longchar.

But at the end, I do not have to use codepage-convert into a fixed longchar…

Posted by slacroixak on 27-Sep-2019 09:14

Hi Goo, I agree with you this is weird. As I was trying to say, this :

UTF8value1 = codepage-convert(myIsoValue,’UTF-8’,session:cpinternal).

may result in applying twice the conversion transformation from your cpinternal to UTF-8

If it does so, then I'd consider it as a bug and would open a Tech Support Ticket

IMHO, when the assign statement assigns a longchar to the value of the codepage-convert function, then it should not convert a second time the result, especially if the target code page param of the codepage-convert matches the codepage of the longchar. The all problem is what to do when these two codepages do not match... raise a runtime error perhaps?

HTH

This thread is closed