LONGCHAR conversion error for German characters

Posted by Andrei Popa on 28-Jun-2017 10:25

I get the following error when I try to add a CHARACTER to a LONGCHAR:
Could not convert LONGCHAR to cpinternal. (11669)

The line where this error occurs is:
cResult = cResult + cValue.
where cResult is LONGCHAR and cValue is CHARACTER

I added the following line before it:
MESSAGE SESSION:CPINTERNAL GET-CODEPAGE(cResult).
and get "UTF-8 UTF-8"

Posted by Garry Hall on 29-Jun-2017 10:57

CHARACTER field data in the database is always intended to be in the database's codepage. This allows the AVM to provide automatic codepage conversion between the database and the client session's -cpinternal. IT also ensures indexing works correctly. However, it is possible to insert data that is not in the database codepage into the database, by bypassing the automatic conversions the AVM provides, or not appreciating the encoding of the input file. For example, if you have a file encoded as 1252, and an AVM running -cpinternal UTF-8 -cpstream UTF-8, and you INPUT from the file without specifying CODEPAGE SOURCE "1252", then you are telling the AVM that the data is in UTF-8. As -cpinternal == -cpstream, there is no codepage conversion done when reading this data. The data is not validated to ensure it is correctly encoded as UTF-8, for performance reasons (I know of a customer situation where there was a request to add this validation, I can't remember the exact details or the outcome).

Posted by Garry Hall on 29-Jun-2017 11:04

To complete this thought: the data would now be in the UTF-8 database, but is not encoded correctly, so any use of it that will try to consume the bytes as characters (counting their length, indexing them) will result either in errors or (worse) unexpected behaviour.

All Replies

Posted by Garry Hall on 28-Jun-2017 10:49

Hmmm... the error message could be more descriptive. There is a problem when converting the CHAR variable to a LONGCHAR internally for the concatenation. There is an internal error code which could be displayed which might help.

The CHAR variable cValue is in -cpinternal, which is UTF-8. Where did you get this value from? I have a hunch it is malformed UTF-8.

Posted by Andrei Popa on 29-Jun-2017 02:14

cValue's content is taken from a database, whose encoding is also "UTF-8".

In the current version the value passed through several temp-tables before reaching cValue.

I've been able to write a short example where I can replicate the issue:

DEFINE VARIABLE lcValue AS LONGCHAR NO-UNDO.

MESSAGE SESSION:CPINTERNAL.

FOR FIRST Table1 NO-LOCK:

   lcValue = Table1.Description.

END.

MESSAGE "After for each".

MESSAGE STRING(lcValue).

An error is thrown at:

lcValue = Table1.Description.

and the error message is:

** Unable to update Field. (142)

Posted by Garry Hall on 29-Jun-2017 08:30

In your repro, can you just display the values of Table1.Description? And if so, does it look correct when you display it with -cpinternal UTF-8? I suspect that maybe some of the "UTF-8" data in your db is malformed.

Posted by Andrei Popa on 29-Jun-2017 08:34

If I try to display it using the DISPLAY/MESSAGE statement, the German characters are not displayed, but if I output it to a file using the EXPORT TO statement, they appear correctly.

Posted by Garry Hall on 29-Jun-2017 08:48

There might be some confusion when they "appear correctly" in the output file. Whatever you use to view the file makes an assumption of the encoding of the file. If you EXPORT with -cpinternal UTF-8 -cpstream UTF-8, the German characters should be 2-byte characters. If it is malformed, and actually in 1252, then when you view it with a text editor that assumes 1252, it will appear to be correct. You'd probably need a hex viewer to view the raw bytes in the output file to make sure. For example, take ß (U+00DF). Its UTF-8 encoding is the two bytes 0xC3 0x9F. This is what you would expect to see in a file EXPORTed with -cpstream UTF-8. However, if the character was instead represented as 0xDF, then the data is actually encoded as 1252.

Posted by Andrei Popa on 29-Jun-2017 10:44

Yes, it seems that it is indeed an issue with the data being in a different encoding.

I am using notepad++ to view the file. Using the HexEditor, the character Ä appears as xC4.

Can the data in the database be in a different encoding than the database encoding itself?

Posted by Garry Hall on 29-Jun-2017 10:57

CHARACTER field data in the database is always intended to be in the database's codepage. This allows the AVM to provide automatic codepage conversion between the database and the client session's -cpinternal. IT also ensures indexing works correctly. However, it is possible to insert data that is not in the database codepage into the database, by bypassing the automatic conversions the AVM provides, or not appreciating the encoding of the input file. For example, if you have a file encoded as 1252, and an AVM running -cpinternal UTF-8 -cpstream UTF-8, and you INPUT from the file without specifying CODEPAGE SOURCE "1252", then you are telling the AVM that the data is in UTF-8. As -cpinternal == -cpstream, there is no codepage conversion done when reading this data. The data is not validated to ensure it is correctly encoded as UTF-8, for performance reasons (I know of a customer situation where there was a request to add this validation, I can't remember the exact details or the outcome).

Posted by Garry Hall on 29-Jun-2017 11:04

To complete this thought: the data would now be in the UTF-8 database, but is not encoded correctly, so any use of it that will try to consume the bytes as characters (counting their length, indexing them) will result either in errors or (worse) unexpected behaviour.

Posted by Andrei Popa on 29-Jun-2017 17:23

Thank you very much for the help!

The data is indeed read from a file, but the file's encoding is UTF-8.

I'll look into what is causing this issue.

This thread is closed