Code page validation

Posted by Mike Fechner on 21-Feb-2014 04:30

OpenEdge 11.1

Given the following scenario: OpenEdge GUI for .NET Client and AppServer, CPINTERNAL utf-8, Database code page also utf-8.

In parallel TTY clients on UNIX running single byte code pages.

In areas of the application that are relevant for compability with the legacy TTY application we need to ensure (validate) that character data entered on some of the GUI screens does not cause code page issues (and if it's just that characters are lost) in terminals or other output) in the legacy application.

So for some tables/fields we need to avoid, that users in the GUI for .NET Application can enter characters that do not fit into a TTY applications code page. As an example, I may need to avoid that certain text fields contain Turkish special characters, because the TTY application running iso8859-1 could not handle them.

What's the best way to validate this? I'm not keen parsing strings myself and validating ASC values of each character in question.

One solution would be to try to assign the CHARACTER in question to a LONGCHAR fixed to iso8859-1 and see if that throws an error.

ROUTINE-LEVEL ON ERROR UNDO, THROW.

/* ***************************  Main Block  *************************** */


DEFINE VARIABLE cTest AS CHARACTER NO-UNDO.
DEFINE VARIABLE lcTest AS LONGCHAR NO-UNDO.

cTest = CHR (50591, "utf-8") .

/*cTest = "ä" .*/

MESSAGE cTest SKIP ASC (cTest) SKIP
    VIEW-AS ALERT-BOX.

FIX-CODEPAGE (lcTest) = "iso8859-1" .
/*FIX-CODEPAGE (lcTest) = "1254" .*/

DO ON ERROR UNDO, THROW:

    /* Attempt to assign cTest to LONGCHAR fixed to iso8859-1 */
    lcTest = cTest . 

    CATCH err AS Progress.Lang.Error:
        IF err:GetMessageNum (1) = 142 THEN 
            MESSAGE "Codepage problem...."
                VIEW-AS ALERT-BOX.
        ELSE 
            UNDO, THROW err . 
    END CATCH.
END.

MESSAGE STRING (lcTest) 
        SKIP 
        cTest ASC (cTest)
    VIEW-AS ALERT-BOX.


Are there other solutions? Solutions that work for a complete temp-table record or ProDataset at once? 

Posted by Garry Hall on 21-Feb-2014 08:38

My last attempt, I promise. Instead of WRITE-XML() to a LONGCHAR of the target codepage, write to a UTF-8 LONGCHAR, then COPY-LOB it to a LONGCHAR of the target codepage. e.g.

define temp-table tt1 no-undo

   field f1 as char

   index ix1 f1.

define temp-table tt2 no-undo

   field f1 as char

   field f2 as char

   index ix2 f1 f2.

define dataset ds1 for tt1, tt2

   data-relation dr1 for tt1, tt2

   relation-fields(f1,f1).

DEFINE VARIABLE lcds AS LONGCHAR   NO-UNDO.

DEFINE VARIABLE lcds2 AS LONGCHAR   NO-UNDO.

DO  transaction:

   create tt1.

   assign tt1.f1 = "A".

   create tt2.

   assign

       tt2.f1 = tt1.f1

       /* Turkish lowercase dotless i

        * U+0131 => UTF-8 hex 0xc4b1 UTF-8 dec 50353 */

       tt2.f2 = CHR(50353).

END.

fix-codepage(lcds) = "UTF-8".

dataset ds1:handle:write-xml(

   "LONGCHAR",

   lcds,

   true /* formatted */,

   "UTF-8" /* encoding */).

fix-codepage(lcds2) = "1252".

copy-lob lcds to lcds2.

This gives me the following error:

 Large object assign or copy failed. (11395)

It is a vague error message, it doesn't explain exactly what the problem is, but it might flag that there is further investigation warranted. I believe it will be faster than a char-by-char comparison written in ABL. Depending on the size of your dataset, the memory consumption of the LONGCHARs could be significant.

All Replies

Posted by sgarg on 21-Feb-2014 04:48

Hi Mike,

I am guessing you can compare the character length vs byte length of the string that you read in your charater client using ABL LENGTH function and if both the length are not equal, that will tell you that there are some characters in the string that are not compatible, else the length will be equal.

Thanks,

Sachin

Posted by Mike Fechner on 21-Feb-2014 05:02

Hi Sachin,

I am not sure, this will work. Validation needs to be made on the AppServer.

And we need to allow those characters that fit into the TTY client's code page. German Umlaute in UTF-8 are for instance also double byte but are o.k. for iso8859-1 clients.

Posted by tbergman on 21-Feb-2014 05:55

Hi Mike,

Here's a method we use to make sure the characters can be converted to Windows 1252.

METHOD PUBLIC STATIC LOGICAL is1252( pChar AS CHARACTER ):

   DEFINE VARIABLE vResult AS LOGICAL NO-UNDO.

   /* No need to check if our session is 1252. This is done so this code will never fail

      during our transition to the UTF-8 client */

   IF SESSION:CPINTERNAL EQ "1252" THEN RETURN TRUE.

   /* The logic here is using the fact that convert-codepage will turn characters

      that can't be converted into question marks.

      If the number of question marks changes, then something was not

      convertible */

   IF NUM-ENTRIES(pChar,"?") NE

     NUM-ENTRIES (CODEPAGE-CONVERT(pChar,"1252"),"?") THEN RETURN FALSE.

   ELSE

     RETURN TRUE.

 END METHOD.

Posted by Mike Fechner on 21-Feb-2014 06:02

Hi Thomas,

also a valid approach. Similar to mine - it requires checking string by string.

Will try, which one runs faster.

Nobody knows a solution that works on full records or ProDatasets without iterating all fields/records/tables?

Posted by Garry Hall on 21-Feb-2014 08:12

For a dataset or temp-table, maybe try converting to XML or JSON. Some quick test code (run with -cpinternal UTF-8):

define temp-table tt1 no-undo

   field f1 as char

   index ix1 f1.

define temp-table tt2 no-undo

   field f1 as char

   field f2 as char

   index ix2 f1 f2.

define dataset ds1 for tt1, tt2

   data-relation dr1 for tt1, tt2

   relation-fields(f1,f1).

DEFINE VARIABLE lcds AS LONGCHAR   NO-UNDO.

DO  transaction:

   create tt1.

   assign tt1.f1 = "A".

   create tt2.

   assign

       tt2.f1 = tt1.f1

       /* Turkish lowercase dotless i

        * U+0131 => UTF-8 hex 0xc4b1 UTF-8 dec 50353 */

       tt2.f2 = CHR(50353).

END.

/* write to XML encoded with 1252 */

fix-codepage(lcds) = "1252".

dataset ds1:handle:write-xml(

   "LONGCHAR",

   lcds,

   false /* formatted */,

   "1252" /* encoding */).

This gives me an error message:

   Invalid encoding for WRITE-XML. (13515)

Posted by Garry Hall on 21-Feb-2014 08:18

Hmmm… scratch that. The error  is about the codepage I specified, not a failure in WRITE-XML. I will look to see if I can refine this to give a message during conversion.
 
 
[collapse]
From: Garry Hall [mailto:bounce-gih@community.progress.com]
Sent: Friday, February 21, 2014 9:13 AM
To: TU.OE.Development@community.progress.com
Subject: RE: Code page validation
 
Reply by Garry Hall

For a dataset or temp-table, maybe try converting to XML or JSON. Some quick test code (run with -cpinternal UTF-8):

define temp-table tt1 no-undo

   field f1 as char

   index ix1 f1.

define temp-table tt2 no-undo

   field f1 as char

   field f2 as char

   index ix2 f1 f2.

define dataset ds1 for tt1, tt2

   data-relation dr1 for tt1, tt2

   relation-fields(f1,f1).

DEFINE VARIABLE lcds AS LONGCHAR   NO-UNDO.

DO  transaction:

   create tt1.

   assign tt1.f1 = "A".

   create tt2.

   assign

       tt2.f1 = tt1.f1

       /* Turkish lowercase dotless i

        * U+0131 => UTF-8 hex 0xc4b1 UTF-8 dec 50353 */

       tt2.f2 = CHR(50353).

END.

/* write to XML encoded with 1252 */

fix-codepage(lcds) = "1252".

dataset ds1:handle:write-xml(

   "LONGCHAR",

   lcds,

   false /* formatted */,

   "1252" /* encoding */).

This gives me an error message:

   Invalid encoding for WRITE-XML. (13515)

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse]

Posted by Garry Hall on 21-Feb-2014 08:20

The code should read "windows-1252" for the WRITE-XML encoding.

But sadly, when the code is correct, there is no error. Instead the character is written escaped:

<?xml version="1.0" encoding="windows-1252"?>

<ds1 xmlns:xsi="www.w3.org/.../XMLSchema-instance">

 <tt1>

   <f1>A</f1>

 </tt1>

 <tt2>

   <f1>A</f1>

   <f2>&#x131;</f2>

 </tt2>

</ds1>

Not sure if there is much you can do with that.

Posted by Garry Hall on 21-Feb-2014 08:38

My last attempt, I promise. Instead of WRITE-XML() to a LONGCHAR of the target codepage, write to a UTF-8 LONGCHAR, then COPY-LOB it to a LONGCHAR of the target codepage. e.g.

define temp-table tt1 no-undo

   field f1 as char

   index ix1 f1.

define temp-table tt2 no-undo

   field f1 as char

   field f2 as char

   index ix2 f1 f2.

define dataset ds1 for tt1, tt2

   data-relation dr1 for tt1, tt2

   relation-fields(f1,f1).

DEFINE VARIABLE lcds AS LONGCHAR   NO-UNDO.

DEFINE VARIABLE lcds2 AS LONGCHAR   NO-UNDO.

DO  transaction:

   create tt1.

   assign tt1.f1 = "A".

   create tt2.

   assign

       tt2.f1 = tt1.f1

       /* Turkish lowercase dotless i

        * U+0131 => UTF-8 hex 0xc4b1 UTF-8 dec 50353 */

       tt2.f2 = CHR(50353).

END.

fix-codepage(lcds) = "UTF-8".

dataset ds1:handle:write-xml(

   "LONGCHAR",

   lcds,

   true /* formatted */,

   "UTF-8" /* encoding */).

fix-codepage(lcds2) = "1252".

copy-lob lcds to lcds2.

This gives me the following error:

 Large object assign or copy failed. (11395)

It is a vague error message, it doesn't explain exactly what the problem is, but it might flag that there is further investigation warranted. I believe it will be faster than a char-by-char comparison written in ABL. Depending on the size of your dataset, the memory consumption of the LONGCHARs could be significant.

Posted by Mike Fechner on 23-Feb-2014 03:28

Hi Garry, thanks for all your tried! :-)

The last one looks lots like the one I used for a single string. Assign to a LONGCHAR fixed to the target CP and see if it errors out. I guess I'll go with that one for the ProDatasets.

Cheers,

Mike

This thread is closed