Is there a plain ASCII code-page ?

Posted by cverbiest on 08-Apr-2016 07:44

We'd like to convert text containing accents é è to standard ASCII < chr(128).

élève => eleve

for search purposes.

Is there something like a plain ascii codepage (only letters below ascii value 128) so that we could use codepage convert ?

If not, are there other methods we could use for this ?

All Replies

Posted by Garry Hall on 08-Apr-2016 08:10

I don't believe we have something like this in OE.

One thought would be your own case-mapping table, but that would mean your entire session has to run like that, which might have unintended consequences elsewhere.

If you are using a single-byte codepage, you are probably better off providing your own function to scan through the string char by char, and provide the mapping yourself. You might be able to use collations to speed this up, as a lot of the collations map the accented chars to the base char. I can expand on this if you are interested.

If you are looking for a more Unicode approach (you are using -cpinternal UTF-8), another thought would be to NORMALIZE your string to NFD or NFKD, then scan through character by character, and skip any char that is not an ASCII character. Theoretically, at least, I have not tried this in practice. It would mean you would skip any chars that are not accented ASCII (e.g. Cyrillic, Thai, Japanese etc), but maybe for your purpose this is acceptable. A more robust approach would be to call the ICU libraries directly (Google suggests ways to do this, using a Transliterator).

Posted by Garry Hall on 08-Apr-2016 08:14

Or, as your post first suggested, for single-byte you could add your own codepage conversion table from your required codepage to plain ASCII, and use CODEPAGE-CONVERT. I don't believe we provide such mappings in OE today. This would be much faster to execute than my first suggestions :-)

Posted by jbijker on 08-Apr-2016 08:36

OE has build-in support for special characters mapped to normal equivalents.

Try this:

MESSAGE "eleve" = "élève" VIEW-AS ALERT-BOX.

It comes back with

---------------------------

Message (Press HELP to view stack trace)

---------------------------

yes

---------------------------

OK   Help  

---------------------------

It also applies to queries from DB, e.g.

DEFINE TEMP-TABLE ttTest

 FIELD cField AS CHARACTER.

CREATE ttTest. ASSIGN ttTest.cField = "eleve".

CREATE ttTest. ASSIGN ttTest.cField = "élève".

FOR EACH ttTest

  WHERE ttTest.cField = "eleve":

 DISPLAY ttTest.

END.

This will return both records for you.

I don't know if it internally converts it to basic text and if you can call that API. But my question is: do you still need to do this if OE already does it for you?

Posted by gus on 08-Apr-2016 08:42

> On Apr 8, 2016, at 9:37 AM, jbijker wrote:

>

> MESSAGE "eleve" = "élève" VIEW-AS ALERT-BOX.

fyi, the collation table assigns those characters the same sort weights and that is what makes these comparisons come out equal.

Posted by Garry Hall on 08-Apr-2016 08:53

As Gus pointed out, those tests are exercising the collation. The collation doesn't give you a string with the accents removed. However, the original question indicated the stripped string was for search purposes, so maybe (depending on what the application is) use of the collation is sufficient, and there is no need to strip the accents.

This thread is closed