Byte-Order-Marker in Progress 11 UTF8 string

Posted by jquerijero on 17-Jan-2013 14:25

We have a custom .NET dll that uses Encoding.UTF8.GetString(bytes[]) to generate a string and that eventually gets passed to our ABL code. After upgrading our application to Progress 11, the generated string has BOM (shows as ?) indicator at the beginning of the generated string. Our version 10.2B application doesn't. The custom dll hasn't changed or recompiled.

Is this a new Progress 11 string parameter passing behavior between .NET assembly and ABL code?

All Replies

Posted by gus on 24-Jan-2013 14:51

Since it is a bytewise encoding by design, UTF-8 and such strings are not supposed to have byte-order markers. There is only one possible byte-order and byte-order markers have no meaning in UTF-8

I don't know where the byte order markers you see are coming from. However, some Microsoft tools like Notepad insert the byte order markers in error and many other programs have subsequently been updated to ignore the byte order markers.

Posted by jquerijero on 31-Jan-2013 17:40

Something is definitely different between version 11 and 10.2B.

Here is the piece of C# code that has the problem;

        private static string ByteArrayToString(byte[] characters)

        {

            string constructedString = Encoding.UTF8.GetString(characters);

            constructedString = constructedString.Trim();

            MessageBox.Show(((int)constructedString[0]).ToString() + " '" + constructedString.Substring(0, 3) + "'");

            return (constructedString);

        }

In Progress 10.2B, the MessageBox displays: 60 '

In Progress 11.1, the MessageBox displays: 65279 'ZERO WIDTH NO-BREAK SPACE)

Byte[] does include the BOM characters (ASCII 239,187,191). In 10.2B, the BOM chars are removed when Encoding.UTF8.GetString() is called. It looks like in 11, the BOM chars are being replaced by the deprecated 'ZERO WIDTH NO-BREAK SPACE' char.

Why is there a difference in behavior?

NOTE: The same dll (same file) is used in both 10.2B and 11.1.

Posted by Garry Hall on 31-Jan-2013 21:22

If I understand this information correctly, the bytes passed to ByteArrayToString contain the BOM encoded in UTF-8. Within this method, you call Encoding.UTF8.GetString() to convert these from UTF-8 bytes to a String. You then display the characters within the string within this method. As far as I can tell, this is all happening within .NET, there is no call into ABL anywhere in this.

The only difference I can think of that would be in play here is the .NET version: 11.X uses .NET 4.0, 10.2B uses .NET 2.0 -> 3.5. If you run your method in a pure .NET environment (no ABL involved), do you see the same difference in behaviour between .NET 2.0 and .NET 4.0?

Posted by jquerijero on 01-Feb-2013 10:07

It looks like this is realated to .NET Framework 4. VS2010 project has the same character in the beginning of the string, but the special character is dropped when writing the string to a file that's why I didn't catch it immediately when I was tinkering with Visual Studio because I was looking at the output file.

Posted by jquerijero on 04-Feb-2013 15:48

BY THE WAY:

It is the System.String.Trim() and not the UTF8.GetString() that was changed from 3.5 to 4.0.

This thread is closed