Invalid Unicode characters in XML 1.0 document

Posted by martinz on 26-Sep-2019 07:26

Hi,

In our ERP system we export some data to xml-files. We do this by buffer-copying rows from tables to temp-tables, and then exporting these using WRITE-XML. This results in a file starting with <?xml version="1.0" encoding="UTF-8"?>. So it's an xml 1.0 file.

Some of our customers have data in the tables with unicode characters, for instance U+1A. In our application, that's perfectly valid data. However, it is an invalid character in xml 1.0. It's valid in xml 1.1, but only when it's written as &#x1A;

So we have two issues here:

1. As far as I can tell, WRITE-XML always writes xml 1.0 files. Is it possible to write xml 1.1 files? If so, how? I can't find anything in the docs. We could just replace the header of the resulting document, but that's a hack.

2. We need to rewrite almost all invalid characters from U+01 until U+1F to the &#x__; form. Is there a smart way to do this? As we are always interested in complete records, buffer-copy is the most efficient way for us to copy the records to the temp-table. When we need to sanitize the data, we could do a replace for all these characters on all character fields, but that's a performance issue waiting to happen.

Any thoughts? Experiences?

TIA

Martin Zantingh

All Replies

Posted by Stefan Drissen on 26-Sep-2019 08:16

1. no - the number of xml 1.1 consumers is also very disappointing

2a. when exporting end user xml we strip out all invalid asc values - it is not smart

2b. for all other use we try to use write-json which handles it all a lot better, but does miss a 'write-schema' option, but you can combine write-xmlschema with write-json

This thread is closed