UTF-8 byte-order mark problem

2011-07-12: SAT appear to have updated both their CFD and CFDi validator sites with stricter requirements for UTF-8 files. If you do not have a UTF-8 byte-order mark (BOM) at the beginning of your XML file, you will get this error:

A byte-order mark for UTF-8 is the sequence of three bytes (0xEF,0xBB,0xBF) at the beginning of a file.

Comment

This is ridiculous requirement and demonstrates yet again that the technical skills inside the SAT are sadly lacking.

There is no need for this. It is not required by the XML standard, although it is allowed. The default encoding for an XML file is assumed to be UTF-8 anyway. The encoding can be (and should be) declared in the declaration at the beginning of the XML document <?xml version="1.0" encoding="UTF-8"?>. There is a simple algorithm to detect whether or not a file contains valid UTF-8 characters without the need for a BOM. It is totally unnecessary.

Worse, this will cause lots of existing invoicing software and XML utilities to break for no reason. The XML functions in PHP, for instance, cannot handle it, so we are told. Delphi programmers are having difficulties, it seems. Is this because bl**dy Microsoft adds it unnecessarily to every UTF-8 file it creates? Or is this just a cynical way to put rival software providers out of business?

Fix available with FirmaSAT 4.1 and above

Anyway, this issue has been addressed in FirmaSAT 4.1.0, released 14 July 2011. XML files created by the SignXML functions will now include this BOM. There are also new functions and methods provided to add the BOM to an existing file (any file, not just an XML document, provided it actually is UTF-8). See the New functions and methods below and Fix UTF-8 BOM in the manual.

New functions and methods

Anti-fix available for FirmaSAT 5.2/5.3

Update December 2013: It now seems that some PACs will actually reject an XML document that has a UTF-8 byte-order mark. What can we say? If this is a problem for you, then FirmaSAT 5.3 has the option not to include a UTF-8 byte order mark in the output when doing a Sign XML operation:

Why use a BOM?

A byte order mark (BOM) was designed for UTF-16 files ("Unicode" in old Microsoft-speak, as opposed to ANSI). In general, UTF-16 files store their characters in two bytes (there are exceptions where more than two bytes can be used, but that is for really obscure character sets like Klingon, so for all practical purposes you can consider it a double-byte character set).

The catch is that there are two types of platforms: some systems store these pairs with the most significant byte first (big-endian) and some store the pairs the other way round (little-endian). All Windows systems are little-endian; Motorola systems are big-endian. When one system writes UTF-16 characters to a file, it could store the byte pairs in one particular order. Another system reading the file could read it in the wrong order. We need some kind of flag to let an interpreting program know which way round they have been written.

The idea of a byte order mark (note the word "order" here) is that you could insert the valid Unicode code point U+FEFF (zero width no-break space) at the beginning of the file. In UTF-16 this is represented by the two bytes FE and FF. If you found FE followed by FF then you knew you had a big-endian order and you could read the rest of the file accordingly. If you found FF and FE, then you knew you had a little-endian order. This character should not affect the display of the actual text, being zero-width and no-break.

The UTF-8 representation of the code point U+FEFF (zero width no-break space) is the three-byte sequence EF BB BF. You can put it there if you want. All systems will write it the same way. A program reading the file can find it and know it's meant to be dealing with a file of characters encoded in UTF-8, but in no way does it add to the program's knowledge of the order of the bytes it might find later on. UTF-8 files don't have a byte order that can change.

There is a simple algorithm that can detect if characters not encoded in UTF-8 are present. A simple ASCII text file is a valid UTF-8 file, which is a deliberate part of its design. It doesn't need a BOM to announce its nature. At best a BOM is a weak signal that you are meant to have UTF-8 encoded characters following (it doesn't guarantee that you do, of course). At worst it breaks your program.

But making it a compulsory requirement to accept or reject a file. Absolutely ridiculous!