XMLParser in Pharo Ansprüche U + 00A0 ist "Ungültige UTF-8"

der Eingang Gegeben: ""XMLParser in Pharo Ansprüche U + 00A0 ist "Ungültige UTF-8"

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?> 
<sms body=". what" />

, wo das Zeichen, nachdem die im Attribut body des sms-Tags ist U+00A0;

ich den Fehler:

XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13)

IIUC, die UTF-8-Darstellung dieses Zeichens ist 0xC2 0xA0per Wikipedia. Tatsächlich sind die Bytes 72 und 73 des Eingangs 194 bzw. 160.

Das scheint wie ein Fehler in XMLParser, oder fehlt mir etwas?

Quelle

2016-07-28 Sean DeNigris

nicht reproduzieren kann: 'XMLDOMParser Parse: ' '' –

Dank Monty für die Rettung kommt on the Pharo User's list:

You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM printToFileNamed: family of messages when writing) and let XMLParser take care this for you, or disable XMLParser decoding before parsing with #decodesCharacters:.

Longer explanation:

The class #on:/#parse: take either a string or a stream (read the definitions). You gave it a FileReference, but because the argument is tested with isString and sent #readStream otherwise, it didn't blowup then.

File refs sent #readStream return file streams that do automatic decoding. But XMLParser automatically attempts its own decoding too, if:

The input starts with a BOM or it can be inferred by null bytes before or after the first non-null byte.

There is an encoding declaration with a non-UTF-8 encoding.

There is a UTF-8 encoding declaration but the stream is not a normal ReadStream (your case).

So it gets decoded twice, and the decoded value of the char causes the error. I'll consider changing the heuristic to make less eager to decode.

Quelle

2016-08-08 12:45:20

XMLParser in Pharo Ansprüche U + 00A0 ist "Ungültige UTF-8"

Antwort

Verwandte Themen