Commit 6b2add5e authored by Simon McVittie's avatar Simon McVittie

Accept non-characters when validating Unicode

Unicode Corrigendum #9 clarifies that the non-characters U+nFFFE
(for n in the range 0 to 0x10), U+nFFFF (for n in the same range),
and U+FDD0..U+FDEF are valid for interchange, and their presence
does not make a string ill-formed.

GLib 2.36 made the corresponding change in its definition of UTF-8
as used by g_utf8_validate() and similar functions.

Bug: https://bugs.freedesktop.org/show_bug.cgi?id=63072Signed-off-by: 's avatarSimon McVittie <simon.mcvittie@collabora.co.uk>
parent 540e5692
......@@ -1577,19 +1577,11 @@ _dbus_string_split_on_byte (DBusString *source,
*
* The second check covers surrogate pairs (category Cs).
*
* The last two checks cover "Noncharacter": defined as:
* "A code point that is permanently reserved for
* internal use, and that should never be interchanged. In
* Unicode 3.1, these consist of the values U+nFFFE and U+nFFFF
* (where n is from 0 to 10_16) and the values U+FDD0..U+FDEF."
*
* @param Char the character
*/
#define UNICODE_VALID(Char) \
((Char) < 0x110000 && \
(((Char) & 0xFFFFF800) != 0xD800) && \
((Char) < 0xFDD0 || (Char) > 0xFDEF) && \
((Char) & 0xFFFE) != 0xFFFE)
(((Char) & 0xFFFFF800) != 0xD800))
/**
* Finds the given substring in the string,
......
......@@ -178,12 +178,14 @@ const char * const invalid_single_signatures[] = {
const char * const valid_strings[] = {
"",
"\xc2\xa9",
"\xc2\xa9", /* UTF-8 (c) symbol */
"\xef\xbf\xbe", /* U+FFFE is reserved but Corrigendum 9 says it's OK */
NULL
};
const char * const invalid_strings[] = {
"\xa9",
"\xa9", /* Latin-1 (c) symbol */
"\xed\xa0\x80", /* UTF-16 surrogates are not valid in UTF-8 */
NULL
};
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment