LINUX.ORG.RU

История изменений

Исправление gh0stwizard, (текущая версия) :

Ну что значит «нет поддержки Юникода»?

Ну, то и значит. UCS-2 != Unicode. Цитирую:

As of Unicode 8.0 there are 120,520 graphic characters.

https://en.wikipedia.org/wiki/Unicode
Что было до 2000 года (если что, сейчас 2016 год):

The UCS has over 1.1 million code points available for use, but only the first 65,536 (the Basic Multilingual Plane, or BMP) had entered into common use before 2000.

https://en.wikipedia.org/wiki/Universal_Coded_Character_Set

По этим двум причинам и придумали char16_t и char32_t.

Вот, что написано в unicode/umachine.h (libicu):

/* UChar and UChar32 definitions -------------------------------------------- */

/** Number of bytes in a UChar. @stable ICU 2.0 */
#define U_SIZEOF_UCHAR 2

/**
 * \var UChar
 * Define UChar to be UCHAR_TYPE, if that is #defined (for example, to char16_t),
 * or wchar_t if that is 16 bits wide; always assumed to be unsigned.
 * If neither is available, then define UChar to be uint16_t.
 *
 * This makes the definition of UChar platform-dependent
 * but allows direct string type compatibility with platforms with
 * 16-bit wchar_t types.
 *
 * @stable ICU 4.4
 */
#if defined(UCHAR_TYPE)
    typedef UCHAR_TYPE UChar;
/* Not #elif U_HAVE_CHAR16_T -- because that is type-incompatible with pre-C++11 callers
    typedef char16_t UChar;  */
#elif U_SIZEOF_WCHAR_T==2
    typedef wchar_t UChar;
#elif defined(__CHAR16_TYPE__)
    typedef __CHAR16_TYPE__ UChar;
#else
    typedef uint16_t UChar;
#endif

/**
 * Define UChar32 as a type for single Unicode code points.
 * UChar32 is a signed 32-bit integer (same as int32_t).
 *
 * The Unicode code point range is 0..0x10ffff.
 * All other values (negative or >=0x110000) are illegal as Unicode code points.
 * They may be used as sentinel values to indicate "done", "error"
 * or similar non-code point conditions.
 *
 * Before ICU 2.4 (Jitterbug 2146), UChar32 was defined
 * to be wchar_t if that is 32 bits wide (wchar_t may be signed or unsigned)
 * or else to be uint32_t.
 * That is, the definition of UChar32 was platform-dependent.
 *
 * @see U_SENTINEL
 * @stable ICU 2.4
 */
typedef int32_t UChar32;
Особенно интересно замечание про unsigned/signed. Вот что у gcc-5.3.0 + musl-1.1.12:
/usr/include/uchar.h:7:typedef unsigned short char16_t;
/usr/include/uchar.h:8:typedef unsigned char32_t;
На wheezy, gcc-4.7.2 + eglibc-2.13 значения для char16_t, char32_t вообще не определены для сишки.

Может чего не догоняю, но использовать char16_t, char32_t также противопоказано. По-хорошему, надо писать свою либу или использовать готовую. Одними только типами ничего не добиться.

Исходная версия gh0stwizard, :

Ну что значит «нет поддержки Юникода»?

Ну, то и значит. UCS-2 != Unicode. Цитирую:

As of Unicode 8.0 there are 120,520 graphic characters.

https://en.wikipedia.org/wiki/Unicode Что было до 2000 года (если что, сейчас 2016 год):

The UCS has over 1.1 million code points available for use, but only the first 65,536 (the Basic Multilingual Plane, or BMP) had entered into common use before 2000.

https://en.wikipedia.org/wiki/Universal_Coded_Character_Set

По этим двум причинам и придумали char16_t и char32_t.

Вот, что написано в unicode/umachine.h (libicu):

/* UChar and UChar32 definitions -------------------------------------------- */

/** Number of bytes in a UChar. @stable ICU 2.0 */
#define U_SIZEOF_UCHAR 2

/**
 * \var UChar
 * Define UChar to be UCHAR_TYPE, if that is #defined (for example, to char16_t),
 * or wchar_t if that is 16 bits wide; always assumed to be unsigned.
 * If neither is available, then define UChar to be uint16_t.
 *
 * This makes the definition of UChar platform-dependent
 * but allows direct string type compatibility with platforms with
 * 16-bit wchar_t types.
 *
 * @stable ICU 4.4
 */
#if defined(UCHAR_TYPE)
    typedef UCHAR_TYPE UChar;
/* Not #elif U_HAVE_CHAR16_T -- because that is type-incompatible with pre-C++11 callers
    typedef char16_t UChar;  */
#elif U_SIZEOF_WCHAR_T==2
    typedef wchar_t UChar;
#elif defined(__CHAR16_TYPE__)
    typedef __CHAR16_TYPE__ UChar;
#else
    typedef uint16_t UChar;
#endif

/**
 * Define UChar32 as a type for single Unicode code points.
 * UChar32 is a signed 32-bit integer (same as int32_t).
 *
 * The Unicode code point range is 0..0x10ffff.
 * All other values (negative or >=0x110000) are illegal as Unicode code points.
 * They may be used as sentinel values to indicate "done", "error"
 * or similar non-code point conditions.
 *
 * Before ICU 2.4 (Jitterbug 2146), UChar32 was defined
 * to be wchar_t if that is 32 bits wide (wchar_t may be signed or unsigned)
 * or else to be uint32_t.
 * That is, the definition of UChar32 was platform-dependent.
 *
 * @see U_SENTINEL
 * @stable ICU 2.4
 */
typedef int32_t UChar32;
Особенно интересно замечание про unsigned/signed. Вот что у gcc-5.3.0 + musl-1.1.12:
/usr/include/uchar.h:7:typedef unsigned short char16_t;
/usr/include/uchar.h:8:typedef unsigned char32_t;
На wheezy, gcc-4.7.2 + eglibc-2.13 значения для char16_t, char32_t вообще не определены для сишки.

Может чего не догоняю, но использовать char16_t, char32_t также противопоказано. По-хорошему, надо писать свою либу или использовать готовую. Одними только типами ничего не добиться.