为什么我得到的是数字而不是Unicode字符

Why do I get a number instead of a Unicode character?

本文关键字：Unicode 字符数字为什么更新时间：2023-10-16

我写了这个代码：

#include <iostream>
int main()
{
std::wcout << 'u00E1' << std::endl;
}

但当使用GCC 4.8.1编译时，它会输出50081。

我可能做错了什么，但我肯定不会期望输出一个数字。怎么回事？

我认为这是g++中的一个bug。'u00E1'的类型是char，但g++将其视为int。clang++做对了。

考虑一下这个相关的程序(使用重载的type_of函数来检测文字的类型)：

#include <iostream>
const char *type_of(char) { return "char"; }
const char *type_of(int)  { return "int";  }
int main()
{
std::cout << "type_of('x')  = " << type_of('x') << "n";
std::cout << "type_of('xy') = " << type_of('xy') << "n";           // line 9
std::cout << "type_of('u00E1')  = " << type_of('u00E1') << "n";  // line 10
std::cout << "type_of('u0100')  = " << type_of('u0100') << "n";  // line 11
}

当我用g++4.7.2编译这个时，我得到了以下警告：

c.cpp:9:47: warning: multi-character character constant [-Wmultichar]
c.cpp:10:52: warning: multi-character character constant [-Wmultichar]
c.cpp:11:52: warning: multi-character character constant [-Wmultichar]

这个输出：

type_of('x')  = char
type_of('xy') = int
type_of('á')  = int
type_of('Ā')  = int

使用clang++3.0，我只收到两个警告：

c.cpp:9:47: warning: multi-character character constant [-Wmultichar]
std::cout << "type_of('xy') = " << type_of('xy') << "n";
^
c.cpp:11:52: warning: character unicode escape sequence too long for its type
std::cout << "type_of('u0100')  = " << type_of('u0100') << "n";

这个输出：

type_of('x')  = char
type_of('xy') = int
type_of('á')  = char
type_of('Ā')  = char

字符文字'u00E1'只有一个c-字符序列，它恰好是通用字符名。因此它属于char类型，但g++错误地将其视为int类型的多字符常量。clang++正确地将其视为char类型的普通字符文字。

这样一个值在char范围之外的字符文字的值是实现定义的，但它仍然是char类型。

由于您正在编写std::wcout，您可能想要一个宽字符的文字：L'u00E1'，它的类型为char_t，而不是'u00E1'，它(如果编译器处理正确)的类型为int。

这似乎是编译器错误。

根据标准(2.14.3/1)，'u00E1'是一个普通的字符文字(它没有u、U或L前缀)，它包含一个c-char(这是一个通用字符名)，因此它具有类型char。

因此std::wcout << 'u00E1'应该使用operator<<(char)并打印单个字符。

相反，它采用通用字符名，将其转换为UTF-8编码序列，并获得多字符文字"\xC3\xA1"，这是一个值为50081:的int

'u00E1' -> 'xC3xA1' -> 50081