C++中支持的字符

Characters supported in C++

本文关键字：字符支持 C++ 更新时间：2023-10-16

当我用外国字符(法语…)写单词时似乎有问题

例如，如果我要求输入std:：string或类似这样的char[]：

std::string s;
std::cin>>s;  //if we input the string "café"
std::cout<<s<<std::endl;  //outputs "café"

一切都很好。

尽管如果字符串是硬编码

std::string s="café";
std::cout<<s<<std::endl; //outputs "cafÚ"

发生了什么事？C++支持哪些字符？我如何使其正常工作？这与我的操作系统(Windows 10)有关吗？我的IDE(VS 15)？还是用C++？

简而言之，如果您想在Windows 10(实际上是任何版本的Windows)上向控制台传递/从控制台接收unicode文本，则需要使用宽字符串、IE、std:：wstring。Windows本身不支持UTF-8编码。这是操作系统的一个基本限制。

控制台和文件系统访问等功能所基于的整个Win32 API只能使用UTF-16编码下的unicode字符，Visual Studio中提供的C/C++运行时不提供任何类型的转换层来使API与UTF-8兼容。这并不意味着您不能在内部使用UTF-8编码，只是意味着当您使用Win32 API或使用它的C/C++运行时功能时，您需要在UTF-8和UTF-16编码之间进行转换。这很糟糕，但这正是我们现在的处境。

有些人可能会引导您了解一系列技巧，这些技巧旨在使控制台能够使用UTF-8。不要走这条路，你会遇到很多问题的。只有宽字符串才能正确地支持unicode控制台访问。

编辑：因为UTF-8/UTF-16字符串转换很简单，而且C++中也没有提供太多帮助，所以下面是我之前准备的一些转换函数：

///////////////////////////////////////////////////////////////////////////////////////////////////
std::wstring UTF8ToUTF16(const std::string& stringUTF8)
{
// Convert the encoding of the supplied string
std::wstring stringUTF16;
size_t sourceStringPos = 0;
size_t sourceStringSize = stringUTF8.size();
stringUTF16.reserve(sourceStringSize);
while (sourceStringPos < sourceStringSize)
{
// Determine the number of code units required for the next character
static const unsigned int codeUnitCountLookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4 };
unsigned int codeUnitCount = codeUnitCountLookup[(unsigned char)stringUTF8[sourceStringPos] >> 4];
// Ensure that the requested number of code units are left in the source string
if ((sourceStringPos + codeUnitCount) > sourceStringSize)
{
break;
}
// Convert the encoding of this character
switch (codeUnitCount)
{
case 1:
{
stringUTF16.push_back((wchar_t)stringUTF8[sourceStringPos]);
break;
}
case 2:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x1F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F);
stringUTF16.push_back((wchar_t)unicodeCodePoint);
break;
}
case 3:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x0F) << 12) |
(((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F);
stringUTF16.push_back((wchar_t)unicodeCodePoint);
break;
}
case 4:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x07) << 18) |
(((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 12) |
(((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 3] & 0x3F);
wchar_t convertedCodeUnit1 = 0xD800 | (((unicodeCodePoint - 0x10000) >> 10) & 0x03FF);
wchar_t convertedCodeUnit2 = 0xDC00 | ((unicodeCodePoint - 0x10000) & 0x03FF);
stringUTF16.push_back(convertedCodeUnit1);
stringUTF16.push_back(convertedCodeUnit2);
break;
}
}
// Advance past the converted code units
sourceStringPos += codeUnitCount;
}
// Return the converted string to the caller
return stringUTF16;
}
///////////////////////////////////////////////////////////////////////////////////////////////////
std::string UTF16ToUTF8(const std::wstring& stringUTF16)
{
// Convert the encoding of the supplied string
std::string stringUTF8;
size_t sourceStringPos = 0;
size_t sourceStringSize = stringUTF16.size();
stringUTF8.reserve(sourceStringSize * 2);
while (sourceStringPos < sourceStringSize)
{
// Check if a surrogate pair is used for this character
bool usesSurrogatePair = (((unsigned int)stringUTF16[sourceStringPos] & 0xF800) == 0xD800);
// Ensure that the requested number of code units are left in the source string
if (usesSurrogatePair && ((sourceStringPos + 2) > sourceStringSize))
{
break;
}
// Decode the character from UTF-16 encoding
unsigned int unicodeCodePoint;
if (usesSurrogatePair)
{
unicodeCodePoint = 0x10000 + ((((unsigned int)stringUTF16[sourceStringPos] & 0x03FF) << 10) | ((unsigned int)stringUTF16[sourceStringPos + 1] & 0x03FF));
}
else
{
unicodeCodePoint = (unsigned int)stringUTF16[sourceStringPos];
}
// Encode the character into UTF-8 encoding
if (unicodeCodePoint <= 0x7F)
{
stringUTF8.push_back((char)unicodeCodePoint);
}
else if (unicodeCodePoint <= 0x07FF)
{
char convertedCodeUnit1 = (char)(0xC0 | (unicodeCodePoint >> 6));
char convertedCodeUnit2 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
}
else if (unicodeCodePoint <= 0xFFFF)
{
char convertedCodeUnit1 = (char)(0xE0 | (unicodeCodePoint >> 12));
char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
char convertedCodeUnit3 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
stringUTF8.push_back(convertedCodeUnit3);
}
else
{
char convertedCodeUnit1 = (char)(0xF0 | (unicodeCodePoint >> 18));
char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 12) & 0x3F));
char convertedCodeUnit3 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
char convertedCodeUnit4 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
stringUTF8.push_back(convertedCodeUnit3);
stringUTF8.push_back(convertedCodeUnit4);
}
// Advance past the converted code units
sourceStringPos += (usesSurrogatePair) ? 2 : 1;
}
// Return the converted string to the caller
return stringUTF8;
}

我负责将一个600万行的遗留Windows应用程序转换为支持Unicode这项令人不快的任务，当时它只是为了支持ASCII而编写的(事实上，它的开发早于Unicode)，我们在内部使用std:：string和char[]来存储字符串。由于根本不可能更改所有内部字符串存储缓冲区，因此我们需要在内部采用UTF-8，并在使用Win32 API时在UTF-8和UTF-16之间进行转换。这些是我们使用的转换函数。

我强烈建议坚持使用新Windows开发所支持的内容，这意味着宽字符串。也就是说，没有理由不能将程序的核心建立在UTF-8字符串的基础上，但在与Windows和C/C++运行时的各个方面交互时，这会使事情变得更加棘手。

编辑2:我刚刚重读了原来的问题，可以看出我回答得不太好。让我提供更多的信息来具体回答你的问题。

怎么回事？当在Windows上使用C++进行开发时，当您使用std:：string和std:：cin/std:：cout时，控制台IO是使用MBCS编码完成的。这是一种不推荐使用的模式，在这种模式下，使用机器上当前选定的代码页对字符进行编码。在这些代码页下编码的值不是unicode，并且不能与选择了不同代码页的其他系统共享，如果代码页发生更改，则不能与同一系统共享。它在测试中非常有效，因为您在当前代码页下捕获输入，并将其显示回同一代码页下。如果您尝试捕获该输入并将其保存到文件中，检查将显示它不是unicode。用我们操作系统中选择的不同代码页将其加载回来，文本将显示为已损坏。只有当你知道文本编码在哪个代码页中时，你才能解释文本。由于这些遗留代码页是区域性的，没有一个可以代表所有文本字符，因此实际上不可能在不同的机器和计算机之间通用地共享文本。MBCS早于unicode的发展，正是因为这些问题，才发明了unicode。Unicode基本上是"一个代码页来管理所有代码"。您可能想知道为什么UTF-8不是Windows上可选择的"遗留"代码页。我们很多人都在想同样的事情。可以说，事实并非如此。因此，您不应该依赖MBCS编码，因为使用它时无法获得unicode支持。您在Windows上支持unicode的唯一选项是使用std:：wstring，并调用UTF-16 Win32 API的。

关于字符串被硬编码的例子，首先要理解，将非ASCII文本编码到源文件中会使您进入编译器特定行为的领域。在Visual Studio中，您实际上可以指定源文件的编码(在"文件"->"高级保存选项"下)。在您的情况下，文本与您预期的不同，因为它(很可能)是用UTF-8编码的，但如前所述，控制台输出是在您当前选择的代码页上使用MBCS编码完成的，而不是UTF-8。从历史上看，建议您避免在源文件中使用任何非ASCII字符，并使用\x符号转义任何字符。如今，有C++11字符串前缀和后缀可以保证各种编码形式。如果你需要这种能力，你可以尝试使用这些。我没有使用它们的实际经验，所以我不能建议这种方法是否有任何问题。

问题源于Windows本身。它对大多数内部操作使用一个字符编码(UTF-16)，对默认文件编码使用另一个(Windows-1252)，对控制台I/O使用另一种(在您的情况下为代码页850)。源文件在Windows-1252中编码，其中é等于单字节'xe9'。当您在代码页850中显示相同的代码时，它将变为Ú。使用u8"é"生成两字节序列"xc3xa9"，该序列在控制台上打印为├®。

可能最简单的解决方案是避免在代码中完全使用非ASCII文字，并为所需字符使用十六进制代码。不过，这不会是一个漂亮的或可移植的解决方案。

std::string s="cafx82";

更好的解决方案是使用u16字符串并使用WideCharToMultiByte对其进行编码。

C++支持哪些字符

C++标准不指定支持哪些字符。它是针对具体实施的。

这与…有关吗。。。

C++

否。

。。。我的IDE？

否，尽管IDE可能有编辑特定编码的源文件的选项。

。。。我的操作系统？

这可能会产生影响。

这受到几个因素的影响。

源文件的编码是什么
编译器用于解释源文件的编码是什么。
- 它与文件的编码相同还是不同(应该相同，否则可能无法正常工作)
- 操作系统的本机编码可能会影响编译器默认期望的字符编码
运行程序的终端支持什么编码。
- 它与文件的编码相同还是不同(它应该相同，否则在没有转换的情况下可能无法正常工作)
使用的字符编码是否为宽。宽，我指的是代码单元的宽度是否大于CHAR_BIT。宽源代码/编译器将导致转换为另一种窄编码，因为您使用了窄字符串文字和窄流运算符。在这种情况下，您需要弄清楚编译器所期望的本机窄字符编码和本机宽字符编码。编译器将把输入字符串转换为窄编码。如果窄编码在输入编码中没有字符的表示形式，则可能无法正常工作

一个例子：

源文件以UTF-8编码。编译器需要UTF-8。终端需要UTF-8。在这种情况下，你看到的就是你得到的。

这里的技巧是setlocale:

#include <clocale>
#include <string>
#include <iostream>
int main() {
std::setlocale(LC_ALL, "");
std::string const s("café");
std::cout << s << 'n';
}

即使没有更改终端代码页，Windows 10命令提示符的输出对我来说也是正确的。