标准::字符串字符编码

std::string character encoding

本文关键字：编码字符字符串标准更新时间：2023-10-16

std::string arrWords[10];
std::vector<std::string> hElemanlar;

......

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

......

我正在做的是：arrWord 的每个元素都是一个 std：：string。我得到arrWord的第n个元素，然后将它们推入hElemanlar。

假设 arrWords[0] 是 "test"，那么：

this->hElemanlar.push_back("t");
this->hElemanlar.push_back("e");
this->hElemanlar.push_back("s");
this->hElemanlar.push_back("t");

我的问题是，虽然我对arrWords没有编码问题，但一些utf-8字符在hElemanlar中没有打印或处理得很好。我该如何解决它？

如果您知道arrWords[i]包含 UTF-8 编码文本，那么您可能需要将字符串拆分为完整的 Unicode 字符。

顺便说一句，而不是说：

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

(它构造一个临时的 std：：string，获取它的 c 字符串表示，构造另一个临时字符串，并将其推送到向量上(，说：

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]))

无论如何。这将需要变成这样的东西：

std::string str(1, this-arrWords[sayKelime][j])
if (static_cast<unsigned char>(str[0]) >= 0xC0)
{
   for (const char c = this-arrWords[sayKelime][j+1];
        static_cast<unsigned char>(c) >= 0x80;
        j++)
   {
       str.push_back(c);
   }
}
this->hElemenlar.push_back(str);

请注意，上面的循环是安全的，因为如果j是字符串中最后一个字符的索引，[j+1]将返回 nul-terminator(这将结束循环(。不过，您需要考虑递增 j 如何与其余代码交互。

然后，您需要考虑是希望hElemanlar表示单个 Unicode 码位(这样做(，还是要包含一个字符 + 后面的所有组合字符？在后一种情况下，您必须将上面的代码扩展到：

分析下一个代码点
确定它是否是组合字符
如果是这样，请推送字符串上的 UTF-8 序列。
重复(一个角色上可以有多个组合字符(。