如何测量非 ASCII 字符的正确大小

How to measure the correct size of non-ASCII characters?

本文关键字：字符 ASCII 何测量测量更新时间：2023-10-16

在下面的程序中，我试图测量带有非ASCII字符的字符串的长度。

但是，我不确定为什么size()在使用非 ASCII 字符时没有打印正确的长度。

#include <iostream>
#include <string>
int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::cout << "Size of " << s1 << " is " << s1.size() << std::endl;
    std::cout << "Size of " << s2 << " is " << s2.size() << std::endl;
}

输出：

Size of Hello is 5
Size of इंडिया is 18

现场演示魔杖盒。

> std::string::size 返回以字节为单位的长度，而不是以字符数为单位。第二个字符串使用 UNICODE 编码，因此每个字符可能需要几个字节。请注意，这同样适用于std::wstring::size，因为它将取决于编码（它返回宽字符的数量，而不是实际字符：如果使用 UTF-16，它将匹配但不一定用于其他编码，更多在本答案中）。

要测量实际长度（以符号数量为单位），您需要知道编码，以便正确分离（从而计算）字符。例如，此答案可能对 UTF-8 有所帮助（尽管使用的方法已在 C++17 中弃用）。

UTF-8 的另一个选项是计算第一个字节的数量（归功于另一个答案）：

int utf8_length(const std::string& s) {
  int len = 0;
  for (auto c : s)
      len += (c & 0xc0) != 0x80;
  return len;
}

我使用了 std：：wstring_convert 类并获得了正确的字符串长度。

#include <string>
#include <iostream>
#include <codecvt>
int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cn;
    auto sz = cn.from_bytes(s2).size();
    std::cout << "Size of " << s2 << " is " << sz << std::endl;
}

现场演示魔杖盒。

此处的重要性参考链接，了解有关std::wstring_convert的更多信息