从 std::string 中提取(第一个）UTF-8 字符

Extract (first) UTF-8 character from a std::string

本文关键字：第一个 UTF-8 字符提取 std string 更新时间：2023-10-16

我需要使用 PHP mb_strtoupper函数的C++实现来模仿维基百科的行为。

我的问题是，我只想向函数输入一个 UTF-8 字符，即 std：：string 的第一个字符。

std::string s("äbcdefg");
mb_strtoupper(s[0]); // this obviously can't work with multi-byte characters
mb_strtoupper('ä'); // works

有没有一种有效的方法来检测/仅返回字符串的第一个 UTF-8 字符？

在 UTF-8 中，第一个字节的高位告诉您有多少后续字节是同一码位的一部分。

0b0xxxxxxx: this byte is the entire code point
0b10xxxxxx: this byte is a continuation byte - this shouldn't occur at the start of a string
0b110xxxxx: this byte plus the next (which must be a continuation byte) form the code point
0b1110xxxx: this byte plus the next two form the code point
0b11110xxx: this byte plus the next three form the code point

可以假设该模式会继续存在，但我认为有效的 UTF-8 不会使用超过四个字节来表示单个代码点。

如果您编写的函数计算设置为 1 的前导位数，则可以使用它来确定拆分字节序列的位置，以便隔离第一个逻辑代码点（假设输入是有效的 UTF-8）。如果要针对无效的 UTF-8 进行强化，则必须编写更多代码。

另一种方法是利用连续字节始终与模式匹配的事实 0b10xxxxxx ，因此您获取第一个字节，然后只要下一个字节与该模式匹配，就继续获取字节。

std::size_t GetFirst(const std::string &text) {
  if (text.empty()) return 0;
  std::size_t length = 1;
  while ((text[length] & 0b11000000) == 0b10000000) {
    ++length;
  }
  return length;
}

对于许多语言，单个代码点通常映射到单个字符。但是人们认为的单个字符可能更接近Unicode所谓的字形簇，即一个或多个代码点组合成一个字形。

在您的示例中，ä可以用不同的方式表示：它可以是单个代码点U+00E4 LATIN SMALL LETTER A WITH DIAERESIS也可以是U+0061 LATIN SMALL LETTER A和U+0308 COMBINING DIAERESIS的组合。幸运的是，仅选择第一个代码点应该可以实现将第一个字母大写的目标。

如果你真的需要第一个字形簇，你必须超越第一个代码点，看看下一个代码点是否与它结合。对于许多语言，知道哪些代码点是"非间距"或"组合"或变体选择器就足够了。对于一些复杂的脚本（例如，韩文？），您可能需要参考此 Unicode 联盟技术报告。

Library str.h

#include <iostream>
#include "str.h"
int main (){
    std::string text = "äbcdefg";
    std::string str = str::substr(text, 0, 1); // Return:~ ä
    std::cout << str << std::endl;
}