找到最长的UTF-8序列，而无需打破多字节序列

Find longest UTF-8 sequence without breaking multi-byte sequences

本文关键字：多字节序列 UTF-8 更新时间：2023-10-16

我需要将UTF-8编码的字符串截断为不超过字节中的预定义大小。特定协议还要求，截断的字符串仍形成有效的UTF-8编码，即必须分配多字节序列。

给定UTF-8编码的结构，我可以向前移动，计算每个代码点的编码大小，直到我达到最大字节数为止。o(n(不是很吸引人。是否有算法，可以更快地完成，理想情况下(摊销(O(1(时间？

更新2019-06-24： 夜晚睡眠后，问题似乎比我的第一次尝试使它看起来更容易。出于历史原因，我已经离开了下面的答案。

UTF-8编码是自同步的。这可以确定符号流中的任意选择的代码单元是否是代码序列的开始。可以将UTF-8序列分为代码序列的开始的左侧。

代码序列的开始是ASCII字符(0xxxxxxxb(，或多字节序列中的领先字节(11xxxxxxb(。落后字节遵循模式10xxxxxxb。UTF-8编码的开始满足了条件(code_unit & 0b11000000) != 0b10000000，换句话说：这不是尾随字节。

最长的UTF-8序列不超过所请求的字节计数，可以通过应用以下算法来确定恒定时间(O(1((：

如果输入不超过所请求的字节数返回实际字节数。
否则，循环开始(启动一个代码单元超过了请求的字节数计数(，直到我们找到序列的开始。返回序列开始的字节数。

放入代码：

#include <string_view>
size_t find_max_utf8_length(std::string_view sv, size_t max_byte_count)
{
    // 1. Input no longer than max byte count
    if (sv.size() <= max_byte_count)
    {
        return sv.size();
    }
    // 2. Input longer than max byte count
    while ((sv[max_byte_count] & 0b11000000) == 0b10000000)
    {
        --max_byte_count;
    }
    return max_byte_count;
}

此测试代码

#include <iostream>
#include <iomanip>
#include <string_view>
#include <string>
int main()
{
    using namespace std::literals::string_view_literals;
    std::cout << "max size outputn=== ==== ======" << std::endl;
    auto test{u8"€«test»"sv};
    for (size_t count{0}; count <= test.size(); ++count)
    {
        auto byte_count{find_max_utf8_length(test, count)};
        std::cout << std::setw(3) << std::setfill(' ') << count
                  << std::setw(5) << std::setfill(' ') << byte_count
                  << " " << std::string(begin(test), byte_count) << std::endl;
    }
}

产生以下输出：

max size output
=== ==== ======
  0    0 
  1    0 
  2    0 
  3    3 €
  4    3 €
  5    5 €«
  6    6 €«t
  7    7 €«te
  8    8 €«tes
  9    9 €«test
 10    9 €«test
 11   11 €«test»

该算法仅在UTF-8编码上运行。它不会尝试以任何方式处理Unicode。虽然它将始终产生有效的UTF-8编码序列，但编码的代码点可能不会形成有意义的Unicode grupheme。

算法在恒定时间内完成。无论输入尺寸如何，鉴于当前最多4个字节的utf-8编码，最终循环最多将旋转3次。算法将继续在恒定时间内继续工作并完成，以防UTF-8编码更改以允许每个编码代码点最多5或6个字节。

上一个答案

可以通过将问题分解为以下情况：

通过将问题分解为O(1(。

输入不超过所请求的字节数。在这种情况下，只需返回输入即可。
输入比请求的字节计数更长。找出索引max_byte_count - 1的编码中的相对位置：
1. 如果这是ASCII字符(最高位未设置0xxxxxxxb(，我们处于自然边界，可以在其后立即切割字符串。
2. 否则，我们正处于多字节序列的开始，中或尾部。要找出在哪里，请考虑以下字符。如果它是ASCII字符(0xxxxxxxb(或多字节序列(11xxxxxxb(的开始，我们在多字节序列的尾部，自然边界。
3. 否则，我们处于多字节序列的开始或中间。迭代字符串的开始，直到我们找到多字节编码的开始(11xxxxxxb(。在该字符之前切下弦。

给定最大字节计数，以下代码计算截断字符串的长度。输入需要形成有效的UTF-8编码。

#include <string_view>
size_t find_max_utf8_length(std::string_view sv, size_t max_byte_count)
{
    // 1. No longer than max byte count
    if (sv.size() <= max_byte_count)
    {
        return sv.size();
    }
    // 2. Longer than byte count
    auto c0{static_cast<unsigned char>(sv[max_byte_count - 1])};
    if ((c0 & 0b10000000) == 0)
    {
        // 2.1 ASCII
        return max_byte_count;
    }
    auto c1{static_cast<unsigned char>(sv[max_byte_count])};
    if (((c1 & 0b10000000) == 0) || ((c1 & 0b11000000) == 0b11000000))
    {
        // 2.2. At end of multi-byte sequence
        return max_byte_count;
    }
    // 2.3. At start or middle of multi-byte sequence
    unsigned char c{};
    do
    {
        --max_byte_count;
        c = static_cast<unsigned char>(sv[max_byte_count]);
    } while ((c & 0b11000000) != 0b11000000);
    return max_byte_count;
}

以下测试代码

#include <iostream>
#include <iomanip>
#include <string_view>
#include <string>
int main()
{
    using namespace std::literals::string_view_literals;
    std::cout << "max size outputn=== ==== ======" << std::endl;
    auto test{u8"€«test»"sv};
    for (size_t count{0}; count <= test.size(); ++count)
    {
        auto byte_count{find_max_utf8_length(test, count)};
        std::cout << std::setw(3) << std::setfill(' ') << count
                  << std::setw(5) << std::setfill(' ') << byte_count
                  << " " << std::string(begin(test), byte_count) << std::endl;
    }
}

产生此输出：

max size output
=== ==== ======
  0    0 
  1    0 
  2    0 
  3    3 €
  4    3 €
  5    5 €«
  6    6 €«t
  7    7 €«te
  8    8 €«tes
  9    9 €«test
 10    9 €«test
 11   11 €«test»