是否可以让boost语言环境边界分析在撇号上进行拆分

Is it possible to get boost locale boundary analysis to split on apostrophes?

本文关键字：拆分 boost 语言边界环境是否更新时间：2023-10-16

例如，考虑以下代码：

using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text = "L'homme qu'on aimait trop.";
ssegment_index map(word, text.begin(), text.end(), gen("fr_FR.UTF-8"));
for (ssegment_index::iterator it = map.begin(), e = map.end(); it != e; ++it)
    std::cout << """ << *it << "", ";
std::cout << std::endl;

该输出：

"L'homme", " ", "qu'on", " ", "aimait", " ", "trop", ".",

是否可以自定义边界分析，从而输出：

"L", "'", "homme", " ", "qu", "'", "on", " ", "aimait", " ", "trop", ".",

我读过http://www.boost.org/doc/libs/1_56_0/libs/locale/doc/html/boundary_analysys.html并搜索了Stack Overflow和谷歌，但到目前为止还没有找到任何东西。

我还没有找到用boost:：locale:：boundary来实现这一点的方法，但可以直接用ICU来实现，方法是创建一个自定义的RuleBasedBreakIterator，而不是使用createWordInstance提供的。

Locale locale("fr_FR");
UErrorCode statusError = U_ZERO_ERROR;
UParseError parseError = { 0 };
// get rules from a default rbbi (these should be in a word.txt file somewhere)
RuleBasedBreakIterator *default_rbbi = dynamic_cast<RuleBasedBreakIterator *>(RuleBasedBreakIterator::createWordInstance(locale, statusError));
UnicodeString rules = default_rbbi->getRules();
delete default_rbbi;
// create custom rbbi with updated rules
rules.findAndReplace("[\p{Word_Break = MidNumLet}]", "[[\p{Word_Break = MidNumLet}] - [\u0027 \u2018 \u2019 \uff07]]");
RuleBasedBreakIterator custom_rbbi(rules, parseError, statusError);
// tokenize text
UnicodeString text = "L'homme qu'on aimait trop.";
custom_rbbi.setText(text);
int32_t e, p = custom_rbbi.first();
while ((e = custom_rbbi.next()) != BreakIterator::DONE) {
    std::string substring;
    text.tempSubStringBetween(p, e).toUTF8String(substring);
    std::cout << """ << substring << "", ";
    p = e;
}
std::cout << std::endl;