使用正则表达式在Embarcadero的C++生成器中将文本拆分为单个单词

Using RegEx to split up a text into single words in Embarcadero's C++ Builder

本文关键字：文本拆分单个单正则表达式 Embarcadero C++ 更新时间：2023-10-16

我正在使用Embarcadero的c++ Builder开发一个拼写检查程序。我使用正则表达式将文本拆分为单个单词。下面的代码在RAD Studio XE中工作得很好，但在RAD Studio Seattle中表现不一样。

当单词包含非拉丁字符，如德语Umlauts (Ä，Ö，Ü)或带有重音字符(，ê， )时，就会出现问题。"w"被解释为[a-zA-Z_0-9]，忽略非拉丁字符。

首先，在我的上下文中什么是单词?可能的单词包括:

" r n "
" word-word-word-word……"
"word."或"word-"
apostrophs的话:"字"磨破会"字"
" "
有两种不同类型的撇号:'和'

代码如下:

String text (L"Österreich l'année);
const String sRegex (L"rn|(\w+\-)+\w+|\w+(\.|\-)|('|’)?\w+('|’)?\w*");
TRegEx regex(sRegex, TRegExOptions());
TMatchCollection regexMatches = regex.Matches(text);
for (int i=0; i<regexMatches.Count; ++i)
{
    TMatch regexMatch = regexMatches.Item[i];
    String word (regexMatch.Value);
    //do stuff with word
}

String字的期望值是"Österreich"answers"l' annacei "。然而，RegEx匹配的是"sterreich"， "l'ann"answers"e"。

我的问题是，如何指定所有非拉丁字符?

p{L}匹配unicode字母。试试用这个代替w。

请参阅regex101。

如果您想要数字以及(与w)添加d到组