std::regex中的错误

Bug in std::regex?

本文关键字：错误 regex std 更新时间：2023-10-16

以下是代码：

#include <string>
#include <regex>
#include <iostream>
int main()
{
    std::string pattern("[^c]ei");
    pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";
    std::regex r(pattern); 
    std::smatch results;   
    std::string test_str = "cei";
    if (std::regex_search(test_str, results, r)) 
        std::cout << results.str() << std::endl;      
    return 0;
}

输出：

cei

使用的编译器是gcc 4.9.1。

我是一个学习正则表达式的新手。我希望不输出任何内容，因为"cei"与这里的模式不匹配。我做得对吗？怎么了？

更新：

此错误已被报告并确认为错误，有关详细信息，请访问此处：https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63497

这是实现中的一个bug。我尝试过的其他几个工具不仅同意你的模式与你的输入不匹配，而且我尝试过这个：

#include <string>
#include <regex>
#include <iostream>
int main()
{
  std::string pattern("([a-z]*)([a-z])(e)(i)([a-z]*)");
  std::regex r(pattern);
  std::smatch results;
  std::string test_str = "cei";
  if (std::regex_search(test_str, results, r))
  {
    std::cout << results.str() << std::endl;
    for (size_t i = 0; i < results.size(); ++i) {
      std::ssub_match sub_match = results[i];
      std::string sub_match_str = sub_match.str();
      std::cout << i << ": " << sub_match_str << 'n';
    }
  }
}

这基本上与您的类似，但为了简单起见，我用[a-z]替换了[:alpha:]，还临时用[a-z]替换了[^c]，因为这似乎可以使其正常工作。以下是它打印的内容（Linux x86-64上的GCC 4.9.0）：

cei
0: cei
1:
2: c
3: e
4: i
5:

如果我将[^c]替换为[a-z]，而只是将f放在那里，它正确地表示模式不匹配。但如果我像你一样使用[^c]：

std::string pattern("([a-z]*)([^c])(e)(i)([a-z]*)");

然后我得到这个输出：

cei
0: cei
1: cei
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create
Aborted (core dumped)

所以它声称匹配成功，结果[0]是预期的"cei"。那么，results[1]也是"cei"，我想这可能还可以。但随后results[2]崩溃了，因为它试图构造一个长度为18446744073709551614、begin=nullptr的std::string。这个巨大的数字正是2^64 - 2，也就是std::string::npos - 1（在我的系统中）。

因此，我认为某个地方存在一个错误，其影响可能不仅仅是一个虚假的正则表达式匹配——它可能在运行时崩溃。

正则表达式是正确的，并且应该与字符串"cei"不匹配。

正则表达式可以在Perl:中得到最好的测试和解释

 my $regex = qr{                 # start regular expression
                 [[:alpha:]]*    # 0 or any number of alpha chars
                 [^c]            # followed by NOT-c character
                 ei              # followed by e and i characters
                 [[:alpha:]]*    # followed by 0 or any number of alpha chars    
               }x;               # end + declare 'x' mode (ignore whitespace)
 print "xei" =~ /$regex/ ? "matchn" : "no matchn";
 print "cei" =~ /$regex/ ? "matchn" : "no matchn";

正则表达式将首先消耗字符串末尾的所有字符（[[:alpha:]]*），然后回溯以找到NON-c字符[^c]并继续进行e和i匹配（通过另一次回溯）。

结果：

 "xei"  -->  match
 "cei"  -->  no match

原因显而易见。在各种C++库和测试工具中，任何与此不一致的地方都是实现的问题，imho。