使用boost::regex迭代捕获

Iterate through captures with boost::regex

本文关键字：迭代 regex boost 使用更新时间：2023-10-16

我有一个正则表达式来捕获HTML标签中的三个字段，使用boost::regex

"\/\/(.{1,3}?)\.wikipedia\.[a-z]+\/wiki\/(.*?)\s*>(.*?)<"

所以，从

<a href="//de.wikipedia.org/wiki/Porky%E2%80%99s" title="Porky’s – German" lang="de" hreflang="de">Deutsch</a>

我得到

Porky%E2%80%99s" title="Porky 's - German" lang="de" hreflang="de"

但是我想要{de, Porky%E2%80%99s, Deutsch}。

我怎样才能使我的正则表达式停止匹配第二个字段，只要它找到第一个空白?

我试着

"\/\/(.{1,3}?)\.wikipedia\.[a-z]+\/wiki\/(\S*?)*>(.*?)<"

所以第二个字段匹配除了空格以外的所有内容但是我得到这个崩溃报告

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >'
  what():  Ran out of stack space trying to match the regular expression.

这可能行得通-

"//(.{1,3}?)\.wikipedia\.[a-z]+/wiki/([^\s>"]*).*?>(.*?)<"

我会用这个代替-

"//(.{1,3}?)\.wikipedia\.[a-z]+/wiki/([^\s>"]*)[^>]*>(.*?)<"

格式:

 //
 ( .{1,3}? )                   # (1)
 .
 wikipedia
 . 
 [a-z]+ 
 /wiki/
 ( [^s>"]* )                  # (2)
 [^>]* 
 >
 ( .*? )                       # (3)
 <

输出:

 **  Grp 0 -  ( pos 9 , len 98 ) 
//de.wikipedia.org/wiki/Porky%E2%80%99s" title="Porky’s – German" lang="de" hreflang="de">Deutsch<  
 **  Grp 1 -  ( pos 11 , len 2 ) 
de  
 **  Grp 2 -  ( pos 33 , len 15 ) 
Porky%E2%80%99s  
 **  Grp 3 -  ( pos 99 , len 7 ) 
Deutsch