从字符串中删除BBcode

removing BBcode from string

本文关键字:BBcode 删除 字符串      更新时间:2023-10-16

所以这个问题似乎已经被问到了几乎所有的语言。。。。。。除了在C++中。我有一个XML文档,它在文本节点中存储了一些bbcode。我正在寻找删除它的最佳方法,我想我应该在这里查看一下,看看是否有人知道一些预先构建的库或一些自己完成这项工作的有效方法。我想删除任何介于"["answers"]"字符之间的字符,但是,使用提供给我的XML文档会变得疯狂,因为BB的许多实例的形式都是'[[blahblahblah]]'和一些'[blahblahblah].'

这是XML文档。<text>标签之间的所有数据都被添加到一个字符串中,有什么建议吗?

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en">
<page>
<title>Human Anatomy/Osteology/Axialskeleton</title>
<ns>0</ns>
<id>181313</id>
<revision>
<id>1481605</id>
<parentid>1379871</parentid>
<timestamp>2009-04-26T02:03:12Z</timestamp>
<contributor>
<username>Adrignola</username>
<id>169232</id>
</contributor>
<minor />
<comment>+Category</comment>
<sha1>hvxozde19haz4yhwj73ez82tf2bocbz</sha1>
<text xml:space="preserve"> [[Image:Axial_skeleton_diagram.svg|thumb|240px|right|Diagram of the axial skeleton]]
The Axial Skeleton is a division of the human skeleton and is named because it makes up the longitudinal ''axis'' of the body. It consists of the skull, hyoid bone, vertebral column, sternum and ribs. It is widely accepted to be made up of 80 bones, although this number varies from individual to individual.
[[Category:{{FULLBOOKNAME}}|{{FULLCHAPTERNAME}}]]</text>
</revision>
</page>
<page>
<title>Horn/General/Fingering Chart</title>
<ns>0</ns>
<id>23346</id>
<revision>
<id>1942387</id>
<parentid>1734837</parentid>
<timestamp>2010-10-02T20:21:09Z</timestamp>
<contributor>
<username>Nat682</username>
<id>144010</id>
</contributor>
<comment>added important note</comment>
<sha1>lana7m8m9r23oor0nh24ky45v71sai9</sha1>
<text xml:space="preserve">{{HornNavGeneral}}
The horn spans four plus octaves depending on the player and uses both the treble and bass clefs. In this chart it is assumed the player is using a double-horn with F and Bb sides. The number 1 indicates that the index-finger valve should be depressed, the number 2 indicates that the middle-finger valve should be depressed and the number 3 indicates that the ring-finger valve should be depressed. There are eight possible valve combinations among the first, second and third valves: 0, 1, 2, 3, 1-2, 1-3, 2-3, and 1-2-3. However, there are effectively seven combinations, because 1-2 will produce the same notes, perhaps slightly out of tune, as 3 alone. One depresses the thumb key to use the Bb side of the horn.
[[Image:Fingering chart.png]]
[[Category:Horn]]</text>
</revision>
</page>
</mediawiki>

因此,如果你看看每个<page>标签的底部,你会看到像[[Category:{{FULLBOOKNAME}}|{{FULLCHAPTERNAME}}]]这样的东西,这就是我想要删除的。

我假设数据是以迭代器的形式提供给您的,您可以从中读取。如果您是以std::string的形式获得它,那么获得一个可以从中读取的迭代器是非常容易的。

在这种情况下,您想要的是增强filter_iterator:http://www.boost.org/doc/libs/1_39_0/libs/iterator/doc/filter_iterator.html

你想要的过滤功能非常简单。你可以记录你看到了多少[,然后减去你看到的](停在0)。当你的计数为正时,你会过滤掉这个字符。

如果您不能使用boost,但您是从std::string获得的,那么这就有点棘手了。但只有一点点。std::copy_if起作用。

如果您使用的是C++11,lambda会让这变得非常容易。如果没有,您将不得不编写自己的计算[s的函子。

作为一个简单情况的具体示例:您收到一个std::string,并且希望生成一个没有任何[]分隔内容的std::string

struct SquareBracketStripper
{
enum { open_bracket = '[', close_bracket = ']' };
size_t count;
SquareBracketStripper():count(0) {}
bool operator()(char c)
{
bool skip = (count > 0) || c == open_bracket;
if (c == open_bracket) {
++count;
} else if (c== close_bracket && count > 0) {
--count;
}
return skip;
}
};
std::string FilterBBCode( std::string input ) {
input.erase(input.end(), std::remove_if( input.begin(), input.end(), SquareBracketStripper() ) );
return input;
}

其处理嵌套CCD_ 17s的任意深度。

filter_iterator的帮助在于,您永远不必将整个字符串加载到内存中,如果您不知道输入的格式有多不正确,这很有用。当您可以流式传输数据并实时进行过滤时,不需要将几个terrabytes的数据从磁盘加载到内存来过滤[]。但是您的用例可能并不真正关心。