识别句子的结尾
Recognizing the end of a sentence
>我正在尝试读取一个文本文件,并将其逐个字符串输入到矢量字符串中。我需要它停在每个句子的末尾,然后在句子中挑选关键词。我知道如何找到关键字,但不知道如何让它停止在最后输入字符串。我正在使用 while 循环来检查每一行,并且我正在考虑使用一系列 if 语句,例如
if(std::vector<string>::iterator i == ".") i == " "
到目前为止,我执行矢量填充的代码是:
std::string c;
ifstream infile;
infile.open("example.txt");
while(infile >> c){
a.push_back(c);
}
好的,所以我想出了一种将文本文件的每个单词加载到标记中的方法,同时考虑到" "作为分隔符,并有一个特殊情况单词列表:
const int MAX_PER_LINE = 512;
const int MAX_TOK = 20;
const char* const DELIMETER = " -";
const char* const SPECIAL ="!?.";
const char* const ignore[] = {"Mr.", "Ms.","Mrs.","sr.", "Ave.", "Rd."};
然后
if(!file.good()){
return 1;
}
//parsing algorithm paraphrased from cs.dvc.edu/HowTo_Parse.html
while(!file.eof()){
char line[MAX_PER_LINE];
file.getline(line, MAX_PER_LINE);
int n = 0;
const char* token[MAX_TOK] = {};
token[0] = strtok(line, DELIMETER);
if(token[0]){
for(n = 1; n < MAX_TOK; ++n){
token[n] = strtok(0, DELIMETER);
if(!token[n]) break;
}
}
//for(int i = 0; i < n; ++i){
for(int i = 0; i < n; ++i){
cout << "Token[" << i << "] =" << token[i] << endl;
cout << endl;
}
}
现在我正在寻找一个在 if 语句中放入什么,以便它检查每个令牌是否存在特殊情况,或者如果它们遵循具有特殊情况的令牌,则将它们加载到新的 Set 令牌中。我大部分都知道伪代码,但我不知道如果(token[i]包含特殊情况或token[i]之前没有任何内容(对于第一个令牌)或大写并在带有特殊大小写的令牌之后将其加载到新令牌中,该怎么办。
任何帮助将不胜感激。
编写自己的句子分隔符对于小型项目或没有国际化的项目是可以的。对于基于文本边界的高级文本解决方案,我推荐ICU的BreakIterator。基于 unicode.org 标准化,它们提供字符、单词、换行符和句子边界。他们在C++(以及我认为是Java)中拥有开源库。请参阅此页面,它有指向图书馆下载页面的链接。
这将避免重新发明轮子并避免以后的潜在问题。大多数领先的出版软件产品,如QuarkXPress等都使用此库。
编辑:我试图找到一个关于 ICU 的 BreakIterator 在句子边界上的用法的快速教程,但我发现了单词边界示例 - (句子边界计算将非常相似,可能需要用下面的createSentenceInstance
替换createWordInstance
)
void listWordBoundaries(const UnicodeString& s) {
UErrorCode status = U_ZERO_ERROR;
BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);
bi->setText(s);
int32_t p = bi->first();
while (p != BreakIterator::DONE) {
printf("Boundary at position %dn", p);
p = bi->next();
}
delete bi;
}
查找以句点结尾的单词非常简单,只需检查是否word.back() == '.'
即可。您还需要先检查word.empty()
,因为如果字符串为空back()
则未定义的行为。如果您的编译器不支持 C++11,您也可以使用 word[word.size() - 1] == '.'
执行更长的方法。
下面是一个基本示例,它天真地使用任何以"."结尾的单词来拆分句子:
#include <iostream>
#include <string>
#include <vector>
int main(int argc, char** argv) {
if (argc == 1) {
std::cerr << "Usage: " << argv[0] << " [text to split]n"
<< "Splits the input text into one sentence per line." << std::endl;
return 1;
}
std::vector<std::string> sentences;
std::string current_sentence;
for (int i = 1; i < argc; ++i) {
std::string word(argv[i]);
current_sentence.append(word);
current_sentence.push_back(' ');
/* use word.back() == '.' for C++11 */
if (!word.empty() && word[word.size() - 1] == '.') {
sentences.push_back(current_sentence);
current_sentence.clear();
}
}
if (!current_sentence.empty()) {
sentences.push_back(current_sentence);
}
for (size_t i = 0; i < sentences.size(); ++i) {
std::cout << sentences[i] << std::endl;
}
return 0;
}
像这样运行:
$ g++ test.cpp
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond.
This is a test.
And a second sentence.
So we meet again Mr.
Bond.
注意它如何认为"先生"是一个句子的结尾。
我不确定处理这个问题的聪明方法,但一个(脆弱的)选择是列出一个不是句子结尾的单词,然后检查该单词是否在列表中,如下所示:
#include <algorithm>
#include <iostream>
#include <set>
#include <string>
#include <vector>
const std::string tmp[] = {
"dr.",
"mr.",
"mrs.",
"ms.",
"rd.",
"st."
};
const std::set<std::string> ABBREVIATIONS(tmp, tmp + sizeof(tmp) / sizeof(tmp[0]));
bool has_period(const std::string& word) {
return !word.empty() && word[word.size() - 1] == '.';
}
bool is_abbreviation(std::string word) {
/* Convert to lowercase, so we don't need to check every possible
* variation of each word. Remove this (and update the set initialization)
* if you don't care about handling poor grammar. */
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
/* Check if the word is an abbreviation. */
return ABBREVIATIONS.find(word) != ABBREVIATIONS.end();
}
int main(int argc, char** argv) {
if (argc == 1) {
std::cerr << "Usage: " << argv[0] << " [text to split]n"
<< "Splits the input text into one sentence per line." << std::endl;
return 1;
}
std::vector<std::string> sentences;
std::string current_sentence;
for (int i = 1; i < argc; ++i) {
std::string word(argv[i]);
current_sentence.append(word);
current_sentence.push_back(' ');
if (has_period(word) && !is_abbreviation(word)) {
sentences.push_back(current_sentence);
current_sentence.clear();
}
}
if (!current_sentence.empty()) {
sentences.push_back(current_sentence);
}
for (size_t i = 0; i < sentences.size(); ++i) {
std::cout << sentences[i] << std::endl;
}
return 0;
}
在 C++11 中,您可以使用 unordered_set
来提高效率,通过使用 std::string::back
和更简单的初始化 (std::set<std::string> PERIOD_WORDS = { "dr.", "mr.", "mrs." /*etc.*/ }
来简化它。
运行此版本:
$ g++ test.cpp
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond.
This is a test.
And a second sentence.
So we meet again Mr. Bond.
但是,当然,它仍然没有捕捉到我们没有明确编程的任何情况:
$ ./a.out Example Ave. is just north of here.
Example Ave.
is just north of here.
即使我们添加了它,也很难检测到诸如"我住在示例大道"之类的情况,其中句子以缩写结尾。我希望这作为一个开始是有帮助的。
编辑:我刚刚阅读了评论中链接的句子打破维基百科文章,合并规则相对容易:
(c) 如果下一个令牌大写,则结束一个句子。
像这样:
#include <algorithm>
#include <iostream>
#include <set>
#include <string>
#include <vector>
const std::string tmp[] = {
"ave.",
"dr.",
"mr.",
"mrs.",
"ms.",
"rd.",
"st."
};
const std::set<std::string> PERIOD_WORDS(tmp, tmp + sizeof(tmp) / sizeof(tmp[0]));
bool has_period(const std::string& word) {
return !word.empty() && word[word.size() - 1] == '.';
}
bool is_abbreviation(std::string word) {
/* Convert to lowercase, so we don't need to check every possible
* variation of each word. Remove this (and update the set initialization)
* if you don't care about handling poor grammar. */
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
/* Check if the word is a word that ends with a period. */
return PERIOD_WORDS.find(word) != PERIOD_WORDS.end();
}
bool is_capitalized(const std::string& word) {
return !word.empty() && std::isupper(word[0]);
}
int main(int argc, char** argv) {
if (argc == 1) {
std::cerr << "Usage: " << argv[0] << " [text to split]n"
<< "Splits the input text into one sentence per line." << std::endl;
return 1;
}
std::vector<std::string> sentences;
std::string current_sentence;
for (int i = 1; i < argc; ++i) {
std::string word(argv[i]);
std::string next_word(i + 1 < argc ? argv[i + 1] : "");
current_sentence.append(word);
current_sentence.push_back(' ');
if (next_word.empty()
|| has_period(word)
&& (!is_abbreviation(word) || is_capitalized(next_word))) {
sentences.push_back(current_sentence);
current_sentence.clear();
}
}
for (size_t i = 0; i < sentences.size(); ++i) {
std::cout << sentences[i] << std::endl;
}
return 0;
}
然后甚至像这样的案例也有效:
$ ./a.out Example Ave. is just north of here. I live on Example Ave. Test test test.
Example Ave. is just north of here.
I live on Example Ave.
Test test test.
但它仍然无法处理某些情况:
$ ./a.out Mr. Adams lives on Example Ave. Example Ave. is just north of here. I live on Example Ave. Test test test.
Mr.
Adams lives on Example Ave.
Example Ave. is just north of here.
I live on Example Ave.
Test test test.