C++,转换字符串,使连续下划线序列变为单个下划线

C++, Transform a string such that sequences of consecutive underscores become a single underscore

本文关键字:下划线 单个 连续 转换 字符串 C++      更新时间:2023-10-16

在C++中,编写一个函数来转换字符串,使连续下划线序列成为单个下划线。 例如 ( ‘_hello___world__’ => ‘_hello_world_’ )。

与以下问题相关:在 c++ 中将多个字符合并为一个字符。 ABCBC -> ABC

使用 erase/unique 和 C++11 lambda。

#include <algorithm>
#include <iostream>
#include <string>
int main()
{
    std::string text("_hello___world__");
    text.erase(
        std::unique(
            text.begin(),
            text.end(),
            [](char a, char b){ return (a == b) && (a == '_'); }
        ),
        text.end()
    );
    std::cout << text << 'n';
    return 0;
}

如果您不想使用 lambda,您可以定义一个函子,例如:

class both_equal_to
{
    char value;
public:
    both_equal_to(char ch) : value(ch) {}
    bool operator()(char first, char second) const
    {
        return (first == second) && (first == value);
    }
};

然后将 lambda 替换为 both_equal_to('_') .

如果您只使用 char*并且不想支付构造std::string的成本,则以下代码更接近 RolandXu 的代码。

char *convert_underscores(char *str)
{
    if (str)
    {
        size_t length = strlen(str);
        char *end = std::unique(str, str + length, both_equal_to('_'));
        *end = '';
    }
    return str;
}

没有库:

#include <stdio.h>
char *RemoveUnderScore(char *src)
{
    char* readIndex;
    char* writeIndex;
    if (src==NULL) return NULL;
    readIndex = writeIndex = src;
    while (*readIndex != '')
    {
        while(*readIndex !='_' && *readIndex != '')
            *writeIndex++ = *readIndex++;
        if (*readIndex != '')
            *writeIndex++ = *readIndex++;
        while (*readIndex == '_')
            readIndex++;
    }
    *writeIndex = '';
    return src;
}
int main(int argc,char** argv){
    char str[] = "_hello___worl__d___";
    printf(RemoveUnderScore(str));
    return 0;
}

这篇文章比较了此页面上提交的方法的速度。 我在 40 个字符的字符串上运行了该函数一百万次,并计时了每种算法花费的时间。

所需时间 |使用的算法

0.2 秒 |RolandXu的版本使用char*,杂耍指向char的指针。

0.4 秒 |高炉的版本第二部分,函子,无弦。

2.7 秒 |高炉的版本第一部分,带有函子和字符串。

8.7 秒 |Eric L 的版本在字符串上循环,确实找到 __ 替换为 _。

11.0 秒 |Eric L 的版本循环遍历每个字符并组装一个字符串。

11.8 秒 |杰瑞·科芬的版本与remove_copy_if。

C++代码来证明上述基准和计时:

#include <iostream>
#include <cstdio>
#include <ctime>
#include <cstring>
#include <algorithm>
using namespace std;
string convert_underscores_by_EricL_using_string_replace(string str){
    //Cons:
        //This is the very slowest algorithm because the complexity of this operation
        //is O(n^2) and possibly higher with the overhead of string conversions.
    //Pros:
        //This is the function is the most concise, needing only 3 lines of code.
    while(str.find("__") != string::npos){
        str = str.replace(str.find("__"), 2, "_");
    }
    return str;
}

string convert_underscores_EricL_loop_over_a_string_and_remove_repeats(string str){
    //CONS:
        //Using a string really slows things down. Algorithm is too slow.
        //Not the most concise solution, 8 lines.
        //Has lots of ugly conditionals, x, and x-1, confusing to look at.
    //PROS:
        //Fastest function of those tested.  
    int len = str.length();
    string result = "";
    if (len < 2) return str;
    result += str[0];
    for(int x = 1; x < len; x++){
        if (str[x] != str[x-1] || str[x] != '_')
            result += str[x];
    }
    return result;
}


class repeated_by_jerry_coffin {
char prev;
char val;
public:
    repeated_by_jerry_coffin(char ch) : val(ch), prev(0) {}
    bool operator()(char ch) {
        bool ret = prev == val && ch == val;
        prev = ch;
        return ret;
    }
};
string convert_underscores_jerry_coffins_with_remove_copy_if(string str){
    //CONS:
        //Algorithm is the 2nd slowest algorithm.
    //PROS:
        //Concise, intuitive, needing only 4 lines.
        //Offloads the heavy lifting to std builtin methods: std::remove_copy_if and std::back_inserter
    string result = "";
    std::remove_copy_if(str.begin(), str.end(),
        std::back_inserter(result),
        repeated_by_jerry_coffin('_'));
    return result;
}

char* convert_underscores_by_RolandXu_using_char_stars_and_pointers(char *src){
    //CONS:
        //You have to get your hands dirty with C++ pointers.
    //PROS:
        //Fastest function of those tested.  
    char* readIndex;
    char* writeIndex;
    if (src==NULL) return NULL;
    readIndex = writeIndex = src;
    while (*readIndex != '')
    {
        while(*readIndex !='_' && *readIndex != '')
            *writeIndex++ = *readIndex++;
        if (*readIndex != '')
            *writeIndex++ = *readIndex++;
        while (*readIndex == '_')
            readIndex++;
    }
    *writeIndex = '';
    return src;
}

class both_equal_to__blastfurnace_version1{
char value;
public:
    both_equal_to__blastfurnace_version1(char ch) : value(ch) {}
    bool operator()(char first, char second) const{
        return (first == second) && (first == value);
    }
};
string convert_underscores_blastfurnace_version1_with_functor(string str){
    //CONS:
        //You have to create an entirely new class that overloads an operator.  
        //The speed is harmed by the usage of string.
    //PROS:
        //Don't need to roll your own loops with pointers.
    str.erase(
        std::unique(
            str.begin(),
            str.end(),
            both_equal_to__blastfurnace_version1('_')
        ),
        str.end()
    );
    return str;
}


class both_equal_to_blastfurnace_version2{
char value;
public:
    both_equal_to_blastfurnace_version2(char ch) : value(ch) {}
    bool operator()(char first, char second) const{
        return (first == second) && (first == value);
    }
};
char *convert_underscores_blastfurnace_version2_std_unique_and_no_string(char *str){
    //CONS:
        //No problems.
    //PROS:
        //More concise/intuitive than the fastest version and nearly as fast.  Winner!
    if (str){
        size_t length = strlen(str);
        char *end = std::unique(str, str + length, both_equal_to_blastfurnace_version2('_'));
        *end = '';
    }
    return str;
}
void assertCharStarEquals(char* a, char* b, string msg){
    if (strcmp(a, b) == 0) cout<<"passed" << endl;
    else cout << "Failed" << msg << " should be: '" << a << "'  it returned: '" << b << "'" << endl;
}
void assertStringEquals(string a, string b, string msg){
    if (a == b) cout<<"passed" << endl;
    else cout << "Failed" << msg << " should be: '" << a << "'  it returned: '" << b << "'" << endl;
}
void test01_convert_underscores_by_RolandXu_using_char_stars_and_pointers(int numtests, string str){
    char mystr[str.length()];
    strcpy(mystr, str.c_str());
    clock_t start = clock();
    int x = 0;
    while(x < numtests){
        char* s = convert_underscores_by_RolandXu_using_char_stars_and_pointers(mystr);
        x++;
    }
    double diff = (std::clock() - start) / (double)CLOCKS_PER_SEC;
    cout <<  diff << "   RolandXu's version using char*. " << 'n';
}
void test02_convert_underscores_blastfurnace_version2_std_unique_and_no_string(int numtests, string str){
    char mystr[str.length()];
    strcpy(mystr, str.c_str());
    clock_t start = clock();
    int x = 0;
    while(x < numtests){
        char* val = convert_underscores_blastfurnace_version2_std_unique_and_no_string(mystr);
        x++;
    }
    double diff = (std::clock() - start) / (double)CLOCKS_PER_SEC;
    cout << diff << "   Blastfurnace's version part two, functor, without string. " <<endl;
}
void test03_convert_underscores_blastfurnace_version1_with_functor(int numtests, string str){
    clock_t start = clock();
    int x = 0;
    while(x < numtests){
        string s = convert_underscores_blastfurnace_version1_with_functor(str);
        x++;
    }
    double diff = (std::clock() - start) / (double)CLOCKS_PER_SEC;
    cout << diff << "   Blastfurnace's version part one with the functor and string. " <<endl;
}
void test04_convert_underscores_by_EricL_using_string_replace(int numtests, string str){
    char mystr[str.length()];
    strcpy(mystr, str.c_str());
    clock_t start = clock();
    int x = 0;
    while(x < numtests){
        string s = convert_underscores_by_EricL_using_string_replace(mystr);
        x++;
    }
    double diff = (std::clock() - start) / (double)CLOCKS_PER_SEC;
    cout<<  diff << "   Eric L's version looping over the string doing find double underscore replace with single underscore. " <<endl;
}
void test05_convert_underscores_EricL_loop_over_a_string_and_remove_repeats(int numtests, string str){
    char mystr[str.length()];
    strcpy(mystr, str.c_str());
    clock_t start = clock();
    int x = 0;
    while(x < numtests){
        string s = convert_underscores_EricL_loop_over_a_string_and_remove_repeats(mystr);
        x++;
    }
    double diff = (std::clock() - start) / (double)CLOCKS_PER_SEC;
    cout <<  diff << "   Eric L's version looping over each char and assembling a string. "<< endl;
}

void test06_convert_underscores_jerry_coffins_with_remove_copy_if(int numtests, string str){
    clock_t start = clock();
    int x = 0;
    while(x < numtests){
        string s = convert_underscores_jerry_coffins_with_remove_copy_if(str);
        x++;
    }
    double diff = (std::clock() - start) / (double)CLOCKS_PER_SEC;
    cout<<  diff << "   Jerry Coffin's version with remove_copy_if. " <<endl;
}
int main(){
    cout << "GO!n";
    int numtests = 1000000;
    string test_string = "__aa_a_aaa_--__&___aa_234______3)_!___<_";
    cout << "Time | Algorithm Used.n";
    test01_convert_underscores_by_RolandXu_using_char_stars_and_pointers(numtests, test_string);
    test02_convert_underscores_blastfurnace_version2_std_unique_and_no_string(numtests, test_string);
    test03_convert_underscores_blastfurnace_version1_with_functor(numtests, test_string);
    test04_convert_underscores_by_EricL_using_string_replace(numtests, test_string);
    test05_convert_underscores_EricL_loop_over_a_string_and_remove_repeats(numtests, test_string);
    test06_convert_underscores_jerry_coffins_with_remove_copy_if(numtests, test_string);

    //Produces the following output:

    //Extra assertion testing to make sure everyone's algorithm is correct:
    char in[30];
    char out[30];
    strcpy(in, "a__");
    strcpy(out, "a_");
    assertCharStarEquals(out, convert_underscores_by_RolandXu_using_char_stars_and_pointers(in), "01");
    strcpy(in, "a_");
    strcpy(out, "a_");
    assertCharStarEquals(out, convert_underscores_by_RolandXu_using_char_stars_and_pointers(in), "02");
    strcpy(in, "_______");
    strcpy(out, "_");
    assertCharStarEquals(out, convert_underscores_by_RolandXu_using_char_stars_and_pointers(in), "03");
    strcpy(in, "__a");
    strcpy(out, "_a");
    assertCharStarEquals(out, convert_underscores_by_RolandXu_using_char_stars_and_pointers(in), "04");
    strcpy(in, "_hello___world__");
    strcpy(out, "_hello_world_");
    assertCharStarEquals(out, convert_underscores_by_RolandXu_using_char_stars_and_pointers(in), "05");
    strcpy(in, "");
    strcpy(out, "");
    assertCharStarEquals(out, convert_underscores_by_RolandXu_using_char_stars_and_pointers(in), "06");
    strcpy(in, "  __  ");
    strcpy(out, "  _  ");
    assertCharStarEquals(out, convert_underscores_by_RolandXu_using_char_stars_and_pointers(in), "07");
    strcpy(in, "U+221E");
    strcpy(out, "U+221E");
    assertCharStarEquals(out, convert_underscores_by_RolandXu_using_char_stars_and_pointers(in), "08");
    strcpy(in, "__u00b2__");
    strcpy(out, "_u00b2_");
    assertCharStarEquals(out, convert_underscores_by_RolandXu_using_char_stars_and_pointers(in), "09");

    cout<< "OKn";

    strcpy(in, "a__");
    strcpy(out, "a_");
    assertCharStarEquals(out, convert_underscores_blastfurnace_version2_std_unique_and_no_string(in), "01");
    strcpy(in, "a_");
    strcpy(out, "a_");
    assertCharStarEquals(out, convert_underscores_blastfurnace_version2_std_unique_and_no_string(in), "02");
    strcpy(in, "_______");
    strcpy(out, "_");
    assertCharStarEquals(out, convert_underscores_blastfurnace_version2_std_unique_and_no_string(in), "03");
    strcpy(in, "__a");
    strcpy(out, "_a");
    assertCharStarEquals(out, convert_underscores_blastfurnace_version2_std_unique_and_no_string(in), "04");
    strcpy(in, "_hello___world__");
    strcpy(out, "_hello_world_");
    assertCharStarEquals(out, convert_underscores_blastfurnace_version2_std_unique_and_no_string(in), "05");
    strcpy(in, "");
    strcpy(out, "");
    assertCharStarEquals(out, convert_underscores_blastfurnace_version2_std_unique_and_no_string(in), "06");
    strcpy(in, "  __  ");
    strcpy(out, "  _  ");
    assertCharStarEquals(out, convert_underscores_blastfurnace_version2_std_unique_and_no_string(in), "07");
    strcpy(in, "U+221E");
    strcpy(out, "U+221E");
    assertCharStarEquals(out, convert_underscores_blastfurnace_version2_std_unique_and_no_string(in), "08");
    strcpy(in, "__u00b2__");
    strcpy(out, "_u00b2_");
    assertCharStarEquals(out, convert_underscores_blastfurnace_version2_std_unique_and_no_string(in), "09");

    cout<< "OKn";

    string in_s = "a__";
    string out_s = "a_";
    assertStringEquals(out_s, convert_underscores_blastfurnace_version1_with_functor(in_s), "01");
    in_s = "a_";
    out_s = "a_";
    assertStringEquals(out_s, convert_underscores_blastfurnace_version1_with_functor(in_s), "02");
    in_s = "_______";
    out_s = "_";
    assertStringEquals(out_s, convert_underscores_blastfurnace_version1_with_functor(in_s), "03");
    in_s = "__a";
    out_s = "_a";
    assertStringEquals(out_s, convert_underscores_blastfurnace_version1_with_functor(in_s), "04");
    in_s = "_hello___world__";
    out_s = "_hello_world_";
    assertStringEquals(out_s, convert_underscores_blastfurnace_version1_with_functor(in_s), "05");
    in_s = "";
    out_s = "";
    assertStringEquals(out_s, convert_underscores_blastfurnace_version1_with_functor(in_s), "06");
    in_s = "  __  ";
    out_s = "  _  ";
    assertStringEquals(out_s, convert_underscores_blastfurnace_version1_with_functor(in_s), "07");
    in_s = "U+221E";
    out_s = "U+221E";
    assertStringEquals(out_s, convert_underscores_blastfurnace_version1_with_functor(in_s), "08");
    in_s = "__u00b2__";
    out_s = "_u00b2_";
    assertStringEquals(out_s, convert_underscores_blastfurnace_version1_with_functor(in_s), "09");

    cout<< "OKn";

    in_s = "a__";
    out_s = "a_";
    assertStringEquals(out_s, convert_underscores_by_EricL_using_string_replace(in_s), "01");
    in_s = "a_";
    out_s = "a_";
    assertStringEquals(out_s, convert_underscores_by_EricL_using_string_replace(in_s), "02");
    in_s = "_______";
    out_s = "_";
    assertStringEquals(out_s, convert_underscores_by_EricL_using_string_replace(in_s), "03");
    in_s = "__a";
    out_s = "_a";
    assertStringEquals(out_s, convert_underscores_by_EricL_using_string_replace(in_s), "04");
    in_s = "_hello___world__";
    out_s = "_hello_world_";
    assertStringEquals(out_s, convert_underscores_by_EricL_using_string_replace(in_s), "05");
    in_s = "";
    out_s = "";
    assertStringEquals(out_s, convert_underscores_by_EricL_using_string_replace(in_s), "06");
    in_s = "  __  ";
    out_s = "  _  ";
    assertStringEquals(out_s, convert_underscores_by_EricL_using_string_replace(in_s), "07");
    in_s = "U+221E";
    out_s = "U+221E";
    assertStringEquals(out_s, convert_underscores_by_EricL_using_string_replace(in_s), "08");
    in_s = "__u00b2__";
    out_s = "_u00b2_";
    assertStringEquals(out_s, convert_underscores_by_EricL_using_string_replace(in_s), "09");

    cout<< "OKn";

    in_s = "a__";
    out_s = "a_";
    assertStringEquals(out_s, convert_underscores_EricL_loop_over_a_string_and_remove_repeats(in_s), "01");
    in_s = "a_";
    out_s = "a_";
    assertStringEquals(out_s, convert_underscores_EricL_loop_over_a_string_and_remove_repeats(in_s), "02");
    in_s = "_______";
    out_s = "_";
    assertStringEquals(out_s, convert_underscores_EricL_loop_over_a_string_and_remove_repeats(in_s), "03");
    in_s = "__a";
    out_s = "_a";
    assertStringEquals(out_s, convert_underscores_EricL_loop_over_a_string_and_remove_repeats(in_s), "04");
    in_s = "_hello___world__";
    out_s = "_hello_world_";
    assertStringEquals(out_s, convert_underscores_EricL_loop_over_a_string_and_remove_repeats(in_s), "05");
    in_s = "";
    out_s = "";
    assertStringEquals(out_s, convert_underscores_EricL_loop_over_a_string_and_remove_repeats(in_s), "06");
    in_s = "  __  ";
    out_s = "  _  ";
    assertStringEquals(out_s, convert_underscores_EricL_loop_over_a_string_and_remove_repeats(in_s), "07");
    in_s = "U+221E";
    out_s = "U+221E";
    assertStringEquals(out_s, convert_underscores_EricL_loop_over_a_string_and_remove_repeats(in_s), "08");
    in_s = "__u00b2__";
    out_s = "_u00b2_";
    assertStringEquals(out_s, convert_underscores_EricL_loop_over_a_string_and_remove_repeats(in_s), "09");


    cout<< "OKn";

    in_s = "a__";
    out_s = "a_";
    assertStringEquals(out_s, convert_underscores_jerry_coffins_with_remove_copy_if(in_s), "01");
    in_s = "a_";
    out_s = "a_";
    assertStringEquals(out_s, convert_underscores_jerry_coffins_with_remove_copy_if(in_s), "02");
    in_s = "_______";
    out_s = "_";
    assertStringEquals(out_s, convert_underscores_jerry_coffins_with_remove_copy_if(in_s), "03");
    in_s = "__a";
    out_s = "_a";
    assertStringEquals(out_s, convert_underscores_jerry_coffins_with_remove_copy_if(in_s), "04");
    in_s = "_hello___world__";
    out_s = "_hello_world_";
    assertStringEquals(out_s, convert_underscores_jerry_coffins_with_remove_copy_if(in_s), "05");
    in_s = "";
    out_s = "";
    assertStringEquals(out_s, convert_underscores_jerry_coffins_with_remove_copy_if(in_s), "06");
    in_s = "  __  ";
    out_s = "  _  ";
    assertStringEquals(out_s, convert_underscores_jerry_coffins_with_remove_copy_if(in_s), "07");
    in_s = "U+221E";
    out_s = "U+221E";
    assertStringEquals(out_s, convert_underscores_jerry_coffins_with_remove_copy_if(in_s), "08");
    in_s = "__u00b2__";
    out_s = "_u00b2_";
    assertStringEquals(out_s, convert_underscores_jerry_coffins_with_remove_copy_if(in_s), "09");


    return 0;
}

我们学到了什么?

  1. 对于 c++ 字符串,'str.length()' 真的很重,因为编译器尽职尽责地逐个遍历字符串的内存,在内存中找到字符串的末尾,边走边算。 不要使用字符串,也不要使用 str.length();

  2. 使用 str.end() 也会调用 O(n) 性能命中,与项目 1 相同。 不要使用 str.end();

  3. 在字符数组上使用 std::unique 是经过优化且快如闪电的。 比带有字符串连接的 for 循环快一个数量级。

  4. 在C++中,执行 mystring[x] 会导致在内存中查找该槽的内存,这需要很长时间,并且比使用指针并将 1 添加到指针要慢得多。 不要把 mystring[x] 放在迭代 x 的循环中。

  5. 如果必须使用字符串,请不要输入 mystring += 另一个字符串[x]; 每次运行此行时,字符串都必须逐个遍历整个字符串。

  6. 不要连接字符串,获取内存块,定义指针,在指针中放置字符,然后递增指针。 循环和与字符串的串联调用 O(n) 复杂性。

这一天学到了很多东西,好歌会唱出来。

我想

我会使用 std::remove_copy_if ,像这样:

char prev;
std::remove_copy_if(input.begin(), input.end(), 
    std::back_inserter(result), 
    [&prev] (char ch) ->bool { 
        bool ret=prev=='_' && ch == '_'; 
        prev=ch; 
        return ret;
});

或者,如果您坚持使用 C++03,您可以使用显式函子而不是 lambda 执行相同的操作:

class repeated { 
    char prev;
    char val;
public:
    repeated(char ch) : val(ch), prev(0) {}
    bool operator()(char ch) { 
        bool ret = prev == val && ch == val; 
        prev = ch;
        return ret;
    }
};
std::remove_copy_if(input.begin(), input.end(), 
                    std::back_inserter(result), 
                    repeated('_'));

无论哪种方式,它只将每个字符从输入复制到输出(最多)一次,其中使用 std::string::replace 复制每个字符的频率与其左侧重复下划线的数量一样多。

编辑:虽然看着@blastfurnace的答案,但我不确定我真的会使用它 - std::unique可能更适合这份工作。

在C++中,我给出的内容非常低效:

string convert_underscores(string str){
    int len = str.length();   
    //bad, str.length() incurs O(n) complexity.
    string result = "";
    //bad, don't use string.  string makes everything slow.
    if (len < 2) return str;  
    result += str[0];
    //bad bad bad, string concatenation incurs O(n) complexity.
    for(int x = 1; x < len; x++){  
        //This for loop incurs unnecessary management overhead.
        if (str[x] != str[x-1] || str[x] != '_'){
            result += str[x];  
            //concatenation extremely evil here: costs O(n)
            //A lookup on str[x] costs time.  
        }
    }
    return result;  
    //returning a string bad, the entire memory is moved 
    //instead of just a pointer to the first slot.
}

甚至效率更低 使用str.replace。

string convert_underscores(string str){
    while(str.find("__") != string::npos){ 
        str = str.replace(str.find("__"), 2, "_");
    }
    return str;
}
//find incurs unnecessary O(n) complexity. string copy incurs 
//a full memory movement on contents of string.  
//n passes are required over the string.