使用XOR查找O(n）-解决方案中的两个字符串是否为变位符

Finding if two strings are anagrams in O(n) - solution using XOR

本文关键字：是否两个字符串查找 XOR 使用解决方案更新时间：2023-10-16

我正在处理hackerearth 的一个问题

目标是找出输入字符串是否是O(n)时间中的变位符。

输入格式：

第一行包含一个整数'T'，表示测试用例的数量
每个测试由一行组成，包含两个空格相等长度的串S1和S2

我的代码：

#include <iostream>
#include <string>
int main()
{
int T;
std::cin >> T;
std::cin.ignore();
for(int i = 0; i < T; ++i)
{
std::string testString;
std::getline(std::cin, testString);
char test =  ' ';
for (auto& token : testString)
{
if(token != ' ')
test ^= token;
}
if (test == ' ')
std::cout << "YESn";
else
std::cout << "NOn";
}
}

上面的代码未通过5/6次hackerearth测试。我的错误在哪里？这是解决这个问题的好方法吗？

注意：你的问题标题说第二个单词必须是第一个单词的变位词。但是，hackerearth上的相关问题使用了术语重排，这比变位符更具限制性，还说：

如果字符串S1的排列的任何等于字符串S2的，则称两个字符串S1和S2相同

一种算法是维护传入字符的直方图。

这是通过两个循环完成的，一个用于第一个单词，另一个用于第二个单词。

对于第一个单词，逐个字符进行处理，并递增直方图值。通过保持连续计数来计算第一个单词的长度。

当达到空间时，执行另一个循环，使直方图递减。保持达到零的直方图单元格数的计数。最后，这必须与第一个单词的长度相匹配(即成功)。

在第二个循环中，如果直方图单元格为负数，则这是不匹配的，因为第二个单词的字符不在第一个单词中，或者第一个单词有太多字符。

注意：我很抱歉这是一个类似C的解决方案，但它可以很容易地适应使用更多的STL组件

此外，每次char-at-a-time输入可能比将整行读取到缓冲串中更快

编辑：我在代码示例中添加了注释/注释，以使更加清晰

#include <stdio.h>
#include <stdlib.h>
char buf[(200 * 1024) + 100];
void
dotest(FILE *xf)
{
int histo[26] = { 0 };
int len = 0;
int chr;
int match = 0;
int fail = 0;
int cnt;
// scan first word
while (1) {
chr = fgetc(xf);
// stop on delimiter between first and second words
if (chr == ' ')
break;
// convert char to histogram index
chr -= 'a';
// increment the histogram cell
cnt = ++histo[chr];
// calculate number of non-zero histogram cells
if (cnt == 1)
++len;
}
// scan second word
while (1) {
chr = fgetc(xf);
// stop on end-of-line or EOF
if (chr == 'n')
break;
if (chr == EOF)
break;
// convert char to histogram index
chr -= 'a';
// decrement the histogram cell
cnt = --histo[chr];
// if the cell reaches zero, we [seemingly] have a match (i.e. the
// number of instances of this char in the second word match the
// number of instances in the first word)
if (cnt == 0)
match += 1;
// however, if we go negative, the second word has too many instances
// of this char to match the first word
if (cnt < 0)
fail = 1;
}
do {
// too many letters in second word that are _not_ in the first word
if (fail)
break;
// the number of times the second word had an exact histogram count
// against the first word must match the number of chars in the first
// [and second] word (i.e. all scrambled chars in the second word had
// a place in the first word)
fail = (match != len);
} while (0);
if (fail)
printf("NOn");
else
printf("YESn");
}
// main -- main program
int
main(int argc,char **argv)
{
char *file;
FILE *xf;
--argc;
++argv;
file = *argv;
if (file != NULL)
xf = fopen(file,"r");
else
xf = stdin;
fgets(buf,sizeof(buf),xf);
int tstcnt = atoi(buf);
for (int tstno = 1;  tstno <= tstcnt;  ++tstno)
dotest(xf);
if (file != NULL)
fclose(xf);
return 0;
}

更新：

我只看了一眼代码，但似乎每找到一个字符(字符串长度)，len就会上升。只有当一个唯一的char(直方图元素)被过度聚类时，匹配才会上升，所以检查匹配==len不好吗？

len仅在第一个循环中递增。(即)它只是第一个单词的长度(如上面的算法描述中所述)。

在第一个循环中，检查字符是否为空格[这是由输入的问题定义所保证的，以定界第一个单词的末尾]，并且循环在该点处被中断[在len递增之前]，因此len是正确的。

len、match和fail的使用加快了速度。否则，在最后，我们必须扫描整个直方图，并确保所有元素都为零，以确定成功/失败(即，任何非零元素都意味着不匹配/失败)。

注意：以前在进行这种定时编码挑战时，我注意到它们对占用的时间、速度和空间可能非常严格。最好尽可能多地进行优化，因为即使算法在技术上是正确的，它也可能因使用过多内存或花费过多时间而无法通过测试。

这就是为什么我建议不要使用字符串缓冲区，因为问题定义的最大大小可以是100000字节。此外，在最后对直方图进行[不必要的]扫描也会增加时间。

更新#2：

一次读取一整行，然后使用char指针遍历缓冲区，这可能会更快。这里有一个版本可以做到这一点。哪种方法更快，需要进行尝试/基准测试才能确定。

#include <stdio.h>
#include <stdlib.h>
char buf[(200 * 1024) + 100];
void
dotest(FILE *xf)
{
char *cp;
int histo[26] = { 0 };
int len = 0;
int chr;
int match = 0;
int fail = 0;
int cnt;
cp = buf;
fgets(cp,sizeof(buf),xf);
// scan first word
for (chr = *cp++;  chr != 0;  chr = *cp++) {
// stop on delimiter between first and second words
if (chr == ' ')
break;
// convert char to histogram index
chr -= 'a';
// increment the histogram cell
cnt = ++histo[chr];
// calculate number of non-zero histogram cells
if (cnt == 1)
++len;
}
// scan second word
for (chr = *cp++;  chr != 0;  chr = *cp++) {
// stop on end-of-line
if (chr == 'n')
break;
// convert char to histogram index
chr -= 'a';
// decrement the histogram cell
cnt = --histo[chr];
// if the cell reaches zero, we [seemingly] have a match (i.e. the
// number of instances of this char in the second word match the
// number of instances in the first word)
if (cnt == 0)
match += 1;
// however, if we go negative, the second word has too many instances
// of this char to match the first word
if (cnt < 0) {
fail = 1;
break;
}
}
do {
// too many letters in second word that are _not_ in the first word
if (fail)
break;
// the number of times the second word had an exact histogram count
// against the first word must match the number of chars in the first
// [and second] word (i.e. all scrambled chars in the second word had
// a place in the first word)
fail = (match != len);
} while (0);
if (fail)
printf("NOn");
else
printf("YESn");
}
// main -- main program
int
main(int argc,char **argv)
{
char *file;
FILE *xf;
--argc;
++argv;
file = *argv;
if (file != NULL)
xf = fopen(file,"r");
else
xf = stdin;
fgets(buf,sizeof(buf),xf);
int tstcnt = atoi(buf);
for (int tstno = 1;  tstno <= tstcnt;  ++tstno)
dotest(xf);
if (file != NULL)
fclose(xf);
return 0;
}

更新#3：

上面两个例子有一个小错误。它将在(例如)aaa aaa的输入线上报告假阴性。

len的增量在第一个循环中总是。这是不正确的。我编辑了上面两个例子，以有条件地增加len(即，如果直方图单元格在增量之前为零，则仅)。现在，len是"第一个字符串中非零直方图单元格的数量"。这将考虑字符串中的重复项(例如aa)。

正如我所提到的，len、match和fail的使用是为了避免在最后扫描所有直方图单元，寻找非零单元，这意味着不匹配/失败。

对于短输入线，这可能会运行得更快，其中直方图的后扫描比输入线循环耗时更长。

然而，假设输入线的长度可以是200k，则[几乎]所有直方图单元都将递增/递减。此外，直方图的后扫描(例如，检查26个整数组值是否为非零)现在是整个时间的可忽略部分。

因此，在前两个循环中消除len/match计算的简单实现[如下]可能是最快/最好的选择。这是因为两个循环稍微快一些。

#include <stdio.h>
#include <stdlib.h>
char buf[(200 * 1024) + 100];
void
dotest(FILE *xf)
{
char *cp;
char buf[(200 * 1024) + 100];
int histo[26] = { 0 };
int chr;
int fail = 0;
cp = buf;
fgets(cp,sizeof(buf),xf);
// scan first word
for (chr = *cp++;  chr != 0;  chr = *cp++) {
// stop on delimiter between first and second words
if (chr == ' ')
break;
// convert char to histogram index
chr -= 'a';
// increment the histogram cell
++histo[chr];
}
// scan second word
for (chr = *cp++;  chr != 0;  chr = *cp++) {
// stop on end-of-line
if (chr == 'n')
break;
// convert char to histogram index
chr -= 'a';
// decrement the histogram cell
--histo[chr];
}
// scan histogram
for (int idx = 0;  idx < 26;  ++idx) {
if (histo[idx]) {
fail = 1;
break;
}
}
if (fail)
printf("NOn");
else
printf("YESn");
}
// main -- main program
int
main(int argc,char **argv)
{
char *file;
FILE *xf;
--argc;
++argv;
file = *argv;
if (file != NULL)
xf = fopen(file,"r");
else
xf = stdin;
fgets(buf,sizeof(buf),xf);
int tstcnt = atoi(buf);
for (int tstno = 1;  tstno <= tstcnt;  ++tstno)
dotest(xf);
if (file != NULL)
fclose(xf);
return 0;
}

缺点是第二个循环没有"早期逃离"。我们必须完成对第二个字符串的扫描，即使我们可能很早就知道第二个串不匹配(例如)：

aaaaaaaaaa baaaaaaaaa
baaaaaaaaa bbaaaaaaaa

在简单版本中，我们无法提前终止第二个循环，即使我们知道当我们看到b(即直方图单元格为负)并跳过第二个单词中的多个a的扫描时，第二个字符串永远不会匹配。

因此，这里有一个版本，它有一个如上所述的简单的第一个循环。它将在第二个循环中对一个变为阴性的单元格进行实时检查。

再一次，哪一个版本(我介绍的四个版本)是最好的，需要一些实验/基准测试。

#include <stdio.h>
#include <stdlib.h>
char buf[(200 * 1024) + 100];
void
dotest(FILE *xf)
{
char *cp;
int histo[26] = { 0 };
int chr;
int fail = 0;
int cnt;
cp = buf;
fgets(cp,sizeof(buf),xf);
// scan first word
for (chr = *cp++;  chr != 0;  chr = *cp++) {
// stop on delimiter between first and second words
if (chr == ' ')
break;
// convert char to histogram index
chr -= 'a';
// increment the histogram cell
++histo[chr];
}
// scan second word
for (chr = *cp++;  chr != 0;  chr = *cp++) {
// stop on end-of-line
if (chr == 'n')
break;
// convert char to histogram index
chr -= 'a';
// decrement the histogram cell
cnt = --histo[chr];
// however, if we go negative, the second word has too many instances
// of this char to match the first word
if (cnt < 0) {
fail = 1;
break;
}
}
do {
// too many letters in second word that are _not_ in the first word
if (fail)
break;
// scan histogram
for (int idx = 0;  idx < 26;  ++idx) {
if (histo[idx]) {
fail = 1;
break;
}
}
} while (0);
if (fail)
printf("NOn");
else
printf("YESn");
}
// main -- main program
int
main(int argc,char **argv)
{
char *file;
FILE *xf;
char buf[100];
--argc;
++argv;
file = *argv;
if (file != NULL)
xf = fopen(file,"r");
else
xf = stdin;
fgets(buf,sizeof(buf),xf);
int tstcnt = atoi(buf);
for (int tstno = 1;  tstno <= tstcnt;  ++tstno)
dotest(xf);
if (file != NULL)
fclose(xf);
return 0;
}

public static final int ASC = 97;
static boolean isAnagram(String a, String b) {
boolean res = false;
int len = a.length();
if (len != b.length()) {
return res;
}
a = a.toLowerCase();
b = b.toLowerCase();
int[] a_ascii = new int[26];
int aval = 0;
for (int i = 0; i < 2 * len; i++) {
if (i < len) {
aval = a.charAt(i) - ASC;
a_ascii[aval] = (a_ascii[aval] == 0) ? (aval * len + 1) : (a_ascii[aval] + 1);
} else {
aval = b.charAt(i - len) - ASC;
if (a_ascii[aval] == 0) {
return false;
}
a_ascii[aval] = a_ascii[aval] - 1;
res = (a_ascii[aval] == aval * len) ? true : false;
}
}
return res;
}