C++ 如何在使用哈希函数时计算冲突次数

C++ How to count number of collisions while using a hash function?

本文关键字：计算冲突函数哈希 C++ 更新时间：2023-10-16

我被分配了这个实验室，我需要在其中创建一个哈希函数，并计算在对多达 30000 个元素的文件进行哈希处理时发生的冲突次数。这是我到目前为止的代码

#include <iostream>
#include <fstream>
#include <string>
using namespace std;
long hashcode(string s){
  long seed = 31; 
  long hash = 0;
  for(int i = 0; i < s.length(); i++){
    hash = (hash * seed) + s[i];
  }
  return hash % 10007;
};
int main(int argc, char* argv[]){
  int count = 0;
  int collisions = 0;
  fstream input(argv[1]);
  string x;
  int array[30000];
  //File stream
  while(!input.eof()){
    input>>x;
    array[count] = hashcode(x);
    count++;
    for(int i = 0; i<count; i++){
        if(array[i]==hashcode(x)){
            collisions++;
        }
    }
  }
  cout<<"Total Input is " <<count-1<<endl;
  cout<<"Collision # is "<<collisions<<endl;
}

我只是不确定如何计算碰撞次数。我尝试将每个哈希值存储到一个数组中，然后搜索该数组，但是当只有 10000 个元素时，它导致了大约 12000 次冲突。任何关于如何计算碰撞的建议，或者即使我的哈希函数可以使用改进，也将不胜感激。谢谢。

问题是你正在叙述碰撞(假设你的列表中有 4 个相同的元素，没有别的，然后通过你的算法看看你会计算多少次碰撞(

相反，创建一组哈希代码，

每次计算哈希代码时，检查它是否在集中。如果它在集合中，则增加碰撞总数。如果它不在集合中，请将其添加到集合中。

编辑：

为了快速修补您的算法，我做了以下操作：在循环后增加计数，并在发现冲突后中断 for 循环。这仍然不是非常有效，因为我们正在循环遍历所有结果(使用设置的数据结构会更快(，但这至少应该是正确的。

还对其进行了调整，因此我们不会一遍又一遍地计算哈希码(x(：

int main(int argc, char* argv[]){
  int count = 0;
  int collisions = 0;
  fstream input(argv[1]);
  string x;
  int array[30000];
  //File stream
  while(!input.eof()){
    input>>x;
    array[count] = hashcode(x);
    for(int i = 0; i<count; i++){
        if(array[i]==array[count]){
            collisions++;
            // Once we've found one collision, we don't want to count all of them.
            break;
        }
    }
    // We don't want to check our hashcode against the value we just added
    // so we should only increment count here.
    count++;
  }
  cout<<"Total Input is " <<count-1<<endl;
  cout<<"Collision # is "<<collisions<<endl;
}

为了教育的利益而添加的答案。这可能是你教授的下一堂课。

几乎可以肯定，检测哈希冲突的最有效方法是使用哈希集(又名unordered_set(

#include <iostream>
#include <unordered_set>
#include <fstream>
#include <string>
// your hash algorithm
long hashcode(std::string const &s) {
    long seed = 31;
    long hash = 0;
    for (int i = 0; i < s.length(); i++) {
        hash = (hash * seed) + s[i];
    }
    return hash % 10007;
};
int main(int argc, char **argv) {
    std::ifstream is{argv[1]};
    std::unordered_set<long> seen_before;
    seen_before.reserve(10007);
    std::string buffer;
    int collisions = 0, count = 0;
    while (is >> buffer) {
        ++count;
        auto hash = hashcode(buffer);
        auto i = seen_before.find(hash);
        if (i == seen_before.end()) {
            seen_before.emplace_hint(i, hash);
        }
        else {
            ++collisions;
        }
    }
    std::cout << "Total Input is " << count << std::endl;
    std::cout << "Collision # is " << collisions << std::endl;
}

有关哈希表的说明，请参阅哈希表的工作原理？

#include <iostream>
#include <fstream>
#include <string>
using namespace std;
// Generate a hash code that is in the range of our hash table.
// The range we are using is zero to 10,007 so that our table is
// large enough and the prime number size reduces the probability
// of collisions from different strings hashing to the same value.
unsigned long hashcode(string s){
    unsigned long seed = 31;
    unsigned long hash = 0;
    for (int i = 0; i < s.length(); i++){
        hash = (hash * seed) + s[i];
    }
    // we want to generate a hash code that is the size of our table.
    // so we mod the calculated hash to ensure that it is in the proper range
    // of our hash table entries. 10007 is a prime number which provides
    // better characteristics than a non-prime number table size.
    return hash % 10007; 
};
int main(int argc, char * argv[]){
    int count = 0;
    int collisions = 0;
    fstream input(argv[1]);
    string x;
    int array[30000] = { 0 };
    //File stream
    while (!input.eof()){
        input >> x;     // get the next string to hash
        count++;        // count the number of strings hashed.
        // hash the string and use the hash as an index into our hash table.
        // the hash table is only used to keep a count of how many times a particular
        // hash has been generated. So the table entries are ints that start with zero.
        // If the value is greater than zero then we have a collision.
        // So we use postfix increment to check the existing value while incrementing
        // the hash table entry.
        if ((array[hashcode(x)]++) > 0)
            collisions++;
    }
    cout << "Total Input is " << count << endl;
    cout << "Collision # is " << collisions << endl;
    return 0;
}