可以改进哈希以计算频率

Hashing to Calculate Frequencies can be improved?

本文关键字：计算频率哈希更新时间：2023-10-16

我目前正在构建一个哈希表，以便根据数据结构的运行时间来计算频率。O（1）插入，O（n） 更差的查找时间等。

我问了几个人std::map和哈希表之间的区别，我得到了一个答案;

" std::map将元素添加为二叉树，因此会导致 O（log n），而使用哈希表实现它将是 O（n）。"

因此，我决定使用链表数组（用于单独链接）结构实现哈希表。在下面的代码中，我为节点分配了两个值，一个是键（单词），另一个是值（频率）。它的工作原理是;当添加第一个节点时，如果索引为空，则直接插入作为链表的第一个元素，频率为 0。如果它已经在列表中（不幸的是搜索需要 O（n） 时间），则将其频率增加 1。如果未找到，只需将其添加到列表的开头即可。

我知道实现中有很多流程，因此我想问一下这里有经验的人，为了有效地计算频率，如何改进这种实现？

到目前为止我写的代码;

#include <iostream>
#include <stdio.h>
using namespace std;
struct Node {
    string word;
    int frequency;
    Node *next;
};
class linkedList
{
private:
    friend class hashTable;
    Node *firstPtr;
    Node *lastPtr;
    int size;
public:
    linkedList()
    {
        firstPtr=lastPtr=NULL;
        size=0;
    }
    void insert(string word,int frequency)
    {
        Node* newNode=new Node;
        newNode->word=word;
        newNode->frequency=frequency;
        if(firstPtr==NULL)
            firstPtr=lastPtr=newNode;
        else {
            newNode->next=firstPtr;
            firstPtr=newNode;
        }
        size++;
    }
    int sizeOfList()
    {
        return size;
    }
    void print()
    {
        if(firstPtr!=NULL)
        {
            Node *temp=firstPtr;
            while(temp!=NULL)
            {
                cout<<temp->word<<" "<<temp->frequency<<endl;
                temp=temp->next;
            }
        }
        else
            printf("%s","List is empty");
    }
};
class hashTable
{
private:
    linkedList* arr;
    int index,sizeOfTable;
public:
    hashTable(int size) //Forced initalizer
    {
        sizeOfTable=size;
        arr=new linkedList[sizeOfTable];
    }
    int hash(string key)
    {
        int hashVal=0;
        for(int i=0;i<key.length();i++)
            hashVal=37*hashVal+key[i];
        hashVal=hashVal%sizeOfTable;
        if(hashVal<0)
            hashVal+=sizeOfTable;
        return hashVal;
    }
    void insert(string key)
    {
        index=hash(key);
        if(arr[index].sizeOfList()<1)
            arr[index].insert(key, 0);
        else {
            //Search for the index throughout the linked list.
            //If found, increment its value +1
            //else if not found, add the node to the beginning
        }
    }

};

你关心最坏的情况吗？如果不是，请使用std::unordered_map（它处理冲突并且您不需要multimap）或 trie/critbit 树（取决于键，它可能比哈希更紧凑，这可能会导致更好的缓存行为）。如果是，请使用std::set或尝试。

例如，如果您想要在线top-k统计信息，除了字典之外，还要保留优先级队列。每个字典值都包含出现次数以及单词是否属于队列。队列复制 top-k 频率/字对，但按频率键控。每当您扫描另一个单词时，请检查它是否（1）尚未在队列中，以及（2）是否比队列中的最小元素更频繁。如果是这样，请提取最少的队列元素并插入刚刚扫描的元素。

如果你愿意，你可以实现自己的数据结构，但是从事STL实现的程序员往往非常敏锐。我会确保这是瓶颈首先出现的地方。

1- 在 std：：

map 和 std：：set 中搜索的复杂时间为 O（log（n））。而且，std：：unordered_map 和 std：：unordered_set 的摊销时间复杂度为 O（n）。但是，散列的常量时间可能非常大，而小数字的常量时间可能大于 log（n）。我一直在考虑这张脸。

2-如果要使用std：：unordered_map，则需要确保为键入定义了std：：hash。否则，您应该定义它。