c++ ~ 1M在unordered_map中查找字符串键要比.net代码慢得多

C++ ~ 1M look-ups in unordered_map with string key works much slower than .NET code

本文关键字：net 代码字符串查找 1M unordered map c++ 更新时间：2023-10-16

我有一个perf test函数的。net和c++实现，该函数使用6838个键池中的字符串键在字典中进行854,750次查找。我编写这些函数是为了调查真实应用程序中的性能瓶颈。

。.NET实现是用f#编写的，使用字典，并为。NET 4.0编译

c++实现使用std::unordered_map，并在VS2010发布模式下构建。

在我的机器上。net代码的平均运行时间为240毫秒，c++代码的平均运行时间为630毫秒。你能帮我理解一下是什么原因造成了如此巨大的速度差异吗?

如果我在c++实现中缩短键长度并使用"key_"前缀而不是"key_prefix_"它将在140毫秒内运行

我尝试的另一个技巧是用自定义的不可变字符串实现替换std::string，该实现具有指向源的const char*指针和一次性计算散列。使用这个字符串可以将c++实现的性能降低到190毫秒。

c++代码:

struct SomeData
{
public:
    float Value;
};
typedef std::string KeyString;
typedef std::unordered_map<KeyString, SomeData> DictionaryT;
const int MaxNumberOfRuns = 125;
const int MaxNumberOfKeys = 6838;
DictionaryT dictionary;
dictionary.rehash(MaxNumberOfKeys);
auto timer = Stopwatch::StartNew();
int lookupCount = 0;
char keyBuffer[100] = "key_prefix_";
size_t keyPrefixLen = std::strlen(keyBuffer);
/// run MaxNumberOfRuns * MaxNumberOfKeys iterations
for(int runId = 0; runId < MaxNumberOfRuns; runId++)
{
    for(int keyId = 0; keyId < MaxNumberOfKeys; keyId++)
    {
        /// get a new key from the pool of MaxNumberOfKeys keys           
        int randomKeySuffix = (std::rand() % MaxNumberOfKeys);
        ::itoa(randomKeySuffix, keyBuffer + keyPrefixLen, 10);
        KeyString key = keyBuffer;
        /// lookup key in the dictionary         
        auto dataIter = dictionary.find(key);
        SomeData* data;
        if(dataIter != dictionary.end())
        {
            /// get existing value           
            data = &dataIter->second;
        }
        else
        {
            /// add a new value
            data = &dictionary.insert(dataIter, DictionaryT::value_type(key, SomeData()))->second;
        }
        /// update corresponding value in the dictionary
        data->Value += keyId * runId;
        lookupCount++;
    }
}
timer.Stop();
std::cout << "Time: " << timer.GetElapsedMilleseconds() << " ms" << std::endl;
std::cout << "Lookup count: " << lookupCount << std::endl;

打印:

时间:636 ms
查找计数:854750

<<p> f#代码/strong>
open System open System.Diagnostics open System.Collections.Generic type SomeData = struct val mutable Value : float end let dictionary = new Dictionary<string, SomeData>() let randomGen = new Random() let MaxNumberOfRuns = 125 let MaxNumberOfKeys = 6838 let timer = Stopwatch.StartNew() let mutable lookupCount = 0 /// run MaxNumberOfRuns * MaxNumberOfKeys iterations for runId in 1 .. MaxNumberOfRuns do for keyId in 1 .. MaxNumberOfKeys do /// get a new key from the pool of MaxNumberOfKeys keys let randomKeySuffix = randomGen.Next(0, MaxNumberOfKeys).ToString() let key = "key_prefix_" + randomKeySuffix /// lookup key in the dictionary let mutable found, someData = dictionary.TryGetValue (key) if not(found) then /// add a new value someData <- new SomeData() dictionary.[key] <- someData /// update corresponding value in the dictionary someData.Value <- someData.Value + float(keyId) * float(runId) lookupCount <- lookupCount + 1 timer.Stop() printfn "Time: %d ms" timer.ElapsedMilliseconds printfn "Lookup count: %d" lookupCount

打印:
时间:245ms
查找计数:854750

Visual Studio 2010为std::string使用了一个高性能的哈希函数，而不是一个精确的哈希函数。基本上，如果密钥字符串大于10个字符，哈希函数停止使用每个字符进行哈希，并且跨距大于1。

size_t operator()(const _Kty& _Keyval) const
    {   // hash _Keyval to size_t value by pseudorandomizing transform
    size_t _Val = 2166136261U;
    size_t _First = 0;
    size_t _Last = _Keyval.size();
    size_t _Stride = 1 + _Last / 10;
    for(; _First < _Last; _First += _Stride)
        _Val = 16777619U * _Val ^ (size_t)_Keyval[_First];
    return (_Val);
    }

size() >= 10 -在第一个
size() >= 20 -在第一个
…

由于这一点，碰撞发生得更频繁，这当然减慢了代码的速度。试试c++版本的自定义哈希函数

我们只能推测为什么一个版本比另一个更快。你绝对可以让侧写师告诉你热点在哪里。所以，不要把这些作为一个确定的答案。

你关于c++版本用更短的键长度更快的注释很有启发性，因为它可以指向两件事:

也许std::string的哈希函数真的是为小字符串而不是长字符串优化的。

根据我使用unordered_map的经验(尽管我对Microsoft的boost实现更熟悉)，这里有一些即兴的观察。

在这个例子中，没有理由使用std::string作为键类型，只使用整数值。这可能会使c++和f#版本更快。
当您向映射中插入值时，执行查找操作后再执行插入操作可能不会更快，因为两者都需要重新散列键字符串。只是使用了[]操作符，它自己执行查找或插入操作。我想这取决于你在地图上找到命中目标的频率，而不是添加新值。
如果分配是瓶颈，并且您必须使用字符串键类型，那么在将字符串插入映射中时，将共享ptr存储到字符串中而不是复制字符串，可能会获得更好的性能。
尝试为忽略字符串的"key_prefix_"部分的密钥类型提供自己的哈希函数
尝试boost的实现;

同样，运行概要文件会很快告诉您在哪里查找这类问题。具体来说，它会告诉你在散列中是否存在瓶颈，在分配中是否存在瓶颈。

处理纯数据结构代码时，2.6的速比并不奇怪。看一下这个项目的幻灯片，你就会明白我的意思了。