如何从路径列表中优化目录列表

How to optimize directory listing from list of paths?

本文关键字：列表优化路径更新时间：2023-10-16

在编写保险丝文件系统时，我有一个unordered_map<std::string, struct stat>作为缓存，该缓存均以 all> all aint 启动的文件和目录进行启动，以减少硬盘上的读数。

要满足readdir()回调，我写了以下循环：

const int sp = path == "/" ? 0 : path.size();
for (auto it = stat_cache.cbegin(); it != stat_cache.cend(); it++)
{
    if (it->first.size() > sp)
    {
        int ls = it->first.find_last_of('/');
        if (it->first.find(path, 0) == 0 && ls == sp)
            filler(buf, it->first.substr(ls + 1).c_str(), const_cast<struct stat*>(&it->second), 0, FUSE_FILL_DIR_PLUS);
    }
}

的想法是，路径的对象是从目录路径开始的，并且在目录路径末端进行最后一个斜线将是其中的成员。我已经对此进行了彻底的测试，并且有效。
插图：

Reading directory: /foo/bar
Candidate file:    /bazboo/oof - not in dir (wrong prefix)
Candidate file:    /foo/bar/baz/boo - not in dir (wrong lastslash location)
Candidate file:    /foo/bar/baz - in dir!

现在，这是令人惊讶的慢（尤其是在缓存中有超过半百万个对象的文件系统中）。valgrind/callgrind尤其指责std::string:find_last_of()和std::string::find()调用。

我已经添加了if (it->first.size() > sp)，以加快循环的速度，但性能增长最少。

我还尝试通过在四个块中并行将循环并行使循环加速加速，但是在unordered_map::cbegin()期间以segfault结束。
我没有实际的代码，但我相信它看起来像这样：

const int sp = path == "/" ? 0 : path.size();
ThreadPool<4> tpool;
ulong cq = stat_cache.size()/4;
for (int i = 0; i < 4; i++)
{
    tpool.addTask([&] () {
        auto it = stat_cache.cbegin();
        std::next(it, i * cq);
        for (int j = 0; j < cq && it != stat_cache.cend(); j++, it++)
        {
            if (it->first.size() > sp)
            {
                int ls = it->first.find_last_of('/');
                if (it->first.find(path, 0) == 0 && ls == sp)
                    filler(buf, it->first.substr(ls + 1).c_str(), const_cast<struct stat*>(&it->second), 0, FUSE_FILL_DIR_PLUS);
            }
        }
    });
}
tpool.joinAll();

我还尝试了通过MAP Buckets将其分开的，unordered_map::cbegin(int)为此提供了方便的过载，但仍将其显示出来。

再次，我目前正在使用第一个（非平行）代码，并希望为此提供帮助，因为并行的代码不起作用。我只是以为我会包括我的并行尝试以进行完整，努力扑打和努力证明。

还有其他选项可以优化此循环？

在这里要做的微不足道的事情是从此更改if：

if (it->first.find(path, 0) == 0 && ls == sp)

简单：

if (ls == sp && it->first.find(path, 0) == 0)

显然，比较两个整数要比寻找子字符串快得多。
我不能保证它会扭转性能，但这可能有助于跳过许多不必要的std::string::find呼叫。也许编译器已经做到了，我会调查拆卸。

另外，由于文件pathes无论如何都是唯一的，所以我会使用std::vector<std::pair<...>>-更好的缓存局部性，更少的内存分配等。请记住先保留大小。

真正的问题是

for (auto it = stat_cache.cbegin(); it != stat_cache.cend(); it++)

有效地删除unordered_maps最大的优势并暴露了其中一个弱点。您不仅没有它的O（1）查找，而且您可能必须浏览地图才能找到条目，这使得用非常大的k（如果不是额外的nie。O。O（n，n）^2））。

最快的解决方案将是查找（在幸运的无序地图中），二进制中的o（strlen（target））的O（strlen（target））或二进制中的O（lgn）。然后沿着struct stat有儿童列表，适用于O（#Children）。