为什么我的无锁消息队列段错误:(?

Why does my lock-free message queue segfault :(?

本文关键字：错误段错误我的消息队列为什么更新时间：2023-10-16

作为纯粹的心理练习，我试图让它在没有锁或互斥锁的情况下工作。这个想法是，当使用者线程读取/执行消息时，它会原子交换生产者线程用于写入std::vector。这可能吗？我试过玩线栅栏，但无济于事。某处存在竞争条件，因为它偶尔会出现故障。我想它在enqueue函数中的某个地方。有什么想法吗？

// should execute functions on the original thread
class message_queue {
public:
using fn = std::function<void()>;
using queue = std::vector<fn>;
message_queue() : write_index(0) {
}
// should only be called from consumer thread
void run () {
// atomically gets the current pending queue and switches it with the other one
// for example if we're writing to queues[0], we grab a reference to queue[0]
// and tell the producer to write to queues[1]
queue& active = queues[write_index.fetch_xor(1)];
// skip if we don't have any messages
if (active.size() == 0) return;
// run all messages/callbacks
for (auto fn : active) {
fn();
}
// clear the active queue so it can be re-used
active.clear();
// swap active and pending threads
write_index.fetch_xor(1);
}
void enqueue (fn value) {
// loads the current pending queue and append some work
queues[write_index.load()].push_back(value);
}
private:
queue queues[2];
std::atomic<bool> is_empty; // unused for now
std::atomic<int> write_index;

};
int main(int argc, const char * argv[])
{
message_queue queue{};
// flag to stop the message loop
// doesn't actually need to be atomic because it's only read/wrote on the main thread
std::atomic<bool> done(false);
std::thread worker([&queue, &done] {
int count = 100;
// send 100 messages
while (--count) {
queue.enqueue([count] {
// should be executed in the main thread
std::cout << count << "n";
});
}
// finally tell the main thread we're done
queue.enqueue([&] {
std::cout << "done!n";
done = true;
});
});
// run messages until the done flag is set
while(!done) queue.run();
worker.join();
}

如果我正确理解您的代码，则存在数据竞争，例如：

// producer
int r0 = write_index.load(); // r0 == 0
// consumer
int r1 = write_index.fetch_xor(1); // r1 == 0
queue& active = queues[r1];
active.size();
// producer
queue[r0].push_back(...);

现在，两个线程同时访问同一个队列。这是一场数据竞赛，这意味着未定义的行为。

您的无锁队列无法正常工作，因为您没有从至少一个半正式的正确性证明开始，然后将该证明转换为一种算法，其中证明是主要文本，注释将证明连接到代码，所有这些都与代码互连。

除非您复制/粘贴其他人的实现，否则任何编写无锁算法的尝试都将失败。如果您要复制粘贴其他人的实现，请提供它。

无锁算法是不健壮的，除非你有这样的证据证明它们是正确的，因为使它们失败的错误类型是微妙的，必须格外小心。简单地"滚动"无锁算法，即使它在测试过程中没有导致明显的问题，也是导致不可靠代码的秘诀。

在这种情况下，绕过编写形式证明的一种方法是追踪已经编写了经过验证的正确伪代码等的人。在注释中勾勒出伪代码以及正确性证明。然后在孔中填写代码。

一般来说，证明"几乎正确"的无锁算法有缺陷比编写一个可靠的证明，证明无锁算法以特定方式实现是正确的，然后实现它更难。现在，如果你的算法有缺陷，很容易找到缺陷，那么你就没有表现出对问题域的基本理解。

简而言之，通过发布"为什么我的算法错误"，您正在接近如何错误地编写无锁算法。 "我的证明中的缺陷在哪里？"，"我在这里证明了这个伪代码是正确的，然后我实现了它，为什么我的测试显示死锁？"是很好的无锁问题。 "这里有一堆带有注释的代码，这些注释仅描述了下一行代码的作用，而没有注释描述我为什么要执行下一行代码，或者该行代码如何保持我的无锁不变量"不是一个好的无锁问题。

退。找到一些经过验证的正确算法。了解校样的工作原理。通过猴子-见猴子-do实现一些经过验证的正确算法。查看脚注以注意他们的证明忽略的问题(如 A-B 问题)。在你拥有一堆之后，尝试一个变体，然后做证明，检查证明，做实现，检查实现。