为什么当我使用的线程数远远超过CPU内核数时,此代码会变得更快?
Why is this code getting faster when I'm using way more threads than my CPU has cores?
我有一些计时代码,我用它来测量给定代码片段的运行时间:
struct time_data {
std::chrono::steady_clock::time_point start, end;
auto get_duration() const {
return end - start;
}
void print_data(std::ostream & out) const {
out << str();
}
std::string str() const {
static const std::locale loc{ "" };
std::stringstream ss;
ss.imbue(loc);
ss << "Start: " << std::setw(24) << std::chrono::nanoseconds{ start.time_since_epoch() }.count() << "nsn";
ss << "End: " << std::setw(24) << std::chrono::nanoseconds{ end.time_since_epoch() }.count() << "nsn";
ss << "Duration: " << std::setw(24) << std::chrono::nanoseconds{ get_duration() }.count() << "nsn";
return ss.str();
}
static friend std::ostream & operator<<(std::ostream & out, const time_data & data) {
return out << data.str();
}
};
template<typename T>
time_data time_function(T && func) {
time_data data;
data.start = std::chrono::steady_clock::now();
func();
data.end = std::chrono::steady_clock::now();
return data;
}
然后,为了使用它,我设计了以下代码来对一组数据执行累积,但不是简单地将数字加在一起,它首先从应用Collatz猜想中找到总停止时间并将其累积。
template<typename T>
T accumulation_function(T a, T b) {
T count = 0;
while (b > 1) {
if (b % 2 == 0) b /= 2;
else b = b * 3 + 1;
++count;
}
return a + count;
}
template<typename IT>
auto std_sum(IT begin, IT end) {
auto sum = (*begin - *begin);
sum = std::accumulate(begin, end, sum, accumulation_function<decltype(sum)>);
return sum;
}
template<typename IT>
auto single_thread_sum(IT begin, IT end) {
auto sum = (*begin - *begin);
IT current = begin;
while (current != end) {
sum = accumulation_function(sum, *current);
++current;
}
return sum;
}
template<typename IT, uint64_t N>
auto N_thread_smart_sum(IT begin, IT end) {
auto sum = (*begin - *begin);
std::vector<std::thread> threads;
std::array<decltype(sum), N> sums;
auto dist = std::distance(begin, end);
for (uint64_t i = 0; i < N; i++) {
threads.emplace_back([=, &sums] {
IT first = begin;
IT last = begin;
auto & tsum = sums[i];
tsum = 0;
std::advance(first, i * dist / N);
std::advance(last, (i + 1) * dist / N);
while (first != last) {
tsum = accumulation_function(tsum, *first);
++first;
}
});
}
for (std::thread & thread : threads)
thread.join();
for (const auto & s : sums) {
sum += s;
}
return sum;
}
template<typename IT>
auto two_thread_smart_sum(IT begin, IT end) {
return N_thread_smart_sum<IT, 2>(begin, end);
}
template<typename IT>
auto four_thread_smart_sum(IT begin, IT end) {
return N_thread_smart_sum<IT, 4>(begin, end);
}
template<typename IT>
auto eight_thread_smart_sum(IT begin, IT end) {
return N_thread_smart_sum<IT, 8>(begin, end);
}
template<typename IT>
auto sixteen_thread_smart_sum(IT begin, IT end) {
return N_thread_smart_sum<IT, 16>(begin, end);
}
template<typename IT>
auto thirty_two_thread_smart_sum(IT begin, IT end) {
return N_thread_smart_sum<IT, 32>(begin, end);
}
template<typename IT>
auto sixty_four_thread_smart_sum(IT begin, IT end) {
return N_thread_smart_sum<IT, 64>(begin, end);
}
作为参考:运行这一切的样板main
代码:
int main() {
std::vector<int64_t> raw_data;
auto fill_data = time_function([&raw_data] {
constexpr uint64_t SIZE = 1'000'000'000ull;
raw_data.resize(SIZE);
std::vector<std::thread> threads;
for (int i = 0; i < 8; i++) {
threads.emplace_back([i, SIZE, &raw_data] {
uint64_t begin = i * SIZE / 8;
uint64_t end = (i + 1) * SIZE / 8;
for (uint64_t index = begin; index < end; index++) {
raw_data[index] = begin % (20 + i);
}
});
}
for (std::thread & t : threads)
t.join();
});
int64_t sum;
std::cout << std::setw(25) << "Fill Data" << std::endl;
std::cout << fill_data << std::endl;
auto std_data = time_function([&] {
sum = std_sum(raw_data.begin(), raw_data.end());
});
std::cout << std::setw(25) << "STD Sum: " << sum << std::endl;
std::cout << std_data << std::endl;
auto single_data = time_function([&] {
sum = single_thread_sum(raw_data.begin(), raw_data.end());
});
std::cout << std::setw(25) << "Single Sum: " << sum << std::endl;
std::cout << single_data << std::endl;
auto smart_2_data = time_function([&] {
sum = two_thread_smart_sum(raw_data.begin(), raw_data.end());
});
std::cout << std::setw(25) << "Two-Thread-Smart Sum: " << sum << std::endl;
std::cout << smart_2_data << std::endl;
auto smart_4_data = time_function([&] {
sum = four_thread_smart_sum(raw_data.begin(), raw_data.end());
});
std::cout << std::setw(25) << "Four-Thread-Smart Sum: " << sum << std::endl;
std::cout << smart_4_data << std::endl;
auto smart_8_data = time_function([&] {
sum = eight_thread_smart_sum(raw_data.begin(), raw_data.end());
});
std::cout << std::setw(25) << "Eight-Thread-Smart Sum: " << sum << std::endl;
std::cout << smart_8_data << std::endl;
auto smart_16_data = time_function([&] {
sum = sixteen_thread_smart_sum(raw_data.begin(), raw_data.end());
});
std::cout << std::setw(25) << "Sixteen-Thread-Smart Sum: " << sum << std::endl;
std::cout << smart_16_data << std::endl;
auto smart_32_data = time_function([&] {
sum = thirty_two_thread_smart_sum(raw_data.begin(), raw_data.end());
});
std::cout << std::setw(25) << "Thirty-Two-Thread-Smart Sum: " << sum << std::endl;
std::cout << smart_32_data << std::endl;
auto smart_64_data = time_function([&] {
sum = sixty_four_thread_smart_sum(raw_data.begin(), raw_data.end());
});
std::cout << std::setw(25) << "Sixty-Four-Thread-Smart Sum: " << sum << std::endl;
std::cout << smart_64_data << std::endl;
return 0;
}
这是程序的输出:
Fill Data
Start: 16,295,979,890,252ns
End: 16,300,523,805,484ns
Duration: 4,543,915,232ns
STD Sum: 7750000000
Start: 16,300,550,212,791ns
End: 16,313,216,607,890ns
Duration: 12,666,395,099ns
Single Sum: 7750000000
Start: 16,313,216,694,522ns
End: 16,325,774,379,684ns
Duration: 12,557,685,162ns
Two-Thread-Smart Sum: 7750000000
Start: 16,325,774,466,014ns
End: 16,334,441,032,868ns
Duration: 8,666,566,854ns
Four-Thread-Smart Sum: 7750000000
Start: 16,334,441,137,913ns
End: 16,342,188,642,721ns
Duration: 7,747,504,808ns
Eight-Thread-Smart Sum: 7750000000
Start: 16,342,188,770,706ns
End: 16,347,850,908,471ns
Duration: 5,662,137,765ns
Sixteen-Thread-Smart Sum: 7750000000
Start: 16,347,850,961,597ns
End: 16,352,187,838,584ns
Duration: 4,336,876,987ns
Thirty-Two-Thread-Smart Sum: 7750000000
Start: 16,352,187,891,710ns
End: 16,356,111,411,220ns
Duration: 3,923,519,510ns
Sixty-Four-Thread-Smart Sum: 7750000000
Start: 16,356,111,471,288ns
End: 16,359,988,028,812ns
Duration: 3,876,557,524ns
前几个结果并不令人惊讶:我自己编写的累加代码与std::accumulate
函数的工作原理大致相同(我在连续运行中看到两者都是"最快的",这意味着它们的实现可能是相似的)。当我切换到两个线程和四个线程时,代码变得更快。这是有道理的,因为我使用的是英特尔4核处理器。
但是超出这一点的结果是令人困惑的。我的CPU只有4个内核(如果考虑到超线程,则为8个内核),但是即使我原谅超线程在8个线程时提供了边际性能提升,但将线程数增加到16、32和64个线程都会产生额外的性能提升。为什么会这样?当我很久以前就达到CPU可以物理并发运行的线程数量上限时,额外的线程是如何产生额外的性能增益的?
注意:这和这个链接的问题不一样,因为我处理的是一个特定的用例和代码,而链接的问题处理的是一般性的。
编辑:我对代码进行了调整,以便在创建线程时,将其优先级设置为Windows允许的最高优先级。总的来说,结果似乎是一样的。
Fill Data
Start: 18,950,798,175,057ns
End: 18,955,085,850,007ns
Duration: 4,287,674,950ns
STD Sum: 7750000000
Start: 18,955,086,975,013ns
End: 18,967,581,061,562ns
Duration: 12,494,086,549ns
Single Sum: 7750000000
Start: 18,967,581,136,724ns
End: 18,980,127,355,698ns
Duration: 12,546,218,974ns
Two-Thread-Smart Sum: 7750000000
Start: 18,980,127,426,332ns
End: 18,988,619,042,214ns
Duration: 8,491,615,882ns
Four-Thread-Smart Sum: 7750000000
Start: 18,988,619,135,487ns
End: 18,996,215,684,824ns
Duration: 7,596,549,337ns
Eight-Thread-Smart Sum: 7750000000
Start: 18,996,215,763,004ns
End: 19,002,055,977,730ns
Duration: 5,840,214,726ns
Sixteen-Thread-Smart Sum: 7750000000
Start: 19,002,056,055,608ns
End: 19,006,282,772,254ns
Duration: 4,226,716,646ns
Thirty-Two-Thread-Smart Sum: 7750000000
Start: 19,006,282,840,774ns
End: 19,010,539,676,914ns
Duration: 4,256,836,140ns
Sixty-Four-Thread-Smart Sum: 7750000000
Start: 19,010,539,758,113ns
End: 19,014,450,311,829ns
Duration: 3,910,553,716ns
DOUBLE EDIT COMBO:
我编辑了程序,以确保tsum
变量是lambda内部的局部变量,而不是对lambda外部的引用。多线程代码的速度大大提高了,但它仍然表现出相同的行为。
Fill Data
Start: 21,740,931,802,656ns
End: 21,745,429,501,480ns
Duration: 4,497,698,824ns
STD Sum: 7750000000
Start: 21,745,430,637,655ns
End: 21,758,206,612,632ns
Duration: 12,775,974,977ns
Single Sum: 7750000000
Start: 21,758,206,681,153ns
End: 21,771,260,233,797ns
Duration: 13,053,552,644ns
Two-Thread-Smart Sum: 7750000000
Start: 21,771,260,287,828ns
End: 21,777,845,764,595ns
Duration: 6,585,476,767ns
Four-Thread-Smart Sum: 7750000000
Start: 21,777,845,831,002ns
End: 21,784,011,543,159ns
Duration: 6,165,712,157ns
Eight-Thread-Smart Sum: 7750000000
Start: 21,784,011,628,584ns
End: 21,788,846,061,014ns
Duration: 4,834,432,430ns
Sixteen-Thread-Smart Sum: 7750000000
Start: 21,788,846,127,422ns
End: 21,791,921,652,246ns
Duration: 3,075,524,824ns
Thirty-Two-Thread-Smart Sum: 7750000000
Start: 21,791,921,747,330ns
End: 21,794,980,832,033ns
Duration: 3,059,084,703ns
Sixty-Four-Thread-Smart Sum: 7750000000
Start: 21,794,980,901,761ns
End: 21,797,937,401,426ns
Duration: 2,956,499,665ns
TRIPLE EDIT WOMBO-COMBO:
我再次进行了相同的测试,但方向相反。结果非常相似:
Fill Data
Start: 22,845,472,001,475ns
End: 22,850,746,219,215ns
Duration: 5,274,217,740ns
Sixty-Four-Thread-Smart Sum: 7750000000
Start: 22,850,747,328,223ns
End: 22,853,951,903,952ns
Duration: 3,204,575,729ns
Thirty-Two-Thread-Smart Sum: 7750000000
Start: 22,853,951,981,830ns
End: 22,857,239,237,507ns
Duration: 3,287,255,677ns
Sixteen-Thread-Smart Sum: 7750000000
Start: 22,857,239,307,839ns
End: 22,860,389,285,305ns
Duration: 3,149,977,466ns
Eight-Thread-Smart Sum: 7750000000
Start: 22,860,389,353,524ns
End: 22,864,828,560,484ns
Duration: 4,439,206,960ns
Four-Thread-Smart Sum: 7750000000
Start: 22,864,828,633,834ns
End: 22,870,640,982,096ns
Duration: 5,812,348,262ns
Two-Thread-Smart Sum: 7750000000
Start: 22,870,641,056,956ns
End: 22,877,032,284,384ns
Duration: 6,391,227,428ns
Single Sum: 7750000000
Start: 22,877,032,367,997ns
End: 22,889,501,695,960ns
Duration: 12,469,327,963ns
STD Sum: 7750000000
Start: 22,889,501,770,820ns
End: 22,902,014,633,229ns
Duration: 12,512,862,409ns
所有的tsum
写入都是相同的array
,这将占用内存中的连续地址。这将导致所有这些写操作的缓存争用问题。如果有更多的线程将这些写操作分散到不同的缓存线,那么CPU内核将花费更少的时间使缓存线失效和重新加载。
尝试累积到一个局部tsum
变量(不是对sums
的引用),然后在循环完成时将其写入sums[i]
。
我怀疑您看到了速度的提高,因为应用程序的线程在系统上运行的活动线程总数中所占的百分比更大,从而为您提供了更多的时间片段。您可能想看看在运行其他程序时,这些数字是变好还是变坏。
- 处理小于cpu数据总线的数据类型.(c++转换为机器代码)
- 什么时候最好在子进程中使用 CPU 或 I/O 密集型代码 [ C++ ]
- TensorFlow CPU 和 CUDA 代码共享
- 为什么当从面向任何 CPU 的 C# 项目调用此代码时,此代码会引发 System.AccessViolationExc
- 维护/维持两个代码集的风险,一个用于 CPU,一个用于 GPU,需要执行非常相似的功能
- 从 C++11 代码中获取系统内存和 CPU 使用率
- C++代码只能针对特定的 CPU 体系结构进行编译.有没有办法将其编译为所有架构
- 任何可以在单个 CPU 指令中在 0 和 1 之间翻转位/整数/布尔值的可能代码
- 代码执行/CPU 速度每 2 秒减慢一次
- 为什么 OpenMP 在这个 fft 代码中停留在 10% 的 CPU 消耗?
- 为什么GTX Titan上的Cublas比单线螺纹CPU代码慢
- 测量C 代码的CPU周期
- 为什么Opencv GPU代码比CPU慢
- 在CPU上使用OpenCL将一个数组复制到另一个数组比C++代码慢得多
- 是否可以计算出有多少 CPU 和 RAM 使用代码块
- 如何计算C++代码段的 CPU 周期成本
- GPU 调用后,CPU 代码运行缓慢
- 可视化设置一个c++应用程序使用最大CPU使用率,在代码中
- gcc/C++:如果CPU负载很低,那么代码优化用处不大,这是真的
- 硬件是否将多个代码操作合并为一个物理CPU操作