C++ 瓦格格林德平铺矩阵乘法的性能分析
C++ Performacne analysis of tiled matrix multiplication with valgrind
我正在尝试弄清楚如何正确实现循环平铺。我的代码基于 http://people.freebsd.org/~lstewart/articles/cpumemory.pdf .从理论上讲,我应该通过使用平铺矩阵乘法来获得性能提升。但我不一定。我还将介绍瓦尔格林德的缓存的结果,我相信这很有趣。
我注释掉了不同的方法。
// cpp program, matrix multiplication
// returns the elapsed time of the loop iterations measured by omp_get_wtime()
#include <iostream>
#include <algorithm> // std::min
#include <omp.h>
int main(int argc, char *argv[])
{
// matrix dimensions
const int row = 1000;
const int col = 1000;
// matrix stored as an array of size 1000*1000
// temp will be b transposed, recommendation from the article mentioned above
// res is of double precision, I ran into errors displaying the data when using a different data type
int *a = new int[row*col];
int *b = new int[row*col];
int *temp = new int[row*col];
double *res = new double[row*col];
// initialization
for(int i = 0; i < row; ++i){
for (int j = 0; j < col; ++j) {
a[i*col+j] = i*col+j;
b[i*col+j] = i*col+j;
}
}
// transposition of b
for(int i = 0; i < row; ++i){
for (int j = 0; j < col; ++j) {
temp[i*col+j] = b[j*col+i];
}
}
int i,j,k,x,y,z;
// "naive" matrix multiplication
// double start = omp_get_wtime();
// for (i = 0; i < row; ++i) {
// for (j = 0; j < col; ++j) {
// for (k = 0; k < row; ++k) {
// res[ i * col + j ] += a[ i * col + k ] * b[ k * col + j ];
// }
// }
// }
// double end = omp_get_wtime();
// std::cout << end-start << std::endl;
// "transposed" matrix multiplication
// for (i = 0; i < row; ++i) {
// for (j = 0; j < col; ++j) {
// for (k = 0; k < row; ++k) {
// res[ i * col + j ] += a[ i * col + k ] * temp[ k + j * col ];
// }
// }
// }
// tiled (parallel) matrix multiplication
// from /sys/devices/system/cpu/cpu0/cache/index0
// cat coherency_line_size returns 64;
// thus I will use 64 as the blocking size;
int incr = 64;
for (i = 0; i < row; i += incr) {
for (j = 0; j < col; j += incr) {
res[i*col+j] = 0.0;
for (k = 0; k < row; k += incr) {
for (x = i; x < std::min( i + incr, row ); x++) {
for (y = j; y < std::min( j + incr, col ); y++) {
for (z = k; z < std::min( k + incr, row ); z++) {
res[ x * col + y ] += a[ x * col + z ] * b[ z * col + y ];
}
}
}
}
}
}
return 0;
}
结果:
现在,我将介绍在具有英特尔双核和4Gb DRAM的Linux机器上编译这三种方法的结果。首先,我将介绍不进行优化的编译结果,然后介绍经过优化的编译结果。对于每个结果,将添加相应的 valgrinds 缓存研磨结果。对于那些不熟悉该软件的人:从 http://www.valgrind.org/docs/manual/cg-manual.html
"首先总结指令获取的缓存访问,给出 提取次数(这是指令数 执行,这本身就很有用),数量 I1 未命中,LL指令(LLi)的数量未命中。
"幼稚"的方法:
$ g++ -fopenmp parallel -o parallel.cpp
$ ./parallel
16.5305
$ valgrind --tool=cachegrind ./parallel
==12558== I refs: 39,054,659,801
==12558== I1 misses: 1,758
==12558== LLi misses: 1,738
==12558== I1 miss rate: 0.00%
==12558== LLi miss rate: 0.00%
==12558==
==12558== D refs: 20,028,690,508 (18,024,512,540 rd + 2,004,177,968 wr)
==12558== D1 misses: 1,064,759,236 ( 1,064,571,085 rd + 188,151 wr)
==12558== LLd misses: 62,877,799 ( 62,689,774 rd + 188,025 wr)
==12558== D1 miss rate: 5.3% ( 5.9% + 0.0% )
==12558== LLd miss rate: 0.3% ( 0.3% + 0.0% )
==12558==
==12558== LL refs: 1,064,760,994 ( 1,064,572,843 rd + 188,151 wr)
==12558== LL misses: 62,879,537 ( 62,691,512 rd + 188,025 wr)
==12558== LL miss rate: 0.1% ( 0.1% + 0.0% )
"转置"方法:
$ g++ -fopenmp parallel -o parallel.cpp
$ ./parallel
9.40104
$ valgrind --tool=cachegrind ./parallel
==13319== I refs: 39,054,659,804
==13319== I1 misses: 1,759
==13319== LLi misses: 1,739
==13319== I1 miss rate: 0.00%
==13319== LLi miss rate: 0.00%
==13319==
==13319== D refs: 20,028,690,508 (18,024,512,539 rd + 2,004,177,969 wr)
==13319== D1 misses: 63,823,736 ( 63,635,585 rd + 188,151 wr)
==13319== LLd misses: 62,877,799 ( 62,689,774 rd + 188,025 wr)
==13319== D1 miss rate: 0.3% ( 0.3% + 0.0% )
==13319== LLd miss rate: 0.3% ( 0.3% + 0.0% )
==13319==
==13319== LL refs: 63,825,495 ( 63,637,344 rd + 188,151 wr)
==13319== LL misses: 62,879,538 ( 62,691,513 rd + 188,025 wr)
==13319== LL miss rate: 0.1% ( 0.1% + 0.0% )
"平铺"方法:
$ g++ -fopenmp parallel -o parallel.cpp
$ ./parallel
13.4941
==13872== I refs: 62,967,276,691
==13872== I1 misses: 1,768
==13872== LLi misses: 1,747
==13872== I1 miss rate: 0.00%
==13872== LLi miss rate: 0.00%
==13872==
==13872== D refs: 35,593,733,973 (28,411,716,118 rd + 7,182,017,855 wr)
==13872== D1 misses: 6,724,892 ( 6,536,740 rd + 188,152 wr)
==13872== LLd misses: 1,377,799 ( 1,189,774 rd + 188,025 wr)
==13872== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==13872== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==13872==
==13872== LL refs: 6,726,660 ( 6,538,508 rd + 188,152 wr)
==13872== LL misses: 1,379,546 ( 1,191,521 rd + 188,025 wr)
==13872== LL miss rate: 0.0% ( 0.0% + 0.0% )
请注意,参考文献已经大大增加。
优化编译:
"幼稚"的方法:
$ g++ -fopenmp -O3 parallel -o parallel.cpp
$ ./parallel
4.87246
$ valgrind --tool=cachegrind ./parallel
==11227== I refs: 9,021,661,364
==11227== I1 misses: 1,756
==11227== LLi misses: 1,734
==11227== I1 miss rate: 0.00%
==11227== LLi miss rate: 0.00%
==11227==
==11227== D refs: 4,008,681,781 (3,004,505,045 rd + 1,004,176,736 wr)
==11227== D1 misses: 1,065,760,232 (1,064,572,078 rd + 1,188,154 wr)
==11227== LLd misses: 62,877,794 ( 62,689,768 rd + 188,026 wr)
==11227== D1 miss rate: 26.5% ( 35.4% + 0.1% )
==11227== LLd miss rate: 1.5% ( 2.0% + 0.0% )
==11227==
==11227== LL refs: 1,065,761,988 (1,064,573,834 rd + 1,188,154 wr)
==11227== LL misses: 62,879,528 ( 62,691,502 rd + 188,026 wr)
==11227== LL miss rate: 0.4% ( 0.5% + 0.0% )
"转置"方法:
$ g++ -fopenmp -O3 parallel -o parallel.cpp
$ ./parallel
2.02121
$ valgrind --tool=cachegrind ./parallel
==12076== I refs: 8,020,662,317
==12076== I1 misses: 1,753
==12076== LLi misses: 1,731
==12076== I1 miss rate: 0.00%
==12076== LLi miss rate: 0.00%
==12076==
==12076== D refs: 4,006,682,757 (3,002,508,030 rd + 1,004,174,727 wr)
==12076== D1 misses: 63,823,733 ( 63,635,579 rd + 188,154 wr)
==12076== LLd misses: 62,877,795 ( 62,689,769 rd + 188,026 wr)
==12076== D1 miss rate: 1.5% ( 2.1% + 0.0% )
==12076== LLd miss rate: 1.5% ( 2.0% + 0.0% )
==12076==
==12076== LL refs: 63,825,486 ( 63,637,332 rd + 188,154 wr)
==12076== LL misses: 62,879,526 ( 62,691,500 rd + 188,026 wr)
==12076== LL miss rate: 0.5% ( 0.5% + 0.0% )
"平铺"方法:
$ g++ -fopenmp -O3 parallel -o parallel.cpp
$ ./parallel
1.78285
$ valgrind --tool=cachegrind ./parallel
==14365== I refs: 8,192,794,606
==14365== I1 misses: 1,753
==14365== LLi misses: 1,732
==14365== I1 miss rate: 0.00%
==14365== LLi miss rate: 0.00%
==14365==
==14365== D refs: 4,102,512,450 (3,083,324,326 rd + 1,019,188,124 wr)
==14365== D1 misses: 6,597,429 ( 6,409,277 rd + 188,152 wr)
==14365== LLd misses: 1,377,797 ( 1,189,770 rd + 188,027 wr)
==14365== D1 miss rate: 0.1% ( 0.2% + 0.0% )
==14365== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==14365==
==14365== LL refs: 6,599,182 ( 6,411,030 rd + 188,152 wr)
==14365== LL misses: 1,379,529 ( 1,191,502 rd + 188,027 wr)
==14365== LL miss rate: 0.0% ( 0.0% + 0.0% )
我的问题是:为什么未优化的"平铺"方法的性能相对差于优化的方法?我的切片算法实现有问题吗?
我的意思是,很明显,虽然这两种方法的缓存未命中大致相同,但参考文献。(获取次数)已从 60 个 bio+ 下降到 8 个生物。因此,现在速度更快也就不足为奇了。但对我来说不明显的是,这额外的 20 条生物+指令来自哪里?它应该是这三个未优化实现中最快的实现,对吧?
好吧,谢谢你很多次。
BW
文森特
平铺方法在代码方面更为复杂,因此会产生额外的开销。当然,使用优化的代码,这不是什么大问题,因为矩阵足够大,可以通过适当的缓存使用产生更多好处。
现在看看未优化的代码:
for (z = k; z < std::min( k + incr, row ); z++) {
-------------------------
这些计算将在紧密循环中执行。这是一个完美的性能杀手。
将它们移动到外部范围(例如:一旦k
可用)会产生很大的不同。当然,优化器可以做到这一点,但前提是您要求它对其进行优化。这就是为什么测量未优化的代码通常毫无价值的原因。
0m16.186s "tiled" approach
0m11.543s "tiled" approach with the hand optimization
0m10.919s "transposed" approach
这是我在机器上测量的。对我来说看起来足够好。
- 删除一个线程上有数百万个字符串的大型哈希映射会影响另一个线程的性能
- OpenMP阵列性能较差
- 递归列出所有目录中的C++与Python与Ruby的性能
- 大小相等但成员数量不同的结构之间的性能差异
- 为什么constexpr的性能比正常表达式差
- 在类中使用随机生成器时出现性能问题
- 在main()之外初始化std::vector会导致性能下降(多线程)
- 海湾合作委员会 ARM 性能下降
- GCC 和 Clang 代码性能的巨大差异
- 在容量内调整矢量大小时的性能影响
- 了解算法的性能差异(如果以不同的编程语言实现)
- 未达到的情况会影响开关外壳性能
- QStringList vs list<shared_ptr<QString>> 性能比较C++
- 是否总是可以将使用递归编写的程序重写为不使用递归的程序C++,性能观点是什么?
- 哪种方法更好,性能明智
- C++ 特征库:引用的性能开销<>
- 与多个 for 循环与单个 for 循环 wrt 相关的性能从多映射获取数据
- 基于范围的 for 循环range_declaration中各种说明符之间的性能差异
- std::p mr::memory_resource 如何与 std::container 产生性能差异?
- C++ 瓦格格林德平铺矩阵乘法的性能分析