C++ 瓦格格林德平铺矩阵乘法的性能分析

C++ Performacne analysis of tiled matrix multiplication with valgrind

本文关键字：性能格格林德平 C++ 更新时间：2023-10-16

我正在尝试弄清楚如何正确实现循环平铺。我的代码基于 http://people.freebsd.org/~lstewart/articles/cpumemory.pdf .从理论上讲，我应该通过使用平铺矩阵乘法来获得性能提升。但我不一定。我还将介绍瓦尔格林德的缓存的结果，我相信这很有趣。

我注释掉了不同的方法。

// cpp program, matrix multiplication
// returns the elapsed time of the loop iterations measured by omp_get_wtime()
#include <iostream>
#include <algorithm>            // std::min
#include <omp.h>
int main(int argc, char *argv[])
{
    // matrix dimensions
    const int row = 1000;
    const int col = 1000;
    // matrix stored as an array of size 1000*1000
    // temp will be b transposed, recommendation from the article mentioned above
    // res is of double precision, I ran into errors displaying the data when using a different data type
    int *a = new int[row*col];
    int *b = new int[row*col];
    int *temp = new int[row*col];
    double *res = new double[row*col];
    // initialization
    for(int i = 0; i < row; ++i){
        for (int j = 0; j < col; ++j) {
            a[i*col+j] = i*col+j;
            b[i*col+j] = i*col+j;
        }
    }
    // transposition of b
    for(int i = 0; i < row; ++i){
        for (int j = 0; j < col; ++j) {
            temp[i*col+j] = b[j*col+i];
        }
    }

    int i,j,k,x,y,z;
// "naive" matrix multiplication
    // double start = omp_get_wtime();
    // for (i = 0; i < row; ++i) {
    //     for (j = 0; j < col; ++j) {
    //         for (k = 0; k < row; ++k) {
    //             res[ i * col + j ] +=  a[ i * col + k ] * b[ k * col + j ];      
    //         }
    //     } 
    // }
    // double end = omp_get_wtime();
    // std::cout << end-start << std::endl;

// "transposed" matrix multiplication
        // for (i = 0; i < row; ++i) {
           //  for (j = 0; j < col; ++j) {
                // for (k = 0; k < row; ++k) {
                   // res[ i * col + j ] +=  a[ i * col + k ] * temp[ k  + j * col  ];      
                // }
            // } 
        // }
// tiled (parallel) matrix multiplication
// from /sys/devices/system/cpu/cpu0/cache/index0
// cat coherency_line_size returns 64;
// thus I will use 64 as the blocking size;
    int incr = 64;
    for (i = 0; i < row; i += incr) {
         for (j = 0; j < col; j += incr) {
             res[i*col+j] = 0.0;
             for (k = 0; k < row; k += incr) {
                 for (x = i; x < std::min( i + incr, row ); x++) {
                     for (y = j; y < std::min( j + incr, col ); y++) {
                         for (z = k; z < std::min( k + incr, row ); z++) {
                             res[ x * col + y ] +=  a[ x * col + z ] * b[ z * col  + y  ];
                         }
                     } 
                 }
             }
         }
     }
     return 0;
}

结果：

现在，我将介绍在具有英特尔双核和4Gb DRAM的Linux机器上编译这三种方法的结果。首先，我将介绍不进行优化的编译结果，然后介绍经过优化的编译结果。对于每个结果，将添加相应的 valgrinds 缓存研磨结果。对于那些不熟悉该软件的人：从 http://www.valgrind.org/docs/manual/cg-manual.html

"首先总结指令获取的缓存访问，给出提取次数（这是指令数执行，这本身就很有用），数量 I1 未命中，LL指令（LLi）的数量未命中。

"幼稚"的方法：

$ g++ -fopenmp parallel -o parallel.cpp
$ ./parallel
16.5305    
$ valgrind --tool=cachegrind ./parallel
==12558== I   refs:      39,054,659,801
==12558== I1  misses:             1,758
==12558== LLi misses:             1,738
==12558== I1  miss rate:           0.00%
==12558== LLi miss rate:           0.00%
==12558== 
==12558== D   refs:      20,028,690,508  (18,024,512,540 rd   + 2,004,177,968 wr)
==12558== D1  misses:     1,064,759,236  ( 1,064,571,085 rd   +       188,151 wr)
==12558== LLd misses:        62,877,799  (    62,689,774 rd   +       188,025 wr)
==12558== D1  miss rate:            5.3% (           5.9%     +           0.0%  )
==12558== LLd miss rate:            0.3% (           0.3%     +           0.0%  )
==12558== 
==12558== LL refs:        1,064,760,994  ( 1,064,572,843 rd   +       188,151 wr)
==12558== LL misses:         62,879,537  (    62,691,512 rd   +       188,025 wr)
==12558== LL miss rate:             0.1% (           0.1%     +           0.0%  )

"转置"方法：

$ g++ -fopenmp parallel -o parallel.cpp
$ ./parallel
9.40104 
$ valgrind --tool=cachegrind ./parallel
==13319== I   refs:      39,054,659,804
==13319== I1  misses:             1,759
==13319== LLi misses:             1,739
==13319== I1  miss rate:           0.00%
==13319== LLi miss rate:           0.00%
==13319== 
==13319== D   refs:      20,028,690,508  (18,024,512,539 rd   + 2,004,177,969 wr)
==13319== D1  misses:        63,823,736  (    63,635,585 rd   +       188,151 wr)
==13319== LLd misses:        62,877,799  (    62,689,774 rd   +       188,025 wr)
==13319== D1  miss rate:            0.3% (           0.3%     +           0.0%  )
==13319== LLd miss rate:            0.3% (           0.3%     +           0.0%  )
==13319== 
==13319== LL refs:           63,825,495  (    63,637,344 rd   +       188,151 wr)
==13319== LL misses:         62,879,538  (    62,691,513 rd   +       188,025 wr)
==13319== LL miss rate:             0.1% (           0.1%     +           0.0%  )

"平铺"方法：

$ g++ -fopenmp parallel -o parallel.cpp
$ ./parallel
13.4941 
==13872== I   refs:      62,967,276,691
==13872== I1  misses:             1,768
==13872== LLi misses:             1,747
==13872== I1  miss rate:           0.00%
==13872== LLi miss rate:           0.00%
==13872== 
==13872== D   refs:      35,593,733,973  (28,411,716,118 rd   + 7,182,017,855 wr)
==13872== D1  misses:         6,724,892  (     6,536,740 rd   +       188,152 wr)
==13872== LLd misses:         1,377,799  (     1,189,774 rd   +       188,025 wr)
==13872== D1  miss rate:            0.0% (           0.0%     +           0.0%  )
==13872== LLd miss rate:            0.0% (           0.0%     +           0.0%  )
==13872== 
==13872== LL refs:            6,726,660  (     6,538,508 rd   +       188,152 wr)
==13872== LL misses:          1,379,546  (     1,191,521 rd   +       188,025 wr)
==13872== LL miss rate:             0.0% (           0.0%     +           0.0%  )

请注意，参考文献已经大大增加。

优化编译：

"幼稚"的方法：

$ g++ -fopenmp -O3 parallel -o parallel.cpp
$ ./parallel
4.87246
$ valgrind --tool=cachegrind ./parallel
==11227== I   refs:      9,021,661,364
==11227== I1  misses:            1,756
==11227== LLi misses:            1,734
==11227== I1  miss rate:          0.00%
==11227== LLi miss rate:          0.00%
==11227== 
==11227== D   refs:      4,008,681,781  (3,004,505,045 rd   + 1,004,176,736 wr)
==11227== D1  misses:    1,065,760,232  (1,064,572,078 rd   +     1,188,154 wr)
==11227== LLd misses:       62,877,794  (   62,689,768 rd   +       188,026 wr)
==11227== D1  miss rate:          26.5% (         35.4%     +           0.1%  )
==11227== LLd miss rate:           1.5% (          2.0%     +           0.0%  )
==11227== 
==11227== LL refs:       1,065,761,988  (1,064,573,834 rd   +     1,188,154 wr)
==11227== LL misses:        62,879,528  (   62,691,502 rd   +       188,026 wr)
==11227== LL miss rate:            0.4% (          0.5%     +           0.0%  )

"转置"方法：

$ g++ -fopenmp -O3 parallel -o parallel.cpp
$ ./parallel 
2.02121 
$ valgrind --tool=cachegrind ./parallel
==12076== I   refs:      8,020,662,317
==12076== I1  misses:            1,753
==12076== LLi misses:            1,731
==12076== I1  miss rate:          0.00%
==12076== LLi miss rate:          0.00%
==12076== 
==12076== D   refs:      4,006,682,757  (3,002,508,030 rd   + 1,004,174,727 wr)
==12076== D1  misses:       63,823,733  (   63,635,579 rd   +       188,154 wr)
==12076== LLd misses:       62,877,795  (   62,689,769 rd   +       188,026 wr)
==12076== D1  miss rate:           1.5% (          2.1%     +           0.0%  )
==12076== LLd miss rate:           1.5% (          2.0%     +           0.0%  )
==12076== 
==12076== LL refs:          63,825,486  (   63,637,332 rd   +       188,154 wr)
==12076== LL misses:        62,879,526  (   62,691,500 rd   +       188,026 wr)
==12076== LL miss rate:            0.5% (          0.5%     +           0.0%  )

"平铺"方法：

$ g++ -fopenmp -O3 parallel -o parallel.cpp
$ ./parallel 
1.78285   
$ valgrind --tool=cachegrind ./parallel
==14365== I   refs:      8,192,794,606
==14365== I1  misses:            1,753
==14365== LLi misses:            1,732
==14365== I1  miss rate:          0.00%
==14365== LLi miss rate:          0.00%
==14365== 
==14365== D   refs:      4,102,512,450  (3,083,324,326 rd   + 1,019,188,124 wr)
==14365== D1  misses:        6,597,429  (    6,409,277 rd   +       188,152 wr)
==14365== LLd misses:        1,377,797  (    1,189,770 rd   +       188,027 wr)
==14365== D1  miss rate:           0.1% (          0.2%     +           0.0%  )
==14365== LLd miss rate:           0.0% (          0.0%     +           0.0%  )
==14365== 
==14365== LL refs:           6,599,182  (    6,411,030 rd   +       188,152 wr)
==14365== LL misses:         1,379,529  (    1,191,502 rd   +       188,027 wr)
==14365== LL miss rate:            0.0% (          0.0%     +           0.0%  )

我的问题是：为什么未优化的"平铺"方法的性能相对差于优化的方法？我的切片算法实现有问题吗？

我的意思是，很明显，虽然这两种方法的缓存未命中大致相同，但参考文献。（获取次数）已从 60 个 bio+ 下降到 8 个生物。因此，现在速度更快也就不足为奇了。但对我来说不明显的是，这额外的 20 条生物+指令来自哪里？它应该是这三个未优化实现中最快的实现，对吧？

好吧，谢谢你很多次。

BW

文森特

平铺方法在代码方面更为复杂，因此会产生额外的开销。当然，使用优化的代码，这不是什么大问题，因为矩阵足够大，可以通过适当的缓存使用产生更多好处。

现在看看未优化的代码：

                     for (z = k; z < std::min( k + incr, row ); z++) {
                                     -------------------------

这些计算将在紧密循环中执行。这是一个完美的性能杀手。

将它们移动到外部范围（例如：一旦k可用）会产生很大的不同。当然，优化器可以做到这一点，但前提是您要求它对其进行优化。这就是为什么测量未优化的代码通常毫无价值的原因。

0m16.186s  "tiled" approach
0m11.543s  "tiled" approach with the hand optimization
0m10.919s  "transposed" approach

这是我在机器上测量的。对我来说看起来足够好。