如何解释缓存未命中的cachegrind输出
How do you interpret cachegrind output for caching misses?
出于好奇,我对矩阵乘法的几个不同版本进行了编码,并对其进行了cachegrind。在下面的结果中,我想知道L1、L2、L3的哪些部分是未命中和引用,这一切到底意味着什么?下面是我的矩阵乘法代码,以防有人需要。
#define SLOWEST
==6933== Cachegrind, a cache and branch-prediction profiler
==6933== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==6933== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==6933== Command: ./a.out 500
==6933==
--6933-- warning: L3 cache found, using its data for the LL simulation.
--6933-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 60.7487 seconds.
==6933==
==6933== I refs: 6,039,791,314
==6933== I1 misses: 1,611
==6933== LLi misses: 1,519
==6933== I1 miss rate: 0.00%
==6933== LLi miss rate: 0.00%
==6933==
==6933== D refs: 2,892,704,678 (2,763,005,485 rd + 129,699,193 wr)
==6933== D1 misses: 136,223,560 ( 136,174,705 rd + 48,855 wr)
==6933== LLd misses: 53,675 ( 5,247 rd + 48,428 wr)
==6933== D1 miss rate: 4.7% ( 4.9% + 0.0% )
==6933== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==6933==
==6933== LL refs: 136,225,171 ( 136,176,316 rd + 48,855 wr)
==6933== LL misses: 55,194 ( 6,766 rd + 48,428 wr)
==6933== LL miss rate: 0.0% ( 0.0% + 0.0% )
#define SLOWER
==8463== Cachegrind, a cache and branch-prediction profiler
==8463== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==8463== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==8463== Command: ./a.out 500
==8463==
--8463-- warning: L3 cache found, using its data for the LL simulation.
--8463-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 49.7397 seconds.
==8463==
==8463== I refs: 4,537,213,120
==8463== I1 misses: 1,571
==8463== LLi misses: 1,487
==8463== I1 miss rate: 0.00%
==8463== LLi miss rate: 0.00%
==8463==
==8463== D refs: 2,891,485,608 (2,761,862,312 rd + 129,623,296 wr)
==8463== D1 misses: 59,961,522 ( 59,913,256 rd + 48,266 wr)
==8463== LLd misses: 53,113 ( 5,246 rd + 47,867 wr)
==8463== D1 miss rate: 2.0% ( 2.1% + 0.0% )
==8463== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==8463==
==8463== LL refs: 59,963,093 ( 59,914,827 rd + 48,266 wr)
==8463== LL misses: 54,600 ( 6,733 rd + 47,867 wr)
==8463== LL miss rate: 0.0% ( 0.0% + 0.0% )
#define SLOW
==9174== Cachegrind, a cache and branch-prediction profiler
==9174== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==9174== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==9174== Command: ./a.out 500
==9174==
--9174-- warning: L3 cache found, using its data for the LL simulation.
--9174-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 35.8901 seconds.
==9174==
==9174== I refs: 3,039,713,059
==9174== I1 misses: 1,570
==9174== LLi misses: 1,486
==9174== I1 miss rate: 0.00%
==9174== LLi miss rate: 0.00%
==9174==
==9174== D refs: 1,893,235,586 (1,763,112,301 rd + 130,123,285 wr)
==9174== D1 misses: 63,285,950 ( 62,987,684 rd + 298,266 wr)
==9174== LLd misses: 53,113 ( 5,246 rd + 47,867 wr)
==9174== D1 miss rate: 3.3% ( 3.5% + 0.2% )
==9174== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==9174==
==9174== LL refs: 63,287,520 ( 62,989,254 rd + 298,266 wr)
==9174== LL misses: 54,599 ( 6,732 rd + 47,867 wr)
==9174== LL miss rate: 0.0% ( 0.0% + 0.0% )
#define MEDIUM
==7838== Cachegrind, a cache and branch-prediction profiler
==7838== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==7838== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==7838== Command: ./a.out 500
==7838==
--7838-- warning: L3 cache found, using its data for the LL simulation.
--7838-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 23.4097 seconds.
==7838==
==7838== I refs: 2,548,967,151
==7838== I1 misses: 1,610
==7838== LLi misses: 1,522
==7838== I1 miss rate: 0.00%
==7838== LLi miss rate: 0.00%
==7838==
==7838== D refs: 1,399,237,303 (1,267,363,440 rd + 131,873,863 wr)
==7838== D1 misses: 592,807 ( 293,091 rd + 299,716 wr)
==7838== LLd misses: 53,147 ( 5,248 rd + 47,899 wr)
==7838== D1 miss rate: 0.0% ( 0.0% + 0.2% )
==7838== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==7838==
==7838== LL refs: 594,417 ( 294,701 rd + 299,716 wr)
==7838== LL misses: 54,669 ( 6,770 rd + 47,899 wr)
==7838== LL miss rate: 0.0% ( 0.0% + 0.0% )
#define MEDIUMISH
==8438== Cachegrind, a cache and branch-prediction profiler
==8438== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==8438== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==8438== Command: ./a.out 500
==8438==
--8438-- warning: L3 cache found, using its data for the LL simulation.
--8438-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 24.0327 seconds.
==8438==
==8438== I refs: 2,550,211,553
==8438== I1 misses: 1,576
==8438== LLi misses: 1,488
==8438== I1 miss rate: 0.00%
==8438== LLi miss rate: 0.00%
==8438==
==8438== D refs: 1,400,107,343 (1,267,610,303 rd + 132,497,040 wr)
==8438== D1 misses: 339,977 ( 42,583 rd + 297,394 wr)
==8438== LLd misses: 53,114 ( 5,248 rd + 47,866 wr)
==8438== D1 miss rate: 0.0% ( 0.0% + 0.2% )
==8438== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==8438==
==8438== LL refs: 341,553 ( 44,159 rd + 297,394 wr)
==8438== LL misses: 54,602 ( 6,736 rd + 47,866 wr)
==8438== LL miss rate: 0.0% ( 0.0% + 0.0% )
矩阵乘法代码。
#if defined(SLOWEST)
void multiply (float **A, float **B, float **out, int size) {
for (int row=0;row<size;row++)
for (int col=0;col<size;col++)
for (int in=0;in<size;in++)
out[row][col] += A[row][in] * B[in][col];
}
// Takes in 1-D arrays, same as before.
#elif defined(SLOWER)
void multiply (float *A, float *B, float *out, int size) {
for (int row=0;row<size;row++)
for (int col=0;col<size;col++)
for (int in=0;in<size;in++)
out[row * size + col] += A[row * size + in] * B[in * size + col];
}
// Flips first and second loops
#elif defined(SLOW)
void multiply (float *A, float *B, float *out, int size) {
for (int col=0;col<size;col++)
for (int row=0;row<size;row++) {
float curr = 0; // prevents from calculating position each time through
for (int in=0;in<size;in++)
curr += A[row * size + in] * B[in *size + col];
out[row * size + col] = curr;
}
}
#elif defined(MEDIUM)
// Keeps it organized for future codes.
float dotProduct(float *A, float *B, int size) {
float curr = 0;
for (int i=0;i<size;i++)
curr += A[i] * B[i];
return curr;
}
void multiply (float *A, float *B, float *out, int size) {
float *temp = new float[size];
for (int col=0;col<size;col++) {
for (int i=0;i<size;i++) // stores column into sequential array
temp[i] = B[i * size + col];
for (int row=0;row<size;row++)
out[row * size + col] = dotProduct(&A[row], temp, size); // uses function above for dot product.
}
delete[] temp;
}
#elif defined(MEDIUMISH)
float dotProduct(float *A, float *B, int size) {
float curr = 0;
for (int i=0;i<size;i++)
curr += A[i] * B[i];
return curr;
}
void multiply (float *A, float *B, float *out, int size) {
for (int i=0;i<size-1;i++)
for (int j=i+1;j<size;j++)
std::swap(B[i * size + j], B[j * size + i]);
for (int col=0;col<size;col++)
for (int row=0;row<size;row++)
out[row * size + col] = dotProduct(&A[row], &B[row], size); // uses function above for dot product.
}
#elif defined(FAST)
#elif defined(FASTER)
#endif
根据文档cachegrind仅模拟第一级和最后一级缓存:
Cachegrind模拟程序如何与机器的缓存交互层次结构和(可选的)分支预测器。它模拟一台机器具有独立的第一级指令和数据高速缓存(I1和D1),由统一的二级高速缓存(L2)支持。这与许多现代机器的配置。
然而,一些现代机器有三到四级缓存。对于这些机器(在Cachegrind可以自动检测缓存配置)Cachegrind模拟第一级最后一级缓存。之所以选择此选项,是因为最后一个级别缓存对运行时的影响最大,因为它屏蔽了对main的访问记忆力此外,L1缓存通常具有较低的关联性,因此模拟它们可以检测代码与该缓存(例如,按列遍历具有行长度的矩阵是2的幂)。
这意味着,在您的情况下,您无法获得L2信息,只能获得L1和L3信息。
cachegrind输出的第一部分报告有关L1指令缓存的信息。在您的所有示例中,一级指令缓存未命中的数量是微不足道的,未命中率始终为0%。这意味着您的所有程序都可以放入L1指令缓存中。
输出的第二部分报告关于L1和LL(最后一级缓存,在您的情况下是L3)数据缓存的信息。使用D1未命中率:信息,您应该了解矩阵乘法算法的哪个版本是"最高效的"
cachegrind输出的最后一部分汇总了指令和数据的LL(最后一级缓存,在您的情况下为L3)信息。因此,它给出了内存访问的次数和缓存所服务的内存请求的百分比。
- 递归函数计算序列中的平方和(并输出过程)
- 如何使用 < 和 > 命令获取 c++ 中的输入和输出?
- 请解释"函数1(p1,p2,p3);"的输出
- C++:将控制台输出存储在宏中更好吗
- 创建一个函数以在输入为负数或零时输出字符串.第一次执行用户定义的函数
- 如何在OMNET++中指定与命令行参数组合的输出文件名
- 为什么我的代码在输出中增加了93天
- 如何从void函数输出字符串
- 输入到文件并输出到另一个文件,并将流文件传递给函数
- AES加密到解密未正确输出
- 如何将c++程序的一些输出传递给shell,以便在shell中使用
- 使用C++程序合并排序没有得到正确的输出
- 为什么我不能在不创建字符串变量的情况下使用函数的字符串输出
- C++格式化输出问题
- 将值从二维数组输出到文本文件
- 集合上的输出迭代器:assign和increment迭代器
- 在while循环中输入带有std::cin的字符串后,控制台会输出大量胡言乱语
- 为什么rk4.do_step不输出C++中的初始条件
- 输出没有重复元素的动态数组(收缩数组)C++
- 如何解释缓存未命中的cachegrind输出