确定一个码块需要多少个时钟周期
Determining how many clock cycles a codeblock need
是否有一种工具或方法可以告诉我代码块使用了多少时钟周期?手工调试和计数对于更庞大的代码块来说是一件痛苦的事情。
在x86上,英特尔的IACA(英特尔体系结构代码分析器)是我所知道的唯一一个静态分析器。它假设零缓存未命中和各种其他简化,但有些有用。
我认为它还假设除了最后一个分支之外的所有分支都没有被获取,所以它可能对具有已获取分支的循环体没有用处。
IACA的数据中也有一些错误,例如它认为shld
在Sandybridge上运行缓慢。它确实知道一些不明显的事情,比如SnB系列CPU不能微熔丝2寄存器寻址模式。
自从Haswell更新后,它基本上被放弃了。Skylake可以在比Haswell更多的执行端口上运行一些指令(请参阅Agner Fog的指令表),但管道足够相似,结果应该相当有用。另请参阅x86标记wiki上的其他链接,包括英特尔的优化手册,以帮助您理解输出。
我喜欢使用这个iaca.sh
包装脚本使-64
成为默认值(我可以用-32
覆盖它)。我忘记了我写了多少(可能只是结尾的if (($# >= 1))
位),也忘记了LD_LIBRARY_PATH部分是从哪里来的。
iaca.sh
:
#!/bin/bash
myname=$(realpath "$0")
mypath=$(dirname "$myname")
ld_lib="$LD_LIBRARY_PATH"
app_loc="../lib"
if [ "$LD_LIBRARY_PATH" = "" ]
then
export LD_LIBRARY_PATH="$mypath/$app_loc"
else
export LD_LIBRARY_PATH="$mypath/$app_loc:$LD_LIBRARY_PATH"
fi
if (($# >= 1));then
exec "$mypath/iaca" -64 "$@"
else
exec "$mypath/iaca" # there is no -help, just run with no args for help output
fi
示例:就地前缀和,来自英特尔cpu上的SIMD前缀和:
#include <immintrin.h>
#ifdef IACA_MARKS_OFF
#define IACA_START
#define IACA_END
#else
#include <iacaMarks.h>
#endif
// In-place rewrite an array of values into an array of prefix sums.
// This makes the code simpler, and minimizes cache effects.
int prefix_sum_sse(int data[], int n)
{
// const int elemsz = sizeof(data[0]);
#define elemsz sizeof(data[0]) // clang-3.5 doesn't allow const int foo = ... as an imm8 arg to intrinsics
__m128i *datavec = (__m128i*)data;
const int vec_elems = sizeof(*datavec)/elemsz;
// to use this for int8/16_t, you still need to change the add_epi32, and the shuffle
const __m128i *endp = (__m128i*) (data + n - 2*vec_elems); // pointer to last full vector we can load
__m128i carry = _mm_setzero_si128();
for(; datavec <= endp ; datavec += 2) {
IACA_START
__m128i x0 = _mm_load_si128(datavec + 0);
__m128i x1 = _mm_load_si128(datavec + 1); // unroll / pipeline by 1
// __m128i x2 = _mm_load_si128(datavec + 2);
// __m128i x3;
x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, elemsz));
x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, elemsz));
x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, 2*elemsz));
x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, 2*elemsz));
// more shifting if vec_elems is larger
x0 = _mm_add_epi32(x0, carry); // this has to go after the byte-shifts, to avoid double-counting the carry.
_mm_store_si128(datavec +0, x0); // store first to allow destructive shuffle (e.g. non-avx shufps for FP or pshufb for narrow integers)
x1 = _mm_add_epi32(_mm_shuffle_epi32(x0, _MM_SHUFFLE(3,3,3,3)), x1);
_mm_store_si128(datavec +1, x1);
carry = _mm_shuffle_epi32(x1, _MM_SHUFFLE(3,3,3,3)); // broadcast the high element for next vector
}
// FIXME: scalar loop to handle the last few elements
IACA_END
return data[n-1];
#undef elemsz
}
$ gcc -I/opt/iaca-2.1/include -Wall -O3 -c prefix-sum.c -march=nehalem -mtune=haswell
$ iaca.sh prefix-sum.o
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - prefix-sum.o
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 6.40 Cycles Throughput Bottleneck: Port5
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 5.7 | 1.4 1.0 | 1.4 1.0 | 2.0 | 6.3 | 1.0 | 1.3 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | | movdqa xmm3, xmmword ptr [rax]
| 1 | 1.0 | | | | | | | | | add rax, 0x20
| 1 | | | | 1.0 1.0 | | | | | | movdqa xmm0, xmmword ptr [rax-0x10]
| 0* | | | | | | | | | | movdqa xmm1, xmm3
| 1 | | | | | | 1.0 | | | CP | pslldq xmm1, 0x4
| 1 | | 1.0 | | | | | | | | paddd xmm1, xmm3
| 0* | | | | | | | | | | movdqa xmm3, xmm0
| 1 | | | | | | 1.0 | | | CP | pslldq xmm3, 0x4
| 0* | | | | | | | | | | movdqa xmm4, xmm1
| 1 | | 1.0 | | | | | | | | paddd xmm3, xmm0
| 1 | | | | | | 1.0 | | | CP | pslldq xmm4, 0x8
| 0* | | | | | | | | | | movdqa xmm0, xmm3
| 1 | | 1.0 | | | | | | | | paddd xmm1, xmm4
| 1 | | | | | | 1.0 | | | CP | pslldq xmm0, 0x8
| 1 | | 1.0 | | | | | | | | paddd xmm1, xmm2
| 1 | | 0.8 | | | | 0.2 | | | CP | paddd xmm0, xmm3
| 2^ | | | | | 1.0 | | | 1.0 | | movaps xmmword ptr [rax-0x20], xmm1
| 1 | | | | | | 1.0 | | | CP | pshufd xmm1, xmm1, 0xff
| 1 | | 0.9 | | | | 0.1 | | | CP | paddd xmm0, xmm1
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | | movaps xmmword ptr [rax-0x10], xmm0
| 1 | | | | | | 1.0 | | | CP | pshufd xmm1, xmm0, 0xff
| 0* | | | | | | | | | | movdqa xmm2, xmm1
| 1 | | | | | | | 1.0 | | | cmp rdx, rax
| 0F | | | | | | | | | | jnb 0xffffffffffffff94
Total Num Of Uops: 20
请注意,总uop计数是而不是融合域uop,这对前端、ROB和4面发布/引退宽度很重要。它计算未融合的域uop,这对执行单元(和调度器)很重要。不过,这有点傻,因为在未融合的领域,最重要的是uop需要哪个端口,而不是有多少端口。
这不是最好的例子,因为它在Haswell的shuffle端口上被严重堵塞。不过,它确实展示了IACA如何显示mov消除、微融合存储以及宏融合比较和分支。
当有选择时,uop在端口之间的分布是相当任意的。不要指望它能与真正的硬件相匹配。我认为IACA根本没有为ROB/调度器建模,真的。在之前的SO问题中已经讨论过这一限制和其他限制。请尝试在IACA上搜索,因为它是一个相当独特的字符串。
- 在C++/Linux中设置单调时钟的一些技巧
- 复制列表初始化的隐式转换的等级是多少
- while循环中while循环的时间复杂度是多少
- 从文本文件中读取时钟时间和事件时间并进行处理
- 如何检查一个c++字符串中有多少相同的字符/数字
- C++有多少类型的循环
- 标准::计时::时钟、硬件时钟和周期计数
- time_t的时钟周期和获取时间问题
- 传递给 std::runtime_error 的 ctor 的字符串对象的生命周期是多少?
- 如何获得进程在windows内核模式下使用的CPU时钟周期
- 如何将刻度转换为时钟周期
- 确定一个码块需要多少个时钟周期
- 正在读取一个布尔原子以及它需要多少个周期
- 在linux中,与动态调用相关的时钟周期中程序的意外执行时间
- 6510 / C64模拟器中的c++,如何实现周期/时钟
- c++ lambda表达式的生命周期是多少?
- 如何获得时钟周期的值在滴答使用linux
- 在现代x86_64 CPU上进行AVX/SSE幂运算需要多少时钟周期?
- const引用右值的类数据成员的生命周期是多少?
- c++数据结构对象的生命周期是多少?