访问内联程序集中的thread_local变量

access thread_local variable in inline assembly

本文关键字：thread local 变量集中程序程序集访问更新时间：2023-10-16

我正在处理一些具有使用内联汇编的优化版本的C++代码。优化版本表现出非线程安全的行为，这可以追溯到从组件内部广泛访问的 3 个全局变量。

__attribute__ ((aligned (16))) unsigned int SHAVITE_MESS[16];
__attribute__ ((aligned (16))) thread_local unsigned char SHAVITE_PTXT[8*4];
__attribute__ ((aligned (16))) unsigned int SHAVITE_CNTS[4] = {0,0,0,0};

。

asm ("movaps xmm0, SHAVITE_PTXT[rip]");
asm ("movaps xmm1, SHAVITE_PTXT[rip+16]");
asm ("movaps xmm3, SHAVITE_CNTS[rip]");
asm ("movaps xmm4, SHAVITE256_XOR2[rip]");
asm ("pxor   xmm2,  xmm2");

我天真地认为解决此问题的最简单方法是使变量thread_local，但是这会导致程序集中的段错误 - 似乎程序集不知道变量是线程局部的？

我已经在一个小thread_local测试用例的组装中挖掘了一下，看看 gcc 如何处理它们mov eax, DWORD PTR fs:num1@tpoff并尝试修改代码以执行相同的操作：

asm ("movaps xmm0, fs:SHAVITE_PTXT@tpoff");
asm ("movaps xmm1, fs:SHAVITE_PTXT@tpoff+16");
asm ("movaps xmm3, fs:SHAVITE_CNTS@tpoff");
asm ("movaps xmm4, fs:SHAVITE256_XOR2@tpoff");
asm ("pxor   xmm2,  xmm2");

如果所有变量也thread_local，则有效，它还与参考实现(非汇编)匹配，因此看起来可以成功工作。但是，这似乎非常特定于 CPU，如果我查看用于编译的输出-m32我会得到mov eax, DWORD PTR gs:num1@ntpoff

由于代码无论如何都是特定于"x86"的(使用 aes-ni)，我可以猜我只是简单地反编译并实现它的所有可能的变体。

但是，我不太喜欢将其作为解决方案，感觉有点像猜测编程。进一步这样做并不能真正帮助我为将来的任何此类情况学习任何东西，这些情况可能不太特定于一种架构。

有没有更通用/正确的方法来解决这个问题？如何告诉程序集变量是以更通用的方式thread_local的？或者有没有办法传入变量，这样它就不需要知道并且无论如何都可以工作？

如果您当前的代码对每条指令使用单独的"基本"asm 语句，那么它写得很糟糕，并且通过破坏 XMM 寄存器而不告诉它而对编译器撒谎。这不是你使用 GNU C 内联 asm 的方式。

您应该使用 AES-NI 和 SIMD 内部函数(如_mm_aesdec_si128)重写它，以便编译器为所有内容发出正确的寻址模式。 https://gcc.gnu.org/wiki/DontUseInlineAsm

或者，如果你真的想仍然使用 GNU C 内联 asm，请使用带有输入/输出"+m"操作数的扩展 asm，它可以是本地变量或任何你想要的 C 变量，包括静态或线程本地。另请参阅 https://stackoverflow.com/tags/inline-assembly/info 以获取有关inlien asm的指南链接。

但希望您可以在函数中使它们自动存储，或者让调用方分配并传递指向上下文的指针，而不是使用静态或线程本地存储。线程本地的访问速度稍慢，因为非零段基会减慢加载执行单元中的地址计算速度。我认为，当地址很早就准备好时，可能不是什么大问题，但请确保您确实需要 TLS，而不仅仅是堆栈上的刮擦空间或由调用方提供。它还会影响代码大小。

当 GCC 在模板中填充"m"操作数约束的%0或%[named]操作数时，它将使用适当的寻址模式。无论是fs:SHAVITE_PTXT@tpoff+16还是XMMWORD PTR [rsp-24]或XMMWORD PTR _ZZ3foovE15SHAVITE256_XOR2[rip](对于函数局部静态变量)，它都可以工作。 (只要您不遇到与英特尔语法的操作数大小不匹配，编译器会用内存操作数填充它，而不是像AT&T语法模式那样将其留给助记符后缀。

像这样，使用全局、TLS 全局、局部自动和局部静态变量只是为了证明它们的工作方式都相同。

// compile with -masm=intel
//#include <stdalign.h>  // for C11
alignas(16) unsigned int SHAVITE_MESS[16];                 // global (static storage)
alignas(16) thread_local unsigned char SHAVITE_PTXT[8*4];  // TLS global
void foo() {
alignas(16) unsigned int SHAVITE_CNTS[4] = {0,0,0,0};   // automatic storage (initialized)
alignas(16) static unsigned int SHAVITE256_XOR2[4];     // local static
asm (
"movaps xmm0, xmmword ptr %[PTXT]     nt"
"movaps xmm1, xmmword ptr %[PTXT]+16  nt"   // x86 addressing modes are always offsetable
"pxor   xmm2,  xmm2       nt"          // mix shorter insns with longer insns to help decode and uop-cache packing
"movaps xmm3, xmmword ptr %[CNTS]+0     nt"
"movaps xmm4, xmmword ptr %[XOR2_256]"
: [CNTS] "+m" (SHAVITE_CNTS),    // outputs and read/write operands
[PTXT] "+m" (SHAVITE_PTXT),
[XOR2_256] "+m" (SHAVITE256_XOR2)
: [MESS] "m" (SHAVITE_MESS)      // read-only inputs
: "xmm0", "xmm1", "xmm2", "xmm3", "xmm4"  // clobbers: list all you use
);
}

如果避免使用 xmm8，则可以使其在 32 位和 64 位模式之间移植。15，或者用#ifdef __x86_64__保护它

请注意，[PTXT] "+m" (SHAVITE_PTXT)作为操作数意味着整个数组是输入/输出，而SHAVITE_PTXT数组是真正的数组，而不是char*。

当然，它会扩展到对象开头的寻址模式，但您可以使用常量来抵消这一点，例如+16. 汇编程序接受[rsp-24]+16等同于[rsp-8]，因此它仅适用于基本寄存器或静态地址。

告诉编译器输入和/或输出中的整个数组意味着即使在内联之后，它也可以安全地围绕 asm 语句进行优化。例如编译器知道写入更高的数组元素也与 ASM 的输入/输出相关，而不仅仅是第一个字节。它不能将以后的元素保留在 asm 的寄存器中，也不能将加载/存储重新排序到这些数组中。

如果您使用了SHAVITE_PTXT[0](即使使用指针也可以使用)，编译器将在操作数中作为英特尔语法byte ptr foobar。但幸运的是，对于xmmword ptr byte ptr，第一个优先并匹配movapsxmm0，xmmword ptr %[foo]' 的操作数大小。 (AT&T语法没有这个问题，必要时，助记符通过后缀携带操作数大小;编译器不会填写任何内容。

您的某些数组恰好是 16 字节大小，因此编译器已经填充了xmmword ptr，但冗余也很好。

如果只有指针而不是数组，请参阅如何指示可以使用内联 ASM 参数指向的内存？以获取"m" (*(unsigned (*)[16]) SHAVITE_MESS)语法。您可以将其用作实际输入操作数，也可以将其用作"+r"操作数中指针旁边的"虚拟"输入。

或者更好的是，要求 SIMD 寄存器输入、输出或读/写操作数，如[PTXT16] "+x"( *(__m128i)&array[16] )。它可以选择您没有声明任何XMM寄存器。使用#include <immintrin.h>来定义__m128i，或者使用GNU C原生向量语法自己做。__m128i使用__attribute__((may_alias))，以便指针强制转换不会创建严格别名 UB。

如果编译器可以内联它并在 asm 语句的 XMM 寄存器中保留一个局部变量，而不是您的手写 asm 执行存储/重新加载以将内容保留在内存中，这将特别好。

上述源代码的编译器输出：

来自带有 gcc9.2 的 Godbolt 编译器资源管理器。这只是编译器在模板中填写%[stuff]后的 asm 文本输出。

# g++ -O3 -masm=intel
foo():
pxor    xmm0, xmm0
movaps  XMMWORD PTR [rsp-24], xmm0      # compiler-generated zero-init array
movaps xmm0, xmmword ptr fs:SHAVITE_PTXT@tpoff     
movaps xmm1, xmmword ptr fs:SHAVITE_PTXT@tpoff+16  
pxor   xmm2,  xmm2       
movaps xmm3, xmmword ptr XMMWORD PTR [rsp-24]+0     
movaps xmm4, xmmword ptr XMMWORD PTR foo()::SHAVITE256_XOR2[rip]
ret

这是组装好的二进制输出的反汇编：

foo():
pxor   xmm0,xmm0
movaps XMMWORD PTR [rsp-0x18],xmm0   # compiler-generated
movaps xmm0,XMMWORD PTR fs:0xffffffffffffffe0
movaps xmm1,XMMWORD PTR fs:0xfffffffffffffff0    # note the +16 worked
pxor   xmm2,xmm2
movaps xmm3,XMMWORD PTR [rsp-0x18]               # note the +0 assembled without syntax error
movaps xmm4,XMMWORD PTR [rip+0x200ae5]        # 601080 <foo()::SHAVITE256_XOR2>
ret

另请注意，非 TLS 全局变量使用 RIP 相对寻址模式，但 TLS 全局变量没有，使用符号扩展[disp32]绝对寻址模式。

(在与位置相关的代码中，理论上可以使用 RIP 相对寻址模式来生成一个小的绝对地址，类似于相对于 TLS 基的绝对地址。不过，我不认为GCC会这样做。

正如另一个答案所说，内联 asm 一团糟，被滥用了。使用内部函数重写应该是好的，并且允许您使用或不使用-mavx(或-march=haswell或-march=znver1或其他什么)进行编译，以使编译器保存一堆寄存器复制指令。

此外，它还允许编译器优化(向量)寄存器分配以及何时加载/存储，这是编译器非常擅长的事情。

好吧，好吧，我无法使用您提供的测试数据。它使用了这里没有提供的其他几个例程，我懒得去寻找它们。

也就是说，我能够拼凑一些东西来获取测试数据。我的 E256() 返回与您的值相同的值。这并不意味着我已经 100% 正确(您需要自己进行测试)，但是一遍又一遍地对所有内容进行所有异或/aesenc，如果出现问题，我希望它会显示出来。

转换为内在并不是特别困难。大多数情况下，您只需要为给定的asm指令找到等效的_mm_函数。并追踪您键入 x12 的所有位置，当您的意思是 x13 (grrr) 时。

请注意，虽然此代码使用了名为 x0-x15 的变量，但这只是因为它使翻译更容易。这些 C 变量名与 gcc 在编译代码时将使用的寄存器之间没有关联。此外，gcc 使用大量关于 SSE 的知识来重新排序指令，因此输出(尤其是 -O3)与原始 asm 有很大不同。如果你认为你可以比较它们以检查正确性(就像我一样)，预计会感到沮丧。

此代码包含原始例程(前缀为"old")和新例程，并从 main() 调用两者以查看它们是否产生相同的输出。我没有努力对内置进行任何更改以尝试对其进行优化。一旦它起作用，我就停下来。我将把任何进一步的改进留给你，现在它都是 C 代码。

也就是说，gcc能够优化内联函数(这是asm无法做到的)。这意味着如果使用-mavx2重新编译此代码，生成的代码将大不相同。

一些统计数据：

E256()的原始(完全扩展)代码占用了287条指令。
使用没有 -mavx2 的内部函数构建需要 251
使用 -mavx2 的内部函数构建需要 196

我没有做任何时间安排，但我相信放弃~100行asm会有所帮助。OTOH，有时 gcc 在优化 SSE 方面做得很糟糕，所以不要假设任何事情。

希望这有帮助。

// Compile with -O3 -msse4.2 -maes
//           or -O3 -msse4.2 -maes -mavx2
#include <wmmintrin.h>
#include <x86intrin.h>
#include <stdio.h>
///////////////////////////
#define tos(a) #a
#define tostr(a) tos(a)
#define rev_reg_0321(j){ asm ("pshufb xmm" tostr(j)", [oldSHAVITE_REVERSE]"); }
#define replace_aes(i, j){ asm ("aesenc xmm" tostr(i)", xmm" tostr(j)""); }
__attribute__ ((aligned (16))) unsigned int oldSHAVITE_MESS[16];
__attribute__ ((aligned (16))) unsigned char oldSHAVITE_PTXT[8*4];
__attribute__ ((aligned (16))) unsigned int oldSHAVITE_CNTS[4] = {0,0,0,0};
__attribute__ ((aligned (16))) unsigned int oldSHAVITE_REVERSE[4] = {0x07060504, 0x0b0a0908, 0x0f0e0d0c, 0x03020100 };
__attribute__ ((aligned (16))) unsigned int oldSHAVITE256_XOR2[4] = {0x0, 0xFFFFFFFF, 0x0, 0x0};
__attribute__ ((aligned (16))) unsigned int oldSHAVITE256_XOR3[4] = {0x0, 0x0, 0xFFFFFFFF, 0x0};
__attribute__ ((aligned (16))) unsigned int oldSHAVITE256_XOR4[4] = {0x0, 0x0, 0x0, 0xFFFFFFFF};
#define oldmixing() do {
asm("movaps xmm11, xmm15");
asm("movaps xmm10, xmm14");
asm("movaps xmm9, xmm13");
asm("movaps xmm8, xmm12");

asm("movaps xmm6, xmm11");
asm("psrldq xmm6, 4");
asm("pxor xmm8, xmm6");
asm("movaps xmm6, xmm8");
asm("pslldq xmm6, 12");
asm("pxor xmm8, xmm6");

asm("movaps xmm7, xmm8");
asm("psrldq xmm7, 4");
asm("pxor xmm9, xmm7");
asm("movaps xmm7, xmm9");
asm("pslldq xmm7, 12");
asm("pxor xmm9, xmm7");

asm("movaps xmm6, xmm9");
asm("psrldq xmm6, 4");
asm("pxor xmm10, xmm6");
asm("movaps xmm6, xmm10");
asm("pslldq xmm6, 12");
asm("pxor xmm10, xmm6");

asm("movaps xmm7, xmm10");
asm("psrldq xmm7, 4");
asm("pxor xmm11, xmm7");
asm("movaps xmm7, xmm11");
asm("pslldq xmm7, 12");
asm("pxor xmm11, xmm7");
} while(0);
void oldE256()
{
asm (".intel_syntax noprefix");
/* (L,R) = (xmm0,xmm1) */
asm ("movaps xmm0, [oldSHAVITE_PTXT]");
asm ("movaps xmm1, [oldSHAVITE_PTXT+16]");
asm ("movaps xmm3, [oldSHAVITE_CNTS]");
asm ("movaps xmm4, [oldSHAVITE256_XOR2]");
asm ("pxor xmm2, xmm2");
/* init key schedule */
asm ("movaps xmm8, [oldSHAVITE_MESS]");
asm ("movaps xmm9, [oldSHAVITE_MESS+16]");
asm ("movaps xmm10, [oldSHAVITE_MESS+32]");
asm ("movaps xmm11, [oldSHAVITE_MESS+48]");
/* xmm8..xmm11 = rk[0..15] */
/* start key schedule */
asm ("movaps xmm12, xmm8");
asm ("movaps xmm13, xmm9");
asm ("movaps xmm14, xmm10");
asm ("movaps xmm15, xmm11");
rev_reg_0321(12);
rev_reg_0321(13);
rev_reg_0321(14);
rev_reg_0321(15);
replace_aes(12, 2);
replace_aes(13, 2);
replace_aes(14, 2);
replace_aes(15, 2);
asm ("pxor xmm12, xmm3");
asm ("pxor xmm12, xmm4");
asm ("movaps xmm4, [oldSHAVITE256_XOR3]");
asm ("pxor xmm12, xmm11");
asm ("pxor xmm13, xmm12");
asm ("pxor xmm14, xmm13");
asm ("pxor xmm15, xmm14");
/* xmm12..xmm15 = rk[16..31] */
/* F3 - first round */
asm ("movaps xmm6, xmm8");
asm ("pxor xmm8, xmm1");
replace_aes(8, 9);
replace_aes(8, 10);
replace_aes(8, 2);
asm ("pxor xmm0, xmm8");
asm ("movaps xmm8, xmm6");
/* F3 - second round */
asm ("movaps xmm6, xmm11");
asm ("pxor xmm11, xmm0");
replace_aes(11, 12);
replace_aes(11, 13);
replace_aes(11, 2);
asm ("pxor xmm1, xmm11");
asm ("movaps xmm11, xmm6");
/* key schedule */
oldmixing();
/* xmm8..xmm11 - rk[32..47] */
/* F3 - third round */
asm ("movaps xmm6, xmm14");
asm ("pxor xmm14, xmm1");
replace_aes(14, 15);
replace_aes(14, 8);
replace_aes(14, 2);
asm ("pxor xmm0, xmm14");
asm ("movaps xmm14, xmm6");
/* key schedule */
asm ("pshufd xmm3, xmm3,135");
asm ("movaps xmm12, xmm8");
asm ("movaps xmm13, xmm9");
asm ("movaps xmm14, xmm10");
asm ("movaps xmm15, xmm11");
rev_reg_0321(12);
rev_reg_0321(13);
rev_reg_0321(14);
rev_reg_0321(15);
replace_aes(12, 2);
replace_aes(13, 2);
replace_aes(14, 2);
replace_aes(15, 2);
asm ("pxor xmm12, xmm11");
asm ("pxor xmm14, xmm3");
asm ("pxor xmm14, xmm4");
asm ("movaps xmm4, [oldSHAVITE256_XOR4]");
asm ("pxor xmm13, xmm12");
asm ("pxor xmm14, xmm13");
asm ("pxor xmm15, xmm14");
/* xmm12..xmm15 - rk[48..63] */
/* F3 - fourth round */
asm ("movaps xmm6, xmm9");
asm ("pxor xmm9, xmm0");
replace_aes(9, 10);
replace_aes(9, 11);
replace_aes(9, 2);
asm ("pxor xmm1, xmm9");
asm ("movaps xmm9, xmm6");
/* key schedule */
oldmixing();
/* xmm8..xmm11 = rk[64..79] */
/* F3 - fifth round */
asm ("movaps xmm6, xmm12");
asm ("pxor xmm12, xmm1");
replace_aes(12, 13);
replace_aes(12, 14);
replace_aes(12, 2);
asm ("pxor xmm0, xmm12");
asm ("movaps xmm12, xmm6");
/* F3 - sixth round */
asm ("movaps xmm6, xmm15");
asm ("pxor xmm15, xmm0");
replace_aes(15, 8);
replace_aes(15, 9);
replace_aes(15, 2);
asm ("pxor xmm1, xmm15");
asm ("movaps xmm15, xmm6");
/* key schedule */
asm ("pshufd xmm3, xmm3, 147");
asm ("movaps xmm12, xmm8");
asm ("movaps xmm13, xmm9");
asm ("movaps xmm14, xmm10");
asm ("movaps xmm15, xmm11");
rev_reg_0321(12);
rev_reg_0321(13);
rev_reg_0321(14);
rev_reg_0321(15);
replace_aes(12, 2);
replace_aes(13, 2);
replace_aes(14, 2);
replace_aes(15, 2);
asm ("pxor xmm12, xmm11");
asm ("pxor xmm13, xmm3");
asm ("pxor xmm13, xmm4");
asm ("pxor xmm13, xmm12");
asm ("pxor xmm14, xmm13");
asm ("pxor xmm15, xmm14");
/* xmm12..xmm15 = rk[80..95] */
/* F3 - seventh round */
asm ("movaps xmm6, xmm10");
asm ("pxor xmm10, xmm1");
replace_aes(10, 11);
replace_aes(10, 12);
replace_aes(10, 2);
asm ("pxor xmm0, xmm10");
asm ("movaps xmm10, xmm6");
/* key schedule */
oldmixing();
/* xmm8..xmm11 = rk[96..111] */
/* F3 - eigth round */
asm ("movaps xmm6, xmm13");
asm ("pxor xmm13, xmm0");
replace_aes(13, 14);
replace_aes(13, 15);
replace_aes(13, 2);
asm ("pxor xmm1, xmm13");
asm ("movaps xmm13, xmm6");
/* key schedule */
asm ("pshufd xmm3, xmm3, 135");
asm ("movaps xmm12, xmm8");
asm ("movaps xmm13, xmm9");
asm ("movaps xmm14, xmm10");
asm ("movaps xmm15, xmm11");
rev_reg_0321(12);
rev_reg_0321(13);
rev_reg_0321(14);
rev_reg_0321(15);
replace_aes(12, 2);
replace_aes(13, 2);
replace_aes(14, 2);
replace_aes(15, 2);
asm ("pxor xmm12, xmm11");
asm ("pxor xmm15, xmm3");
asm ("pxor xmm15, xmm4");
asm ("pxor xmm13, xmm12");
asm ("pxor xmm14, xmm13");
asm ("pxor xmm15, xmm14");
/* xmm12..xmm15 = rk[112..127] */
/* F3 - ninth round */
asm ("movaps xmm6, xmm8");
asm ("pxor xmm8, xmm1");
replace_aes(8, 9);
replace_aes(8, 10);
replace_aes(8, 2);
asm ("pxor xmm0, xmm8");
asm ("movaps xmm8, xmm6");
/* F3 - tenth round */
asm ("movaps xmm6, xmm11");
asm ("pxor xmm11, xmm0");
replace_aes(11, 12);
replace_aes(11, 13);
replace_aes(11, 2);
asm ("pxor xmm1, xmm11");
asm ("movaps xmm11, xmm6");
/* key schedule */
oldmixing();
/* xmm8..xmm11 = rk[128..143] */
/* F3 - eleventh round */
asm ("movaps xmm6, xmm14");
asm ("pxor xmm14, xmm1");
replace_aes(14, 15);
replace_aes(14, 8);
replace_aes(14, 2);
asm ("pxor xmm0, xmm14");
asm ("movaps xmm14, xmm6");
/* F3 - twelfth round */
asm ("movaps xmm6, xmm9");
asm ("pxor xmm9, xmm0");
replace_aes(9, 10);
replace_aes(9, 11);
replace_aes(9, 2);
asm ("pxor xmm1, xmm9");
asm ("movaps xmm9, xmm6");
/* feedforward */
asm ("pxor xmm0, [oldSHAVITE_PTXT]");
asm ("pxor xmm1, [oldSHAVITE_PTXT+16]");
asm ("movaps [oldSHAVITE_PTXT], xmm0");
asm ("movaps [oldSHAVITE_PTXT+16], xmm1");
asm (".att_syntax noprefix");
return;
}
void oldCompress256(const unsigned char *message_block, unsigned char *chaining_value, unsigned long long counter,
const unsigned char salt[32])
{
int i, j;
for (i=0;i<8*4;i++)
oldSHAVITE_PTXT[i]=chaining_value[i];
for (i=0;i<16;i++)
oldSHAVITE_MESS[i] = *((unsigned int*)(message_block+4*i));
oldSHAVITE_CNTS[0] = (unsigned int)(counter & 0xFFFFFFFFULL);
oldSHAVITE_CNTS[1] = (unsigned int)(counter>>32);
/* encryption + Davies-Meyer transform */
oldE256();
for (i=0; i<4*8; i++)
chaining_value[i]=oldSHAVITE_PTXT[i];
return;
}
////////////////////////////////
__attribute__ ((aligned (16))) unsigned int SHAVITE_MESS[16];
__attribute__ ((aligned (16))) unsigned char SHAVITE_PTXT[8*4];
__attribute__ ((aligned (16))) unsigned int SHAVITE_CNTS[4] = {0,0,0,0};
__attribute__ ((aligned (16))) unsigned int SHAVITE_REVERSE[4] = {0x07060504, 0x0b0a0908, 0x0f0e0d0c, 0x03020100 };
__attribute__ ((aligned (16))) unsigned int SHAVITE256_XOR2[4] = {0x0, 0xFFFFFFFF, 0x0, 0x0};
__attribute__ ((aligned (16))) unsigned int SHAVITE256_XOR3[4] = {0x0, 0x0, 0xFFFFFFFF, 0x0};
__attribute__ ((aligned (16))) unsigned int SHAVITE256_XOR4[4] = {0x0, 0x0, 0x0, 0xFFFFFFFF};
#define mixing() do {
x11 = x15; 
x10 = x14; 
x9 = x13;
x8 = x12;

x6 = x11;
x6 = _mm_srli_si128(x6, 4);
x8 = _mm_xor_si128(x8, x6);
x6 = x8;
x6 = _mm_slli_si128(x6, 12);
x8 = _mm_xor_si128(x8, x6);

x7 = x8;
x7 = _mm_srli_si128(x7, 4);
x9 = _mm_xor_si128(x9, x7);
x7 = x9;
x7 = _mm_slli_si128(x7, 12);
x9 = _mm_xor_si128(x9, x7);

x6 = x9;
x6 = _mm_srli_si128(x6, 4);
x10 = _mm_xor_si128(x10, x6);
x6 = x10;
x6 = _mm_slli_si128(x6, 12);
x10 = _mm_xor_si128(x10, x6);

x7 = x10;
x7 = _mm_srli_si128(x7, 4);
x11 = _mm_xor_si128(x11, x7);
x7 = x11;
x7 = _mm_slli_si128(x7, 12);
x11 = _mm_xor_si128(x11, x7);
} while(0);
void E256()
{
__m128i x0;
__m128i x1;
__m128i x2;
__m128i x3;
__m128i x4;
__m128i x5;
__m128i x6;
__m128i x7;
__m128i x8;
__m128i x9;
__m128i x10;
__m128i x11;
__m128i x12;
__m128i x13;
__m128i x14;
__m128i x15;
/* (L,R) = (xmm0,xmm1) */
const __m128i ptxt1 = _mm_loadu_si128((const __m128i*)SHAVITE_PTXT);
const __m128i ptxt2 = _mm_loadu_si128((const __m128i*)(SHAVITE_PTXT+16));
x0 = ptxt1;
x1 = ptxt2;
x3 = _mm_loadu_si128((__m128i*)SHAVITE_CNTS);
x4 = _mm_loadu_si128((__m128i*)SHAVITE256_XOR2);
x2 = _mm_setzero_si128();
/* init key schedule */
x8 = _mm_loadu_si128((__m128i*)SHAVITE_MESS);
x9 = _mm_loadu_si128((__m128i*)(SHAVITE_MESS+4));
x10 = _mm_loadu_si128((__m128i*)(SHAVITE_MESS+8));
x11 = _mm_loadu_si128((__m128i*)(SHAVITE_MESS+12));
/* xmm8..xmm11 = rk[0..15] */
/* start key schedule */
x12 = x8;
x13 = x9;
x14 = x10;
x15 = x11;
const __m128i xtemp = _mm_loadu_si128((__m128i*)SHAVITE_REVERSE);
x12 = _mm_shuffle_epi8(x12, xtemp);
x13 = _mm_shuffle_epi8(x13, xtemp);
x14 = _mm_shuffle_epi8(x14, xtemp);
x15 = _mm_shuffle_epi8(x15, xtemp);
x12 = _mm_aesenc_si128(x12, x2);
x13 = _mm_aesenc_si128(x13, x2);
x14 = _mm_aesenc_si128(x14, x2);
x15 = _mm_aesenc_si128(x15, x2);
x12 = _mm_xor_si128(x12, x3);
x12 = _mm_xor_si128(x12, x4);
x4 = _mm_loadu_si128((__m128i*)SHAVITE256_XOR3);
x12 = _mm_xor_si128(x12, x11);
x13 = _mm_xor_si128(x13, x12);
x14 = _mm_xor_si128(x14, x13);
x15 = _mm_xor_si128(x15, x14);
/* xmm12..xmm15 = rk[16..31] */
/* F3 - first round */
x6 = x8;
x8 = _mm_xor_si128(x8, x1);
x8 = _mm_aesenc_si128(x8, x9);
x8 = _mm_aesenc_si128(x8, x10);
x8 = _mm_aesenc_si128(x8, x2);
x0 = _mm_xor_si128(x0, x8);
x8 = x6;
/* F3 - second round */
x6 = x11;
x11 = _mm_xor_si128(x11, x0);
x11 = _mm_aesenc_si128(x11, x12);
x11 = _mm_aesenc_si128(x11, x13);
x11 = _mm_aesenc_si128(x11, x2);
x1 = _mm_xor_si128(x1, x11);
x11 = x6;
/* key schedule */
mixing();
/* xmm8..xmm11 - rk[32..47] */
/* F3 - third round */
x6 = x14;
x14 = _mm_xor_si128(x14, x1);
x14 = _mm_aesenc_si128(x14, x15);
x14 = _mm_aesenc_si128(x14, x8);
x14 = _mm_aesenc_si128(x14, x2);
x0 = _mm_xor_si128(x0, x14);
x14 = x6;
/* key schedule */
x3 = _mm_shuffle_epi32(x3, 135);
x12 = x8;
x13 = x9;
x14 = x10;
x15 = x11;
x12 = _mm_shuffle_epi8(x12, xtemp);
x13 = _mm_shuffle_epi8(x13, xtemp);
x14 = _mm_shuffle_epi8(x14, xtemp);
x15 = _mm_shuffle_epi8(x15, xtemp);
x12 = _mm_aesenc_si128(x12, x2);
x13 = _mm_aesenc_si128(x13, x2);
x14 = _mm_aesenc_si128(x14, x2);
x15 = _mm_aesenc_si128(x15, x2);
x12 = _mm_xor_si128(x12, x11);
x14 = _mm_xor_si128(x14, x3);
x14 = _mm_xor_si128(x14, x4);
x4 = _mm_loadu_si128((__m128i*)SHAVITE256_XOR4);
x13 = _mm_xor_si128(x13, x12);
x14 = _mm_xor_si128(x14, x13);
x15 = _mm_xor_si128(x15, x14);
/* xmm12..xmm15 - rk[48..63] */
/* F3 - fourth round */
x6 = x9;
x9 = _mm_xor_si128(x9, x0);
x9 = _mm_aesenc_si128(x9, x10);
x9 = _mm_aesenc_si128(x9, x11);
x9 = _mm_aesenc_si128(x9, x2);
x1 = _mm_xor_si128(x1, x9);
x9 = x6;
/* key schedule */
mixing();
/* xmm8..xmm11 = rk[64..79] */
/* F3 - fifth round */
x6 = x12;
x12 = _mm_xor_si128(x12, x1);
x12 = _mm_aesenc_si128(x12, x13);
x12 = _mm_aesenc_si128(x12, x14);
x12 = _mm_aesenc_si128(x12, x2);
x0 = _mm_xor_si128(x0, x12);
x12 = x6;
/* F3 - sixth round */
x6 = x15;
x15 = _mm_xor_si128(x15, x0);
x15 = _mm_aesenc_si128(x15, x8);
x15 = _mm_aesenc_si128(x15, x9);
x15 = _mm_aesenc_si128(x15, x2);
x1 = _mm_xor_si128(x1, x15);
x15 = x6;
/* key schedule */
x3 = _mm_shuffle_epi32(x3, 147);
x12 = x8;
x13 = x9;
x14 = x10;
x15 = x11;
x12 = _mm_shuffle_epi8(x12, xtemp);
x13 = _mm_shuffle_epi8(x13, xtemp);
x14 = _mm_shuffle_epi8(x14, xtemp);
x15 = _mm_shuffle_epi8(x15, xtemp);
x12 = _mm_aesenc_si128(x12, x2);
x13 = _mm_aesenc_si128(x13, x2);
x14 = _mm_aesenc_si128(x14, x2);
x15 = _mm_aesenc_si128(x15, x2);
x12 = _mm_xor_si128(x12, x11);
x13 = _mm_xor_si128(x13, x3);
x13 = _mm_xor_si128(x13, x4);
x13 = _mm_xor_si128(x13, x12);
x14 = _mm_xor_si128(x14, x13);
x15 = _mm_xor_si128(x15, x14);
/* xmm12..xmm15 = rk[80..95] */
/* F3 - seventh round */
x6 = x10;
x10 = _mm_xor_si128(x10, x1);
x10 = _mm_aesenc_si128(x10, x11);
x10 = _mm_aesenc_si128(x10, x12);
x10 = _mm_aesenc_si128(x10, x2);
x0 = _mm_xor_si128(x0, x10);
x10 = x6;
/* key schedule */
mixing();
/* xmm8..xmm11 = rk[96..111] */
/* F3 - eigth round */
x6 = x13;
x13 = _mm_xor_si128(x13, x0);
x13 = _mm_aesenc_si128(x13, x14);
x13 = _mm_aesenc_si128(x13, x15);
x13 = _mm_aesenc_si128(x13, x2);
x1 = _mm_xor_si128(x1, x13);
x13 = x6;
/* key schedule */
x3 = _mm_shuffle_epi32(x3, 135);
x12 = x8;
x13 = x9;
x14 = x10;
x15 = x11;
x12 = _mm_shuffle_epi8(x12, xtemp);
x13 = _mm_shuffle_epi8(x13, xtemp);
x14 = _mm_shuffle_epi8(x14, xtemp);
x15 = _mm_shuffle_epi8(x15, xtemp);
x12 = _mm_aesenc_si128(x12, x2);
x13 = _mm_aesenc_si128(x13, x2);
x14 = _mm_aesenc_si128(x14, x2);
x15 = _mm_aesenc_si128(x15, x2);
x12 = _mm_xor_si128(x12, x11);
x15 = _mm_xor_si128(x15, x3);
x15 = _mm_xor_si128(x15, x4);
x13 = _mm_xor_si128(x13, x12);
x14 = _mm_xor_si128(x14, x13);
x15 = _mm_xor_si128(x15, x14);
/* xmm12..xmm15 = rk[112..127] */
/* F3 - ninth round */
x6 = x8;
x8 = _mm_xor_si128(x8, x1);
x8 = _mm_aesenc_si128(x8, x9);
x8 = _mm_aesenc_si128(x8, x10);
x8 = _mm_aesenc_si128(x8, x2);
x0 = _mm_xor_si128(x0, x8);
x8 = x6;
/* F3 - tenth round */
x6 = x11;
x11 = _mm_xor_si128(x11, x0);
x11 = _mm_aesenc_si128(x11, x12);
x11 = _mm_aesenc_si128(x11, x13);
x11 = _mm_aesenc_si128(x11, x2);
x1 = _mm_xor_si128(x1, x11);
x11 = x6;
/* key schedule */
mixing();
/* xmm8..xmm11 = rk[128..143] */
/* F3 - eleventh round */
x6 = x14;
x14 = _mm_xor_si128(x14, x1);
x14 = _mm_aesenc_si128(x14, x15);
x14 = _mm_aesenc_si128(x14, x8);
x14 = _mm_aesenc_si128(x14, x2);
x0 = _mm_xor_si128(x0, x14);
x14 = x6;
/* F3 - twelfth round */
x6 = x9;
x9 = _mm_xor_si128(x9, x0);
x9 = _mm_aesenc_si128(x9, x10);
x9 = _mm_aesenc_si128(x9, x11);
x9 = _mm_aesenc_si128(x9, x2);
x1 = _mm_xor_si128(x1, x9);
x9 = x6;
/* feedforward */
x0 = _mm_xor_si128(x0, ptxt1);
x1 = _mm_xor_si128(x1, ptxt2);
_mm_storeu_si128((__m128i *)SHAVITE_PTXT, x0);
_mm_storeu_si128((__m128i *)(SHAVITE_PTXT + 16), x1);
return;
}
void Compress256(const unsigned char *message_block, unsigned char *chaining_value, unsigned long long counter,
const unsigned char salt[32])
{
int i, j;
for (i=0;i<8*4;i++)
SHAVITE_PTXT[i]=chaining_value[i];
for (i=0;i<16;i++)
SHAVITE_MESS[i] = *((unsigned int*)(message_block+4*i));
SHAVITE_CNTS[0] = (unsigned int)(counter & 0xFFFFFFFFULL);
SHAVITE_CNTS[1] = (unsigned int)(counter>>32);
/* encryption + Davies-Meyer transform */
E256();
for (i=0; i<4*8; i++)
chaining_value[i]=SHAVITE_PTXT[i];
return;
}
int main(int argc, char *argv[])
{
const int cvlen = 32;
unsigned char *cv = (unsigned char *)malloc(cvlen);
for (int x=0; x < cvlen; x++)
cv[x] = x + argc;
const int mblen = 64;
unsigned char *mb = (unsigned char *)malloc(mblen);
for (int x=0; x < mblen; x++)
mb[x] = x + argc;
unsigned long long counter = 0x1234567812345678ull;
unsigned char s[32] = {0};
oldCompress256(mb, cv, counter, s);
printf("old: ");
for (int x=0; x < cvlen; x++)
printf("%2x ", cv[x]);
printf("n");
for (int x=0; x < cvlen; x++)
cv[x] = x + argc;
Compress256(mb, cv, counter, s);
printf("new: ");
for (int x=0; x < cvlen; x++)
printf("%2x ", cv[x]);
printf("n");
}

编辑：

全局变量仅用于在 C 和 asm 之间传递值。也许 asm 编写者不知道如何访问参数？IAC，它们是不必要的(也是线程安全问题的根源)。这是没有它们的代码(以及一些外观更改)：

#include <x86intrin.h>
#include <stdio.h>
#include <time.h>
#define mixing() 
x11 = x15;
x10 = x14;
x9 = x13;
x8 = x12;

x6 = x11;
x6 = _mm_srli_si128(x6, 4);
x8 = _mm_xor_si128(x8, x6);
x6 = x8;
x6 = _mm_slli_si128(x6, 12);
x8 = _mm_xor_si128(x8, x6);

x7 = x8;
x7 = _mm_srli_si128(x7, 4);
x9 = _mm_xor_si128(x9, x7);
x7 = x9;
x7 = _mm_slli_si128(x7, 12);
x9 = _mm_xor_si128(x9, x7);

x6 = x9;
x6 = _mm_srli_si128(x6, 4);
x10 = _mm_xor_si128(x10, x6);
x6 = x10;
x6 = _mm_slli_si128(x6, 12);
x10 = _mm_xor_si128(x10, x6);

x7 = x10;
x7 = _mm_srli_si128(x7, 4);
x11 = _mm_xor_si128(x11, x7);
x7 = x11;
x7 = _mm_slli_si128(x7, 12);
x11 = _mm_xor_si128(x11, x7);
// If mess & chain won't be 16byte aligned, change _mm_load to _mm_loadu and
// _mm_store to _mm_storeu
void Compress256(const __m128i *mess, __m128i *chain, unsigned long long counter, const unsigned char salt[32])
{
// note: _mm_set_epi32 uses (int e3, int e2, int e1, int e0)
const __m128i SHAVITE_REVERSE = _mm_set_epi32(0x03020100, 0x0f0e0d0c, 0x0b0a0908, 0x07060504);
const __m128i SHAVITE256_XOR2 = _mm_set_epi32(0x0, 0x0, 0xFFFFFFFF, 0x0);
const __m128i SHAVITE256_XOR3 = _mm_set_epi32(0x0, 0xFFFFFFFF, 0x0, 0x0);
const __m128i SHAVITE256_XOR4 = _mm_set_epi32(0xFFFFFFFF, 0x0, 0x0, 0x0);
const __m128i SHAVITE_CNTS =
_mm_set_epi32(0, 0, (unsigned int)(counter>>32), (unsigned int)(counter & 0xFFFFFFFFULL));
__m128i x0, x1, x2, x3, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15;
/* (L,R) = (xmm0,xmm1) */
const __m128i ptxt1 = _mm_load_si128(chain);
const __m128i ptxt2 = _mm_load_si128(chain+1);
x0 = ptxt1;
x1 = ptxt2;
x3 = SHAVITE_CNTS;
x2 = _mm_setzero_si128();
/* init key schedule */
x8 = _mm_load_si128(mess);
x9 = _mm_load_si128(mess+1);
x10 = _mm_load_si128(mess+2);
x11 = _mm_load_si128(mess+3);
/* xmm8..xmm11 = rk[0..15] */
/* start key schedule */
x12 = x8;
x13 = x9;
x14 = x10;
x15 = x11;
x12 = _mm_shuffle_epi8(x12, SHAVITE_REVERSE);
x13 = _mm_shuffle_epi8(x13, SHAVITE_REVERSE);
x14 = _mm_shuffle_epi8(x14, SHAVITE_REVERSE);
x15 = _mm_shuffle_epi8(x15, SHAVITE_REVERSE);
x12 = _mm_aesenc_si128(x12, x2);
x13 = _mm_aesenc_si128(x13, x2);
x14 = _mm_aesenc_si128(x14, x2);
x15 = _mm_aesenc_si128(x15, x2);
x12 = _mm_xor_si128(x12, x3);
x12 = _mm_xor_si128(x12, SHAVITE256_XOR2);
x12 = _mm_xor_si128(x12, x11);
x13 = _mm_xor_si128(x13, x12);
x14 = _mm_xor_si128(x14, x13);
x15 = _mm_xor_si128(x15, x14);
/* xmm12..xmm15 = rk[16..31] */
/* F3 - first round */
x6 = x8;
x8 = _mm_xor_si128(x8, x1);
x8 = _mm_aesenc_si128(x8, x9);
x8 = _mm_aesenc_si128(x8, x10);
x8 = _mm_aesenc_si128(x8, x2);
x0 = _mm_xor_si128(x0, x8);
x8 = x6;
/* F3 - second round */
x6 = x11;
x11 = _mm_xor_si128(x11, x0);
x11 = _mm_aesenc_si128(x11, x12);
x11 = _mm_aesenc_si128(x11, x13);
x11 = _mm_aesenc_si128(x11, x2);
x1 = _mm_xor_si128(x1, x11);
x11 = x6;
/* key schedule */
mixing();
/* xmm8..xmm11 - rk[32..47] */
/* F3 - third round */
x6 = x14;
x14 = _mm_xor_si128(x14, x1);
x14 = _mm_aesenc_si128(x14, x15);
x14 = _mm_aesenc_si128(x14, x8);
x14 = _mm_aesenc_si128(x14, x2);
x0 = _mm_xor_si128(x0, x14);
x14 = x6;
/* key schedule */
x3 = _mm_shuffle_epi32(x3, 135);
x12 = x8;
x13 = x9;
x14 = x10;
x15 = x11;
x12 = _mm_shuffle_epi8(x12, SHAVITE_REVERSE);
x13 = _mm_shuffle_epi8(x13, SHAVITE_REVERSE);
x14 = _mm_shuffle_epi8(x14, SHAVITE_REVERSE);
x15 = _mm_shuffle_epi8(x15, SHAVITE_REVERSE);
x12 = _mm_aesenc_si128(x12, x2);
x13 = _mm_aesenc_si128(x13, x2);
x14 = _mm_aesenc_si128(x14, x2);
x15 = _mm_aesenc_si128(x15, x2);
x12 = _mm_xor_si128(x12, x11);
x14 = _mm_xor_si128(x14, x3);
x14 = _mm_xor_si128(x14, SHAVITE256_XOR3);
x13 = _mm_xor_si128(x13, x12);
x14 = _mm_xor_si128(x14, x13);
x15 = _mm_xor_si128(x15, x14);
/* xmm12..xmm15 - rk[48..63] */
/* F3 - fourth round */
x6 = x9;
x9 = _mm_xor_si128(x9, x0);
x9 = _mm_aesenc_si128(x9, x10);
x9 = _mm_aesenc_si128(x9, x11);
x9 = _mm_aesenc_si128(x9, x2);
x1 = _mm_xor_si128(x1, x9);
x9 = x6;
/* key schedule */
mixing();
/* xmm8..xmm11 = rk[64..79] */
/* F3 - fifth round */
x6 = x12;
x12 = _mm_xor_si128(x12, x1);
x12 = _mm_aesenc_si128(x12, x13);
x12 = _mm_aesenc_si128(x12, x14);
x12 = _mm_aesenc_si128(x12, x2);
x0 = _mm_xor_si128(x0, x12);
x12 = x6;
/* F3 - sixth round */
x6 = x15;
x15 = _mm_xor_si128(x15, x0);
x15 = _mm_aesenc_si128(x15, x8);
x15 = _mm_aesenc_si128(x15, x9);
x15 = _mm_aesenc_si128(x15, x2);
x1 = _mm_xor_si128(x1, x15);
x15 = x6;
/* key schedule */
x3 = _mm_shuffle_epi32(x3, 147);
x12 = x8;
x13 = x9;
x14 = x10;
x15 = x11;
x12 = _mm_shuffle_epi8(x12, SHAVITE_REVERSE);
x13 = _mm_shuffle_epi8(x13, SHAVITE_REVERSE);
x14 = _mm_shuffle_epi8(x14, SHAVITE_REVERSE);
x15 = _mm_shuffle_epi8(x15, SHAVITE_REVERSE);
x12 = _mm_aesenc_si128(x12, x2);
x13 = _mm_aesenc_si128(x13, x2);
x14 = _mm_aesenc_si128(x14, x2);
x15 = _mm_aesenc_si128(x15, x2);
x12 = _mm_xor_si128(x12, x11);
x13 = _mm_xor_si128(x13, x3);
x13 = _mm_xor_si128(x13, SHAVITE256_XOR4);
x13 = _mm_xor_si128(x13, x12);
x14 = _mm_xor_si128(x14, x13);
x15 = _mm_xor_si128(x15, x14);
/* xmm12..xmm15 = rk[80..95] */
/* F3 - seventh round */
x6 = x10;
x10 = _mm_xor_si128(x10, x1);
x10 = _mm_aesenc_si128(x10, x11);
x10 = _mm_aesenc_si128(x10, x12);
x10 = _mm_aesenc_si128(x10, x2);
x0 = _mm_xor_si128(x0, x10);
x10 = x6;
/* key schedule */
mixing();
/* xmm8..xmm11 = rk[96..111] */
/* F3 - eigth round */
x6 = x13;
x13 = _mm_xor_si128(x13, x0);
x13 = _mm_aesenc_si128(x13, x14);
x13 = _mm_aesenc_si128(x13, x15);
x13 = _mm_aesenc_si128(x13, x2);
x1 = _mm_xor_si128(x1, x13);
x13 = x6;
/* key schedule */
x3 = _mm_shuffle_epi32(x3, 135);
x12 = x8;
x13 = x9;
x14 = x10;
x15 = x11;
x12 = _mm_shuffle_epi8(x12, SHAVITE_REVERSE);
x13 = _mm_shuffle_epi8(x13, SHAVITE_REVERSE);
x14 = _mm_shuffle_epi8(x14, SHAVITE_REVERSE);
x15 = _mm_shuffle_epi8(x15, SHAVITE_REVERSE);
x12 = _mm_aesenc_si128(x12, x2);
x13 = _mm_aesenc_si128(x13, x2);
x14 = _mm_aesenc_si128(x14, x2);
x15 = _mm_aesenc_si128(x15, x2);
x12 = _mm_xor_si128(x12, x11);
x15 = _mm_xor_si128(x15, x3);
x15 = _mm_xor_si128(x15, SHAVITE256_XOR4);
x13 = _mm_xor_si128(x13, x12);
x14 = _mm_xor_si128(x14, x13);
x15 = _mm_xor_si128(x15, x14);
/* xmm12..xmm15 = rk[112..127] */
/* F3 - ninth round */
x6 = x8;
x8 = _mm_xor_si128(x8, x1);
x8 = _mm_aesenc_si128(x8, x9);
x8 = _mm_aesenc_si128(x8, x10);
x8 = _mm_aesenc_si128(x8, x2);
x0 = _mm_xor_si128(x0, x8);
x8 = x6;
/* F3 - tenth round */
x6 = x11;
x11 = _mm_xor_si128(x11, x0);
x11 = _mm_aesenc_si128(x11, x12);
x11 = _mm_aesenc_si128(x11, x13);
x11 = _mm_aesenc_si128(x11, x2);
x1 = _mm_xor_si128(x1, x11);
x11 = x6;
/* key schedule */
mixing();
/* xmm8..xmm11 = rk[128..143] */
/* F3 - eleventh round */
x6 = x14;
x14 = _mm_xor_si128(x14, x1);
x14 = _mm_aesenc_si128(x14, x15);
x14 = _mm_aesenc_si128(x14, x8);
x14 = _mm_aesenc_si128(x14, x2);
x0 = _mm_xor_si128(x0, x14);
x14 = x6;
/* F3 - twelfth round */
x6 = x9;
x9 = _mm_xor_si128(x9, x0);
x9 = _mm_aesenc_si128(x9, x10);
x9 = _mm_aesenc_si128(x9, x11);
x9 = _mm_aesenc_si128(x9, x2);
x1 = _mm_xor_si128(x1, x9);
x9 = x6;
/* feedforward */
x0 = _mm_xor_si128(x0, ptxt1);
x1 = _mm_xor_si128(x1, ptxt2);
_mm_store_si128(chain, x0);
_mm_store_si128(chain + 1, x1);
}
int main(int argc, char *argv[])
{
__m128i chain[2], mess[4];
unsigned char *p;
// argc prevents compiler from precalculating results
p = (unsigned char *)mess;
for (int x=0; x < 64; x++)
p[x] = x + argc;
p = (unsigned char *)chain;
for (int x=0; x < 32; x++)
p[x] = x + argc;
unsigned long long counter = 0x1234567812345678ull + argc;
// Unused, but prototype requires it.
unsigned char s[32] = {0};
Compress256(mess, chain, counter, s);
for (int x=0; x < 32; x++)
printf("%02x ", p[x]);
printf("n");
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
unsigned char res = 0;
for (int x=0; x < 400000; x++)
{
Compress256(mess, chain, counter, s);
// Ensure optimizer doesn't omit the calc
res ^= *p;
}
clock_gettime(CLOCK_MONOTONIC, &end);
unsigned long long delta_us = (end.tv_sec - start.tv_sec) * 1000000ull + (end.tv_nsec - start.tv_nsec) / 1000ull;
printf("%x: %llun", res, delta_us);
}