AVX矢量化代码中的分段故障，GCC attribute对齐为32字节

Seg Fault in AVX Vectorized Code with GCC attribute aligned at 32 bytes

本文关键字：GCC attribute 对齐 32字节故障代码矢量化分段 AVX 更新时间：2023-10-16

只有当循环在AVX机器（Intel（R）Core（TM）i5-3570K CPU@3.40GHz）上完全矢量化时，我才会在循环中遇到seg故障。

使用gcc-c-march=本机MyClass.cpp-O3-ftree矢量器verbose=6 编译

我正在尝试对齐数组，以避免这些来自-ftree矢量器verbose=6的消息：

MyClass.cpp:352: note: dependence distance modulo vf == 0 between this_7(D)->x[i_101] and this_7(D)->x[i_101]
MyClass.cpp:352: note: vect_model_load_cost: unaligned supported by hardware.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: vect_model_store_cost: unaligned supported by hardware.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: Alignment of access forced using peeling.
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: cost model: prologue peel iters set to vf/2.
MyClass.cpp:352: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .

我想看到（并且确实看到）的是：

MyClass.cpp:352: note: dependence distance modulo vf == 0 between this_7(D)->x[i_101] and this_7(D)->x[i_101]
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 1, outside_cost = 0.
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
MyClass.cpp:352: note: vect_model_load_cost: aligned.
MyClass.cpp:352: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
MyClass.cpp:352: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
MyClass.cpp:352: note: vect_model_store_cost: aligned.
MyClass.cpp:352: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .

现在，无论如何，我都不是C/C++/Assemblyr大师，但当我出现seg错误时，我认为我的代码中有一些指针/数组/其他错误，而完全矢量化的循环只是暴露了这一点。但在学习了两天汇编程序后，我找不到它了。我来了。

代码看起来是这样的（希望我包含了所有相关的内容——我不能在这里完整地共享实际的.cpp）：

class MyClass {
private:
    static const long maxElems = 1024;
    static const double otherVar = 0.9;
    double x[maxElems] __attribute__ ((aligned (32)));  <-- gcc reports fully vectorized
    //double x[maxElems];   <-- leads to unaligned peeling
public:
    void myFunc() {
        // Always works
        for (int i=0; i<maxElems; ++i) printf("Test: %d %.4en", i, x[i]);
        // Seg fault if fully vectorized (no peeling)
        for (int i=0; i<maxElems; ++i) {
            x[i] = x[i] - 42;
        } 
        // Works if no seg fault earlier
        for (int i=0; i<maxElems; ++i) printf("Test: %d %.4en", i, x[i]);
    }
}

当它被完全矢量化时，我看到（使用-Wa，-alh标志来查看汇编程序）：

 989      00
 990 0b56 488B4424      movq    40(%rsp), %rax
 990      28
 991 0b5b C5FD280D      vmovapd .LC8(%rip), %ymm1
 991      00000000 
 992                    .p2align 4,,10
 993 0b63 0F1F4400      .p2align 3
 993      00
 994                .L153:
 995 0b68 C5FD2800      vmovapd (%rax), %ymm0
 996 0b6c C5FD5CC1      vsubpd  %ymm1, %ymm0, %ymm0
 997 0b70 C5FD2900      vmovapd %ymm0, (%rax)
 998 0b74 4883C020      addq    $32, %rax
 999 0b78 4C39E0        cmpq    %r12, %rax
 1000 0b7b 75EB             jne .L153

同样，关于"不了解汇编程序"的常见警告，但我确实花了相当多的时间打印指针和检查汇编程序，以说服自己这个循环在数组的开始和结束处开始和结束。但是当我得到seg错误时，x的起始地址不能被32整除。我想这就是造成麻烦的原因。

是的，我知道我可以在堆上分配x，并选择它的最终位置来对齐它。但我在这里的实验是让MyClass的大小固定，里面有所有数据（想想缓存效率），所以我在堆上分配了MyClass的实例，在集合中指向它们的指针，x在MyClass里面。

align属性不是应该把x放在32字节的边界上吗？编译器假设，那么vmovapd会爆炸，因为它不是，对吧？

GCC校准文件：https://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html

我是否必须以某种方式在堆上对齐MyClass？我该怎么做？我如何告诉GCC我做到了，这样它就可以像我想要的那样矢量化？

编辑：我已经解决了这个问题（部分感谢下面的评论和回答）。当在堆上创建对象时，可以通过重写默认的new运算符来保证对象的对齐。当我这样做的时候，我没有遇到seg错误，我的代码仍然按照我的意愿完美地向量化了。我是怎么做到的：

static void* operator new(size_t size) throw (std::bad_alloc) {
    void *alignedPointer;
    int alignError = 0;
    // Try to allocate the required amount of memory (using POSIX standard aligned allocation)
    alignError = posix_memalign(&alignedPointer, VECTOR_ALIGN_BYTES, size);
    // Throw/Report error if any
    if (alignError) {
        throw std::bad_alloc();
    }
    // Return a pointer to this aligned memory location
    return alignedPointer;
}
static void operator delete(void* alignedPointer) {
    // POSIX aligned memory allocation can be freed normally with free()
    free(alignedPointer);
}

C++在调用运算符之后/之前为您调用构造函数/析构函数。因此，对齐由类本身控制。如果您有不同的偏好，还有其他对齐的内存分配器。我用了POSIX。

两个注意事项：如果有人用任意地址呼叫placement new，您仍然无法对齐。如果有人将你的类声明为他们类的成员，并且他们的类是在堆上分配的，那么你可能是不对齐的。我已经在我的构造函数中进行了检查，如果检测到这一点，就会抛出一个错误。

__attribute__((aligned(32))

可能不会像我们认为的那样（bug？功能？）。

它基本上告诉编译器它可以假设这个东西是对齐的，但它可能不是。如果它在堆上，则需要使用posix_memalign或类似方法进行分配。

如果设置了__attribute__((aligned(...))但分配没有对齐，GCC实际上会得到错误的指针算术。

s2->aligned_var = 0x199c030
&s2->aligned_var % 0x40  = 0x0

https://gcc.gnu.org/ml/gcc/2014-06/msg00308.html

AVX矢量化代码中的分段故障，GCC __attribute__对齐为32字节

Seg Fault in AVX Vectorized Code with GCC __attribute__ aligned at 32 bytes

AVX矢量化代码中的分段故障，GCC attribute对齐为32字节

Seg Fault in AVX Vectorized Code with GCC attribute aligned at 32 bytes