矢量的数据如何对齐？

How is a vector's data aligned?

本文关键字：对齐何对齐数据更新时间：2023-10-16

如果我想使用 SSE 处理std::vector中的数据，我需要 16 字节对齐。我怎样才能做到这一点？我需要编写自己的分配器吗？还是默认分配器已经与 16 字节边界对齐？

C++标准需要分配函数(malloc()和operator new()(来分配适合任何标准类型的内存。由于这些函数不会接收对齐要求作为参数，因此在实践中，这意味着所有分配的对齐方式都是相同的，并且是具有最大对齐要求的标准类型，通常是long double和/或long long(请参阅提升max_align联合(。

与标准C++分配函数提供的对齐要求相比，矢量指令(SSE 和 AVX 等矢量指令(具有更强的对齐要求(16 字节对齐用于 128 位访问，32 字节对齐用于 256 位访问(。 posix_memalign()或memalign()可用于满足具有更强对齐要求的此类分配。

<小时 />

在 C++17 中，分配函数接受类型 std::align_val_t 的附加参数。

您可以像这样使用它：

#include <immintrin.h>
#include <memory>
#include <new>
int main() {
    std::unique_ptr<__m256i[]> arr{new(std::align_val_t{alignof(__m256i)}) __m256i[32]};
}

此外，在 C++17 中，标准分配器已更新为尊重类型的对齐方式，因此您只需执行以下操作：

#include <immintrin.h>
#include <vector>
int main() {
    std::vector<__m256i> arr2(32);
}

或者(C++11 中不涉及和支持堆分配(：

#include <immintrin.h>
#include <array>
int main() {
    std::array<__m256i, 32> arr3;
}

应将自定义分配器与std::容器一起使用，例如 vector 。不记得是谁写了下面的，但我用了一段时间，它似乎可以工作(您可能需要_aligned_malloc更改为 _mm_malloc ，具体取决于编译器/平台(：

#ifndef ALIGNMENT_ALLOCATOR_H
#define ALIGNMENT_ALLOCATOR_H
#include <stdlib.h>
#include <malloc.h>
template <typename T, std::size_t N = 16>
class AlignmentAllocator {
public:
  typedef T value_type;
  typedef std::size_t size_type;
  typedef std::ptrdiff_t difference_type;
  typedef T * pointer;
  typedef const T * const_pointer;
  typedef T & reference;
  typedef const T & const_reference;
  public:
  inline AlignmentAllocator () throw () { }
  template <typename T2>
  inline AlignmentAllocator (const AlignmentAllocator<T2, N> &) throw () { }
  inline ~AlignmentAllocator () throw () { }
  inline pointer adress (reference r) {
    return &r;
  }
  inline const_pointer adress (const_reference r) const {
    return &r;
  }
  inline pointer allocate (size_type n) {
     return (pointer)_aligned_malloc(n*sizeof(value_type), N);
  }
  inline void deallocate (pointer p, size_type) {
    _aligned_free (p);
  }
  inline void construct (pointer p, const value_type & wert) {
     new (p) value_type (wert);
  }
  inline void destroy (pointer p) {
    p->~value_type ();
  }
  inline size_type max_size () const throw () {
    return size_type (-1) / sizeof (value_type);
  }
  template <typename T2>
  struct rebind {
    typedef AlignmentAllocator<T2, N> other;
  };
  bool operator!=(const AlignmentAllocator<T,N>& other) const  {
    return !(*this == other);
  }
  // Returns true if and only if storage allocated from *this
  // can be deallocated from other, and vice versa.
  // Always returns true for stateless allocators.
  bool operator==(const AlignmentAllocator<T,N>& other) const {
    return true;
  }
};
#endif

像这样使用它(如果需要，将 16 更改为另一种对齐方式(：

std::vector<T, AlignmentAllocator<T, 16> > bla;

但是，这只能确保std::vector使用的内存块是 16 字节对齐的。如果sizeof(T)不是 16 的倍数，则某些元素将不会对齐。根据您的数据类型，这可能不是问题。如果T为 int(4 个字节(，则仅加载索引为 4 的倍数的元素。如果是 double(8 个字节(，则只有 2 的倍数，依此类推。

真正的问题是，如果您使用类作为T，在这种情况下，您必须在类本身中指定对齐要求(同样，根据编译器的不同，这可能会有所不同;该示例适用于 GCC(：

class __attribute__ ((aligned (16))) Foo {
    __attribute__ ((aligned (16))) double u[2];
};

我们快完成了！如果使用 Visual C++(至少是版本 2010(，则由于 std::vector::resize ，您将无法将std::vector与您指定的对齐方式的类一起使用。

编译时，如果收到以下错误：

C:Program FilesMicrosoft Visual Studio 10.0VCincludevector(870):
error C2719: '_Val': formal parameter with __declspec(align('16')) won't be aligned

您必须破解stl::vector header文件：

找到vector头文件 [C：\Program Files\Microsoft Visual Studio 10.0\VC\include\vector]
找到void resize( _Ty _Val )方法 [VC2010 上的第 870 行]
将其更改为 void resize( const _Ty& _Val ) .

而不是像之前建议的那样编写自己的分配器，您可以使用boost::alignment::aligned_allocator进行如下std::vector：

#include <vector>
#include <boost/align/aligned_allocator.hpp>
template <typename T>
using aligned_vector = std::vector<T, boost::alignment::aligned_allocator<T, 16>>;

编写自己的分配器。 allocate和deallocate是重要的。下面是一个例子：

pointer allocate( size_type size, const void * pBuff = 0 )
{
    char * p;
    int difference;
    if( size > ( INT_MAX - 16 ) )
        return NULL;
    p = (char*)malloc( size + 16 );
    if( !p )
        return NULL;
    difference = ( (-(int)p - 1 ) & 15 ) + 1;
    p += difference;
    p[ -1 ] = (char)difference;
    return (T*)p;
}
void deallocate( pointer p, size_type num )
{
    char * pBuffer = (char*)p;
    free( (void*)(((char*)p) - pBuffer[ -1 ] ) );
}

简短回答：

如果是sizeof(T)*vector.size() > 16则为是.
_{假设您的向量使用普通分配器}

警告：只要alignof(std::max_align_t) >= 16这是最大对齐方式。

长答案：

更新 25/Aug/2017 新标准 n4659

如果它对齐了大于 16

的任何内容，则它也正确对齐了 16。

6.11 对齐(第4/5段(

对齐方式表示为 std：：size_t 类型的值。有效对齐方式仅包括基本类型的 alignof 表达式返回的值，以及一组其他实现定义的值(可能为空(。每个对齐值应为 2 的非负积分幂。
对齐方式具有从较弱到强或较严格的对齐顺序。更严格的对齐具有更大的对齐值。满足对齐要求的地址也满足任何较弱的有效对齐要求。

new

和 new[] 返回对齐的值，以便对象根据其大小正确对齐：

8.3.4 新增(第17段(

[ 注意：当分配函数返回 null 以外的值时，它必须是指向已为对象保留空间的存储块的指针。假定存储块已正确对齐并具有请求的大小。如果对象是数组，则创建对象的地址不一定与块的地址相同。— 尾注 ]

请注意，大多数系统都有最大对齐方式。动态分配的内存不需要与大于此值的值对齐。

6.11 对齐(第2段(

基本对齐由小于或等于支持的最大对齐方式表示通过所有上下文中的实现，等于 alignof(std：：max_align_t( (21.2(。对齐方式当类型用作完整对象的类型时，类型必需以及用作子对象的类型。

因此，只要分配的向量内存大于 16 字节，它就会在 16 字节边界上正确对齐。

对

一个过时(但重要(问题的当代回答。

正如其他人所说，编写自己的Allocator类[模板]立即浮现在脑海中。从 C++11 到 C++17，实现将主要限制(按标准(使用alignas和放置new。C++17升降C11的aligned_alloc，很方便。此外，C++17 的 std::pmr 命名空间(标头 <memory_resource> (引入了 polymorphic_allocator 类模板和用于多态分配的memory_resource抽象接口，深受 Boost 的启发。除了允许真正的通用动态代码外，这些代码在某些情况下还被证明可以提供速度改进;在这种情况下，您的 SIMD 代码的性能会更好。

按照英特尔矢量化教程中所述使用declspec(align(x,y))，http://d3f8ykwhia686p.cloudfront.net/1live/intel/CompilerAutovectorizationGuide.pdf

标准要求new和new[]返回与任何数据类型一致的数据，其中应包括 SSE。MSVC是否真的遵循该规则是另一个问题。