最终的ARM Linux内存碎片与NEON Copy，但不是memcpy

Eventual ARM Linux Memory Fragmentation with NEON Copy but not memcpy

本文关键字：Copy memcpy NEON ARM Linux 内存碎片更新时间：2023-10-16

我在BeagleBone X-15(ARM Cortex-A15(板上运行Linux 4.4。我的应用程序映射了 SGX GPU 的输出，需要复制 DRM 后备存储。

memcpy 和我的自定义 NEON 复制代码都可以工作......但是 NEON 代码要快得多(~11ms 对 ~35ms(。

我注意到，在 12500 秒后，当我使用 NEON 版本的副本时，Linux 会因内存不足 (OOM( 而杀死应用程序。当我运行应用程序并将一行从 NEON 副本更改为标准 memcpy 时，它会无限期运行(到目前为止 12 小时......但是复制速度较慢。

我已经粘贴了下面的 mmap、副本和 NEON 复制代码。我的霓虹灯副本真的有问题吗？

霓虹灯副本：

/**
* CompOpenGL neonCopyRGBAtoRGBA()
* Purpose: neonCopyRGBAtoRGBA - Software NEON copy
*
* @param src - Source buffer
* @param dst - Destination buffer
* @param numpix - Number of pixels to convert
*/
__attribute__((noinline)) void CompOpenGL::neonCopyRGBAtoRGBA(unsigned char* src, unsigned char* dst, int numPix)
{
    (void)src;
    (void)dst;
    (void)numPix;
    // This case takes RGBA -> BGRA
    __asm__ volatile(
                "mov r3, r3, lsr #3n"           /* Divide number of pixels by 8 because we process them 8 at a time */
                "loopRGBACopy:n"
                "vld4.8 {d0-d3}, [r1]!n"        /* Load 8 pixels into d0 through d2. d0 = R[0-7], d1 = G[0-7], d2 = B[0-7], d3 = A[0-7] */
                "subs r3, r3, #1n"              /* Decrement the loop counter */
                "vst4.8 {d0-d3}, [r2]!n"        /* Store the RGBA into destination 8 pixels at a time */
                "bgt loopRGBACopyn"
                "bx lrn"
                );
}

Mmap 并在此处复制代码：

union gbm_bo_handle handleUnion = gbm_bo_get_handle(m_Fb->bo);
struct drm_omap_gem_info gemInfo;
char *gpuMmapFrame = NULL;
gemInfo.handle = handleUnion.s32;
int ret = drmCommandWriteRead(m_DRMController->m_Fd, DRM_OMAP_GEM_INFO,&gemInfo, sizeof(gemInfo));
if (ret) {
    qDebug() << "Cannot set write/read";
}
else {
    // Mmap the frame
    gpuMmapFrame = (char *)mmap(0, gemInfo.size, PROT_READ | PROT_WRITE, MAP_SHARED,m_DRMController->m_Fd, gemInfo.offset);
    if ( gpuMmapFrame != MAP_FAILED ) {
        QElapsedTimer timer;
        timer.restart();
        
        //m_OGLController->neonCopyRGBAtoRGBA((uchar*)gpuMmapFrame,  (uchar*)m_cpyFrame,dmaBuf.width * dmaBuf.height);
        memcpy(m_cpyFrame,gpuMmapFrame,dmaBuf.height * dmaBuf.width * 4);
        
        qDebug() << "Copy Performance: " << timer.elapsed();

好消息是，如果将vld4/vst4替换为 vld1/vst1，您的函数将运行得更快。

坏消息是，您必须报告使用和修改的寄存器，包括CPSR和内存，并且不应从内联程序集返回。(bx lr(。

__asm__ volatile(
                "mov r3, r3, lsr #3n"           /* Divide number of pixels by 8 because we process them 8 at a time */
                "loopRGBACopy:n"
                "vld1.8 {d0-d3}, [r1]!n"        /* Load 8 pixels into d0 through d2. d0 = R[0-7], d1 = G[0-7], d2 = B[0-7], d3 = A[0-7] */
                "subs r3, r3, #1n"              /* Decrement the loop counter */
                "vst1.8 {d0-d3}, [r2]!n"        /* Store the RGBA into destination 8 pixels at a time */
                "bgt loopRGBACopyn"
                ::: "r1", "r2", "r3", "d0", "d1", "d2", "d3", "cc", "memory"
                );

http://www.ethernut.de/en/documents/arm-inline-asm.html