用于将YUYV拆分为平面的NEON优化
NEON Optimization for Splitting YUYV into Planes
我想学习如何使用NEON将YUYV拆分为Y、U和V平面,以便稍后将数据作为OpenGL纹理提供给GPU。
目前,我在C++中这样做:
/**
* TopOpenGL splitYuvPlanes()
* Purpose: splitYuvPlanes - Split YUYV into 3 arrays - one for each component
*
* @param data - input data
* @param size - input data size
* @param y - array to store output channels
* @param u - array to store output channels
* @param v - array to store output channels
*/
void TopOpenGL::splitYuvPlanes(unsigned char *data, int size, unsigned char *y, unsigned char *u, unsigned char *v)
{
// This case takes RGBA -> BGRA
// __asm__ volatile(
// "mov r3, r3, lsr #3n" /* Divide number of pixels by 8 because we process them 8 at a time */
// "loopRGBA:n"
// "vld4.8 {d0-d3}, [r1]!n" /* Load 8 pixels into d0 through d2. d0 = R[0-7], d1 = G[0-7], d2 = B[0-7], d3 = A[0-7] */
// "subs r3, r3, #1n" /* Decrement the loop counter */
// "vswp d0, d2n" /* Swap R and B channels */
// "vst4.8 {d0-d3}, [r2]!n" /* Store the RGBA into destination 8 pixels at a time */
// "bgt loopRGBAn"
// "bx lrn"
// );
for ( int c = 0 ; c < ( size - 4 ) ; c+=4 ) {
*y = *data; // Y0
data++;
*u = *data; // U0
u++;
*u = *data; // U0
data++;
y++;
*y = *data; // Y1
data++;
*v = *data; // V0
v++;
*v = *data; // V0
data++;
y++;
u++;
v++;
}
}
如何使用NEON将其拆分为char*y、char*u和char*v?非常感谢。
我找到了这个博客,但它并不是我想要的。http://blog.lumberlabs.com/2011/04/efficiently-splitting-cbcr-plane-with.html
以下代码实现了将YUYV帧拆分为Y、U和V平面的目标。
/// This structure is passed to ARM Assembly code
/// to split the YUV frame into seperate planes for
/// OpenGL Consumption
typedef struct {
uchar *input_data;
uint32_t input_size;
uchar *y_plane;
uchar *u_plane;
uchar *v_plane;
} yuvSplitStruct;
void TopOpenGL::splitYuvPlanes(yuvSplitStruct *yuvStruct)
{
__asm__ volatile(
"PUSH {r4}n" /* Save callee-save registers R4 and R5 on the stack */
"PUSH {r5}n" /* r1 is the pointer to the input structure ( r0 is 'this' because c++ ) */
"ldr r0 , [r1]n" /* reuse r0 scratch register for the address of our frame input */
"ldr r2 , [r1, #4]n" /* use r2 scratch register to store the size in bytes of the YUYV frame */
"ldr r3 , [r1, #8]n" /* use r3 scratch register to store the destination Y plane address */
"ldr r4 , [r1, #12]n" /* use r4 register to store the destination U plane address */
"ldr r5 , [r1, #16]n" /* use r5 register to store the destination V plane address */
"mov r2, r2, lsr #5n" /* Divide number of bytes by 32 because we process 16 pixels at a time */
"loopYUYV:n"
"vld4.8 {d0-d3}, [r0]!n" /* Load 8 YUYV elements from our frame into d0-d3, increment frame pointer */
"vst2.8 {d0,d2}, [r3]!n" /* Store both Y elements into destination y plane, increment plane pointer */
"vmov.F64 d0, d1n" /* Duplicate U value */
"vst2.8 {d0,d1}, [r4]!n" /* Store both U elements into destination u plane, increment plane pointer */
"vmov.F64 d1, d3n" /* Duplicate V value */
"vst2.8 {d1,d3}, [r5]!n" /* Store both V elements into destination v plane, increment plane pointer */
"subs r2, r2, #1n" /* Decrement the loop counter */
"bgt loopYUYVn" /* Loop until entire frame is processed */
"POP {r5}n" /* Restore callee-save registers */
"POP {r4}n"
);
}
相关文章:
- 空基优化子对象的地址
- 关闭||运算符优化
- 如何解决gcc编译器优化导致的centos双编译器设置中的分段错误
- 返回值优化:显式移动还是隐式
- 人脸跟踪arduino代码的优化
- 使用仅使用一次的变量调用的复制构造函数.这可能是通过调用move构造函数进行编译器优化的情况吗
- 纯函数,为什么没有优化
- 为什么大多数 pair 实现默认不使用压缩(空基优化)?
- 如何以优化的方式同时迭代两个间距不相等的数组
- 小字符串优化(调试与发布模式)
- 浮点定向舍入和优化
- Visual Studio 调试优化如何工作?
- 为什么开关的优化方式与 c/c++ 中的链接不同?
- 线性优化目标函数中的绝对值
- GCC 会优化内联访问器吗?
- gcc 如何优化此循环?
- 如何防止 CUDA-GDB 中的<优化输出>值
- ARM NEON aarch64:如何以优化的方式比较和更新 neon 寄存器
- 通过ARM NEON程序集最大限度地优化元素乘法
- 用于将YUYV拆分为平面的NEON优化