矩阵rowSums(）与colSums((）在R与Rcpp与Armadillo中的效率

Efficiency of matrix rowSums() vs. colSums() in R vs Rcpp vs Armadillo

本文关键字：Armadillo 效率 Rcpp rowSums colSums 矩阵更新时间：2023-10-16

背景

来自R编程，我正在使用Rcpp扩展到C/C++形式的编译代码。作为一个关于循环交换效果的实践练习(通常只是C/C++)，我为具有Rcpp的矩阵实现了R的rowSums()和colSums()函数的等价物(我知道这些函数以Rcpp糖和Armadillo的形式存在——这只是一个练习)。

问题

我在这个matsums.cpp文件中有我的rowSums()和colSums()的C++实现，以及Rcpp-sugar和arma::sum()版本。我的只是像这样的简单循环：

NumericVector Cpp_colSums(const NumericMatrix& x) {
int nr = x.nrow(), nc = x.ncol();
NumericVector ans(nc);
for (int j = 0; j < nc; j++) {
double sum = 0.0;
for (int i = 0; i < nr; i++) {
sum += x(i, j);
}
ans[j] = sum;
}
return ans;
}
NumericVector Cpp_rowSums(const NumericMatrix& x) {
int nr = x.nrow(), nc = x.ncol();
NumericVector ans(nr);
for (int j = 0; j < nc; j++) {
for (int i = 0; i < nr; i++) {
ans[i] += x(i, j);
}
}
return ans;
}

(R矩阵是以列为主存储的，所以外循环中的列应该是更有效的方法。这就是我最初测试的。)

在运行这些基准测试时，我遇到了一些我没有预料到的事情：行和列和之间有明显的性能差异(见下面的基准测试)：

使用内置的R函数，colSums()的速度大约是rowSums()的两倍
对于我自己的Rcpp和糖版本，情况正好相反：rowSums()的速度大约是colSums()的两倍
最后，加上Armadillo实现，运算大致相等(colsum可能会快一点，正如我所期望的那样，它们也在R中)

所以我的主要问题是：为什么Cpp_rowSums()明显快于Cpp_colSums()

作为次要的兴趣，我也很好奇为什么在R实现中会出现相同的差异。(我浏览了C源代码，但无法真正找出显著的差异。)(第三，为什么Armadillo的性能相同？)

基准

我在10,000 x 10,000对称矩阵上测试了这两个函数的所有4个实现

Rcpp::sourceCpp("matsums.cpp")
set.seed(92136)
n <- 1e4 # build n x n test matrix
x <- matrix(rnorm(n), 1, n)
x <- crossprod(x, x) # symmetric
bench::mark(
rowSums(x),
colSums(x),
Cpp_rowSums(x),
Cpp_colSums(x),
Sugar_rowSums(x),
Sugar_colSums(x),
Arma_rowSums(x),
Arma_colSums(x)
)[, 1:7]
#> # A tibble: 8 x 7
#>   expression            min     mean   median      max `itr/sec` mem_alloc
#>   <chr>            <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 rowSums(x)        192.2ms  207.9ms  194.6ms  236.9ms      4.81    78.2KB
#> 2 colSums(x)         93.4ms   97.2ms   96.5ms  101.3ms     10.3     78.2KB
#> 3 Cpp_rowSums(x)     73.5ms   76.3ms     76ms   80.4ms     13.1     80.7KB
#> 4 Cpp_colSums(x)    126.5ms  127.6ms  126.8ms  130.3ms      7.84    80.7KB
#> 5 Sugar_rowSums(x)   73.9ms   75.6ms   74.3ms   79.4ms     13.2     80.7KB
#> 6 Sugar_colSums(x)  124.2ms  125.8ms  125.6ms  127.9ms      7.95    80.7KB
#> 7 Arma_rowSums(x)    73.2ms   74.7ms   73.9ms   79.3ms     13.4     80.7KB
#> 8 Arma_colSums(x)    62.8ms   64.4ms   63.7ms   69.6ms     15.5     80.7KB

(同样，您可以在此处找到C++源文件matsums.cpp。)

平台：

> sessioninfo::platform_info()
setting  value                       
version  R version 3.5.1 (2018-07-02)
os       Windows >= 8 x64            
system   x86_64, mingw32             
ui       RStudio                     
language (EN)                        
collate  English_United States.1252  
tz       Europe/Helsinki             
date     2018-08-09

更新

进一步研究，我还使用R的传统C接口编写了相同的函数：源代码在这里。我用R CMD SHLIB编译了这些函数，并再次测试：行和比列和快的相同现象仍然存在(基准测试)。然后，我还研究了objdump的反汇编，但在我看来(由于我对asm的理解非常有限)，编译器也没有真正优化C代码的主循环体(行、列)？

首先，让我在笔记本电脑上显示时间统计数据。我使用了一个5000 x 5000的矩阵，这足以进行基准测试，microbenchmark包用于100次评估。

Unit: milliseconds
expr       min        lq      mean    median        uq       max
colSums(x)  71.40671  71.64510  71.80394  71.72543  71.80773  75.07696
Cpp_colSums(x)  71.29413  71.42409  71.65525  71.48933  71.56241  77.53056
Sugar_colSums(x)  73.05281  73.19658  73.38979  73.25619  73.31406  76.93369
Arma_colSums(x)  39.08791  39.34789  39.57979  39.43080  39.60657  41.70158
rowSums(x) 177.33477 187.37805 187.57976 187.49469 187.73155 194.32120
Cpp_rowSums(x)  54.00498  54.37984  54.70358  54.49165  54.73224  64.16104
Sugar_rowSums(x)  54.17001  54.38420  54.73654  54.56275  54.75695  61.80466
Arma_rowSums(x)  49.54407  49.77677  50.13739  49.90375  50.06791  58.29755

R核心中的C代码并不总是比我们自己写的更好。CCD_ 18比CCD_。我觉得自己没有能力解释为什么R的版本比它应该的慢。我将专注于：我们如何进一步优化我们自己的colSums和rowSums来击败Armadillo。请注意，我编写C，使用R的旧C接口，并使用R CMD SHLIB进行编译。

`colSums`和`rowSums`之间有实质性差异吗

如果我们有一个比CPU缓存容量大得多的n x n矩阵，colSums将n x n数据从RAM加载到缓存，但rowSums加载的数据是它的两倍，即2 x n x n。

想一想保存总和的结果向量：这个长度n向量从RAM加载到缓存中的次数？对于colSums，它只加载一次，但对于rowSums，它加载n次。每次向其中添加矩阵列时，它都会加载到缓存中，但由于太大而被逐出。

对于大型n:

colSums导致n x n + n数据从RAM加载到缓存
CCD_ 37导致CCD_ 38数据从RAM加载到高速缓存

换句话说，理论上rowSums的记忆效率较低，而且可能较慢。

如何提高`colSums`的性能

由于RAM和高速缓存之间的数据流很容易优化，因此唯一的改进是循环展开。将内部环路(求和环路)展开深度2就足够了，我们将看到2倍的提升。

循环展开的工作原理是启用CPU的指令管道。如果我们每次迭代只做一次加法，就不可能进行流水线操作；通过两次添加，这种指令级并行性开始发挥作用。我们也可以将循环展开深度为4，但我的经验是，深度为2的展开足以从循环展开中获得大部分好处。

如何提高`rowSums`的性能

优化数据流是第一步。我们需要首先进行缓存阻塞，以减少从2 x n x n到n x n的数据传输。

将这个n x n矩阵分割成多个行块：每个行块都是2040 x n(最后一个块可能更小)，然后逐块应用普通的rowSums。对于每个块，累加器向量的长度为2040，大约是32KB CPU缓存所能容纳的长度的一半。对于添加到此累加器向量的矩阵列，另一半被反转。通过这种方式，累加器矢量可以被保持在高速缓存中，直到该块中的所有矩阵列都被处理为止。因此，累加器矢量只被加载到缓存中一次，因此整体内存性能与colSums一样好。

现在，我们可以进一步对每个块中的rowSums应用循环展开。将外环和内环都展开2深，我们将看到提升。一旦展开外循环，块大小应该减少到1360，因为现在我们需要缓存中的空间来在每次外循环迭代中容纳三个长度为1360的向量。

C代码：让我们打败Armadillo

使用循环展开编写代码可能是一项令人讨厌的工作，因为我们现在需要为一个函数编写几个不同的版本。

对于colSums，我们需要两个版本：

colSums_1x1：内外循环都以深度1展开，即这是一个没有循环展开的版本
colSums_2x1：不展开外环，展开深度为2的内环

对于rowSums，我们最多可以有四个版本，rowSums_sxt，其中s = 1 or 2是内环的展开深度，t = 1 or 2是外环的展开深度。

如果我们一个接一个地编写每个版本，代码编写可能会非常乏味。在经历了多年或对此感到沮丧之后；"自动代码/版本生成"；使用内联模板函数和宏的技巧。

#include <stdlib.h>
#include <Rinternals.h>
static inline void colSums_template_sx1 (size_t s,
double *A, size_t LDA,
size_t nr, size_t nc,
double *sum) {
size_t nrc = nr % s, i;
double *A_end = A + LDA * nc, a0, a1;
for (; A < A_end; A += LDA) {
a0 = 0.0; a1 = 0.0;  // accumulator register variables
if (nrc > 0) a0 = A[0];  // is there a "fractional loop"?
for (i = nrc; i < nr; i += s) {  // main loop of depth-s
a0 += A[i];  // 1st iteration
if (s > 1) a1 += A[i + 1];  // 2nd iteration
}
if (s > 1) a0 += a1;  // combine two accumulators
*sum++ = a0;  // write-back
}
}
#define macro_define_colSums(s, colSums_sx1) 
SEXP colSums_sx1 (SEXP matA) { 
double *A = REAL(matA); 
size_t nrow_A = (size_t)nrows(matA); 
size_t ncol_A = (size_t)ncols(matA); 
SEXP result = PROTECT(allocVector(REALSXP, ncols(matA))); 
double *sum = REAL(result); 
colSums_template_sx1(s, A, nrow_A, nrow_A, ncol_A, sum); 
UNPROTECT(1); 
return result; 
}
macro_define_colSums(1, colSums_1x1)
macro_define_colSums(2, colSums_2x1)

模板函数为具有LDA(a的前导维度)行的矩阵A计算(在R-语法中)sum <- colSums(A[1:nr, 1:nc])。参数s是对内部循环展开的版本控制。模板函数乍一看很可怕，因为它包含许多if。但是，它被声明为static inline。如果它是通过将已知常数1或2传递给s来调用的，则优化编译器能够在编译时评估这些if，消除不可访问的代码并丢弃"0"；设置但未使用"；变量(寄存器已初始化、修改但未写回RAM的变量)。

该宏用于函数声明。它接受一个常量s和一个函数名，生成一个具有所需循环展开版本的函数。

以下内容适用于rowSums。

static inline void rowSums_template_sxt (size_t s, size_t t,
double *A, size_t LDA,
size_t nr, size_t nc,
double *sum) {
size_t ncr = nc % t, nrr = nr % s, i;
double *A_end = A + LDA * nc, *B;
double a0, a1;
for (i = 0; i < nr; i++) sum[i] = 0.0;  // necessary initialization
if (ncr > 0) {  // is there a "fractional loop" for the outer loop?
if (nrr > 0) sum[0] += A[0];  // is there a "fractional loop" for the inner loop?
for (i = nrr; i < nr; i += s) {  // main inner loop with depth-s
sum[i] += A[i];
if (s > 1) sum[i + 1] += A[i + 1];
}
A += LDA;
}
for (; A < A_end; A += t * LDA) {  // main outer loop with depth-t
if (t > 1) B = A + LDA;
if (nrr > 0) {  // is there a "fractional loop" for the inner loop?
a0 = A[0]; if (t > 1) a0 += A[LDA];
sum[0] += a0;
}
for(i = nrr; i < nr; i += s) {  // main inner loop with depth-s
a0 = A[i]; if (t > 1) a0 += B[i];
sum[i] += a0;
if (s > 1) {
a1 = A[i + 1]; if (t > 1) a1 += B[i + 1];
sum[i + 1] += a1;
}
}
}
}
#define macro_define_rowSums(s, t, rowSums_sxt) 
SEXP rowSums_sxt (SEXP matA, SEXP chunk_size) { 
double *A = REAL(matA); 
size_t nrow_A = (size_t)nrows(matA); 
size_t ncol_A = (size_t)ncols(matA); 
SEXP result = PROTECT(allocVector(REALSXP, nrows(matA))); 
double *sum = REAL(result); 
size_t block_size = (size_t)asInteger(chunk_size); 
size_t i, block_size_i; 
if (block_size > nrow_A) block_size = nrow_A; 
for (i = 0; i < nrow_A; i += block_size_i) { 
block_size_i = nrow_A - i; if (block_size_i > block_size) block_size_i = block_size; 
rowSums_template_sxt(s, t, A, nrow_A, block_size_i, ncol_A, sum); 
A += block_size_i; sum += block_size_i; 
} 
UNPROTECT(1); 
return result; 
}
macro_define_rowSums(1, 1, rowSums_1x1)
macro_define_rowSums(1, 2, rowSums_1x2)
macro_define_rowSums(2, 1, rowSums_2x1)
macro_define_rowSums(2, 2, rowSums_2x2)

注意，模板函数现在接受s和t，并且要由宏定义的函数已经应用了行分块。

尽管我在代码中留下了一些注释，但代码可能仍然不容易理解，但我无法花更多的时间来更详细地解释。

要使用它们，请将它们复制并粘贴到名为"的C文件中；matSums.c"；并用CCD_ 68进行编译。

对于R侧，在"R"中定义以下函数；matSums.R〃；。

colSums_zheyuan <- function (A, s) {
dyn.load("matSums.so")
if (s == 1) result <- .Call("colSums_1x1", A)
if (s == 2) result <- .Call("colSums_2x1", A)
dyn.unload("matSums.so")
result
}
rowSums_zheyuan <- function (A, chunk.size, s, t) {
dyn.load("matSums.so")
if (s == 1 && t == 1) result <- .Call("rowSums_1x1", A, as.integer(chunk.size))
if (s == 2 && t == 1) result <- .Call("rowSums_2x1", A, as.integer(chunk.size))
if (s == 1 && t == 2) result <- .Call("rowSums_1x2", A, as.integer(chunk.size))
if (s == 2 && t == 2) result <- .Call("rowSums_2x2", A, as.integer(chunk.size))
dyn.unload("matSums.so")
result
}

现在让我们有一个基准，再次使用5000 x 5000矩阵。

A <- matrix(0, 5000, 5000)
library(microbenchmark)
source("matSums.R")
microbenchmark("col0" = colSums(A),
"col1" = colSums_zheyuan(A, 1),
"col2" = colSums_zheyuan(A, 2),
"row0" = rowSums(A),
"row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
"row2" = rowSums_zheyuan(A, 2040, 1, 1),
"row3" = rowSums_zheyuan(A, 1360, 1, 2),
"row4" = rowSums_zheyuan(A, 1360, 2, 2))

在我的笔记本电脑上我得到：

Unit: milliseconds
expr       min        lq      mean    median        uq       max neval
col0  65.33908  71.67229  71.87273  71.80829  71.89444 111.84177   100
col1  67.16655  71.84840  72.01871  71.94065  72.05975  77.84291   100
col2  35.05374  38.98260  39.33618  39.09121  39.17615  53.52847   100
row0 159.48096 187.44225 185.53748 187.53091 187.67592 202.84827   100
row1  49.65853  54.78769  54.78313  54.92278  55.08600  60.27789   100
row2  49.42403  54.56469  55.00518  54.74746  55.06866  60.31065   100
row3  37.43314  41.57365  41.58784  41.68814  41.81774  47.12690   100
row4  34.73295  37.20092  38.51019  37.30809  37.44097  99.28327   100

注意循环展开是如何加快colSums和rowSums的速度的。通过充分优化("col2"answers"row4")，我们击败了Armadillo(见本答案开头的时间表)。

在这种情况下，行分块策略显然没有带来好处。让我们尝试一个包含数百万行的矩阵。

A <- matrix(0, 1e+7, 20)
microbenchmark("row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
"row2" = rowSums_zheyuan(A, 2040, 1, 1),
"row3" = rowSums_zheyuan(A, 1360, 1, 2),
"row4" = rowSums_zheyuan(A, 1360, 2, 2))

我得到

Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
row1 604.7202 607.0256 617.1687 607.8580 609.1728 720.1790   100
row2 514.7488 515.9874 528.9795 516.5193 521.4870 636.0051   100
row3 412.1884 413.8688 421.0790 414.8640 419.0537 525.7852   100
row4 377.7918 379.1052 390.4230 379.9344 386.4379 476.9614   100

在这种情况下，我们观察到缓存阻塞带来的好处。

最后的想法

基本上，这个答案已经解决了所有问题，除了以下问题：

为什么R的rowSums比它应该的效率低
为什么在没有任何优化的情况下rowSums("row1")比colSums("col1")快

同样，我无法解释第一个，实际上我不在乎，因为我们可以轻松地编写一个比R的内置版本更快的版本。

第二个绝对值得追求。我在我们的讨论室里把我的评论抄下来作为记录。

这个问题归结为："为什么一个向量加起来比两个向量元素加起来慢">
我不时看到类似的现象。我第一次遇到这种奇怪的行为是几年前，我编写了自己的矩阵矩阵乘法。我发现DAXPY比DDOT更快。
DAXPY这样做：y += a * x，其中x和y是矢量，a是标量；DDOT这样做：a += x * y。
鉴于DDOT是一种还原操作，我预计它比DAXPY更快。但不，DAXPY更快。
然而，一旦我在矩阵乘法的三重循环嵌套中展开循环，DDOT就比DAXPY快得多。
你的问题也发生了类似的事情。缩减操作：a = x[1] + x[2] + ... + x[n]比按元素添加：y[i] += x[i]慢。但一旦完成循环展开，后者的优势就会丧失。
我不确定以下解释是否属实，因为我没有证据。
归约运算具有依赖链，因此计算是严格串行的；另一方面，元素操作没有依赖链，因此CPU可以用它做得更好。
一旦我们展开循环，每次迭代都有更多的算法要做，CPU在这两种情况下都可以进行流水线操作。然后可以观察到还原操作的真正优点。

回复Jaap关于使用`matrixStats`中的`rowSums2`和`colSums2`

仍然使用上面的5000 x 5000示例。

A <- matrix(0, 5000, 5000)
library(microbenchmark)
source("matSums.R")
library(matrixStats)  ## NEW
microbenchmark("col0" = base::colSums(A),
"col*" = matrixStats::colSums2(A),  ## NEW
"col1" = colSums_zheyuan(A, 1),
"col2" = colSums_zheyuan(A, 2),
"row0" = base::rowSums(A),
"row*" = matrixStats::rowSums2(A),  ## NEW
"row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
"row2" = rowSums_zheyuan(A, 2040, 1, 1),
"row3" = rowSums_zheyuan(A, 1360, 1, 2),
"row4" = rowSums_zheyuan(A, 1360, 2, 2))
Unit: milliseconds
expr       min        lq      mean    median        uq       max neval
col0  71.53841  71.72628  72.13527  71.81793  71.90575  78.39645   100
col*  75.60527  75.87255  76.30752  75.98990  76.18090  87.07599   100
col1  71.67098  71.86180  72.06846  71.93872  72.03739  77.87816   100
col2  38.88565  39.03980  39.57232  39.08045  39.16790  51.39561   100
row0 187.44744 187.58121 188.98930 187.67168 187.86314 206.37662   100
row* 158.08639 158.26528 159.01561 158.34864 158.62187 174.05457   100
row1  54.62389  54.81724  54.97211  54.92394  55.04690  56.33462   100
row2  54.15409  54.44208  54.78769  54.59162  54.76073  60.92176   100
row3  41.43393  41.63886  42.57511  41.73538  41.81844 111.94846   100
row4  37.07175  37.25258  37.45033  37.34456  37.47387  43.14157   100

我看不出rowSums2和colSums2的性能优势。

"为什么Cpp_rowSums()比Cpp_colSums(。

当你访问"column-major"时，预取器很难预测你接下来需要什么，所以它不会像以前那样高效地(如果有的话)提前将东西填充到缓存中——这会减慢你的速度。

CPU喜欢对数据的线性访问。如果你不做他们喜欢的事情，你就要为缓存未命中和主内存访问延迟付出代价。

有一个非常好的R包，名为Rfast(此处)，它提供了行/列和和均值等的C++实现。刚刚试用过，它比base包中的各个默认函数快得多。

矩阵rowSums(）与colSums((）在R与Rcpp与Armadillo中的效率

Efficiency of matrix rowSums() vs. colSums() in R vs Rcpp vs Armadillo

背景

问题

基准

更新

colSums和rowSums之间有实质性差异吗

如何提高colSums的性能

如何提高rowSums的性能