使C++特征LU更快(我的测试显示比GSL慢2倍)
Making C++ Eigen LU faster (my tests show 2x slower than GSL)
我将特征的LU分解/求解与GSL进行比较,发现特征在g++/OSX上与-O3优化相比慢了2倍。我隔离了分解和求解的时间,但发现两者都比它们的GSL同行慢得多。我是在做一些愚蠢的事情,还是Eigen在这个用例中表现不佳(例如,非常小的系统?)使用Eigen 3.2.8和旧的GSL 1.15构建。测试用例非常做作,但反映了我正在编写的一些非线性拟合软件中的结果——对于总的LU/求解操作,Eigen的速度在1.5x-2x+之间。
#define NDEBUG
#include "sys/time.h"
#include "gsl/gsl_linalg.h"
#include <Eigen/LU>
// Ax=b is a 3x3 system for which soln is x=[8,2,3]
//
double avals_col[9] = { 4, 2, 2, 7, 5, 5, 7, 5, 9 };
// col major
double avals_row[9] = { 4, 7, 7, 2, 5, 5, 2, 5, 9 };
// row major
double bvals[9] = { 67, 41, 53 };
//----------- helpers
void print_solution( double *x, int dim, char *which ) {
printf( "%s solve for x:n", which );
for( int j=0; j<3; j++ ) {
printf( "%g ", x[j] );
}
printf( "n" );
}
struct timeval tv;
struct timezone tz;
double timeNow() {
gettimeofday( &tv, &tz );
int _mils = tv.tv_usec/1000;
int _secs = tv.tv_sec;
return (double)_secs + ((double)_mils/1000.0);
}
//-----------
void run_gsl( double *A, double *b, double *x, int dim, int reps ) {
gsl_matrix_view gslA;
gsl_vector_view gslB;
gsl_vector_view gslX;
gsl_permutation *gslP;
int sign;
gslA = gsl_matrix_view_array( A, dim, dim );
gslP = gsl_permutation_alloc( dim );
gslB = gsl_vector_view_array( b, dim );
gslX = gsl_vector_view_array( x, dim );
int err;
double t, elapsed;
t = timeNow();
for( int i=0; i<reps; i++ ) {
// gsl overwrites A during decompose, so we must copy the initial A each time.
memcpy( A, avals_row, sizeof(avals_row) );
err = gsl_linalg_LU_decomp( &gslA.matrix, gslP, &sign );
}
elapsed = timeNow() - t;
printf( "GSL decompose (%dx) time = %gn", reps, elapsed );
t = timeNow();
for( int i=0; i<reps; i++ ) {
err = gsl_linalg_LU_solve( &gslA.matrix, gslP, &gslB.vector, &gslX.vector );
}
elapsed = timeNow() - t;
printf( "GSL solve (%dx) time = %gn", reps, elapsed );
gsl_permutation_free( gslP );
}
void run_eigen( double *A, double *b, double *x, int dim, int reps ) {
Eigen::PartialPivLU<Eigen::MatrixXd> eigenA_lu;
Eigen::Map< Eigen::Matrix < double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor > > ma( A, dim, dim );
Eigen::Map<Eigen::MatrixXd> mb( b, dim, 1 );
int err;
double t, elapsed;
t = timeNow();
for( int i=0; i<reps; i++ ) {
// This memcpy is not necessary for Eigen, which does not overwrite A in the
// decompose, but do it so that the time is directly comparable to GSL.
memcpy( A, avals_col, sizeof(avals_col) );
eigenA_lu.compute( ma );
}
elapsed = timeNow() - t;
printf( "Eigen decompose (%dx) time = %gn", reps, elapsed );
t = timeNow();
Eigen::VectorXd _x;
for( int i=0; i<reps; i++ ) {
_x = eigenA_lu.solve( mb );
}
elapsed = timeNow() - t;
printf( "Eigen solve (%dx) time = %gn", reps, elapsed );
// copy soln to passed x
for( int i=0; i<dim; i++ ) {
x[i] = _x(i);
}
}
int main() {
// solve a 3x3 system many times
double A[9], b[3], x[3];
int dim = 3;
int reps = 1000000;
memcpy( b, bvals, sizeof(bvals) );
// init b vector, A is copied multiple times in run_gsl/run_eigen
run_eigen( A, b, x, dim, reps );
print_solution( x, dim, "Eigen" );
run_gsl( A, b, x, dim, reps );
print_solution( x, dim, "GSL" );
return 0;
}
这会产生,例如:
Eigen decompose (1000000x) time = 0.242
Eigen solve (1000000x) time = 0.108
Eigen solve for x:
8 2 3
GSL decompose (1000000x) time = 0.049
GSL solve (1000000x) time = 0.075
GSL solve for x:
8 2 3
您的基准测试并不公平,因为您在Eigen版本中复制了两次输入矩阵:一次是通过memcpy
手动复制,另一次是在PartialPivLU
内复制。您还通过将mb声明为Map<Eigen::Vectord>
,让Eigen知道mb是一个向量。然后我得到(GCC5,-O3,特征3.3):
Eigen decompose (1000000x) time = 0.087
Eigen solve (1000000x) time = 0.036
Eigen solve for x:
8 2 3
GSL decompose (1000000x) time = 0.032
GSL solve (1000000x) time = 0.062
GSL solve for x:
8 2 3
此外,Eigen的PartialPivLU并不是真正为这种极小的矩阵设计的(见下文)。对于3x3矩阵,最好显式计算逆(对于4x4以下的矩阵,通常是这样,但对于较大的矩阵则不然!)。在这种情况下,您必须在编译时修复大小:
Eigen::PartialPivLU<Eigen::Matrix3d> eigenA_lu;
Eigen::Map<Eigen::Matrix3d> ma(avals_col);
Eigen::Map<Eigen::Vector3d> mb(b);
Eigen::Matrix3d inv;
Eigen::Vector3d _x;
double t, elapsed;
t = timeNow();
for( int i=0; i<reps; i++ ) {
inv = ma.inverse();
}
elapsed = timeNow() - t;
printf( "Eigen decompose (%dx) time = %gn", reps, elapsed );
t = timeNow();
for( int i=0; i<reps; i++ ) {
_x.noalias() = inv * mb;
}
elapsed = timeNow() - t;
printf( "Eigen solve (%dx) time = %gn", reps, elapsed );
这给了我:
Eigen inverse and solve (1000000x) time = 0.0209999
Eigen solve (1000000x) time = 0.000999928
如此之快。
现在,如果我们尝试一个更大的问题,比如3000 x 3000,我们会得到不止一个数量级的差异,有利于特征:
Eigen decompose (1x) time = 0.411
GSL decompose (1x) time = 6.073
这通常是为大型问题提供这种性能的优化,同时也为非常小的矩阵引入了一些开销。
相关文章:
- 比较并显示使用最小值(a,b)和最大值(a、b)升序排列的4个数字
- C++,OpenCV,尝试显示图像时"OpenCV(4.3.0) Error: Assertion failed (size.width>0 && size.height>0)"此错误
- 字符串-C++后显示的随机字符
- 继承期间显示未知行为的子类
- 仅使用绝对值对数组进行排序,并在C++中显示实际值
- 程序崩溃并显示"std::out_of_range"错误
- 如何在C++中用std::cout正确显示带十六进制的字符串文本
- 为什么在C的循环中使用printf的Rust代码不显示输出,而在C++的循环中显示std::cout
- 从数据库实时显示QT c++中的数据
- 如何处理来自核心指南检查器的关于gsl::at的静态分析警告
- 当使用比格式支持的精度更高的精度来显示数字时,会写出什么数据
- 显示错误输出的简单数组排序程序
- Qt自定义QPush按钮未显示在布局上
- C++射线示踪剂ppm表示没有足够的数据来显示图像
- 显示基于用户输入的整数的字符
- 使用QTreeView,如何通过调用函数只突出显示特定的行/列
- 应用程序崩溃并显示"symbol _ZdlPvm, version Qt_5 not defined in file libQt5Core.so.5 with link time reference"
- 将gsl c++程序与"英特尔MKL"链接
- 示例外壳应用程序显示的 V8 "segmentation fault (core dumped)"错误
- 使C++特征LU更快(我的测试显示比GSL慢2倍)