R 中 %in% 运算符的C++版本

A C++ version of the %in% operator in R

本文关键字：C++ 版本运算符 %in% 更新时间：2023-10-16

C++中是否有任何函数等同于R中的%in%运算符？考虑 R 中的以下命令：

which(y %in% x)

我试图在C++(特别是在犰狳)中找到等效的东西，但我找不到任何东西。然后我编写了自己的函数，与上面的 R 命令相比，它非常慢。

这是我写的：

#include <RcppArmadillo.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
arma::uvec myInOperator(arma::vec myBigVec, arma::vec mySmallVec ){
arma::uvec rslt = find(myBigVec == mySmallVec[0]);
for (int i = 1; i < mySmallVec.size(); i++){
arma::uvec rslt_tmp = find(myBigVec == mySmallVec[i]);
rslt = arma::unique(join_cols( rslt, rslt_tmp ));
}
return rslt;
}

现在，在上面的代码中采购后，我们有：

x <- 1:4
y <- 1:10
res <- benchmark(myInOperator(y, x), which(y %in% x), columns = c("test",
"replications", "elapsed", "relative", "user.self", "sys.self"), 
order = "relative")

以下是结果：

test replications elapsed relative user.self sys.self
2    which(y %in% x)          100   0.001        1     0.001        0
1 myInOperator(y, x)          100   0.002        2     0.001        0

谁能指导我找到与哪个(y %in% x)对应的C++代码或使我的代码更有效率？对于这两个功能来说，经过的时间已经非常短。我想我所说的效率更多是从编程的角度来看，以及我思考问题的方式和我使用的命令是否有效。

我感谢您的帮助。

编辑：感谢@MatthewLundberg和@Yakk抓住我的愚蠢错误。

如果你真正想要的只是更快的匹配，你应该看看Simon Urbanek的快速匹配包。然而，Rcpp实际上确实具有糖in功能，可以在这里使用。in使用了fastmatch包中的一些想法，并将它们合并到Rcpp中。我还在这里比较@hadley的解决方案。

// [[Rcpp::plugins("cpp11")]]
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::vector<int> sugar_in(IntegerVector x, IntegerVector y) {
LogicalVector ind = in(x, y);
int n = ind.size();
std::vector<int> output;
output.reserve(n);
for (int i=0; i < n; ++i) {
if (ind[i]) output.push_back(i+1);
}
return output;
}
// [[Rcpp::export]]
std::vector<int> which_in(IntegerVector x, IntegerVector y) {
int nx = x.size();
std::unordered_set<int> z(y.begin(), y.end());
std::vector<int> output;
output.reserve(nx);
for (int i=0; i < nx; ++i) {
if (z.find( x[i] ) != z.end() ) {
output.push_back(i+1);
}
}
return output;
}

// [[Rcpp::export]]
std::vector<int> which_in2(IntegerVector x, IntegerVector y) {
std::vector<int> y_sort(y.size());
std::partial_sort_copy (y.begin(), y.end(), y_sort.begin(), y_sort.end());
int nx = x.size();
std::vector<int> out;
for (int i = 0; i < nx; ++i) {
std::vector<int>::iterator found =
lower_bound(y_sort.begin(), y_sort.end(), x[i]);
if (found != y_sort.end()) {
out.push_back(i + 1);
}
}
return out;
}
/*** R
set.seed(123)
library(microbenchmark)
x <- sample(1:100)
y <- sample(1:10000, 1000)
identical( sugar_in(y, x), which(y %in% x) )
identical( which_in(y, x), which(y %in% x) )
identical( which_in2(y, x), which(y %in% x) )
microbenchmark(
sugar_in(y, x),
which_in(y, x),
which_in2(y, x),
which(y %in% x)
)
*/

打电话给sourceCpp，从基准中，我，

Unit: microseconds
expr    min      lq  median      uq    max neval
sugar_in(y, x)  7.590 10.0795 11.4825 14.3630 32.753   100
which_in(y, x) 40.757 42.4460 43.4400 46.8240 63.690   100
which_in2(y, x) 14.325 15.2365 16.7005 17.2620 30.580   100
which(y %in% x) 17.070 21.6145 23.7070 29.0105 78.009   100

对于这组输入，我们可以通过使用一种技术上具有更高算法复杂性(每次查找的 O(ln n) 与 O(1) )但常量较低的方法来提高性能：二进制搜索。

// [[Rcpp::plugins("cpp11")]]
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
std::vector<int> which_in(IntegerVector x, IntegerVector y) {
int nx = x.size();
std::unordered_set<int> z(y.begin(), y.end());
std::vector<int> output;
output.reserve(nx);
for (int i=0; i < nx; ++i) {
if (z.find( x[i] ) != z.end() ) {
output.push_back(i+1);
}
}
return output;
}
// [[Rcpp::export]]
std::vector<int> which_in2(IntegerVector x, IntegerVector y) {
std::vector<int> y_sort(y.size());
std::partial_sort_copy (y.begin(), y.end(), y_sort.begin(), y_sort.end());
int nx = x.size();
std::vector<int> out;
for (int i = 0; i < nx; ++i) {
std::vector<int>::iterator found =
lower_bound(y_sort.begin(), y_sort.end(), x[i]);
if (found != y_sort.end()) {
out.push_back(i + 1);
}
}
return out;
}
/*** R
set.seed(123)
library(microbenchmark)
x <- sample(1:100)
y <- sample(1:10000, 1000)
identical( which_in(y, x), which(y %in% x) )
identical( which_in2(y, x), which(y %in% x) )
microbenchmark(
which_in(y, x),
which_in2(y, x),
which(y %in% x)
)
*/

在我的计算机上，产生

Unit: microseconds
expr  min   lq median   uq  max neval
which_in(y, x) 39.3 41.0   42.7 44.0 81.5   100
which_in2(y, x) 12.8 13.6   14.4 15.0 23.8   100
which(y %in% x) 16.8 20.2   21.0 21.9 31.1   100

所以比碱基R好约30%。