Using C++ to Speedup Row-wise Operations

Date:
Description: The R package Rcpp is a great tool for improving your code’s performance.
Tags:

Using C++ to Speedup Row-wise Operations


Summary

Data analysis code often falls into the ad hoc class: a one-time use which is destined for the graveyard of forgotten files that are slowly decomposing your computer’s available storage space. But some code falls into the reusable class; a category which requires thoughtful considerations of robustness, maintenance, and performance. Recently I encountered a performance bottleneck while productionalizing a particular workflow. The original code was written in R and contained several time consuming operations. Specifically, data preprocessing involved ingesting a large but narrow data set (rows: 60+ million, columns: around 15), and then flagging any row which contained either all zeros or all NAs. Loop-ing and apply-ing over rows can be a very time consuming task within R. Fortunately, 1) C++ is very efficient at these operations, and 2) the Rcpp package provides a simple way to integrate your C++ code within your R workflow. For a more detailed Rcpp tutorial see here.

Results

Below I compare the performance of three different row-wise paradigms: an R for loop, an R apply, and a C++ for loop integrated into R using Rcpp. The row-wise task is to find all row index values which fail the zero/NA check (described above), where our narrow data set has 15 columns and N rows. Here N is varied by factors of 10 from 6K up to 60 million. The number of rows with all zeros or all NAs are both set to 25% of the number of rows N. Note: Other approaches involving column-wise looping, recoding, and summarizing using row sums could also be explored, but the intent of this post is to provide a template for performing row-wise operations and comparing performance of row-wise alternatives.
The below table contains the run times (in seconds) from one experiment per setting. Of course we ought to run the experiment multiple times to account for system variability but “ain’t nobody got time for that”. The “DNR” label stands for did not run.

Rows Cpp Time Apply Time R Loop Time
6,000 0.003 0.054 4.285
60,000 0.010 0.445 41.096
600,000 0.094 3.498 1752.259
6,000,000 1.136 35.517 DNR
60,000,000 10.321 397.695 DNR

The below table contains the ratio of run time relative to the Cpp run time. Interestingly, Cpp is many times more efficient than the other two methods but is logically constructed in the same way as the R for loop.

Rows Cpp Ratio Apply Ratio R Loop Ratio
6,000 1.0 18.1 1438.7
60,000 1.0 44.5 4106.8
600,000 1.0 37.2 18629.9
6,000,000 1.0 31.3 DNR
60,000,000 1.0 38.5 DNR

Code

Step 1. Setup R session and compile C++ function for use within R:

### You may need more digits for test times for small test sizes:
options(digits.secs = 6)

### ---------------------------------------------- ###
### load packages:
### ---------------------------------------------- ###

library(Rcpp)

### ---------------------------------------------- ###
### Compile C++ function
### ---------------------------------------------- ###

# Here we compile cpp code inside current session, but 
# could be built into a package for production purposes:

### Using in script compiling of c++ function:
cppFunction('std::vector<int> allNa_or_allZero(Rcpp::NumericMatrix x) {
            // find row count of numeric matrix
            int nRows = x.nrow();
            // using std for integer vector so I can call .reserve()
            std::vector<int> myvector;
            // preallocating the maximum memory for the object
            myvector.reserve(nRows);
            // loop over all rows of numeric matrix
            for (int i = 0; i < nRows; i++) {
              if (Rcpp::is_true(Rcpp::all(Rcpp::is_na( x(i, _) ) ) )) {
                // if all were NA then add to back of current vector
                myvector.push_back(i + 1);
              } else if ( Rcpp::is_true(Rcpp::all( x(i, _)  == 0) ) ) {
                // if all were 0 then add to back of current vector
                myvector.push_back(i + 1);
              }
            }
            return myvector;
            }')

### If compiling c++ function from a file (e.g. allNa_or_allZero.cpp) use:
# sourceCpp(".../allNa_or_allZero.cpp")

### Example file content:
#
# #include <Rcpp.h>
# using namespace Rcpp;
# 
# // [[Rcpp::export]]
# std::vector<int> allNa_or_allZero(Rcpp::NumericMatrix x) {
#   // find row count of numeric matrix
#   int nRows = x.nrow();
#   // using std for integer vector so I can call .reserve()
#   std::vector<int> myvector;
#   // preallocating the maximum memory for the object
#   myvector.reserve(nRows);
#   // loop over all rows of numeric matrix
#   for (int i = 0; i < nRows; i++) {
#     if (Rcpp::is_true(Rcpp::all(Rcpp::is_na( x(i, _) ) ) )) {
#       // if all were NA then add to back of current vector
#       myvector.push_back(i + 1);
#     } else if ( Rcpp::is_true(Rcpp::all( x(i, _)  == 0) ) ) {
#       // if all were 0 then add to back of current vector
#       myvector.push_back(i + 1);
#     }
#   }
#   return myvector;
# }

Step 2. Create some fake data:

### ---------------------------------------------- ###
### create some fake data:
### ---------------------------------------------- ###

# Data set size:
N <- 6*10^3 # 6*10^3 6*10^4, 6*10^5, 6*10^6, 6*10^7  
df <- as.data.frame(matrix(1, ncol = 15, nrow = N))
# Create 25% all 0:
df[1:(N/4),] <- 0
#  Create 25% all NA:
df[(N/4 + 1):(N/2),] <- NA

Step 3. Run the three competing approaches:

### ---------------------------------------------- ###
### use apply:
### ---------------------------------------------- ###

# custom function:
na_or_0_row_ind = function(r){all(is.na(r)) | all(r == 0)}
stm <- Sys.time()
apply_rows <- which(apply(df, 1, na_or_0_row_ind))
apply_time <- as.numeric(difftime(Sys.time(),stm, units = "secs"))

### ---------------------------------------------- ###
### use for loop:
### ---------------------------------------------- ###

if(N < 600001){
  stm <- Sys.time()
  counter <- 0
  forloop_rows <- rep(NA, N)
  for(rr in 1:N){
    test <- (all(is.na(df[rr,])) | all(df[rr,] == 0))
    if(test){
      counter <- counter + 1
      forloop_rows[counter] <- rr
    }
  }
  forloop_rows <- forloop_rows[1:counter]
  forloop_time <- as.numeric(difftime(Sys.time(),stm, units = "secs"))
}

### ---------------------------------------------- ###
### use Rcpp:
### ---------------------------------------------- ###

stm <- Sys.time()
cpp_time <- cpp_rows <- allNa_or_allZero(as.matrix(df))
cpp_time <- as.numeric(difftime(Sys.time(),stm, units = "secs"))

Step 4. Compare results:

### ---------------------------------------------- ###
### Compare Results
### ---------------------------------------------- ###

### compare rows found:
all.equal(forloop_rows, cpp_rows)
all.equal(apply_rows, cpp_rows)

### compare elapsed times:
apply_time
forloop_time
cpp_time

### compare elapsed time ratios
apply_time / cpp_time
forloop_time / cpp_time