Profiling, Vectorization, and Rcpp in R

1. Profiling

Before optimizing, identify where time and memory are actually spent. Guessing leads to wasted effort on code paths that are not the bottleneck.

1.1 Rprof and summaryRprof

Base R provides a sampling profiler via Rprof(). It records the call stack at fixed intervals and writes results to a file.

Rprof("profile.out", interval = 0.01, memory.profiling = TRUE)
result <- slow_function(data)
Rprof(NULL)

summaryRprof("profile.out")

summaryRprof() returns two tables: by.self (time spent in a function excluding callees) and by.total (including callees). by.self is the more actionable view for locating hotspots.

Note Sampling interval default is 0.02s. For functions running under a second, reduce to 0.001–0.005 to get enough samples for a meaningful breakdown.

1.2 profvis

The profvis package wraps Rprof with an interactive flame graph and source-line annotation.

library(profvis)

profvis({
  result <- vapply(1:1e5, function(i) sqrt(i) + log(i), numeric(1))
})

The flame graph's x-axis is sample count (proxy for time), y-axis is call depth. Wide blocks at the top of the stack are self-time hotspots. The memory column (in MB) shows allocations and GC events per line — useful for spotting unnecessary copies.

1.3 bench and microbenchmark

For comparing alternative implementations of a small code unit, use bench::mark() or microbenchmark::microbenchmark(). Both run expressions many times and report distributional statistics rather than a single timing.

library(bench)

mark(
  loop    = { x <- numeric(n); for(i in 1:n) x[i] <- i^2; x },
  vec     = (1:n)^2,
  sapply  = sapply(1:n, function(i) i^2),
  iterations = 100,
  check = FALSE
)

bench::mark() additionally reports memory allocation (mem_alloc) and garbage collections per iteration (n_gc), both relevant for allocation-heavy code.

Warn bench::mark() checks that all expressions return equal results by default (check = TRUE). Set to FALSE when comparing implementations with different output types (e.g. matrix vs vector), or results that differ by floating-point tolerance.

1.4 Memory profiling

lobstr::obj_size() reports object size including shared references. tracemem() identifies when an object is copied — critical for diagnosing unexpected duplication from R's copy-on-modify semantics.

x <- 1:1e6
tracemem(x)
# [1] "<0x55d2a1b4c9e0>"

y <- x       # no copy yet (shared reference)
y[1] <- 0L    # tracemem[0x55d2a1b4c9e0 -> 0x55d2a3f1b220]: triggers copy

1.5 Common hotspot patterns

PatternSymptomFix
Growing a vector/list in a loop with c() or append()O(n²) time, high n_gcPreallocate with vector(mode, length)
rbind() inside a loopQuadratic copy costCollect rows in a list, do.call(rbind, list) once, or use data.table::rbindlist
Repeated data.frame column access in a loopHigh self-time in [.data.frameExtract columns to vectors before looping, or use matrices/data.table
Nested apply family with closures capturing large environmentsHigh memory, slow GCAvoid capturing large objects; pass as arguments
ifelse() over large vectorsEvaluates both branches fully for all elementsUse direct logical indexing or dplyr::case_when (still vectorized but clearer)

2. Vectorization

R's core arithmetic, comparison, and many mathematical operations are implemented as vectorized C loops. Replacing element-wise R-level iteration with calls to these primitives removes interpreter overhead per element.

2.1 Why loops are slow in R

Each iteration of a for loop over R-level code involves: bytecode dispatch, S3/S4 method lookup for generic operators, argument matching, and (frequently) object copying due to copy-on-modify. A vectorized call to +, sqrt, or %*% performs this dispatch once and then runs a tight C loop over the full buffer.

# Scalar loop: ~50ms for n = 1e6
result <- numeric(n)
for (i in 1:n) {
  result[i] <- sqrt(x[i]) * 2 + 1
}

# Vectorized: ~5ms for n = 1e6
result <- sqrt(x) * 2 + 1

2.2 Replacing apply with vectorized primitives

The apply family (sapply, vapply, lapply, mapply) does not avoid R-level iteration — it iterates in C but still calls an R closure per element. They are convenience tools for code clarity, not performance, unless the underlying operation cannot be vectorized.

IdiomVectorized alternative
sapply(x, function(v) v^2)x^2
sapply(seq_along(x), function(i) x[i] + y[i])x + y
sapply(x, function(v) if (v > 0) "pos" else "neg")ifelse(x > 0, "pos", "neg")
for (i in seq_len(n)) total <- total + x[i]sum(x)
Nested loop for outer-product-like computationouter(x, y, FUN) or matrix algebra
Row-wise apply(m, 1, sum)rowSums(m) / colSums(m) / rowMeans(m)

2.3 Vectorized conditionals and recycling

Vector recycling lets shorter vectors be reused element-wise against longer ones without explicit replication — but silent recycling of mismatched lengths is a frequent source of bugs.

x <- 1:6
x * c(1, 0)   # recycles c(1,0) to c(1,0,1,0,1,0): zeroes out even-indexed elements

# Conditional assignment via logical indexing (faster than ifelse for assignment)
x[x %% 2 == 0] <- 0
Warn Recycling a vector of length not a multiple of the longer vector's length produces a warning (in recent R versions) but still executes. This silently corrupts results if the length mismatch is unintentional — always check length() when vectors should align.

2.4 matrix and linear algebra

Operations expressible as matrix algebra benefit from BLAS-backed routines (%*%, solve(), crossprod(), tcrossprod()), which are orders of magnitude faster than equivalent loop-based implementations and benefit further from an optimized BLAS (OpenBLAS, MKL) linked at the R installation level.

# Sum of squares per row, loop vs vectorized vs matrix algebra
rowSums(m^2)              # vectorized
diag(m %*% t(m))         # matrix algebra (avoid for large m: computes full n x n matrix)

2.5 data.table and vectorized joins/aggregation

For tabular workloads, data.table performs grouped aggregation and joins using compiled C routines with radix sorting, avoiding both R-level loops and the row-by-row overhead of data.frame indexing.

library(data.table)

dt[, .(total = sum(value), n = .N), by = group]
dt[other_dt, on = "id"]   # vectorized merge join
Tip Profile before assuming a loop is the bottleneck. Some loops (e.g. ones with sequential dependencies between iterations, like a recursive filter or a simulation with state) cannot be vectorized at the R level at all — these are prime candidates for Rcpp rather than R-side restructuring.

3. Rcpp

Rcpp provides a C++ API that maps R types (vectors, lists, data frames, environments) to C++ classes with near-zero-cost interop, and handles the .Call boilerplate, memory protection, and type conversion required for a standard R/C extension.

3.1 When Rcpp is appropriate

ScenarioRecommendation
Computation expressible as vector/matrix ops on existing dataVectorize in R first; Rcpp unlikely to help
Sequential algorithm with per-iteration state dependency (e.g. recursive filters, MCMC, simulations)Rcpp — cannot be vectorized in R
Tight loop over millions of elements with simple per-element logicRcpp, or check if a vectorized package primitive already exists
String processing / regex over large character vectorsRcpp with std::string, or stringi (already C-backed)
Calling an existing C/C++ libraryRcpp (wraps the call) or .Call with manual SEXP handling

3.2 Inline compilation with cppFunction / sourceCpp

library(Rcpp)

cppFunction('
NumericVector cumsum_custom(NumericVector x) {
  int n = x.size();
  NumericVector out(n);
  double acc = 0;
  for (int i = 0; i < n; i++) {
    acc += x[i];
    out[i] = acc;
  }
  return out;
}
')

cumsum_custom(c(1, 2, 3, 4))   # 1 3 6 10

For larger functions, use a .cpp file with sourceCpp("file.cpp"). The // [[Rcpp::export]] attribute marks functions to be made available in R.

// [[Rcpp::depends(Rcpp)]]
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector running_mean(NumericVector x, int window) {
  int n = x.size();
  NumericVector out(n);
  double sum = 0;
  for (int i = 0; i < n; i++) {
    sum += x[i];
    if (i >= window) sum -= x[i - window];
    out[i] = (i >= window - 1) ? sum / window : NA_REAL;
  }
  return out;
}

3.3 Type mappings

R typeRcpp typeNotes
numeric vectorNumericVectordouble precision; backed by R's SEXP
integer vectorIntegerVectorNA_INTEGER for missing
character vectorCharacterVectorelements are String / SEXP CHARSXP
logical vectorLogicalVectorthree-valued: TRUE, FALSE, NA_LOGICAL
listListheterogeneous; access via as<T>()
data.frameDataFramecolumn access via df["colname"]
matrixNumericMatrix / IntegerMatrix(i,j) indexing, column-major like R
environmentEnvironmentfor accessing/modifying R environments
functionFunctioncallback into R — has call overhead, avoid in hot loops

3.4 Indexing and NA handling

Danger Rcpp vectors are 0-indexed (C++ semantics), unlike R's 1-indexed vectors. Off-by-one errors are the most common bug when porting R loops to Rcpp. NumericVector x(5) has valid indices 0 through 4.
// Checking for NA in a NumericVector element
if (NumericVector::is_na(x[i])) {
  continue;
}

// Integer NA differs from double NA — use NA_INTEGER, not NA_REAL, for IntegerVector

3.5 Returning multiple values and lists

// [[Rcpp::export]]
List summary_stats(NumericVector x) {
  return List::create(
    Named("mean") = mean(x),
    Named("sd")   = sd(x),
    Named("n")    = x.size()
  );
}

3.6 Rcpp sugar

"Sugar" expressions provide R-like vectorized syntax that compiles down to efficient loops, avoiding manual element-wise iteration while retaining C++ performance.

NumericVector x = ...;
NumericVector y = Rcpp::sqrt(x) + 1;     // vectorized sugar
LogicalVector mask = (x > 0);            // element-wise comparison
double total = Rcpp::sum(x[mask]);     // logical subsetting + sum
Note Sugar expressions can be lazily evaluated and chained; for performance-critical code, profile against a hand-written loop — sugar is generally fast but not always optimal for complex chained expressions due to intermediate temporaries.

3.7 RcppArmadillo and RcppEigen

For linear algebra-heavy code, RcppArmadillo exposes the Armadillo C++ library (MATLAB-like syntax: mat, vec, solve(), inv()) and RcppEigen exposes Eigen. Both integrate with Rcpp's type conversions and link against the same BLAS/LAPACK as R.

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
vec ols_coef(mat X, vec y) {
  return solve(X, y);   // equivalent to lm.fit, no formula overhead
}

3.8 Build and packaging considerations

ContextMechanism
Interactive / scriptsourceCpp() — recompiles on file change, caches output
Packageusethis::use_rcpp(), source in src/, LinkingTo: Rcpp + Imports: Rcpp in DESCRIPTION, run compileAttributes() to regenerate exports
Compiler flagssrc/Makevars (Linux/macOS) and src/Makevars.win for PKG_CXXFLAGS, -O2, OpenMP flags
OpenMP parallelism// [[Rcpp::plugins(openmp)]], #pragma omp parallel for — must not call R API functions inside parallel regions
Danger R API calls (allocating SEXPs, calling back into R via Function, anything touching R's memory manager) are not thread-safe. Inside OpenMP or other parallel regions, operate only on raw C++ containers (e.g. copy to std::vector first) and convert back to Rcpp types after the parallel region completes.

4. Decision Framework

StepAction
1Profile with profvis or Rprof to confirm the actual hotspot — do not optimize unmeasured code
2Check for an existing vectorized primitive or package function (rowSums, data.table, matrixStats, collapse) that replaces the loop entirely
3If no vectorized equivalent exists and the loop has sequential state dependencies, write the inner loop in Rcpp
4Benchmark the Rcpp version against the R version with bench::mark() on representative data sizes — compilation and call overhead can make Rcpp slower for small n
5Re-profile the full pipeline; the new bottleneck is often elsewhere (I/O, data.frame coercion, plotting)
Tip A vectorized R rewrite is almost always lower effort and lower maintenance burden than an Rcpp rewrite, and is portable without a compiler toolchain. Reserve Rcpp for cases where profiling shows the bottleneck is in code that genuinely cannot be vectorized.