Profiling, Vectorization, and Rcpp in R

1. Profiling

Before optimizing, identify where time and memory are actually spent. Guessing leads to wasted effort on code paths that are not the bottleneck.

1.1 Rprof and summaryRprof

Base R provides a sampling profiler via Rprof(). It records the call stack at fixed intervals and writes results to a file.

Rprof("profile.out", interval = 0.01, memory.profiling = TRUE)
result <- slow_function(data)
Rprof(NULL)

summaryRprof("profile.out")

summaryRprof() returns two tables: by.self (time spent in a function excluding callees) and by.total (including callees). by.self is the more actionable view for locating hotspots.

Note Sampling interval default is 0.02s. For functions running under a second, reduce to 0.001–0.005 to get enough samples for a meaningful breakdown.

1.2 profvis

The profvis package wraps Rprof with an interactive flame graph and source-line annotation.

library(profvis)

profvis({
  result <- vapply(1:1e5, function(i) sqrt(i) + log(i), numeric(1))
})

The flame graph's x-axis is sample count (proxy for time), y-axis is call depth. Wide blocks at the top of the stack are self-time hotspots. The memory column (in MB) shows allocations and GC events per line — useful for spotting unnecessary copies.

1.3 bench and microbenchmark

For comparing alternative implementations of a small code unit, use bench::mark() or microbenchmark::microbenchmark(). Both run expressions many times and report distributional statistics rather than a single timing.

library(bench)

mark(
  loop    = { x <- numeric(n); for(i in 1:n) x[i] <- i^2; x },
  vec     = (1:n)^2,
  sapply  = sapply(1:n, function(i) i^2),
  iterations = 100,
  check = FALSE
)

bench::mark() additionally reports memory allocation (mem_alloc) and garbage collections per iteration (n_gc), both relevant for allocation-heavy code.

Warn bench::mark() checks that all expressions return equal results by default (check = TRUE). Set to FALSE when comparing implementations with different output types (e.g. matrix vs vector), or results that differ by floating-point tolerance.

1.4 Memory profiling

lobstr::obj_size() reports object size including shared references. tracemem() identifies when an object is copied — critical for diagnosing unexpected duplication from R's copy-on-modify semantics.

x <- 1:1e6
tracemem(x)
# [1] "<0x55d2a1b4c9e0>"

y <- x       # no copy yet (shared reference)
y[1] <- 0L    # tracemem[0x55d2a1b4c9e0 -> 0x55d2a3f1b220]: triggers copy

1.5 Common hotspot patterns

Pattern	Symptom	Fix
Growing a vector/list in a loop with `c()` or `append()`	O(n²) time, high `n_gc`	Preallocate with `vector(mode, length)`
`rbind()` inside a loop	Quadratic copy cost	Collect rows in a list, `do.call(rbind, list)` once, or use `data.table::rbindlist`
Repeated `data.frame` column access in a loop	High self-time in `[.data.frame`	Extract columns to vectors before looping, or use matrices/data.table
Nested `apply` family with closures capturing large environments	High memory, slow GC	Avoid capturing large objects; pass as arguments
`ifelse()` over large vectors	Evaluates both branches fully for all elements	Use direct logical indexing or `dplyr::case_when` (still vectorized but clearer)

2. Vectorization

R's core arithmetic, comparison, and many mathematical operations are implemented as vectorized C loops. Replacing element-wise R-level iteration with calls to these primitives removes interpreter overhead per element.

2.1 Why loops are slow in R

Each iteration of a for loop over R-level code involves: bytecode dispatch, S3/S4 method lookup for generic operators, argument matching, and (frequently) object copying due to copy-on-modify. A vectorized call to +, sqrt, or %*% performs this dispatch once and then runs a tight C loop over the full buffer.

# Scalar loop: ~50ms for n = 1e6
result <- numeric(n)
for (i in 1:n) {
  result[i] <- sqrt(x[i]) * 2 + 1
}

# Vectorized: ~5ms for n = 1e6
result <- sqrt(x) * 2 + 1

2.2 Replacing apply with vectorized primitives

The apply family (sapply, vapply, lapply, mapply) does not avoid R-level iteration — it iterates in C but still calls an R closure per element. They are convenience tools for code clarity, not performance, unless the underlying operation cannot be vectorized.

Idiom	Vectorized alternative
`sapply(x, function(v) v^2)`	`x^2`
`sapply(seq_along(x), function(i) x[i] + y[i])`	`x + y`
`sapply(x, function(v) if (v > 0) "pos" else "neg")`	`ifelse(x > 0, "pos", "neg")`
`for (i in seq_len(n)) total <- total + x[i]`	`sum(x)`
Nested loop for outer-product-like computation	`outer(x, y, FUN)` or matrix algebra
Row-wise `apply(m, 1, sum)`	`rowSums(m)` / `colSums(m)` / `rowMeans(m)`

2.3 Vectorized conditionals and recycling

Vector recycling lets shorter vectors be reused element-wise against longer ones without explicit replication — but silent recycling of mismatched lengths is a frequent source of bugs.

x <- 1:6
x * c(1, 0)   # recycles c(1,0) to c(1,0,1,0,1,0): zeroes out even-indexed elements

# Conditional assignment via logical indexing (faster than ifelse for assignment)
x[x %% 2 == 0] <- 0

Warn Recycling a vector of length not a multiple of the longer vector's length produces a warning (in recent R versions) but still executes. This silently corrupts results if the length mismatch is unintentional — always check length() when vectors should align.

2.4 matrix and linear algebra

Operations expressible as matrix algebra benefit from BLAS-backed routines (%*%, solve(), crossprod(), tcrossprod()), which are orders of magnitude faster than equivalent loop-based implementations and benefit further from an optimized BLAS (OpenBLAS, MKL) linked at the R installation level.

# Sum of squares per row, loop vs vectorized vs matrix algebra
rowSums(m^2)              # vectorized
diag(m %*% t(m))         # matrix algebra (avoid for large m: computes full n x n matrix)

2.5 data.table and vectorized joins/aggregation

For tabular workloads, data.table performs grouped aggregation and joins using compiled C routines with radix sorting, avoiding both R-level loops and the row-by-row overhead of data.frame indexing.

library(data.table)

dt[, .(total = sum(value), n = .N), by = group]
dt[other_dt, on = "id"]   # vectorized merge join

Tip Profile before assuming a loop is the bottleneck. Some loops (e.g. ones with sequential dependencies between iterations, like a recursive filter or a simulation with state) cannot be vectorized at the R level at all — these are prime candidates for Rcpp rather than R-side restructuring.

3. Rcpp

Rcpp provides a C++ API that maps R types (vectors, lists, data frames, environments) to C++ classes with near-zero-cost interop, and handles the .Call boilerplate, memory protection, and type conversion required for a standard R/C extension.

3.1 When Rcpp is appropriate

Scenario	Recommendation
Computation expressible as vector/matrix ops on existing data	Vectorize in R first; Rcpp unlikely to help
Sequential algorithm with per-iteration state dependency (e.g. recursive filters, MCMC, simulations)	Rcpp — cannot be vectorized in R
Tight loop over millions of elements with simple per-element logic	Rcpp, or check if a vectorized package primitive already exists
String processing / regex over large character vectors	Rcpp with `std::string`, or `stringi` (already C-backed)
Calling an existing C/C++ library	Rcpp (wraps the call) or `.Call` with manual `SEXP` handling

3.2 Inline compilation with cppFunction / sourceCpp

library(Rcpp)

cppFunction('
NumericVector cumsum_custom(NumericVector x) {
  int n = x.size();
  NumericVector out(n);
  double acc = 0;
  for (int i = 0; i < n; i++) {
    acc += x[i];
    out[i] = acc;
  }
  return out;
}
')

cumsum_custom(c(1, 2, 3, 4))   # 1 3 6 10

For larger functions, use a .cpp file with sourceCpp("file.cpp"). The // [[Rcpp::export]] attribute marks functions to be made available in R.

// [[Rcpp::depends(Rcpp)]]
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector running_mean(NumericVector x, int window) {
  int n = x.size();
  NumericVector out(n);
  double sum = 0;
  for (int i = 0; i < n; i++) {
    sum += x[i];
    if (i >= window) sum -= x[i - window];
    out[i] = (i >= window - 1) ? sum / window : NA_REAL;
  }
  return out;
}

3.3 Type mappings

R type	Rcpp type	Notes
numeric vector	`NumericVector`	double precision; backed by R's `SEXP`
integer vector	`IntegerVector`	`NA_INTEGER` for missing
character vector	`CharacterVector`	elements are `String` / `SEXP CHARSXP`
logical vector	`LogicalVector`	three-valued: `TRUE`, `FALSE`, `NA_LOGICAL`
list	`List`	heterogeneous; access via `as<T>()`
data.frame	`DataFrame`	column access via `df["colname"]`
matrix	`NumericMatrix` / `IntegerMatrix`	`(i,j)` indexing, column-major like R
environment	`Environment`	for accessing/modifying R environments
function	`Function`	callback into R — has call overhead, avoid in hot loops

3.4 Indexing and NA handling

Danger Rcpp vectors are 0-indexed (C++ semantics), unlike R's 1-indexed vectors. Off-by-one errors are the most common bug when porting R loops to Rcpp. NumericVector x(5) has valid indices 0 through 4.

// Checking for NA in a NumericVector element
if (NumericVector::is_na(x[i])) {
  continue;
}

// Integer NA differs from double NA — use NA_INTEGER, not NA_REAL, for IntegerVector

3.5 Returning multiple values and lists

// [[Rcpp::export]]
List summary_stats(NumericVector x) {
  return List::create(
    Named("mean") = mean(x),
    Named("sd")   = sd(x),
    Named("n")    = x.size()
  );
}

3.6 Rcpp sugar

"Sugar" expressions provide R-like vectorized syntax that compiles down to efficient loops, avoiding manual element-wise iteration while retaining C++ performance.

NumericVector x = ...;
NumericVector y = Rcpp::sqrt(x) + 1;     // vectorized sugar
LogicalVector mask = (x > 0);            // element-wise comparison
double total = Rcpp::sum(x[mask]);     // logical subsetting + sum

Note Sugar expressions can be lazily evaluated and chained; for performance-critical code, profile against a hand-written loop — sugar is generally fast but not always optimal for complex chained expressions due to intermediate temporaries.

3.7 RcppArmadillo and RcppEigen

For linear algebra-heavy code, RcppArmadillo exposes the Armadillo C++ library (MATLAB-like syntax: mat, vec, solve(), inv()) and RcppEigen exposes Eigen. Both integrate with Rcpp's type conversions and link against the same BLAS/LAPACK as R.

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
vec ols_coef(mat X, vec y) {
  return solve(X, y);   // equivalent to lm.fit, no formula overhead
}

3.8 Build and packaging considerations

Context	Mechanism
Interactive / script	`sourceCpp()` — recompiles on file change, caches output
Package	`usethis::use_rcpp()`, source in `src/`, `LinkingTo: Rcpp` + `Imports: Rcpp` in `DESCRIPTION`, run `compileAttributes()` to regenerate exports
Compiler flags	`src/Makevars` (Linux/macOS) and `src/Makevars.win` for `PKG_CXXFLAGS`, `-O2`, OpenMP flags
OpenMP parallelism	`// [[Rcpp::plugins(openmp)]]`, `#pragma omp parallel for` — must not call R API functions inside parallel regions

Danger R API calls (allocating SEXPs, calling back into R via Function, anything touching R's memory manager) are not thread-safe. Inside OpenMP or other parallel regions, operate only on raw C++ containers (e.g. copy to std::vector first) and convert back to Rcpp types after the parallel region completes.

4. Decision Framework

Step	Action
1	Profile with `profvis` or `Rprof` to confirm the actual hotspot — do not optimize unmeasured code
2	Check for an existing vectorized primitive or package function (`rowSums`, `data.table`, `matrixStats`, `collapse`) that replaces the loop entirely
3	If no vectorized equivalent exists and the loop has sequential state dependencies, write the inner loop in Rcpp
4	Benchmark the Rcpp version against the R version with `bench::mark()` on representative data sizes — compilation and call overhead can make Rcpp slower for small n
5	Re-profile the full pipeline; the new bottleneck is often elsewhere (I/O, data.frame coercion, plotting)

Tip A vectorized R rewrite is almost always lower effort and lower maintenance burden than an Rcpp rewrite, and is portable without a compiler toolchain. Reserve Rcpp for cases where profiling shows the bottleneck is in code that genuinely cannot be vectorized.