Performance Issue: Unnecessary Matrix Copy in MMCS Sampling
📋 Summary
The MMCS (Multiphase Monte Carlo Sampling) implementation performs an expensive matrix transpose operation that creates an unnecessary full matrix copy. This impacts performance, especially for high-dimensional sampling scenarios with large sample sizes.
🔍 Problem Description
Current Implementation
In include/sampling/mmcs.hpp (line 234) and related files, the code performs:
MT Samples = TotalRandPoints.transpose(); //do not copy TODO!
for (int i = 0; i < total_samples; i++)
{
Samples.col(i) = T * Samples.col(i) + T_shift;
}
S.conservativeResize(P.dimension(), total_number_of_samples_in_P0 + total_samples);
S.block(0, total_number_of_samples_in_P0, P.dimension(), total_samples) =
Samples.block(0, 0, P.dimension(), total_samples);
The Problem
Current Flow (Inefficient):
Memory Flow:
- Read
TotalRandPoints (n×d elements)
- Allocate new matrix
Samples (d×n elements) ← WASTED MEMORY
- Copy and transpose all elements ← EXPENSIVE OPERATION
- Transform each column:
T * Samples.col(i) + T_shift
- Copy transformed data to
S
Issues:
-
Memory Overhead: Creates a full copy of the transposed matrix
- For typical case: 50 dimensions × 5000 samples = 250,000 elements
- Memory: ~2MB (double precision) wasted per iteration
-
Computation Overhead: O(n×d) transpose operation
- Unnecessary for the actual computation needed
-
Memory Bandwidth: Doubles memory traffic
- Read from
TotalRandPoints → Write to Samples → Read from Samples → Write to S
Matrix Dimensions
TotalRandPoints: (num_samples × dimension)
- Each row represents one sample point
- Shape: [total_samples, P.dimension()]
Samples (transposed copy): (dimension × num_samples)
- Each column represents one sample point (transposed)
- Shape: [P.dimension(), total_samples]
S (final result): (dimension × total_samples)
- Each column is a transformed sample
- Shape: [P.dimension(), total_number_of_samples_in_P0 + total_samples]
💡 Proposed Solution
Optimized Implementation
Instead of creating a transposed copy, work directly with rows of TotalRandPoints:
// Optimized: avoid transpose copy by working directly with rows
S.conservativeResize(P.dimension(), total_number_of_samples_in_P0 + total_samples);
for (int i = 0; i < total_samples; i++)
{
// Transform each sample: T * sample + T_shift, then store as column in S
S.col(total_number_of_samples_in_P0 + i).noalias() =
T * TotalRandPoints.row(i).transpose() + T_shift;
}
Optimized Flow
Optimized Flow (Efficient):
Memory Flow:
- Read
TotalRandPoints.row(i) (single row, d elements)
- Transform:
T * row(i).transpose() + T_shift (in-place computation)
- Assign directly to
S.col(i) with noalias() ← NO COPY
- Repeat for all rows
Key Improvement:
- Before: Read → Copy → Transpose → Transform → Copy = 5 operations
- After: Read → Transform → Assign = 3 operations
Benefits:
- No Memory Copy: Eliminates intermediate
Samples matrix
- Direct Computation: Transform and assign in one step
- Better Cache Usage: Single pass through data
- Uses
noalias(): Prevents Eigen from creating temporaries
References
Performance Issue: Unnecessary Matrix Copy in MMCS Sampling
📋 Summary
The MMCS (Multiphase Monte Carlo Sampling) implementation performs an expensive matrix transpose operation that creates an unnecessary full matrix copy. This impacts performance, especially for high-dimensional sampling scenarios with large sample sizes.
🔍 Problem Description
Current Implementation
In
include/sampling/mmcs.hpp(line 234) and related files, the code performs:The Problem
Current Flow (Inefficient):
Memory Flow:
TotalRandPoints(n×d elements)Samples(d×n elements) ← WASTED MEMORYT * Samples.col(i) + T_shiftSIssues:
Memory Overhead: Creates a full copy of the transposed matrix
Computation Overhead: O(n×d) transpose operation
Memory Bandwidth: Doubles memory traffic
TotalRandPoints→ Write toSamples→ Read fromSamples→ Write toSMatrix Dimensions
💡 Proposed Solution
Optimized Implementation
Instead of creating a transposed copy, work directly with rows of
TotalRandPoints:Optimized Flow
Optimized Flow (Efficient):
Memory Flow:
TotalRandPoints.row(i)(single row, d elements)T * row(i).transpose() + T_shift(in-place computation)S.col(i)withnoalias()← NO COPYKey Improvement:
Benefits:
Samplesmatrixnoalias(): Prevents Eigen from creating temporariesReferences