pyFFTMatvec
FFTMatvec provides Python bindings via the pyFFTMatvec package, built with pybind11. The package exposes the core C++ classes (Comm, Vector, Matrix) to Python and includes zero-copy PyTorch GPU integration.
Building the Python Package
Prerequisites
In addition to the standard build dependencies, you need:
- Python 3.8+
- pybind11
- scikit-build-core
- mpi4py
- PyTorch (for GPU tensor interop)
- h5py (optional, for creating input data)
Build with pip
From the repository root:
This uses scikit-build-core to invoke CMake with the appropriate flags (see pyproject.toml). You can customize the build by editing the [tool.scikit-build] section.
Build with CMake directly
Alternatively, build the extension module alongside the C++ code:
cmake -B build -DBUILD_PYTHON_BINDINGS=ON -DCMAKE_BUILD_TYPE=Release [other flags...]
cmake --build build --parallel
This produces the _pyFFTMatvec shared library in the build directory.
Import Order
!!! warning "Critical: Import pyFFTMatvec before mpi4py"
On HPC systems, `pyFFTMatvec` must be imported **before** `mpi4py` so that global MPI C++ symbols are exposed to Python before `mpi4py` initializes. Importing in the wrong order will raise an `ImportError`.
# ✅ Correct
import pyFFTMatvec
from mpi4py import MPI
# ❌ Wrong — will crash
from mpi4py import MPI
import pyFFTMatvec # raises ImportError
Core Classes
Comm — MPI + GPU Communication
The Comm object manages MPI communicators, NCCL communicators, CUDA streams, and cuBLAS handles. It must be created first and passed to all Matrix and Vector constructors.
import pyFFTMatvec
from mpi4py import MPI
# Create a communicator with a 2×2 processor grid (4 GPUs total)
comm = pyFFTMatvec.Comm(MPI.COMM_WORLD, proc_rows=2, proc_cols=2)
# Query properties
print(f"Rank {comm.get_world_rank()} of {comm.get_world_size()}")
print(f"Grid position: row_color={comm.get_row_color()}, col_color={comm.get_col_color()}")
print(f"GPU device: {comm.get_device()}")
Matrix — Block-Triangular Toeplitz Matrix
A Matrix can be constructed either from dimensions (for testing) or from a directory path (for real data):
# From dimensions (for testing — initialize with ones or random values)
F = pyFFTMatvec.Matrix(comm, cols=20, rows=10, block_size=50, global_sizes=True)
F.init_mat_ones() # fill with ones
# or
F.init_mat_doubles() # fill with deterministic random doubles
# From a directory path (reads meta_adj and HDF5 files — see I/O Format)
F = pyFFTMatvec.Matrix(comm, path="/path/to/my_matrix")
# With an auxiliary matrix
F = pyFFTMatvec.Matrix(comm, path="/path/to/my_matrix", aux_path="/path/to/aux_matrix")
# With mixed precision
p_config = pyFFTMatvec.MatvecPrecisionConfig()
p_config.fft = pyFFTMatvec.Precision.SINGLE
p_config.sbgemv = pyFFTMatvec.Precision.SINGLE
F = pyFFTMatvec.Matrix(comm, path="/path/to/my_matrix", p_config=p_config)
Vector — Distributed Block Vector
Vectors are distributed across MPI ranks. A "column" vector lives in the parameter space, and a "row" vector lives in the observation space.
# Create vectors from a matrix (recommended — dimensions match automatically)
x = F.get_vec("input") # column vector (Nm * Nt)
y = F.get_vec("output") # row vector (Nd * Nt)
# Or create directly with explicit dimensions
x = pyFFTMatvec.Vector(comm, blocks=20, block_size=50, row_or_col="col", global_sizes=True)
# Initialize
x.init_vec_ones() # fill with 1.0
x.init_vec_zeros() # fill with 0.0
x.init_vec_doubles() # fill with deterministic random doubles
x.init_vec_consecutive() # fill with 0, 1, 2, ...
x.init_vec() # mark as initialized (use existing GPU memory)
# Query properties
print(f"Local blocks: {x.get_num_blocks()}, Global blocks: {x.get_glob_num_blocks()}")
print(f"Block size: {x.get_block_size()}, On grid: {x.on_grid()}")
MatvecPrecisionConfig — Mixed Precision
Controls the precision of each stage of the FFT-based matvec algorithm:
p_config = pyFFTMatvec.MatvecPrecisionConfig()
p_config.broadcast_and_pad = pyFFTMatvec.Precision.DOUBLE # default
p_config.fft = pyFFTMatvec.Precision.SINGLE # use single-precision FFT
p_config.sbgemv = pyFFTMatvec.Precision.SINGLE # use single-precision GEMV
p_config.ifft = pyFFTMatvec.Precision.SINGLE # use single-precision IFFT
p_config.unpad_and_reduce = pyFFTMatvec.Precision.DOUBLE # default
Matrix-Vector Operations
# Forward matvec: y = F @ x
F.matvec(x, y)
# Transpose matvec: y = F^T @ x
F.transpose_matvec(x, y)
# Using auxiliary matrix G: y = G @ x
F.matvec(x, y, use_aux_mat=True)
# Full matvec (F^T F): y = F^T F @ x
F.matvec(x, y, full=True)
Vector Arithmetic
Vector supports standard Python arithmetic operators. All operations are executed on the GPU.
Scalar Operations
# Arithmetic with scalars
z = x * 2.0 # scale
z = x + 1.0 # add scalar
z = x - 0.5 # subtract scalar
z = x / 3.0 # divide by scalar
z = 1.0 / x # element-wise reciprocal
z = x ** 2.0 # element-wise power
# In-place variants
x *= 2.0
x += 1.0
x -= 0.5
x /= 3.0
x **= 2.0
Vector-Vector Operations
# Element-wise operations
z = x + y
z = x - y
z = x * y # element-wise multiply
z = x / y # element-wise divide
# In-place variants
x += y
x -= y
x *= y
x /= y
# Dot product and norms
dot_val = x.dot(y)
l2_norm = x.norm(2) # L2 norm (default)
l1_norm = x.norm(1) # L1 norm
linf = x.norm(0) # L-infinity norm
# BLAS-style operations
x.scale(2.0) # x = 2.0 * x (in-place)
x.axpy(alpha, y) # x = x + alpha * y (in-place)
z = x.waxpy(alpha, y) # z = x + alpha * y (returns new)
x.axpby(alpha, beta, y) # x = alpha * x + beta * y (in-place)
z = x.waxpby(alpha, beta, y) # z = alpha * x + beta * y (returns new)
Additional Operations
# Element-wise operations (explicit method calls)
z = x.elementwise_multiply(y)
x.elementwise_multiply_inplace(y)
z = x.elementwise_divide(y)
x.elementwise_divide_inplace(y)
z = x.elementwise_inverse() # z_i = 1 / x_i
x.elementwise_inverse_inplace()
z = x.pow(0.5) # z_i = x_i^0.5
x.pow_inplace(0.5)
z = x.add_scalar(1.0)
x.add_scalar_inplace(1.0)
# Fused multiply-add: z_i = x_i * y_i + z_i
z = x.elementwise_multiply_add(y, z)
x.elementwise_multiply_add_inplace(y, z)
# Copy
x.copy_to(y) # y ← x
# Resize
z = x.extend(new_block_size) # extend block size (zero-padded)
z = x.shrink(new_block_size) # shrink block size (truncated)
z = x.resize(new_block_size) # extend or shrink as needed
File I/O
See the I/O and Data Formats page for details on the file format.
Reading Vectors
# Read a column vector from an HDF5 file
x = pyFFTMatvec.Vector(comm, blocks=20, block_size=50, row_or_col="col", global_sizes=True)
x.init_vec_from_file("my_vector.h5")
# With checksum verification
x.init_vec_from_file("my_vector.h5", checksum=42)
# For QoI vectors
x.init_vec_from_file("qoi_vector.h5", QoI=True)
Writing Vectors
# Save to HDF5
y.save("output.h5")
# Save as QoI vector (adds qoi=1 attribute)
y.save("qoi_output.h5", QoI=True)
PyTorch Integration
pyFFTMatvec provides zero-copy interop between Vector objects and PyTorch tensors via the CUDA Array Interface. No data is copied between CPU and GPU — both the Vector and the PyTorch tensor share the same GPU memory.
Vector → PyTorch Tensor
import torch
# Get a zero-copy PyTorch view of the GPU data
t = x.to_torch() # default: float64
t = x.to_torch(dtype=torch.float32) # cast to float32
# t is a regular PyTorch CUDA tensor — use it in any PyTorch computation
result = torch.nn.functional.relu(t)
loss = (t ** 2).sum()
!!! note
to_torch() returns None if the calling MPI rank does not own data for this vector (i.e., on_grid() returns False).
PyTorch Tensor → Vector
# Copy data from a PyTorch tensor into the Vector's GPU memory
tensor = torch.randn(x.get_num_blocks() * x.get_block_size(), device="cuda")
x.from_torch(tensor)
The tensor must:
- Be on the GPU (
device="cuda") - Have the same total number of elements as the vector
- Multi-dimensional tensors are automatically flattened if the total element count matches
Complete Example
import pyFFTMatvec
from mpi4py import MPI
import torch
comm = pyFFTMatvec.Comm(MPI.COMM_WORLD, proc_rows=1, proc_cols=2)
# Load matrix and create vectors
F = pyFFTMatvec.Matrix(comm, path="/path/to/matrix")
x = F.get_vec("input")
y = F.get_vec("output")
# Initialize x from a PyTorch tensor
if x.on_grid():
size = x.get_num_blocks() * x.get_block_size()
t = torch.randn(size, device="cuda", dtype=torch.float64)
x.from_torch(t)
x.init_vec()
# Compute matvec
F.matvec(x, y)
# Use the result in PyTorch
y_torch = y.to_torch()
if y_torch is not None:
loss = (y_torch ** 2).sum()
loss.backward() # gradients flow through the shared memory
Testing Utilities
The Tester submodule provides verification functions:
# Check that a ones-matrix matvec produces the expected result
pyFFTMatvec.Tester.check_ones_matvec(comm, F, out_vec, conj=False, full=False)
# conj=True for transpose matvec, full=True for full (F^T F) matvec
This checks that \(F \cdot \mathbf{1} = \text{expected}\) when \(F\) is initialized with init_mat_ones().