Analysis¶
The mdpp.analysis subpackage provides trajectory analysis functions. All compute functions return frozen dataclass results and follow consistent patterns.
Float Dtype Control¶
All compute_* functions default to float32, matching the precision of
MD trajectory coordinates stored by mdtraj. You can override the dtype
globally or per-function call:
import numpy as np
import mdpp
# Check the current default
print(mdpp.get_default_dtype()) # float32
# Override globally
mdpp.set_default_dtype(np.float64)
# Override per-function (takes precedence over the global default)
result = compute_rmsd(traj, dtype=np.float64)
# Reset to float32
mdpp.set_default_dtype(np.float32)
Float32 is sufficient for all analysis operations in this package. MD trajectories are stored in float32 by mdtraj, so there is no extra precision to recover from float64. Empirical tests confirm negligible differences: RMSF error ~1e-5 nm, DCCM correlation error ~4e-6, FES error ~2e-6 kJ/mol -- all well below any physically meaningful threshold.
RMSD¶
from mdpp.core import load_trajectory
from mdpp.analysis.metrics import compute_rmsd
traj = load_trajectory("md.xtc", topology_path="topol.gro")
result = compute_rmsd(traj, atom_selection="backbone")
print(result.rmsd_angstrom.mean()) # average RMSD in Å
print(result.time_ns[-1]) # simulation length in ns
RMSF¶
from mdpp.analysis.metrics import compute_rmsf
result = compute_rmsf(traj, atom_selection="name CA")
# result.rmsf_angstrom: per-residue RMSF
# result.residue_ids: corresponding residue IDs
Delta-RMSF¶
Compare flexibility between two systems (e.g. apo vs holo). The RMSF for
each system is averaged across replicas in MSF space (sqrt(mean(RMSF^2)))
and the SEM is propagated through the sqrt transform before being combined
in quadrature for the delta:
from mdpp.analysis.metrics import compute_rmsf, compute_delta_rmsf
rmsf_apo = [compute_rmsf(traj) for traj in apo_replicas]
rmsf_holo = [compute_rmsf(traj) for traj in holo_replicas]
# Element-wise comparison (requires identical residue counts)
delta = compute_delta_rmsf(rmsf_apo, rmsf_holo)
# delta.delta_rmsf_angstrom: per-residue flexibility change (B - A)
# delta.sem_angstrom: SEM, or None if either system has fewer than 2 replicas
For systems with different sequences, pass aligned index arrays so each position matches structurally (e.g. derived from a multiple sequence alignment):
delta = compute_delta_rmsf(
rmsf_apo,
rmsf_holo,
indices_a=apo_aligned_indices,
indices_b=holo_aligned_indices,
)
Dynamic Cross-Correlation Matrix (DCCM)¶
from mdpp.analysis.metrics import compute_dccm
result = compute_dccm(traj, atom_selection="name CA")
# result.correlation: (n_atoms, n_atoms) correlation matrix
DCCM backend selection¶
compute_dccm dispatches the covariance computation through a pluggable
backend registry. Five backends are available:
| Backend | Device | Notes |
|---|---|---|
"numpy" |
CPU (BLAS GEMM) | default; multi-threaded via BLAS |
"numba" |
CPU (all cores) | explicit Numba parallel kernel |
"cupy" |
NVIDIA GPU | requires [gpu] extra |
"torch" |
CUDA GPU / CPU | requires [gpu] extra |
"jax" |
GPU / TPU / CPU | requires [gpu] extra |
The default "numpy" backend uses BLAS GEMM via reshape + matmul, which
is multi-threaded out of the box and already saturates available cores
without optional dependencies. Note that the DCCM default differs from
distances and RMSD matrix because mdtraj has no native DCCM kernel.
# Default -- numpy + BLAS
result = compute_dccm(traj, atom_selection="name CA")
# Numba: explicit CPU parallelism
result = compute_dccm(traj, atom_selection="name CA", backend="numba")
# GPU backend for very large selections
result = compute_dccm(traj, atom_selection="name CA", backend="torch")
Solvent-Accessible Surface Area (SASA)¶
from mdpp.analysis.metrics import compute_sasa
result = compute_sasa(traj, atom_selection="protein", mode="residue")
print(result.total_nm2.mean()) # average total SASA
Radius of Gyration¶
from mdpp.analysis.metrics import compute_radius_of_gyration
result = compute_radius_of_gyration(traj)
# result.radius_gyration_angstrom: Rg per frame in Angstrom
Hydrogen Bonds¶
from mdpp.analysis.hbond import compute_hbonds, format_hbond_triplets
result = compute_hbonds(traj) # default method: baker_hubbard
labels = format_hbond_triplets(traj.topology, result.triplets)
# result.occupancy: fraction of frames each bond is present
Contacts¶
Inter-residue contacts¶
from mdpp.analysis.contacts import compute_contacts
result = compute_contacts(traj, scheme="closest-heavy")
# result.distances_nm: (n_frames, n_pairs) distance matrix
Contact frequency matrix¶
from mdpp.analysis.contacts import compute_contact_frequency
frequency, pairs = compute_contact_frequency(traj, cutoff_nm=0.45, scheme="closest-heavy")
Native contacts (Q value)¶
from mdpp.analysis.contacts import compute_native_contacts
result = compute_native_contacts(traj, reference_frame=0, cutoff_nm=0.45, scheme="closest-heavy")
# result.fraction: Q(t) per frame
Pairwise Distances¶
from mdpp.analysis.distance import compute_distances
import numpy as np
pairs = np.array([[0, 100], [50, 200]])
result = compute_distances(traj, atom_pairs=pairs)
# result.distances_angstrom: (n_frames, 2)
Minimum distance between groups¶
from mdpp.analysis.distance import compute_minimum_distance
result = compute_minimum_distance(traj, group1="resid 10", group2="resid 50")
Compute backend selection¶
compute_distances, compute_minimum_distance, and featurize_ca_distances
accept a backend= argument. Five backends are available:
| Backend | Device | PBC | Install |
|---|---|---|---|
"mdtraj" |
CPU (1 thread) | yes | built-in (default) |
"numba" |
CPU (all cores) | no | built-in (numba) |
"cupy" |
NVIDIA GPU | no | pip install -e ".[gpu]" |
"torch" |
CUDA GPU / CPU | no | pip install -e ".[gpu]" |
"jax" |
GPU / TPU / CPU | no | pip install -e ".[gpu]" |
Default is "mdtraj" across every analysis function for API
consistency and correctness (only mdtraj supports periodic boundary
conditions). Opt in to faster backends explicitly when performance
matters:
# Default -- mdtraj (supports periodic boundary conditions)
result = compute_distances(traj, atom_pairs=pairs, periodic=True)
# Numba: 5-10x faster than mdtraj on multi-core CPU (no PBC)
result = compute_distances(traj, atom_pairs=pairs, backend="numba", periodic=False)
# GPU backends for very large trajectories
result = compute_distances(traj, atom_pairs=pairs, backend="cupy", periodic=False)
Note: Only the
mdtrajbackend supports periodic boundary conditions. For the other four backends theperiodicflag is silently ignored.
Cached GPU memory is released automatically¶
PyTorch, CuPy, and JAX all use caching memory allocators that hold
GPU blocks in a pool so subsequent calls can reuse them without
expensive CUDA malloc/free round-trips. mdpp decorates every
GPU-backed kernel with a framework-specific cleanup decorator, so
pooled memory is returned to the CUDA driver in a finally block
as soon as the kernel returns (even on exceptions):
# Applied internally in mdpp.analysis._backends:
@clean_torch_cache
def rmsd_torch(...): ...
@clean_cupy_cache
def distances_cupy(...): ...
In practice, nvidia-smi reflects the release automatically after
every compute_rmsd_matrix / compute_distances /
featurize_ca_distances call that uses a GPU backend -- no manual
cleanup is needed. If you interleave mdpp with raw
torch/cupy/jax code and need to force a release across
frameworks, use the native APIs:
import torch
import cupy as cp
torch.cuda.empty_cache()
cp.get_default_memory_pool().free_all_blocks()
Secondary Structure (DSSP)¶
from mdpp.analysis.dssp import compute_dssp
result = compute_dssp(traj, simplified=True)
# result.assignments: (n_frames, n_residues) with "H", "E", "C"
# result.frequency: (n_residues, 3) fraction in each state
PCA / TICA¶
Backbone torsion featurization¶
from mdpp.analysis.decomposition import featurize_backbone_torsions, compute_pca
features = featurize_backbone_torsions(traj, sincos_embedding=True)
pca_result = compute_pca(features.values, n_components=2)
Project new data onto fitted PCA¶
from mdpp.analysis.decomposition import project_pca
new_features = featurize_backbone_torsions(new_traj, sincos_embedding=True)
projected = project_pca(new_features.values, fitted=pca_result)
TICA (time-lagged independent component analysis)¶
from mdpp.analysis.decomposition import compute_tica
tica_result = compute_tica(features.values, lagtime=10, n_components=2)
Free Energy Surface¶
from mdpp.analysis.fes import compute_fes_from_projection
fes = compute_fes_from_projection(pca_result.projections)
# fes.free_energy_kj_mol: 2D free energy in kJ/mol (default T=298.15 K)
Conformational Clustering¶
Each clustering method is a frozen dataclass configured at construction time and called on the data matrix:
from mdpp.analysis.clustering import (
Gromos, Hierarchical, DBSCAN, HDBSCAN,
KMeans, MiniBatchKMeans, RegularSpace,
compute_rmsd_matrix,
)
# --- Distance-matrix methods (take an RMSD matrix) ---
rmsd_mat = compute_rmsd_matrix(traj, atom_selection="backbone")
# GROMOS (greedy largest-cluster-first, Numba JIT, O(n) aux memory)
result = Gromos(cutoff_nm=0.15)(rmsd_mat.rmsd_matrix_nm)
# Hierarchical agglomerative (scipy)
result = Hierarchical(linkage_method="average", distance_threshold=0.2)(rmsd_mat.rmsd_matrix_nm)
result = Hierarchical(linkage_method="average", n_clusters=10)(rmsd_mat.rmsd_matrix_nm)
# DBSCAN (Numba JIT default, sklearn backend available)
result = DBSCAN(eps=0.15, min_samples=5)(rmsd_mat.rmsd_matrix_nm)
result = DBSCAN(eps=0.15, min_samples=5, backend="sklearn")(rmsd_mat.rmsd_matrix_nm)
# HDBSCAN (sklearn)
result = HDBSCAN(min_cluster_size=50, min_samples=5)(rmsd_mat.rmsd_matrix_nm)
print(f"Found {result.n_clusters} clusters")
print(f"Medoid frames: {result.medoid_frames}")
# --- Feature-vector methods (take PCA/TICA projections) ---
from mdpp.analysis.decomposition import featurize_backbone_torsions, compute_pca
torsions = featurize_backbone_torsions(traj, atom_selection="protein")
pca = compute_pca(torsions.values, n_components=10)
result = KMeans(n_clusters=10)(pca.projections)
result = MiniBatchKMeans(n_clusters=10, batch_size=1024)(pca.projections)
result = RegularSpace(dmin=0.5)(pca.projections)
# Seed is configurable -- defaults to 42 for reproducible runs. Pass
# random_state=None to let scikit-learn pick a non-deterministic seed.
result = KMeans(n_clusters=10, random_state=2024)(pca.projections)
# result.labels, result.n_clusters, result.cluster_centers,
# result.medoid_frames, result.inertia
Clustering backend selection¶
DBSCAN provides two backends:
| Backend | Method | Aux memory |
|---|---|---|
"numba" |
Custom Numba-JIT BFS (default) | O(n) |
"sklearn" |
scikit-learn DBSCAN(metric="precomputed") |
O(n^2) |
The "numba" backend handles 120k+ frames without OOM by reading the
RMSD matrix in place and using only O(n) auxiliary buffers.
RMSD matrix backend selection¶
compute_rmsd_matrix defaults to the mdtraj backend for API
consistency with other analysis functions. Five backends are
available:
| Backend | Method | Device |
|---|---|---|
"mdtraj" |
Precentered md.rmsd loop |
CPU (1 thread) |
"numba" |
QCP + Newton-Raphson | CPU (all cores) |
"torch" |
Vectorised einsum + batched SVD | CUDA / CPU |
"jax" |
Vectorised einsum + batched SVD | GPU / TPU / CPU |
"cupy" |
Vectorised einsum + batched SVD | NVIDIA GPU |
# Default -- mdtraj
rmsd_mat = compute_rmsd_matrix(traj, atom_selection="backbone")
# Numba QCP: 50-200x faster than mdtraj on multi-core CPU
rmsd_mat = compute_rmsd_matrix(traj, atom_selection="backbone", backend="numba")
# GPU backend for very large trajectories (>1000 frames)
rmsd_mat = compute_rmsd_matrix(traj, atom_selection="backbone", backend="torch")
All backends agree to within ~5e-5 nm and produce symmetric matrices
with a zero diagonal. GPU backends (torch, jax, cupy) require
the optional [gpu] extra.