Skip to content

Cheminformatics API Reference

Descriptors

mdpp.chem.descriptors

Molecular descriptor calculation and filtering utilities.

calc_descs(mol, *, desc_names=COMMON_DESC_NAMES)

Calculate molecular descriptors.

Default descriptors include Lipinski rule-of-five properties (MolWt, MolLogP, NumHAcceptors, NumHDonors) and other common descriptors (FractionCSP3, NumRotatableBonds, RingCount, TPSA, qed).

Parameters:

Name Type Description Default
mol Mol

An RDKit molecule.

required
desc_names Sequence[str]

Descriptor names to calculate. Must be a subset of BUILTIN_DESC_NAMES.

COMMON_DESC_NAMES

Returns:

Type Description
float | tuple[float, ...]

A single float when one descriptor is requested, otherwise a tuple

float | tuple[float, ...]

of floats in the same order as desc_names.

Raises:

Type Description
KeyError

If any name in desc_names is not a valid RDKit descriptor.

filt_descs(mol, *, filt)

Filter a molecule based on descriptor value ranges.

Parameters:

Name Type Description Default
mol Mol

An RDKit molecule.

required
filt dict[str, tuple[float, float]]

Mapping of descriptor names to (lower, upper) bounds. An empty dict lets every molecule pass.

required

Returns:

Type Description
bool

True if all descriptors fall within their specified ranges.

Filters

mdpp.chem.filters

Molecular scaffold extraction and structural filters.

get_framework(mol, *, generic=False)

Get the Murcko scaffold of a molecule.

Parameters:

Name Type Description Default
mol Mol | str

An RDKit molecule or a SMILES string.

required
generic bool

If True, return the generic (all-carbon, all-single-bond) scaffold.

False

Returns:

Type Description
Mol | str

The scaffold in the same type as the input (SMILES string or Mol).

is_pains(mol)

Check whether a molecule matches any PAINS filter.

PAINS (Pan Assay Interference Compounds) are frequent hitters in high-throughput screens that act through non-specific mechanisms.

Parameters:

Name Type Description Default
mol Mol

An RDKit molecule.

required

Returns:

Type Description
bool

True if the molecule matches at least one PAINS pattern.

Fingerprints

mdpp.chem.fingerprints

Molecular fingerprint generation and clustering utilities.

FingerprintClusteringResult(clusters, n_clusters) dataclass

Fingerprint-based Butina clustering output.

Attributes:

Name Type Description
clusters tuple[tuple[int, ...], ...]

Cluster memberships sorted by size (largest first). Each tuple contains molecule indices belonging to that cluster.

n_clusters int

Total number of clusters.

gen_fp(mol, *, fp_type='morgan')

Generate a molecular fingerprint.

Parameters:

Name Type Description Default
mol Mol

An RDKit molecule.

required
fp_type str

Fingerprint type. One of 'morgan', 'ecfp2', 'ecfp4', 'ecfp6', 'maccs', 'rdkit', 'atom_pair', 'topological_torsion'.

'morgan'

Returns:

Type Description
FingerPrint

A fingerprint bit vector.

Raises:

Type Description
ValueError

If fp_type is not recognised.

cluster_fps(fps, *, cutoff=0.6, similarity_metric='tanimoto')

Cluster fingerprints using RDKit bulk similarity and the Butina algorithm.

Distances are computed as 1 - similarity. Only metrics where sim(A, A) == 1 (i.e. self-distance is zero) are valid for clustering. 'russel' (Russell-Rao) is excluded because its self-similarity equals popcount / n_bits, which is generally less than 1.

Parameters:

Name Type Description Default
fps Sequence[FingerPrint]

Fingerprint bit vectors.

required
cutoff float

Similarity cutoff for clustering.

0.6
similarity_metric str

Similarity metric name. Must be one of the metrics listed in CLUSTERING_SIM_METRICS.

'tanimoto'

Returns:

Type Description
FingerprintClusteringResult

Clustering result with sorted cluster tuples and cluster count.

Raises:

Type Description
ValueError

If similarity_metric is not in CLUSTERING_SIM_METRICS.

cluster_fps_parallel(fps, *, cutoff=0.6, similarity_metric='tanimoto')

Cluster fingerprints using Numba-parallel similarity and the Butina algorithm.

Distances are computed as 1 - similarity. Only metrics where sim(A, A) == 1 (i.e. self-distance is zero) are valid for clustering. 'russel' (Russell-Rao) is excluded because its self-similarity equals popcount / n_bits, which is generally less than 1.

Parameters:

Name Type Description Default
fps ndarray

2D numpy array of shape (n_mols, n_bits) with binary fingerprints.

required
cutoff float

Similarity cutoff for clustering.

0.6
similarity_metric str

Similarity metric name. Must be one of the metrics listed in CLUSTERING_SIM_METRICS.

'tanimoto'

Returns:

Type Description
FingerprintClusteringResult

Clustering result with sorted cluster tuples and cluster count.

Raises:

Type Description
ValueError

If similarity_metric is not in CLUSTERING_SIM_METRICS or fps is not a 2D array.

Similarity

mdpp.chem.similarity

Similarity metrics, kernels, and pairwise computation utilities.

CLUSTERING_SIM_METRICS = (frozenset(PARALLEL_SIM_KERNELS) | frozenset(BULK_SIM_FUNCS)) - _METRICS_UNSUITABLE_FOR_CLUSTERING module-attribute

Similarity metrics whose 1 - sim transform yields a valid distance for clustering.

Excluded metrics:

  • 'russel' (Russell-Rao): self-similarity equals popcount / n_bits rather than 1, so 1 - sim(A, A) > 0 for most fingerprints, breaking distance-based clustering.

calc_similarities(fps, sim_kernel)

Compute condensed pairwise similarity array using a Numba-parallel kernel.

Parameters:

Name Type Description Default
fps ndarray

2D numpy array of shape (n_mols, n_bits) with binary fingerprints.

required
sim_kernel Callable[[int, int, int, int], float]

A @njit function (c, a, b, n_bits) -> float returning similarity in [0, 1] (or [-1, 1] for McConnaughey).

required

Returns:

Type Description
ndarray

1D condensed similarity array of length n*(n-1)/2, dtype float32.

calc_sim(fp1, fp2, *, similarity_metric='tanimoto')

Calculate similarity between two fingerprints.

Parameters:

Name Type Description Default
fp1 FingerPrint

First fingerprint.

required
fp2 FingerPrint

Second fingerprint.

required
similarity_metric str

Similarity metric name (case-insensitive).

'tanimoto'

Returns:

Type Description
float

Similarity score.

Raises:

Type Description
ValueError

If similarity_metric is not recognised.

calc_bulk_sim(fp, fps, *, similarity_metric='tanimoto')

Calculate similarity between one fingerprint and a list of fingerprints.

Parameters:

Name Type Description Default
fp FingerPrint

Query fingerprint.

required
fps Sequence[FingerPrint]

Target fingerprints.

required
similarity_metric str

Similarity metric name (case-insensitive).

'tanimoto'

Returns:

Type Description
list[float]

List of similarity scores, one per target fingerprint.

Raises:

Type Description
ValueError

If similarity_metric is not recognised.

Suppliers

mdpp.chem.suppliers

Molecule file readers wrapping RDKit supplier classes.

MolSupplier(file, *, multithreaded=False, **kwargs)

Iterate over molecules from a chemical structure file.

Wraps RDKit's MolSupplier classes with optional multithreading and automatic skipping of empty (unparseable) molecules.

Recommended for large files to avoid memory issues; use rdkit.Chem.PandasTools for small files and CSV/XLSX formats.

Examples:

>>> supplier = MolSupplier("molecules.sdf")
>>> for mol in supplier:
...     print(Chem.MolToSmiles(mol))
Note

Molecule ordering is not guaranteed when multithreaded is True.

Initialise the supplier for the given file.

Parameters:

Name Type Description Default
file StrPath

Path to the input file (.sdf, .sdfgz, .mae, .maegz, .smi, or .smr).

required
multithreaded bool

Use multithreaded reading where supported. Not available for .mae / .maegz files.

False
**kwargs Any

Forwarded to the underlying RDKit supplier.

{}

Raises:

Type Description
TypeError

If the file format is unsupported or multithreading is not available for the format.

__iter__()

Return the iterator.

__next__()

Return the next molecule, skipping empty (unparseable) entries.