Cheminformatics API Reference¶

Descriptors¶

`mdpp.chem.descriptors` ¶

Molecular descriptor calculation and filtering utilities.

`calc_descs(mol, *, desc_names=COMMON_DESC_NAMES)` ¶

Calculate molecular descriptors.

Default descriptors include Lipinski rule-of-five properties (MolWt, MolLogP, NumHAcceptors, NumHDonors) and other common descriptors (FractionCSP3, NumRotatableBonds, RingCount, TPSA, qed).

Parameters:

Name	Type	Description	Default
`mol`	`Mol`	An RDKit molecule.	required
`desc_names`	`Sequence[str]`	Descriptor names to calculate. Must be a subset of `BUILTIN_DESC_NAMES`.	`COMMON_DESC_NAMES`

Returns:

Type	Description
`float \| tuple[float, ...]`	A single float when one descriptor is requested, otherwise a tuple
`float \| tuple[float, ...]`	of floats in the same order as desc_names.

Raises:

Type	Description
`KeyError`	If any name in desc_names is not a valid RDKit descriptor.

`filt_descs(mol, *, filt)` ¶

Filter a molecule based on descriptor value ranges.

Parameters:

Name	Type	Description	Default
`mol`	`Mol`	An RDKit molecule.	required
`filt`	`dict[str, tuple[float, float]]`	Mapping of descriptor names to `(lower, upper)` bounds. An empty dict lets every molecule pass.	required

Returns:

Type	Description
`bool`	True if all descriptors fall within their specified ranges.

Filters¶

`mdpp.chem.filters` ¶

Molecular scaffold extraction and structural filters.

`get_framework(mol, *, generic=False)` ¶

Get the Murcko scaffold of a molecule.

Parameters:

Name	Type	Description	Default
`mol`	`Mol \| str`	An RDKit molecule or a SMILES string.	required
`generic`	`bool`	If True, return the generic (all-carbon, all-single-bond) scaffold.	`False`

Returns:

Type	Description
`Mol \| str`	The scaffold in the same type as the input (SMILES string or Mol).

`is_pains(mol)` ¶

Check whether a molecule matches any PAINS filter.

PAINS (Pan Assay Interference Compounds) are frequent hitters in high-throughput screens that act through non-specific mechanisms.

Parameters:

Name	Type	Description	Default
`mol`	`Mol`	An RDKit molecule.	required

Returns:

Type	Description
`bool`	True if the molecule matches at least one PAINS pattern.

Fingerprints¶

`mdpp.chem.fingerprints` ¶

Molecular fingerprint generation and clustering utilities.

`FingerprintClusteringResult(clusters, n_clusters)` `dataclass` ¶

Fingerprint-based Butina clustering output.

Attributes:

Name	Type	Description
`clusters`	`tuple[tuple[int, ...], ...]`	Cluster memberships sorted by size (largest first). Each tuple contains molecule indices belonging to that cluster.
`n_clusters`	`int`	Total number of clusters.

`gen_fp(mol, *, fp_type='morgan')` ¶

Generate a molecular fingerprint.

Parameters:

Name	Type	Description	Default
`mol`	`Mol`	An RDKit molecule.	required
`fp_type`	`str`	Fingerprint type. One of 'morgan', 'ecfp2', 'ecfp4', 'ecfp6', 'maccs', 'rdkit', 'atom_pair', 'topological_torsion'.	`'morgan'`

Returns:

Type	Description
`FingerPrint`	A fingerprint bit vector.

Raises:

Type	Description
`ValueError`	If fp_type is not recognised.

`cluster_fps(fps, *, cutoff=0.6, similarity_metric='tanimoto')` ¶

Cluster fingerprints using RDKit bulk similarity and the Butina algorithm.

Distances are computed as 1 - similarity. Only metrics where sim(A, A) == 1 (i.e. self-distance is zero) are valid for clustering. 'russel' (Russell-Rao) is excluded because its self-similarity equals popcount / n_bits, which is generally less than 1.

Parameters:

Name	Type	Description	Default
`fps`	`Sequence[FingerPrint]`	Fingerprint bit vectors.	required
`cutoff`	`float`	Similarity cutoff for clustering.	`0.6`
`similarity_metric`	`str`	Similarity metric name. Must be one of the metrics listed in `CLUSTERING_SIM_METRICS`.	`'tanimoto'`

Returns:

Type	Description
`FingerprintClusteringResult`	Clustering result with sorted cluster tuples and cluster count.

Raises:

Type	Description
`ValueError`	If similarity_metric is not in `CLUSTERING_SIM_METRICS`.

`cluster_fps_parallel(fps, *, cutoff=0.6, similarity_metric='tanimoto')` ¶

Cluster fingerprints using Numba-parallel similarity and the Butina algorithm.

Distances are computed as 1 - similarity. Only metrics where sim(A, A) == 1 (i.e. self-distance is zero) are valid for clustering. 'russel' (Russell-Rao) is excluded because its self-similarity equals popcount / n_bits, which is generally less than 1.

Parameters:

Name	Type	Description	Default
`fps`	`ndarray`	2D numpy array of shape `(n_mols, n_bits)` with binary fingerprints.	required
`cutoff`	`float`	Similarity cutoff for clustering.	`0.6`
`similarity_metric`	`str`	Similarity metric name. Must be one of the metrics listed in `CLUSTERING_SIM_METRICS`.	`'tanimoto'`

Returns:

Type	Description
`FingerprintClusteringResult`	Clustering result with sorted cluster tuples and cluster count.

Raises:

Type	Description
`ValueError`	If similarity_metric is not in `CLUSTERING_SIM_METRICS` or fps is not a 2D array.

Similarity¶

`mdpp.chem.similarity` ¶

Similarity metrics, kernels, and pairwise computation utilities.

`CLUSTERING_SIM_METRICS = (frozenset(PARALLEL_SIM_KERNELS) | frozenset(BULK_SIM_FUNCS)) - _METRICS_UNSUITABLE_FOR_CLUSTERING` `module-attribute` ¶

Similarity metrics whose 1 - sim transform yields a valid distance for clustering.

Excluded metrics:

'russel' (Russell-Rao): self-similarity equals popcount / n_bits rather than 1, so 1 - sim(A, A) > 0 for most fingerprints, breaking distance-based clustering.

`calc_similarities(fps, sim_kernel)` ¶

Compute condensed pairwise similarity array using a Numba-parallel kernel.

Parameters:

Name	Type	Description	Default
`fps`	`ndarray`	2D numpy array of shape `(n_mols, n_bits)` with binary fingerprints.	required
`sim_kernel`	`Callable[[int, int, int, int], float]`	A `@njit` function `(c, a, b, n_bits) -> float` returning similarity in [0, 1] (or [-1, 1] for McConnaughey).	required

Returns:

Type	Description
`ndarray`	1D condensed similarity array of length `n*(n-1)/2`, dtype float32.

`calc_sim(fp1, fp2, *, similarity_metric='tanimoto')` ¶

Calculate similarity between two fingerprints.

Parameters:

Name	Type	Description	Default
`fp1`	`FingerPrint`	First fingerprint.	required
`fp2`	`FingerPrint`	Second fingerprint.	required
`similarity_metric`	`str`	Similarity metric name (case-insensitive).	`'tanimoto'`

Returns:

Type	Description
`float`	Similarity score.

Raises:

Type	Description
`ValueError`	If similarity_metric is not recognised.

`calc_bulk_sim(fp, fps, *, similarity_metric='tanimoto')` ¶

Calculate similarity between one fingerprint and a list of fingerprints.

Parameters:

Name	Type	Description	Default
`fp`	`FingerPrint`	Query fingerprint.	required
`fps`	`Sequence[FingerPrint]`	Target fingerprints.	required
`similarity_metric`	`str`	Similarity metric name (case-insensitive).	`'tanimoto'`

Returns:

Type	Description
`list[float]`	List of similarity scores, one per target fingerprint.

Raises:

Type	Description
`ValueError`	If similarity_metric is not recognised.

Suppliers¶

`mdpp.chem.suppliers` ¶

Molecule file readers wrapping RDKit supplier classes.

`MolSupplier(file, *, multithreaded=False, **kwargs)` ¶

Iterate over molecules from a chemical structure file.

Wraps RDKit's MolSupplier classes with optional multithreading and automatic skipping of empty (unparseable) molecules.

Recommended for large files to avoid memory issues; use rdkit.Chem.PandasTools for small files and CSV/XLSX formats.

Examples:

>>> supplier = MolSupplier("molecules.sdf")
>>> for mol in supplier:
...     print(Chem.MolToSmiles(mol))

Note

Molecule ordering is not guaranteed when multithreaded is True.

Initialise the supplier for the given file.

Parameters:

Name	Type	Description	Default
`file`	`StrPath`	Path to the input file (`.sdf`, `.sdfgz`, `.mae`, `.maegz`, `.smi`, or `.smr`).	required
`multithreaded`	`bool`	Use multithreaded reading where supported. Not available for `.mae` / `.maegz` files.	`False`
`**kwargs`	`Any`	Forwarded to the underlying RDKit supplier.	`{}`

Raises:

Type	Description
`TypeError`	If the file format is unsupported or multithreading is not available for the format.

`iter()` ¶

Return the iterator.

`next()` ¶

Return the next molecule, skipping empty (unparseable) entries.

Cheminformatics API Reference¶

Descriptors¶

mdpp.chem.descriptors ¶

calc_descs(mol, *, desc_names=COMMON_DESC_NAMES) ¶

filt_descs(mol, *, filt) ¶

Filters¶

mdpp.chem.filters ¶

get_framework(mol, *, generic=False) ¶

is_pains(mol) ¶

Fingerprints¶

mdpp.chem.fingerprints ¶

FingerprintClusteringResult(clusters, n_clusters) dataclass ¶

gen_fp(mol, *, fp_type='morgan') ¶

cluster_fps(fps, *, cutoff=0.6, similarity_metric='tanimoto') ¶

cluster_fps_parallel(fps, *, cutoff=0.6, similarity_metric='tanimoto') ¶

Similarity¶

mdpp.chem.similarity ¶

CLUSTERING_SIM_METRICS = (frozenset(PARALLEL_SIM_KERNELS) | frozenset(BULK_SIM_FUNCS)) - _METRICS_UNSUITABLE_FOR_CLUSTERING module-attribute ¶

calc_similarities(fps, sim_kernel) ¶

calc_sim(fp1, fp2, *, similarity_metric='tanimoto') ¶

calc_bulk_sim(fp, fps, *, similarity_metric='tanimoto') ¶

Suppliers¶

mdpp.chem.suppliers ¶

MolSupplier(file, *, multithreaded=False, **kwargs) ¶

__iter__() ¶

__next__() ¶

`mdpp.chem.descriptors` ¶

`calc_descs(mol, *, desc_names=COMMON_DESC_NAMES)` ¶

`filt_descs(mol, *, filt)` ¶

`mdpp.chem.filters` ¶

`get_framework(mol, *, generic=False)` ¶

`is_pains(mol)` ¶

`mdpp.chem.fingerprints` ¶

`FingerprintClusteringResult(clusters, n_clusters)` `dataclass` ¶

`gen_fp(mol, *, fp_type='morgan')` ¶

`cluster_fps(fps, *, cutoff=0.6, similarity_metric='tanimoto')` ¶

`cluster_fps_parallel(fps, *, cutoff=0.6, similarity_metric='tanimoto')` ¶

`mdpp.chem.similarity` ¶

`CLUSTERING_SIM_METRICS = (frozenset(PARALLEL_SIM_KERNELS) | frozenset(BULK_SIM_FUNCS)) - _METRICS_UNSUITABLE_FOR_CLUSTERING` `module-attribute` ¶

`calc_similarities(fps, sim_kernel)` ¶

`calc_sim(fp1, fp2, *, similarity_metric='tanimoto')` ¶

`calc_bulk_sim(fp, fps, *, similarity_metric='tanimoto')` ¶

`mdpp.chem.suppliers` ¶

`MolSupplier(file, *, multithreaded=False, **kwargs)` ¶

`iter()` ¶

`next()` ¶