Cheminformatics API Reference¶
Descriptors¶
mdpp.chem.descriptors
¶
Molecular descriptor calculation and filtering utilities.
calc_descs(mol, *, desc_names=COMMON_DESC_NAMES)
¶
Calculate molecular descriptors.
Default descriptors include Lipinski rule-of-five properties
(MolWt, MolLogP, NumHAcceptors, NumHDonors) and
other common descriptors (FractionCSP3, NumRotatableBonds,
RingCount, TPSA, qed).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mol
|
Mol
|
An RDKit molecule. |
required |
desc_names
|
Sequence[str]
|
Descriptor names to calculate. Must be a subset of
|
COMMON_DESC_NAMES
|
Returns:
| Type | Description |
|---|---|
float | tuple[float, ...]
|
A single float when one descriptor is requested, otherwise a tuple |
float | tuple[float, ...]
|
of floats in the same order as desc_names. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If any name in desc_names is not a valid RDKit descriptor. |
filt_descs(mol, *, filt)
¶
Filter a molecule based on descriptor value ranges.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mol
|
Mol
|
An RDKit molecule. |
required |
filt
|
dict[str, tuple[float, float]]
|
Mapping of descriptor names to |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if all descriptors fall within their specified ranges. |
Filters¶
mdpp.chem.filters
¶
Molecular scaffold extraction and structural filters.
get_framework(mol, *, generic=False)
¶
Get the Murcko scaffold of a molecule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mol
|
Mol | str
|
An RDKit molecule or a SMILES string. |
required |
generic
|
bool
|
If True, return the generic (all-carbon, all-single-bond) scaffold. |
False
|
Returns:
| Type | Description |
|---|---|
Mol | str
|
The scaffold in the same type as the input (SMILES string or Mol). |
is_pains(mol)
¶
Check whether a molecule matches any PAINS filter.
PAINS (Pan Assay Interference Compounds) are frequent hitters in high-throughput screens that act through non-specific mechanisms.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mol
|
Mol
|
An RDKit molecule. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the molecule matches at least one PAINS pattern. |
Fingerprints¶
mdpp.chem.fingerprints
¶
Molecular fingerprint generation and clustering utilities.
FingerprintClusteringResult(clusters, n_clusters)
dataclass
¶
Fingerprint-based Butina clustering output.
Attributes:
| Name | Type | Description |
|---|---|---|
clusters |
tuple[tuple[int, ...], ...]
|
Cluster memberships sorted by size (largest first). Each tuple contains molecule indices belonging to that cluster. |
n_clusters |
int
|
Total number of clusters. |
gen_fp(mol, *, fp_type='morgan')
¶
Generate a molecular fingerprint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mol
|
Mol
|
An RDKit molecule. |
required |
fp_type
|
str
|
Fingerprint type. One of 'morgan', 'ecfp2', 'ecfp4', 'ecfp6', 'maccs', 'rdkit', 'atom_pair', 'topological_torsion'. |
'morgan'
|
Returns:
| Type | Description |
|---|---|
FingerPrint
|
A fingerprint bit vector. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If fp_type is not recognised. |
cluster_fps(fps, *, cutoff=0.6, similarity_metric='tanimoto')
¶
Cluster fingerprints using RDKit bulk similarity and the Butina algorithm.
Distances are computed as 1 - similarity. Only metrics where
sim(A, A) == 1 (i.e. self-distance is zero) are valid for clustering.
'russel' (Russell-Rao) is excluded because its self-similarity equals
popcount / n_bits, which is generally less than 1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fps
|
Sequence[FingerPrint]
|
Fingerprint bit vectors. |
required |
cutoff
|
float
|
Similarity cutoff for clustering. |
0.6
|
similarity_metric
|
str
|
Similarity metric name. Must be one of the metrics
listed in |
'tanimoto'
|
Returns:
| Type | Description |
|---|---|
FingerprintClusteringResult
|
Clustering result with sorted cluster tuples and cluster count. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If similarity_metric is not in |
cluster_fps_parallel(fps, *, cutoff=0.6, similarity_metric='tanimoto')
¶
Cluster fingerprints using Numba-parallel similarity and the Butina algorithm.
Distances are computed as 1 - similarity. Only metrics where
sim(A, A) == 1 (i.e. self-distance is zero) are valid for clustering.
'russel' (Russell-Rao) is excluded because its self-similarity equals
popcount / n_bits, which is generally less than 1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fps
|
ndarray
|
2D numpy array of shape |
required |
cutoff
|
float
|
Similarity cutoff for clustering. |
0.6
|
similarity_metric
|
str
|
Similarity metric name. Must be one of the metrics
listed in |
'tanimoto'
|
Returns:
| Type | Description |
|---|---|
FingerprintClusteringResult
|
Clustering result with sorted cluster tuples and cluster count. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If similarity_metric is not in |
Similarity¶
mdpp.chem.similarity
¶
Similarity metrics, kernels, and pairwise computation utilities.
CLUSTERING_SIM_METRICS = (frozenset(PARALLEL_SIM_KERNELS) | frozenset(BULK_SIM_FUNCS)) - _METRICS_UNSUITABLE_FOR_CLUSTERING
module-attribute
¶
Similarity metrics whose 1 - sim transform yields a valid distance for clustering.
Excluded metrics:
'russel'(Russell-Rao): self-similarity equalspopcount / n_bitsrather than 1, so1 - sim(A, A) > 0for most fingerprints, breaking distance-based clustering.
calc_similarities(fps, sim_kernel)
¶
Compute condensed pairwise similarity array using a Numba-parallel kernel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fps
|
ndarray
|
2D numpy array of shape |
required |
sim_kernel
|
Callable[[int, int, int, int], float]
|
A |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
1D condensed similarity array of length |
calc_sim(fp1, fp2, *, similarity_metric='tanimoto')
¶
Calculate similarity between two fingerprints.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fp1
|
FingerPrint
|
First fingerprint. |
required |
fp2
|
FingerPrint
|
Second fingerprint. |
required |
similarity_metric
|
str
|
Similarity metric name (case-insensitive). |
'tanimoto'
|
Returns:
| Type | Description |
|---|---|
float
|
Similarity score. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If similarity_metric is not recognised. |
calc_bulk_sim(fp, fps, *, similarity_metric='tanimoto')
¶
Calculate similarity between one fingerprint and a list of fingerprints.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fp
|
FingerPrint
|
Query fingerprint. |
required |
fps
|
Sequence[FingerPrint]
|
Target fingerprints. |
required |
similarity_metric
|
str
|
Similarity metric name (case-insensitive). |
'tanimoto'
|
Returns:
| Type | Description |
|---|---|
list[float]
|
List of similarity scores, one per target fingerprint. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If similarity_metric is not recognised. |
Suppliers¶
mdpp.chem.suppliers
¶
Molecule file readers wrapping RDKit supplier classes.
MolSupplier(file, *, multithreaded=False, **kwargs)
¶
Iterate over molecules from a chemical structure file.
Wraps RDKit's MolSupplier classes with optional multithreading and
automatic skipping of empty (unparseable) molecules.
Recommended for large files to avoid memory issues; use
rdkit.Chem.PandasTools for small files and CSV/XLSX formats.
Examples:
>>> supplier = MolSupplier("molecules.sdf")
>>> for mol in supplier:
... print(Chem.MolToSmiles(mol))
Note
Molecule ordering is not guaranteed when multithreaded is True.
Initialise the supplier for the given file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
StrPath
|
Path to the input file ( |
required |
multithreaded
|
bool
|
Use multithreaded reading where supported.
Not available for |
False
|
**kwargs
|
Any
|
Forwarded to the underlying RDKit supplier. |
{}
|
Raises:
| Type | Description |
|---|---|
TypeError
|
If the file format is unsupported or multithreading is not available for the format. |