Enrichment Predictions

This tutorial assumes the PIDGINv4 repository is located at $PV4 and is concerned with the script predict_enriched.py

This script calculates target prediction enrichment (using Fishers’ t-test) between two input SMILES/SDF files as in [1]. Target predictions are extended with NCBI Biosystems pathways and DisGeNET diseases. Pathway or disease-gene association enrichment (using chi-square test) enrichment is calculated for the two input SMILES/SDF files.

The approach is used to annotate which targets/pathways/diseases are statistically associated between two compound sets given their input SMILES/SDF files. This analysis is important since a (predicted) target is not necessarily responsible for eliciting an observed mechanism-of-action. Some target prediction models also behave promiscuously, due to biases in training data (chemical space) and the nature of the target.

The analysis must use a cut-off for the probability of activity from the random forest for each target. Predictions are generated for the models using the reliability-density neighbourhood Applicability Domain (AD) analysis by Aniceto from: doi.org/10.1186/s13321-016-0182-y

biosystems.txt contains pathway data from the NCBI biosystems used to annotate target predictions. Pathway results can be filtered by source (e.g. KEGG/Reactome/GO) afterward.

DisGeNET_diseases.txt contains disease data used to annotate target predictions. DisGeNET gene-disease score takes into account the number and type of sources (level of curation, organisms), and the number of publications supporting the association. The score ranges from 0 to 1 in accordance to increasing confidence in annotations, resepctively. A DisGeNET_threshold can be supplied at runtime when annotating predictions with diseases (0.06 threshold applied by default, which includes associations from curated sources/animal models supporting the association or reported in 20-200 papers). More info on the score here: http://disgenet.org/web/DisGeNET/menu/dbinfo#score

List of available arguments

To see all available options, run

$ python $PV4/predict_enriched.py -h
Usage: predict_enriched.py [options]

  -h, --help            show this help message and exit
  --f1=FILE             Firest input smiles or sdf file (required)
  --f2=FILE             Second input smiles or sdf file (required)
  -d DELIM, --smiles_delim=DELIM
                                                Input file (smiles) delimiter char (default: white
                                                space ' ')
                                                Input file (smiles) delimiter column (default: 0)
                                                Input file (smiles) ID column (default: 1)
  -o FILE               Optional output prediction file name
  -n NCORES, --ncores=NCORES
                                                No. cores (default: 1)
                                                Bioactivity Um threshold (required). Use either
                                                100/10/1/0.1 (default:10)
  -p PROBA, --proba=PROBA
                                                RF probability threshold (default: None)
  --ad=AD               Applicability Domain (AD) filter using percentile of
                                                weights [float]. Default: 90 (integer for percentile)
  --known_flag          Set known activities (annotate duplicates betweem
                                                input to train with correct label)
  --orthologues         Set to use orthologue bioactivity data in model
  --organism=ORGANISM   Organism filter (multiple can be specified using
                                                commas ',')
                                                Target classification filter
  --min_size=MINSIZE    Minimum number of actives used in model generation
                                                (default: 10)
                                                Comma-seperated performance filtering using following
                                                nomenclature: validation_set[tsscv,l50so,l50po],metric
                                                E.g 'tsscv,bedroc,0.5'
  --se_filter           Optional setting to restrict to models which do not
                                                require Sphere Exclusion (SE)
  --training_log        Optional setting to add training_details to the
                                                prediction file (large increase in output file size)
  --ntrees=NTREES       Specify the minimum number of trees for warm-start
                                                random forest models (N.B Potential large
                                                latency/memory cost)
  --preprocess_off      Turn off preprocessing using the flatkinson (eTox)
                                                standardizer (github.com/flatkinson/standardiser),
                                                size filter (100 >= Mw >= 1000 and organic mol check
                                                (C count >= 1)
  --dgn=DGN_THRESHOLD   DisGeNET score threshold (default: 0.06)

Generating enrichment predictions

In this example, we will work with a two SMILES input files, comprising cytotoxic compounds in the file named cytotox_library.smi and (putative) non-toxic compounds in the file named nontoxic_background.smi. Both are located in the examples directory.

The corresponding top 5 SMILES strings are:




The following code will generate cow target prediction enrichment at 1μM (with lenient AD filters of 30 percentiles and probability of activity cut-off of 0.45) along with enriched pathways and diseases (0.06 score threshold) for the cytotoxic compounds, when compared to the non-toxic compounds.

$ python $PV4/predict_enriched.py --f1 cytotox_library.smi --f2 nontoxic_background.smi --organism "Bos taurus" -b 1 -p 0.45 --ad 30 -n 4

Three files are output for the target, pathway and disease enrichment calculations, with the naming convention:


The rows in each file correspond to the ranked enriched list of targets/pathways/diseases that are more statistically associated with the first SMILES/SDF file (--f1) of (e.g. cytotoxic) compounds. A higher Odd’s Ratio (column Odds_Ratio) or Risk Ratio (Risk_Ratio) indicates a larger degree of enrichment for a given target/pathway/disease compared to the second input --f2 (nontoxic) compound set.

The output has columns for the number of compound predictions (column [f1/f2]_[In]Actives_[probability_activity]) and the associated percentage Percent_[f1/f2]_[In]Actives_[probability_activity]) of compounds with that prediction.

The Fishers or Chi-squared p-values are provided ([Fishers_Test/Chisquared]_P_Value) including the Benjamini & Hochberg corrected values in the column named [Fishers_Test/Chisquared]_P_Value_Corrected. The output should be filtered for a given preference.

The percentage NaN predictions (compounds outside the Applicability Domain (AD) filter that were not given an active/inactive target prediction) are also provided in the column entitled [f1/f2]_Proportion_Nan_Predictions_[ad].


Please note that the Odd’s and Risk ratios are implemented in a different way to the previous version of PIDGIN. For this version, larger numbers indicate larger enrichments.

In this example, there are six targets with a corrected p-value less than 0.05 with a Odds or Risk ratio greater than 1.0. All targets have known links to cytotoxicity, for example three are related to Tublin with known mechanisms to cytotoxicity (via cytoskeletal machinery).

More complicated example

Target/pathway/disease enrichment analysis can be combined with all model filters outlined in the previous section “Getting started”.

For example, the following code:

$ python $PV4/predict_enriched.py --f1 cytotox_library.smi --f2 nontoxic_background.smi --organism Drosophila -b 100 --known_flag --ad 0 -n 4 -p 0.8 --min_size 50 --se_filter --performance_filter l50po,bedroc,0.8

would filter for Drosophila models that did not require Sphere Exlusion (SE) (i.e. sufficient number of inactives available) and a minimum number of 50 actives in the training set, with a minimum BEDROC performance of 0.8 for leave out 50% of ChEMBL publications from training data over 4-fold cross validation (L50PO), to produce enrichment predictions at a 0.8 probability cut-off at a threshold of 100μM, with the Applicability Domain (AD) filter silenced and where known activities (in ChEMBL or PubChem) are set.


[1]Mervin, L H., et al. Understanding Cytotoxicity and Cytostaticity in a High-Throughput Screening Collection. ACS Chem. Biol. 11: 11 (2016) mervin2016_doi