Getting started

This tutorial assumes the PIDGINv4 repository is located at $PV4.

Generating predictions for human targets

In this example, we will work with the input file named test.smi in the examples directory, which containins two molecules whose SMILES strings are defined as:

COc1cc2c3CN4CCC[C@H]4[C@@H](O)c3c5ccc(O)cc5c2cc1OC CompoundID1
COc1cc2c3CN4CCC[C@H]4[C@@H](O)c3c5ccc(O)c(OC)c5c2cc1OC CompoundID2

The following code will generate the RF probabilities at 1μM for all human targets for the input file:

$ python $PV4/ -f test.smi --organism "Homo sapiens" -b 1

This script outputs the RF output from each of the Random Forest classifiers across the targets for the all compounds into a probability matrix, where the columns are compounds and the rows are targets.

If using --organism, it must be as specified in the uniprot_information.txt and if using spaces enclosed by quotes (“”) - as in the above example. The organism filter uses fuzzy matching, so --organism homo would also achieve a similar filtered list.

Generating binary predictions

The following code will generate binary predictions at 0.1 and 1μM for all human targets, at a threshold of 0.5 (the compound was more often predicted active compared to inactive):

$ python $PV4/ -f test.smi --organism "Homo sapiens" -b 0.1,1 -p 0.5

The threshold can be increased to increase the confidence in the prediction.


These probabilities are different from PIDGIN version 2 in that they have not been Platt-scaled, since this increased the number of false positives.

Decreasing applicability domain (AD) filter

To reduce the stringency in the AD filter, the --ad parameter (defulat:90) can be reduced, as in the following snippet:

$ python $PV4/ -f test.smi --organism "Homo sapiens" -b 1 -p 0.5 --ad 60

In this case, the threshold for the applicability domain weights calculated across the targets has been reduced from 90% to 60%, and thus compounds that are further from the AD are now accepted.

Outputting the AD results

To following snippet calculates the weights for each of the input compounds and outputs their corresponding percentile value, so that a user can view the matrix of percentiles for each compound and accept/reject predictions at a percentile threshold without the need to re-run predictions:

$ python $PV4/ -f test.smi --organism "Homo sapiens" -b 1 --percentile

Silencing the AD filter

To following snippet would therefore turn off the AD filter, since all predictions are accepted:

$ python $PV4/ -f test.smi --organism "Homo sapiens" -b 1 --ad 0

Combining model filters

If the user is interested in a given target class (for example “Lipase”) then the following can be used:

$ python $PV4/ -f test.smi --organism "Homo sapiens" --target_class Lipase

Filters can be combined, for instance:

$ python $PV4/ -f test.smi --organism "Homo sapiens" --target_class GPCR --min_size 25 --performance_filter tsscv,prauc,0.7

would filter human models for GPCRs with a minimum number of 25 actives in the training set and with a minimum precision-recall AUC (PR-AUC) performance of 0.7 during time-series split cross validation (TSSCV).

Additional criteria can be added, for instance:

$ python $PV4/ -f test.smi --organism "Rattus" -b 0.1,1 -p 0.5 --min_size 50 --se_filter --performance_filter l50po,bedroc,0.8

would filter rat models that did not require Sphere Exclusion (SE) (i.e. sufficient number of inactives available) and a minimum number of 50 actives in the training set, with a minimum BEDROC performance of 0.8 during leave 50% of ChEMBL publications in the training data out over 4-fold cross validation (L50PO) to produce a binary matrix of predictions at a probability cut-off of 0.5 and for models trained with bioactivity data at a threshold of 0.1 & 1.01μM.