Extended functionality

This tutorial assumes the PIDGINv4 repository is located at $PV4.

The input file named test.smi is used for these examples

COc1cc2c3CN4CCC[C@H]4[C@@H](O)c3c5ccc(O)cc5c2cc1OC CompoundID1
COc1cc2c3CN4CCC[C@H]4[C@@H](O)c3c5ccc(O)c(OC)c5c2cc1OC CompoundID2

Generating transposed predictions

The following code will output the RF probabilities at 10μM for all human targets to a transposed file:

$ python $PV4/predict.py -f test.smi --organism "Homo sapiens" -b 10 --transpose

This script outputs the RF output from each of the Random Forest classifiers across the targets for the all compounds into a probability matrix, where the rows are compounds and the columns are targets.

Increasing trees and getting the standard dev. for input compounds

The following snippet will increase the minimum number of RF trees to 250 for all 0.1μM ligase targets and then calculate the standard deviation of the predictions across the 250 trees in the forests across the filtered targets:

$ python $PV4/predict.py -f test.smi --ntrees 250 --target_class Ligase --std_dev


The max number of trees when generating the models was set to 250. An algorithm to search for the optimal trees was performed as follows: 1. start at 90 trees and calculate the out-of-bag error (OOB) for the forest. 2. Increase the trees by 10 and calculate difference in OOB score. 3. Repeat until 1 minute of train time is reached or there was no performance gain on two trees incement occasions (test for convergence) or a maximum of 250 trees is reached.

Annotating predictions with known activity

The probabilities output are clipped between 0.001 and 0.999, so that a perfect score of 0.0 and 1.0 is not obtained from the RFs. This behaviour affords the explicit annotation of duplicate bioactivity data between input compounds and the training set by specifying known inactives with a score of 0.0 and actives with 1.0. To activate this functionality use the following snippet:

$ python $PV4/predict.py -f test.smi --organism Drosophila -b 100 --known_flag

which would provide predictions for all Drosophila targets with a 100μM cut-off, and would calculate overlap between input compounds and the training set and annotate these instead of providing predictions.


This setting increases latency since every input compound has to be compared for perfect Tanimoto coefficient (Tc) similarity of 1.0 against every training compound.

Similarity of input compounds to the active compounds in the training set

The sim_to_train.py script conducts Tanimoto coefficient (Tc) similarity analysis for input compounds in test.smi and the active compounds in the training data in PIDGIN. This can be used to support prediction interpretation to indicate which compounds are driving predictions. Two files are produced; The first is a matrix similar to the predict_raw script above, which has a similarity matrix of compounds vs. target instead of the raw predictions. The second is a detailed breakdown of the nearest neighbour compounds in the training set (i.e. their affinity, confidence and which organism this is extracted from - since ortholog bioactivity data is also used). Example of how to run the code:

$ python $PV4/sim_to_train.py -f test.smi --organism 'Mus musculus' -n 4 -b 100 --ortho

which would provide Tc similarity results for the compounds in test.smi file for ‘Mus musculus’ organism with a 100μM cut-off including orthologue data and using 4 cores for the calculation. .. note:

Options available for this calculation:
- selection of orthologue data
- type of organism
- bioactivity threshold (0.1, 1, 10 or 100 μM)
- number of cores
For more information about specifying the command Line Arguments see 'Command Line Arguments' section