Ricci-Lopez, J., Aguila, S. A., Gilson, M. K. & Brizuela, C. A. Improving Structure-Based Virtual Screening with Ensemble Docking and Machine Learning. J. Chem. Inf. Model. acs.jcim.1c00511 (2021) doi:10.1021/ACS.JCIM.1C00511.
Click to toggle abstract
One of the main challenges of structure-based virtual screening (SBVS) is the incorporation of the receptor’s flexibility, as its explicit representation in every docking run implies a high computational cost. Therefore, a common alternative to include the receptor’s flexibility is the approach known as ensemble docking. Ensemble docking consists of using a set of receptor conformations and performing the docking assays over each of them. However, there is still no agreement on how to combine the ensemble docking results to obtain the final ligand ranking. A common choice is to use consensus strategies to aggregate the ensemble docking scores, but these strategies exhibit slight improvement regarding the single-structure approach. Here, we claim that using machine learning (ML) methodologies over the ensemble docking results could improve the predictive power of SBVS. To test this hypothesis, four proteins were selected as study cases: CDK2, FXa, EGFR, and HSP90. Protein conformational ensembles were built from crystallographic structures, whereas the evaluated compound library comprised up to three benchmarking data sets (DUD, DEKOIS 2.0, and CSAR-2012) and cocrystallized molecules. Ensemble docking results were processed through 30 repetitions of 4-fold cross-validation to train and validate two ML classifiers: logistic regression and gradient boosting trees. Our results indicate that the ML classifiers significantly outperform traditional consensus strategies and even the best performance case achieved with single-structure docking. We provide statistical evidence that supports the effectiveness of ML to improve the ensemble docking performance.- Jupyter notebooks with the study's workflow. They are required to reproduce the results and figures of the study.
- Python and R scripts containing helper functions.
- Main datasets and complementary files.
We evaluated target-specific ML models for structure-based virtual screening.
The following four proteins were considered as case studies:
| # | Protein name | Directory | UniProtKB | 
|---|---|---|---|
| 1. | CDK2 | cdk2 | P24941 | 
| 2. | FXa | fxa | P00742 | 
| 3. | EGFR | egfr | P00533 | 
| 4. | HSP90 | hsp90 | P07900 | 
Each protein directory (cdk2, fxa, egfr, hsp90) has the following structure:
- 📂 1_Download_and_prepare_protein_ensembles:- Download and prepare protein crystalographic structures from PDB
 
- Download and prepare protein crystalographic structures from 
- 📂 2_Molecular_libraries- Download and prepare ligand molecules from benchmarking sets
 
- 📂 3_Protein_Ensembles_Analysis- Create and analyze the protein ensembles
 
- 📂 4_Ensemble_docking_results- Prepare and gather Ensemble Docking results
 
- 📂 5_Machine_Learning- Evaluate consensus strategies and ML classifiers through 30x4cv
 
conda env create -f conda_environment.yml- The above will install all the python libraries used during our study.
- Some of the analysis and plots were performed using R(version 4.0.3)
- The Rlibraries used here are listed at the top of eachRscript inside theR_scriptsdirectory.
- Joel Ricci-López: CICESE Research Center, Ensenada, México
- Sergio A. Aguila: CNyN, UNAM, Ensenada, México
- Michael K. Gilson: Skaggs School of Pharmacy and Pharmaceutical Sciences,
 UCSD, La Jolla, California, USA.
- Carlos A. Brizuela: CICESE Research Center, Ensenada, México
- LANCAD-UNAM-DGTIC-286 and PAPIIT-DGAPA-UNAM-IG200320grants
- CAB and JRL acknowledge the support of CONACyT under grant A1-S-20638
- JRL was supported by the Programa de Doctorado en Nanociencias at CICESE and byCONACyT.
- Authors also thank to the anonymous reviewers for their comments and thoughtful suggestions, which substantially helped to improve the manuscript.