Establishing the foundations for a data-centric AI approach for virtual drug screening through a systematic assessment of the properties of chemical data

Allen Chong; Ser-Xian Phua; Yunzhi Xiao; Woon Yee Ng; Hoi Yeung Li; Wilson Wen Bin Goh

doi:10.7554/eLife.97821.2

Revised: This Reviewed Preprint has been revised by the authors in response to the previous round of peer review; the eLife assessment and the public reviews have been updated where necessary by the editors and peer reviewers.

Reviewing Editor
Alan Talevi
National University of La Plata, La Plata, Argentina
Senior Editor
Aleksandra Walczak
CNRS, Paris, France

Reviewer #1 (Public review):

Summary:

The work provides more evidence of the importance of data quality and representation for ligand-based virtual screening approaches. The authors have applied different machine learning (ML) algorithms and data representation using a new dataset of BRAF ligands. First, the authors evaluate the ML algorithms, and demonstrate that independently of the ML algorithm, predictive and robust models can be obtained in this BRAF dataset. Second, the authors investigate how the molecular representations can modify the prediction of the ML algorithm. They found that in this highly curated dataset the different molecule representations are adequate for the ML algorithms since almost all of them obtain high accuracy values, with Estate fingerprints obtaining the worst performing predictive models and ECFP6 fingerprints producing the best classificatory models. Third, the authors evaluate the performance of the models on subsets of different composition and size of the BRAF dataset. They found that given a finite number of active compounds, increasing the number of inactive compounds worsens the recall and accuracy. Finally, the authors analyze if the use of "less active" molecules affect the model's predictive performance using "less active" molecules taken from ChEMBl Database or using decoys from DUD-E. As results, they found that the accuracy of the model falls as the number of "less active" examples in the training dataset increases while the implementation of decoys in the training set generates results as good as the original models or even better in some cases. However, the use of decoys in the training set worsens the predictive power in the test sets that contain active and inactive molecules.

Strengths:

This is a highly relevant topic in medicinal chemistry and drug discovery. The manuscript is well-written, with a clear structure that facilitates easy reading, and it includes up-to-date references. The hypotheses are clearly presented and appropriately explored. The study provides valuable insights into the importance of deriving models from high-quality data, demonstrating that, when this condition is met, complex computational methods are not always necessary to achieve predictive models. Furthermore, the generated BRAF dataset offers a valuable resource for medicinal chemists working in ligand-based virtual screening.

Weaknesses:

While the work highlights the importance of using high-quality datasets to achieve better and more generalizable results, it does not present significant novelty, as the analysis of training data has been extensively studied in chemoinformatics and medicinal chemistry. Additionally, the inclusion of "AI" in the context of data-centric AI is somewhat unclear, given that the dataset curation is conducted manually, selecting active compounds based on IC50 values from ChEMBL and inactive compounds according to the authors' criteria.

Moreover, the conclusions are based on the analysis of only two high-quality datasets. To generalize these findings, it would be beneficial to extend the analysis to additional high-quality datasets (at least 10 datasets for a robust benchmarking exercise).

A key aspect that could be improved is the definition of an "inactive" compound, which remains unclear. In the manuscript, it is stated:

• "The inactives were carefully selected based on the fact that they have no known pharmacological activity against BRAF."
Does the lack of BRAF activity data necessarily imply that these compounds are inactive?
• "We define a compound as 'inactive' if there are no known pharmacological assays for the said compound on our target, BRAF."
However, in the authors' response, they mention:
• "We selected certain compounds that we felt could not possibly be active against BRAF, such as ligands for neurotransmitter receptors, as inactives."

Given that the definition of "inactive" is one of the most critical concepts in the study, I believe it should be clearly and consistently explained.

Lastly, while statistical comparison is not always common in machine learning, it would greatly enhance the value of this work, especially when comparing models with small differences in accuracy.

https://doi.org/10.7554/eLife.97821.2.sa3

Reviewer #2 (Public review):

Summary:

The authors explored the importance of data quality and representation for ligand-based virtual screening approaches. I believe the results could be of potential benefit to the drug discovery community, especially to those scientists working in the field of machine learning applied to drug research. The in silico design is comprehensive and adequate for the proposed comparisons.

This manuscript by Chong A. et al describes that it is not necessary to resort to the use of sophisticated deep learning algorithms for virtual screening, since based on their results considering conventional ML may perform exceptionally well if feeded by the right data and molecular representations.

The article is interesting and well-written. The overview of the field and the warning about dataset composition are very well thought-out and should be of interest to a broad segment of the AI in drug discovery readership. This article further highlights some of the considerations that need to be taken into consideration for the implementation of data-centric AI for computer-aided drug design methods.

Strengths:

This study contributes significantly to the field of machine learning and data curation in drug discovery. The paper is, in general, well-written and structured. However, in my opinion, there are some suggestions regarding certain aspects of the data analyses.

Weaknesses:

The conclusions drawn in the study are based on the analysis of a two dataset. The authors chose BRAF as an example in this study, and expanded with BACE-1 dataset; however a benchmark with several targets would be suitable to evaluate reproducibility or transferability of the method. One concern could be the applicability of the method in other targets.

https://doi.org/10.7554/eLife.97821.2.sa2

Reviewer #3 (Public review):

Summary:

The authors presented a data-centric ML approach for virtual ligand screening. They used BRAF as an example to demonstrate the predictive power of their approach.

Strengths:

The performance of predictive models in this study is superior (nearly perfect) with respect to exiting methods.

Comments on revisions:

In the revised manuscript, the presented approach has been robustly tested and can be very useful for ligand prediction.

https://doi.org/10.7554/eLife.97821.2.sa1

Author response:

The following is the authors’ response to the original reviews.

We thank the Editors and reviewers for their candid evaluation of our work. While it was suggested that we should demonstrate the validity of our approach with maybe 10 different datasets but we felt that this would place an undue burden on our resources. Generally, it takes about 4 to 6 months for us to build a dataset and this does not include the time taken to train and test our AI models. This would mean that it would take us another 3 to 5 years to complete this research project if we chose to provide 10 different datasets. Publishing a research on one dataset is definitely not unheard of: for example, Subramanian et al. (2016) published their widely-cited benchmark dataset for just BACE1 inhibitors. However, we hoped that the additional work where we showed that we were able to improve the benchmark dataset for BACE1 inhibitors and achieve the same high level of predictive performance for this dataset would convince the readers (and reviewers) of the reproducibility of our approach. Furthermore, we also showed that our approach is robust and does not rely on a large volume of data to achieve this near-perfect accuracy. As can be seen in the Supplemental section, even our AI models trained on ONLY 250 BRAF actives and 250 inactives could achieve 96.3% accuracy! Logically, if the model is robust then we would expect the model to be reproducible. As such, we do not feel it is necessary for us to test our approach on 10 different datasets.

It was also suggested that we expand this study to other types of molecular representations to give a better idea of generalizability. We would like to point out that we tested, in total, 55 single fingerprints and paired combinations. Our goal was to create an approach that could give superior performance for virtual screening and we believe that we have achieved this. Based on the results of our study, we are of the opinion that molecular representations do not, in general, have an oversized effect on AI virtual screening. Although it is important to be aware that certain molecular representations may give SLIGHTLY better performance but we can see that with the exception of the 79-bit E-State fingerprint (which could still achieve an impressive 85% accuracy for the SVM model), nearly all molecular fingerprints and paired combinations that we used were able to achieve an accuracy of above 97%. Therefore, we do not share the reviewers' concern that our approach may not be useful when applied with other types of molecular representations.

It is true that our work involved manual curation of the datasets but the goal of this paper is to lay down some ground rules for the future development of a data-centric AI approach. Although manual curation is a routine practice in AI/ML, but it should be recognised that there is good manual curation and bad manual curation, and rules need to be established to ensure we have good manual curation. Without these rules, we would also not be able to establish and train a data-centric AI. All manual curation involves a level of subjectiveness but that subjectiveness comes from one's experience and domain knowledge of the field in which the AI is being applied. For example, in the case of this study, we relied on our knowledge and understanding of pharmacology to determine whether a compound is pharmacologically inactive or active. This may seem somewhat arbitrary to the uninitiated but it is anything but arbitrary. It is through careful thought and assessment of the chemical compounds that we choose these compounds for training the AI. Unfortunately, this sort of subjective assessment cannot be easily or completely explained but we do show where current practices have failed when building a dataset for training an AI for virtual screening.

https://doi.org/10.7554/eLife.97821.2.sa0

Establishing the foundations for a data-centric AI approach for virtual drug screening through a systematic assessment of the properties of chemical data

Peer review process

Editors

Be the first to read new articles from eLife