Three descriptor model can predict 55% of the CSAR-NRC HiQ benchmark dataset
Kramer, Christian and Gedeck, Peter (2011) Three descriptor model can predict 55% of the CSAR-NRC HiQ benchmark dataset. Journal of Chemical Information and Modeling, 51 (9). pp. 2139-2145. ISSN 1549-9596
Abstract
Here we report the results we obtained with a proteochemometric approach for predicting ligand binding free energies of the CSAR-NRC HiQ benchmark data set. Using distance-dependent atom-type pair descriptors in a bagged stepwise multiple-linear regression (MLR) model with subsequent complexity reduction we were able to identify three descriptors that can be used to build a very robust regression model for the CSAR-NRC HiQ data set. The model has an R(2)(cv) of 0.55, a MUE(cv) of 1.19, and an RMSE(cv) of 1.49 on the out-of-bag test set. The descriptors selected are the count of protein atoms in a shell between 4.5 Å and 6 Å around each heavy ligand atom excluding oxygen and phosphorus, the count of sulfur atoms in the vicinity of tryptophan, and the count of aliphatic ligand hydroxy hydrogens. The first two descriptors have a positive sign indicating that they contribute favorably to the binding energy, whereas the count of hydroxy hydrogens contributes unfavorably to the binding free energy observed. The fact that such a simple model can be so effective raises a couple of questions that are addressed in the article.
Item Type: | Article |
---|---|
Additional Information: | archiving not formally supported by this publisher |
Date Deposited: | 13 Oct 2015 13:15 |
Last Modified: | 13 Oct 2015 13:15 |
URI: | https://oak.novartis.com/id/eprint/4271 |