All-Assay-Max2 pQSAR: Activity predictions as accurate as 4-concentration IC50s for nearly 9,000 Novartis assays
Martin, Eric, Polyakov, Valery, Zhu, Xiangwei, Tian, Li, liu, Win and Mukherjee, Prasenjit (2019) All-Assay-Max2 pQSAR: Activity predictions as accurate as 4-concentration IC50s for nearly 9,000 Novartis assays. Journal of Chemical Information and Modeling, 59 (10). pp. 4450-4459. ISSN 1549-95961549-960X
Abstract
Profile-QSAR (pQSAR) is a massively multi-task, 2-step machine learning method with unprecedented scope, accuracy and applicability domain. In step one, a “profile” of conventional single-assay random forest regression (RFR) models are trained on a very large number of biochemical and cellular pIC50 assays using Morgan 2 sub-structural fingerprints as compound descriptors. In step 2, a panel of PLS models are built using the profile of pIC50 predictions from those RFR models as compound descriptors. Hence the name. Previously described for a panel of 728 biochemical and cellular kinase assays, we have now built an enormous pQSAR from 11,805 diverse Novartis IC50 and EC50 assays. This large number of assays, and hence of compound descriptors for PLS, dictated reducing the profile by only including RFR models whose predictions correlate with the assay being modeled. We evaluate both the RFR and pQSAR models with our “realistically novel” held-out test set whose median, average similarity to the nearest training set member across the 11,805 assays was only 0.34, thus testing a realistically large applicability domain. For the 11,805 single-assay RFR models, the median correlation of prediction with experiment was only R2ext=0.05, virtually random, and only 8% of the models achieved our standard success threshold of R2ext=0.30. For pQSAR, the median correlation was R2ext=0.53, comparable to 4-concentration experimental IC50s, and 72% of the models met our R2ext>0.30 standard, totaling 8558 successful models. The successful models included assays from all of the 51 annotated target sub-classes, as well as 4196 phenotypic assays, indicating that pQSAR can be applied to virtually any disease area. Every month, all models are updated to include new measurements, and predictions are made for 5.5 million Novartis compounds, totaling 50 billion predictions. Common uses have included virtual screening, selectivity design, toxicity and promiscuity prediction, mechanism-of-action prediction, and others.
Item Type: | Article |
---|---|
Keywords: | pQSAR, full profile, reduced profile, ChEMBL, transfer learning, chance correlation, multi-task model, applicability domain |
Date Deposited: | 03 Dec 2019 00:45 |
Last Modified: | 03 Dec 2019 00:45 |
URI: | https://oak.novartis.com/id/eprint/39425 |