Deep learning models that leverage large datasets are often the state of the art for modelling molecular properties. When the datasets are smaller (< 2000 molecules), it is not clear that deep learning approaches are the right modelling tool. In this work we perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets. Using different molecular representations and models, we analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets. We also introduce two simulated experiments that evaluate their performance: (1) Bayesian optimization guided molecular design, (2) inference on out-of-distribution data via ablated cluster splits. We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments. We have packaged our analysis into the DIONYSUS repository, which is open sourced to aid in reproducibility and extension to new datasets.
translated by 谷歌翻译
贝叶斯优化(BO)是用于全局优化昂贵的黑盒功能的流行范式,但是在许多域中,该函数并不完全是黑色框。数据可能具有一些已知的结构(例如对称性)和/或数据生成过程可能是一个复合过程,除优化目标的值外,还可以产生有用的中间或辅助信息。但是,传统上使用的代孕模型,例如高斯工艺(GPS),随数据集大小的规模较差,并且不容易适应已知的结构。取而代之的是,我们使用贝叶斯神经网络,这是具有感应偏见的一类可扩展和灵活的替代模型,将BO扩展到具有高维度的复杂,结构化问题。我们证明了BO在物理和化学方面的许多现实问题,包括使用卷积神经网络对光子晶体材料进行拓扑优化,以及使用图神经网络对分子进行化学性质优化。在这些复杂的任务上,我们表明,就抽样效率和计算成本而言,神经网络通常优于GP作为BO的替代模型。
translated by 谷歌翻译
可拍照的分子显示了可以使用光访问的两个或多个异构体形式。将这些异构体的电子吸收带分开是选择性解决特定异构体并达到高光稳态状态的关键,同时总体红色转移带来的吸收带可以限制因紫外线暴露而限制材料损害,并增加了光疗法应用中的渗透深度。但是,通过合成设计将这些属性工程为系统仍然是一个挑战。在这里,我们提出了一条数据驱动的发现管道,用于由数据集策划和使用高斯过程的多任务学习支撑的分子照片开关。在对电子过渡波长的预测中,我们证明了使用来自四个Photoswitch转变波长的标签训练的多输出高斯过程(MOGP)产生相对于单任务模型的最强预测性能,并且在操作上超过了时间依赖时间依赖性的密度理论(TD) -dft)就预测的墙壁锁定时间而言。我们通过筛选可商购的可拍摄分子库来实验验证我们提出的方法。通过此屏幕,我们确定了几个图案,这些基序显示了它们的异构体的分离电子吸收带,表现出红移的吸收,并且适用于信息传输和光电学应用。我们的策划数据集,代码以及所有型号均可在https://github.com/ryan-rhys/the-photoswitch-dataset上提供
translated by 谷歌翻译
基于机器学习的数据驱动方法具有加速原子结构的计算分析。在这种情况下,可靠的不确定性估计对于评估对预测和实现决策的信心很重要。然而,机器学习模型可以产生严重校准的不确定性估计,因此仔细检测和处理不确定性至关重要。在这项工作中,我们扩展了一种消息,该消息通过神经网络,专门用于预测分子和材料的性质,具有校准的概率预测分布。本文提出的方法与先前的工作不同,通过考虑统一框架中的炼体和认知的不确定性,并通过重新校准未经证明数据的预测分布。通过计算机实验,我们表明我们的方法导致准确的模型,用于预测两种公共分子基准数据集,QM9和PC9的训练数据分布良好的分子形成能量。该方法提供了一种用于训练和评估神经网络集合模型的一般框架,该模型能够产生具有良好校准的不确定性估计的分子性质的准确预测。
translated by 谷歌翻译
在三维分子结构上运行的计算方法有可能解决生物学和化学的重要问题。特别地,深度神经网络的重视,但它们在生物分子结构域中的广泛采用受到缺乏系统性能基准或统一工具包的限制,用于与分子数据相互作用。为了解决这个问题,我们呈现Atom3D,这是一个新颖的和现有的基准数据集的集合,跨越几个密钥的生物分子。我们为这些任务中的每一个实施多种三维分子学习方法,并表明它们始终如一地提高了基于单维和二维表示的方法的性能。结构的具体选择对于性能至关重要,具有涉及复杂几何形状的任务的三维卷积网络,在需要详细位置信息的系统中表现出良好的图形网络,以及最近开发的设备越多的网络显示出显着承诺。我们的结果表明,许多分子问题符合三维分子学习的增益,并且有可能改善许多仍然过分曝光的任务。为了降低进入并促进现场进一步发展的障碍,我们还提供了一套全面的DataSet处理,模型培训和在我们的开源ATOM3D Python包中的评估工具套件。所有数据集都可以从https://www.atom3d.ai下载。
translated by 谷歌翻译
Uncertainty quantification (UQ) has increasing importance in building robust high-performance and generalizable materials property prediction models. It can also be used in active learning to train better models by focusing on getting new training data from uncertain regions. There are several categories of UQ methods each considering different types of uncertainty sources. Here we conduct a comprehensive evaluation on the UQ methods for graph neural network based materials property prediction and evaluate how they truly reflect the uncertainty that we want in error bound estimation or active learning. Our experimental results over four crystal materials datasets (including formation energy, adsorption energy, total energy, and band gap properties) show that the popular ensemble methods for uncertainty estimation is NOT the best choice for UQ in materials property prediction. For the convenience of the community, all the source code and data sets can be accessed freely at \url{https://github.com/usccolumbia/materialsUQ}.
translated by 谷歌翻译
定量探索了量子化学参考数据的训练神经网络(NNS)预测的不确定性量化的价值。为此,适当地修改了Physnet NN的体系结构,并使用不同的指标评估所得模型,以量化校准,预测质量以及预测误差和预测的不确定性是否可以相关。 QM9数据库培训的结果以及分布内外的测试集的数据表明,错误和不确定性与线性无关。结果阐明了噪声和冗余使分子的性质预测复杂化,即使在发生变化的情况下,例如在两个原本相同的分子中的双键迁移 - 很小。然后将模型应用于互变异反应的真实数据库。分析特征空间中的成员之间的距离与其他参数结合在一起表明,训练数据集中的冗余信息会导致较大的差异和小错误,而存在相似但非特定的信息的存在会返回大错误,但差异很小。例如,这是对含硝基的脂肪族链的观察到的,尽管训练集包含了与芳香族分子结合的硝基组的几个示例,但这些预测很困难。这强调了训练数据组成的重要性,并提供了化学洞察力,以了解这如何影响ML模型的预测能力。最后,提出的方法可用于通过主动学习优化基于信息的化学数据库改进目标应用程序。
translated by 谷歌翻译
人工智能(AI)已被广泛应用于药物发现中,其主要任务是分子财产预测。尽管分子表示学习中AI技术的繁荣,但尚未仔细检查分子性质预测的一些关键方面。在这项研究中,我们对三个代表性模型,即随机森林,莫尔伯特和格罗弗进行了系统比较,该模型分别利用了三个主要的分子表示,扩展连接的指纹,微笑的字符串和分子图。值得注意的是,莫尔伯特(Molbert)和格罗弗(Grover)以自我监督的方式在大规模的无标记分子库中进行了预定。除了常用的分子基准数据集外,我们还组装了一套与阿片类药物相关的数据集进行下游预测评估。我们首先对标签分布和结构分析进行了数据集分析;我们还检查了阿片类药物相关数据集中的活动悬崖问题。然后,我们培训了4,320个预测模型,并评估了学习表示的有用性。此外,我们通过研究统计测试,评估指标和任务设置的效果来探索模型评估。最后,我们将化学空间的概括分解为施加间和支柱内的概括,并测量了预测性能,以评估两种设置下模型的普遍性。通过采取这种喘息,我们反映了分子财产预测的基本关键方面,希望在该领域带来更好的AI技术的意识。
translated by 谷歌翻译
蛋白质 - 配体相互作用(PLIS)是生化研究的基础,其鉴定对于估计合理治疗设计的生物物理和生化特性至关重要。目前,这些特性的实验表征是最准确的方法,然而,这是非常耗时和劳动密集型的。在这种情况下已经开发了许多计算方法,但大多数现有PLI预测大量取决于2D蛋白质序列数据。在这里,我们提出了一种新颖的并行图形神经网络(GNN),以集成PLI预测的知识表示和推理,以便通过专家知识引导的深度学习,并通过3D结构数据通知。我们开发了两个不同的GNN架构,GNNF是采用不同特种的基础实现,以增强域名认识,而GNNP是一种新颖的实现,可以预测未经分子间相互作用的先验知识。综合评价证明,GNN可以成功地捕获配体和蛋白质3D结构之间的二元相互作用,对于GNNF的测试精度和0.958,用于预测蛋白质 - 配体络合物的活性。这些模型进一步适用于回归任务以预测实验结合亲和力,PIC50对于药物效力和功效至关重要。我们在实验亲和力上达到0.66和0.65的Pearson相关系数,分别在PIC50和GNNP上进行0.50和0.51,优于基于2D序列的模型。我们的方法可以作为可解释和解释的人工智能(AI)工具,用于预测活动,效力和铅候选的生物物理性质。为此,我们通过筛选大型复合库并将我们的预测与实验测量数据进行比较来展示GNNP对SARS-COV-2蛋白靶标的实用性。
translated by 谷歌翻译
Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule.However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has
translated by 谷歌翻译
Molecular machine learning has been maturing rapidly over the last few years.Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem
translated by 谷歌翻译
大多数机器学习算法由一个或多个超参数配置,必须仔细选择并且通常会影响性能。为避免耗时和不可递销的手动试验和错误过程来查找性能良好的超参数配置,可以采用各种自动超参数优化(HPO)方法,例如,基于监督机器学习的重新采样误差估计。本文介绍了HPO后,本文审查了重要的HPO方法,如网格或随机搜索,进化算法,贝叶斯优化,超带和赛车。它给出了关于进行HPO的重要选择的实用建议,包括HPO算法本身,性能评估,如何将HPO与ML管道,运行时改进和并行化结合起来。这项工作伴随着附录,其中包含关于R和Python的特定软件包的信息,以及用于特定学习算法的信息和推荐的超参数搜索空间。我们还提供笔记本电脑,这些笔记本展示了这项工作的概念作为补充文件。
translated by 谷歌翻译
Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs have focused on low-dimensional datasets which are not representative of realistic transfer learning tasks. Drug discovery is one application area that is characterized by datasets consisting of many chemical properties or functions which are sparsely observed, yet depend on shared features or representations of the molecular inputs. This paper applies the conditional neural process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as an alternative model for transfer learning based on pre-training and refining neural network regressors. We present a Bayesian optimization experiment which showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty quantification.
translated by 谷歌翻译
对于大型小分子的大型库,在考虑一系列疾病模型,测定条件和剂量范围时,详尽的组合化学筛选变得不可行。深度学习模型已实现了硅的最终技术,以预测协同得分。但是,药物组合的数据库对协同剂有偏见,这些结果不一定会概括分布不足。我们采用了使用深度学习模型的顺序模型优化搜索来快速发现与癌细胞系相比的协同药物组合,而与详尽的评估相比,筛查要少得多。在仅3轮ML引导的体外实验(包括校准圆圈)之后,我们发现,对高度协同组合进行了查询的一组药物对。进行了另外两轮ML引导实验,以确保趋势的可重复性。值得注意的是,我们重新发现药物组合后来证实将在临床试验中研究。此外,我们发现仅使用结构信息生成的药物嵌入开始反映作用机理。
translated by 谷歌翻译
The accurate prediction of physicochemical properties of chemical compounds in mixtures (such as the activity coefficient at infinite dilution $\gamma_{ij}^\infty$) is essential for developing novel and more sustainable chemical processes. In this work, we analyze the performance of previously-proposed GNN-based models for the prediction of $\gamma_{ij}^\infty$, and compare them with several mechanistic models in a series of 9 isothermal studies. Moreover, we develop the Gibbs-Helmholtz Graph Neural Network (GH-GNN) model for predicting $\ln \gamma_{ij}^\infty$ of molecular systems at different temperatures. Our method combines the simplicity of a Gibbs-Helmholtz-derived expression with a series of graph neural networks that incorporate explicit molecular and intermolecular descriptors for capturing dispersion and hydrogen bonding effects. We have trained this model using experimentally determined $\ln \gamma_{ij}^\infty$ data of 40,219 binary-systems involving 1032 solutes and 866 solvents, overall showing superior performance compared to the popular UNIFAC-Dortmund model. We analyze the performance of GH-GNN for continuous and discrete inter/extrapolation and give indications for the model's applicability domain and expected accuracy. In general, GH-GNN is able to produce accurate predictions for extrapolated binary-systems if at least 25 systems with the same combination of solute-solvent chemical classes are contained in the training set and a similarity indicator above 0.35 is also present. This model and its applicability domain recommendations have been made open-source at https://github.com/edgarsmdn/GH-GNN.
translated by 谷歌翻译
超参数优化构成了典型的现代机器学习工作流程的很大一部分。这是由于这样一个事实,即机器学习方法和相应的预处理步骤通常只有在正确调整超参数时就会产生最佳性能。但是在许多应用中,我们不仅有兴趣仅仅为了预测精度而优化ML管道;确定最佳配置时,必须考虑其他指标或约束,从而导致多目标优化问题。由于缺乏知识和用于多目标超参数优化的知识和容易获得的软件实现,因此通常在实践中被忽略。在这项工作中,我们向读者介绍了多个客观超参数优化的基础知识,并激励其在应用ML中的实用性。此外,我们从进化算法和贝叶斯优化的领域提供了现有优化策略的广泛调查。我们说明了MOO在几个特定ML应用中的实用性,考虑了诸如操作条件,预测时间,稀疏,公平,可解释性和鲁棒性之类的目标。
translated by 谷歌翻译
人工智能(AI)在过去十年中一直在改变药物发现的实践。各种AI技术已在广泛的应用中使用,例如虚拟筛选和药物设计。在本调查中,我们首先概述了药物发现,并讨论了相关的应用,可以减少到两个主要任务,即分子性质预测和分子产生。然后,我们讨论常见的数据资源,分子表示和基准平台。此外,为了总结AI在药物发现中的进展情况,我们介绍了在调查的论文中包括模型架构和学习范式的相关AI技术。我们预计本调查将作为有兴趣在人工智能和药物发现界面工作的研究人员的指南。我们还提供了GitHub存储库(HTTPS:///github.com/dengjianyuan/survey_survey_au_drug_discovery),其中包含文件和代码,如适用,作为定期更新的学习资源。
translated by 谷歌翻译
The process of screening molecules for desirable properties is a key step in several applications, ranging from drug discovery to material design. During the process of drug discovery specifically, protein-ligand docking, or chemical docking, is a standard in-silico scoring technique that estimates the binding affinity of molecules with a specific protein target. Recently, however, as the number of virtual molecules available to test has rapidly grown, these classical docking algorithms have created a significant computational bottleneck. We address this problem by introducing Deep Surrogate Docking (DSD), a framework that applies deep learning-based surrogate modeling to accelerate the docking process substantially. DSD can be interpreted as a formalism of several earlier surrogate prefiltering techniques, adding novel metrics and practical training practices. Specifically, we show that graph neural networks (GNNs) can serve as fast and accurate estimators of classical docking algorithms. Additionally, we introduce FiLMv2, a novel GNN architecture which we show outperforms existing state-of-the-art GNN architectures, attaining more accurate and stable performance by allowing the model to filter out irrelevant information from data more efficiently. Through extensive experimentation and analysis, we show that the DSD workflow combined with the FiLMv2 architecture provides a 9.496x speedup in molecule screening with a <3% recall error rate on an example docking task. Our open-source code is available at https://github.com/ryienh/graph-dock.
translated by 谷歌翻译
Graph classification is an important area in both modern research and industry. Multiple applications, especially in chemistry and novel drug discovery, encourage rapid development of machine learning models in this area. To keep up with the pace of new research, proper experimental design, fair evaluation, and independent benchmarks are essential. Design of strong baselines is an indispensable element of such works. In this thesis, we explore multiple approaches to graph classification. We focus on Graph Neural Networks (GNNs), which emerged as a de facto standard deep learning technique for graph representation learning. Classical approaches, such as graph descriptors and molecular fingerprints, are also addressed. We design fair evaluation experimental protocol and choose proper datasets collection. This allows us to perform numerous experiments and rigorously analyze modern approaches. We arrive to many conclusions, which shed new light on performance and quality of novel algorithms. We investigate application of Jumping Knowledge GNN architecture to graph classification, which proves to be an efficient tool for improving base graph neural network architectures. Multiple improvements to baseline models are also proposed and experimentally verified, which constitutes an important contribution to the field of fair model comparison.
translated by 谷歌翻译
虽然最近在许多科学领域都变得无处不在,但对其评估的关注较少。对于分子生成模型,最先进的是孤立或与其输入有关的输出。但是,它们的生物学和功能特性(例如配体 - 靶标相互作用)尚未得到解决。在这项研究中,提出了一种新型的生物学启发的基准,用于评估分子生成模型。具体而言,设计了三个不同的参考数据集,并引入了与药物发现过程直接相关的一组指标。特别是我们提出了一个娱乐指标,将药物目标亲和力预测和分子对接应用作为评估生成产量的互补技术。虽然所有三个指标均在测试的生成模型中均表现出一致的结果,但对药物目标亲和力结合和分子对接分数进行了更详细的比较,表明单峰预测器可能会导致关于目标结合在分子水平和多模式方法的错误结论,而多模式的方法是错误的结论。因此优选。该框架的关键优点是,它通过明确关注配体 - 靶标相互作用,将先前的物理化学域知识纳入基准测试过程,从而创建了一种高效的工具,不仅用于评估分子生成型输出,而且还用于丰富富含分子生成的输出。一般而言,药物发现过程。
translated by 谷歌翻译