We introduce an end-to-end computational framework that enables hyperparameter optimization with the DeepHyper library, accelerated training, and interpretable AI inference with a suite of state-of-the-art AI models, including CGCNN, PhysNet, SchNet, MPNN, MPNN-transformer, and TorchMD-Net. We use these AI models and the benchmark QM9, hMOF, and MD17 datasets to showcase the prediction of user-specified materials properties in modern computing environments, and to demonstrate translational applications for the modeling of small molecules, crystals and metal organic frameworks with a unified, stand-alone framework. We deployed and tested this framework in the ThetaGPU supercomputer at the Argonne Leadership Computing Facility, and the Delta supercomputer at the National Center for Supercomputing Applications to provide researchers with modern tools to conduct accelerated AI-driven discovery in leadership class computing environments.
translated by 谷歌翻译
分子动力学(MD)仿真是一种强大的工具,用于了解物质的动态和结构。由于MD的分辨率是原子尺度,因此实现了使用飞秒集成的长时间模拟非常昂贵。在每个MD步骤中,执行许多可以学习和避免的冗余计算。这些冗余计算可以由像图形神经网络(GNN)的深度学习模型代替和建模。在这项工作中,我们开发了一个GNN加速分子动力学(GAMD)模型,实现了快速准确的力预测,并产生与经典MD模拟一致的轨迹。我们的研究结果表明,Gamd可以准确地预测两个典型的分子系统,Lennard-Jones(LJ)颗粒和水(LJ +静电)的动态。 GAMD的学习和推理是不可知论的,它可以在测试时间缩放到更大的系统。我们还进行了一项全面的基准测试,将GAMD的实施与生产级MD软件进行了比较,我们展示了GAMD在大规模模拟上对它们具有竞争力。
translated by 谷歌翻译
这项工作介绍了神经性等因素的外部潜力(NEQUIP),E(3) - 用于学习分子动力学模拟的AB-INITIO计算的用于学习网状体电位的e(3)的神经网络方法。虽然大多数当代对称的模型使用不变的卷曲,但仅在标量上采取行动,Nequip采用E(3) - 几何张量的相互作用,举起Quivariant卷曲,导致了更多的信息丰富和忠实的原子环境代表。该方法在挑战和多样化的分子和材料集中实现了最先进的准确性,同时表现出显着的数据效率。 Nequip优先于现有型号,最多三个数量级的培训数据,挑战深度神经网络需要大量培训套装。该方法的高数据效率允许使用高阶量子化学水平的理论作为参考的精确潜力构建,并且在长时间尺度上实现高保真分子动力学模拟。
translated by 谷歌翻译
SchNetPack is a versatile neural networks toolbox that addresses both the requirements of method development and application of atomistic machine learning. Version 2.0 comes with an improved data pipeline, modules for equivariant neural networks as well as a PyTorch implementation of molecular dynamics. An optional integration with PyTorch Lightning and the Hydra configuration framework powers a flexible command-line interface. This makes SchNetPack 2.0 easily extendable with custom code and ready for complex training task such as generation of 3d molecular structures.
translated by 谷歌翻译
The accurate prediction of physicochemical properties of chemical compounds in mixtures (such as the activity coefficient at infinite dilution $\gamma_{ij}^\infty$) is essential for developing novel and more sustainable chemical processes. In this work, we analyze the performance of previously-proposed GNN-based models for the prediction of $\gamma_{ij}^\infty$, and compare them with several mechanistic models in a series of 9 isothermal studies. Moreover, we develop the Gibbs-Helmholtz Graph Neural Network (GH-GNN) model for predicting $\ln \gamma_{ij}^\infty$ of molecular systems at different temperatures. Our method combines the simplicity of a Gibbs-Helmholtz-derived expression with a series of graph neural networks that incorporate explicit molecular and intermolecular descriptors for capturing dispersion and hydrogen bonding effects. We have trained this model using experimentally determined $\ln \gamma_{ij}^\infty$ data of 40,219 binary-systems involving 1032 solutes and 866 solvents, overall showing superior performance compared to the popular UNIFAC-Dortmund model. We analyze the performance of GH-GNN for continuous and discrete inter/extrapolation and give indications for the model's applicability domain and expected accuracy. In general, GH-GNN is able to produce accurate predictions for extrapolated binary-systems if at least 25 systems with the same combination of solute-solvent chemical classes are contained in the training set and a similarity indicator above 0.35 is also present. This model and its applicability domain recommendations have been made open-source at https://github.com/edgarsmdn/GH-GNN.
translated by 谷歌翻译
图形卷积神经网络(GCNN)是材料科学中流行的深度学习模型(DL)模型,可从分子结构的图表中预测材料特性。训练针对分子设计的准确而全面的GCNN替代物需要大规模的图形数据集,并且通常是一个耗时的过程。 GPU和分布计算的最新进展为有效降低GCNN培训的计算成本开辟了道路。但是,高性能计算(HPC)资源进行培训的有效利用需要同时优化大型数据管理和可扩展的随机批处理优化技术。在这项工作中,我们专注于在HPC系统上构建GCNN模型,以预测数百万分子的材料特性。我们使用Hydragnn,我们的内部库进行大规模GCNN培训,利用Pytorch中的分布数据并行性。我们使用Adios(高性能数据管理框架)来有效存储和读取大分子图数据。我们在两个开源大规模图数据集上进行并行训练,以构建一个称为Homo-Lumo Gap的重要量子属性的GCNN预测指标。我们衡量在两个DOE超级计算机上的方法的可伸缩性,准确性和收敛性:橡树岭领导力计算设施(OLCF)的峰会超级计算机和国家能源研究科学计算中心(NERSC)的Perlmutter系统。我们通过HydragnN表示我们的实验结果,显示I)与常规方法相比,将数据加载时间降低了4.2倍,而II)线性缩放性能在峰会和Perlmutter上均可训练高达1,024 GPU。
translated by 谷歌翻译
Molecular machine learning has been maturing rapidly over the last few years.Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem
translated by 谷歌翻译
在三维分子结构上运行的计算方法有可能解决生物学和化学的重要问题。特别地,深度神经网络的重视,但它们在生物分子结构域中的广泛采用受到缺乏系统性能基准或统一工具包的限制,用于与分子数据相互作用。为了解决这个问题,我们呈现Atom3D,这是一个新颖的和现有的基准数据集的集合,跨越几个密钥的生物分子。我们为这些任务中的每一个实施多种三维分子学习方法,并表明它们始终如一地提高了基于单维和二维表示的方法的性能。结构的具体选择对于性能至关重要,具有涉及复杂几何形状的任务的三维卷积网络,在需要详细位置信息的系统中表现出良好的图形网络,以及最近开发的设备越多的网络显示出显着承诺。我们的结果表明,许多分子问题符合三维分子学习的增益,并且有可能改善许多仍然过分曝光的任务。为了降低进入并促进现场进一步发展的障碍,我们还提供了一套全面的DataSet处理,模型培训和在我们的开源ATOM3D Python包中的评估工具套件。所有数据集都可以从https://www.atom3d.ai下载。
translated by 谷歌翻译
电子密度$ \ rho(\ vec {r})$是用密度泛函理论(dft)计算地面能量的基本变量。除了总能量之外,$ \ rho(\ vec {r})$分布和$ \ rho(\ vec {r})$的功能通常用于捕获电子规模以功能材料和分子中的关键物理化学现象。方法提供对$ \ rho(\ vec {r})的可紊乱系统,其具有少量计算成本的复杂无序系统可以是对材料相位空间的加快探索朝向具有更好功能的新材料的逆设计的游戏更换者。我们为预测$ \ rho(\ vec {r})$。该模型基于成本图形神经网络,并且在作为消息传递图的一部分的特殊查询点顶点上预测了电子密度,但仅接收消息。该模型在多个数据组中进行测试,分子(QM9),液体乙烯碳酸酯电解质(EC)和Lixniymnzco(1-Y-Z)O 2锂离子电池阴极(NMC)。对于QM9分子,所提出的模型的准确性超过了从DFT获得的$ \ Rho(\ vec {r})$中的典型变异性,以不同的交换相关功能,并显示超出最先进的准确性。混合氧化物(NMC)和电解质(EC)数据集更好的精度甚至更好。线性缩放模型同时探测成千上万点的能力允许计算$ \ Rho(\ vec {r})$的大型复杂系统,比DFT快于允许筛选无序的功能材料。
translated by 谷歌翻译
定量探索了量子化学参考数据的训练神经网络(NNS)预测的不确定性量化的价值。为此,适当地修改了Physnet NN的体系结构,并使用不同的指标评估所得模型,以量化校准,预测质量以及预测误差和预测的不确定性是否可以相关。 QM9数据库培训的结果以及分布内外的测试集的数据表明,错误和不确定性与线性无关。结果阐明了噪声和冗余使分子的性质预测复杂化,即使在发生变化的情况下,例如在两个原本相同的分子中的双键迁移 - 很小。然后将模型应用于互变异反应的真实数据库。分析特征空间中的成员之间的距离与其他参数结合在一起表明,训练数据集中的冗余信息会导致较大的差异和小错误,而存在相似但非特定的信息的存在会返回大错误,但差异很小。例如,这是对含硝基的脂肪族链的观察到的,尽管训练集包含了与芳香族分子结合的硝基组的几个示例,但这些预测很困难。这强调了训练数据组成的重要性,并提供了化学洞察力,以了解这如何影响ML模型的预测能力。最后,提出的方法可用于通过主动学习优化基于信息的化学数据库改进目标应用程序。
translated by 谷歌翻译
计算催化和机器学习社区在开发用于催化剂发现和设计的机器学习模型方面取得了长足的进步。然而,跨越催化的化学空间的一般机器学习潜力仍然无法触及。一个重大障碍是在广泛的材料中获得访问培训数据的访问。缺乏数据的一类重要材料是氧化物,它抑制模型无法更广泛地研究氧气进化反应和氧化物电催化。为了解决这个问题,我们开发了开放的催化剂2022(OC22)数据集,包括62,521个密度功能理论(DFT)放松(〜9,884,504个单点计算),遍及一系列氧化物材料,覆盖范围,覆盖率和吸附物( *H, *o, *o, *o, *o, *o, * n, *c, *ooh, *oh, *oh2, *o2, *co)。我们定义广义任务,以预测催化过程中适用的总系统能量,发展几个图神经网络的基线性能(Schnet,Dimenet ++,Forcenet,Spinconv,Painn,Painn,Gemnet-DT,Gemnet-DT,Gemnet-OC),并提供预先定义的数据集分割以建立明确的基准,以实现未来的努力。对于所有任务,我们研究组合数据集是否会带来更好的结果,即使它们包含不同的材料或吸附物。具体而言,我们在Open Catalyst 2020(OC20)数据集和OC22上共同训练模型,或OC22上的微调OC20型号。在最一般的任务中,Gemnet-OC看到通过微调来提高了约32%的能量预测,通过联合训练的力预测提高了约9%。令人惊讶的是,OC20和较小的OC22数据集的联合培训也将OC20的总能量预测提高了约19%。数据集和基线模型是开源的,公众排行榜将遵循,以鼓励社区的持续发展,以了解总能源任务和数据。
translated by 谷歌翻译
分子特性预测是与关键现实影响的深度学习的增长最快的应用之一。包括3D分子结构作为学习模型的输入可以提高它们对许多分子任务的性能。但是,此信息是不可行的,可以以几个现实世界应用程序所需的规模计算。我们建议预先训练模型,以推理仅给予其仅为2D分子图的分子的几何形状。使用来自自我监督学习的方法,我们最大化3D汇总向量和图形神经网络(GNN)的表示之间的相互信息,使得它们包含潜在的3D信息。在具有未知几何形状的分子上进行微调期间,GNN仍然产生隐式3D信息,并可以使用它来改善下游任务。我们表明3D预训练为广泛的性质提供了显着的改进,例如八个量子力学性能的22%的平均MAE。此外,可以在不同分子空间中的数据集之间有效地传送所学习的表示。
translated by 谷歌翻译
Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At this point, the next step is to find a particularly effective variant of this general approach and apply it to chemical prediction benchmarks until we either solve them or reach the limits of the approach. In this paper, we reformulate existing models into a single common framework we call Message Passing Neural Networks (MPNNs) and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark; these results are strong enough that we believe future work should focus on datasets with larger molecules or more accurate ground truth labels.Recently, large scale quantum chemistry calculation and molecular dynamics simulations coupled with advances in high throughput experiments have begun to generate data at an unprecedented rate. Most classical techniques do not make effective use of the larger amounts of data that are now available. The time is ripe to apply more powerful and flexible machine learning methods to these problems, assuming we can find models with suitable inductive biases. The symmetries of atomic systems suggest neural networks that operate on graph structured data and are invariant to graph isomorphism might also be appropriate for molecules. Sufficiently successful models could someday help automate challenging chemical search problems in drug discovery or materials science.In this paper, our goal is to demonstrate effective machine learning models for chemical prediction problems
translated by 谷歌翻译
我们向高吞吐量基准介绍了用于材料和分子数据集的化学系统的多种表示的高吞吐量基准的机器学习(ML)框架。基准测试方法的指导原理是通过将模型复杂性限制在简单的回归方案的同时,在执行最佳ML实践的同时将模型复杂性限制为简单的回归方案,允许通过沿着同步的列车测试分裂的系列进行学习曲线来评估学习进度来评估原始描述符性能。结果模型旨在为未来方法开发提供通知的基线,旁边指示可以学习给定的数据集多么容易。通过对各种物理化学,拓扑和几何表示的培训结果的比较分析,我们介绍了这些陈述的相对优点以及它们的相互关联。
translated by 谷歌翻译
Deep learning models that leverage large datasets are often the state of the art for modelling molecular properties. When the datasets are smaller (< 2000 molecules), it is not clear that deep learning approaches are the right modelling tool. In this work we perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets. Using different molecular representations and models, we analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets. We also introduce two simulated experiments that evaluate their performance: (1) Bayesian optimization guided molecular design, (2) inference on out-of-distribution data via ablated cluster splits. We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments. We have packaged our analysis into the DIONYSUS repository, which is open sourced to aid in reproducibility and extension to new datasets.
translated by 谷歌翻译
Machine-learning models are increasingly used to predict properties of atoms in chemical systems. There have been major advances in developing descriptors and regression frameworks for this task, typically starting from (relatively) small sets of quantum-mechanical reference data. Larger datasets of this kind are becoming available, but remain expensive to generate. Here we demonstrate the use of a large dataset that we have "synthetically" labelled with per-atom energies from an existing ML potential model. The cheapness of this process, compared to the quantum-mechanical ground truth, allows us to generate millions of datapoints, in turn enabling rapid experimentation with atomistic ML models from the small- to the large-data regime. This approach allows us here to compare regression frameworks in depth, and to explore visualisation based on learned representations. We also show that learning synthetic data labels can be a useful pre-training task for subsequent fine-tuning on small datasets. In the future, we expect that our open-sourced dataset, and similar ones, will be useful in rapidly exploring deep-learning models in the limit of abundant chemical data.
translated by 谷歌翻译
预测分子系统的结构和能量特性是分子模拟的基本任务之一,并且具有化学,生物学和医学的用例。在过去的十年中,机器学习算法的出现影响了各种任务的分子模拟,包括原子系统的财产预测。在本文中,我们提出了一种新的方法,用于将从简单分子系统获得的知识转移到更复杂的知识中,并具有明显的原子和自由度。特别是,我们专注于高自由能状态的分类。我们的方法依赖于(i)分子的新型超图表,编码所有相关信息来表征构象的势能,以及(ii)新的消息传递和汇总层来处理和对此类超图结构数据进行预测。尽管问题的复杂性,但我们的结果表明,从三丙氨酸转移到DECA-丙氨酸系统的转移学习中,AUC的AUC为0.92。此外,我们表明,相同的转移学习方法可以用无监督的方式分组,在具有相似的自由能值的簇中,deca-丙氨酸的各种二级结构。我们的研究代表了一个概念证明,即可以设计用于分子系统的可靠传输学习模型,为预测生物学相关系统的结构和能量性能的未开发途径铺平道路。
translated by 谷歌翻译
Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.
translated by 谷歌翻译
偶极矩是一个物理量,指示分子的极性,并通过反映成分原子的电性能和分子的几何特性来确定。大多数用于表示传统图神经网络方法中图表表示的嵌入方式将分子视为拓扑图,从而为识别几何信息的目标造成了重大障碍。与现有的嵌入涉及均值的嵌入不同,该嵌入适当地处理分子的3D结构不同,我们的拟议嵌入直接表达了偶极矩局部贡献的物理意义。我们表明,即使对于具有扩展几何形状的分子并捕获更多的原子相互作用信息,开发的模型甚至可以合理地工作,从而显着改善了预测结果,准确性与AB-Initio计算相当。
translated by 谷歌翻译
人工智能(AI)在过去十年中一直在改变药物发现的实践。各种AI技术已在广泛的应用中使用,例如虚拟筛选和药物设计。在本调查中,我们首先概述了药物发现,并讨论了相关的应用,可以减少到两个主要任务,即分子性质预测和分子产生。然后,我们讨论常见的数据资源,分子表示和基准平台。此外,为了总结AI在药物发现中的进展情况,我们介绍了在调查的论文中包括模型架构和学习范式的相关AI技术。我们预计本调查将作为有兴趣在人工智能和药物发现界面工作的研究人员的指南。我们还提供了GitHub存储库(HTTPS:///github.com/dengjianyuan/survey_survey_au_drug_discovery),其中包含文件和代码,如适用,作为定期更新的学习资源。
translated by 谷歌翻译