研究了生物样品中的小分子,以提供有关疾病状态,环境毒素,天然产品发现和许多其他应用的信息。小分子混合物组成的主要窗口是串联质谱法(MS2),它产生的数据具有高灵敏度和每百万分辨率的部分。我们采用MS2中质量数据的多尺度正弦嵌入,旨在应对MS2数据的完整分辨率学习的挑战。使用这些嵌入,我们为光谱库搜索提供了新的最新模型,这是MS2数据初始评估的标准任务。我们还引入了一项新的任务,从MS2数据中引入了化学性质预测,该预测在高通量MS2实验中具有自然应用,并表明可以在10个化合物中获得平均$ r^2 $ 80 \%,可以在10个化学特性中获得优先级的10个化学性质药化学家。我们使用降低降低技术和具有不同浮点分辨率的实验,以显示从MS2数据学习中多尺度正弦嵌入的重要作用。
translated by 谷歌翻译
质谱是研究小分子的关键工具,在代谢组科,药物发现和环境化学中发挥着重要作用。串联质谱捕获碎片模式,提供有关分子的关键结构信息并有助于其识别。从业者经常依赖于光谱库搜索以将未知光谱与已知化合物匹配。但是,这种基于搜索的方法受引用实验数据的可用性限制。在这项工作中,我们表明图形变压器可用于准确预测串联质谱。我们的型号,质量置器,优于竞争深度学习的频谱预测方法,包括可解释的注意机制,以帮助解释预测。我们证明我们的模型可用于改善合成分子识别任务的参考文库覆盖。通过定量分析和目视检查,我们验证了我们的模型恢复了关于碰撞能量对生成频谱的影响的先验知识。我们从两个独立的MS数据集中评估我们的不同类型质量光谱的模型,并表明其性能推广。代码在github.com/roestlab/massformer中获得。
translated by 谷歌翻译
在药物发现中,具有所需生物活性的新分子的合理设计是一项至关重要但具有挑战性的任务,尤其是在治疗新的靶家庭或研究靶标时。在这里,我们提出了PGMG,这是一种用于生物活化分子产生的药效团的深度学习方法。PGMG通过药理的指导提供了一种灵活的策略,以使用训练有素的变异自动编码器在各种情况下生成具有结构多样性的生物活性分子。我们表明,PGMG可以在给定药效团模型的情况下生成匹配的分子,同时保持高度的有效性,独特性和新颖性。在案例研究中,我们证明了PGMG在基于配体和基于结构的药物从头设计以及铅优化方案中生成生物活性分子的应用。总体而言,PGMG的灵活性和有效性使其成为加速药物发现过程的有用工具。
translated by 谷歌翻译
Models that accurately predict properties based on chemical structure are valuable tools in drug discovery. However, for many properties, public and private training sets are typically small, and it is difficult for the models to generalize well outside of the training data. Recently, large language models have addressed this problem by using self-supervised pretraining on large unlabeled datasets, followed by fine-tuning on smaller, labeled datasets. In this paper, we report MolE, a molecular foundation model that adapts the DeBERTa architecture to be used on molecular graphs together with a two-step pretraining strategy. The first step of pretraining is a self-supervised approach focused on learning chemical structures, and the second step is a massive multi-task approach to learn biological information. We show that fine-tuning pretrained MolE achieves state-of-the-art results on 9 of the 22 ADMET tasks included in the Therapeutic Data Commons.
translated by 谷歌翻译
Despite significant progress of generative models in the natural sciences, their controllability remains challenging. One fundamentally missing aspect of molecular or protein generative models is an inductive bias that can reflect continuous properties of interest. To that end, we propose the Regression Transformer (RT), a novel method that abstracts regression as a conditional sequence modeling problem. This introduces a new paradigm of multitask language models which seamlessly bridge sequence regression and conditional sequence generation. We thoroughly demonstrate that, despite using a nominal-scale training objective, the RT matches or surpasses the performance of conventional regression models in property prediction tasks of small molecules, proteins and chemical reactions. Critically, priming the same model with continuous properties yields a highly competitive conditional generative model that outperforms specialized approaches in a substructure-constrained, property-driven molecule generation benchmark. Our dichotomous approach is facilitated by a novel, alternating training scheme that enables the model to decorate seed sequences by desired properties, e.g., to optimize reaction yield. In sum, the RT is the first report of a multitask model that concurrently excels at predictive and generative tasks in biochemistry. This finds particular application in property-driven, local exploration of the chemical or protein space and could pave the road toward foundation models in material design. The code to reproduce all experiments of the paper is available at: https://github.com/IBM/regression-transformer
translated by 谷歌翻译
在三维分子结构上运行的计算方法有可能解决生物学和化学的重要问题。特别地,深度神经网络的重视,但它们在生物分子结构域中的广泛采用受到缺乏系统性能基准或统一工具包的限制,用于与分子数据相互作用。为了解决这个问题,我们呈现Atom3D,这是一个新颖的和现有的基准数据集的集合,跨越几个密钥的生物分子。我们为这些任务中的每一个实施多种三维分子学习方法,并表明它们始终如一地提高了基于单维和二维表示的方法的性能。结构的具体选择对于性能至关重要,具有涉及复杂几何形状的任务的三维卷积网络,在需要详细位置信息的系统中表现出良好的图形网络,以及最近开发的设备越多的网络显示出显着承诺。我们的结果表明,许多分子问题符合三维分子学习的增益,并且有可能改善许多仍然过分曝光的任务。为了降低进入并促进现场进一步发展的障碍,我们还提供了一套全面的DataSet处理,模型培训和在我们的开源ATOM3D Python包中的评估工具套件。所有数据集都可以从https://www.atom3d.ai下载。
translated by 谷歌翻译
我们解决了受控生成小分子的任务,该任务需要在某些约束(例如,与参考分子相似)下找到具有所需特性的新分子。在这里,我们介绍了Molmim,这是一种用于学习信息丰富且聚集的潜在空间的小分子药物发现的概率自动编码器。 Molmim通过共同信息机(MIM)学习训练,并提供可变长度微笑字符串的固定长度表示。由于编码器模型可以通过无效样品的``孔''来学习表示形式,因此我们在这里提出了训练程序的新型扩展,该过程促进了促进密集的潜在空间,并允许模型从潜在代码的随机扰动中采样有效分子。我们提供了Molmim与几个可变大小和固定尺寸的编码器模型的彻底比较,这表明了Molmim的上一代,如有效性,独特性和新颖性而言。然后,我们利用CMA-E,一种天真的黑盒和无梯度的搜索算法,是Molmim的潜在空间来实现属性引导分子优化的任务。我们实现了最新的单个属性优化任务以及多目标优化的具有挑战性的任务,从而提高了先前的成功率SOTA超过5 \%。我们将强有力的结果归因于莫尔米姆的潜在表示,这些表示在潜在空间中聚集了相似的分子,而CMA-ES通常用作基线优化方法。我们还证明了莫尔米姆在计算有限的制度中有利,使其成为这种情况的有吸引力的模型。
translated by 谷歌翻译
分子特性预测是与关键现实影响的深度学习的增长最快的应用之一。包括3D分子结构作为学习模型的输入可以提高它们对许多分子任务的性能。但是,此信息是不可行的,可以以几个现实世界应用程序所需的规模计算。我们建议预先训练模型,以推理仅给予其仅为2D分子图的分子的几何形状。使用来自自我监督学习的方法,我们最大化3D汇总向量和图形神经网络(GNN)的表示之间的相互信息,使得它们包含潜在的3D信息。在具有未知几何形状的分子上进行微调期间,GNN仍然产生隐式3D信息,并可以使用它来改善下游任务。我们表明3D预训练为广泛的性质提供了显着的改进,例如八个量子力学性能的22%的平均MAE。此外,可以在不同分子空间中的数据集之间有效地传送所学习的表示。
translated by 谷歌翻译
人工智能(AI)在过去十年中一直在改变药物发现的实践。各种AI技术已在广泛的应用中使用,例如虚拟筛选和药物设计。在本调查中,我们首先概述了药物发现,并讨论了相关的应用,可以减少到两个主要任务,即分子性质预测和分子产生。然后,我们讨论常见的数据资源,分子表示和基准平台。此外,为了总结AI在药物发现中的进展情况,我们介绍了在调查的论文中包括模型架构和学习范式的相关AI技术。我们预计本调查将作为有兴趣在人工智能和药物发现界面工作的研究人员的指南。我们还提供了GitHub存储库(HTTPS:///github.com/dengjianyuan/survey_survey_au_drug_discovery),其中包含文件和代码,如适用,作为定期更新的学习资源。
translated by 谷歌翻译
自我监督的神经语言模型最近在有机分子和蛋白质序列的生成设计中发现了广泛的应用,以及用于下游结构分类和功能预测的表示学习。但是,大多数现有的分子设计深度学习模型通常都需要一个大数据集并具有黑盒架构,这使得很难解释其设计逻辑。在这里,我们提出了生成分子变压器(GMTRANSFORMER),这是一种用于分子生成设计的概率神经网络模型。我们的模型建立在最初用于文本处理的空白填充语言模型上,该模型在学习具有高质量生成,可解释性和数据效率的“分子语法”方面具有独特的优势。与其他基线相比,我们的模型在摩西数据集上的基准测试后获得了高新颖性和SCAF。概率生成步骤具有修补分子设计的潜力,因为它们有能力推荐如何通过学习的隐式分子化学指导,并通过解释来修饰现有分子。可以在https://github.com/usccolumbia/gmtransformer上自由访问源代码和数据集
translated by 谷歌翻译
基于深度学习的分子建模的最新进步令人兴奋地加速硅药发现。可获得血清的生成模型,构建原子原子和键合或逐片键的分子。然而,许多药物发现项目需要固定的支架以存在于所生成的分子中,并纳入该约束仅探讨了该约束。在这里,我们提出了一种基于图形的模型,其自然地支持支架作为生成过程的初始种子,这是可能的,因为它不调节在发电历史上。我们的实验表明,Moler与最先进的方法进行了相当的方法,在无约会的分子优化任务上,并且在基于脚手架的任务上优于它们,而不是比现有方法从培训和样本更快的数量级。此外,我们展示了许多看似小设计选择对整体性能的影响。
translated by 谷歌翻译
We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds.
translated by 谷歌翻译
Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule.However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has
translated by 谷歌翻译
Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.
translated by 谷歌翻译
Molecular shape and geometry dictate key biophysical recognition processes, yet many graph neural networks disregard 3D information for molecular property prediction. Here, we propose a new contrastive-learning procedure for graph neural networks, Molecular Contrastive Learning from Shape Similarity (MolCLaSS), that implicitly learns a three-dimensional representation. Rather than directly encoding or targeting three-dimensional poses, MolCLaSS matches a similarity objective based on Gaussian overlays to learn a meaningful representation of molecular shape. We demonstrate how this framework naturally captures key aspects of three-dimensionality that two-dimensional representations cannot and provides an inductive framework for scaffold hopping.
translated by 谷歌翻译
In this work, we propose MEDICO, a Multi-viEw Deep generative model for molecule generation, structural optimization, and the SARS-CoV-2 Inhibitor disCOvery. To the best of our knowledge, MEDICO is the first-of-this-kind graph generative model that can generate molecular graphs similar to the structure of targeted molecules, with a multi-view representation learning framework to sufficiently and adaptively learn comprehensive structural semantics from targeted molecular topology and geometry. We show that our MEDICO significantly outperforms the state-of-the-art methods in generating valid, unique, and novel molecules under benchmarking comparisons. In particular, we showcase the multi-view deep learning model enables us to generate not only the molecules structurally similar to the targeted molecules but also the molecules with desired chemical properties, demonstrating the strong capability of our model in exploring the chemical space deeply. Moreover, case study results on targeted molecule generation for the SARS-CoV-2 main protease (Mpro) show that by integrating molecule docking into our model as chemical priori, we successfully generate new small molecules with desired drug-like properties for the Mpro, potentially accelerating the de novo design of Covid-19 drugs. Further, we apply MEDICO to the structural optimization of three well-known Mpro inhibitors (N3, 11a, and GC376) and achieve ~88% improvement in their binding affinity to Mpro, demonstrating the application value of our model for the development of therapeutics for SARS-CoV-2 infection.
translated by 谷歌翻译
虽然最近在许多科学领域都变得无处不在,但对其评估的关注较少。对于分子生成模型,最先进的是孤立或与其输入有关的输出。但是,它们的生物学和功能特性(例如配体 - 靶标相互作用)尚未得到解决。在这项研究中,提出了一种新型的生物学启发的基准,用于评估分子生成模型。具体而言,设计了三个不同的参考数据集,并引入了与药物发现过程直接相关的一组指标。特别是我们提出了一个娱乐指标,将药物目标亲和力预测和分子对接应用作为评估生成产量的互补技术。虽然所有三个指标均在测试的生成模型中均表现出一致的结果,但对药物目标亲和力结合和分子对接分数进行了更详细的比较,表明单峰预测器可能会导致关于目标结合在分子水平和多模式方法的错误结论,而多模式的方法是错误的结论。因此优选。该框架的关键优点是,它通过明确关注配体 - 靶标相互作用,将先前的物理化学域知识纳入基准测试过程,从而创建了一种高效的工具,不仅用于评估分子生成型输出,而且还用于丰富富含分子生成的输出。一般而言,药物发现过程。
translated by 谷歌翻译
与靶蛋白具有高结合亲和力的药物样分子的产生仍然是药物发现中的一项困难和资源密集型任务。现有的方法主要采用强化学习,马尔可夫采样或以高斯过程为指导的深层生成模型,在生成具有高结合亲和力的分子时,通过基于计算量的物理学方法计算出的高结合亲和力。我们提出了对分子(豪华轿车)的潜在构成主义,它通过类似于Inceptionism的技术显着加速了分子的产生。豪华轿车采用序列的两个神经网络采用变异自动编码器生成的潜在空间和性质预测,从而使基于梯度的分子特性更快地基于梯度的反相比。综合实验表明,豪华轿车在基准任务上具有竞争力,并且在产生具有高结合亲和力的类似药物的化合物的新任务上,其最先进的技术表现出了最先进的技术,可针对两个蛋白质靶标达到纳摩尔范围。我们通过对绝对结合能的基于更准确的基于分子动力学的计算来证实这些基于对接的结果,并表明我们生成的类似药物的化合物之一的预测$ k_d $(结合亲和力的量度)为$ 6 \ cdot 10^ {-14} $ m针对人类雌激素受体,远远超出了典型的早期药物候选物和大多数FDA批准的药物的亲和力。代码可从https://github.com/rose-stl-lab/limo获得。
translated by 谷歌翻译
DNA-Encoded Library (DEL) technology has enabled significant advances in hit identification by enabling efficient testing of combinatorially-generated molecular libraries. DEL screens measure protein binding affinity though sequencing reads of molecules tagged with unique DNA-barcodes that survive a series of selection experiments. Computational models have been deployed to learn the latent binding affinities that are correlated to the sequenced count data; however, this correlation is often obfuscated by various sources of noise introduced in its complicated data-generation process. In order to denoise DEL count data and screen for molecules with good binding affinity, computational models require the correct assumptions in their modeling structure to capture the correct signals underlying the data. Recent advances in DEL models have focused on probabilistic formulations of count data, but existing approaches have thus far been limited to only utilizing 2-D molecule-level representations. We introduce a new paradigm, DEL-Dock, that combines ligand-based descriptors with 3-D spatial information from docked protein-ligand complexes. 3-D spatial information allows our model to learn over the actual binding modality rather than using only structured-based information of the ligand. We show that our model is capable of effectively denoising DEL count data to predict molecule enrichment scores that are better correlated with experimental binding affinity measurements compared to prior works. Moreover, by learning over a collection of docked poses we demonstrate that our model, trained only on DEL data, implicitly learns to perform good docking pose selection without requiring external supervision from expensive-to-source protein crystal structures.
translated by 谷歌翻译
Machine learning methods have been used to accelerate the molecule optimization process. However, efficient search for optimized molecules satisfying several properties with scarce labeled data remains a challenge for machine learning molecule optimization. In this study, we propose MOMO, a multi-objective molecule optimization framework to address the challenge by combining learning of chemical knowledge with Pareto-based multi-objective evolutionary search. To learn chemistry, it employs a self-supervised codec to construct an implicit chemical space and acquire the continues representation of molecules. To explore the established chemical space, MOMO uses multi-objective evolution to comprehensively and efficiently search for similar molecules with multiple desirable properties. We demonstrate the high performance of MOMO on four multi-objective property and similarity optimization tasks, and illustrate the search capability of MOMO through case studies. Remarkably, our approach significantly outperforms previous approaches in optimizing three objectives simultaneously. The results show the optimization capability of MOMO, suggesting to improve the success rate of lead molecule optimization.
translated by 谷歌翻译