吸收,分布,代谢,排泄和毒性(ADMET)特性在药物发现中很重要,因为它们定义了功效和安全性。在这项工作中,我们应用了一系列功能,包括指纹和描述符,以及基于树的机器学习模型,极端的梯度增强,以进行准确的ADMET预测。我们的模型在Therapeutics Data Commons ADMET基准组中表现良好。对于22个任务,我们的模型在18个任务中排名第一,在21个任务中排名前3名。训练有素的机器学习模型集成在AdmetBoost,这是一家网络服务器,该网络服务器可在https://ai-druglab.smu.edu/admet上公开获得。
translated by 谷歌翻译
人工智能(AI)已被广泛应用于药物发现中,其主要任务是分子财产预测。尽管分子表示学习中AI技术的繁荣,但尚未仔细检查分子性质预测的一些关键方面。在这项研究中,我们对三个代表性模型,即随机森林,莫尔伯特和格罗弗进行了系统比较,该模型分别利用了三个主要的分子表示,扩展连接的指纹,微笑的字符串和分子图。值得注意的是,莫尔伯特(Molbert)和格罗弗(Grover)以自我监督的方式在大规模的无标记分子库中进行了预定。除了常用的分子基准数据集外,我们还组装了一套与阿片类药物相关的数据集进行下游预测评估。我们首先对标签分布和结构分析进行了数据集分析;我们还检查了阿片类药物相关数据集中的活动悬崖问题。然后,我们培训了4,320个预测模型,并评估了学习表示的有用性。此外,我们通过研究统计测试,评估指标和任务设置的效果来探索模型评估。最后,我们将化学空间的概括分解为施加间和支柱内的概括,并测量了预测性能,以评估两种设置下模型的普遍性。通过采取这种喘息,我们反映了分子财产预测的基本关键方面,希望在该领域带来更好的AI技术的意识。
translated by 谷歌翻译
药物介导的电压门控钾通道(HERG)和电压门控钠通道(NAV1.5)可导致严重的心血管并发症。这种上升的担忧已经反映在药物开发竞技场中,因为许多经批准的药物的常常出现心脏毒性导致他们在某些情况下停止他们的使用,或者在某些情况下,他们从市场上撤回。在药物发现过程的开始时预测潜在的HERG和NAV1.5阻滞剂可以解决这个问题,因此可以降低开发安全药物的时间和昂贵的成本。一种快速且经济高效的方法是在杂草中使用硅预测方法,在药物开发的早期阶段杂草出潜在的Herg和Nav1.5阻滞剂。在这里,我们介绍了两种基于强大的基于2D描述符的基于描述符的QSAR预测模型,用于HERG和NAV1.5责任预测。机器学习模型训练,用于回归,预测药物的效力值,以及三种不同效力截止的多条分类(即1 {\ mu} m,10 {\ mu} m,和30 {\ mu}) M),其中托管 - Herg分类器是随机森林模型的管道,受到8380个独特的分子化合物的大型策级数据集。虽然Toxtree-Nav1.5分类器,凯列化SVM模型的管道,由来自Chembl和Pubchem公开的生物活动数据库的大型手动策划的1550个独特的化合物培训。拟议的HERG诱导者表现优于最先进的发布模型和其他现有工具的大多数指标。此外,我们正在介绍Q4 = 74.9%的第一个NAV1.5责任预测模型,Q2 = 86.7%的二进制分类= 71.2%在173个独特的化合物的外部测试组上进行评估。该项目中使用的策划数据集公开可向研究界提供。
translated by 谷歌翻译
Models that accurately predict properties based on chemical structure are valuable tools in drug discovery. However, for many properties, public and private training sets are typically small, and it is difficult for the models to generalize well outside of the training data. Recently, large language models have addressed this problem by using self-supervised pretraining on large unlabeled datasets, followed by fine-tuning on smaller, labeled datasets. In this paper, we report MolE, a molecular foundation model that adapts the DeBERTa architecture to be used on molecular graphs together with a two-step pretraining strategy. The first step of pretraining is a self-supervised approach focused on learning chemical structures, and the second step is a massive multi-task approach to learn biological information. We show that fine-tuning pretrained MolE achieves state-of-the-art results on 9 of the 22 ADMET tasks included in the Therapeutic Data Commons.
translated by 谷歌翻译
Deep learning models that leverage large datasets are often the state of the art for modelling molecular properties. When the datasets are smaller (< 2000 molecules), it is not clear that deep learning approaches are the right modelling tool. In this work we perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets. Using different molecular representations and models, we analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets. We also introduce two simulated experiments that evaluate their performance: (1) Bayesian optimization guided molecular design, (2) inference on out-of-distribution data via ablated cluster splits. We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments. We have packaged our analysis into the DIONYSUS repository, which is open sourced to aid in reproducibility and extension to new datasets.
translated by 谷歌翻译
用于分析化学数据的计算技术的引入引起了对生物系统的分析研究,称为“生物信息学”。生物信息学的一个方面是使用机器学习(ML)技术在各种情况下检测多变量趋势。最紧迫的情况之一是预测血脑屏障(BBB)的渗透性。治疗中枢神经系统疾病的新药物的开发由于在血脑屏障中的渗透功效不佳而带来了独特的挑战。在这项研究中,我们旨在通过分析化学特征的ML模型来减轻此问题。这样做:(i)给出了相关的生物系统和过程以及用例的概述。 (ii)第二,对检测BBB渗透性的现有计算技术进行了深入的文献综述。从那里开始,确定了跨电流技术的一个方面,并提出了解决方案。 (iii)最后,开发,测试和反映了通过被动扩散在整个BBB上具有确定特征的药物渗透性的两部分,以量化具有定义特征的药物的渗透性。使用数据集进行的测试和验证确定预测LOGBB模型的平方误差约为0.112单位,而神经炎症模型的均方误差约为0.3个单位,胜过所有相关研究。
translated by 谷歌翻译
Molecular machine learning has been maturing rapidly over the last few years.Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem
translated by 谷歌翻译
机器学习(ML)已经证明了用于准确和结晶材料的准确性能预测的承诺。为了化学结构的高度精确的ML型号的化学结构属性预测,需要具有足够样品的数据集。然而,获得昂贵的化学性质的获得和充分数据可以是昂贵的令人昂贵的,这大大限制了ML模型的性能。通过计算机视觉和黑暗语言处理中数据增强的成功,我们开发了奥古里希姆:数据八级化图书馆化学结构。引入了弃头晶系统和分子的增强方法,其可以对基于指纹的ML模型和图形神经网络(GNNS)进行脱颖而出。我们表明,使用我们的增强策略意义地提高了ML模型的性能,特别是在使用GNNS时,我们开发的增强件在训练期间可以用作广告插件模块,并在用不同的GNN实施时证明了有效性。模型通过Theauglichem图书馆。基于Python的封装我们实现了EugliChem:用于化学结构的数据增强库,可公开获取:https://github.com/baratilab/auglichem.1
translated by 谷歌翻译
SARS-COV-2是一种积极的单链RNA基于大分子,自2022年6月以来,已导致超过630万人死亡。此外,通过封锁扰乱了全球供应链,该病毒对全球经济造成了毁灭性的破坏。为该病毒及其各种变体设计和开发药物至关重要。在本文中,我们使用了一个内部研究框架来重新利用现有的治疗剂,以找到可以治愈COVID-19的药物样生物活性分子。我们使用了从Chembl数据库中检索到的分子的Lipinski规则,以发现针对SARS冠状病毒3Cl蛋白酶的133种吸毒生物活性分子。在标准IC50的基础上,数据集分为三类活动性,无效和中间体。我们的比较分析表明,提出的额外树回收剂(ETR)集成模型改善了结果,同时相对于其他最先进的机器学习模型,可以预测化学化合物的准确生物活性。使用ADMET分析,我们确定了13个具有化学ID的新型生物活性分子187460,190743,222234,222628,222735,222769,222840,222840,222893,2255515,358279,358279,33535,363535,363535,365134 and 422688.88.88.88.88.88.88.88.88.88。 SARS-COV-2 3Cl蛋白酶。这些候选分子进一步研究了结合亲和力。为此,我们进行了分子对接和简短列出的六个具有Chembl IDS 187460、222769、225515、358279、363535和36513的生物活性分子。这些分子可以是SARS-COV-2-2。预计药物学家社区可能会使用这些有希望的化合物进行进一步的体外分析。
translated by 谷歌翻译
Molecular "fingerprints" encoding structural information are the workhorse of cheminformatics and machine learning in drug discovery applications. However, fingerprint representations necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make datadriven decisions. We describe molecular graph convolutions, a machine learning architecture for learning from undirected graphs, specifically small molecules. Graph convolutions use a simple encoding of the molecular graph-atoms, bonds, distances, etc.-which allows the model to take greater advantage of information in the graph structure. Although graph convolutions do not outperform all fingerprint-based methods, they (along with other graph-based methods) represent a new paradigm in ligand-based virtual screening with exciting opportunities for future improvement.
translated by 谷歌翻译
Ionic Liquids (ILs) provide a promising solution for CO$_2$ capture and storage to mitigate global warming. However, identifying and designing the high-capacity IL from the giant chemical space requires expensive, and exhaustive simulations and experiments. Machine learning (ML) can accelerate the process of searching for desirable ionic molecules through accurate and efficient property predictions in a data-driven manner. But existing descriptors and ML models for the ionic molecule suffer from the inefficient adaptation of molecular graph structure. Besides, few works have investigated the explainability of ML models to help understand the learned features that can guide the design of efficient ionic molecules. In this work, we develop both fingerprint-based ML models and Graph Neural Networks (GNNs) to predict the CO$_2$ absorption in ILs. Fingerprint works on graph structure at the feature extraction stage, while GNNs directly handle molecule structure in both the feature extraction and model prediction stage. We show that our method outperforms previous ML models by reaching a high accuracy (MAE of 0.0137, $R^2$ of 0.9884). Furthermore, we take the advantage of GNNs feature representation and develop a substructure-based explanation method that provides insight into how each chemical fragments within IL molecules contribute to the CO$_2$ absorption prediction of ML models. We also show that our explanation result agrees with some ground truth from the theoretical reaction mechanism of CO$_2$ absorption in ILs, which can advise on the design of novel and efficient functional ILs in the future.
translated by 谷歌翻译
Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule.However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has
translated by 谷歌翻译
虽然最近在许多科学领域都变得无处不在,但对其评估的关注较少。对于分子生成模型,最先进的是孤立或与其输入有关的输出。但是,它们的生物学和功能特性(例如配体 - 靶标相互作用)尚未得到解决。在这项研究中,提出了一种新型的生物学启发的基准,用于评估分子生成模型。具体而言,设计了三个不同的参考数据集,并引入了与药物发现过程直接相关的一组指标。特别是我们提出了一个娱乐指标,将药物目标亲和力预测和分子对接应用作为评估生成产量的互补技术。虽然所有三个指标均在测试的生成模型中均表现出一致的结果,但对药物目标亲和力结合和分子对接分数进行了更详细的比较,表明单峰预测器可能会导致关于目标结合在分子水平和多模式方法的错误结论,而多模式的方法是错误的结论。因此优选。该框架的关键优点是,它通过明确关注配体 - 靶标相互作用,将先前的物理化学域知识纳入基准测试过程,从而创建了一种高效的工具,不仅用于评估分子生成型输出,而且还用于丰富富含分子生成的输出。一般而言,药物发现过程。
translated by 谷歌翻译
Graph classification is an important area in both modern research and industry. Multiple applications, especially in chemistry and novel drug discovery, encourage rapid development of machine learning models in this area. To keep up with the pace of new research, proper experimental design, fair evaluation, and independent benchmarks are essential. Design of strong baselines is an indispensable element of such works. In this thesis, we explore multiple approaches to graph classification. We focus on Graph Neural Networks (GNNs), which emerged as a de facto standard deep learning technique for graph representation learning. Classical approaches, such as graph descriptors and molecular fingerprints, are also addressed. We design fair evaluation experimental protocol and choose proper datasets collection. This allows us to perform numerous experiments and rigorously analyze modern approaches. We arrive to many conclusions, which shed new light on performance and quality of novel algorithms. We investigate application of Jumping Knowledge GNN architecture to graph classification, which proves to be an efficient tool for improving base graph neural network architectures. Multiple improvements to baseline models are also proposed and experimentally verified, which constitutes an important contribution to the field of fair model comparison.
translated by 谷歌翻译
在三维分子结构上运行的计算方法有可能解决生物学和化学的重要问题。特别地,深度神经网络的重视,但它们在生物分子结构域中的广泛采用受到缺乏系统性能基准或统一工具包的限制,用于与分子数据相互作用。为了解决这个问题,我们呈现Atom3D,这是一个新颖的和现有的基准数据集的集合,跨越几个密钥的生物分子。我们为这些任务中的每一个实施多种三维分子学习方法,并表明它们始终如一地提高了基于单维和二维表示的方法的性能。结构的具体选择对于性能至关重要,具有涉及复杂几何形状的任务的三维卷积网络,在需要详细位置信息的系统中表现出良好的图形网络,以及最近开发的设备越多的网络显示出显着承诺。我们的结果表明,许多分子问题符合三维分子学习的增益,并且有可能改善许多仍然过分曝光的任务。为了降低进入并促进现场进一步发展的障碍,我们还提供了一套全面的DataSet处理,模型培训和在我们的开源ATOM3D Python包中的评估工具套件。所有数据集都可以从https://www.atom3d.ai下载。
translated by 谷歌翻译
为化疗中的许多重要任务收集标记数据是耗时的,需要昂贵的实验。近年来,机器学习已被用来使用大规模未标记的分子数据集学习分子的丰富表示,并转移知识,以解决有限数据集的更具挑战性的任务。变形AutoEncoders是已经提出用于进行化学性质预测和分子产生任务的转移的工具之一。在这项工作中,我们提出了一种简单的方法,可以通过在变形自身偏析者学习的表示中包含关于相关分子描述符的附加信息来改善机器学习模型的化学性质预测性能。我们验证了三个属性预测的方法询问。我们探讨了合并的描述符的数量,描述符和目标属性之间的相关性,数据集等的尺寸的影响。最后,我们显示了性能预测模型的性能与属性预测数据集之间的距离和更大的未标记之间的关系。 DataSet在表示空间中。
translated by 谷歌翻译
Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.
translated by 谷歌翻译
We discover a robust self-supervised strategy tailored towards molecular representations for generative masked language models through a series of tailored, in-depth ablations. Using this pre-training strategy, we train BARTSmiles, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations. In-depth evaluations show that BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks setting a new state-of-the-art on 11 tasks. We then quantitatively show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. For example, by selecting seven neurons from a frozen BARTSmiles, we can obtain a model having performance within two percentage points of the full fine-tuned model on task Clintox. Lastly, we show that standard attribution interpretability methods, when applied to BARTSmiles, highlight certain substructures that chemists use to explain specific properties of molecules. The code and the pretrained model are publicly available.
translated by 谷歌翻译
Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs have focused on low-dimensional datasets which are not representative of realistic transfer learning tasks. Drug discovery is one application area that is characterized by datasets consisting of many chemical properties or functions which are sparsely observed, yet depend on shared features or representations of the molecular inputs. This paper applies the conditional neural process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as an alternative model for transfer learning based on pre-training and refining neural network regressors. We present a Bayesian optimization experiment which showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty quantification.
translated by 谷歌翻译
Extended-connectivity fingerprints (ECFPs) are a novel class of topological fingerprints for molecular characterization. Historically, topological fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure-activity modeling. ECFPs are circular fingerprints with a number of useful qualities: they can be very rapidly calculated; they are not predefined and can represent an essentially infinite number of different molecular features (including stereochemical information); their features represent the presence of particular substructures, allowing easier interpretation of analysis results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature.
translated by 谷歌翻译