智能论文笔记

A Ligand-and-structure Dual-driven Deep Learning Method for the Discovery of Highly Potent GnRH1R Antagonist to treat Uterine Diseases

Song Li , Song Ke , Chenxing Yang , Jun Chen , Yi Xiong , Lirong Zheng , Hao Liu , Liang Hong

分类：人工智能 | 机器学习

2022-07-23

促性腺营养蛋白释放激素受体（GNRH1R）是治疗子宫疾病的有前途的治疗靶标。迄今为止，在临床研究中可以使用几个GNRH1R拮抗剂，而不满足多个财产约束。为了填补这一空白，我们旨在开发一个基于学习的框架，以促进有效，有效地发现具有理想特性的新的口服小型分子药物靶向GNRH1R。在目前的工作中，首先通过充分利用已知活性化合物和靶蛋白的结构的信息，首先提出了配体和结构组合模型，即LS-Molgen，首先提出了分子生成的方法，该信息通过其出色的性能证明了这一点。比分别基于配体或结构方法。然后，进行了A中的计算机筛选，包括活性预测，ADMET评估，分子对接和FEP计算，其中约30,000个生成的新型分子被缩小到8，以进行实验合成和验证。体外和体内实验表明，其中三个表现出有效的抑制活性（化合物5 IC50 = 0.856 nm，化合物6 IC50 = 0.901 nm，化合物7 IC50 = 2.54 nm对GNRH1R，并且化合物5在基本PK属性中表现良好例如半衰期，口服生物利用度和PPB等。我们认为，提议的配体和结构组合结合的分子生成模型和整个计算机辅助工作流程可能会扩展到从头开始的类似任务或铅优化的类似任务。

translated by 谷歌翻译

A biologically-inspired evaluation of molecular generative machine learning

Elizaveta Vinogradova , Abay Artykbayev , Alisher Amanatay , Mukhamejan Karatayev , Maxim Mametkulov , Albina Li , Anuar Suleimenov , Abylay Salimzhanov , Karina Pats , Rustam Zhumagambetov

分类：机器学习 | 人工智能

2022-08-20

虽然最近在许多科学领域都变得无处不在，但对其评估的关注较少。对于分子生成模型，最先进的是孤立或与其输入有关的输出。但是，它们的生物学和功能特性（例如配体 - 靶标相互作用）尚未得到解决。在这项研究中，提出了一种新型的生物学启发的基准，用于评估分子生成模型。具体而言，设计了三个不同的参考数据集，并引入了与药物发现过程直接相关的一组指标。特别是我们提出了一个娱乐指标，将药物目标亲和力预测和分子对接应用作为评估生成产量的互补技术。虽然所有三个指标均在测试的生成模型中均表现出一致的结果，但对药物目标亲和力结合和分子对接分数进行了更详细的比较，表明单峰预测器可能会导致关于目标结合在分子水平和多模式方法的错误结论，而多模式的方法是错误的结论。因此优选。该框架的关键优点是，它通过明确关注配体 - 靶标相互作用，将先前的物理化学域知识纳入基准测试过程，从而创建了一种高效的工具，不仅用于评估分子生成型输出，而且还用于丰富富含分子生成的输出。一般而言，药物发现过程。

translated by 谷歌翻译

LIMO: Latent Inceptionism for Targeted Molecule Generation

Peter Eckmann , Kunyang Sun , Bo Zhao , Mudong Feng , Michael K. Gilson , Rose Yu

分类：机器学习

2022-06-17

与靶蛋白具有高结合亲和力的药物样分子的产生仍然是药物发现中的一项困难和资源密集型任务。现有的方法主要采用强化学习，马尔可夫采样或以高斯过程为指导的深层生成模型，在生成具有高结合亲和力的分子时，通过基于计算量的物理学方法计算出的高结合亲和力。我们提出了对分子（豪华轿车）的潜在构成主义，它通过类似于Inceptionism的技术显着加速了分子的产生。豪华轿车采用序列的两个神经网络采用变异自动编码器生成的潜在空间和性质预测，从而使基于梯度的分子特性更快地基于梯度的反相比。综合实验表明，豪华轿车在基准任务上具有竞争力，并且在产生具有高结合亲和力的类似药物的化合物的新任务上，其最先进的技术表现出了最先进的技术，可针对两个蛋白质靶标达到纳摩尔范围。我们通过对绝对结合能的基于更准确的基于分子动力学的计算来证实这些基于对接的结果，并表明我们生成的类似药物的化合物之一的预测$ k_d $（结合亲和力的量度）为$ 6 \ cdot 10^ {-14} $ m针对人类雌激素受体，远远超出了典型的早期药物候选物和大多数FDA批准的药物的亲和力。代码可从https://github.com/rose-stl-lab/limo获得。

translated by 谷歌翻译

PGMG: A Pharmacophore-Guided Deep Learning Approach for Bioactive Molecular Generation

Huimin Zhu , Renyi Zhou , Jing Tang , Min Li

分类：机器学习

2022-07-02

在药物发现中，具有所需生物活性的新分子的合理设计是一项至关重要但具有挑战性的任务，尤其是在治疗新的靶家庭或研究靶标时。在这里，我们提出了PGMG，这是一种用于生物活化分子产生的药效团的深度学习方法。PGMG通过药理的指导提供了一种灵活的策略，以使用训练有素的变异自动编码器在各种情况下生成具有结构多样性的生物活性分子。我们表明，PGMG可以在给定药效团模型的情况下生成匹配的分子，同时保持高度的有效性，独特性和新颖性。在案例研究中，我们证明了PGMG在基于配体和基于结构的药物从头设计以及铅优化方案中生成生物活性分子的应用。总体而言，PGMG的灵活性和有效性使其成为加速药物发现过程的有用工具。

translated by 谷歌翻译

Predicting the protein-ligand affinity from molecular dynamics trajectories

Yaosen Min , Ye Wei , Peizhuo Wang , Nian Wu , Stefan Bauer , Shuxin Zheng , Yu Shi , Yingheng Wang , Dan Zhao , Ji Wu

分类：机器学习

2022-08-19

准确的蛋白质结合亲和力预测在药物设计和许多其他分子识别问题中至关重要。尽管基于机器学习技术的亲和力预测取得了许多进步，但由于蛋白质 - 配体结合取决于原子和分子的动力学，它们仍然受到限制。为此，我们策划了一个包含3,218个动态蛋白质配合物的MD数据集，并进一步开发了DynaFormer，这是一个基于图的深度学习框架。 DynaFormer可以通过考虑相互作用的各种几何特征来完全捕获动态结合规则。我们的方法显示出优于迄今报告的方法。此外，我们通过将模型与基于结构的对接整合在一起，对热休克蛋白90（HSP90）进行了虚拟筛选。我们对其他基线进行了基准测试，表明我们的方法可以鉴定具有最高实验效力的分子。我们预计大规模的MD数据集和机器学习模型将形成新的协同作用，为加速药物发现和优化提供新的途径。

translated by 谷歌翻译

Multi-view deep learning based molecule design and structural optimization accelerates the SARS-CoV-2 inhibitor discovery

Chao Pang , Yu Wang , Yi Jiang , Ruheng Wang , Ran Su , Leyi Wei

分类：机器学习

2022-12-03

In this work, we propose MEDICO, a Multi-viEw Deep generative model for molecule generation, structural optimization, and the SARS-CoV-2 Inhibitor disCOvery. To the best of our knowledge, MEDICO is the first-of-this-kind graph generative model that can generate molecular graphs similar to the structure of targeted molecules, with a multi-view representation learning framework to sufficiently and adaptively learn comprehensive structural semantics from targeted molecular topology and geometry. We show that our MEDICO significantly outperforms the state-of-the-art methods in generating valid, unique, and novel molecules under benchmarking comparisons. In particular, we showcase the multi-view deep learning model enables us to generate not only the molecules structurally similar to the targeted molecules but also the molecules with desired chemical properties, demonstrating the strong capability of our model in exploring the chemical space deeply. Moreover, case study results on targeted molecule generation for the SARS-CoV-2 main protease (Mpro) show that by integrating molecule docking into our model as chemical priori, we successfully generate new small molecules with desired drug-like properties for the Mpro, potentially accelerating the de novo design of Covid-19 drugs. Further, we apply MEDICO to the structural optimization of three well-known Mpro inhibitors (N3, 11a, and GC376) and achieve ~88% improvement in their binding affinity to Mpro, demonstrating the application value of our model for the development of therapeutics for SARS-CoV-2 infection.

translated by 谷歌翻译

De novo PROTAC design using graph-based deep generative models

Divya Nori , Connor W. Coley , Rocío Mercado

分类：人工智能 | 机器学习

2022-11-04

PROteolysis TArgeting Chimeras (PROTACs) are an emerging therapeutic modality for degrading a protein of interest (POI) by marking it for degradation by the proteasome. Recent developments in artificial intelligence (AI) suggest that deep generative models can assist with the de novo design of molecules with desired properties, and their application to PROTAC design remains largely unexplored. We show that a graph-based generative model can be used to propose novel PROTAC-like structures from empty graphs. Our model can be guided towards the generation of large molecules (30--140 heavy atoms) predicted to degrade a POI through policy-gradient reinforcement learning (RL). Rewards during RL are applied using a boosted tree surrogate model that predicts a molecule's degradation potential for each POI. Using this approach, we steer the generative model towards compounds with higher likelihoods of predicted degradation activity. Despite being trained on sparse public data, the generative model proposes molecules with substructures found in known degraders. After fine-tuning, predicted activity against a challenging POI increases from 50% to >80% with near-perfect chemical validity for sampled compounds, suggesting this is a promising approach for the optimization of large, PROTAC-like molecules for targeted protein degradation.

translated by 谷歌翻译

Hybrid Approach to Identify Druglikeness Leading Compounds against COVID-19 3CL Protease

Imra Aqeel Abdul Majid

分类：机器学习

2022-08-03

SARS-COV-2是一种积极的单链RNA基于大分子，自2022年6月以来，已导致超过630万人死亡。此外，通过封锁扰乱了全球供应链，该病毒对全球经济造成了毁灭性的破坏。为该病毒及其各种变体设计和开发药物至关重要。在本文中，我们使用了一个内部研究框架来重新利用现有的治疗剂，以找到可以治愈COVID-19的药物样生物活性分子。我们使用了从Chembl数据库中检索到的分子的Lipinski规则，以发现针对SARS冠状病毒3Cl蛋白酶的133种吸毒生物活性分子。在标准IC50的基础上，数据集分为三类活动性，无效和中间体。我们的比较分析表明，提出的额外树回收剂（ETR）集成模型改善了结果，同时相对于其他最先进的机器学习模型，可以预测化学化合物的准确生物活性。使用ADMET分析，我们确定了13个具有化学ID的新型生物活性分子187460，190743，222234，222628，222735，222769，222840，222840，222893，2255515，358279，358279，33535，363535，363535，365134 and 422688.88.88.88.88.88.88.88.88.88。 SARS-COV-2 3Cl蛋白酶。这些候选分子进一步研究了结合亲和力。为此，我们进行了分子对接和简短列出的六个具有Chembl IDS 187460、222769、225515、358279、363535和36513的生物活性分子。这些分子可以是SARS-COV-2-2。预计药物学家社区可能会使用这些有希望的化合物进行进一步的体外分析。

translated by 谷歌翻译

Recent Developments in Structure-Based Virtual Screening Approaches

Christoph Gorgulla

分类：机器学习

2022-11-06

Drug development is a wide scientific field that faces many challenges these days. Among them are extremely high development costs, long development times, as well as a low number of new drugs that are approved each year. To solve these problems, new and innovate technologies are needed that make the drug discovery process of small-molecules more time and cost-efficient, and which allow to target previously undruggable target classes such as protein-protein interactions. Structure-based virtual screenings have become a leading contender in this context. In this review, we give an introduction to the foundations of structure-based virtual screenings, and survey their progress in the past few years. We outline key principles, recent success stories, new methods, available software, and promising future research directions. Virtual screenings have an enormous potential for the development of new small-molecule drugs, and are already starting to transform early-stage drug discovery.

translated by 谷歌翻译

Bridging the gap between target-based and cell-based drug discovery with a graph generative multi-task model

Fan Hu , Dongqi Wang , Huazhen Huang , Yishen Hu , Peng Yin

分类：机器学习

2022-08-09

药物发现对于保护人免受疾病至关重要。基于目标的筛查是过去几十年来开发新药的最流行方法之一。该方法有效地筛选了候选药物在体外抑制靶蛋白，但由于体内所选药物的活性不足，它通常失败。需要准确的计算方法来弥合此差距。在这里，我们提出了一个新的图形多任务深度学习模型，以识别具有目标抑制性和细胞活性（matic）特性的化合物。在经过精心策划的SARS-COV-2数据集中，提出的Matic模型显示了与传统方法相比，在筛选体内有效化合物方面的优点。接下来，我们探索了模型的解释性，发现目标抑制（体外）或细胞活性（体内）任务的学习特征与分子属性相关性和原子功能专注不同。基于这些发现，我们利用了基于蒙特卡洛的增强性学习生成模型来生成具有体外和体内功效的新型多毛皮化合物，从而弥合了基于靶基于靶基于靶标的药物和基于细胞的药物发现之间的差距。

translated by 谷歌翻译

Molecule Optimization via Fragment-based Generative Models

Ziqi Chen , Martin Renqiang Min , Srinivasan Parthasarathy , Xia Ning

分类：机器学习 | 神经与进化计算 | (统计)机器学习

2020-12-08

在药物发现中，分子优化是在所需药物性质方面将药物候选改变为更好的阶梯。随着近期人工智能的进展，传统上的体外过程越来越促进了Silico方法。我们以硅方法提出了一种创新的，以通过深生成模型制定分子并制定问题，以便产生优化的分子图。我们的生成模型遵循基于片段的药物设计的关键思想，并通过修改其小碎片来优化分子。我们的模型了解如何识别待优化的碎片以及如何通过学习具有良好和不良性质的分子的差异来修改此类碎片。在优化新分子时，我们的模型将学习信号应用于在片段的预测位置解码优化的片段。我们还将多个这样的模型构造成管道，使得管道中的每个模型能够优化一个片段，因此整个流水线能够在需要时改变多个分子片段。我们将我们的模型与基准数据集的其他最先进的方法进行比较，并证明我们的方法在中等分子相似度约束下具有超过80％的性质改善，在高分子相似度约束下具有超过80％的财产改善。。

translated by 谷歌翻译

Reinforced Genetic Algorithm for Structure-based Drug Design

Tianfan Fu , Wenhao Gao , Connor W. Coley , Jimeng Sun

分类：机器学习

2022-11-28

Structure-based drug design (SBDD) aims to discover drug candidates by finding molecules (ligands) that bind tightly to a disease-related protein (targets), which is the primary approach to computer-aided drug discovery. Recently, applying deep generative models for three-dimensional (3D) molecular design conditioned on protein pockets to solve SBDD has attracted much attention, but their formulation as probabilistic modeling often leads to unsatisfactory optimization performance. On the other hand, traditional combinatorial optimization methods such as genetic algorithms (GA) have demonstrated state-of-the-art performance in various molecular optimization tasks. However, they do not utilize protein target structure to inform design steps but rely on a random-walk-like exploration, which leads to unstable performance and no knowledge transfer between different tasks despite the similar binding physics. To achieve a more stable and efficient SBDD, we propose Reinforced Genetic Algorithm (RGA) that uses neural models to prioritize the profitable design steps and suppress random-walk behavior. The neural models take the 3D structure of the targets and ligands as inputs and are pre-trained using native complex structures to utilize the knowledge of the shared binding physics from different targets and then fine-tuned during optimization. We conduct thorough empirical studies on optimizing binding affinity to various disease targets and show that RGA outperforms the baselines in terms of docking scores and is more robust to random initializations. The ablation study also indicates that the training on different targets helps improve performance by leveraging the shared underlying physics of the binding processes. The code is available at https://github.com/futianfan/reinforced-genetic-algorithm.

translated by 谷歌翻译

Artificial Intelligence in Drug Discovery: Applications and Techniques

Jianyuan Deng , Zhibo Yang , Iwao Ojima , Dimitris Samaras , Fusheng Wang

分类：机器学习 | 人工智能

2021-06-09

人工智能（AI）在过去十年中一直在改变药物发现的实践。各种AI技术已在广泛的应用中使用，例如虚拟筛选和药物设计。在本调查中，我们首先概述了药物发现，并讨论了相关的应用，可以减少到两个主要任务，即分子性质预测和分子产生。然后，我们讨论常见的数据资源，分子表示和基准平台。此外，为了总结AI在药物发现中的进展情况，我们介绍了在调查的论文中包括模型架构和学习范式的相关AI技术。我们预计本调查将作为有兴趣在人工智能和药物发现界面工作的研究人员的指南。我们还提供了GitHub存储库（HTTPS:///github.com/dengjianyuan/survey_survey_au_drug_discovery），其中包含文件和代码，如适用，作为定期更新的学习资源。

translated by 谷歌翻译

Fragment-based molecular generative model with high generalization ability and synthetic accessibility

Seonghwan Seo , Jaechang Lim , Woo Youn Kim

分类：机器学习

2021-11-25

深度生成模型吸引了具有所需特性的分子设计的极大关注。大多数现有模型通过顺序添加原子来产生分子。这通常会使产生的分子与目标性能和低合成可接近性较少。诸如官能团的分子片段与分子性质和合成可接近的比原子更密切相关。在此，我们提出了一种基于片段的分子发生模型，其通过顺序向任何给定的起始分子依次向任何给定的起始分子添加分子片段来设计具有靶性质的新分子。我们模型的一个关键特征是属性控制和片段类型方面的高概括能力。通过以自动回归方式学习各个片段对目标属性的贡献来实现前者。对于后者，我们使用深神经网络，其从两个分子的嵌入载体中预测两个分子的键合概率作为输入。在用金砖石分解方法制备片段文库的同时隐式考虑所生成的分子的高合成可用性。我们表明该模型可以以高成功率同时控制多个目标性质的分子。即使在培训数据很少的财产范围内，它也与看不见的片段同样很好地工作，验证高概括能力。作为一种实际应用，我们证明，在对接得分方面，该模型可以产生具有高结合亲和力的潜在抑制剂，其抗对接得分的3CL-COV-2。

translated by 谷歌翻译

Exploiting Pretrained Biochemical Language Models for Targeted Drug Design

Gökçe Uludoğan , Elif Ozkirimli , Kutlu O. Ulgen , Nilgün Karalı , Arzucan Özgür

分类：机器学习 | 自然语言处理 | (统计)机器学习

2022-09-02

动机：针对感兴趣的蛋白质的新颖化合物的发展是制药行业中最重要的任务之一。深层生成模型已应用于靶向分子设计，并显示出令人鼓舞的结果。最近，靶标特异性分子的产生被视为蛋白质语言与化学语言之间的翻译。但是，这种模型受相互作用蛋白质配对的可用性的限制。另一方面，可以使用大量未标记的蛋白质序列和化学化合物，并已用于训练学习有用表示的语言模型。在这项研究中，我们提出了利用预审核的生化语言模型以初始化（即温暖的开始）目标分子产生模型。我们研究了两种温暖的开始策略：（i）一种一阶段策略，其中初始化模型是针对靶向分子生成（ii）的两阶段策略进行培训的，该策略包含对分子生成的预处理，然后进行目标特定训练。我们还比较了两种生成化合物的解码策略：光束搜索和采样。结果：结果表明，温暖启动的模型的性能优于从头开始训练的基线模型。相对于基准广泛使用的指标，这两种拟议的温暖启动策略相互取得了相似的结果。然而，对许多新蛋白质生成的化合物进行对接评估表明，单阶段策略比两阶段策略更好地概括了。此外，我们观察到，在对接评估和基准指标中，梁搜索的表现优于采样，用于评估复合质量。可用性和实施：源代码可在https://github.com/boun-tabi/biochemical-lms-for-drug-design和材料中获得，并在Zenodo归档，网址为https://doi.org/10.5281/zenodo .6832145

translated by 谷歌翻译

HTML版本

PIGNet: A physics-informed deep learning model toward generalized drug-target interaction predictions

Seokhyun Moon , Wonho Zhung , Soojung Yang , Jaechang Lim , Woo Youn Kim

分类：机器学习

2020-08-22

最近，基于深度神经网络（DNN）的药物 - 目标相互作用（DTI）模型以高精度突出显示，具有实惠的计算成本。然而，模型在硅药物发现的实践中仍然是一个具有挑战性的问题。我们提出了两项关键策略，以提高DTI模型的概括。首先是通过用神经网络参数化的物理通知方程来预测原子原子对相互作用，并提供蛋白质 - 配体复合物作为其总和的总结合亲和力。通过增强更广泛的绑定姿势和配体来培训数据，我们进一步改善了模型泛化。我们验证了我们的模型，PIGNET，在评分职能（CASF）2016的比较评估中，展示了比以前的方法更优于对接和筛选力。我们的物理信息策略还通过可视化配体副结构的贡献来解释预测的亲和力，为进一步配体优化提供了见解。

translated by 谷歌翻译

Decoding the Protein-ligand Interactions Using Parallel Graph Neural Networks

Carter Knutson , Mridula Bontha , Jenna A. Bilbrey , Neeraj Kumar

分类： (统计)机器学习 | 机器学习

2021-11-30

蛋白质 - 配体相互作用（PLIS）是生化研究的基础，其鉴定对于估计合理治疗设计的生物物理和生化特性至关重要。目前，这些特性的实验表征是最准确的方法，然而，这是非常耗时和劳动密集型的。在这种情况下已经开发了许多计算方法，但大多数现有PLI预测大量取决于2D蛋白质序列数据。在这里，我们提出了一种新颖的并行图形神经网络（GNN），以集成PLI预测的知识表示和推理，以便通过专家知识引导的深度学习，并通过3D结构数据通知。我们开发了两个不同的GNN架构，GNNF是采用不同特种的基础实现，以增强域名认识，而GNNP是一种新颖的实现，可以预测未经分子间相互作用的先验知识。综合评价证明，GNN可以成功地捕获配体和蛋白质3D结构之间的二元相互作用，对于GNNF的测试精度和0.958，用于预测蛋白质 - 配体络合物的活性。这些模型进一步适用于回归任务以预测实验结合亲和力，PIC50对于药物效力和功效至关重要。我们在实验亲和力上达到0.66和0.65的Pearson相关系数，分别在PIC50和GNNP上进行0.50和0.51，优于基于2D序列的模型。我们的方法可以作为可解释和解释的人工智能（AI）工具，用于预测活动，效力和铅候选的生物物理性质。为此，我们通过筛选大型复合库并将我们的预测与实验测量数据进行比较来展示GNNP对SARS-COV-2蛋白靶标的实用性。

translated by 谷歌翻译

Retrieval-based Controllable Molecule Generation

Zichao Wang , Weili Nie , Zhuoran Qiao , Chaowei Xiao , Richard Baraniuk , Anima Anandkumar

分类：机器学习

2022-08-23

通过生成模型生成具有特定化学和生物学特性的新分子已成为药物发现的有希望的方向。但是，现有的方法需要大型数据集进行广泛的培训/微调，在现实世界中通常无法使用。在这项工作中，我们提出了一个新的基于检索的框架，用于可控分子生成。我们使用一系列的示例分子，即（部分）满足设计标准的分子，以引导预先训练的生成模型转向满足给定设计标准的合成分子。我们设计了一种检索机制，该机制将示例分子与输入分子融合在一起，该分子受到一个新的自我监督目标训练，该目标可以预测输入分子的最近邻居。我们还提出了一个迭代改进过程，以动态更新生成的分子和检索数据库，以更好地泛化。我们的方法不可知生成模型，不需要特定于任务的微调。关于从简单设计标准到设计与SARS-COV-2主蛋白酶结合的铅化合物的具有挑战性的现实世界情景的各种任务，我们证明了我们的方法外推出了远远超出检索数据库，并且比检索数据库更高，并且比更高的性能和更广泛的适用性以前的方法。

translated by 谷歌翻译

Structure-based drug discovery with deep learning

Rıza Özçelik , Derek van Tilborg , José Jiménez-Luna , Francesca Grisoni

分类：机器学习

2022-12-26

Artificial intelligence (AI) in the form of deep learning bears promise for drug discovery and chemical biology, $\textit{e.g.}$, to predict protein structure and molecular bioactivity, plan organic synthesis, and design molecules $\textit{de novo}$. While most of the deep learning efforts in drug discovery have focused on ligand-based approaches, structure-based drug discovery has the potential to tackle unsolved challenges, such as affinity prediction for unexplored protein targets, binding-mechanism elucidation, and the rationalization of related chemical kinetic properties. Advances in deep learning methodologies and the availability of accurate predictions for protein tertiary structure advocate for a $\textit{renaissance}$ in structure-based approaches for drug discovery guided by AI. This review summarizes the most prominent algorithmic concepts in structure-based deep learning for drug discovery, and forecasts opportunities, applications, and challenges ahead.

translated by 谷歌翻译

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing

Shengchao Liu , Weili Nie , Chengpeng Wang , Jiarui Lu , Zhuoran Qiao , Ling Liu , Jian Tang , Chaowei Xiao , Anima Anandkumar

分类：机器学习 | 自然语言处理 | (统计)机器学习

2022-12-21

There is increasing adoption of artificial intelligence in drug discovery. However, existing works use machine learning to mainly utilize the chemical structures of molecules yet ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions, and predict complex biological activities. We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecule's chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct the largest multi-modal dataset to date, namely PubChemSTM, with over 280K chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM possesses two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.

translated by 谷歌翻译