数据驱动的预测方法可以有效,准确地将蛋白质序列转化为生物活性结构,对于科学研究和治疗发展非常有价值。使用共同进化信息确定准确的折叠格局是现代蛋白质结构预测方法的成功基础。作为最新的状态,AlphaFold2显着提高了准确性,而无需进行明确的共同进化分析。然而,其性能仍然显示出对可用序列同源物的强烈依赖。我们研究了这种依赖性的原因,并提出了一种元生成模型Evogen,以弥补较差的MSA靶标的Alphafold2的表现不佳。 Evogen使我们能够通过降低搜索的MSA或生成虚拟MSA来操纵折叠景观,并帮助Alphafold2在低数据表方面准确地折叠,甚至通过单序预测来实现令人鼓舞的性能。能够用很少的MSA做出准确的预测,不仅可以更好地概括为孤儿序列的Alphafold2,而且使其在高通量应用程序中的使用民主化。此外,Evogen与AlphaFold2结合产生了一种概率结构生成方法,该方法可以探索蛋白质序列的替代构象,并且序列生成的任务意识可区分算法将使包括蛋白质设计在内的其他相关任务受益。
translated by 谷歌翻译
The prediction of protein structures from sequences is an important task for function prediction, drug design, and related biological processes understanding. Recent advances have proved the power of language models (LMs) in processing the protein sequence databases, which inherit the advantages of attention networks and capture useful information in learning representations for proteins. The past two years have witnessed remarkable success in tertiary protein structure prediction (PSP), including evolution-based and single-sequence-based PSP. It seems that instead of using energy-based models and sampling procedures, protein language model (pLM)-based pipelines have emerged as mainstream paradigms in PSP. Despite the fruitful progress, the PSP community needs a systematic and up-to-date survey to help bridge the gap between LMs in the natural language processing (NLP) and PSP domains and introduce their methodologies, advancements and practical applications. To this end, in this paper, we first introduce the similarities between protein and human languages that allow LMs extended to pLMs, and applied to protein databases. Then, we systematically review recent advances in LMs and pLMs from the perspectives of network architectures, pre-training strategies, applications, and commonly-used protein databases. Next, different types of methods for PSP are discussed, particularly how the pLM-based architectures function in the process of protein folding. Finally, we identify challenges faced by the PSP community and foresee promising research directions along with the advances of pLMs. This survey aims to be a hands-on guide for researchers to understand PSP methods, develop pLMs and tackle challenging problems in this field for practical purposes.
translated by 谷歌翻译
蛋白质是人类生命的重要组成部分,其结构对于功能和机制分析很重要。最近的工作表明了AI驱动方法对蛋白质结构预测的潜力。但是,新模型的开发受到数据集和基准测试培训程序的限制。据我们所知,现有的开源数据集远不足以满足现代蛋白质序列相关研究的需求。为了解决这个问题,我们介绍了具有高覆盖率和多样性的第一个百万级蛋白质结构预测数据集,称为PSP。该数据集由570K真实结构序列(10TB)和745K互补蒸馏序列(15TB)组成。此外,我们还提供了该数据集上SOTA蛋白结构预测模型的基准测试训练程序。我们通过参与客串比赛验证该数据集的实用程序进行培训,我们的模特赢得了第一名。我们希望我们的PSP数据集以及培训基准能够为AI驱动的蛋白质相关研究提供更广泛的AI/生物学研究人员社区。
translated by 谷歌翻译
基于注意的蛋白质序列训练的基于注意力的模型在分类和与人工智能驱动的蛋白质设计相关的分类和生成任务方面取得了令人难以置信的成功。但是,我们对非常大规模的模型和数据在有效的蛋白质模型开发中发挥作用。我们介绍了一套名为progen2的蛋白质语言模型的套件,该模型最高为6.4b参数,并在从基因组,宏基因组和免疫曲目数据库中绘制的不同序列数据集上进行了培训。 GEECEN2模型在捕获观察到的进化序列的分布,生成新型的可行序列并预测蛋白质适应性的情况下显示出最先进的性能,而无需额外的芬特。随着蛋白质序列的大型大小和原始数量继续变得更加广泛,我们的结果表明,越来越多的重点需要放在提供给蛋白质序列模型的数据分布上。我们在https://github.com/salesforce/progen上发布了PECEN2模型和代码。
translated by 谷歌翻译
Protein structure prediction aims to determine the three-dimensional shape of a protein from its amino acid sequence 1 . This problem is of fundamental importance to biology as the structure of a protein largely determines its function 2 but can be hard to determine experimentally. In recent years, considerable progress has been made by leveraging genetic information: analysing the co-variation of homologous sequences can allow one to infer which amino acid residues are in contact, which in turn can aid structure prediction 3 . In this work, we show that we can train a neural network to accurately predict the distances between pairs of residues in a protein which convey more about structure than contact predictions. With this information we construct a potential of mean force 4 that can accurately describe the shape of a protein. We find that the resulting potential can be optimised by a simple gradient descent algorithm, to realise structures without the need for complex sampling procedures.The resulting system, named AlphaFold, has been shown to achieve high accuracy, even for sequences with relatively few homologous sequences. In the most recent Critical Assessment of Protein Structure Prediction 5 (CASP13), a blind assessment of the state of the field of protein structure prediction, AlphaFold created high-accuracy structures (with TM-scores † of 0.7 or higher) for 24 out of 43 free modelling domains whereas the next best method, using sampling and contact information, achieved such accuracy for only 14 out of 43 domains.AlphaFold represents a significant advance in protein structure prediction. We expect the increased accuracy of structure predictions for proteins to enable insights in understanding the function and malfunction of these proteins, especially in cases where no homologous proteins have been experimentally determined 7 .Proteins are at the core of most biological processes. Since the function of a protein is dependent on its structure, understanding protein structure has been a grand challenge in biology for decades. While several experimental structure determination techniques have been developed
translated by 谷歌翻译
RNA结构的确定和预测可以促进靶向RNA的药物开发和可用的共性元素设计。但是,由于RNA的固有结构灵活性,所有三种主流结构测定方法(X射线晶体学,NMR和Cryo-EM)在解决RNA结构时会遇到挑战,这导致已解决的RNA结构的稀缺性。计算预测方法作为实验技术的补充。但是,\ textit {de从头}的方法都不基于深度学习,因为可用的结构太少。取而代之的是,他们中的大多数采用了耗时的采样策略,而且它们的性能似乎达到了高原。在这项工作中,我们开发了第一种端到端的深度学习方法E2FOLD-3D,以准确执行\ textit {de de novo} RNA结构预测。提出了几个新的组件来克服数据稀缺性,例如完全不同的端到端管道,二级结构辅助自我鉴定和参数有效的骨干配方。此类设计在独立的,非重叠的RNA拼图测试数据集上进行了验证,并达到平均sub-4 \ aa {}根平方偏差,与最先进的方法相比,它表现出了优越的性能。有趣的是,它在预测RNA复杂结构时也可以取得令人鼓舞的结果,这是先前系统无法完成的壮举。当E2FOLD-3D与实验技术耦合时,RNA结构预测场可以大大提高。
translated by 谷歌翻译
Deep learning has been widely used for protein engineering. However, it is limited by the lack of sufficient experimental data to train an accurate model for predicting the functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants by leveraging both sequence and structure information, and exploiting attention mechanism. Our model integrates local evolutionary context from homologous sequences, the global evolutionary context encoding rich semantic from the universal protein sequence space and the structure information accounting for the microenvironment around each residue in a protein. We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship on 26 deep mutational scanning datasets. More importantly, we propose a data augmentation strategy by leveraging the data from unsupervised models to pre-train our model. After that, our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants (> 4 mutation sites), when finetuned by using only a small number of experimental mutation data (<50). The strategy proposed is of great practical value as the required experimental effort, i.e., producing a few tens of experimental mutation data on a given protein, is generally affordable by an ordinary biochemical group and can be applied on almost any protein.
translated by 谷歌翻译
大规模蛋白质语言模型(PLM)在蛋白质预测任务中的性能提高,范围从3D结构预测到各种功能预测。特别是,Alphafold(一种开创性的AI系统)可能会重塑结构生物学。但是,尚未探索超出结构预测的AlphaFold,Evoformer的PLM模块的效用。在本文中,我们研究了三个流行PLM的表示能力:ESM-1B(单序),MSA转换器(多个序列比对)和Evoformer(结构),并特别关注Evoformer。具体而言,我们旨在回答以下关键问题:(i)作为Alphafold的一部分,Evoformer是否会产生可预测蛋白质功能的表示形式? (ii)如果是的,可以替换ESM-1B和MSA转换器? (iii)这些PLM多少依赖于进化相关的蛋白质数据?在这方面,他们彼此补充吗?我们通过实证研究以及新的见解和结论来比较这些模型。最后,我们发布代码和数据集以获得可重复性。
translated by 谷歌翻译
基于AI的蛋白质结构预测管道(例如AlphaFold2)已达到了几乎实验的准确性。这些高级管道主要依赖于多个序列比对(MSA)和模板作为输入来从同源序列中学习共进化信息。但是,从蛋白质数据库中搜索MSA和模板很耗时,通常需要数十分钟。因此,我们尝试通过仅使用蛋白质的主要序列来探索快速蛋白质结构预测的极限。提出了Helixfold单一的形式将大规模蛋白质语言模型与AlphaFold2的优质几何学习能力相结合。我们提出的方法,Helixfold单个,首先预先培训是一种大规模蛋白质语言模型(PLM),使用了数以千计的主要序列利用自我监督的学习范式,将用作MSA和模板的替代方法共同进化信息。然后,通过将预训练的PLM和AlphaFold2的必需组件组合在一起,我们获得了一个端到端可区分模型,以仅从主要序列预测原子的3D坐标。 Helixfold-Single在数据集CASP14和Cameo中得到了验证,通过基于MSA的方法,具有大型同源家庭的基于MSA的方法,从而实现了竞争精度。此外,与主流管道进行蛋白质结构预测相比,Helixfold单个的时间比主流管道的时间少得多,这表明其在需要许多预测的任务中的潜力。 HelixFold-Single的守则可在https://github.com/paddlepaddle/paddlehelix/tree/dev/dev/pprotein_folding/helixfold-single上获得,我们还在https://paddlehelix.baidu.com上提供稳定的Web服务。 /app/drug/protein-single/prevast。
translated by 谷歌翻译
最近,自我监督的神经语言模型最近已应用于生物序列数据,进步的结构,功能和突变效应预测。一些蛋白质语言模型,包括MSA变压器和Alphafold的Evoformer,将进化相关蛋白的多个序列比对作为输入。 MSA Transformer的行专注的简单组合导致了最新的无监督结构接触预测。我们证明,MSA变压器柱浓度的简单和通用组合与MSA中序列之间的锤距距离密切相关。因此,基于MSA的语言模型编码详细的系统发育关系。我们进一步表明,这些模型可以将编码功能和结构约束的共同进化信号与反映历史意义的系统发育相关性分开。为了评估这一点,我们从POTTS模型中生成了在天然MSA训练的POTTS模型的合成MSA。我们发现,当使用MSA变压器与推断的POTTS模型时,无监督的接触预测对系统发育噪声的弹性更大。
translated by 谷歌翻译
Proteins are fundamental biological entities that play a key role in life activities. The amino acid sequences of proteins can be folded into stable 3D structures in the real physicochemical world, forming a special kind of sequence-structure data. With the development of Artificial Intelligence (AI) techniques, Protein Representation Learning (PRL) has recently emerged as a promising research topic for extracting informative knowledge from massive protein sequences or structures. To pave the way for AI researchers with little bioinformatics background, we present a timely and comprehensive review of PRL formulations and existing PRL methods from the perspective of model architectures, pretext tasks, and downstream applications. We first briefly introduce the motivations for protein representation learning and formulate it in a general and unified framework. Next, we divide existing PRL methods into three main categories: sequence-based, structure-based, and sequence-structure co-modeling. Finally, we discuss some technical challenges and potential directions for improving protein representation learning. The latest advances in PRL methods are summarized in a GitHub repository https://github.com/LirongWu/awesome-protein-representation-learning.
translated by 谷歌翻译
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins. Our framework achieves this without any model fine-tuning or re-training by constructing a product of experts distribution directly in discrete protein space. Instead of resorting to brute force search or random sampling, which is typical of classic directed evolution, we introduce a fast MCMC sampler that uses gradients to propose promising mutations. We conduct in silico directed evolution experiments on wide fitness landscapes and across a range of different pre-trained unsupervised models, including a 650M parameter protein language model. Our results demonstrate an ability to efficiently discover variants with high evolutionary likelihood as well as estimated activity multiple mutations away from a wild type protein, suggesting our sampler provides a practical and effective new paradigm for machine-learning-based protein engineering.
translated by 谷歌翻译
In the field of antibody engineering, an essential task is to design a novel antibody whose paratopes bind to a specific antigen with correct epitopes. Understanding antibody structure and its paratope can facilitate a mechanistic understanding of its function. Therefore, antibody structure prediction from its sequence alone has always been a highly valuable problem for de novo antibody design. AlphaFold2, a breakthrough in the field of structural biology, provides a solution to predict protein structure based on protein sequences and computationally expensive coevolutionary multiple sequence alignments (MSAs). However, the computational efficiency and undesirable prediction accuracy of antibodies, especially on the complementarity-determining regions (CDRs) of antibodies limit their applications in the industrially high-throughput drug design. To learn an informative representation of antibodies, we employed a deep antibody language model (ALM) on curated sequences from the observed antibody space database via a transformer model. We also developed a novel model named xTrimoABFold to predict antibody structure from antibody sequence based on the pretrained ALM as well as efficient evoformers and structural modules. The model was trained end-to-end on the antibody structures in PDB by minimizing the ensemble loss of domain-specific focal loss on CDR and the frame-aligned point loss. xTrimoABFold outperforms AlphaFold2 and other protein language model based SOTAs, e.g., OmegaFold, HelixFold-Single, and IgFold with a large significant margin (30+\% improvement on RMSD) while performing 151 times faster than AlphaFold2. To the best of our knowledge, xTrimoABFold achieved state-of-the-art antibody structure prediction. Its improvement in both accuracy and efficiency makes it a valuable tool for de novo antibody design and could make further improvements in immuno-theory.
translated by 谷歌翻译
计算蛋白质设计,即推断与给定结构一致的新型和多样的蛋白质序列仍然是一个主要的未解决的挑战。最近,从单独的序列或序列和结构中学习的深度生成模型在此任务上表现出令人印象深刻的性能。然而,这些模型在建模结构约束方面出现有限,捕获足够的序列分集或两者。在这里,我们考虑三个最近提出的蛋白质设计的深度生成框架:(AR)基于序列的自回归生成模型,(GVP)基于精确的结构形式的图形神经网络,以及折叠模糊的模糊和无规模表示的折叠表示 - 折叠,同时强制执行结构到序列(反之亦然)一致性。我们基准这些模型对抗体序列计算设计的任务,要求设计具有高多样性的序列以进行功能含义。在设计序列的多样性方面,FOLD2SEQ框架突出了两个其他基线,同时保持典型的折叠。
translated by 谷歌翻译
这是一门专门针对STEM学生开发的介绍性机器学习课程。我们的目标是为有兴趣的读者提供基础知识,以在自己的项目中使用机器学习,并将自己熟悉术语作为进一步阅读相关文献的基础。在这些讲义中,我们讨论受监督,无监督和强化学习。注释从没有神经网络的机器学习方法的说明开始,例如原理分析,T-SNE,聚类以及线性回归和线性分类器。我们继续介绍基本和先进的神经网络结构,例如密集的进料和常规神经网络,经常性的神经网络,受限的玻尔兹曼机器,(变性)自动编码器,生成的对抗性网络。讨论了潜在空间表示的解释性问题,并使用梦和对抗性攻击的例子。最后一部分致力于加强学习,我们在其中介绍了价值功能和政策学习的基本概念。
translated by 谷歌翻译
现代深度学习方法构成了令人难以置信的强大工具,以解决无数的挑战问题。然而,由于深度学习方法作为黑匣子运作,因此与其预测相关的不确定性往往是挑战量化。贝叶斯统计数据提供了一种形式主义来理解和量化与深度神经网络预测相关的不确定性。本教程概述了相关文献和完整的工具集,用于设计,实施,列车,使用和评估贝叶斯神经网络,即使用贝叶斯方法培训的随机人工神经网络。
translated by 谷歌翻译
Geometric deep learning has recently achieved great success in non-Euclidean domains, and learning on 3D structures of large biomolecules is emerging as a distinct research area. However, its efficacy is largely constrained due to the limited quantity of structural data. Meanwhile, protein language models trained on substantial 1D sequences have shown burgeoning capabilities with scale in a broad range of applications. Nevertheless, no preceding studies consider combining these different protein modalities to promote the representation power of geometric neural networks. To address this gap, we make the foremost step to integrate the knowledge learned by well-trained protein language models into several state-of-the-art geometric networks. Experiments are evaluated on a variety of protein representation learning benchmarks, including protein-protein interface prediction, model quality assessment, protein-protein rigid-body docking, and binding affinity prediction, leading to an overall improvement of 20% over baselines and the new state-of-the-art performance. Strong evidence indicates that the incorporation of protein language models' knowledge enhances geometric networks' capacity by a significant margin and can be generalized to complex tasks.
translated by 谷歌翻译
鉴定新型药物靶标相互作用(DTI)是药物发现中的关键和速率限制步骤。虽然已经提出了深入学习模型来加速识别过程,但我们表明最先进的模型无法概括到新颖(即,从未见过的)结构上。我们首先揭示负责此缺点的机制,展示模型如何依赖于利用蛋白质 - 配体二分网络拓扑的捷径,而不是学习节点特征。然后,我们介绍AI-BIND,这是一个与无监督的预训练的基于网络的采样策略相结合的管道,使我们能够限制注释不平衡并改善新型蛋白质和配体的结合预测。我们通过预测具有结合亲和力的药物和天然化合物对SARS-COV-2病毒蛋白和相关的人蛋白质来说明Ai-reat的值。我们还通过自动扩展模拟和与最近的实验证据进行比较来验证这些预测。总体而言,AI-Bind提供了一种强大的高通量方法来识别药物目标组合,具有成为药物发现中强大工具的可能性。
translated by 谷歌翻译
在三维分子结构上运行的计算方法有可能解决生物学和化学的重要问题。特别地,深度神经网络的重视,但它们在生物分子结构域中的广泛采用受到缺乏系统性能基准或统一工具包的限制,用于与分子数据相互作用。为了解决这个问题,我们呈现Atom3D,这是一个新颖的和现有的基准数据集的集合,跨越几个密钥的生物分子。我们为这些任务中的每一个实施多种三维分子学习方法,并表明它们始终如一地提高了基于单维和二维表示的方法的性能。结构的具体选择对于性能至关重要,具有涉及复杂几何形状的任务的三维卷积网络,在需要详细位置信息的系统中表现出良好的图形网络,以及最近开发的设备越多的网络显示出显着承诺。我们的结果表明,许多分子问题符合三维分子学习的增益,并且有可能改善许多仍然过分曝光的任务。为了降低进入并促进现场进一步发展的障碍,我们还提供了一套全面的DataSet处理,模型培训和在我们的开源ATOM3D Python包中的评估工具套件。所有数据集都可以从https://www.atom3d.ai下载。
translated by 谷歌翻译
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
translated by 谷歌翻译