基于注意的蛋白质序列训练的基于注意力的模型在分类和与人工智能驱动的蛋白质设计相关的分类和生成任务方面取得了令人难以置信的成功。但是,我们对非常大规模的模型和数据在有效的蛋白质模型开发中发挥作用。我们介绍了一套名为progen2的蛋白质语言模型的套件,该模型最高为6.4b参数,并在从基因组,宏基因组和免疫曲目数据库中绘制的不同序列数据集上进行了培训。 GEECEN2模型在捕获观察到的进化序列的分布,生成新型的可行序列并预测蛋白质适应性的情况下显示出最先进的性能,而无需额外的芬特。随着蛋白质序列的大型大小和原始数量继续变得更加广泛,我们的结果表明,越来越多的重点需要放在提供给蛋白质序列模型的数据分布上。我们在https://github.com/salesforce/progen上发布了PECEN2模型和代码。
translated by 谷歌翻译
在这项工作中,我们介绍了RITA:蛋白质序列的自回归生成模型套件,具有多达12亿个参数,对属于Uniref-100数据库的2.8亿次蛋白质序列进行了培训。这种生成模型具有极大加速蛋白质设计的希望。我们对蛋白质结构域中自回旋变压器的模型大小进行的能力大小进行了首次系统研究:我们在下一个氨基酸预测,零摄像及适应性和酶功能预测中评估RITA模型,从而显示出增加的量表。我们公开发布丽塔模型,以使研究界受益。
translated by 谷歌翻译
The prediction of protein structures from sequences is an important task for function prediction, drug design, and related biological processes understanding. Recent advances have proved the power of language models (LMs) in processing the protein sequence databases, which inherit the advantages of attention networks and capture useful information in learning representations for proteins. The past two years have witnessed remarkable success in tertiary protein structure prediction (PSP), including evolution-based and single-sequence-based PSP. It seems that instead of using energy-based models and sampling procedures, protein language model (pLM)-based pipelines have emerged as mainstream paradigms in PSP. Despite the fruitful progress, the PSP community needs a systematic and up-to-date survey to help bridge the gap between LMs in the natural language processing (NLP) and PSP domains and introduce their methodologies, advancements and practical applications. To this end, in this paper, we first introduce the similarities between protein and human languages that allow LMs extended to pLMs, and applied to protein databases. Then, we systematically review recent advances in LMs and pLMs from the perspectives of network architectures, pre-training strategies, applications, and commonly-used protein databases. Next, different types of methods for PSP are discussed, particularly how the pLM-based architectures function in the process of protein folding. Finally, we identify challenges faced by the PSP community and foresee promising research directions along with the advances of pLMs. This survey aims to be a hands-on guide for researchers to understand PSP methods, develop pLMs and tackle challenging problems in this field for practical purposes.
translated by 谷歌翻译
Geometric deep learning has recently achieved great success in non-Euclidean domains, and learning on 3D structures of large biomolecules is emerging as a distinct research area. However, its efficacy is largely constrained due to the limited quantity of structural data. Meanwhile, protein language models trained on substantial 1D sequences have shown burgeoning capabilities with scale in a broad range of applications. Nevertheless, no preceding studies consider combining these different protein modalities to promote the representation power of geometric neural networks. To address this gap, we make the foremost step to integrate the knowledge learned by well-trained protein language models into several state-of-the-art geometric networks. Experiments are evaluated on a variety of protein representation learning benchmarks, including protein-protein interface prediction, model quality assessment, protein-protein rigid-body docking, and binding affinity prediction, leading to an overall improvement of 20% over baselines and the new state-of-the-art performance. Strong evidence indicates that the incorporation of protein language models' knowledge enhances geometric networks' capacity by a significant margin and can be generalized to complex tasks.
translated by 谷歌翻译
大规模蛋白质语言模型(PLM)在蛋白质预测任务中的性能提高,范围从3D结构预测到各种功能预测。特别是,Alphafold(一种开创性的AI系统)可能会重塑结构生物学。但是,尚未探索超出结构预测的AlphaFold,Evoformer的PLM模块的效用。在本文中,我们研究了三个流行PLM的表示能力:ESM-1B(单序),MSA转换器(多个序列比对)和Evoformer(结构),并特别关注Evoformer。具体而言,我们旨在回答以下关键问题:(i)作为Alphafold的一部分,Evoformer是否会产生可预测蛋白质功能的表示形式? (ii)如果是的,可以替换ESM-1B和MSA转换器? (iii)这些PLM多少依赖于进化相关的蛋白质数据?在这方面,他们彼此补充吗?我们通过实证研究以及新的见解和结论来比较这些模型。最后,我们发布代码和数据集以获得可重复性。
translated by 谷歌翻译
Deep learning has been widely used for protein engineering. However, it is limited by the lack of sufficient experimental data to train an accurate model for predicting the functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants by leveraging both sequence and structure information, and exploiting attention mechanism. Our model integrates local evolutionary context from homologous sequences, the global evolutionary context encoding rich semantic from the universal protein sequence space and the structure information accounting for the microenvironment around each residue in a protein. We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship on 26 deep mutational scanning datasets. More importantly, we propose a data augmentation strategy by leveraging the data from unsupervised models to pre-train our model. After that, our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants (> 4 mutation sites), when finetuned by using only a small number of experimental mutation data (<50). The strategy proposed is of great practical value as the required experimental effort, i.e., producing a few tens of experimental mutation data on a given protein, is generally affordable by an ordinary biochemical group and can be applied on almost any protein.
translated by 谷歌翻译
学习有效的蛋白质表示在生物学的各种任务中至关重要,例如预测蛋白质功能或结构。现有的方法通常在大量未标记的氨基酸序列上预先蛋白质语言模型,然后在下游任务中使用一些标记的数据来对模型进行修复。尽管基于序列的方法具有有效性,但尚未探索蛋白质性能预测的已知蛋白质结构的预处理功能,尽管蛋白质结构已知是蛋白质功能的决定因素,但尚未探索。在本文中,我们建议根据其3D结构预处理蛋白质。我们首先提出一个简单而有效的编码器,以学习蛋白质的几何特征。我们通过利用多视图对比学习和不同的自我预测任务来预先蛋白质图编码器。对功能预测和折叠分类任务的实验结果表明,我们提出的预处理方法表现优于或与最新的基于最新的序列方法相提并论,同时使用较少的数据。我们的实施可在https://github.com/deepgraphlearning/gearnet上获得。
translated by 谷歌翻译
数据驱动的预测方法可以有效,准确地将蛋白质序列转化为生物活性结构,对于科学研究和治疗发展非常有价值。使用共同进化信息确定准确的折叠格局是现代蛋白质结构预测方法的成功基础。作为最新的状态,AlphaFold2显着提高了准确性,而无需进行明确的共同进化分析。然而,其性能仍然显示出对可用序列同源物的强烈依赖。我们研究了这种依赖性的原因,并提出了一种元生成模型Evogen,以弥补较差的MSA靶标的Alphafold2的表现不佳。 Evogen使我们能够通过降低搜索的MSA或生成虚拟MSA来操纵折叠景观,并帮助Alphafold2在低数据表方面准确地折叠,甚至通过单序预测来实现令人鼓舞的性能。能够用很少的MSA做出准确的预测,不仅可以更好地概括为孤儿序列的Alphafold2,而且使其在高通量应用程序中的使用民主化。此外,Evogen与AlphaFold2结合产生了一种概率结构生成方法,该方法可以探索蛋白质序列的替代构象,并且序列生成的任务意识可区分算法将使包括蛋白质设计在内的其他相关任务受益。
translated by 谷歌翻译
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins. Our framework achieves this without any model fine-tuning or re-training by constructing a product of experts distribution directly in discrete protein space. Instead of resorting to brute force search or random sampling, which is typical of classic directed evolution, we introduce a fast MCMC sampler that uses gradients to propose promising mutations. We conduct in silico directed evolution experiments on wide fitness landscapes and across a range of different pre-trained unsupervised models, including a 650M parameter protein language model. Our results demonstrate an ability to efficiently discover variants with high evolutionary likelihood as well as estimated activity multiple mutations away from a wild type protein, suggesting our sampler provides a practical and effective new paradigm for machine-learning-based protein engineering.
translated by 谷歌翻译
计算蛋白质设计,即推断与给定结构一致的新型和多样的蛋白质序列仍然是一个主要的未解决的挑战。最近,从单独的序列或序列和结构中学习的深度生成模型在此任务上表现出令人印象深刻的性能。然而,这些模型在建模结构约束方面出现有限,捕获足够的序列分集或两者。在这里,我们考虑三个最近提出的蛋白质设计的深度生成框架:(AR)基于序列的自回归生成模型,(GVP)基于精确的结构形式的图形神经网络,以及折叠模糊的模糊和无规模表示的折叠表示 - 折叠,同时强制执行结构到序列(反之亦然)一致性。我们基准这些模型对抗体序列计算设计的任务,要求设计具有高多样性的序列以进行功能含义。在设计序列的多样性方面,FOLD2SEQ框架突出了两个其他基线,同时保持典型的折叠。
translated by 谷歌翻译
响应于病原体,自适应免疫系统产生结合和中和外部抗原的特异性抗体。了解个体的免疫力曲目的组成可以为该过程提供见解,并揭示潜在的治疗抗体。在这项工作中,我们探讨了抗体特定语言模型的应用,以帮助了解免疫曲目。我们介绍抗体,一种在558米天然抗体序列上培训的语言模型。我们发现在reptoIres中,我们的模型群抗体进入了类似亲和力成熟的轨迹。重要的是,我们表明培训的模型在多实例学习框架下预测高度冗余序列,识别过程中的密钥绑定残留物。通过进一步发展,这里呈现的方法将为单独的ReptoIre序列的抗原结合提供新的见解。
translated by 谷歌翻译
动机:针对感兴趣的蛋白质的新颖化合物的发展是制药行业中最重要的任务之一。深层生成模型已应用于靶向分子设计,并显示出令人鼓舞的结果。最近,靶标特异性分子的产生被视为蛋白质语言与化学语言之间的翻译。但是,这种模型受相互作用蛋白质配对的可用性的限制。另一方面,可以使用大量未标记的蛋白质序列和化学化合物,并已用于训练学习有用表示的语言模型。在这项研究中,我们提出了利用预审核的生化语言模型以初始化(即温暖的开始)目标分子产生模型。我们研究了两种温暖的开始策略:(i)一种一阶段策略,其中初始化模型是针对靶向分子生成(ii)的两阶段策略进行培训的,该策略包含对分子生成的预处理,然后进行目标特定训练。我们还比较了两种生成化合物的解码策略:光束搜索和采样。结果:结果表明,温暖启动的模型的性能优于从头开始训练的基线模型。相对于基准广泛使用的指标,这两种拟议的温暖启动策略相互取得了相似的结果。然而,对许多新蛋白质生成的化合物进行对接评估表明,单阶段策略比两阶段策略更好地概括了。此外,我们观察到,在对接评估和基准指标中,梁搜索的表现优于采样,用于评估复合质量。可用性和实施​​:源代码可在https://github.com/boun-tabi/biochemical-lms-for-drug-design和材料中获得,并在Zenodo归档,网址为https://doi.org/10.5281/zenodo .6832145
translated by 谷歌翻译
最近,自我监督的神经语言模型最近已应用于生物序列数据,进步的结构,功能和突变效应预测。一些蛋白质语言模型,包括MSA变压器和Alphafold的Evoformer,将进化相关蛋白的多个序列比对作为输入。 MSA Transformer的行专注的简单组合导致了最新的无监督结构接触预测。我们证明,MSA变压器柱浓度的简单和通用组合与MSA中序列之间的锤距距离密切相关。因此,基于MSA的语言模型编码详细的系统发育关系。我们进一步表明,这些模型可以将编码功能和结构约束的共同进化信号与反映历史意义的系统发育相关性分开。为了评估这一点,我们从POTTS模型中生成了在天然MSA训练的POTTS模型的合成MSA。我们发现,当使用MSA变压器与推断的POTTS模型时,无监督的接触预测对系统发育噪声的弹性更大。
translated by 谷歌翻译
现在,我们目睹了深度学习方法在各种蛋白质(或数据集)中的重大进展。但是,缺乏评估不同方法的性能的标准基准,这阻碍了该领域的深度学习进步。在本文中,我们提出了一种称为PEER的基准,这是一种用于蛋白质序列理解的全面和多任务基准。 PEER提供了一组不同的蛋白质理解任务,包括蛋白质功能预测,蛋白质定位预测,蛋白质结构预测,蛋白质 - 蛋白质相互作用预测和蛋白质 - 配体相互作用预测。我们评估每个任务的不同类型的基于序列的方法,包括传统的特征工程方法,不同的序列编码方法以及大规模的预训练蛋白质语言模型。此外,我们还研究了这些方法在多任务学习设置下的性能。实验结果表明,大规模的预训练蛋白质语言模型可实现大多数单个任务的最佳性能,共同训练多个任务进一步提高了性能。该基准的数据集和源代码均可在https://github.com/deepgraphlearning/peer_benchmark上获得
translated by 谷歌翻译
在三维分子结构上运行的计算方法有可能解决生物学和化学的重要问题。特别地,深度神经网络的重视,但它们在生物分子结构域中的广泛采用受到缺乏系统性能基准或统一工具包的限制,用于与分子数据相互作用。为了解决这个问题,我们呈现Atom3D,这是一个新颖的和现有的基准数据集的集合,跨越几个密钥的生物分子。我们为这些任务中的每一个实施多种三维分子学习方法,并表明它们始终如一地提高了基于单维和二维表示的方法的性能。结构的具体选择对于性能至关重要,具有涉及复杂几何形状的任务的三维卷积网络,在需要详细位置信息的系统中表现出良好的图形网络,以及最近开发的设备越多的网络显示出显着承诺。我们的结果表明,许多分子问题符合三维分子学习的增益,并且有可能改善许多仍然过分曝光的任务。为了降低进入并促进现场进一步发展的障碍,我们还提供了一套全面的DataSet处理,模型培训和在我们的开源ATOM3D Python包中的评估工具套件。所有数据集都可以从https://www.atom3d.ai下载。
translated by 谷歌翻译
Despite significant progress of generative models in the natural sciences, their controllability remains challenging. One fundamentally missing aspect of molecular or protein generative models is an inductive bias that can reflect continuous properties of interest. To that end, we propose the Regression Transformer (RT), a novel method that abstracts regression as a conditional sequence modeling problem. This introduces a new paradigm of multitask language models which seamlessly bridge sequence regression and conditional sequence generation. We thoroughly demonstrate that, despite using a nominal-scale training objective, the RT matches or surpasses the performance of conventional regression models in property prediction tasks of small molecules, proteins and chemical reactions. Critically, priming the same model with continuous properties yields a highly competitive conditional generative model that outperforms specialized approaches in a substructure-constrained, property-driven molecule generation benchmark. Our dichotomous approach is facilitated by a novel, alternating training scheme that enables the model to decorate seed sequences by desired properties, e.g., to optimize reaction yield. In sum, the RT is the first report of a multitask model that concurrently excels at predictive and generative tasks in biochemistry. This finds particular application in property-driven, local exploration of the chemical or protein space and could pave the road toward foundation models in material design. The code to reproduce all experiments of the paper is available at: https://github.com/IBM/regression-transformer
translated by 谷歌翻译
基于AI的蛋白质结构预测管道(例如AlphaFold2)已达到了几乎实验的准确性。这些高级管道主要依赖于多个序列比对(MSA)和模板作为输入来从同源序列中学习共进化信息。但是,从蛋白质数据库中搜索MSA和模板很耗时,通常需要数十分钟。因此,我们尝试通过仅使用蛋白质的主要序列来探索快速蛋白质结构预测的极限。提出了Helixfold单一的形式将大规模蛋白质语言模型与AlphaFold2的优质几何学习能力相结合。我们提出的方法,Helixfold单个,首先预先培训是一种大规模蛋白质语言模型(PLM),使用了数以千计的主要序列利用自我监督的学习范式,将用作MSA和模板的替代方法共同进化信息。然后,通过将预训练的PLM和AlphaFold2的必需组件组合在一起,我们获得了一个端到端可区分模型,以仅从主要序列预测原子的3D坐标。 Helixfold-Single在数据集CASP14和Cameo中得到了验证,通过基于MSA的方法,具有大型同源家庭的基于MSA的方法,从而实现了竞争精度。此外,与主流管道进行蛋白质结构预测相比,Helixfold单个的时间比主流管道的时间少得多,这表明其在需要许多预测的任务中的潜力。 HelixFold-Single的守则可在https://github.com/paddlepaddle/paddlehelix/tree/dev/dev/pprotein_folding/helixfold-single上获得,我们还在https://paddlehelix.baidu.com上提供稳定的Web服务。 /app/drug/protein-single/prevast。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
In the field of antibody engineering, an essential task is to design a novel antibody whose paratopes bind to a specific antigen with correct epitopes. Understanding antibody structure and its paratope can facilitate a mechanistic understanding of its function. Therefore, antibody structure prediction from its sequence alone has always been a highly valuable problem for de novo antibody design. AlphaFold2, a breakthrough in the field of structural biology, provides a solution to predict protein structure based on protein sequences and computationally expensive coevolutionary multiple sequence alignments (MSAs). However, the computational efficiency and undesirable prediction accuracy of antibodies, especially on the complementarity-determining regions (CDRs) of antibodies limit their applications in the industrially high-throughput drug design. To learn an informative representation of antibodies, we employed a deep antibody language model (ALM) on curated sequences from the observed antibody space database via a transformer model. We also developed a novel model named xTrimoABFold to predict antibody structure from antibody sequence based on the pretrained ALM as well as efficient evoformers and structural modules. The model was trained end-to-end on the antibody structures in PDB by minimizing the ensemble loss of domain-specific focal loss on CDR and the frame-aligned point loss. xTrimoABFold outperforms AlphaFold2 and other protein language model based SOTAs, e.g., OmegaFold, HelixFold-Single, and IgFold with a large significant margin (30+\% improvement on RMSD) while performing 151 times faster than AlphaFold2. To the best of our knowledge, xTrimoABFold achieved state-of-the-art antibody structure prediction. Its improvement in both accuracy and efficiency makes it a valuable tool for de novo antibody design and could make further improvements in immuno-theory.
translated by 谷歌翻译
Proteins play a central role in biology from immune recognition to brain activity. While major advances in machine learning have improved our ability to predict protein structure from sequence, determining protein function from structure remains a major challenge. Here, we introduce Holographic Convolutional Neural Network (H-CNN) for proteins, which is a physically motivated machine learning approach to model amino acid preferences in protein structures. H-CNN reflects physical interactions in a protein structure and recapitulates the functional information stored in evolutionary data. H-CNN accurately predicts the impact of mutations on protein function, including stability and binding of protein complexes. Our interpretable computational model for protein structure-function maps could guide design of novel proteins with desired function.
translated by 谷歌翻译