Differentiable Search Indices (DSIs) encode a corpus of documents in the parameters of a model and use the same model to map queries directly to relevant document identifiers. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12\%). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting by a significant margin. Concretely, it improves the average Hits@10 by $+21.1\%$ over competitive baselines for NQ and requires $6$ times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.
translated by 谷歌翻译
Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.
translated by 谷歌翻译
Although scaling language models improves performance on a range of tasks, there are apparently some scenarios where scaling hurts performance. For instance, the Inverse Scaling Prize Round 1 identified four ''inverse scaling'' tasks, for which performance gets worse for larger models. These tasks were evaluated on models of up to 280B parameters, trained up to 500 zettaFLOPs of compute. This paper takes a closer look at these four tasks. We evaluate models of up to 540B parameters, trained on five times more compute than those evaluated in the Inverse Scaling Prize. With this increased range of model sizes and training compute, three out of the four tasks exhibit what we call ''U-shaped scaling'' -- performance decreases up to a certain model size, and then increases again up to the largest model evaluated. One hypothesis is that U-shaped scaling occurs when a task comprises a ''true task'' and a ''distractor task''. Medium-size models can do the distractor task, which hurts performance, while only large-enough models can ignore the distractor task and do the true task. The existence of U-shaped scaling implies that inverse scaling may not hold for larger models. Second, we evaluate the inverse scaling tasks using chain-of-thought (CoT) prompting, in addition to basic prompting without CoT. With CoT prompting, all four tasks show either U-shaped scaling or positive scaling, achieving perfect solve rates on two tasks and several sub-tasks. This suggests that the term "inverse scaling task" is under-specified -- a given task may be inverse scaling for one prompt but positive or U-shaped scaling for a different prompt.
translated by 谷歌翻译
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
translated by 谷歌翻译
变压器模型的缩放属性引起了很多兴趣。但是,在研究不同电感偏差和模型体系结构的缩放特性的效果的前提下,没有做太多事情。模型体系结构的规模不同吗?如果是这样,归纳偏置如何影响缩放行为?这如何影响上游(预训练)和下游(转移)?本文对十种不同模型体系结构的缩放行为进行了系统研究,例如变压器,交换机变压器,通用变压器,动态卷积,表演者以及最近提出的MLP混合物。通过广泛的实验,我们表明(1)架构在执行缩放时确实是一个重要的考虑因素,并且(2)最佳性能模型可以在不同的尺度上波动。我们认为,这项工作中概述的发现对当前在社区中评估模型架构的方式具有重要意义。
translated by 谷歌翻译
基于变压器的大语言模型(LLM)的最新进展已导致许多任务的性能改进。这些收益随着模型的大小而大幅增加,可能导致推理时间缓慢且昂贵的使用。但是,实际上,LLMS制造的一代人由不同的难度组成。尽管某些预测确实从模型的全部容量中受益,但其他延续更为微不足道,可以通过减少的计算来解决。在这项工作中,我们介绍了自信的自适应语言建模(平静),该框架用于动态分配每个输入和生成时间段的不同计算。提前退出解码涉及我们在这里解决的几个挑战,例如:(1)使用什么信心措施; (2)将序列级别的约束连接到局部人口退出决策; (3)由于以前的令牌中的早期退出而返回丢失的隐藏表示形式。通过对三个不同文本生成任务的理论分析和经验实验,我们证明了框架在减少计算的效果 - 潜在的速度最高为$ \ times 3 $ - 同时可维持高性能。
translated by 谷歌翻译
扩展语言模型已被证明可以预测提高各种下游任务的性能和样本效率。相反,本文讨论了一种不可预测的现象,我们将其称为大语言模型的新兴能力。如果在较小的模型中不存在,而是在较大的模型中存在,那么我们认为它可以突然出现。因此,不仅可以通过推断较小模型的性能来预测紧急能力。这种出现的存在意味着额外的扩展可以进一步扩大语言模型的能力范围。
translated by 谷歌翻译
大型语言模型已被证明可以使用少量学习来实现各种自然语言任务的出色表现,这大大减少了将模型调整到特定应用程序所需的特定任务培训示例的数量。为了进一步了解量表对少量学习的影响,我们培训了一个5400亿个参数,密集激活的变压器语言模型,我们称之为“途径”语言模型棕榈。我们使用Pathways在6144 TPU V4芯片上训练了Palm,这是一种新的ML系统,可在多个TPU POD上进行高效的训练。我们通过在数百种语言理解和产生基准的基准方面实现最先进的学习结果来证明扩展的持续好处。在这些任务中,Palm 540B实现了突破性的表现,在一系列多步推理任务上表现出色,超过了最新的最新表现,并且在最近发布的Big Benchmark上表现优于平均人类表现。大量的大型基础任务显示出与模型量表的不连续改进,这意味着当我们扩展到最大模型时,性能急剧增加。 Palm在多语言任务和源代码生成方面也具有很强的功能,我们在各种基准测试中证明了这一点。我们还提供了有关偏见和毒性的全面分析,并研究了训练数据记忆的程度,相对于模型量表。最后,我们讨论与大语言模型有关的道德考虑,并讨论潜在的缓解策略。
translated by 谷歌翻译
及时调整是以参数有效的方式对预训练的预训练语言模型的新范式。在这里,我们探讨了超级核武器的使用来产生超预价:我们提出了HyperPrompt,这是一种用于迅速基于变形金刚自我注意的任务调节的新型体系结构。超预要是通过超网络通过一代人来学习的端到端。 HyperPrompt允许网络学习特定于任务的功能地图,其中超预告是要参与的查询的任务全局记忆,同时启用了任务之间的灵活信息共享。我们表明,HyperPrompt与强大的多任务学习基线具有竞争力,其额外的任务条件参数的$ 0.14 \%$ $ \%,实现了出色的参数和计算效率。通过广泛的经验实验,我们证明,超级启示可以比强大的T5多任务学习基准和参数效率高效的适配器变体获得卓越的性能,包括及时调整和SuplyFormer ++在许多模型尺寸的自然语言理解胶水和SuperGrue的基准上。
translated by 谷歌翻译
我们可以训练一个能够处理多个模态和数据集的单个变压器模型,同时分享几乎所有的学习参数?我们呈现Polyvit,一种培训的模型,在图像,音频和视频上接受了讲述这个问题。通过在单一的方式上培训不同的任务,我们能够提高每个任务的准确性,并在5个标准视频和音频分类数据集中实现最先进的结果。多种模式和任务上的共同训练Polyvit会导致一个更具参数效率的模型,并学习遍历多个域的表示。此外,我们展示了实施的共同培训和实用,因为我们不需要调整数据集的每个组合的超级参数,但可以简单地调整来自标准的单一任务培训。
translated by 谷歌翻译