Retrieval-augmented in-context learning has emerged as a powerful approach for addressing knowledge-intensive tasks using frozen language models (LM) and retrieval models (RM). Existing work has combined these in simple "retrieve-then-read" pipelines in which the RM retrieves passages that are inserted into the LM prompt. To begin to fully realize the potential of frozen LMs and RMs, we propose Demonstrate-Search-Predict (DSP), a framework that relies on passing natural language texts in sophisticated pipelines between an LM and an RM. DSP can express high-level programs that bootstrap pipeline-aware demonstrations, search for relevant passages, and generate grounded predictions, systematically breaking down problems into small transformations that the LM and RM can handle more reliably. We have written novel DSP programs for answering questions in open-domain, multi-hop, and conversational settings, establishing in early evaluations new state-of-the-art in-context learning results and delivering 37-200%, 8-40%, and 80-290% relative gains against vanilla LMs, a standard retrieve-then-read pipeline, and a contemporaneous self-ask pipeline, respectively.
translated by 谷歌翻译
How do we design measures of social bias that we trust? While prior work has introduced several measures, no measure has gained widespread trust: instead, mounting evidence argues we should distrust these measures. In this work, we design bias measures that warrant trust based on the cross-disciplinary theory of measurement modeling. To combat the frequently fuzzy treatment of social bias in NLP, we explicitly define social bias, grounded in principles drawn from social science research. We operationalize our definition by proposing a general bias measurement framework DivDist, which we use to instantiate 5 concrete bias measures. To validate our measures, we propose a rigorous testing protocol with 8 testing criteria (e.g. predictive validity: do measures predict biases in US employment?). Through our testing, we demonstrate considerable evidence to trust our measures, showing they overcome conceptual, technical, and empirical deficiencies present in prior measures.
translated by 谷歌翻译
Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21's J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
translated by 谷歌翻译
Despite the central role that melody plays in music perception, it remains an open challenge in music information retrieval to reliably detect the notes of the melody present in an arbitrary music recording. A key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles - existing strategies work well for some melody instruments or styles but not all. To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio, thereby improving performance on melody transcription by $20$% relative to conventional spectrogram features. Another obstacle in melody transcription is a lack of training data - we derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music. The combination of generative pre-training and a new dataset for this task results in $77$% stronger performance on melody transcription relative to the strongest available baseline. By pairing our new melody transcription approach with solutions for beat detection, key estimation, and chord recognition, we build Sheet Sage, a system capable of transcribing human-readable lead sheets directly from music audio. Audio examples can be found at https://chrisdonahue.com/sheetsage and code at https://github.com/chrisdonahue/sheetsage .
translated by 谷歌翻译
尽管自我监督学习(SSL)方法取得了经验成功,但尚不清楚其表示的哪些特征导致了高下游精度。在这项工作中,我们表征了SSL表示应该满足的属性。具体而言,我们证明了必要和充分的条件,因此,对于给出的数据增强的任何任务,在该表示形式上训练的所需探针(例如,线性或MLP)具有完美的准确性。这些要求导致一个统一的概念框架,用于改善现有的SSL方法并得出新方法。对于对比度学习,我们的框架规定了对以前的方法(例如使用不对称投影头)的简单但重大改进。对于非对比度学习,我们使用框架来得出一个简单新颖的目标。我们所得的SSL算法在标准基准测试上的表现优于基线,包括Imagenet线性探测的SHAV+多螺旋桨。
translated by 谷歌翻译
中文学习是指模型在及时序列中条件条件的能力,该序列由内部下文示例(输入输出对,与某些任务相对应)以及新的查询输入,并生成相应的输出。至关重要的是,内在学习仅在推理时间发生,而没有任何参数更新模型。尽管大型语言模型(例如GPT-3)具有某种能力来执行中文学习的能力,但尚不清楚任务成功的任务之间的关系以及培训数据中存在的内容。为了取得进步朝着理解文本学习的进步,我们考虑了训练模型的明确定义的问题,以学习函数类(例如,线性函数):也就是说,给定的数据从类中的某些功能衍生而成,可以我们训练一个模型以在此课程中学习“大多数”功能?我们从经验上表明,可以从头开始训练标准变压器,以执行线性函数的文本学习 - 也就是说,训练有素的模型能够从具有与最佳最小二乘估计器相当的性能的示例中学习看不见的线性函数。实际上,即使在两种形式的分配变化下,也可能进行中文学习:(i)模型的训练数据和推理时间提示之间,以及(ii)在推理过程中的内在示例和查询输入之间。我们还表明,我们可以训练变形金刚在文本中学习更多复杂的功能类,即稀疏线性功能,两层神经网络和决策树 - 具有匹配或超过特定于任务特定的学习算法的性能。我们的代码和模型可在https://github.com/dtsip/in-context-learning上找到。
translated by 谷歌翻译
我们经常在强大的机器学习中看到不良的权衡,其中分布(OOD)的精度与分布式(ID)的准确性不一致:通过删除伪造功能的专用技术获得的强大分类器通常具有更好的OOD,但ID较差,但ID较差。与通过ERM训练的标准分类器相比,准确性。在本文中,我们发现由ID校准的合奏(仅在ID数据上校准ID数据之后简单地整合标准和健壮的模型)优于ID和ID和OOD准确性。在11个自然分配移位数据集中,ID校准的合奏获得了两全其美的最佳:强大的ID准确性和OOD精度。我们在风格化的设置中分析了此方法,并确定了两个重要条件以使合奏执行良好的ID和OOD:(1)我们需要校准标准和可靠的模型(在ID数据上,因为OOD数据不可用),(2)OOD没有反相关的虚假特征。
translated by 谷歌翻译
剪辑的发展[Radford等,2021]引发了关于语言监督是否可以导致与传统仅图像方法更可转移表示的视觉模型的争论。我们的工作通过对两种方法的学习能力进行了对下游分类任务的学习能力进行仔细控制的比较来研究这个问题。我们发现,当预训练数据集符合某些标准时 - 它足够大,并且包含具有较低变异性的描述性字幕 - 仅图像的方法也与剪辑的传输性能不匹配,即使它们接受了更多图像数据的培训。但是,与人们期望的相反,在某些情况下,没有满足这些标准,其中通过标题增加的监督实际上是有害的。在我们的发现的激励下,我们设计了简单的处方,以使剪辑能够更好地利用现有预训练数据集中存在的语言信息。
translated by 谷歌翻译
预训练会产生对各种下游任务有效的表示,但是目前尚不清楚预训练的有效收益必不可少的特性。值得注意的是,最近的工作表明,即使对合成任务进行预训练也可以在下游任务中取得显着增长。在这项工作中,我们进行了三个实验,可以迭代地简化预训练,并表明简化仍然保留了其大部分收益。首先,在先前的工作中,我们对六个下游任务的三种现有合成预训练方法进行系统评估。我们发现最好的合成预训练方法是石灰,平均获得了自然预训练的收益的67美元\%$。其次,令我们惊讶的是,我们发现由设定功能定义的简单且通用的合成任务进行预培训可实现$ 65 \%的好处,几乎是匹配的石灰。第三,我们发现仅使用合成预培训的参数统计数据可以实现$ 39 \%的利益。我们在https://github.com/felixzli/synthetic_pretraining上发布源代码。
translated by 谷歌翻译
扩展语言模型已被证明可以预测提高各种下游任务的性能和样本效率。相反,本文讨论了一种不可预测的现象,我们将其称为大语言模型的新兴能力。如果在较小的模型中不存在,而是在较大的模型中存在,那么我们认为它可以突然出现。因此,不仅可以通过推断较小模型的性能来预测紧急能力。这种出现的存在意味着额外的扩展可以进一步扩大语言模型的能力范围。
translated by 谷歌翻译