Task-oriented dialogue (TOD) systems have been applied in a range of domains to support human users to achieve specific goals. Systems are typically constructed for a single domain or language and do not generalise well beyond this. Their extension to other languages in particular is restricted by the lack of available training data for many of the world's languages. To support work on Natural Language Understanding (NLU) in TOD across multiple languages and domains simultaneously, we constructed MULTI3NLU++, a multilingual, multi-intent, multi-domain dataset. MULTI3NLU++ extends the English-only NLU++ dataset to include manual translations into a range of high, medium and low resource languages (Spanish, Marathi, Turkish and Amharic), in two domains (banking and hotels). MULTI3NLU++ inherits the multi-intent property of NLU++, where an utterance may be labelled with multiple intents, providing a more realistic representation of a user's goals and aligning with the more complex tasks that commercial systems aim to model. We use MULTI3NLU++ to benchmark state-of-the-art multilingual language models as well as Machine Translation and Question Answering systems for the NLU task of intent detection for TOD systems in the multilingual setting. The results demonstrate the challenging nature of the dataset, particularly in the low-resource language setting.
translated by 谷歌翻译
As machine translation (MT) metrics improve their correlation with human judgement every year, it is crucial to understand the limitations of such metrics at the segment level. Specifically, it is important to investigate metric behaviour when facing accuracy errors in MT because these can have dangerous consequences in certain contexts (e.g., legal, medical). We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. We use ACES to evaluate a wide range of MT metrics including the submissions to the WMT 2022 metrics shared task and perform several analyses leading to general recommendations for metric developers. We recommend: a) combining metrics with different strengths, b) developing metrics that give more weight to the source and less to surface-level overlap with the reference and c) explicitly modelling additional language-specific information beyond what is available via multilingual embeddings.
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
translated by 谷歌翻译
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on how to turn it into one that can be productively studied empirically. We first present an experimental design centered on choosing tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment following meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
translated by 谷歌翻译
“感应头”是注意力头,它实现了一种简单的算法来完成令牌序列,例如[a] [b] ... [a] - > [b]。在这项工作中,我们提供了一个假设的初步和间接证据,即诱导头可能构成大型大型变压器模型中所有“文本学习”中大多数的机制(即减少在增加代币指数时损失的损失)。我们发现,诱导头在与秘密学习能力突然急剧上的急剧上升的位置完全相同,这是训练损失的颠簸。我们提出了六种互补的证据,认为诱导头可能是任何大小的变压器模型中一般性内部学习的机理来源。对于仅关注的小型模型,我们提供了有力的因果证据。对于具有MLP的较大模型,我们提供相关证据。
translated by 谷歌翻译
我们研究语言模型是否可以评估自己主张的有效性,并预测他们能够正确回答的问题。我们首先表明,当以正确的格式提供时,较大的模型在多样化的多项选择和True/False问题上进行了很好的校准。因此,我们可以通过要求模型首先提出答案,然后评估其答案正确的概率“ p(true)”来对开放式采样任务进行自我评估。我们发现在各种任务中,P(true)的表现,校准和缩放令人鼓舞。当我们允许模型考虑自己的许多样本之前,在预测一种特定可能性的有效性之前,自我评估的性能进一步改善。接下来,我们研究是否可以培训模型来预测“ P(ik)”,即“我知道”问题的概率,而无需参考任何特定提出的答案。模型在预测P(IK)方面表现良好,并且在跨任务中部分概括,尽管它们在新任务上的P(IK)校准方面遇到了困难。预测的p(IK)概率在存在相关的原始材料的情况下以及对数学单词问题解决方案的提示也适当增加。我们希望这些观察结果为培训更诚实的模型提供了基础,并研究了诚实对模型模仿人类写作以外的其他目标培训的案例的普遍性。
translated by 谷歌翻译
本文介绍了用于持久图计算的有效算法,给定一个输入分段线性标量字段f在D上定义的d二维简单复杂k,并带有$ d \ leq 3 $。我们的方法通过引入三个主要加速度来扩展开创性的“ Paircells”算法。首先,我们在离散摩尔斯理论的设置中表达了该算法,该算法大大减少了要考虑的输入简单数量。其次,我们介绍了问题的分层方法,我们称之为“夹心”。具体而言,minima-saddle持久性对($ d_0(f)$)和鞍 - 最大持久对($ d_ {d-1}(f)$)是通过与Union-Find-Find-Find-Find-Find-Find-Find-Find-find-find-find-find-find-find-find-find-find-find-find-find-find of nourstable组的1个有效计算的。 - addles和(D-1)addles的稳定集。尺寸为0和(D-1)的快速处理进一步减少,并且大幅度降低了$ d_1(f)$,即三明治的中间层的计算$ d_1(f)$的关键简单数量。第三,我们通过共享记忆并行性记录了几个绩效改进。我们为可重复性目的提供了算法的开源实施。我们还贡献了一个可重复的基准软件包,该基准软件包利用了公共存储库中的三维数据,并将我们的算法与各种公开可用的实现进行了比较。广泛的实验表明,我们的算法提高了两个数量级,即它扩展的开创性“ Paircells”算法的时间性能。此外,它还改善了14种竞争方法的选择,改善了记忆足迹和时间性能,比最快的可用方法具有可观的增长,同时产生了严格的输出。我们通过应用于表面,音量数据和高维点云的持续性一维发电机的快速和稳健提取的应用来说明我们的贡献实用性。
translated by 谷歌翻译
随着深度神经网络(DNN)的发展以解决日益复杂的问题,它们正受到现有数字处理器的延迟和功耗的限制。为了提高速度和能源效率,已经提出了专门的模拟光学和电子硬件,但是可扩展性有限(输入矢量长度$ k $的数百个元素)。在这里,我们提出了一个可扩展的,单层模拟光学处理器,该光学处理器使用自由空间光学器件可重新配置输入向量和集成的光电,用于静态,可更新的加权和非线性 - 具有$ k \ \ 1,000 $和大约1,000美元和超过。我们通过实验测试MNIST手写数字数据集的分类精度,在没有数据预处理或在硬件上进行数据重新处理的情况下达到94.7%(地面真相96.3%)。我们还确定吞吐量($ \ sim $ 0.9 examac/s)的基本上限,由最大光带宽设置,然后大大增加误差。我们在兼容CMOS兼容系统中宽光谱和空间带宽的组合可以实现下一代DNN的高效计算。
translated by 谷歌翻译
文献中的最新结果表明,经过分类训练的神经网络的倒数第二层(倒数第二层)表示,展示了一种称为神经崩溃的聚类特性(NC)。我们研究训练深神经网络时,随机梯度下降(SGD)的隐式偏见,有利于低深度溶液。我们表征了有效深度的概念,该概念测量了使用最近级中心分类器可分离样品嵌入的第一层。此外,我们假设和经验表明,SGD隐含地选择了小有效深度的神经网络。其次,尽管即使不可能进行概括,但神经崩溃也会出现 - 我们认为,中间层中的\ emph {可分离性}与概括有关。我们得出了一个基于将网络的有效深度与与部分损坏的标签相同的数据集进行比较最小深度的限制。值得注意的是,这种结合提供了对测试性能的非平凡估计。最后,我们从经验上表明,在增加数据中随机标签的数量时,受过训练的神经网络的有效深度会单调增加。
translated by 谷歌翻译