顺序标记是一项基本的NLP任务,构成了许多应用程序的骨干。对SEQ2SEQ模型的监督学习(如T5)在这些问题上取得了巨大的成功。但是,这些模型的培训目标与我们在实际应用中关心的指标和Desiderata之间存在显着脱节。例如,实用的序列标记应用程序可能需要优化某些Precision-Recall折衷(TOP-K预测),这与最大化金标记序列的可能性的标准目标完全不同。因此,为了弥合这一差距,我们提出了Groot,这是一个简单而有效的框架,用于生成文本序列的奖励优化。 Groot通过训练生成的顺序标记模型来工作,以将解码器输出分布与(Black-Box)奖励函数的输出分布相匹配。使用迭代培训制度,我们首先生成预测候选者,然后纠正其中的错误,最后对比这些候选者(基于其奖励价值)。正如通过四个公共基准测试的广泛实验所证明的那样,Groot显着改善了所有奖励指标。此外,Groot还导致了整体解码器分布的改善,这是由顶级$ K $候选者的质量提高所证明的。
translated by 谷歌翻译
Text-to-text generation models have increasingly become the go-to solution for a wide variety of sequence labeling tasks (e.g., entity extraction and dialog slot filling). While most research has focused on the labeling accuracy, a key aspect -- of vital practical importance -- has slipped through the cracks: understanding model confidence. More specifically, we lack a principled understanding of how to reliably gauge the confidence of a model in its predictions for each labeled span. This paper aims to provide some empirical insights on estimating model confidence for generative sequence labeling. Most notably, we find that simply using the decoder's output probabilities is not the best in realizing well-calibrated confidence estimates. As verified over six public datasets of different tasks, we show that our proposed approach -- which leverages statistics from top-$k$ predictions by a beam search -- significantly reduces calibration errors of the predictions of a generative sequence labeling model.
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
注释数据是用于培训和评估机器学习模型的自然语言处理中的重要成分。因此,注释具有高质量是非常理想的。但是,最近的工作表明,几个流行的数据集包含令人惊讶的注释错误或不一致之处。为了减轻此问题,多年来已经设计了许多注释错误检测方法。尽管研究人员表明他们的方法在新介绍的数据集上效果很好,但他们很少将其方法与以前的工作或同一数据集进行比较。这引起了人们对方法的一般表现的强烈关注,并且使他们的优势和劣势很难解决。因此,我们重新实现18种检测潜在注释错误的方法,并在9个英语数据集上对其进行评估,以进行文本分类以及令牌和跨度标签。此外,我们定义了统一的评估设置,包括注释错误检测任务,评估协议和一般最佳实践的新形式化。为了促进未来的研究和可重复性,我们将数据集和实施释放到易于使用和开源软件包中。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
Recently, contrastive learning attracts increasing interests in neural text generation as a new solution to alleviate the exposure bias problem. It introduces a sequence-level training signal which is crucial to generation tasks that always rely on auto-regressive decoding. However, previous methods using contrastive learning in neural text generation usually lead to inferior performance. In this paper, we analyse the underlying reasons and propose a new Contrastive Neural Text generation framework, CoNT. CoNT addresses bottlenecks that prevent contrastive learning from being widely adopted in generation tasks from three aspects -- the construction of contrastive examples, the choice of the contrastive loss, and the strategy in decoding. We validate CoNT on five generation tasks with ten benchmarks, including machine translation, summarization, code comment generation, data-to-text generation and commonsense generation. Experimental results show that CoNT clearly outperforms the conventional training framework on all the ten benchmarks with a convincing margin. Especially, CoNT surpasses previous the most competitive contrastive learning method for text generation, by 1.50 BLEU on machine translation and 1.77 ROUGE-1 on summarization, respectively. It achieves new state-of-the-art on summarization, code comment generation (without external data) and data-to-text generation.
translated by 谷歌翻译
对于指定的实体识别(NER),基于序列标签和基于跨度的范例大不相同。先前的研究表明,这两个范式具有明显的互补优势,但是据我们所知,很少有模型试图在单个NER模型中利用这些优势。在我们以前的工作中,我们提出了一种称为捆绑学习(BL)的范式来解决上述问题。 BL范式将两个NER范式捆绑在一起,从而使NER模型通过加权总结每个范式的训练损失来共同调整其参数。但是,三个关键问题仍未解决:BL何时起作用? BL为什么工作? BL可以增强现有的最新(SOTA)NER模型吗?为了解决前两个问题,我们实施了三个NER模型,涉及一个基于序列标签的模型-Seqner,Seqner,一个基于跨度的NER模型 - 机器人,以及将Seqner和Spanner捆绑在一起的BL-NER。我们根据来自五个域的11个NER数据集的实验结果得出两个关于这两个问题的结论。然后,我们将BL应用于现有的五个SOTA NER模型,以研究第三期,包括三个基于序列标签的模型和两个基于SPAN的模型。实验结果表明,BL始终提高其性能,表明可以通过将BL纳入当前的SOTA系统来构建新的SOTA NER系统。此外,我们发现BL降低了实体边界和类型预测错误。此外,我们比较了两种常用的标签标签方法以及三种类型的跨度语义表示。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.
translated by 谷歌翻译
我们提出了一种可解释的关系提取方法,通过共同训练这两个目标来减轻概括和解释性之间的张力。我们的方法使用多任务学习体系结构,该体系结构共同训练分类器以进行关系提取,并在解释关系分类器的决策的关系中标记单词的序列模型。我们还将模型输出转换为规则,以将全局解释带入这种方法。使用混合策略对此序列模型进行训练:有监督,当可获得预先存在的模式的监督时,另外还要半监督。在后一种情况下,我们将序列模型的标签视为潜在变量,并学习最大化关系分类器性能的最佳分配。我们评估了两个数据集中的提议方法,并表明序列模型提供了标签,可作为关系分类器决策的准确解释,并且重要的是,联合培训通常可以改善关系分类器的性能。我们还评估了生成的规则的性能,并表明新规则是手动规则的重要附加功能,并使基于规则的系统更接近神经模型。
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
变量名称对于传达预期的程序行为至关重要。基于机器学习的程序分析方法使用变量名称表示广泛的任务,例如建议新的变量名称和错误检测。理想情况下,这些方法可以捕获句法相似性的名称之间的语义关系,例如,名称平均和均值的事实是相似的。不幸的是,以前的工作发现,即使是先前的最佳的表示方法主要是捕获相关性(是否有两个变量始终链接),而不是相似性(是否具有相同的含义)。我们提出了VarCLR,一种用于学习变量名称的语义表示的新方法,这些方法有效地捕获了这种更严格的意义上的可变相似性。我们观察到这个问题是对比学习的优秀契合,旨在最小化明确类似的输入之间的距离,同时最大化不同输入之间的距离。这需要标记的培训数据,因此我们构建了一种新颖的弱监督的变量重命名数据集,从GitHub编辑开采。我们表明VarCLR能够有效地应用BERT等复杂的通用语言模型,以变为变量名称表示,因此也是与变量名称相似性搜索或拼写校正等相关的下游任务。 varclr产生模型,显着越优于idbench的最先进的现有基准,明确地捕获可变相似度(与相关性不同)。最后,我们贡献了所有数据,代码和预先训练模型的版本,旨在为现有或未来程序分析中使用的可变表示提供的可变表示的替代品。
translated by 谷歌翻译
反向工程师受益于二进制中的标识符(例如函数名称)的存在,但通常将其删除以释放。训练机器学习模型自动预测功能名称是有希望的,但从根本上讲很难:与自然语言中的单词不同,大多数函数名称仅出现一次。在本文中,我们通过引入极端功能标签(XFL)来解决此问题,这是一种极端的多标签学习方法,可为二进制功能选择适当的标签。 XFL将函数名称分为代币,将每个功能视为具有自然语言标记文本的问题的信息标签。我们将二进制代码的语义与通过dexter进行标签,这是一种新颖的函数,将基于静态分析的特征与来自呼叫图的本地上下文和整个二进制的全局上下文相结合。我们证明,XFL/Dexter在Debian Project的10,047个二进制数据集上的功能标签上优于最新技术,获得了83.5%的精度。我们还研究了XFL与文献中的替代二进制嵌入的组合,并表明Dexter始终为这项任务做得最好。结果,我们证明了二进制函数标记可以通过多标签学习有效地措辞,并且二进制函数嵌入得益于包括明确的语义特征。
translated by 谷歌翻译
End-to-end (E2E) task-oriented dialogue (ToD) systems are prone to fall into the so-called 'likelihood trap', resulting in generated responses which are dull, repetitive, and often inconsistent with dialogue history. Comparing ranked lists of multiple generated responses against the 'gold response' (from training data) reveals a wide diversity in response quality, with many good responses placed lower in the ranked list. The main challenge, addressed in this work, is then how to reach beyond greedily generated system responses, that is, how to obtain and select such high-quality responses from the list of overgenerated responses at inference without availability of the gold response. To this end, we propose a simple yet effective reranking method which aims to select high-quality items from the lists of responses initially overgenerated by the system. The idea is to use any sequence-level (similarity) scoring function to divide the semantic space of responses into high-scoring versus low-scoring partitions. At training, the high-scoring partition comprises all generated responses whose similarity to the gold response is higher than the similarity of the greedy response to the gold response. At inference, the aim is to estimate the probability that each overgenerated response belongs to the high-scoring partition, given only previous dialogue history. We validate the robustness and versatility of our proposed method on the standard MultiWOZ dataset: our methods improve a state-of-the-art E2E ToD system by 2.4 BLEU, 3.2 ROUGE, and 2.8 METEOR scores, achieving new peak results. Additional experiments on the BiTOD dataset and human evaluation further ascertain the generalisability and effectiveness of the proposed framework.
translated by 谷歌翻译
由于暴露偏见,大多数现有的自然语言产生(NLG)模型通过最大化的可能性目标训练了推理阶段的文本结果不佳。在本文中,为了解决此问题,我们重新审视生成的框架,并提出了用于文本生成任务的联合发电机库(JGR)培训算法。在JGR中,生成器模型是通过最大化两个目标来训练的:训练语料库的可能性和排名者模型给出的预期奖励。同时,Ranker模型从发电机模型中获取输入样本,并学会了将优质样本与生成池区分开来。发电机和排名模型交替优化,直到收敛为止。在实证研究中,提出的JGR模型在五个公共基准测试中实现了新的最先进的表现,涵盖了三项大众一代任务:摘要,问题生成和回答生成。我们将在https://github.com/microsoft/advnlg上提供代码,数据和模型。
translated by 谷歌翻译
尽管与专家标签相比,众包平台通常用于收集用于培训机器学习模型的数据集,尽管标签不正确。有两种常见的策略来管理这种噪音的影响。第一个涉及汇总冗余注释,但以较少的例子为代价。其次,先前的作品还考虑使用整个注释预算来标记尽可能多的示例,然后应用Denoising算法来隐式清洁数据集。我们找到了一个中间立场,并提出了一种方法,该方法保留了一小部分注释,以明确清理高度可能的错误样本以优化注释过程。特别是,我们分配了标签预算的很大一部分,以形成用于训练模型的初始数据集。然后,该模型用于确定最有可能是不正确的特定示例,我们将剩余预算用于重新标记。在三个模型变化和四个自然语言处理任务上进行的实验表明,当分配相同的有限注释预算时,旨在处理嘈杂标签的标签聚合和高级denoising方法均优于标签聚合或匹配。
translated by 谷歌翻译
Partial label learning (PLL) is an important problem that allows each training example to be labeled with a coarse candidate set, which well suits many real-world data annotation scenarios with label ambiguity. Despite the promise, the performance of PLL often lags behind the supervised counterpart. In this work, we bridge the gap by addressing two key research challenges in PLL -- representation learning and label disambiguation -- in one coherent framework. Specifically, our proposed framework PiCO consists of a contrastive learning module along with a novel class prototype-based label disambiguation algorithm. PiCO produces closely aligned representations for examples from the same classes and facilitates label disambiguation. Theoretically, we show that these two components are mutually beneficial, and can be rigorously justified from an expectation-maximization (EM) algorithm perspective. Moreover, we study a challenging yet practical noisy partial label learning setup, where the ground-truth may not be included in the candidate set. To remedy this problem, we present an extension PiCO+ that performs distance-based clean sample selection and learns robust classifiers by a semi-supervised contrastive learning algorithm. Extensive experiments demonstrate that our proposed methods significantly outperform the current state-of-the-art approaches in standard and noisy PLL tasks and even achieve comparable results to fully supervised learning.
translated by 谷歌翻译
由于低资源语言缺乏培训数据,交叉语言机器阅读理解(XMRC)是挑战。最近的方法仅使用培训数据,以资源丰富的语言,如英语到微调大规模的跨语法预训练的语言模型。由于语言之间的巨大差异,仅由源语言微调的模型可能无法对目标语言表现良好。有趣的是,我们观察到,虽然先前方法预测的前1个结果可能经常无法达到地面真理答案,但是正确的答案通常包含在Top-K预测结果中。基于这种观察,我们开发了一种两级方法来提高模型性能。召回的第一阶段目标:我们设计一个艰难的学习(HL)算法,以最大化顶级预测包含准确答案的可能性。第二阶段专注于精确:开发了答案感知对比学习(AA-CL)机制,以了解准确答案和其他候选者之间的细差异。我们的广泛实验表明,我们的模型在两个交叉语言MRC基准数据集上显着优于一系列强大的基线。
translated by 谷歌翻译
随着越来越多的可用文本数据,能够自动分析,分类和摘要这些数据的算法的开发已成为必需品。在本研究中,我们提出了一种用于关键字识别的新颖算法,即表示给定文档的关键方面的一个或多字短语的提取,称为基于变压器的神经标记器,用于关键字识别(TNT-KID)。通过将变压器架构适用于手头的特定任务并利用域特定语料库上的预先磨损的语言模型,该模型能够通过提供竞争和强大的方式克服监督和无监督的最先进方法的缺陷在各种不同的数据集中的性能,同时仅需要最佳执行系统所需的手动标记的数据。本研究还提供了彻底的错误分析,具有对模型内部运作的有价值的见解和一种消融研究,测量关键字识别工作流程的特定组分对整体性能的影响。
translated by 谷歌翻译
Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting. It includes general-purpose datasets such as SQuAD for English, datasets from ten domains and two styles, as well as datasets in eight different languages. Using QG-Bench as a reference, we perform an extensive analysis of the capabilities of language models for the task. First, we propose robust QG baselines based on fine-tuning generative language models. Then, we complement automatic evaluation based on standard metrics with an extensive manual evaluation, which in turn sheds light on the difficulty of evaluating QG models. Finally, we analyse both the domain adaptability of these models as well as the effectiveness of multilingual models in languages other than English. QG-Bench is released along with the fine-tuned models presented in the paper https://github.com/asahi417/lm-question-generation, which are also available as a demo https://autoqg.net/.
translated by 谷歌翻译