We consider the end-to-end abstract-to-title generation problem, exploring seven recent transformer based models (including ChatGPT) fine-tuned on more than 30k abstract-title pairs from NLP and machine learning venues. As an extension, we also consider the harder problem of generating humorous paper titles. For the latter, we compile the first large-scale humor annotated dataset for scientific papers in the NLP/ML domains, comprising almost 2.5k titles. We evaluate all models using human and automatic metrics. Our human evaluation suggests that our best end-to-end system performs similarly to human authors (but arguably slightly worse). Generating funny titles is more difficult, however, and our automatic systems clearly underperform relative to humans and often learn dataset artefacts of humor. Finally, ChatGPT, without any fine-tuning, performs on the level of our best fine-tuned system.
translated by 谷歌翻译
在本文中,我们将科学文章分类为自然语言处理(NLP)和机器学习(ML)的科学文章(i)是否通过引入击败现有模型或的新型技术来扩展当前的最新技术是否(ii)他们是否主要批评现有的最新技术,即,它相对于某些属性(例如,错误的评估,错误的数据集,误导性的任务规范)不足。我们将(i)下的贡献称为具有\ enquote {正姿势}和(ii)下的贡献为具有\ enquote {负姿势}(对相关工作)。我们注释来自NLP和ML的1.5k纸以超过1.5k的论文来培训基于SCIBERT的模型,以自动根据其标题和抽象来预测论文的立场。然后,我们分析了NLP和ML的最后35年$ 35年以上的41k纸上的大规模趋势,发现随着时间的流逝,论文变得更加积极,但是负面论文也变得更加负面,我们观察到更多的负面论文,我们观察到了更多的负面论文。最近几年。在收到的引用方面,负面论文也更具影响力。
translated by 谷歌翻译
An optimal delivery of arguments is key to persuasion in any debate, both for humans and for AI systems. This requires the use of clear and fluent claims relevant to the given debate. Prior work has studied the automatic assessment of argument quality extensively. Yet, no approach actually improves the quality so far. Our work is the first step towards filling this gap. We propose the task of claim optimization: to rewrite argumentative claims to optimize their delivery. As an initial approach, we first generate a candidate set of optimized claims using a sequence-to-sequence model, such as BART, while taking into account contextual information. Our key idea is then to rerank generated candidates with respect to different quality metrics to find the best optimization. In automatic and human evaluation, we outperform different reranking baselines on an English corpus, improving 60% of all claims (worsening 16% only). Follow-up analyses reveal that, beyond copy editing, our approach often specifies claims with details, whereas it adds less evidence than humans do. Moreover, its capabilities generalize well to other domains, such as instructional texts.
translated by 谷歌翻译
Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
translated by 谷歌翻译
直接使用现有的文本生成数据集进行可控生成时,我们面临的问题是没有域知识,因此可以控制的方面受到限制。一个典型的示例是,当使用CNN/Daily Mail数据集用于可控文本摘要时,没有关于摘要句子的重点的指导信息。更有用的文本生成器应利用输入文本和控制信号来指导生成,只能在对域知识的深入了解中构建。在这个愿景的激励下,我们的论文介绍了一个名为Mred的新文本生成数据集。我们的新数据集由7,089个元评论组成,其所有45k元评论句子都用9个精心定义的类别之一手动注释,包括抽象,力量,决策等。我们介绍了对开始的实验结果摘要模型,并提出了使用我们的带注释数据的方法对结构控制生成的方法。通过探索各种设置并分析模型行为相对于控制信号,我们证明了我们提出的任务的挑战以及数据集MRD的值。同时,MRD还使我们能够更好地了解元评论域。
translated by 谷歌翻译
This paper addresses the quality issues in existing Twitter-based paraphrase datasets, and discusses the necessity of using two separate definitions of paraphrase for identification and generation tasks. We present a new Multi-Topic Paraphrase in Twitter (MultiPIT) corpus that consists of a total of 130k sentence pairs with crowdsoursing (MultiPIT_crowd) and expert (MultiPIT_expert) annotations using two different paraphrase definitions for paraphrase identification, in addition to a multi-reference test set (MultiPIT_NMR) and a large automatically constructed training set (MultiPIT_Auto) for paraphrase generation. With improved data annotation quality and task-specific paraphrase definition, the best pre-trained language model fine-tuned on our dataset achieves the state-of-the-art performance of 84.2 F1 for automatic paraphrase identification. Furthermore, our empirical results also demonstrate that the paraphrase generation models trained on MultiPIT_Auto generate more diverse and high-quality paraphrases compared to their counterparts fine-tuned on other corpora such as Quora, MSCOCO, and ParaNMT.
translated by 谷歌翻译
学术研究是解决以前从未解决过的问题的探索活动。通过这种性质,每个学术研究工作都需要进行文献审查,以区分其Novelties尚未通过事先作品解决。在自然语言处理中,该文献综述通常在“相关工作”部分下进行。鉴于研究文件的其余部分和引用的论文列表,自动相关工作生成的任务旨在自动生成“相关工作”部分。虽然这项任务是在10年前提出的,但直到最近,它被认为是作为科学多文件摘要问题的变种。然而,即使在今天,尚未标准化了自动相关工作和引用文本生成的问题。在这项调查中,我们进行了一个元研究,从问题制定,数据集收集,方法方法,绩效评估和未来前景的角度来比较相关工作的现有文献,以便为读者洞察到国家的进步 - 最内容的研究,以及如何进行未来的研究。我们还调查了我们建议未来工作要考虑整合的相关研究领域。
translated by 谷歌翻译
Training learnable metrics using modern language models has recently emerged as a promising method for the automatic evaluation of machine translation. However, existing human evaluation datasets in text simplification are limited by a lack of annotations, unitary simplification types, and outdated models, making them unsuitable for this approach. To address these issues, we introduce the SIMPEVAL corpus that contains: SIMPEVAL_ASSET, comprising 12K human ratings on 2.4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5. Training on SIMPEVAL_ASSET, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical results show that LENS correlates better with human judgment than existing metrics, paving the way for future progress in the evaluation of text simplification. To create the SIMPEVAL datasets, we introduce RANK & RATE, a human evaluation framework that rates simplifications from several models in a list-wise manner by leveraging an interactive interface, which ensures both consistency and accuracy in the evaluation process. Our metric, dataset, and annotation toolkit are available at https://github.com/Yao-Dou/LENS.
translated by 谷歌翻译
Legal contracts, such as employment or lease agreements, are important documents as they govern the obligations and entitlements of the various contracting parties. However, these documents are typically long and written in legalese resulting in lots of manual hours spent in understanding them. In this paper, we address the task of summarizing legal contracts for each of the contracting parties, to enable faster reviewing and improved understanding of them. Specifically, we collect a dataset consisting of pairwise importance comparison annotations by legal experts for ~293K sentence pairs from lease agreements. We propose a novel extractive summarization system to automatically produce a summary consisting of the most important obligations, entitlements, and prohibitions in a contract. It consists of two modules: (1) a content categorize to identify sentences containing each of the categories (i.e., obligation, entitlement, and prohibition) for a party, and (2) an importance ranker to compare the importance among sentences of each category for a party to obtain a ranked list. The final summary is produced by selecting the most important sentences of a category for each of the parties. We demonstrate the effectiveness of our proposed system by comparing it against several text ranking baselines via automatic and human evaluation.
translated by 谷歌翻译
最近提出的基于BERT的评估指标在标准评估基准方面表现良好,但容易受到对抗性攻击的影响,例如与事实错误有关。我们认为这(部分原因)是因为它们是语义相似性的模型。相反,我们根据自然语言推断(NLI)制定评估指标,我们认为这是更合适的建模。我们设计了一个基于偏好的对抗攻击框架,并表明我们的基于NLI的指标比最近基于BERT的指标更强大。在标准基准上,我们的基于NLI的指标的表现优于现有的摘要指标,但在SOTA MT指标下执行。但是,当我们将现有指标与NLI指标相结合时,我们可以获得更高的对抗性鲁棒性( +20%至 +30%)和较高质量的指标,如标准基准测量( +5%至 +25%)。
translated by 谷歌翻译
在线巨魔增加了社会成本,并对个人造成心理损害。随着自动化帐户利用机器人进行拖钓的扩散,目标个人用户很难在定量和定性上处理这种情况。为了解决这个问题,我们专注于自动化对抗巨魔的方法,因为对战斗巨魔的反应鼓励社区用户在不损害言论自由的情况下保持持续的讨论。为此,我们为自动反响应生成提出了一个新颖的数据集。特别是,我们构建了一个配对数据集,其中包括巨魔评论和使用标记的响应策略的反响应,该策略使我们的数据集中的模型可以通过根据指定策略改变反响应来生成响应。我们执行了三个任务来评估数据集的有效性,并通过自动和人类评估评估结果。在人类评估中,我们证明了数据集中微调的模型显示出策略控制的句子生成的性能有了显着改善。
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting. It includes general-purpose datasets such as SQuAD for English, datasets from ten domains and two styles, as well as datasets in eight different languages. Using QG-Bench as a reference, we perform an extensive analysis of the capabilities of language models for the task. First, we propose robust QG baselines based on fine-tuning generative language models. Then, we complement automatic evaluation based on standard metrics with an extensive manual evaluation, which in turn sheds light on the difficulty of evaluating QG models. Finally, we analyse both the domain adaptability of these models as well as the effectiveness of multilingual models in languages other than English. QG-Bench is released along with the fine-tuned models presented in the paper https://github.com/asahi417/lm-question-generation, which are also available as a demo https://autoqg.net/.
translated by 谷歌翻译
Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN), comprised of 12k dialogue turns generated by neural dialogue systems trained on three knowledgegrounded dialogue corpora. We collect human annotations assessing the extent to which the models' responses can be attributed to the given background information. We then use BEGIN to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make BEGIN publicly available at https://github.com/ google/BEGIN-dataset.
translated by 谷歌翻译
The publication rates are skyrocketing across many fields of science, and it is difficult to stay up to date with the latest research. This makes automatically summarizing the latest findings and helping scholars to synthesize related work in a given area an attractive research objective. In this paper we study the problem of citation text generation, where given a set of cited papers and citing context the model should generate a citation text. While citation text generation has been tackled in prior work, existing studies use different datasets and task definitions, which makes it hard to study citation text generation systematically. To address this, we propose CiteBench: a benchmark for citation text generation that unifies the previous datasets and enables standardized evaluation of citation text generation models across task settings and domains. Using the new benchmark, we investigate the performance of multiple strong baselines, test their transferability between the datasets, and deliver new insights into task definition and evaluation to guide the future research in citation text generation. We make CiteBench publicly available at https://github.com/UKPLab/citebench.
translated by 谷歌翻译
Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization. We propose CLIPBERTScore, a simple weighted combination of CLIPScore and BERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate the quality of multimodal factuality metrics, we collect human judgments of factuality with respect to documents and images. We show that this simple combination of two metrics in the zero-shot setting achieves higher correlations than existing factuality metrics for document summarization, outperforms an existing multimodal summarization metric, and performs competitively with strong multimodal factuality metrics specifically fine-tuned for the task. Our thorough analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks. Finally, we demonstrate two practical downstream applications of our CLIPBERTScore metric: for selecting important images to focus on during training, and as a reward for reinforcement learning to improve factuality of multimodal summary generation w.r.t automatic and human evaluation. Our data and code are publicly available at https://github.com/meetdavidwan/faithful-multimodal-summ
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
自动评估摘要的连贯性具有重要意义,既可以实现成本效益的摘要评估,又可以通过选择高分候选候选摘要来提高连贯性。尽管已经提出了许多不同的方法来建模摘要相干性,但通常使用不同的数据集和指标对其进行评估。这使得很难理解他们的相对性能,并确定朝着更好的摘要连贯建模的方法。在这项工作中,我们对各种方法进行了大规模研究,以进行均匀的竞争环境建模。此外,我们介绍了两项新的分析措施,即系统内相关性和偏置矩阵,它们有助于确定相干度量的偏见,并为系统级混杂因素提供鲁棒性。尽管当前可用的自动连贯性措施都无法为所有评估指标的系统摘要分配可靠的连贯分数,但对自我监督任务进行了微调的大规模语言模型显示出令人鼓舞的结果,只要微调会考虑在内他们需要在不同的摘要长度上概括。
translated by 谷歌翻译
State-of-the-art poetry generation systems are often complex. They either consist of task-specific model pipelines, incorporate prior knowledge in the form of manually created constraints or both. In contrast, end-to-end models would not suffer from the overhead of having to model prior knowledge and could learn the nuances of poetry from data alone, reducing the degree of human supervision required. In this work, we investigate end-to-end poetry generation conditioned on styles such as rhyme, meter, and alliteration. We identify and address lack of training data and mismatching tokenization algorithms as possible limitations of past attempts. In particular, we successfully pre-train and release ByGPT5, a new token-free decoder-only language model, and fine-tune it on a large custom corpus of English and German quatrains annotated with our styles. We show that ByGPT5 outperforms other models such as mT5, ByT5, GPT-2 and ChatGPT, while also being more parameter efficient and performing favorably compared to humans. In addition, we analyze its runtime performance and introspect the model's understanding of style conditions. We make our code, models, and datasets publicly available.
translated by 谷歌翻译
最先进的抽象摘要系统经常生成\ emph {幻觉};即,不直接从源文本中推断的内容。尽管被认为是不正确的,我们发现非常令人难潮的内容是事实,即与世界知识一致。这些事实幻觉通过提供有用的背景信息,可以在摘要中受益。在这项工作中,我们提出了一种新的检测方法,将事实与实体的非事实幻觉分开。我们的方法分别使用实体的先前和后验概率,分别是预训练和芬特的屏蔽语言模型。经验结果表明,我们的方法在精度和F1分数方面大大优于两种基线%,与人类判断强烈相关。百分比对事实分类任务。此外,我们显示我们的探测器,当用作离线增强学习(RL)算法中的奖励信号时,显着提高了摘要的事实性,同时保持抽象水平。
translated by 谷歌翻译