Training learnable metrics using modern language models has recently emerged as a promising method for the automatic evaluation of machine translation. However, existing human evaluation datasets in text simplification are limited by a lack of annotations, unitary simplification types, and outdated models, making them unsuitable for this approach. To address these issues, we introduce the SIMPEVAL corpus that contains: SIMPEVAL_ASSET, comprising 12K human ratings on 2.4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5. Training on SIMPEVAL_ASSET, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical results show that LENS correlates better with human judgment than existing metrics, paving the way for future progress in the evaluation of text simplification. To create the SIMPEVAL datasets, we introduce RANK & RATE, a human evaluation framework that rates simplifications from several models in a list-wise manner by leveraging an interactive interface, which ensures both consistency and accuracy in the evaluation process. Our metric, dataset, and annotation toolkit are available at https://github.com/Yao-Dou/LENS.
translated by 谷歌翻译
This paper addresses the quality issues in existing Twitter-based paraphrase datasets, and discusses the necessity of using two separate definitions of paraphrase for identification and generation tasks. We present a new Multi-Topic Paraphrase in Twitter (MultiPIT) corpus that consists of a total of 130k sentence pairs with crowdsoursing (MultiPIT_crowd) and expert (MultiPIT_expert) annotations using two different paraphrase definitions for paraphrase identification, in addition to a multi-reference test set (MultiPIT_NMR) and a large automatically constructed training set (MultiPIT_Auto) for paraphrase generation. With improved data annotation quality and task-specific paraphrase definition, the best pre-trained language model fine-tuned on our dataset achieves the state-of-the-art performance of 84.2 F1 for automatic paraphrase identification. Furthermore, our empirical results also demonstrate that the paraphrase generation models trained on MultiPIT_Auto generate more diverse and high-quality paraphrases compared to their counterparts fine-tuned on other corpora such as Quora, MSCOCO, and ParaNMT.
translated by 谷歌翻译
This report summarizes the work carried out by the authors during the Twelfth Montreal Industrial Problem Solving Workshop, held at Universit\'e de Montr\'eal in August 2022. The team tackled a problem submitted by CBC/Radio-Canada on the theme of Automatic Text Simplification (ATS).
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
We report on novel investigations into training models that make sentences concise. We define the task and show that it is different from related tasks such as summarization and simplification. For evaluation, we release two test sets, consisting of 2000 sentences each, that were annotated by two and five human annotators, respectively. We demonstrate that conciseness is a difficult task for which zero-shot setups with large neural language models often do not perform well. Given the limitations of these approaches, we propose a synthetic data generation method based on round-trip translations. Using this data to either train Transformers from scratch or fine-tune T5 models yields our strongest baselines that can be further improved by fine-tuning on an artificial conciseness dataset that we derived from multi-annotator machine translation test sets.
translated by 谷歌翻译
临床票据是记录患者信息的有效方法,但难以破译非专家的难以破译。自动简化医学文本可以使患者提供有关其健康的有价值的信息,同时节省临床医生。我们提出了一种基于词频率和语言建模的医学文本自动简化的新方法,基于富裕的外行术语的医疗本体。我们发布了一对公开可用的医疗句子的新数据集,并由临床医生简化了它们的版本。此外,我们定义了一种新颖的文本简化公制和评估框架,我们用于对我们对现有技术的方法进行大规模人类评估。我们基于在医学论坛数据上培训的语言模型的方法在保留语法和原始含义时产生更简单的句子,超越现有技术。
translated by 谷歌翻译
The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from English, and to date, there has not been a systematic study of evaluating MT systems from English into Indian languages. In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. Additionally, we find that the metrics do not adequately capture fluency-based errors in Indian languages, and there is a need to develop metrics focused on Indian languages. We hope that our dataset and analysis will help promote further research in this area.
translated by 谷歌翻译
The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive transformers in generating machine-paraphrased plagiarism and their detection is still developing in the literature. This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia. We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples. Our results suggest that large models can rewrite text humans have difficulty identifying as machine-paraphrased (53% mean acc.). Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5, fluency 4.2/5, coherence 3.8/5). The best-performing detection model (GPT-3) achieves a 66% F1-score in detecting paraphrases.
translated by 谷歌翻译
We consider the end-to-end abstract-to-title generation problem, exploring seven recent transformer based models (including ChatGPT) fine-tuned on more than 30k abstract-title pairs from NLP and machine learning venues. As an extension, we also consider the harder problem of generating humorous paper titles. For the latter, we compile the first large-scale humor annotated dataset for scientific papers in the NLP/ML domains, comprising almost 2.5k titles. We evaluate all models using human and automatic metrics. Our human evaluation suggests that our best end-to-end system performs similarly to human authors (but arguably slightly worse). Generating funny titles is more difficult, however, and our automatic systems clearly underperform relative to humans and often learn dataset artefacts of humor. Finally, ChatGPT, without any fine-tuning, performs on the level of our best fine-tuned system.
translated by 谷歌翻译
Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
translated by 谷歌翻译
迄今为止对文本生成的评估主要集中在依次创建的内容上,而不是对文本的改进。但是,写作自然是一个迭代和增量过程,需要在不同的模块化技能上进行专业知识,例如修复过时的信息或使样式更加一致。即便如此,对模型执行这些技能和编辑能力的模型能力的全面评估仍然很少。这项工作介绍了EditeVal:基于指导的,基准和评估套件,该套件利用现有的现有和新数据集自动评估编辑功能,例如使文本更具凝聚力和释义。我们评估了几种预训练的模型,这表明指令和同伴表现最好,但是大多数基准都落在监督的SOTA以下,尤其是在中和和更新信息时。我们的分析还表明,用于编辑任务的常用指标并不总是很好地关联,并且对具有最高性能的提示的优化并不一定带来对不同模型的最强鲁棒性。通过发布此基准和公开可用的排行榜挑战,我们希望在开发能够迭代和更可控制的编辑模型中解锁未来的研究。
translated by 谷歌翻译
A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.
translated by 谷歌翻译
We propose BERTSCORE, an automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTSCORE is more robust to challenging examples when compared to existing metrics.
translated by 谷歌翻译
文本样式传输是自然语言生成中的重要任务,旨在控制生成的文本中的某些属性,例如礼貌,情感,幽默和许多其他特性。它在自然语言处理领域拥有悠久的历史,最近由于深神经模型带来的有希望的性能而重大关注。在本文中,我们对神经文本转移的研究进行了系统调查,自2017年首次神经文本转移工作以来跨越100多个代表文章。我们讨论了任务制定,现有数据集和子任务,评估,以及丰富的方法在存在并行和非平行数据存在下。我们还提供关于这项任务未来发展的各种重要主题的讨论。我们的策据纸张列表在https://github.com/zhijing-jin/text_style_transfer_survey
translated by 谷歌翻译
An optimal delivery of arguments is key to persuasion in any debate, both for humans and for AI systems. This requires the use of clear and fluent claims relevant to the given debate. Prior work has studied the automatic assessment of argument quality extensively. Yet, no approach actually improves the quality so far. Our work is the first step towards filling this gap. We propose the task of claim optimization: to rewrite argumentative claims to optimize their delivery. As an initial approach, we first generate a candidate set of optimized claims using a sequence-to-sequence model, such as BART, while taking into account contextual information. Our key idea is then to rerank generated candidates with respect to different quality metrics to find the best optimization. In automatic and human evaluation, we outperform different reranking baselines on an English corpus, improving 60% of all claims (worsening 16% only). Follow-up analyses reveal that, beyond copy editing, our approach often specifies claims with details, whereas it adds less evidence than humans do. Moreover, its capabilities generalize well to other domains, such as instructional texts.
translated by 谷歌翻译
即使在高度发达的国家,多达15-30%的人口只能理解使用基本词汇编写的文本。他们对日常文本的理解是有限的,这阻止了他们在社会中发挥积极作用,并就医疗保健,法律代表或民主选择做出明智的决定。词汇简化是一项自然语言处理任务,旨在通过更简单地替换复杂的词汇和表达方式来使每个人都可以理解文本,同时保留原始含义。在过去的20年中,它引起了极大的关注,并且已经针对各种语言提出了全自动词汇简化系统。该领域进步的主要障碍是缺乏用于构建和评估词汇简化系统的高质量数据集。我们提出了一个新的基准数据集,用于英语,西班牙语和(巴西)葡萄牙语中的词汇简化,并提供有关数据选择和注释程序的详细信息。这是第一个可直接比较三种语言的词汇简化系统的数据集。为了展示数据集的可用性,我们将两种具有不同体系结构(神经与非神经)的最先进的词汇简化系统适应所有三种语言(英语,西班牙语和巴西葡萄牙语),并评估他们的表演在我们的新数据集中。为了进行更公平的比较,我们使用多种评估措施来捕获系统功效的各个方面,并讨论其优势和缺点。我们发现,最先进的神经词汇简化系统优于所有三种语言中最先进的非神经词汇简化系统。更重要的是,我们发现最先进的神经词汇简化系统对英语的表现要比西班牙和葡萄牙语要好得多。
translated by 谷歌翻译
我们提出了两种小型无监督方法,用于消除文本中的毒性。我们的第一个方法结合了最近的两个想法:(1)使用小型条件语言模型的生成过程的指导和(2)使用释义模型进行风格传输。我们使用良好的令人措辞的令人愉快的释放器,由风格培训的语言模型引导,以保持文本内容并消除毒性。我们的第二种方法使用BERT用他们的非攻击性同义词取代毒性单词。我们通过使BERT替换具有可变数量的单词的屏蔽令牌来使该方法更灵活。最后,我们介绍了毒性去除任务的风格转移模型的第一个大规模比较研究。我们将模型与许多用于样式传输的方法进行比较。使用无监督的样式传输指标的组合以可参考方式评估该模型。两种方法都建议产生新的SOTA结果。
translated by 谷歌翻译
End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.
translated by 谷歌翻译
接受高等教育对于少数族裔和新兴双语学生至关重要。但是,高等教育机构用来与准学生交流的语言通常太复杂了。具体而言,美国的许多机构发布录取申请指令远远高于典型高中毕业生的平均阅读水平,通常接近13年级或14年级。这导致学生之间不必要的障碍和获得高等教育。这项工作旨在通过简化文本来应对这一挑战。我们介绍PSAT(专业简化的录取文本),这是一个数据集,其中有112条从美国的高等教育机构中随机选择的录取说明。然后,这些文本将被专业地简化,并被各个机构招生办公室的专职员工专家进行了验证和接受。此外,PSAT带有1,883个原始简化句子对的手动对齐。结果是在与现有简化资源不同的高风险流派中评估和微调文本简化系统的首个语料库。
translated by 谷歌翻译
文本生成的广泛使用的评估指标要么与更长的文本效果不错,要么无法评估文本质量的所有方面。在本文中,我们引入了一个名为SMART的新指标,以减轻此类限制。具体而言,我们将句子视为匹配的基本单位,而不是代币,并使用句子匹配函数来匹配匹配候选和参考句子。还将候选句子与源文件中的句子进行了比较,以允许接地(例如,事实)评估。我们的结果表明,我们提出的指标与基于模型的匹配函数的系统级相关性优于萨姆瓦尔摘要元评估数据集上的所有竞争指标指标。后者不使用任何神经模型,这在模型开发阶段很有用,在这些阶段,资源可以受到限制且需要快速评估。最后,我们还进行了广泛的分析,表明我们提出的指标与较长的摘要很好地运行,并且对特定模型的偏见较小。
translated by 谷歌翻译