根据其结构如何影响文本的解释和意义,文本中存在许多隐式推论。具有年代学中的文本中存在的一个这样的结构方面是其演示的顺序。对于叙述或故事,这被称为叙述顺序。重新排序叙述可能影响时间,因果,事件和其他推论读者从中抽取,这反过来可能对其解释和有趣有很大的影响。在本文中,我们提出并调查了叙事重新排序(Nareor)的任务,涉及以不同的叙述顺序重写给定的故事,同时保留其情节。我们在非线性订单中展示了一个DataSet,Nareorc,在洛奇因子内的故事中的人类重写,并对其进行详细分析。此外,我们提出了具有合适的评估指标的新型任务特定培训方法。我们使用诸如BART和T5等最先进的模型进行Nareorc的实验,并进行广泛的自动和人类评估。我们证明,尽管我们的模型可以体现,但是Nareor是一个具有挑战性的任务,具有进一步探索的潜力。我们还调查了Nareor的两种应用:生成更有趣的故事变化,并且作为临时/事件相关任务的对抗集,除了讨论其他潜在的任务之外,例如与文章技能相关的教学设置,如文章写作和医学的应用。涉及临床叙事。
translated by 谷歌翻译
我们激励并提出了一套简单但有效的改进,涉及蓝宝石的概念到文本生成:设置增强和后期短语infilling和重组。我们通过使用BART和T5模型的实验展示了它们对生成型号推理的有效性A.k.a.。通过广泛的自动和人类评估,我们表明蓝宝石显着提高了模型性能。深入的定性分析说明了蓝宝石有效地解决了基线模型世代的许多问题,包括缺乏致辞,特异性不足,流畅性差。
translated by 谷歌翻译
拟人化是一种语音人物,它赋予无生命实体具有属性和行动,通常被视为需要动画。在本文中,我们探讨了人格化生成的任务。为此,我们提出了菠萝:通过获取平行的人格化数据来学习增强的产生,来拟人化无生命的实体。我们策划了一个名为PersonifCorp的拟人化语料库,并自动生成了这些拟人化的文字化。我们通过训练SEQ2SEQ模型来拟人化给定的文字输入,从而证明了该平行语料库的有用性。自动评估和人类评估都表明,通过人格科目进行微调会带来与人格化相关的素质(例如动画和兴趣)的显着提高。详细的定性分析还强调了菠萝在基准上的关键优势和瑕疵,表明具有强大的能力产生多样化和创造性的拟人化,从而增强了句子的整体吸引力。
translated by 谷歌翻译
舌头是有意义的句子,难以发音。自动产生舌头扭曲的过程具有挑战性,因为产生的话语必须立即满足两个条件:语音难度和语义含义。此外,语音难度本身很难表征,并且通过异质的现象(例如垂涎和谐音)的异质组合以自然的扭曲词来表达。在本文中,我们提出了Pancetta:音素意识到的神经完成,以自动引起舌头扭曲。我们利用音素表示来捕获语音难度的概念,并训练语言模型以在两个提出的任务设置上生成原始的舌头扭曲。为此,我们策划了一个名为Pancetta的数据集,该数据集由现有的英语舌头组成。通过自动和人类评估以及定性分析,我们表明pancetta产生了新颖,语音上的困难,流利和语义上有意义的舌头扭曲。
translated by 谷歌翻译
我们调查使用图像中包含的多模式信息作为增强文本生成的变压器模型的勤义的有效方法。我们在概念到文本生成中使用BART和T5进行实验,特别是生成致辞推理或蒙的任务。我们称之为Visctg:视觉地基础的概念到文本生成。VisctG涉及代表适当日常方案的标题图像,并使用这些标题来丰富和转向生成过程。综合评估和分析表明,VisctG显着提高了模型性能,同时成功地解决了基线几代的几个问题,包括差的致辞,流畅性和特异性。
translated by 谷歌翻译
Knowledge about outcomes is critical for complex event understanding but is hard to acquire. We show that by pre-identifying a participant in a complex event, crowd workers are able to (1) infer the collective impact of salient events that make up the situation, (2) annotate the volitional engagement of participants in causing the situation, and (3) ground the outcome of the situation in state changes of the participants. By creating a multi-step interface and a careful quality control strategy, we collect a high quality annotated dataset of 8K short newswire narratives and ROCStories with high inter-annotator agreement (0.74-0.96 weighted Fleiss Kappa). Our dataset, POQue (Participant Outcome Questions), enables the exploration and development of models that address multiple aspects of semantic understanding. Experimentally, we show that current language models lag behind human performance in subtle ways through our task formulations that target abstract and specific comprehension of a complex event, its outcome, and a participant's influence over the event culmination.
translated by 谷歌翻译
预训练的语言模型(PLM)无法生成长形式的叙事文本,因为它们不考虑全局结构。结果,生成的文本通常是不巧妙的,重复的或缺乏内容的。故事发电的最新工作以提示,关键字或语义框架的形式重新引入了明确的内容计划。经过大型平行语料库的培训,这些模型可以生成更合乎逻辑的事件序列,从而产生更满足的故事。但是,这些中间表示通常不使用自然语言,并且不需要微调就无法使用。我们建议使用现成的PLM生成故事情节,同时保持内容计划的好处,以产生凝聚力和满足的故事。我们提出的方法ScratchPlot首先提示PLM构成内容计划。然后,我们生成故事的身体并以内容计划结束。此外,我们通过使用其他PLM来对生成的(故事,结尾)对进行排名。我们用各种基线基准测试我们的方法,并在人类和自动评估中取得了卓越的结果。
translated by 谷歌翻译
尽管在产生流利的文本方面取得了进步,但现有的预训练模型倾向于在产生诸如故事和新闻之类的叙述时将不连贯的事件序列附加到相关实体上。我们猜想,这些问题是由将实体表示为浅表词的静态嵌入而导致的,同时忽略了对其不断变化的状态建模,即随着文本的展开,即它们所携带的信息。因此,我们将变压器模型扩展到动态执行实体状态更新和叙事生成的句子实现。我们提出了一个对比框架,以在离散空间中学习状态表示,并将其他注意层插入解码器中以更好地利用这些状态。两个叙述数据集的实验表明,与有意义的实体状态的指导相比,我们的模型可以产生更多的连贯和多样化的叙事。
translated by 谷歌翻译
Training learnable metrics using modern language models has recently emerged as a promising method for the automatic evaluation of machine translation. However, existing human evaluation datasets in text simplification are limited by a lack of annotations, unitary simplification types, and outdated models, making them unsuitable for this approach. To address these issues, we introduce the SIMPEVAL corpus that contains: SIMPEVAL_ASSET, comprising 12K human ratings on 2.4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5. Training on SIMPEVAL_ASSET, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical results show that LENS correlates better with human judgment than existing metrics, paving the way for future progress in the evaluation of text simplification. To create the SIMPEVAL datasets, we introduce RANK & RATE, a human evaluation framework that rates simplifications from several models in a list-wise manner by leveraging an interactive interface, which ensures both consistency and accuracy in the evaluation process. Our metric, dataset, and annotation toolkit are available at https://github.com/Yao-Dou/LENS.
translated by 谷歌翻译
Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3-4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.
translated by 谷歌翻译
叙事中的事件可以通过其参与者的基本状态理解为一致的整体。通常,这些参与者在叙述中没有明确提及,而是通过常识性或推论填写。理解叙述的模型应该能够推断出这些隐性参与者状态,以及有关这些状态对叙事的影响的原因。为了促进这一目标,我们介绍了一个新的众包参与者指出的数据集意大利面。该数据集包含有效的,可推断的参与者状态;对国家的反事实扰动;如果反事实是真实的,那么故事的变化将是必要的。我们介绍了三项基于州的推理任务,这些任务测试了一个故事何时由故事启用,修改一个反事实状态的故事,并解释给定经过修订的故事的最有可能的状态变化。我们的基准测试实验表明,尽管当今的LLM能够在某种程度上推理有关州的推理,但仍有很大的改进空间,这表明了未来研究的潜在途径。
translated by 谷歌翻译
Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
translated by 谷歌翻译
为了实现长文档理解的构建和测试模型,我们引入质量,具有中文段的多项选择QA DataSet,具有约5,000个令牌的平均长度,比典型的当前模型更长。与经过段落的事先工作不同,我们的问题是由阅读整个段落的贡献者编写和验证的,而不是依赖摘要或摘录。此外,只有一半的问题是通过在紧缩时间限制下工作的注释器来应答,表明略读和简单的搜索不足以一直表现良好。目前的模型在此任务上表现不佳(55.4%),并且落后于人类性能(93.5%)。
translated by 谷歌翻译
Storytelling and narrative are fundamental to human experience, intertwined with our social and cultural engagement. As such, researchers have long attempted to create systems that can generate stories automatically. In recent years, powered by deep learning and massive data resources, automatic story generation has shown significant advances. However, considerable challenges, like the need for global coherence in generated stories, still hamper generative models from reaching the same storytelling ability as human narrators. To tackle these challenges, many studies seek to inject structured knowledge into the generation process, which is referred to as structure knowledge-enhanced story generation. Incorporating external knowledge can enhance the logical coherence among story events, achieve better knowledge grounding, and alleviate over-generalization and repetition problems in stories. This survey provides the latest and comprehensive review of this research field: (i) we present a systematical taxonomy regarding how existing methods integrate structured knowledge into story generation; (ii) we summarize involved story corpora, structured knowledge datasets, and evaluation metrics; (iii) we give multidimensional insights into the challenges of knowledge-enhanced story generation and cast light on promising directions for future study.
translated by 谷歌翻译
This paper addresses the quality issues in existing Twitter-based paraphrase datasets, and discusses the necessity of using two separate definitions of paraphrase for identification and generation tasks. We present a new Multi-Topic Paraphrase in Twitter (MultiPIT) corpus that consists of a total of 130k sentence pairs with crowdsoursing (MultiPIT_crowd) and expert (MultiPIT_expert) annotations using two different paraphrase definitions for paraphrase identification, in addition to a multi-reference test set (MultiPIT_NMR) and a large automatically constructed training set (MultiPIT_Auto) for paraphrase generation. With improved data annotation quality and task-specific paraphrase definition, the best pre-trained language model fine-tuned on our dataset achieves the state-of-the-art performance of 84.2 F1 for automatic paraphrase identification. Furthermore, our empirical results also demonstrate that the paraphrase generation models trained on MultiPIT_Auto generate more diverse and high-quality paraphrases compared to their counterparts fine-tuned on other corpora such as Quora, MSCOCO, and ParaNMT.
translated by 谷歌翻译
Narrative summarization aims to produce a distilled version of a narrative to describe its most salient events and characters. Summarizing a narrative is challenging as it requires an understanding of event causality and character behaviors. To encourage research in this direction, we propose NarraSum, a large-scale narrative summarization dataset. It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries. Experiments show that there is a large performance gap between humans and the state-of-the-art summarization models on NarraSum. We hope that this dataset will promote future research in summarization, as well as broader studies of natural language understanding and generation. The dataset is available at https://github.com/zhaochaocs/narrasum.
translated by 谷歌翻译
将文本插入段落中指定位置的任务(称为空白(FITB))对于各种应用程序与作家与自然语言生成(NLG)系统互动以制作文本的应用很有用。虽然先前的工作已经通过专门培训的模型来解决此问题,但更有用的模型是可以有效地执行_both_ fitb和延续的模型。在这项工作中,我们评估了使用单个模型完成这两个任务的可行性。我们表明,通过FITB式目标进行预训练的模型都可以完成这两个任务,而预先训练的持续训练的模型却没有。最后,我们展示了如何轻松地对FITB模型进行填充,以允许对一代的长度和单词选择进行细粒度的控制。
translated by 谷歌翻译
非平行文本样式转移是自然语言生成的重要任务。但是,先前的研究集中在令牌或句子级别上,例如句子情绪和形式转移,但在话语水平上忽略了长时间的转移。长文本通常涉及更复杂的作者语言偏好,例如话语结构,而不是句子。在本文中,我们制定了非并行故事作者风格转移的任务,该任务需要将输入故事传输到指定的作者样式的同时,同时维护源语义。为了解决这个问题,我们提出了一个名为StoryTrans的一代模型,该模型利用话语表示捕获源内容信息并将其传输到具有可学习样式嵌入的目标样式中。我们使用额外的培训目标将文学的文学特征与学习的话语表示,以防止模型退化为自动编码器。此外,为了增强内容保存,我们设计了一个面具和填充框架,以将源文本的特定于特定于样式的关键字定为生成。此外,我们分别用中文和英语构建了此任务的新数据集。广泛的实验表明,我们的模型在样式传输和内容保存的总体性能方面优于强大的基线。
translated by 谷歌翻译
We propose the Detailed Outline Control (DOC) framework for improving long-range plot coherence when automatically generating several-thousand-word-long stories. DOC consists of two complementary components: a detailed outliner and a detailed controller. The detailed outliner creates a more detailed, hierarchically structured outline, shifting creative burden from the main drafting procedure to the planning stage. The detailed controller ensures the more detailed outline is still respected during generation by controlling story passages to align with outline details. In human evaluations of automatically generated stories, DOC substantially outperforms a strong Re3 baseline (Yang et al., 2022) on plot coherence (22.5% absolute gain), outline relevance (28.2%), and interestingness (20.7%). Humans also judged DOC to be much more controllable in an interactive generation setting.
translated by 谷歌翻译
我们挑战AI模型,以“展示”对《纽约客》标题比赛的复杂多模式幽默的理解。具体而言,我们开发了三个精心限制的任务,以掌握图像和标题之间的潜在复杂和意外的关系,并且对人类经验的广泛品种产生了复杂和意外的寓意;这些是纽约口径卡通的标志。我们调查了直接将卡通像素和字幕输入的视觉和语言模型,以及仅通过提供图像的文本描述来规避图像处理的仅限语言模型。即使我们为卡通图像提供了丰富的多方面注释,我们也可以确定高质量的机器学习模型(例如,微调,175b参数语言模型)和人类之间的性能差距。我们公开发布我们的语料库,包括描述图像的位置/实体的注释,场景的不寻常以及对笑话的解释。
translated by 谷歌翻译