Often clickbait articles have a title that is phrased as a question or vague teaser that entices the user to click on the link and read the article to find the explanation. We developed a system that will automatically find the answer or explanation of the clickbait hook from the website text so that the user does not need to read through the text themselves. We fine-tune an extractive question and answering model (RoBERTa) and an abstractive one (T5), using data scraped from the 'StopClickbait' Facebook pages and Reddit's 'SavedYouAClick' subforum. We find that both extractive and abstractive models improve significantly after finetuning. We find that the extractive model performs slightly better according to ROUGE scores, while the abstractive one has a slight edge in terms of BERTscores.
translated by 谷歌翻译
查询聚焦的文本摘要(QFTS)任务旨在构建基于给定查询的文本文档摘要的构建系统。解决此任务的关键挑战是缺乏培训摘要模型的大量标记数据。在本文中,我们通过探索一系列域适应技术来解决这一挑战。鉴于最近在广泛的自然语言处理任务中进行预先接受的变压器模型的成功,我们利用此类模型为单文档和多文件方案的QFTS任务产生抽象摘要。对于域适应,我们使用预先训练的变压器的摘要模型应用了各种技术,包括转移学习,弱监督学习和远程监督。六个数据集的广泛实验表明,我们所提出的方法非常有效地为QFTS任务产生抽象摘要,同时在一组自动和人类评估指标上设置新的最先进的结果。
translated by 谷歌翻译
The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by time- and labour-consuming annotation, limiting the training data's potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).
translated by 谷歌翻译
该论文为罗马尼亚语提供了一个开放域的答案系统,回答了Covid-19相关问题。QA系统管道涉及自动问题处理,自动查询生成,Web搜索前10个最相关的文档,并使用用于提取质量质量质量质量质量质量质量的BERT模型回答提取,并在我们手动创建的COVID-19数据集上进行了培训。该论文将介绍质量检查系统及其与罗马尼亚语言技术的集成,COVID-19数据集以及对质量检查性能的不同评估。
translated by 谷歌翻译
实际一致性是实际设置中文本摘要模型的基本质量。在评估此维度的现有工作可以大致分为两行研究,基于征收的指标和问题应答(QA)的指标。然而,最近作品中提出的不同的实验设置导致对比的结论是哪个范例表现最佳。在这项工作中,我们进行了广泛的征集和基于QA的指标的比较,致力于仔细选择基于QA的度量的组件对于性能至关重要。在那些见解中,我们提出了一个优化的公制,我们称之为QAFacteval,这导致了对夏季事实一致性基准的基于QA的度量标准的平均平均平均改进。我们的解决方案提高了基于最佳的基于范围的公制,并在该基准测试中实现了最先进的性能。此外,我们发现基于QA和基于征求的度量提供了互补信号,并将两者组合成单个学习的度量,以进一步提升。通过定性和定量分析,我们将问题生成和可应答性分类视为基于QA的度量的未来工作的两个关键组成部分。
translated by 谷歌翻译
Large pre-trained language models have recently enabled open-ended generation frameworks (e.g., prompt-to-text NLG) to tackle a variety of tasks going beyond the traditional data-to-text generation. While this framework is more general, it is under-specified and often leads to a lack of controllability restricting their real-world usage. We propose a new grounded keys-to-text generation task: the task is to generate a factual description about an entity given a set of guiding keys, and grounding passages. To address this task, we introduce a new dataset, called EntDeGen. Inspired by recent QA-based evaluation measures, we propose an automatic metric, MAFE, for factual correctness of generated descriptions. Our EntDescriptor model is equipped with strong rankers to fetch helpful passages and generate entity descriptions. Experimental result shows a good correlation (60.14) between our proposed metric and human judgments of factuality. Our rankers significantly improved the factual correctness of generated descriptions (15.95% and 34.51% relative gains in recall and precision). Finally, our ablation study highlights the benefit of combining keys and groundings.
translated by 谷歌翻译
Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting. It includes general-purpose datasets such as SQuAD for English, datasets from ten domains and two styles, as well as datasets in eight different languages. Using QG-Bench as a reference, we perform an extensive analysis of the capabilities of language models for the task. First, we propose robust QG baselines based on fine-tuning generative language models. Then, we complement automatic evaluation based on standard metrics with an extensive manual evaluation, which in turn sheds light on the difficulty of evaluating QG models. Finally, we analyse both the domain adaptability of these models as well as the effectiveness of multilingual models in languages other than English. QG-Bench is released along with the fine-tuned models presented in the paper https://github.com/asahi417/lm-question-generation, which are also available as a demo https://autoqg.net/.
translated by 谷歌翻译
问题应答系统这些天通常使用基于模板的语言生成。虽然足够适用于特定于域的任务,但这些系统对于域无关的系统来说太限性和预定义。本文提出了一个输出全长答案的系统给出一个问题和提取的事实答案(如命名实体等短跨度)作为输入。我们的系统使用选区和依赖性解析问题的树木。基于变压器的语法纠错模型Gector(2020)用作后处理步骤,以便更好流畅。我们将系统与(i)修改的指针生成器(SOTA)和(ii)微调对话框进行了比较。我们还通过更好的结果测试我们的方法(是 - 否)问题的方法。我们的模型比最先进的(SOTA)方法产生准确和流畅的答案。评估是在NewsQA和Squad数据集上完成的,分别增加0.4和0.9个百分点的速度分数。与SOTA相比,推理时间也减少了85 \%。用于我们评估的改进数据集将作为研究贡献的一部分发布。
translated by 谷歌翻译
大型审慎的语言模型最近征服了自然语言处理领域。作为BERT中引入的主要掩盖语言建模的替代方案,T5模型引入了更通用的训练目标,即序列转换的顺序,其中包括蒙版语言模型,但自然地适合文本生成任务,例如机器翻译,摘要,开放 - 开放 - 域问题回答,文本简化,对话系统等。T5模型的单语变体仅限于资源良好的语言,而大量的多语言T5模型则支持101种语言。相比之下,我们训练了两个不同尺寸的T5型序列,以使用较少的资源并分析其行为的形态丰富的斯洛文尼语的序列模型。关于分类任务,SLOT5模型主要落后于单语Slovene Sloberta模型,但应考虑生成任务。
translated by 谷歌翻译
人类在对话中提出的问题通常包含上下文依赖性,即对先前对话转弯的明确或隐式引用。这些依赖性采用核心发挥的形式(例如,通过代词使用)或椭圆形,并且可以使自动化系统的理解难以理解。促进对问题的理解和后续治疗方法的一种方法是将其重写为不受欢迎的形式,即可以理解的形式而没有对话性上下文。我们提出了Coqar,Coqar是一种语料库,其中包含$ 4.5 $ k的对话中的对话询问数据集COQA,总计$ 53 $ K的后续提问 - 答案对。每个原始问题都在至少2个脱离台面重写中手动注释。 COQAR可用于监督三个任务的监督:问题释义,问题重写和会话问题回答。为了评估Coqar重写的质量,我们进行了几项实验,包括培训和评估这三个任务的模型。我们的结果支持以下想法:问题重写可以用作问题回答模型的预处理步骤,从而提高其性能。
translated by 谷歌翻译
Information extraction from scholarly articles is a challenging task due to the sizable document length and implicit information hidden in text, figures, and citations. Scholarly information extraction has various applications in exploration, archival, and curation services for digital libraries and knowledge management systems. We present MORTY, an information extraction technique that creates structured summaries of text from scholarly articles. Our approach condenses the article's full-text to property-value pairs as a segmented text snippet called structured summary. We also present a sizable scholarly dataset combining structured summaries retrieved from a scholarly knowledge graph and corresponding publicly available scientific articles, which we openly publish as a resource for the research community. Our results show that structured summarization is a suitable approach for targeted information extraction that complements other commonly used methods such as question answering and named entity recognition.
translated by 谷歌翻译
自动问题回答是电子商务中的一个重要但具有挑战性的任务,因为用户发布了有兴趣购买的产品的数百万个问题。因此,对使用有关产品的相关信息提供快速响应的自动答案生成系统存在很大的需求。他们有三种知识来源可用于接听用户发布查询,它们是评论,重复或类似的问题和规范。有效利用这些信息来源将极大地帮助我们回答复杂问题。然而,利用这些来源存在两个主要挑战:(i)存在无关信息和(ii)的存在评论和类似问题的情绪模糊。通过这项工作,我们提出了一种新的管道(MSQAP),其通过在生成响应之前分别执行相关性和歧义预测来利用上述来源中存在的丰富信息。实验结果表明,与硼基基线相比,我们的相关性预测模型(BERT-QA)优于所有其他变体,并且在F1分数中提高了12.36%。我们的生成模型(T5-QA)优于所有内容保存度量的基线,如Bleu,Rouge,并且在Bleu中的平均提高35.02%,与最高表现为基线(HSSC-Q)相比,BLEU中的198.75%。人为评估我们的管道向我们展示了我们的方法在生成模型(T5-QA)上的准确性提高了30.7%,导致我们的全部管道的方法(MSQAP)提供更准确的答案。据我们所知,这是电子商务域中的第一个工作,它自动生成自然语言答案,将目前的信息与规格,类似问题和评论数据相结合。
translated by 谷歌翻译
传达相关和忠实信息的能力对于有条件生成的许多任务至关重要,但对于神经SEQ-seq seq模型仍然难以捉摸,这些模型的输出通常显示出幻觉,并且无法正确涵盖重要细节。在这项工作中,我们主张规划作为有用的中间表示,以使有条件的一代减少不透明和扎根。我们的作品提出了将文本计划作为一系列提问(QA)对的新概念化。我们用QA蓝图作为内容选择(即〜说什么)和计划(即〜按什么顺序)来增强现有数据集(例如,用于摘要)。我们通过利用最先进的问题生成技术并将输入输出对自动获取蓝图,并将其转换为输入 - 蓝图输出输出元组。我们开发了基于变压器的模型,每个模型都在它们如何将蓝图合并到生成的输出中(例如,作为全局计划或迭代)。跨指标和数据集的评估表明,蓝图模型比不采取计划并允许对生成输出进行更严格控制的替代方案更为事实。
translated by 谷歌翻译
Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN), comprised of 12k dialogue turns generated by neural dialogue systems trained on three knowledgegrounded dialogue corpora. We collect human annotations assessing the extent to which the models' responses can be attributed to the given background information. We then use BEGIN to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make BEGIN publicly available at https://github.com/ google/BEGIN-dataset.
translated by 谷歌翻译
Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer module, in which we use pre-trained models from the existing literature, and therefore, our metric can be used without further training. We show that RQUGE has a higher correlation with human judgment without relying on the reference question. RQUGE is shown to be significantly more robust to several adversarial corruptions. Additionally, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on the synthetic data generated by a question generation model and re-ranked by RQUGE.
translated by 谷歌翻译
现有的通用机器翻译或自然语言生成评估指标有几个问题,在这种情况下,提问(QA)系统无动于衷。为了构建强大的质量检查系统,我们需要具有等效鲁棒评估系统的能力,以验证对问题的模型预测是否类似于地面真相注释。比较基于语义而不是纯字符串重叠的相似性的能力对于公平比较模型并指出现实生活应用中更现实的接受标准很重要。我们首先建立在我们的知识论文的基础上,该论文使用基于变压器的模型指标来评估语义答案的相似性,并在没有词汇重叠的情况下实现与人类判断的更高相关性。我们提出了跨编码器增强双重编码器和Bertscore模型,以进行语义答案相似性,该模型在新的数据集中进行了培训,该数据集由美国公共人物的名称对组成。就我们而言,我们提供了第一个共同参考名称字符串对的数据集及其相似性,可用于培训。机器学习与应用第四届机器学习与应用国际会议(CMLA 2022)6月25日至2022年6月25日,哥本哈根,丹麦批量编辑:David C. Wyld,Dhinaharan Nagamalai(EDS)
translated by 谷歌翻译
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.
translated by 谷歌翻译
寻求健康信息的寻求使网络与消费者的健康相关问题淹没了。通常,消费者使用过度描述性和外围信息来表达其医疗状况或其他医疗保健需求,从而有助于自然语言理解的挑战。解决这一挑战的一种方法是总结问题并提取原始问题的关键信息。为了解决此问题,我们介绍了一个新的数据集CHQ-SUMM,其中包含1507个域 - 专家注释的消费者健康问题和相应的摘要。该数据集源自社区提问论坛,因此为了解社交媒体上与消费者健康相关的帖子提供了宝贵的资源。我们在多个最先进的摘要模型上基准测试数据集,以显示数据集的有效性。
translated by 谷歌翻译
近年来,低资源机器阅读理解(MRC)取得了重大进展,模型在各种语言数据集中获得了显着性能。但是,这些模型都没有为URDU语言定制。这项工作探讨了通过将机器翻译的队伍与来自剑桥O级书籍的Wikipedia文章和Urdu RC工作表组合的人生成的样本组合了机器翻译的小队,探讨了乌尔通题的半自动创建了数据集(UQuad1.0)。 UQuad1.0是一个大型URDU数据集,用于提取机器阅读理解任务,由49K问题答案成对组成,段落和回答格式。在UQuad1.0中,通过众包的原始SquAd1.0和大约4000对的机器翻译产生45000对QA。在本研究中,我们使用了两种类型的MRC型号:基于规则的基线和基于先进的变换器的模型。但是,我们发现后者优于其他人;因此,我们已经决定专注于基于变压器的架构。使用XLMroberta和多语言伯特,我们分别获得0.66和0.63的F1得分。
translated by 谷歌翻译
为了实现长文档理解的构建和测试模型,我们引入质量,具有中文段的多项选择QA DataSet,具有约5,000个令牌的平均长度,比典型的当前模型更长。与经过段落的事先工作不同,我们的问题是由阅读整个段落的贡献者编写和验证的,而不是依赖摘要或摘录。此外,只有一半的问题是通过在紧缩时间限制下工作的注释器来应答,表明略读和简单的搜索不足以一直表现良好。目前的模型在此任务上表现不佳(55.4%),并且落后于人类性能(93.5%)。
translated by 谷歌翻译