Recent research shows synthetic data as a source of supervision helps pretrained language models (PLM) transfer learning to new target tasks/domains. However, this idea is less explored for spatial language. We provide two new data resources on multiple spatial language processing tasks. The first dataset is synthesized for transfer learning on spatial question answering (SQA) and spatial role labeling (SpRL). Compared to previous SQA datasets, we include a larger variety of spatial relation types and spatial expressions. Our data generation process is easily extendable with new spatial expression lexicons. The second one is a real-world SQA dataset with human-generated questions built on an existing corpus with SPRL annotations. This dataset can be used to evaluate spatial language processing models in realistic situations. We show pretraining with automatically generated data significantly improves the SOTA results on several SQA and SPRL benchmarks, particularly when the training data in the target domain is small.
translated by 谷歌翻译
关于时间知识图(TKGQA)的问题回答最近发现兴趣越来越大。 TKGQA需要时间推理技术来从时间知识库中提取相关信息。唯一现有的TKGQA数据集,即cronquestions,由基于固定时间段内的事实组成,其中跨越同一时期的时间知识图(TKG)可以完全使用用于答案推断,允许使用TKGQA模型。即将根据过去事实回答问题的未来知识。但是,在现实世界的情况下,鉴于到目前为止的知识也很常见,我们希望TKGQA系统回答询问未来的问题。随着人类不断寻求未来计划,建立用于回答此类预测问题的TKGQA系统很重要。然而,这在先前的研究中仍未得到探索。在本文中,我们提出了一个新的任务:关于时间知识图的预测问题。我们还为此任务提出了一个大规模的TKGQA基准数据集,即预测。它包括三种类型的问题,即实体预测,不是和事实推理问题。对于我们数据集中的每个预测问题,QA模型只能在给定问题中注释的时间戳以进行答案推理之前访问TKG信息。我们发现,最先进的TKGQA方法在预测问题上的表现较差,并且他们无法回答不是问题和事实推理问题。为此,我们提出了一种TKGQA模型预测,该模型采用TKG预测模块进行未来推断,以回答所有三种类型的问题。实验结果表明,预测到实体预测问题的最新方法优于最近的TKGQA方法,并且在回答其他两种类型的问题方面也显示出很大的有效性。
translated by 谷歌翻译
Web搜索是人类获取信息的重要方法,但是对于了解网页内容的机器仍然是一个巨大的挑战。在本文中,我们介绍了对网上结构阅读理解(SRC)的任务。鉴于网页和关于它的问题,任务是从网页找到答案。此任务要求系统不仅要了解文本的语义,还需要了解文本的语义,还需要网页的结构。此外,我们提出了一种新的基于Web的结构阅读理解数据集。 WebSRC由400K问答对组成,从6.4K网页收集。与QA对一起,我们的数据集还提供了相应的HTML源代码,屏幕截图和元数据。 WebSRC中的每个问题都需要对网页的某种结构理解来回答,并且答案是网页或是/否的文本跨度。我们评估我们数据集的各种基线,以显示我们的任务难度。我们还研究了结构信息和视觉功能的有用性。我们的数据集和基线已在HTTPS://x-lance.github.io/websrc/上公开提供。
translated by 谷歌翻译
知识丰富的语言代表学习在各种知识密集型的NLP任务中表现出了有希望的表现。但是,现有的知识语言模型都培训了单格式知识图数据,这将其应用限制为更多语言。在这项工作中,我们向预先rain基于知识的多语言语言模型(KMLMS)提出了一种新颖的框架。我们首先使用Wikidata知识图来生成大量的代码切换合成句和基于推理的多语言训练数据。然后基于所生成的数据的内部和际际结构,我们设计预先升温任务,以促进知识学习,这允许语言模型不仅存储事实知识,还可以学习有用的逻辑模式。我们的预制kmlms展示了对广泛知识密集型的交叉线路任务的显着性能,包括指定实体识别,事实知识检索,关系分类以及我们设计的新任务,即逻辑推理。我们的代码和预付费语言模型将公开可用。
translated by 谷歌翻译
Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HOTPOTQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HOTPOTQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
translated by 谷歌翻译
When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little general background. To investigate question answering with prior knowledge, we present COMMONSENSEQA: a challenging new dataset for commonsense question answering. To capture common sense beyond associations, we extract from CON-CEPTNET (Speer et al., 2017) multiple target concepts that have the same semantic relation to a single source concept. Crowd-workers are asked to author multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts. This encourages workers to create questions with complex semantics that often require prior knowledge. We create 12,247 questions through this procedure and demonstrate the difficulty of our task with a large number of strong baselines. Our best baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy, well below human performance, which is 89%.
translated by 谷歌翻译
最近,对建立问题的兴趣越来越兴趣,其中跨多种模式(如文本和图像)的原因。但是,使用图像的QA通常仅限于从预定义的选项集中挑选答案。此外,在现实世界中的图像,特别是在新闻中,具有与文本共同参考的对象,其中来自两个模态的互补信息。在本文中,我们提出了一种新的QA评估基准,并在新闻文章中提出了1,384个问题,这些文章需要跨媒体接地图像中的物体接地到文本上。具体地,该任务涉及需要推理图像标题对的多跳问题,以识别接地的视觉对象,然后从新闻正文文本中预测跨度以回答问题。此外,我们介绍了一种新颖的多媒体数据增强框架,基于跨媒体知识提取和合成问题答案生成,自动增强可以为此任务提供弱监管的数据。我们在我们的基准测试中评估了基于管道和基于端到端的预先预测的多媒体QA模型,并表明他们实现了有希望的性能,而在人类性能之后大幅滞后,因此留下了未来工作的大型空间,以便在这一具有挑战性的新任务上的工作。
translated by 谷歌翻译
Pre-trained Language Models (PLMs) which are trained on large text corpus through the self-supervised learning method, have yielded promising performance on various tasks in Natural Language Processing (NLP). However, though PLMs with huge parameters can effectively possess rich knowledge learned from massive training text and benefit downstream tasks at the fine-tuning stage, they still have some limitations such as poor reasoning ability due to the lack of external knowledge. Incorporating knowledge into PLMs has been tried to tackle these issues. In this paper, we present a comprehensive review of Knowledge-Enhanced Pre-trained Language Models (KE-PLMs) to provide a clear insight into this thriving field. We introduce appropriate taxonomies respectively for Natural Language Understanding (NLU) and Natural Language Generation (NLG) to highlight the focus of these two kinds of tasks. For NLU, we take several types of knowledge into account and divide them into four categories: linguistic knowledge, text knowledge, knowledge graph (KG), and rule knowledge. The KE-PLMs for NLG are categorized into KG-based and retrieval-based methods. Finally, we point out some promising future directions of KE-PLMs.
translated by 谷歌翻译
知识基础问题回答(KBQA)旨在通过知识库(KB)回答问题。早期研究主要集中于回答有关KB的简单问题,并取得了巨大的成功。但是,他们在复杂问题上的表现远非令人满意。因此,近年来,研究人员提出了许多新颖的方法,研究了回答复杂问题的挑战。在这项调查中,我们回顾了KBQA的最新进展,重点是解决复杂问题,这些问题通常包含多个主题,表达复合关系或涉及数值操作。详细说明,我们从介绍复杂的KBQA任务和相关背景开始。然后,我们描述用于复杂KBQA任务的基准数据集,并介绍这些数据集的构建过程。接下来,我们提出两个复杂KBQA方法的主流类别,即基于语义解析的方法(基于SP)的方法和基于信息检索的方法(基于IR)。具体而言,我们通过流程设计说明了他们的程序,并讨论了它们的主要差异和相似性。之后,我们总结了这两类方法在回答复杂问题时会遇到的挑战,并解释了现有工作中使用的高级解决方案和技术。最后,我们结论并讨论了与复杂的KBQA有关的几个有希望的方向,以进行未来的研究。
translated by 谷歌翻译
Machine reading comprehension (MRC) is a long-standing topic in natural language processing (NLP). The MRC task aims to answer a question based on the given context. Recently studies focus on multi-hop MRC which is a more challenging extension of MRC, which to answer a question some disjoint pieces of information across the context are required. Due to the complexity and importance of multi-hop MRC, a large number of studies have been focused on this topic in recent years, therefore, it is necessary and worth reviewing the related literature. This study aims to investigate recent advances in the multi-hop MRC approaches based on 31 studies from 2018 to 2022. In this regard, first, the multi-hop MRC problem definition will be introduced, then 31 models will be reviewed in detail with a strong focus on their multi-hop aspects. They also will be categorized based on their main techniques. Finally, a fine-grain comprehensive comparison of the models and techniques will be presented.
translated by 谷歌翻译
Triplet extraction aims to extract entities and their corresponding relations in unstructured text. Most existing methods train an extraction model on high-quality training data, and hence are incapable of extracting relations that were not observed during training. Generalizing the model to unseen relations typically requires fine-tuning on synthetic training data which is often noisy and unreliable. In this paper, we argue that reducing triplet extraction to a template filling task over a pre-trained language model can equip the model with zero-shot learning capabilities and enable it to leverage the implicit knowledge in the language model. Embodying these ideas, we propose a novel framework, ZETT (ZEro-shot Triplet extraction by Template infilling), that is based on end-to-end generative transformers. Our experiments show that without any data augmentation or pipeline systems, ZETT can outperform previous state-of-the-art models with 25% less parameters. We further show that ZETT is more robust in detecting entities and can be incorporated with automatically generated templates for relations.
translated by 谷歌翻译
Recent progress in pretraining language models on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as "fillin-the-blank" cloze statements. Language models have many advantages over structured knowledge bases: they require no schema engineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-theart pretrained language models. We find that (i) without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge, (ii) BERT also does remarkably well on open-domain question answering against a supervised baseline, and (iii) certain types of factual knowledge are learned much more readily than others by standard language model pretraining approaches. The surprisingly strong ability of these models to recall factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. The code to reproduce our analysis is available at https: //github.com/facebookresearch/LAMA.
translated by 谷歌翻译
Pre-trained language models (LMs) have shown remarkable reasoning performance using explanations (or ``chain-of-thought'' (CoT)) for in-context learning. On the other hand, these reasoning tasks are usually presumed to be more approachable for symbolic programming. To make progress towards understanding in-context learning, we curate synthetic datasets containing equivalent (natural, symbolic) data pairs, where symbolic examples contain first-order logic rules and predicates from knowledge bases (KBs). Then we revisit neuro-symbolic approaches and use Language Models as Logic Programmer (LMLP) that learns from demonstrations containing logic rules and corresponding examples to iteratively reason over KBs, recovering Prolog's backward chaining algorithm. Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than 25% higher accuracy than CoT on length generalization benchmarks even with fewer parameters.
translated by 谷歌翻译
Natural Language Processing (NLP) has been revolutionized by the use of Pre-trained Language Models (PLMs) such as BERT. Despite setting new records in nearly every NLP task, PLMs still face a number of challenges including poor interpretability, weak reasoning capability, and the need for a lot of expensive annotated data when applied to downstream tasks. By integrating external knowledge into PLMs, \textit{\underline{K}nowledge-\underline{E}nhanced \underline{P}re-trained \underline{L}anguage \underline{M}odels} (KEPLMs) have the potential to overcome the above-mentioned limitations. In this paper, we examine KEPLMs systematically through a series of studies. Specifically, we outline the common types and different formats of knowledge to be integrated into KEPLMs, detail the existing methods for building and evaluating KEPLMS, present the applications of KEPLMs in downstream tasks, and discuss the future research directions. Researchers will benefit from this survey by gaining a quick and comprehensive overview of the latest developments in this field.
translated by 谷歌翻译
虽然神经语言模型往往对自然语言理解(NLU)任务进行令人惊讶的令人惊讶,但它们的优势和局限性仍然很差。因此,受控的合成任务是用于诊断模型行为的越来越重要的资源。在这项工作中,我们专注于讲故事的理解,是NLU系统的核心竞争力。然而,讲故事的主要综合资源是Babi基准,缺乏可控任务生成的这种系统机制。我们开发Dyna-Babi,一种动态框架,提供对Babi中的任务生成的细粒度控制。我们通过构建一个组成概括的三项新任务来展示我们的想法,这是来自原始基准的重要评估设置。我们测试了为BABI开发的专用模型以及最先进的预训练方法,发现这两种方法都解决了原始任务(> 99%的精度),并且在组成泛化设置中都没有成功地成功地成功,表示原始培训数据的局限性。我们探索了增加原始数据的方法,发现,尽管多样化培训数据比简单地增加数据集尺寸更有用,但它仍然不足以驾驶鲁棒成分泛化(具有<70%的复杂组合物的精度)。我们的结果强调了高度可控任务发生器通过模型和数据开发的良性循环创建强大的NLU系统的重要性。
translated by 谷歌翻译
For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.
translated by 谷歌翻译
知识库问题的最现有的方法接听(KBQA)关注特定的基础知识库,原因是该方法的固有假设,或者因为在不同的知识库上评估它需要非琐碎的变化。然而,许多流行知识库在其潜在模式中的相似性份额可以利用,以便于跨知识库的概括。为了实现这一概念化,我们基于2级架构介绍了一个KBQA框架,该架构明确地将语义解析与知识库交互分开,促进了数据集和知识图中的转移学习。我们表明,具有不同潜在知识库的数据集预先灌注可以提供显着的性能增益并降低样本复杂性。我们的方法可实现LC-Quad(DBPedia),WEDQSP(FreeBase),简单问话(Wikidata)和MetaQA(WikiMovies-KG)的可比性或最先进的性能。
translated by 谷歌翻译
从自然语言问题中构建查询图是在知识图上回答复杂问题(复杂KGQA)的重要一步。通常,如果正确构建其查询图,可以正确回答问题,然后通过针对kg发出查询图来检索正确的答案。因此,本文着重于自然语言问题的查询图生成。查询图生成的现有方法忽略了问题的语义结构,从而导致大量破坏预测准确性的嘈杂的查询图候选者。在本文中,我们从kgqa中的常见问题定义了六个语义结构,并开发了一种新颖的结构,以预测问题的语义结构。通过这样做,我们可以首先过滤嘈杂的候选查询图,然后使用基于BERT的排名模型对剩余的候选人进行排名。与最先进的艺术相比,对两个流行的基准metaqa和WebQuestionsSP(WSP)进行了广泛的实验,证明了我们方法的有效性。
translated by 谷歌翻译
Structured tabular data exist across nearly all fields. Reasoning task over these data aims to answer questions or determine the truthiness of hypothesis sentences by understanding the semantic meaning of a table. While previous works have devoted significant efforts to the tabular reasoning task, they always assume there are sufficient labeled data. However, constructing reasoning samples over tables (and related text) is labor-intensive, especially when the reasoning process is complex. When labeled data is insufficient, the performance of models will suffer an unendurable decline. In this paper, we propose a unified framework for unsupervised complex tabular reasoning (UCTR), which generates sufficient and diverse synthetic data with complex logic for tabular reasoning tasks, assuming no human-annotated data at all. We first utilize a random sampling strategy to collect diverse programs of different types and execute them on tables based on a "Program-Executor" module. To bridge the gap between the programs and natural language sentences, we design a powerful "NL-Generator" module to generate natural language sentences with complex logic from these programs. Since a table often occurs with its surrounding texts, we further propose novel "Table-to-Text" and "Text-to-Table" operators to handle joint table-text reasoning scenarios. This way, we can adequately exploit the unlabeled table resources to obtain a well-performed reasoning model under an unsupervised setting. Our experiments cover different tasks (question answering and fact verification) and different domains (general and specific), showing that our unsupervised methods can achieve at most 93% performance compared to supervised models. We also find that it can substantially boost the supervised performance in low-resourced domains as a data augmentation technique. Our code is available at https://github.com/leezythu/UCTR.
translated by 谷歌翻译
We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages Visual Genome scene graph structures to create 22M diverse reasoning questions, which all come with functional programs that represent their semantics. We use the programs to gain tight control over the answer distribution and present a new tunable smoothing technique to mitigate question biases. Accompanying the dataset is a suite of new metrics that evaluate essential qualities such as consistency, grounding and plausibility. A careful analysis is performed for baselines as well as state-of-the-art models, providing fine-grained results for different question types and topologies. Whereas a blind LSTM obtains a mere 42.1%, and strong VQA models achieve 54.1%, human performance tops at 89.3%, offering ample opportunity for new research to explore. We hope GQA will provide an enabling resource for the next generation of models with enhanced robustness, improved consistency, and deeper semantic understanding of vision and language.
translated by 谷歌翻译