学习捕获文本表对齐对于文本到SQL等任务至关重要。一个模型需要正确识别对列和值的自然语言引用,并在给定的数据库架构中将其扎根。在本文中,我们为文本到SQL提出了一个新颖的弱监督结构接地预处理框架(strug),可以有效地学习基于平行的文本表语料库来捕获文本表对齐。我们确定了一组新的预测任务:列接地,价值接地和列值映射,并利用它们为文本表编码预处理。此外,为了评估更现实的文本表对齐设置下的不同方法,我们基于蜘蛛dev设置的新评估集蜘蛛现实化,并明确提及已删除的列名,并采用八个现有的文本到SQL数据集以进行交叉 - 数据库评估。在所有设置中,Strug对Bert-Large都有显着改善。与现有的预训练方法(例如Grappa)相比,Strug在蜘蛛方面的性能相似,并且在更现实的集合上都优于所有基线。蜘蛛现实的数据集可从https://doi.org/10.5281/zenodo.5205322获得。
translated by 谷歌翻译
文本到SQL解析是一项必不可少且具有挑战性的任务。文本到SQL解析的目的是根据关系数据库提供的证据将自然语言(NL)问题转换为其相应的结构性查询语言(SQL)。来自数据库社区的早期文本到SQL解析系统取得了显着的进展,重度人类工程和用户与系统的互动的成本。近年来,深层神经网络通过神经生成模型显着提出了这项任务,该模型会自动学习从输入NL问题到输出SQL查询的映射功能。随后,大型的预训练的语言模型将文本到SQL解析任务的最新作品带到了一个新级别。在这项调查中,我们对文本到SQL解析的深度学习方法进行了全面的评论。首先,我们介绍了文本到SQL解析语料库,可以归类为单转和多转。其次,我们提供了预先训练的语言模型和现有文本解析方法的系统概述。第三,我们向读者展示了文本到SQL解析所面临的挑战,并探索了该领域的一些潜在未来方向。
translated by 谷歌翻译
文本到SQL引起了自然语言处理和数据库社区的关注,因为它能够将自然语言中的语义转换为SQL查询及其在构建自然语言接口到数据库系统中的实际应用。文本到SQL的主要挑战在于编码自然话语的含义,解码为SQL查询,并翻译这两种形式之间的语义。这些挑战已被最近的进步解决了不同的范围。但是,对于这项任务仍缺乏全面的调查。为此,我们回顾了有关数据集,方法和评估的文本到SQL的最新进展,并提供了这项系统的调查,解决了上述挑战并讨论潜在的未来方向。我们希望这项调查可以作为快速获取现有工作并激励未来的研究。
translated by 谷歌翻译
表中的信息可能是文本的重要补充,使基于表的问题答案(QA)具有巨大的价值。处理表的内在复杂性通常会增加模型设计和数据注释的额外负担。在本文中,我们旨在以最少的注释工作开发一个简单的基于表的质量检查模型。由于基于表的质量检查需要问题和表之间的对齐方式以及在多个表元素上执行复杂推理的能力,因此我们提出了一种杂食性的预读方法,该方法既可以消耗自然数据,又提出了合成数据,以使模型具有这些各自的能力。具体而言,鉴于可免费获得的表,我们利用检索将它们与相关的自然句子配对,以进行掩盖预处理,并通过将SQL从表中进行转换为QA损失进行预处理而合成NL问题。我们在几次和完整的设置中都进行了广泛的实验,结果清楚地证明了模型omnitab的优势,最好的多任务方法分别实现了16.2%和2.7%的绝对增益,在128次和完整的设置中也获得了2.7%建立有关Wickitable Questions的最新最新。详细的消融和分析揭示了自然和合成数据的不同特征,从而阐明了杂食性预处理的未来方向。可以在https://github.com/jzbjyb/omnitab上获得代码,预读数据和预算模型。
translated by 谷歌翻译
最近的语言模型预培训进展取得了巨大的成功,通过利用大规模的非结构化文本数据。然而,由于没有大规模的高质量表格数据,在结构化的表格数据上应用预先培训仍然是一项挑战。在本文中,我们提出了Tapex,以表明通过在合成语料库上学习神经SQL执行程序来实现表预培训,这是通过自动合成可执行的SQL查询和执行输出来获得的。 Tapex通过引导语言模型来模仿SQL执行程序的不同,大规模和高质量的合成语料库来解决数据稀缺性挑战。我们在四个基准数据集中评估Tapex。实验结果表明,Tapex优于以前的表格预训练,并通过大幅度达到了新的最先进的结果。这包括改进弱监管的WikiSQL表示精度为89.5%(+ 2.3%),WikityQuestions表示精度为57.5%(+ 4.8%),SQA表示精度为74.5%(+ 3.5%)和Tabfact精度84.2%(+ 3.2%)。为了我们的知识,这是通过合成可执行程序利用表预培训的第一项工作,并在各种下游任务上实现新的最先进结果。
translated by 谷歌翻译
The task of text-to-SQL is to convert a natural language question to its corresponding SQL query in the context of relational tables. Existing text-to-SQL parsers generate a "plausible" SQL query for an arbitrary user question, thereby failing to correctly handle problematic user questions. To formalize this problem, we conduct a preliminary study on the observed ambiguous and unanswerable cases in text-to-SQL and summarize them into 6 feature categories. Correspondingly, we identify the causes behind each category and propose requirements for handling ambiguous and unanswerable questions. Following this study, we propose a simple yet effective counterfactual example generation approach for the automatic generation of ambiguous and unanswerable text-to-SQL examples. Furthermore, we propose a weakly supervised model DTE (Detecting-Then-Explaining) for error detection, localization, and explanation. Experimental results show that our model achieves the best result on both real-world examples and generated examples compared with various baselines. We will release data and code for future research.
translated by 谷歌翻译
最近训练模型通过利用大规模文本语料库来改善神经网络的上下文表示能力,显着提高了各种NLP任务的性能。大型预培训语言模型也已应用于表语义解析的区域。然而,现有的预训练方法没有仔细探索问题与相应的数据库模式之间的明确互动关系,这是揭示其语义和结构对应的关键成分。此外,在架构接地背景下的问知表示学习在预训练目标中受到更少的关注。为了减轻这些问题,本文设计了两种新的预训练目标,将所需的归纳偏差将所需的归纳偏差施加到表前的学习表现-训练。我们进一步提出了一种模式感知课程学习方法来减轻噪声的影响,并以易于努力的方式从预训练数据中学习。我们通过在两个基准,蜘蛛和罢工中进行微调,评估我们预先接受训练的框架。结果表明,与各种基线相比,我们的预训练目标和课程的有效性。
translated by 谷歌翻译
随着未来以数据为中心的决策,对数据库的无缝访问至关重要。关于创建有效的文本到SQL(Text2SQL)模型以访问数据库的数据有广泛的研究。使用自然语言是可以通过有效访问数据库(尤其是对于非技术用户)来弥合数据和结果之间差距的最佳接口之一。它将打开门,并在精通技术技能或不太熟练的查询语言的用户中引起极大的兴趣。即使提出或研究了许多基于深度学习的算法,在现实工作场景中使用自然语言来解决数据查询问题仍然非常具有挑战性。原因是在不同的研究中使用不同的数据集,这带来了其局限性和假设。同时,我们确实缺乏对这些提议的模型及其对其训练的特定数据集的局限性的彻底理解。在本文中,我们试图介绍过去几年研究的24种神经网络模型的整体概述,包括其涉及卷积神经网络,经常性神经网络,指针网络,强化学习,生成模型等的架构。我们还概述11个数据集,这些数据集被广泛用于训练Text2SQL技术的模型。我们还讨论了无缝数据查询中文本2SQL技术的未来应用可能性。
translated by 谷歌翻译
自然语言接口到数据库(NLIDB),其中用户在自然语言(NL)上姿势查询是至关重要的,使非专家能够从数据中获得见解。相比之下,开发此类接口依赖于经常代码启发式的专家来映射NL到SQL。或者,基于机器学习模型的NLIDB依赖于用作训练数据的NL到SQL映射的监督示例(NL-SQL对)。再次采购这些示例,使用专家,该专家通常涉及超过一次性相互作用。即,部署NLIDB的每个数据域都可能具有不同的特征,因此需要专用的启发式或域特定的培训示例。为此,我们提出了一种使用弱监管培训基于机器学习的NLIDB的替代方法。我们使用最近提出的问题分解表示称为qdmr,是NL和正式查询语言之间的中间。最近的工作表明,非专家通常在将NL转化为QDMR时是成功的。因此,我们使用NL-QDMR对以及问题答案,作为自动综合SQL查询的监督。然后使用NL问题和合成的SQL来培训NL-TO-SQL模型,我们在五个基准数据集中测试。广泛的实验表明,我们的解决方案需要零专家注释,竞争性地与专家注释数据培训的模型竞争地表现得很竞争。
translated by 谷歌翻译
We present Spider, a large-scale, complex and cross-domain semantic parsing and textto-SQL dataset annotated by 11 college students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables, covering 138 different domains. We define a new complex and cross-domain semantic parsing and textto-SQL task where different complex SQL queries and databases appear in train and test sets. In this way, the task requires the model to generalize well to both new SQL queries and new database schemas. Spider is distinct from most of the previous semantic parsing tasks because they all use a single database and the exact same programs in the train set and the test set. We experiment with various state-of-the-art models and the best model achieves only 12.4% exact matching accuracy on a database split setting. This shows that Spider presents a strong challenge for future research. Our dataset and task are publicly available at https://yale-lily. github.io/spider.
translated by 谷歌翻译
关于文本到SQL语义解析的最新研究取决于解析器本身或基于简单的启发式方法来理解自然语言查询(NLQ)。合成SQL查询时,没有可用的NLQ的明确语义信息,从而导致不良的概括性能。此外,如果没有词汇级的细粒度查询理解,查询与数据库之间的链接只能依赖模糊的字符串匹配,这会导致实际应用中的次优性能。考虑到这一点,在本文中,我们提出了一个基于令牌级的细粒度查询理解的通用,模块化的神经语义解析框架。我们的框架由三个模块组成:命名实体识别器(NER),神经实体接头(NEL)和神经语义解析器(NSP)。通过共同建模查询和数据库,NER模型可以分析用户意图并确定查询中的实体。 NEL模型将类型的实体链接到数据库中的模式和单元格值。解析器模型利用可用的语义信息并链接结果并根据动态生成的语法合成树结构的SQL查询。新发布的语义解析数据集的Squall实验表明,我们可以在WikiableQuestions(WTQ)测试集上实现56.8%的执行精度,这使最先进的模型的表现优于2.7%。
translated by 谷歌翻译
长期以来,可以将可以应用于新数据库的文本到SQL解析器的重要性已得到认可,实现此目标的关键步骤是架构链接,即在生成SQL时正确地识别未见列或表的提及。在这项工作中,我们提出了一个新颖的框架,以通过基于PoinCar \'e距离指标的探测程序从大规模预训练的语言模型(PLM)中引起关系结构,并使用诱导的关系来增强基于图的解析器为了更好的模式链接。与常用的基于规则的架构链接方法相比,我们发现探测关系也可以稳健地捕获语义对应关系,即使提及和实体的表面形式不同。此外,我们的探测过程完全不受监督,不需要其他参数。广泛的实验表明,我们的框架在三个基准测试中设定了新的最新性能。我们从经验上验证我们的探测程序确实可以通过定性分析找到所需的关系结构。
translated by 谷歌翻译
Recently, there has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we first examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, our experiments show that these models have significant accuracy boosts on popular benchmarks, including new state-of-the-art performance on Spider.
translated by 谷歌翻译
The robustness of Text-to-SQL parsers against adversarial perturbations plays a crucial role in delivering highly reliable applications. Previous studies along this line primarily focused on perturbations in the natural language question side, neglecting the variability of tables. Motivated by this, we propose the Adversarial Table Perturbation (ATP) as a new attacking paradigm to measure the robustness of Text-to-SQL models. Following this proposition, we curate ADVETA, the first robustness evaluation benchmark featuring natural and realistic ATPs. All tested state-of-the-art models experience dramatic performance drops on ADVETA, revealing models' vulnerability in real-world practices. To defend against ATP, we build a systematic adversarial training example generation framework tailored for better contextualization of tabular data. Experiments show that our approach not only brings the best robustness improvement against table-side perturbations but also substantially empowers models against NL-side perturbations. We release our benchmark and code at: https://github.com/microsoft/ContextualSP.
translated by 谷歌翻译
Current SQL generators based on pre-trained language models struggle to answer complex questions requiring domain context or understanding fine-grained table structure. Humans would deal with these unknowns by reasoning over the documentation of the tables. Based on this hypothesis, we propose DocuT5, which uses off-the-shelf language model architecture and injects knowledge from external `documentation' to improve domain generalization. We perform experiments on the Spider family of datasets that contain complex questions that are cross-domain and multi-table. Specifically, we develop a new text-to-SQL failure taxonomy and find that 19.6% of errors are due to foreign key mistakes, and 49.2% are due to a lack of domain knowledge. We proposed DocuT5, a method that captures knowledge from (1) table structure context of foreign keys and (2) domain knowledge through contextualizing tables and columns. Both types of knowledge improve over state-of-the-art T5 with constrained decoding on Spider, and domain knowledge produces state-of-the-art comparable effectiveness on Spider-DK and Spider-SYN datasets.
translated by 谷歌翻译
Structured tabular data exist across nearly all fields. Reasoning task over these data aims to answer questions or determine the truthiness of hypothesis sentences by understanding the semantic meaning of a table. While previous works have devoted significant efforts to the tabular reasoning task, they always assume there are sufficient labeled data. However, constructing reasoning samples over tables (and related text) is labor-intensive, especially when the reasoning process is complex. When labeled data is insufficient, the performance of models will suffer an unendurable decline. In this paper, we propose a unified framework for unsupervised complex tabular reasoning (UCTR), which generates sufficient and diverse synthetic data with complex logic for tabular reasoning tasks, assuming no human-annotated data at all. We first utilize a random sampling strategy to collect diverse programs of different types and execute them on tables based on a "Program-Executor" module. To bridge the gap between the programs and natural language sentences, we design a powerful "NL-Generator" module to generate natural language sentences with complex logic from these programs. Since a table often occurs with its surrounding texts, we further propose novel "Table-to-Text" and "Text-to-Table" operators to handle joint table-text reasoning scenarios. This way, we can adequately exploit the unlabeled table resources to obtain a well-performed reasoning model under an unsupervised setting. Our experiments cover different tasks (question answering and fact verification) and different domains (general and specific), showing that our unsupervised methods can achieve at most 93% performance compared to supervised models. We also find that it can substantially boost the supervised performance in low-resourced domains as a data augmentation technique. Our code is available at https://github.com/leezythu/UCTR.
translated by 谷歌翻译
Fact verification has attracted a lot of research attention recently, e.g., in journalism, marketing, and policymaking, as misinformation and disinformation online can sway one's opinion and affect one's actions. While fact-checking is a hard task in general, in many cases, false statements can be easily debunked based on analytics over tables with reliable information. Hence, table-based fact verification has recently emerged as an important and growing research area. Yet, progress has been limited due to the lack of datasets that can be used to pre-train language models (LMs) to be aware of common table operations, such as aggregating a column or comparing tuples. To bridge this gap, in this paper we introduce PASTA, a novel state-of-the-art framework for table-based fact verification via pre-training with synthesized sentence-table cloze questions. In particular, we design six types of common sentence-table cloze tasks, including Filter, Aggregation, Superlative, Comparative, Ordinal, and Unique, based on which we synthesize a large corpus consisting of 1.2 million sentence-table pairs from WikiTables. PASTA uses a recent pre-trained LM, DeBERTaV3, and further pretrains it on our corpus. Our experimental results show that PASTA achieves new state-of-the-art performance on two table-based fact verification benchmarks: TabFact and SEM-TAB-FACTS. In particular, on the complex set of TabFact, which contains multiple operations, PASTA largely outperforms the previous state of the art by 4.7 points (85.6% vs. 80.9%), and the gap between PASTA and human performance on the small TabFact test set is narrowed to just 1.5 points (90.6% vs. 92.1%).
translated by 谷歌翻译
深度学习的最新进展极大地推动了语义解析的研究。此后,在许多下游任务中进行了改进,包括Web API的自然语言接口,文本到SQL的生成等。但是,尽管与这些任务有着密切的联系,但有关知识库的问题的研究(KBQA)的进展相对缓慢。我们将其确定并归因于KBQA的两个独特挑战,模式级的复杂性和事实级别的复杂性。在这项调查中,我们将KBQA放置在更广泛的语义解析文献中,并全面说明了现有的KBQA方法如何试图应对独特的挑战。无论面临什么独特的挑战,我们都认为我们仍然可以从语义解析的文献中汲取太大的灵感,这被现有的KBQA研究所忽略了。基于我们的讨论,我们可以更好地了解当前KBQA研究的瓶颈,并阐明KBQA的有希望的方向,以跟上语义解析的文献,尤其是在预训练的语言模型时代。
translated by 谷歌翻译
In this paper, we study the problem of knowledge-intensive text-to-SQL, in which domain knowledge is necessary to parse expert questions into SQL queries over domain-specific tables. We formalize this scenario by building a new Chinese benchmark KnowSQL consisting of domain-specific questions covering various domains. We then address this problem by presenting formulaic knowledge, rather than by annotating additional data examples. More concretely, we construct a formulaic knowledge bank as a domain knowledge base and propose a framework (ReGrouP) to leverage this formulaic knowledge during parsing. Experiments using ReGrouP demonstrate a significant 28.2% improvement overall on KnowSQL.
translated by 谷歌翻译
对新数据库的普遍性对于旨在将人类话语解析为SQL语句的文本到SQL系统至关重要。现有作品通过利用确切的匹配方法来确定问题单词和模式项目之间的词汇匹配来实现这一目标。但是,这些方法在其他具有挑战性的场景中失败,例如,表面形式在相应的问题单词和架构项目之间有所不同的同义词替代。在本文中,我们提出了一个名为ISESL-SQL的框架,以迭代地构建问题令牌和数据库模式之间的语义增强的架构链接图。首先,我们以无监督的方式通过探测过程提取PLM的模式链接图。然后,通过深图学习方法在训练过程中进一步优化了模式链接图。同时,我们还设计了一个称为图形正则化的辅助任务,以改善模式链接图中提到的模式信息。对三个基准测试的广泛实验表明,ISESL-SQL可以始终优于基准,进一步的研究表明其普遍性和鲁棒性。
translated by 谷歌翻译