Recently, there has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we first examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, our experiments show that these models have significant accuracy boosts on popular benchmarks, including new state-of-the-art performance on Spider.
translated by 谷歌翻译
语义解析数据集可以收集昂贵。此外,即使是与给定域的相关问题,它是语义解析系统的输入,也可能不容易获得,尤其是跨域语义解析。这使得数据增强更具挑战性。现有方法综合新数据使用手工制作或诱导规则,需要大量的工程努力和语言专业知识来实现​​良好的覆盖和精度,这限制了可扩展性。在这项工作中,我们提出了一种纯粹的神经网络,用于语义解析的语义解析,完全消除对语法工程的需要,同时实现更高的语义解析精度。此外,我们的方法可以在零拍摄设置中合成,其中只有新域模式没有新域的任何输入输出示例。在蜘蛛跨域文本到SQL语义解析基准测试中,我们使用我们的零射击增强实现了开发集的最先进的性能(77.2%的准确性)。
translated by 谷歌翻译
文本到SQL引起了自然语言处理和数据库社区的关注,因为它能够将自然语言中的语义转换为SQL查询及其在构建自然语言接口到数据库系统中的实际应用。文本到SQL的主要挑战在于编码自然话语的含义,解码为SQL查询,并翻译这两种形式之间的语义。这些挑战已被最近的进步解决了不同的范围。但是,对于这项任务仍缺乏全面的调查。为此,我们回顾了有关数据集,方法和评估的文本到SQL的最新进展,并提供了这项系统的调查,解决了上述挑战并讨论潜在的未来方向。我们希望这项调查可以作为快速获取现有工作并激励未来的研究。
translated by 谷歌翻译
最近训练模型通过利用大规模文本语料库来改善神经网络的上下文表示能力,显着提高了各种NLP任务的性能。大型预培训语言模型也已应用于表语义解析的区域。然而,现有的预训练方法没有仔细探索问题与相应的数据库模式之间的明确互动关系,这是揭示其语义和结构对应的关键成分。此外,在架构接地背景下的问知表示学习在预训练目标中受到更少的关注。为了减轻这些问题,本文设计了两种新的预训练目标,将所需的归纳偏差将所需的归纳偏差施加到表前的学习表现-训练。我们进一步提出了一种模式感知课程学习方法来减轻噪声的影响,并以易于努力的方式从预训练数据中学习。我们通过在两个基准,蜘蛛和罢工中进行微调,评估我们预先接受训练的框架。结果表明,与各种基线相比,我们的预训练目标和课程的有效性。
translated by 谷歌翻译
Current SQL generators based on pre-trained language models struggle to answer complex questions requiring domain context or understanding fine-grained table structure. Humans would deal with these unknowns by reasoning over the documentation of the tables. Based on this hypothesis, we propose DocuT5, which uses off-the-shelf language model architecture and injects knowledge from external `documentation' to improve domain generalization. We perform experiments on the Spider family of datasets that contain complex questions that are cross-domain and multi-table. Specifically, we develop a new text-to-SQL failure taxonomy and find that 19.6% of errors are due to foreign key mistakes, and 49.2% are due to a lack of domain knowledge. We proposed DocuT5, a method that captures knowledge from (1) table structure context of foreign keys and (2) domain knowledge through contextualizing tables and columns. Both types of knowledge improve over state-of-the-art T5 with constrained decoding on Spider, and domain knowledge produces state-of-the-art comparable effectiveness on Spider-DK and Spider-SYN datasets.
translated by 谷歌翻译
自然语言接口到数据库(NLIDB),其中用户在自然语言(NL)上姿势查询是至关重要的,使非专家能够从数据中获得见解。相比之下,开发此类接口依赖于经常代码启发式的专家来映射NL到SQL。或者,基于机器学习模型的NLIDB依赖于用作训练数据的NL到SQL映射的监督示例(NL-SQL对)。再次采购这些示例,使用专家,该专家通常涉及超过一次性相互作用。即,部署NLIDB的每个数据域都可能具有不同的特征,因此需要专用的启发式或域特定的培训示例。为此,我们提出了一种使用弱监管培训基于机器学习的NLIDB的替代方法。我们使用最近提出的问题分解表示称为qdmr,是NL和正式查询语言之间的中间。最近的工作表明,非专家通常在将NL转化为QDMR时是成功的。因此,我们使用NL-QDMR对以及问题答案,作为自动综合SQL查询的监督。然后使用NL问题和合成的SQL来培训NL-TO-SQL模型,我们在五个基准数据集中测试。广泛的实验表明,我们的解决方案需要零专家注释,竞争性地与专家注释数据培训的模型竞争地表现得很竞争。
translated by 谷歌翻译
随着未来以数据为中心的决策,对数据库的无缝访问至关重要。关于创建有效的文本到SQL(Text2SQL)模型以访问数据库的数据有广泛的研究。使用自然语言是可以通过有效访问数据库(尤其是对于非技术用户)来弥合数据和结果之间差距的最佳接口之一。它将打开门,并在精通技术技能或不太熟练的查询语言的用户中引起极大的兴趣。即使提出或研究了许多基于深度学习的算法,在现实工作场景中使用自然语言来解决数据查询问题仍然非常具有挑战性。原因是在不同的研究中使用不同的数据集,这带来了其局限性和假设。同时,我们确实缺乏对这些提议的模型及其对其训练的特定数据集的局限性的彻底理解。在本文中,我们试图介绍过去几年研究的24种神经网络模型的整体概述,包括其涉及卷积神经网络,经常性神经网络,指针网络,强化学习,生成模型等的架构。我们还概述11个数据集,这些数据集被广泛用于训练Text2SQL技术的模型。我们还讨论了无缝数据查询中文本2SQL技术的未来应用可能性。
translated by 谷歌翻译
文本到SQL解析是一项必不可少且具有挑战性的任务。文本到SQL解析的目的是根据关系数据库提供的证据将自然语言(NL)问题转换为其相应的结构性查询语言(SQL)。来自数据库社区的早期文本到SQL解析系统取得了显着的进展,重度人类工程和用户与系统的互动的成本。近年来,深层神经网络通过神经生成模型显着提出了这项任务,该模型会自动学习从输入NL问题到输出SQL查询的映射功能。随后,大型的预训练的语言模型将文本到SQL解析任务的最新作品带到了一个新级别。在这项调查中,我们对文本到SQL解析的深度学习方法进行了全面的评论。首先,我们介绍了文本到SQL解析语料库,可以归类为单转和多转。其次,我们提供了预先训练的语言模型和现有文本解析方法的系统概述。第三,我们向读者展示了文本到SQL解析所面临的挑战,并探索了该领域的一些潜在未来方向。
translated by 谷歌翻译
本文旨在通过探索基于神经网络的方法(称为Sun)中的内在不确定性来提高文本到SQL解析的性能。从数据不确定性的角度来看,可以从多个语义等效的问题中学到单个SQL。从以前仅限于一对一映射的方法中不同,我们提出了一个数据不确定性限制来探索潜在的互补语义语义多个语义等效问题(多对一)中的信息,并以减少的虚假关联来学习稳健的特征表示。通过这种方式,我们可以降低学习表示的敏感性并改善解析器的鲁棒性。从模型的不确定性角度来看,神经网络的权重之间通常存在结构信息(依赖性)。为了提高神经文本到SQL解析器的普遍性和稳定性,我们提出了模型不确定性约束,以通过强制执行不同扰动编码网络的输出表示形式来完善查询表示形式,以使其彼此一致。在五个基准数据集上进行的广泛实验表明,我们的方法显着优于强大的竞争对手,并实现了新的最新结果。为了获得可重复性,我们在https://github.com/alibabaresearch/damo-convai/tree/main/main/sunsql上发布代码和数据。
translated by 谷歌翻译
We present Spider, a large-scale, complex and cross-domain semantic parsing and textto-SQL dataset annotated by 11 college students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables, covering 138 different domains. We define a new complex and cross-domain semantic parsing and textto-SQL task where different complex SQL queries and databases appear in train and test sets. In this way, the task requires the model to generalize well to both new SQL queries and new database schemas. Spider is distinct from most of the previous semantic parsing tasks because they all use a single database and the exact same programs in the train set and the test set. We experiment with various state-of-the-art models and the best model achieves only 12.4% exact matching accuracy on a database split setting. This shows that Spider presents a strong challenge for future research. Our dataset and task are publicly available at https://yale-lily. github.io/spider.
translated by 谷歌翻译
自动SQL生成一直是一个活跃的研究领域,旨在通过以特定意图编写自然语言而不是编写SQL来简化对数据库的访问。语义解析的当前SOTA方法取决于LLMS在基准数据集上实现高预测精度。这降低了其适用性,因为LLMS需要昂贵的GPU。此外,SOTA方法是未接地的,因此不能保证始终生成有效的SQL。在这里,我们提出了T5QL,这是一种新的SQL生成方法,当使用较小的LMS(即T5-base)与SOTA方法相比时,可以改善基准数据集中的性能。此外,保证T5QL始终使用无上下文语法来限制SQL生成的有效SQL。最后,我们表明,在两项任务中进行语义解析,候选SQLS的生成和重新排名,是一个有希望的研究途径,可以减少对大型LM的需求。
translated by 谷歌翻译
学习捕获文本表对齐对于文本到SQL等任务至关重要。一个模型需要正确识别对列和值的自然语言引用,并在给定的数据库架构中将其扎根。在本文中,我们为文本到SQL提出了一个新颖的弱监督结构接地预处理框架(strug),可以有效地学习基于平行的文本表语料库来捕获文本表对齐。我们确定了一组新的预测任务:列接地,价值接地和列值映射,并利用它们为文本表编码预处理。此外,为了评估更现实的文本表对齐设置下的不同方法,我们基于蜘蛛dev设置的新评估集蜘蛛现实化,并明确提及已删除的列名,并采用八个现有的文本到SQL数据集以进行交叉 - 数据库评估。在所有设置中,Strug对Bert-Large都有显着改善。与现有的预训练方法(例如Grappa)相比,Strug在蜘蛛方面的性能相似,并且在更现实的集合上都优于所有基线。蜘蛛现实的数据集可从https://doi.org/10.5281/zenodo.5205322获得。
translated by 谷歌翻译
Subject to the huge semantic gap between natural and formal languages, neural semantic parsing is typically bottlenecked by its complexity of dealing with both input semantics and output syntax. Recent works have proposed several forms of supplementary supervision but none is generalized across multiple formal languages. This paper proposes a unified intermediate representation (IR) for graph query languages, named GraphQ IR. It has a natural-language-like expression that bridges the semantic gap and formally defined syntax that maintains the graph structure. Therefore, a neural semantic parser can more precisely convert user queries into GraphQ IR, which can be later losslessly compiled into various downstream graph query languages. Extensive experiments on several benchmarks including KQA Pro, Overnight, GrailQA, and MetaQA-Cypher under standard i.i.d., out-of-distribution, and low-resource settings validate GraphQ IR's superiority over the previous state-of-the-arts with a maximum 11% accuracy improvement.
translated by 谷歌翻译
Parsing natural language questions into executable logical forms is a useful and interpretable way to perform question answering on structured data such as knowledge bases (KB) or databases (DB). However, existing approaches on semantic parsing cannot adapt to both modalities, as they suffer from the exponential growth of the logical form candidates and can hardly generalize to unseen data. In this work, we propose Uni-Parser, a unified semantic parser for question answering (QA) on both KB and DB. We introduce the primitive (relation and entity in KB, and table name, column name and cell value in DB) as an essential element in our framework. The number of primitives grows linearly with the number of retrieved relations in KB and DB, preventing us from dealing with exponential logic form candidates. We leverage the generator to predict final logical forms by altering and composing topranked primitives with different operations (e.g. select, where, count). With sufficiently pruned search space by a contrastive primitive ranker, the generator is empowered to capture the composition of primitives enhancing its generalization ability. We achieve competitive results on multiple KB and DB QA benchmarks more efficiently, especially in the compositional and zero-shot settings.
translated by 谷歌翻译
深度学习的最新进展极大地推动了语义解析的研究。此后,在许多下游任务中进行了改进,包括Web API的自然语言接口,文本到SQL的生成等。但是,尽管与这些任务有着密切的联系,但有关知识库的问题的研究(KBQA)的进展相对缓慢。我们将其确定并归因于KBQA的两个独特挑战,模式级的复杂性和事实级别的复杂性。在这项调查中,我们将KBQA放置在更广泛的语义解析文献中,并全面说明了现有的KBQA方法如何试图应对独特的挑战。无论面临什么独特的挑战,我们都认为我们仍然可以从语义解析的文献中汲取太大的灵感,这被现有的KBQA研究所忽略了。基于我们的讨论,我们可以更好地了解当前KBQA研究的瓶颈,并阐明KBQA的有希望的方向,以跟上语义解析的文献,尤其是在预训练的语言模型时代。
translated by 谷歌翻译
Conversational text-to-SQL is designed to translate multi-turn natural language questions into their corresponding SQL queries. Most state-of-the-art conversational text- to-SQL methods are incompatible with generative pre-trained language models (PLMs), such as T5. In this paper, we present a two-stage unified MultI-task Generation frAmework (MIGA) that leverages PLMs' ability to tackle conversational text-to-SQL. In the pre-training stage, MIGA first decomposes the main task into several related sub-tasks and then unifies them into the same sequence-to-sequence (Seq2Seq) paradigm with task-specific natural language prompts to boost the main task from multi-task training. Later in the fine-tuning stage, we propose four SQL perturbations to alleviate the error propagation problem. MIGA tends to achieve state-of-the-art performance on two benchmarks (SparC and CoSQL). We also provide extensive analyses and discussions to shed light on some new perspectives for conversational text-to-SQL.
translated by 谷歌翻译
最近的语言模型预培训进展取得了巨大的成功,通过利用大规模的非结构化文本数据。然而,由于没有大规模的高质量表格数据,在结构化的表格数据上应用预先培训仍然是一项挑战。在本文中,我们提出了Tapex,以表明通过在合成语料库上学习神经SQL执行程序来实现表预培训,这是通过自动合成可执行的SQL查询和执行输出来获得的。 Tapex通过引导语言模型来模仿SQL执行程序的不同,大规模和高质量的合成语料库来解决数据稀缺性挑战。我们在四个基准数据集中评估Tapex。实验结果表明,Tapex优于以前的表格预训练,并通过大幅度达到了新的最先进的结果。这包括改进弱监管的WikiSQL表示精度为89.5%(+ 2.3%),WikityQuestions表示精度为57.5%(+ 4.8%),SQA表示精度为74.5%(+ 3.5%)和Tabfact精度84.2%(+ 3.2%)。为了我们的知识,这是通过合成可执行程序利用表预培训的第一项工作,并在各种下游任务上实现新的最先进结果。
translated by 谷歌翻译
Structured tabular data exist across nearly all fields. Reasoning task over these data aims to answer questions or determine the truthiness of hypothesis sentences by understanding the semantic meaning of a table. While previous works have devoted significant efforts to the tabular reasoning task, they always assume there are sufficient labeled data. However, constructing reasoning samples over tables (and related text) is labor-intensive, especially when the reasoning process is complex. When labeled data is insufficient, the performance of models will suffer an unendurable decline. In this paper, we propose a unified framework for unsupervised complex tabular reasoning (UCTR), which generates sufficient and diverse synthetic data with complex logic for tabular reasoning tasks, assuming no human-annotated data at all. We first utilize a random sampling strategy to collect diverse programs of different types and execute them on tables based on a "Program-Executor" module. To bridge the gap between the programs and natural language sentences, we design a powerful "NL-Generator" module to generate natural language sentences with complex logic from these programs. Since a table often occurs with its surrounding texts, we further propose novel "Table-to-Text" and "Text-to-Table" operators to handle joint table-text reasoning scenarios. This way, we can adequately exploit the unlabeled table resources to obtain a well-performed reasoning model under an unsupervised setting. Our experiments cover different tasks (question answering and fact verification) and different domains (general and specific), showing that our unsupervised methods can achieve at most 93% performance compared to supervised models. We also find that it can substantially boost the supervised performance in low-resourced domains as a data augmentation technique. Our code is available at https://github.com/leezythu/UCTR.
translated by 谷歌翻译
Text-to-SQL semantic parsing is an important NLP task, which greatly facilitates the interaction between users and the database and becomes the key component in many human-computer interaction systems. Much recent progress in text-to-SQL has been driven by large-scale datasets, but most of them are centered on English. In this work, we present MultiSpider, the largest multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese). Upon MultiSpider, we further identify the lexical and structural challenges of text-to-SQL (caused by specific language properties and dialect sayings) and their intensity across different languages. Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages. Qualitative and quantitative analyses are conducted to understand the reason for the performance drop of each language. Besides the dataset, we also propose a simple schema augmentation framework SAVe (Schema-Augmentation-with-Verification), which significantly boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.
translated by 谷歌翻译
长期以来,可以将可以应用于新数据库的文本到SQL解析器的重要性已得到认可,实现此目标的关键步骤是架构链接,即在生成SQL时正确地识别未见列或表的提及。在这项工作中,我们提出了一个新颖的框架,以通过基于PoinCar \'e距离指标的探测程序从大规模预训练的语言模型(PLM)中引起关系结构,并使用诱导的关系来增强基于图的解析器为了更好的模式链接。与常用的基于规则的架构链接方法相比,我们发现探测关系也可以稳健地捕获语义对应关系,即使提及和实体的表面形式不同。此外,我们的探测过程完全不受监督,不需要其他参数。广泛的实验表明,我们的框架在三个基准测试中设定了新的最新性能。我们从经验上验证我们的探测程序确实可以通过定性分析找到所需的关系结构。
translated by 谷歌翻译