SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset

Saihao Huang

^{1}

, Lijie Wang

^{2}

, Zhenghua Li

^{1}

, Zeyang Liu

^{1}

, Chenhui Dou

^{1}

,
Fukang Yan

^{1}

, Xinyan Xiao

^{2}

, Hua Wu

^{2}

, Min Zhang

^{1}

^{1}

Institute of Artificial Intelligence, School of Computer Science and Technology,
Soochow University, China;

^{2}

Baidu Inc, Beijing, China

^{1}

{shhuang21,zyliu20,chdou21,fkyan21}@stu.suda.edu.cn

^{1}

{zhli13,minzhang}@suda.edu.cn

^{2}

{wanglijie,xiaoxinyan,wu_hua}@baidu.com

Abstract

As the first session-level Chinese dataset, CHASE contains two separate parts, i.e., 2,003 sessions manually constructed from scratch (CHASE-C), and 3,456 sessions translated from English SParC (CHASE-T). We find the two parts are highly discrepant and incompatible as training and evaluation data. In this work, we present SeSQL, yet another large-scale session-level text-to-SQL dataset in Chinese, consisting of 5,028 sessions all manually constructed from scratch. In order to guarantee data quality, we adopt an iterative annotation workflow to facilitate intense and in-time review of previous-round natural language (NL) questions and SQL queries. Moreover, by completing all context-dependent NL questions, we obtain 27,012 context-independent question/SQL pairs, allowing SeSQL to be used as the largest dataset for single-round multi-DB text-to-SQL parsing. We conduct benchmark session-level text-to-SQL parsing experiments on SeSQL by employing three competitive session-level parsers, and present detailed analysis.

1 Introduction

Text-to-SQL parsing aims to automatically transform natural language (NL) questions into SQL queries based on given databases (DBs) Tang and Mooney (2001). As a key technology in an NL interface for relational DBs, it has attracted increasing attention from both academic and industrial community. Researchers have done many solid and interesting fundamental works on both dataset construction Zhong et al. (2017); Yu et al. (2018) and parsing model innovation Zhang et al. (2019); Wang et al. (2020a).

Figure 1: An example session from SeSQL.

Previous studies mainly focus on the single-round text-to-SQL parsing, where the input questions are context-independent. Popular single-round datasets include WikiSQL Zhong et al. (2017) and Spider Yu et al. (2018) for English, and DuSQL Wang et al. (2020b) for Chinese.

However, in a real-world setting, it is usually difficult for users to meet their information need via a single stand-alone question. On the one hand, users usually have several related questions to ask at the same time, instead of a single one. On the other hand, possibly due to unfamiliarity toward the database or the system, users may need several trials until they find the suitable NL question.

Therefore, recent works go beyond single-round text-to-SQL parsing and start to tackle session-level text-to-SQL parsing Yu et al. (2019b); Cai and Wan (2020), similar to the trend from single-round question answering (QA) to context-dependent QA Bertomeu et al. (2006). Figure 1 shows a session-level example. Given a relational DB $D$ , a user asks a sequence of questions, denoted by $Q = q_{1}, . . ., q_{n}$ , and the text-to-SQL engine produces a sequence of SQL queries, denoted by $Y = y_{1}, . . ., y_{n}$ . Questions in the same session are usually thematically related, and contextually dependent via ellipsis or co-reference as well Bertomeu et al. (2006). When generating $y_{j}$ , the parser needs to not only look at $q_{j}$ , but also heavily rely on the previous questions.

So far, previous researchers have constructed two session-level text-to-SQL datasets, i.e., SParC Yu et al. (2019b) in English and CHASE Guo et al. (2021) in Chinese. SParC, containing 4,298 sessions and 12,726 question/SQL pairs, is built by extending the single-round Spider Yu et al. (2018).

As the first session-level Chinese dataset, CHASE contains 5,459 sessions and 17,940 question/SQL pairs Guo et al. (2021). The major problem of CHASE is that it adopted a hybrid construction method. Only 2,003 sessions are manually constructed from scratch (CHASE-C), whereas 3,456 correspond to a part of SParC after translating DBs and questions (CHASE-T). As shown in our experiments, CHASE-C and CHASE-T are highly discrepant and incompatible as training and evaluation data, possibly due to culture and language gaps. Moreover, only using CHASE-C may be insufficient to support model training.

This work presents SeSQL (/\textprimstressseskju:l/), yet another large-scale session-level Chinese text-to-SQL dataset. SeSQL contains 5,028 sessions and 27,012 question/SQL pairs. All sessions are constructed manually from scratch. This paper describes the construction methodology and process of SeSQL and presents detailed data analysis. We summarize contributions of this work as follows.

[label =(0),leftmargin=*]
SeSQL has three important features. First, based on several annotation trials, we adopt an iterative annotation workflow to encourage careful review of previous submissions, which we find is very useful for improving data quality. Second, we design seven categories of thematic transition for explicitly guiding annotators to creating next-round SQL queries. Third, we follow CHASE and explicitly annotate the context-dependent types of adjacent NL questions, such as ellipses and co-reference.
We complete 17,704 context-dependent questions into corresponding context-independent ones, resulting in 27,012 context-independent questions. This leads to two advantages. On the one hand, SeSQL provides the largest dataset for single-round multi-DB text-to-SQL parsing. On the other hand, SeSQL can also support research on question completion techniques.
We conduct benchmark session-level experiments on SeSQL, employing three competitive text-to-SQL models, i.e., EditSQL Zhang et al. (2019), IGSQL Cai and Wan (2020), and EX-RATSQL Guo et al. (2021).

We will release SeSQL and the code for research usage at http://xyz.

2 Related Works

Session-level text-to-SQL datasets. To date, there exist two representative session-level text-to-SQL datasets, i.e., English SParC Yu et al. (2019b) and Chinese CHASE Guo et al. (2021). SParC reuses questions in the single-round dataset Spider Yu et al. (2018) as guidance for annotators to create question sequences. The basic idea is to transform an original Spider question into a sequence of simpler questions, with the goal of answering the original question. As pointed out by Guo et al. (2021), this construction method leads to two biases: 1) high proportion of context-independent questions, and 2) high proportion of easy SQL queries.

As the first session-level Chinese dataset, CHASE is composed of two separate parts, i.e., CHASE-C and CHASE-T Guo et al. (2021). For CHASE-C, they reuse 120 DBs from single-round DuSQLWang et al. (2020b), and question/SQL pairs are created from scratch by 12 college students. For CHASE-T, they reuse a part of English SParC and employ 11 college students to translate DBs and question sequences into Chinese. However, as shown in our experiments, CHASE-C and CHASE-T exhibit different characteristics due to culture and language gaps. Moreover, it is inevitable that CHASE-T inherits the biases of SParC.

Conversational text-to-SQL parsing belongs to a different task from session-level text-to-SQL parsing, and is also known as DB-based conversational QA. CoSQL Yu et al. (2019a) is an English dataset for this task. Besides generating SQL queries, the model can ask NL questions to users for clarifying ambiguities.

Session-level text-to-SQL parsing approaches. Due to space limitation, we briefly introduce four representative approaches for session-level text-to-SQL parsing. EditSQL Zhang et al. (2019) generates a current-round SQL query by editing a previous-round query. Its encoder is designed to model interaction between the current-round question and all previous questions. IGSQL Cai and Wan (2020) extends EditSQL by introducing a graph encoder to model DB items together with those mentioned in questions. Hui et al. (2021) propose to jointly model the question sequence, DB items, and their interactions via a dynamic graph. Guo et al. (2021) propose extended RATSQL (EX-RATSQL), a session-level variant of RATSQL Wang et al. (2020a), by simply concatenating all previous questions as inputs.

Figure 2: Illustration of how to complete a current-round NL question and a next-round SQL query.

3 Dataset Construction

The construction of SeSQL mainly consists of five steps: 1) DB collection and cleansing, 2) initial SQL query creation, 3) subsequent question/SQL generation, 4) review and final question creation, and 5) completing context-dependent questions.

We first introduce our overall annotation workflow in Section 3.1, and then detail the five steps in Sections 3.2-3.5, and finally discuss other annotation details in Section 3.7.

3.1 An Iterative Annotation Workflow

In the early stage of this work, we observed that one annotator tended to have a very limited number of ways for advancing a session, probably due to thinking habits and background knowledge. In other words, annotators usually followed a few fixed patterns to ask new questions in order to improve annotation speed. Therefore, if we let one annotator to complete a whole session, the constructed data would probably contain strong annotator-related biases Mor et al. (2019) and be less diverse.

To deal with this issue, we adopted an iterative annotation workflow, as illustrated in Figure 2. The basic idea is that one annotator only completes one NL question and one SQL query, and previous submissions are intensively reviewed by subsequent annotators. There are six possible subtasks for an annotator to complete at a time.

Subtask 1: knowing the context. The annotator first reads all previous NL questions in order to 1) know what the session is about, and 2) avoid asking identical or similar questions.

Subtask 2: checking the previous submission. The annotator must carefully check and correct the submission of the previous annotator, which usually consists of two parts, i.e., a current-round SQL query, and a previous-round NL question (if not first-round). We find that this step is very important for avoiding error accumulation.

Subtask 3: writing an NL question. The annotator writes a qualified NL question for the current-round SQL query. On the one hand, the question should correctly and exactly express the meaning of the SQL query. On the other hand, the question should be expressed in a flexible and natural manner, imitating human conversation in real life.

Categories	Examples
Categories	current-round SQL query $y_{j}$	next-round SQL query $y_{j + 1}$
Changing SELECT	SELECT name FROM movie	SELECT name, score, type FROM movie
Changing conditions	SELECT count(*) FROM cinema WHERE score > 3.5	SELECT count(*) FROM cinema WHERE score > 3.5 AND room_number > 10
Changing tables	SELECT name FROM movie	SELECT name FROM cinema
Changing display	SELECT name FROM movie	SELECT name FROM movie ORDER BY score DESC LIMIT 3
Combining queries	SELECT name FROM cinema ORDER BY score DESC LIMIT 100	{SELECT name FROM cinema ORDER BY score DESC LIMIT 100 } EXCEPT {SELECT name FROM cinema WHERE type = "thriller"}
Hybrid of the above	SELECT name FROM movie	SELECT name, score FROM movie WHERE review_number > 1000
Unrelated	SELECT name, address FROM cinema ORDER BY score DESC LIMIT 1	SELECT name FROM movie

Table 1: Seven categories of thematic transition for creating the next-round SQL query.

Subtask 4: composing an SQL query. The annotator composes a new next-round SQL query, which is detailed in Section 3.4.

Subtask 5: verifying corrections via interaction. If the annotator (A) finds and corrects mistakes in the previous submission of another annotator (B). Our annotation tool will deliver the original submission along with the corrections to annotator B for his confirmation. If annotator B agrees with A, then the issue is settled; otherwise, a senior annotator is called to make a final decision.

Subtask 6: making a request for ending a session. An annotator may make a request for ending a session after completing subtask 2, when he fails to think of anything more to ask. Then a senior annotator handles the request.

Following Yu et al. (2019b), we require the number of question/SQL pairs in a session should range between 3 and 10. A session is automatically terminated if the number reaches 10.

Discussion. The one-annotator-one-session workflow, adopted by CHASE, means that a session is completed by a single annotator.

As we discussed in Section 1, it may introduce strong annotator-related bias, since annotator usually have a limited number of ways to advance a session. We observe that our iterative workflow can effectively alleviate this issue. Another advantage of our iterative workflow is that a previous submission is reviewed timely, which can avoid error accumulation and improve data quality. In contrast, data review can only be performed only after a whole session is completed in CHASE. Nevertheless, it would be very expensive to compare the two workflows via strict quantitative experiments, which is beyond the scope of this work.

3.2 DB Collection and Cleansing (S1)

Collecting DBs is a non-trivial work. For simplicity, we reuse all 201 DBs with 813 tables of the DuSQL dataset¹¹1The license and data is at https://www.luge.ai/#/luge/dataDetail?id=13. Wang et al. (2020b).

After looking into the data, we find that there are a lot of noises in the original DBs of DuSQL, which is also pointed out by Guo et al. (2021). Most noises fall into four categories: 1) primary or foreign keys are not given; 2) the value type of a cell does not match its column type; 3) some cells do not have values; 4) a duplicate value occurs in the primary key column.

In order to improve the quality of DBs and make sure that all legal SQL queries can be successfully executed, our six senior annotators have manually checked and corrected all DBs²²2Please note that we ask annotators not to introduce identification information and ask them to anonymize the existing identification information. before real annotation. Each annotator handles about 35 DBs.

3.3 Initial SQL Query Creation (S2)

Creating suitable initial queries are crucial for session-level text-to-SQL data creation, since they directly influence subsequent annotations. The suitability of initial queries depends on two aspects, i.e., simplicity and diversity. Regarding to the first aspect, we find that queries at easy and medium difficulty levels are the most appropriate as initial queries. We follow definition of difficulty levels in Yu et al. (2018). The second aspect indicates that initial queries should cover as many SQL keywords as possible. In order to satisfy both aspects, we induced 60 SQL query templates from single-round Spider and DuSQL, and each template contains some slots corresponding to masked table/column names and cell values. Given a DB, we require the initial query matches one of the templates (simplicity), and create at most one initial query for one template (diversity).

We create 5,028 valid initial queries in total, among which 1,761 are from DuSQL, and 3,267 are written from scratch by our senior annotators.

3.4 Subsequent Question/SQL Creation (S3)

As discussed in Section 3.1, subsequent questions and queries are created by multiple randomly selected annotators, each contributing one current-round NL question and one next-round SQL query. This subsection focuses on how to create a next-round SQL query given existing context.

Similar to context-dependent QA Bertomeu et al. (2006), it is crucial to make as realistic as possible the thematic transition and context dependency between adjacent utterances, where theme refers to users’ information need, and context dependency is concerned with manners in reusing previous content. In this step (i.e., S3), we mainly consider the thematic transition, since reusing previous content is usually a natural choice for annotators. As for the context dependency information, we follow CHASE and create explicit annotation (see Section 3.6).

Seven categories of thematic transition. To capture theme change and encourage diversity, we design seven transition categories to represent the relationship between the current-round and next-round queries, i.e., $y_{j}$ and $y_{j + 1}$ , as illustrated in Table 1. Please note that we also allow annotators to compose a thematically “unrelated” query, which sometimes happens in real-world scenario.

Figure 2 illustrates concrete operations in the bottom left corner. Given $y_{j}$ , the annotator first selects a transition category; then our annotation tool suggests several potential SQL templates according to $y_{j}$ and the selected category; finally the annotator selects an SQL template and fills it with DB elements to complete an SQL query.

3.5 Review & Final Question Creation (S4)

Datasets	General Information						Challenge Information		Contextual Annotation		Question Completion
Datasets	Language	DBs	Tables	Sequences	Pairs	Avg. round	Dep. Ratio	Easy Ratio	Thematic	Dependency	Question Completion
SparC	English	200	1,020	4,298	12,726	3.0	52.5	40.1	\XSolidBrush	\XSolidBrush	\XSolidBrush
CHASE	Chinese	280	1,280	5,459	17,940	3.3	64.7	27.7	\Checkmark	\Checkmark	\XSolidBrush
CHASE-C		120	462	2,003	7,694	3.8	71.2	18.6
CHASE-T		160	818	3,456	10,246	3.0	57.8	37.4
SeSQL	Chinese	201	813	5,028	27,012	5.4	65.5	13.6	\Checkmark	\Checkmark	\Checkmark

Table 2: Statistics of existing cross-domain context-dependent datasets. “Avg. round” represents the average number of rounds in a sequence. “Dep. Ratio” represents the ratio of the context-dependent questions, and “Easy ratio” represents the ratio of queries in the easy level. Please note that CHASE only partially released coarse-grained thematic transition relations.

This step is performed by our senior annotators. If an ordinary annotator makes a request to terminate a session, the annotation tool will transmit the request to a senior annotator. If the senior annotator agrees, then he must carefully review all previous questions and SQL queries, and correct all found mistakes. After that, he writes an NL question for the final-round SQL query.

3.6 Completing Context-Dep Questions (S5)

In order to capture context dependency and make our dataset more widely applicable, we perform this step in a separate manner after all sessions are completed via the above four steps (S1-S4).

Each session is then assigned to one senior annotator. The annotator first goes through all NL questions, and decides whether each question is context-dependent. Then, the context-dependent question is rewritten into a corresponding context-independent one. There are in total 17,704 context-dependent questions, accounting for 65.5% of all questions (see Table 3). As a result, SeSQL can serve as a single-round Chinese text-to-SQL dataset as well, like DuSQL Wang et al. (2020b). Moreover, it can also support reserach on question completion techniques.

Context dependency types. Inspired by CHASE Guo et al. (2021) and context-dependent QA Bertomeu et al. (2006), we ask annotators to explicitly annotate the way that a context-dependent question depends on its previous questions. There are five types, i.e., independent, co-reference, ellipsis, hybrid of co-reference and ellipsis, and others. Such annotation can help us to better understand results of text-to-SQL parsers.

3.7 Other Annotation Details

Annotators and Training. We recruit 28 undergraduate students as our part-time annotators, and 6 master students as senior annotators, including three co-authors of this paper. All of them come from the computer science department of our university and are familiar with the SQL language.

Before real annotation, we train all annotators for several times so that they understand the text-to-SQL parsing task, the annotation workflow, and the annotation tool, etc. During real annotation, we have also held several meetings to discuss common mistakes and settle disputes. Our annotation project lasts for about half a year.

Annotation tool. We build an online browser-based annotation tool to facilitate this work. Figure 4 in Appendix B shows the annotation interface.

Payment. All annotators were paid for their work based on the quality and quantity of their annotations. According to the annotation time recorded by our annotation tool, the average salary per hour is 25 RMB for ordinary annotators, and 35 RMB for senior annotators.³³3The average salary is about 20 RMB for a part-time KFC employee in our city. A total of 106K RMB is paid to annotators.

4 Analysis of SeSQL

Basic statistics. As shown in Table 2, SeSQL contains 5,028 unique question sequences over 201 DBs, with 27,012 questions annotated with their corresponding SQL queries. First of all, compared with English text-to-SQL datasets, both SeSQL and CHASE have a larger number of sessions and question/query pairs. Second, both SeSQL and CHASE are more challenging, due to higher percentage of context-dependent and non-easy questions.

Compared with CHASE, SeSQL contains more question/query rounds per session. We believe this owes to the seven categories of thematic transition that we design, which makes it more flexible for annotators to create next-round SQL queries. Moreover, SeSQL has overall higher percentages of context-dependent and non-easy questions.

Looking into CHASE, as we earlier discussed, only 2,003 sessions and 7,694 question/query pairs (i.e., CHASE-C) are annotated from scratch, which are much fewer than SeSQL. In the experiments, we show that CHASE-C and CHASE-T are highly discrepant and incompatible as training and evaluation data.

Finally, SeSQL provides corresponding context-independent questions for all context-dependent ones, and thus can also serve as a single-round text-to-SQL dataset.

Figure 3: Thematic transition distribution in SeSQL.

Thematic transition. We compute the thematic transition distributions in Figure 3. We find that the most frequently occurring transitions are Changing conditions and Changing SELECT, which are two very common contextual thematic relations in conversational QA systems Bertomeu et al. (2006). Meanwhile, it can be seen that transitions of Changing tables and Combining queries rarely occur. In the former case, the next-round SQL query raises another related topic, and its NL question is usually context-independent. The latter case usually leads to a very complex next-round SQL query.

Datasets	Indep.	Core.	Elli.	Both	Others
SParC	47.5	31.6	25.9	5.0	0
CHASE	35.3	35.7	28.5	0.5	0
CHASE-C	28.8	39.8	30.9	0.5	0
CHASE-T	42.2	31.4	24.7	1.7	0
SeSQL	34.5	13.4	35.0	12.5	4.6

Table 3: Context dependency distributions of existing datasets, where the reported results of SParC, CHASE, CHASE-C and CHASE-T are from Guo et al. (2021) .

Context dependency. Table 3 shows distribution of context dependency types of different datasets. Following previous studies, there are five types of context dependency, i.e., independent (indep.), co-reference (core.), ellipsis (elli.), hybrid of co-reference and ellipsis (both) ⁴⁴4The values of “both” for other three datasets are inferred from their reported results of “Core.” and “Elli.”., and others. First, the proportion of context-independent questions in SeSQL is much lower than SParc and CHASE-T, and is only 5.7% higher than CHASE-C. Second, SeSQL has the highest percentage of questions with ellipsis, and of questions with both co-reference and ellipsis. Finally, the remaining 4.6% of questions in SeSQL are related with previous questions in other ways than co-reference and ellipsis. From the distribution analysis, we believe that compared with CHASE, SeSQL can be used as a new and complementary resource for research on session-level text-to-SQL parsing.

Split Set	# DB	# Sequence	# Pair
Train	160	4,002	21,454
Dev	17	425	2,279
Test	24	601	3,279

Table 4: Dataset split statistics.

5 Experiments

Datasets. According to the cross-domain setting, we split SeSQL such that there is no DB overlap in train/dev/test sets. Since our DBs are from DuSQL, we follow its DB split for three sets of SeSQL. Table 4 shows the data split statistics.

Evaluation metrics. We use two popular metrics to evaluate model performances: Question-level Match (QM), the exact matching score over all questions, and Interaction-level Match (IM), the exact matching score over all interactions. The exact matching score is 1 for a question only if all its predicted SQL clauses are correct, and 1 for an interaction only if the exact matching score for every question in the interaction is 1.

Benchmark approaches. We adopt several competitive models that have published the corresponding source codes as the baseline approaches, i.e., EditSQL Zhang et al. (2019), IGSQL Cai and Wan (2020) and extended RATSQL (EX-RATSQL) Guo et al. (2021) for the context-dependent setting, as well as RATSQL Wang et al. (2020a) and LGESQL Cao et al. (2021) for the context-independent setting. Due to space limitation, we show their implementation details in Appendix A.

Models	QM		IM
Models	Dev	Test	Dev	Test
EditSQL Zhang et al. (2019)	57.2	52.6	27.3	22.6
IGSQL Cai and Wan (2020)	63.3	59.5	35.0	29.0
EX-RATSQL Guo et al. (2021)	56.6	50.4	18.9	17.0
RATSQL Wang et al. (2020a)	65.6	56.5	-	-
LGESQL Cao et al. (2021)	76.8	71.0	-	-

Table 5: Performances of base approaches over all questions (QM) and all interactions (IM). All reported results are the average over two runs.

Training Data	SeSQL		CHASE-C		CHASE-T
Training Data	QM	IM	QM	IM	QM	IM
CHASE	12.9	0.3	33.4	9.9	43.9	24.6
CHASE-C	4.0	0.1	35.7	11.0	22.5	6.1
CHASE-T	12.1	0.3	19.4	2.5	42.3	23.8
SeSQL	61.7	32.9	22.0	4.0	30.4	15.0
SeSQL + CHASE-C	62.5	33.2	39.3	14.0	33.8	15.4
SeSQL + CHASE-T	63.8	34.2	24.6	3.7	46.5	28.5

Table 6: Performances of IGSQL on dev sets of three datasets using different training data. All results are the average over three runs.

Models	Thematic Transition							Context Dependency					Round Number
Models	SEL.	Cons.	Tab.	Display	Com.	Hybrid	Unrel.	Indep.	Core.	Elli.	Both	Others	1	2	3	4	$\geq$ 5
EditSQL	51.3	47.6	40.7	50.4	40.0	31.0	59.6	63.2	51.2	45.4	47.3	51.0	65.4	57.3	50.7	45.4	46.9
IGSQL	59.5	55.0	40.7	59.8	55.0	43.0	65.6	67.4	59.3	53.4	56.0	60.3	68.6	64.6	58.1	55.2	53.9
EX-RATSQL	48.1	43.5	51.2	48.2	45.0	37.0	58.9	63.0	47.3	41.9	44.4	50.6	64.8	58.0	46.6	42.4	44.1

Table 7: Fine-grained QM results on the test set of SeSQL. All reported results are the average over two runs.

5.1 Results

Overall performances. Table 5 shows the overall performances of five baseline models, where the first row shows performances of three session-level models (i.e., EditSQL, IGSQL and EX-RATSQL) on the SeSQL’s session-level data, and the second row shows performances of RATSQL and LGESQL on the single-round data of SeSQL. IGSQL and LGESQL have achieved the best performances on the session-level and single-round data, respectively. But the results on the session-level data are far from satisfactory, reflected in two aspects. First, the best performances on IM, the primary metric in the session-level setting, only achieves 29.0% on the test set. Second, the best QM accuracy achieved by IGSQL is 59.5%, where the best QM accuracy on the single-round data is 71.0%. That is, there is a large room for both QM and IM improvements on SeSQL. We believe SeSQL can facilitate the research on session-level text-to-SQL parsing.

Comparison between SeSQL and CHASE. We use different combination of SeSQL and CHASE as training data, and use three separate dev sets, in order to understand data similarity and discrepancy. To avoid DB overlap, which would corrupt the cross-DB text-to-SQL parsing task, we remove all DBs that also appears in any of the three dev sets from each training data, along with corresponding question/SQL pairs. Table 6 shows the results.

First of all, it is clear that CHASE-C and CHASE-T are highly discrepant and incompatible. Using whole CHASE as training data leads to performances drop on CHASE-C dev set, compared with using only CHASE-C. In other words, the extra CHASE-T only introduces more noisy information than helpful information. However, using whole CHASE increases performances on CHASE-T dev set, compared with using only CHASE-T.

Second, using only SeSQL as training data achieves acceptable cross-dataset performances on both CHASE-C dev set, which is much higher than using CHASE-T as training data. The same trend goes to CHASE-T dev set. This indicates that SeSQL possesses a higher level of generalization ability.

Third, using both SeSQL and CHASE-C as training data leads to higher performances on CHASE-C dev set than using only CHASE-C. Similarly, using both SeSQL and CHASE-T as training data leads to higher performances on CHASE-T dev set than using only CHASE-T as well. Such consistent improvement indicates that SeSQL is of higher quality and compatible with both CHASE-C and CHASE-T.

Finally, using either CHASE-C or CHASE-T as extra training data increases QM and IM on SeSQL dev set slightly, compared with using only SeSQL. We suspect this may be due to the increased data volume added by both datasets.

Despite SeSQL improves cross-dataset generalization, the model generalization ability across different datasets is still weak, even if these datasets are built on the same DBs (e.g., SeSQL and CHASE-C). We believe SeSQL can facilitate the research of text-to-SQL parsing, especially on the cross-dataset generalization of text-to-SQL models.

5.2 Analysis

According to the fine-grained annotation information, we report QM results on SeSQL’s test set in Table 7. There are three main findings that are applicable to all baseline models. First, among all thematic transitions, all models do not perform well on transitions of Combining queries (Com.) and hybrid of other transitions. As described in Section 4, these transitions usually result in complex query generation. Second, as shown in the column of “Context Dependency”, QM performances on context-independent (Indep.) pairs is higher than that on context-dependent pairs, i.e., the other four dependency types. Furthermore, all base models do not perform well on questions that omit important historical information, i.e., labeled as “ellipse”. This proves that how to effectively use historical information is challenging. Finally, due to the difficulty increase in SQL generation, QM performances decreases as the round increases, which is consistent with the conclusions in other session-level datasets, e.g., SParC and CHASE.

Then we analyze significance of fine-grained annotations by comparing different models. From Table 7, there are three interesting findings to verify the importance of fine-grained annotations in revealing the effectiveness of model components. First, both IGSQL and EditSQL perform better than EX-RATSQL on non-first round questions, as they refer to the previous-round SQL query during the generation of the current-round SQL query. As all we known, in the context-dependent setting, the historical questions and generated SQL queries are very important to the current SQL generation. Second, IGSQL outperforms EditSQL on all transition types and dependency types, where IGSQL incorporates a graph encoder into EditSQL to model DB schema items together with items mentioned in historical questions. The performances on these fine-grained annotations verify that this graph encoder effectively captures historical information of questions and DB schema items. Third, EX-RATSQL, which only uses a relation-aware transformer to model the historical questions and DB schema items, performs best on the transition type of Tab., in which there is weak correlation between the historical rounds and the current round. Based on the above conclusions, we believe these annotations can help to reveal the advantages and limitations of the model, so as to help to improve models.

6 Conclusions

This paper presents SeSQL, yet another large-scale session-level Chinese text-to-SQL dataset. We describe its construction methodology and process in detail, and present detailed analysis about it. We conduct benchmark experiments with three representative session-level parsers, and prove that SeSQL exhibits several important features compared with CHASE. First, all 5,028 sessions are manually constructed from scratch, whereas only 2,003 sessions in CHASE-C are manually constructed from scratch. Second, being used as extra training data, SeSQL can consistently improve performances on both CHASE-C and CHASE-T. This indicates SeSQL is of higher quality and has stronger generalization ability. Third, by completing context-dependent questions, SeSQL provides 27,012 context-independent question/SQL pairs, and thus can be used as a solid dataset for future research on single-round text-to-SQL parsing.

References

Bertomeu et al. (2006) Núria Bertomeu, Hans Uszkoreit, Anette Frank, Hans-Ulrich Krieger, and Brigitte Jörg. 2006. Contextual phenomena and thematic relations in database QA dialogues: results from a Wizard-of-Oz experiment. In Proceedings of NAACL-HLT, pages 1–8.
Cai and Wan (2020) Yitao Cai and Xiaojun Wan. 2020. IGSQL: Database schema interaction graph based neural model for context-dependent text-to-SQL generation. In Proceedings of EMNLP, pages 6903–6912.
Cao et al. (2021) Ruisheng Cao, Lu Chen, Zhi Chen, Yanbin Zhao, Su Zhu, and Kai Yu. 2021. LGESQL: Line graph enhanced text-to-SQL model with mixed local and non-local relations. In Proceedings of ACL, pages 2541–2555.
Cui et al. (2021) Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and Ziqing Yang. 2021. Pre-training with whole word masking for Chinese BERT. TASLP, 29:3504–3514.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
Guo et al. (2021) Jiaqi Guo, Ziliang Si, Yu Wang, Qian Liu, Ming Fan, Jian-Guang Lou, Zijiang Yang, and Ting Liu. 2021. Chase: A large-scale and pragmatic Chinese dataset for cross-database context-dependent text-to-SQL. In Proceedings of ACL, pages 2316–2331.
Hui et al. (2021) Binyuan Hui, Ruiying Geng, Qiyu Ren, Binhua Li, Yongbin Li, Jian Sun, Fei Huang, Luo Si, Pengfei Zhu, and Xiaodan Zhu. 2021. Dynamic hybrid relation network for cross-domain context-dependent semantic parsing. arXiv preprint arXiv:2101.01686.
Mor et al. (2019) Geva Mor, Goldberg Yoav, and Berant Jonathan. 2019. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of EMNLP-IJCNLP.
Scholak et al. (2021) Torsten Scholak, Raymond Li, Dzmitry Bahdanau, Harm de Vries, and Chris Pal. 2021. DuoRAT: Towards simpler text-to-SQL models. In Proceedings of NAACL, pages 1313–1321.
Tang and Mooney (2001) Lappoon R Tang and Raymond J Mooney. 2001. Using multiple clause constructors in inductive logic programming for semantic parsing. In Proceedings of ECML, pages 466–477.
Wang et al. (2020a) Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020a. RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. In Proceedings of ACL, pages 7567–7578.
Wang et al. (2020b) Lijie Wang, Ao Zhang, Kun Wu, Ke Sun, Zhenghua Li, Hua Wu, Min Zhang, and Haifeng Wang. 2020b. DuSQL: A large-scale and pragmatic Chinese text-to-SQL dataset. In Proceedings of EMNLP, pages 6923–6935.
Yu et al. (2019a) Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, et al. 2019a. CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In Proceedings of EMNLP-IJCNLP, pages 1962–1979.
Yu et al. (2020) Tao Yu, Rui Zhang, Alex Polozov, Christopher Meek, and Ahmed Hassan Awadallah. 2020. SCoRe: Pre-training for context representation in conversational semantic parsing. In Proceedings of ICLR.
Yu et al. (2018) Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of EMNLP, pages 3911–3921.
Yu et al. (2019b) Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, et al. 2019b. SParC: Cross-domain semantic parsing in context. In Proceedings of ACL, pages 4511–4523.
Zhang et al. (2019) Rui Zhang, Tao Yu, Heyang Er, Sungrok Shim, Eric Xue, Xi Victoria Lin, Tianze Shi, Caiming Xiong, Richard Socher, and Dragomir Radev. 2019. Editing-based SQL query generation for cross-domain context-dependent questions. In Proceedings of EMNLP-IJCNLP, pages 5338–5349.
Zheng et al. (2022) Yanzhao Zheng, Haibin Wang, Baohua Dong, Xingjun Wang, and Changshan Li. 2022. HIE-SQL: History information enhanced network for context-dependent text-to-SQL semantic parsing. arXiv preprint arXiv:2203.07376.
Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv:1709.00103.

Appendices

Figure 4: User interface of our annotation tool.

Appendix A Benchmark Approaches

In the context-dependent setting, many researchers focus on parsing model innovation Yu et al. (2020); Hui et al. (2021); Zheng et al. (2022). In this work, we adopt Edit-SQL Zhang et al. (2019), IGSQL Cai and Wan (2020) and extended RATSQL (EX-RATSQL) Guo et al. (2021) as our benchmark approaches to evaluate their performances on our new dataset - SeSQL. For the three baseline approaches, we use default parameter settings in their released codes. Please note that in our experiments, we use fixed values given in the released source codes for hyper-parameters, that is, we do not perform hyper-parameter search to find best hyper-parameter values. Meanwhile, following Guo et al. (2021), we use BERT-base Devlin et al. (2019) to enhance the three parsers.

Edit-SQL⁵⁵5https://github.com/ryanzhumich/editsql utilizes the interaction history by editing the previous predicted query to improve the generation quality, as the adjacent NL questions are often linguistically dependent and their corresponding SQL queries tend to overlap. In the decoding process, they view an SQL query as a sequence and use an editing mechanism to reuse the previous generated SQL query at the token level. In the encoder, in order to deal with complex table structures, they employ an question-table encoder to incorporate the context of the user question and the table schema.

It takes about 7 days to train a basic BERT enhanced EditSQL model on a V100 GPU card. The EditSQL model has about 115M parameters.

IGSQL⁶⁶6https://github.com/headacheboy/IGSQL incorporates a graph encoder into EditSQL to capture historical information of user questions and database schema items. In encoding phase, they not only use an interaction encoder to capture historical information of user NL questions, but also use a DB schema interaction graph encoder to utilize historical information of DB schema items. In decoding phase, for making the prediction of SQL tokens, IGSQL introduces a gate mechanism to weigh the importance score of vocabularies from different sources, including DB schema items and the previous generated SQL query.

With a V100 GPU card, it takes about 6 days to train a basic BERT enhanced IGSQL model. And an IGSQL model has about 110M parameters.

EX-RATSQL⁷⁷7https://github.com/xjtu-intsoft/chase/tree/main/Benchmark_Approaches/DuoratChar is an extension of RATSQL Wang et al. (2020a), which uses a relation-aware transformer encoder and a gramma-based decoder to generate SQL Queries and performs well in the context-independent setting. Compared with RATSQL, EX-RATSQL adopts a simple concatenation context modeling approach to concatenate current question and all prior context-dependent questions with a special symbol [SEP] from the back forward. In other respects, it keeps up with RATSQL. The code of EX-RATSQL is based on DuoRAT Scholak et al. (2021) and we set batch size as 12 and max steps as 200,000 in our experiments. Besides it, we use default values for other hyper-parameters.

It also takes about 8 days to train a basic BERT enhanced EX-RATSQL model on a V100 GPU card. An EX-RATSQL model has about 135M parameters.

In the context-independent setting, we use a widely-used approach RATSQL Wang et al. (2020a) and a SOTA approach LGESQL Cao et al. (2021) as the benchmark approaches to understand the characteristics of our dataset. We evaluate RATSQL and LGESQL on all context-independent pairs.

RATSQL⁸⁸8https://github.com/microsoft/rat-sql utilizes a relation-aware transformer encoder to better model the connections between DB schemas and NL questions. Then it uses a grammar-based decoder to ensure the grammaticality of the generated SQL queries. In our experiments, we set batch size as 4 and random initialize seed as 0. We use default values for other parameter in the released code, and use the fixed values (listed in the released code) for hyper-parameters. Similarly, we use Chinese BERT-wwm Cui et al. (2021) to enhance our RATSQL model.

Training RATSQL is also expensive. It takes about 6 days to train a basic BERT-wwm enhanced RATSQL model on a V100 GPU card. A RATSQL model has about 168M parameters.

LGESQL⁹⁹9https://github.com/rhythmcao/text2sql-lgesql.git takes advantages of Dual RGAT to jointly encode the questions and the schemas. Compared with RATSQL, LGESQL pays more attention to the topological structure of edges and applies an edge-centered line graph to enhance the encoding of 1-hop edge features. In addition, it comes up with a graph pruning method as an auxiliary task to help the encoder improve the discriminative capability. Later, it uses a grammar-based decoder to generate SQL Queries as well. In our experiments, we leave the default hyperparameters unchanged.

Training a basic BERT enhanced LGESQL model spends 8 days on a V100 GPU card. The LGESQL model has about 148M parameters.

Appendix B Annotation Tool

Figure 4 shows the user interface of our annotation tool. The details of annotating process have been mentioned in Section 3. During the process of annotating and checking question/SQL sequences, this annotation tool helps to improve annotation speed and data quality.

Appendix C More Annotation Examples

#	Question&SQL Query	Context Dependency	Thematic Transition
Session 1
$q_{1}^{1}$	请列出所有频道的相关收视信息。	Independent	–
	(Please list the relevant information of all channel.)
$y_{1}^{1}$	select * from 频道收视
	(select * from channel_ratings)
$q_{2}^{1}$	市场份额最高的那个呢？	Dependent (Ellipsis)	Changing display
	(Tell me the one with the highest market share.)
$y_{2}^{1}$	select * from 频道收视 order by 市场份额 desc limit 1
	(select * from channel_ratings order by market_share desc limit 1)
$q_{3}^{1}$	还是把市场份额不低于10%的都列出来吧。	Dependent (Ellipsis)	Changing conditions
	(Now list all relevent information about channels with a market share of no less than 10% instead.)
$y_{3}^{1}$	select * from 频道收视 where 市场份额 >= 0.1
	(select * from channel_ratings where market_share >= 0.1)
$q_{4}^{1}$	其中直播关注度超过0.1%的呢？	Dependent (Both)	Changing conditions
	(How about counting only those in the above results with more than 0.1% live streaming attention?)
$y_{4}^{1}$	select * from 频道收视 where 市场份额 >= 0.1 and 直播关注度 > 0.001
	(select * from channel_ratings where market_share >= 0.1 and live_streaming_attention > 0.001)
$q_{5}^{1}$	帮我再查一下市场份额高于均值的吧。	Dependent (Ellipsis)	Changing conditions
	(I want to know the information of the channels whose market share is above average now, can you work out?)
$y_{5}^{1}$	select * from 频道收视 where 市场份额 > ( select avg(市场份额) from 频道收视 )
	(select * from channel_ratings where market_share > ( select avg( market_share ) from channel_ratings ))
Session 2
$q_{1}^{2}$	将所有基金公司按照注册资金由多到少排一下序，并告诉我它们对应的封闭式与开放式基金之和分别是多少。	Independent	–
	(Sort all funds according to their registered capital from most to least and show me the sum of their closed-end and open-end funds.)
$y_{1}^{2}$	select 名称, 封闭式基金数量 + 开放式基金数量 from 基金公司 order by 注册资本（万） desc
	(select name , sum( num_closed-end_funds + num_open-end_funds ) from fund_company order by registered_capital desc)
$q_{2}^{2}$	顺便给出它们分别有多少亚洲债券基金吧。	Dependent (Both)	Changing SELECT
	(By the way, how many Asian Bond Fund do they have?)
$y_{2}^{2}$	select 名称, 亚洲债券基金数量, 封闭式基金数量 + 开放式基金数量 from 基金公司 order by 注册资本（万） desc
	(select name , Aisan_bond_fund , sum( num_closed-end_funds + num_open-end_funds ) from fund_company order by registered_capital desc )
$q_{3}^{2}$	这亚洲债券基金好像没什么参考价值，换成注册资金看看呢？	Dependent (Ellipsis)	Changing SELECT
	(The Asian Bond Fund seems useless, can you replace it with registered captial? )
$y_{3}^{2}$	select 名称, 注册资本（万）, 封闭式基金数量 + 开放式基金数量 from 基金公司 order by 注册资本（万） desc
	(select name , registered_captial , sum( num_closed-end_funds + num_open-end_funds ) from fund_company order by registered_capital desc)
$q_{4}^{2}$	再把上述结果按注册资金从少到多排一下序让我瞅瞅吧。	Dependent (Coreference)	Changing display
	(Put the above result in reverse order and show me that.)
$y_{4}^{2}$	select 名称, 注册资本（万）, 封闭式基金数量 + 开放式基金数量 from 基金公司 order by 注册资本（万） asc
	(select name , registered_captial , sum( num_closed-end_funds + num_open-end_funds ) from fund_company order by registered_capital asc)
Session 3
$q_{1}^{3}$	电影分很多种类型，麻烦将各类型按对应的电影数从多到少排序。	Independent	–
	(As we all know, there are plenty of movie genres, can you sort them according to the number of movies they have from most to least?)
$y_{1}^{3}$	select 类型 from 电影 group by 类型 order by count(*) desc
	(select genre from movie group by genre order by count(*) desc)
$q_{2}^{3}$	那是哪种类型一百分钟以上的电影最多？	Dependent (Others)	Changing conditions
	(Which genre has the most movies over 100 minutes?)
$y_{2}^{2}$	select 类型 from 电影 where 片长（分钟） > 100 group by 类型 order by count(*) desc limit 1
	(select genre from movie where length > 100 group by genre order by count(*) desc limit 1)
$q_{3}^{3}$	如果只统计票价超过50元的呢？	Dependent (Ellipsis)	Changing conditions
	(What if we only count movies that cost more than ￥50?)
$y_{3}^{3}$	select 电影.类型 from 电影 join 电影上映 on 电影上映.电影id = 电影.词条id where 电影上映.票价（元） > 50
	group by 电影.类型 order by count(*) desc limit 1
	(select movie.genre from movie join movie_released on movie_released.id = movie.id where movie_released.cost > 50
	group by movie.genre order by count(*) desc limit 1)
$q_{4}^{3}$	又有哪些类型符合条件的电影超过三部？	Dependent (Both)	Changing conditions
	(Which movie genres have more than three movies that meet the criteria?)
$y_{4}^{3}$	select 电影.类型 from 电影 join 电影上映 on 电影上映.电影id = 电影.词条id where 电影上映.票价（元） > 50
	group by 电影.类型 having count(*) > 3
	(select movie.genre from movie join movie_released on movie_released.id = movie.id where movie_released.cost > 50
	group by movie.genre having count(*) > 3)
Session 4
$q_{1}^{4}$	社交软件有多少个？	Independent	–
	(How many social software?)
$y_{1}^{4}$	select count(*) from 社交APP
	(select count(*) from social_APP)
$q_{2}^{4}$	占内存30MB以上的呢？	Dependent (Ellipsis)	Changing conditions
	(How about memory usage occupied over 30M?)
$y_{2}^{2}$	select count(*) from 社交APP where 软件大小（M） > 30
	(select count(*) from social_APP where memory_usage > 30)
$q_{3}^{4}$	各公司旗下分别有多少这样的软件？	Dependent (Coreference)	Changing conditions
	(How many software mentioned above does each company have?)
$y_{3}^{4}$	select 公司.名称, count(*) from 公司 join 社交APP on 社交APP.母公司id = 公司.词条id where 社交APP.软件大小（M） > 30
	group by 公司.名称
	(select company.name, count(*) from company join social_APP on social_APP.parent_company_id=company.id
	where social_APP.memory_usage > 30 group by company.name)
$q_{4}^{4}$	分别总共有多少用户注册呢？	Dependent (Both)	Changing SELECT
	(How many users are registered in total respectively?)
$y_{4}^{4}$	select 公司.名称, count(*), sum(社交APP.注册用户量（亿）) from 公司 join 社交APP on 社交APP.母公司id = 公司.词条id
	where 社交APP.软件大小（M） > 30 group by 公司.名称
	(select company.name, count(*), sum(social_APP.num_registed_user) from company join social_APP on social_APP.parent_company_id = company.id
	where social_APP.memory_usage > 30 group by company.name)

Table 8: Question sequence examples in Ours.