Building the Intent Landscape of Real-World Conversational Corpora with Extractive Question-Answering Transformers

Jean-Philippe Corbeil Mia Taige Li Bell Canada Hadi Abdi Ghavidel Bell Canada

Abstract

For companies with customer service, mapping intents inside their conversational data is crucial in building applications based on natural language understanding (NLU). Nevertheless, there is no established automated technique to gather the intents from noisy online chats or voice transcripts. Simple clustering approaches are not suited to intent-sparse dialogues. To solve this intent-landscape task, we propose an unsupervised pipeline that extracts the intents and the taxonomy of intents from real-world dialogues. Our pipeline mines intent-span candidates with an extractive Question-Answering Electra model and leverages sentence embeddings to apply a low-level density clustering followed by a top-level hierarchical clustering. Our results demonstrate the generalization ability of an ELECTRA large model fine-tuned on the SQuAD2 dataset to understand dialogues. With the right prompting question, this model achieves a rate of linguistic validation on intent spans beyond 85%. We furthermore reconstructed the intent schemes of five domains from the MultiDoGo dataset with an average recall of 94.3%.

1 Introduction

With the rise of pre-trained NLU models in the last few years, call centers now have robust tools in their reach to optimize their customer operations. Two main applications are chatbots and head-intent detection, of which both need to define a taxonomy of intents relevant to the business. In most cases, it is an overwhelming task for companies, despite having a readily available amount of transcripts and chats.

Current approaches in the field of intent discovery Popov et al. (2019); Vedula et al. (2019, 2020); Chatterjee and Sengupta (2020); Zhang et al. (2021); Hang (2021) tend to focus on datasets with intent-dense utterances — e.g. ATIS Hemphill et al. (1990) or SNIPS Coucke et al. (2018). However, this type of dataset is rarely available when starting the design of NLU applications. Usually, companies have dialogues between an agent and a customer, of which most utterances do not have a clear intent prompt. We characterize these real-world datasets as intent-sparse. We propose the intent landscape as the task to extract the intents and the taxonomy of intents from intent-sparse dialogue datasets. We consider this task as a generalization of the intent discovery task, which focuses on labelled and unlabelled utterance corpora to find known and unknown intents.

How can we automatically extract intents from a noisy intent-sparse dialogue corpus? We propose a pipeline to extract intent spans from conversational data into a data-driven intent hierarchy to solve this.

Our contributions are five folds:

We defined the intent landscape task.
We designed a pipeline that extracts intent spans from real-world dialogues and maps the relevant intents into a hierarchy of clusters.
We proposed a strategy to estimate the count of each cluster based on semantic similarity.
We show that the ELECTRA large Clark et al. (2020) fine-tuned on the SQuAD2 Question-Answering dataset Rajpurkar et al. (2018) has the ability to extract relevant intent spans from conversations.
We observed flaws in the original intent schemes of the MultiDoGo dataset Peskov et al. (2019), and we discovered a few new intent clusters in each domain.

We first review the previous works. We present our methodology afterwards: the dataset, our pipeline and the experiments. We then report each of our results and discuss our main findings. We conclude with a few closing remarks.

2 Related Work

When looking for intents in text, one usually needs to assess whether open intents exist, i.e. a predefined taxonomy can confine all intents in the scope of the current dataset. With the assumption that there exist no open intents, intent detection becomes a supervised classification task Vedula et al. (2019). Some works on this front focus on finding and analyzing intents from social media such as online forums and microblogs Agarwal and Sureka (2017); Gupta et al. (2014); Wang et al. (2015); Chen et al. (2013); while others look at intent detection as part of Spoken Language Understanding, and study it jointly with slot filling. A typical architecture used in this field is attention-based RNN Goo et al. (2018); Mesnil et al. (2015); Zhang and Wang (2016), and in recent years many explored the benefits of adversarial learning Kim et al. (2017); Liu and Lane (2017); Yu and Lam (2018).

On the other hand, many previous works focus on capturing unknown intents in given data. According to Zhang et al. (2021), there are two tasks on this end: open intent detection and open intent discovery¹¹1The terms "intent discovery" and "intent mining" have been used interchangeably in the literature, both referring to the task of identifying types of unknown intents in text. We will use "intent discovery" in our paper.. Open intent detection is an n+1-class classification with n known intent classes and one open intent class. Many works in this area use a threshold-based method for such decision Hendrycks and Gimpel (2016); Shu et al. (2017); Liang et al. (2017), but there are also attempts to use geometrical features to alleviate the problem of relying on the presence of unknown intents in the train set Lin and Xu (2019); Zhang et al. (2021). However, the downside of this task is that there is no way to differentiate one open intent from another, as they are all placed into one umbrella category.

Unlike previously-mentioned tasks, open intent discovery does not require a predefined intent taxonomy. Researchers in this field leverage both unsupervised and supervised learning and domain-expert knowledge. Vedula et al. (2019) develops a pipeline incorporating attention-based LSTM and CRF, trains on data collected from Stack Exchange, to discover unknown intents by sequentially tagging the action-object pairs determined by the CRF; Cai et al. (2017) uses unsupervised clustering and domain knowledge on online medical forums to map out an intent taxonomy for classifying medical intents online. The limitation of these two works is that both rely on data in the form of short, intent-dense prompts. In reality, open intent discovery often needs to be done on noisy, intent-sparse data such as conversation dialogues. There are various approaches in the literature to tackle this problem. Popov et al. (2019) takes intent discovery as a topic modelling task, which can process input data as a whole, therefore avoiding the need for intent-dense prompts. However, Hang (2021) points out that this approach relies on additional linguistic features to help the performance, as conversational dialogues tend to be shorter in length and contain less latent semantic information when compared to documents and paragraphs used in topic modelling. Chatterjee and Sengupta (2020) uses a domain-agnostic pre-trained Dialog Act Classifier, where any utterances being tagged as QUESTION or INFORMATION by the classifier would be considered as potential intent candidates and get passed on into downstream clustering. Nonetheless, the assumption that we capture all intents within utterances tagged as one of these two classes limits their approach. Vedula et al. (2020) proposes a 3-stage system that can be trained on known intents in a known domain to leverage knowledge transfer and discover unknown intents in other domains, with the assumption that the chosen known intents and their relation with the domain are an appropriate representation of the unknown.

Our work presents a cleaner way to extract unknown intents from conversation dialogues with fewer assumptions about their characteristics. We focus only on the syntactical integrity of intent phrases, and we do not assume anything about their semantics. Moreover, we also map the intents into relevant categories at many levels of granularity.

3 Methodology

3.1 Dataset

We need a conversational dataset to evaluate our pipeline with the following characteristics: covering a few domains, annotated intents, having two channels (e.g. agent and customer) and textual data about a real-life customer service interaction with multiple turns. In the literature, there is mainly three types of high-quality conversational datasets that are available for research purposes: intent-dense utterance datasets Coucke et al. (2018); Hemphill et al. (1990), task-oriented dialogue datasets Budzianowski et al. (2018); Peskov et al. (2019); Eric et al. (2019); Zang et al. (2020); Rastogi et al. (2020); Chen et al. (2021) and open-dialogue datasets Li et al. (2017); Rashkin et al. (2018); Zhang et al. (2018). We discard the open-dialogue datasets since we need labelled intents in the dataset. To work with actual conversations, we cannot rely on the intent-dense utterance dataset, which already considers a selection of certain types of utterances. Therefore, we selected the task-oriented dialogue datasets, which would closely match our criteria. Despite being cleaner than real-world conversations, the structure of these conversations will still emphasize intents from a customer (see Table 1). To augment the sparsity of intents and have real customer intentions, we do not take into account any intent that are general conversational markers (e.g. opening greeting, closing greeting and confirmation) or system markers (e.g. out-of-domain), and not head intentions. By doing this, we focus the evaluation of our pipeline on real domain-specific intents.

We chose the MultiDoGo dataset containing six different domains and multi-turn conversations. Its name stands for "Multi-Domain Goal-oriented dialogues" and it is a dataset compiled by the AWS Labs Peskov et al. (2019). It contains dialogues from 6 different domains: airline, media, insurance, finance, software, and fast food. We have three different sets of data: annotated at the turn level, annotated at the sentence level and unannotated. The first two were labelled only on the customer channel and are split into three sets: train, dev and test. We leverage the test sets of this annotated data to extract the intent schemes for each domain. We discarded 7 conversational markers or technical markers that are present in all the intent sets: openinggreeting, closinggreeting, confirmation, rejection, contentonly, thankyou and outofdomain. For our experiments, we are using the unannotated dataset since it contains the full dialogues (customer and agent channels).

customer: hello agent: Hello there! Welcome to Inflamites Cable/Media service, how may I help you today? customer: i want to purchase new cable service agent: …

Table 1: Example of dialogue formatted from the MultiDoGo media dataset.

3.2 Pipeline

The pipeline is illustrated in Figure 1. It is composed of five steps:

Figure 1: Diagram of the Intent-Landscape Pipeline.

Model Size	Creator	Name	F1	EM
small	mrm8488	`electra-small-finetuned-squadv2`	73.4	69.7
base	PremalMatalia	`electra-base-best-squad2`	83.2	79.3
large	ahotrod	`electra_large_discriminator_squad2_512`	90.0	87.0

Table 2: Information of Top Question-Answering Models fine-tuned on SQuAD2 selected on the HuggingFaceHUB.

Intent Extraction: Extract $N$ intent-like candidates based on the span extraction done with an extractive Question-Answering model.
Validation of Spans: Check that the span corresponds to a phrase expressing an intent. We impose four main criteria: all candidates answer the question (i.e. no unanswerable token), Part-of-Speech contains action-object phrase, the sentence has an appropriate form (i.e. length and clean formatting), and span comes from a customer utterance only.
Sentence Embedding: Encode the spans into sentence embeddings.
Clustering Low-Level: Apply a density-based clustering to find the meaningful clusters from densely packed regions. We assume that frequent fine-grained intent spans are relevant to consider as low-level intents.
Clustering Top-Level: Apply a hierarchical clustering on the low-level cluster-center embeddings to establish the structure of the fine-grained intent spans into high-level intents.

3.2.1 Extractive Question-Answering Model

The extractive Question-Answering (QA) task relies on two inputs: a question and a context paragraph. Both are provided to the transformer encoder with the SEP token as a delimiter. The goal is to extract a span of text — start index to end index inside the token sequence — from the context paragraph that answers the question. It is common to use a ranking strategy to consider the top $K$ potential answers. In our work, we leverage the pipeline implementation from HuggingFace to generate our candidates Wolf et al. (2020). We can usually answer the question with a span of text from the paragraph. However, the SQuAD2 dataset Rajpurkar et al. (2018) was built with unanswerable questions as well. The authors used the empty string ("") as a span answer for a question that we cannot answer with the context paragraph. We refer in this paper to the unanswerable situation as "impossible" as in "impossible to answer". We use this in our validation step Ignore any Impossible (see Section 3.3.1).

3.2.2 Dialogue Pre-processing

To form the context paragraph, we concatenate the utterances at the conversation level according to their turn order. We also append the channel name at the beginning of each utterance with a column ":" in-between, and we append a line return at the end. We hypothesize that this format would help the question-answering transformer model leverage similar dialogue patterns seen during its pre-training. The dialogue string takes the appearance of the example in Table 1.

3.2.3 Sentence Embedding

We encode the candidate spans into sentence embeddings using the sentence-transformer bi-encoder approach Reimers and Gurevych (2019). It uses an encoding transformer (e.g. BERT) with a pooling layer on all contextual embeddings, trained in a Siamese fashion under the cosine similarity.

3.2.4 Low-level Density Clustering

Since we do not want a fixed amount of clusters and we want to focus on dense regions of the embedding space, we used the HDBSCAN algorithm McInnes et al. (2017) to determine the low-level clusters based on the cosine distance, given by the formula: $1 - c o s i n e s i m i l a r i t y$ . HDBSCAN has the advantage of capturing the hierarchical structure of the space to extract density clusters to determine the distance threshold of each cluster, which is a limitation of the DBSCAN algorithm Ester et al. (1996). It is also faster than OPTICS Ankerst et al. (1999), which we experimented with and found closely match outputs in our pipeline. These density-clustering techniques are not relying on a fixed amount of clusters, but they filter out noise by focusing on clusters with more than min_cluster_size examples. We use 2 by default in our experiment, but larger domains (airline and media) were less noisy with slightly higher numbers (4 and 3, respectively).

3.2.5 Top-Level Hierarchical Clustering

To build the hierarchy of intents on low-level cluster centers, we rely on hierarchical clustering Ward Jr (1963) by average link on cosine distance. We compute the low-level cluster centers using the average of all its members. The main hyperparameter is the distance_threshold, which sets the cutting point to form the top-level clusters. We manually tune it between $0.2$ and $0.5$ , which translates roughly in cosine distance as many small clusters and few large clusters, respectively. We select the hyperparameter that gives the most relevant clusters visually on the TSNE 2D plot (e.g. each blob should have the same colour), and we validate that it keeps the clusters homogeneous semantically by inspection (e.g. all order intents should be a top-level cluster order).

3.2.6 Manual Mapping After Top-Level

From the top-level clustering, we observe imperfections in the hierarchy because of the variability of the distance of discrimination between clusters — i.e. some pairs of clusters tend to be closer than others from a semantic similarity perspective. Our experiments include a final manual step of cleaning the automated outputs from the pipeline based on semantic meaning, which includes merging similar clusters and grouping one cluster contained in another. We also included that step to match the intents in the test sets for evaluation purposes. We let the furthermore automation and refinement of this step as future work.

3.2.7 Count Estimation Strategy

Most of the time, we still have many dialogues that we cannot attach to an intent after the clustering. We observed in our experiment that it is usually between 10%-25% of the dialogues. To complete the estimation of intent volumes, we apply a completion strategy by passing through the candidate spans of these dialogues, and we look at the similarity between these and the low-level cluster centers. We consider the low-level cluster centers since they are specific and give a more precise similarity. We take the minimum similarity scores across the cluster centers, and if it is below some threshold, we have found a cluster to assign to this dialogue. We named that threshold the force_cluster_threshold. We can tune this threshold given the amount of noise we can tolerate. In our experiments, we aim to keep accurate and homogeneous clusters. Thus, we keep that threshold around $0.2$ and $0.3$ .

3.3 Experiments

3.3.1 Question-Answering Validation Experiment

First, we experiment with the QA models to understand the boundaries of its application in the intent-landscape pipeline. We change two parameters — model size and question prompting — and measure their impact on the intent linguistic validation rates in the validation step of the pipeline. For the model sizes, we considered all three sizes of the ELECTRA model Clark et al. (2020): small, base and large. We selected models fine-tuned on the SQuAD2 dataset Rajpurkar et al. (2018) based on the highest F1 score from the HuggingFaceHUB²²2https://huggingface.co/models Wolf et al. (2020) in Table 2.

Nonetheless, we still need to articulate well a question to query the intent-like spans from the customer. Thus, we select three different prompting questions:

Q1: What is the main reason of the call mentionned by the customer?
Q2: What can the agent help the customer with?
Q3: What is the customer’s first intent?

With Q1, we emphasize what the customer mentions during the call, and we specify that we are looking for the "main reason". On the other hand, we formulate Q2 to look for what the agent can do for the customer using the verb "help". At last, the last prompt Q3 ignores the call aspect and only asks for the customer’s "first intent" in technical terms.

Figure 2: Validations applied to span extracted by the extractive Question-Answering model.

We quantify the quality of the extraction and the generalization of the QA model to the dialogue domain based on the Validation of Spans in Figure 1. In Figure 2, we break down that step into four components:

Ignore any Impossible: Remove the dialogue if any of its candidate spans is the impossible marker in the top $K$ extracted by the QA model.
POS validations: Validate the candidate span based on what we expect from an intent span. We assume that there will be the presence of both an action and an object. We relate these to a VERB and a NOUN Parts-of-Speech, respectively. We leveraged industrial-grade English POS tagger from Spacy Honnibal and Montani (2017). For instance, we expect the intent text spans to be extracted similarly to "I [want] $_{V E R B}$ black [smartphone] $_{N O U N}$ ". Moreover, the pronoun "I" could be dropped in this example, which would still be considered valid.
Sentence validations: We check that the span does not contain any dialogue format artefacts like channel prefixes ("customer: " and "agent: ") as well as a line return. Afterwards, we remove any span below 2 tokens or beyond 12 tokens based on whitespace tokenization. We assume that only one word cannot be an intent, and a span above 12 tokens means a lack of precision from the QA model.
Customer Channel Validation: Remove the span candidate if it is not from the customer channel of the dialogue.

We count all the remaining dialogues containing an intent-candidate span after applying all validations sequentially. Then, we calculate the absolute percentages by dividing by the initial count of dialogues for all validation steps. We consider a combination of a model and a prompt question better if it finds the most dialogues with at least one valid intent span.

3.3.2 Intent Scheme Recovery Experiment

In this experiment, we apply the pipeline in Figure 1 on five of the six domains in MultiDoGo. We consider that the ground-truth intents are associated with our top-level intents extracted by the pipeline. To validate the pipeline experimentally, we need to find these associations. Therefore, we manually assess the mapping from the resulting top-cluster spans to the known domain-specific intents. We recover the domain intents from the labelled test set data and remove the conversational and system markers. We map the extracted top-level intent spans (e.g. i need to my seat assignment) based on how close it is from an actual domain intent (e.g. getseatinfo). If it means the same or the span can be in its scope without any other match, we associate them. Otherwise, we assigned the OTHER intent, which is our marker for newly discovered intent. Two authors participated in this effort, and a review session set the final associations. Finally, we compute the recall of intents if at least one top-level span matched that intent.

3.3.3 Cluster-Quality Experiment

To demonstrate the quality of the extracted clusters, we report the classification metrics Precision, Recall and F1-score on the annotated test sets. We consider only the intents with support above 10 — to ensure significant results. We use the semantic similarity based on cosine as a zero-shot classification applied on the set of valid span candidates $\to s$ after both the QA and the validation step (see Figure 1).

We select the estimate intent $^y$ by picking the most similar low-level cluster center $_{i}$ (see equation 1). We assign "unlabeled" to any example with all similarity scores below $0.4$ . Then, we trace the top-level cluster from the low-level one using the taxonomy. To get our final metrics, we use our manual mapping to compute the classification scores with the ground truth (see Appendix A).

^y = {a r g m a x}_{i} (\frac{\to s \cdot_{i}}{| | \to s | | | |_{i} | |})

(1)

4 Results

In this section, we present the results of three experiments: question-answering extraction by linguistic validations, intent scheme recovery by our intent-landscape pipeline, and validation of the quality of intent-span clusters through zero-shot similarity classification.

4.1 Question-Answering Validation Experiment

Figure 3: Validations done on the 10 intent-like span candidates for all MultiDoGo dialogues extracted with the ELECTRA models fine-tuned on SQuAD2 (small, base and large). The absolute percentages are computed based on the presence of any valid candidate for each conversation remaining after each validation divided by the initial amount of conversations.

We observe drastic changes in the results in Figure 3 for the various combinations of model sizes, the prompting questions and domains. The large model achieves the best combination with the first prompt Q1, which is formulated with general terms like "main reason of the call" and asks to focus on the "customer" only. This combination performs a final validation rate above 85%, except in the fast-food domain. On the other hand, the small models can achieve a considerable validation rate with the last prompt Q3 right before the stage named Customer Channel Validation. However, it fails at this last validation step. Therefore, we hypothesize that the small model size does not have a good representation of a dialogue structure in a text, which is more abstract and should come in the latest layers. With the wrong prompting, its performances are also less reliable. Our results show that the base model can perform considerably in the insurance and finance domains. In the other domains, this model size seems to struggle by relying a lot on the Question-Answering unanswerable span (i.e. ""), which reduces the number of dialogues from the first validation step Ignore any Impossible. In general, Q2 prompt performed the worst at the last stage of our validation process, i.e. Customer Channel Validation. We hypothesize that mentioning the agent and the customer inside the question confuses the model.

Our results highlight that the fast-food domain is more challenging. When observing the intents and the dialogues, we noticed that the agents and the customers use particular terms around the meals. On top of this, we further noted similar sentence structures about ordering food. Thus, we argue that these two aspects make it difficult for the transfer learning of the QA models to generalize on that domain. We speculate that conversational design grounded in the best practices seems only to contain one order intent and the named entity extraction on the meal names.

In section 4.2, we base our experiments on the best combination formed by the large model with the first question Q1. We also ignore the fast-food domain for the reasons mentioned above.

4.2 Intent Scheme Recovery Experiment

In Table 3, we summarized the results regarding our findings on the intents of all five domains using our pipeline. Overall, we observed a recall of 94.3% on average, which indicates that we found the vast majority of intents. Our pipeline only missed one intent in media and one in software. We also discovered between 3 and 6 new intents. We shared in the Appendix A our mapping, the TSNE 2D plots of the clusters and the clustering hyperparameters.

Domain	Total	Found	New	Recall
airline	4	4	4	100%
media	6	5	3	83%
insurance	3	3	6	100%
finance	10	10	5	100%
software	8	7	5	88%

Table 3: Summary of results with the mappings based on the top clusters, compared with intent schemes (Total column ignoring intents that are conversation markers or outofdomain). The Total, Found and New are numbers of intents. The recall of found intents by our pipeline in the original intent scheme is the percent of found/total.

4.3 Cluster-Quality Experiment

Intent	P	R	F1
changeseatassignment	0.92	0.33	0.48
getboardingpass	0.96	1.0	0.98
bookflight	0.89	0.96	0.92
getseatinfo	0.15	0.93	0.26

Table 4: Airline report on selected intents (support above 10).

Intent	P	R	F1
startserviceintent	0.8	0.68	0.73
viewbillsintent	0.95	1.0	0.97

Table 5: Media report on selected intents (support above 10).

Intent	P	R	F1
checkclaimstatus	0.91	0.98	0.95
getproofofinsurance	0.98	1.0	0.99
reportbrokenphone	0.76	1.0	0.86

Table 6: Insurance report on selected intents (support above 10).

Intent	P	R	F1
reportlostcard	0.82	1.0	0.9
updateaddress	0.9	1.0	0.95
checkbalance	0.97	0.99	0.98
transfermoney	0.91	0.97	0.94
disputecharge	0.67	0.98	0.8

Table 7: Finance report on selected intents (support above 10).

Intent	P	R	F1
reportbrokensoftware	0.64	0.97	0.77
softwareupdate	0.9	0.49	0.63
expensereport	0.62	0.96	0.75
startorder	0.61	0.95	0.74
checkserverstatus	0.88	0.91	0.9

Table 8: Software report on selected intents (support above 10).

We displayed the classification results for all domains in Tables 4, 5, 6, 7 and 8. We note that most of the Precisions, Recalls and F1-scores are above $0.9$ , which is considerable for a zero-shot setting and indicates a high-quality clustering. There are a few lower values that are related to the incompatibility between our pipeline and the flat taxonomy design of the MultiDoGo dataset.

In the following paragraphs, we use the centered dot " $\cdot$ " to indicate that the intents on both sides have a relatively high similarity score compared to other intents in the domain. Two pairs of intents that particularly caught our attention are changeseatassignment $\cdot$ getseatinfo from the airline domain, and startserviceintent $\cdot$ getinformationintent from the media domain. In the former case, both intents frequently contain the phrase "seat assignment", making the two semantically similar. In the latter case, both intents are syntactically similar to the patterns "I want …" and "I’d like to …". Due to their semantic or syntactical similarities, the distances between the clusters of these pairs of intents are much closer than compared to those of other intents in the same domain. On top of the flat taxonomy design of the MultiDoGo dataset, it is more difficult for the pipeline to distinguish these pairs of similar intents apart when the other intents are much further away from each other. As a result of this problem, we discovered that utterances such as "I would like to buy 10 Gb plan", which belongs to startserviceintent, are labelled as getinformationintent.

This aforementioned issue can be easily solved by constructing a two-stage taxonomy, in which the first level maximizes the dissimilarity between intents at a high level, and the second level targets more fine-grained distinctions. For example, in the software domain the intents can be naturally defined into three broader groups based on their similarity: softwareupdate $\cdot$ reportbrokensoftware $\cdot$ checkserverstatus, expensereport $\cdot$ getpromotions, and startorder $\cdot$ changeorder $\cdot$ stoporder.

5 Conclusion

In conclusion, we proposed a pipeline to solve our intent-landscape task by extracting the intents and an intent taxonomy from a corpus composed of real-life customer-service dialogues. First, we experimented with the first two stages that extract and validate intent-like spans from dialogues. We showed the generalization ability of the Question-Answering Electra large model to a dialogue context, with a linguistic validation of the extracted intent spans above 85%. We recovered the intent schemes from five domains with our pipeline at an average recall of 94.3%. At last, we suggested that a two-level taxonomy could alleviate flaws in airline, media, and software schemes. We have seen two limitations with particular domains like fast food and the manual mapping step after our top-level clusters. We let these two as future works. Reasonable approaches to tackling these two issues could be leveraging domain adaptation and upgrading semantic similarity with natural language inference.

Acknowledgements

We thank Bell Canada (BCE inc.) as well as everyone involved: Nassim Guerroumi, Ryan Levman, Stephanie Maccio, Jeff Kurys, Michel Richer, and Alan Khalil.

References

S. Agarwal and A. Sureka (2017) Characterizing linguistic attributes for automatic classification of intent based racist/radicalized posts on tumblr micro-blogging website. CoRR abs/1701.04931. External Links: Link, 1701.04931 Cited by: §2.
M. Ankerst, M. M. Breunig, H. Kriegel, and J. Sander (1999) OPTICS: ordering points to identify the clustering structure. ACM Sigmod record 28 (2), pp. 49–60. Cited by: §3.2.4.
P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, U. Stefan, R. Osman, and M. Gašić (2018) MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.1.
R. Cai, B. Zhu, L. Ji, T. Hao, J. Yan, and W. Liu (2017) An cnn-lstm attention approach to understanding user query intent from online health communities. In 2017 ieee international conference on data mining workshops (icdmw), pp. 430–437. Cited by: §2.
A. Chatterjee and S. Sengupta (2020) Intent mining from past conversations for conversational agent. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 4140–4152. Cited by: §1, §2.
D. Chen, H. Chen, Y. Yang, A. Lin, and Z. Yu (2021) Action-based conversations dataset: a corpus for building more in-depth task-oriented dialogue systems. arXiv preprint arXiv:2104.00783. Cited by: §3.1.
Z. Chen, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh (2013) Identifying intention posts in discussion forums. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 1041–1050. External Links: Link Cited by: §2.
K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: item 4, §3.3.1.
A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, et al. (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190, pp. 12–16. Cited by: §1, §3.1.
M. Eric, R. Goel, S. Paul, A. Sethi, S. Agarwal, S. Gao, and D. Hakkani-Tur (2019) MultiWOZ 2.1: multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669. Cited by: §3.1.
M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In kdd, Vol. 96, pp. 226–231. Cited by: §3.2.4.
C. Goo, G. Gao, Y. Hsu, C. Huo, T. Chen, K. Hsu, and Y. Chen (2018) Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 753–757. External Links: Link, Document Cited by: §2.
V. Gupta, D. Varshney, H. Jhamtani, D. Kedia, and S. Karwa (2014) Identifying purchase intent from social posts. Proceedings of the International AAAI Conference on Web and Social Media 8 (1), pp. 180–186. External Links: Link Cited by: §2.
S. Hang (2021) Clustering short texts: categorizing initial utterances from customer service dialogue agents. Cited by: §1, §2.
C. T. Hemphill, J. J. Godfrey, and G. R. Doddington (1990) The atis spoken language systems pilot corpus. In Proceedings of the Workshop on Speech and Natural Language, HLT ’90, USA, pp. 96–101. External Links: Link, Document Cited by: §1, §3.1.
D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. CoRR abs/1610.02136. External Links: Link, 1610.02136 Cited by: §2.
M. Honnibal and I. Montani (2017) spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Note: To appear Cited by: 2nd item.
Y. Kim, K. Stratos, and D. Kim (2017) Adversarial adaptation of synthetic or stale data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1297–1307. External Links: Link, Document Cited by: §2.
Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu (2017) Dailydialog: a manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957. Cited by: §3.1.
S. Liang, Y. Li, and R. Srikant (2017) Principled detection of out-of-distribution examples in neural networks. CoRR abs/1706.02690. External Links: Link, 1706.02690 Cited by: §2.
T. Lin and H. Xu (2019) Deep unknown intent detection with margin loss. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5491–5496. External Links: Link, Document Cited by: §2.
B. Liu and I. R. Lane (2017) Multi-domain adversarial learning for slot filling in spoken language understanding. CoRR abs/1711.11310. External Links: Link, 1711.11310 Cited by: §2.
L. McInnes, J. Healy, and S. Astels (2017) Hdbscan: hierarchical density based clustering. Journal of Open Source Software 2 (11), pp. 205. Cited by: §3.2.4.
G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig (2015) Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (3), pp. 530–539. External Links: Document Cited by: §2.
D. Peskov, N. Clarke, J. Krone, B. Fodor, Y. Zhang, A. Youssef, and M. Diab (2019) Multi-domain goal-oriented dialogues (multidogo): strategies toward curating and annotating large scale dialogue data. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4526–4536. Cited by: item 5, §3.1, §3.1.
A. Popov, V. Bulatov, D. Polyudova, and E. Veselova (2019) Unsupervised dialogue intent detection via hierarchical topic model. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), Varna, Bulgaria, pp. 932–938. External Links: Link, Document Cited by: §1, §2.
P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789. Cited by: item 4, §3.2.1, §3.3.1.
H. Rashkin, E. M. Smith, M. Li, and Y. Boureau (2018) Towards empathetic open-domain conversation models: a new benchmark and dataset. arXiv preprint arXiv:1811.00207. Cited by: §3.1.
A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan (2020) Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8689–8696. Cited by: §3.1.
N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §3.2.3.
L. Shu, H. Xu, and B. Liu (2017) DOC: deep open classification of text documents. CoRR abs/1709.08716. External Links: Link, 1709.08716 Cited by: §2.
N. Vedula, R. Gupta, A. Alok, and M. Sridhar (2020) Automatic discovery of novel intents & domains from text utterances. arXiv preprint arXiv:2006.01208. Cited by: §1, §2.
N. Vedula, N. Lipka, P. Maneriker, and S. Parthasarathy (2019) Towards open intent discovery for conversational text. arXiv preprint arXiv:1904.08524. Cited by: §1, §2, §2.
J. Wang, G. Cong, X. Zhao, and X. Li (2015) Mining user intents in twitter: a semi-supervised approach to inferring intent categories for tweets. Proceedings of the AAAI Conference on Artificial Intelligence 29 (1). External Links: Link Cited by: §2.
J. H. Ward Jr (1963) Hierarchical grouping to optimize an objective function. Journal of the American statistical association 58 (301), pp. 236–244. Cited by: §3.2.5.
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §3.2.1, §3.3.1.
Q. Yu and W. Lam (2018) Product question intent detection using indicative clause attention and adversarial learning. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ’18, New York, NY, USA, pp. 75–82. External Links: ISBN 9781450356565, Link, Document Cited by: §2.
X. Zang, A. Rastogi, S. Sunkara, R. Gupta, J. Zhang, and J. Chen (2020) MultiWOZ 2.2: a dialogue dataset with additional annotation corrections and state tracking baselines. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, ACL 2020, pp. 109–117. Cited by: §3.1.
H. Zhang, X. Li, H. Xu, P. Zhang, K. Zhao, and K. Gao (2021) TEXTOIR: an integrated and visualized platform for text open intent recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 167–174. Cited by: §1, §2.
H. Zhang, H. Xu, and T. Lin (2021) Deep open intent classification with adaptive decision boundary. Proceedings of the AAAI Conference on Artificial Intelligence 35 (16), pp. 14374–14382. External Links: Link Cited by: §2.
S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: i have a dog, do you have pets too?. arXiv preprint arXiv:1801.07243. Cited by: §3.1.
X. Zhang and H. Wang (2016) A joint model of intent determination and slot filling for spoken language understanding. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pp. 2993–2999. External Links: ISBN 9781577357704 Cited by: §2.

Appendix A Appendix

Top Cluster	Intent	Volume
can you pls send me my boarding pass	getboardingpass	101
i need to check my seat assignment	getseatinfo	91
i want book a flight ticket	bookflight	90
i need to check my seat	getseatinfo	49
i already book a ticket but i change this seat	changeseatassignment	23
i need to change seat	changeseatassignment	17
i need your help	OTHER	13
i need to my seat assignment	getseatinfo	13
boarding pass to be sent to my email address	getboardingpass	11
i wanna change my seat assignment	changeseatassignment	10
could you tell me my seat arrangement	getseatinfo	9
i need middle seat	changeseatassignment	8
i need ticket for chennai	bookflight	5
i need a boarding pas	getboardingpass	5
i want to know saet no	getseatinfo	3
i have an upcoming flight	OTHER	3
i already book a ticket but i want to chnage it	OTHER	3
i want to book one way ticket	bookflight	2
i need flight under $300	bookflight	1
i want to changing the window seat	changeseatassignment	1
i need a plan in ticked booking	OTHER	1
i wand a plain ticket	bookflight	1
i am need a flight from newyork to texas	bookflight	1

Table 9: Manual mapping for airline from top clusters to intents with estimated volumes on test set.

Top Cluster	Intent	Volume
i want new cable service connection	startserviceintent	192
i like to purchase a new internet connection service	startsertviceintent	125
i want internet plan	startserviceintent	59
i would like to sign up new service	startserviceintent	43
you have cleared all my queries	OTHER	22
i want view my bill	viewbillsintent	18
i want to phone service	startserviceintent	8
i need help	OTHER	5
i wanna buy a new purchase from you	startserviceintent	5
i purchase plan	startserviceintent	2
i request you to sign up to new connection	startserviceintent	1
i want to know my data usage bills	viewdatausageintent	1
i need intrnet service	startserviceintent	1
my bill keeps going up	viewbillsintent	1
i want cancel cable service	cancelserviceintent	1
internet is very slow	OTHER	1
bill keeps going up	viewbillsintent	1
i liked phone connetion	startserviceintent	1

Table 10: Manual mapping for media from top clusters to intents with estimated volumes on test set.

Top Cluster	Intent	Volume
i want proof of insurance for my car	getproofofinsurance	221
i need status of my claim	checkclaimstatus	114
my phone screen is broken	reportbrokenphone	103
the year is also wrong	OTHER	10
i meet this work	OTHER	8
i need your help	OTHER	4
i need my cliam status	checkclaimstatus	2
my ssn number is lost	OTHER	1
i have bought a new car	OTHER	1
flxing screen	reportbrokenphone	1
my car met accident and bumper was damaged	OTHER	1
please fix my sreen	reportbrokenphone	1

Table 11: Manual mapping for insurance from top clusters to intents with estimated volumes on test set.

Top Cluster	Intent	Volume
my credit card was lost	reportlostcard	120
i want to change address on my account	updateaddress	88
need to check my account balance	checkbalance	64
i have error on my credit card bill	disputecharge	60
i need help from you	OTHER	26
i want to transfer money to another account	transfermoney	25
i want to close my account	closeaccount	13
i wanted to inquire whether new lower rates are availble to me	checkoffereligibility	13
my old card expired	replacecard	9
i need to block my credit card	reportlostcard	8
i need transfer money from my a/c to another	transfermoney	8
want to know the banks routing number	getroutingnumber	8
i made a purchase of $400	OTHER	7
i need to order checks	orderchecks	5
some amount debited my card	disputecharge	4
my cars missed	reportlostcard	4
need to check my balance in my a/c	checkbalance	4
there is an error	OTHER	3
i need a education loan	OTHER	3
i have cheack my balance	checkbalance	2
i have check my account	OTHER	1
i want transfer	transfermoney	1
i need to check my balanace	checkbalance	1

Table 12: Manual mapping for finance from top clusters to intents with estimated volumes on test set.

Top Cluster	Intent	Volume
i want to reimburse my travel expense	expensereport	119
my skype application is not working	reportbrokensoftware	48
i need to buy keyboards	startorder	47
my outlook software is not working properly	reportbrokensoftware	40
i want check the global status of servers	checkserverstatus	32
i want to set up new recurring orders	startorder	24
please check my puchasing item	OTHER	21
please provide dollar value for food and hotel	OTHER	20
i need software update	softwareupdate	19
my application not working properly	reportbrokensoftware	18
i want buy a musical instruments	startorder	14
i need this model psr-e363	startorder	10
missing periodic software updates	softwareupdate	10
my travel expenses is very high	OTHER	9
i need a help	OTHER	8
i need a some information	OTHER	6
my whatsapp messages are not send	reportbrokensoftware	5
i need travel ticket toreimbursement	expensereport	4
i want reorder basic keybords 10 pieces to 15 pieces	startorder	4
i want report on your software on outlook	reportbrokensoftware	3
what is the procedure to summit my expenses	expensereport	2
i want to cancel one item in my order list	stoporder	2
facing the server error	checkserverstatus	2
i need to report my travel expanses	expensereport	1
i want finance help	getpromotion	1
hike message is not received	reportbrokensoftware	1
i have an issue in my skpe app	reportbrokensoftware	1

Table 13: Manual mapping for software from top clusters to intents with estimated volumes on test set.

Figure 4: TSNE 2D plot for the airline domain with the colorbar as the top-level cluster indicator and the surface area color as the manually defined cluster. The names are intents from the testset with a support higher than 10.

	min_cluster	distance	force_cluster
	_size	_threshold	_threshold
airline	4	0.29	0.3
media	3	0.42	0.2
insurance	2	0.5	0.2
finance	2	0.45	0.2
software	2	0.5	0.3

Table 14: Cluster hyperparameters.