Focus-Driven Contrastive Learning for Medical Question Summarization

Ming Zhang

^{1}

^{,}

^{2}

Shuai Dou

^{3}

Ziyang Wang

^{1}

^{,}

^{2}

Yunfang Wu

^{1}

^{,}

^{3}

^{1}

MOE, Key Laboratory of Computational Linguistics, Peking University

^{2}

School of Software and Microelectronics, Peking University

^{3}

School of Computer Science, Peking University

^{1}

{zhangming,wzy232303}@stu.pku.edu.cn

^{2}

{monkdou,wuyf}@pku.edu.cn *Corresponding author.

Abstract

Automatic medical question summarization can significantly help the system to understand consumer health questions and retrieve correct answers. The Seq2Seq model based on maximum likelihood estimation (MLE) has been applied in this task, which faces two general problems: the model can not capture well question focus and and the traditional MLE strategy lacks the ability to understand sentence-level semantics. To alleviate these problems, we propose a novel question focus-driven contrastive learning framework (QFCL). Specially, we propose an easy and effective approach to generate hard negative samples based on the question focus, and exploit contrastive learning at both encoder and decoder to obtain better sentence-level representations. On three medical benchmark datasets, our proposed model achieves new state-of-the-art results, and obtains a performance gain of 5.33, 12.85 and 3.81 points over the baseline BART model on three datasets respectively. Further human judgement and detailed analysis prove that our QFCL model learns better sentence representations with the ability to distinguish different sentence meanings, and generates high-quality summaries by capturing question focus.

1 Introduction

Input question: consumer health question (CHQ)

subject: gender dysphoria message: no health care on

my son suffering from gender dysphoria what can we

do to help him he worked out of high school no problems

now not working and about shutting himself in his room

24/7 theres nothing this condition in our area we live in

[location].no help in area what can we do he has had bad

thoughts already please help us with some sort of info

thank yuo [name] [location]

Golden summary: frequently asked question (FAQ):

Where can I find information on treatment and resources

for gender dysphoria?

Summary by BART (baseline):

What are the treatments for weight loss?

Summary by our model:

What are the treatments for gender dysphoria?

Table 1: An example of medical question summarization in MeqSum dataset, where the question focus is highlighted in green. Summaries generated by BART and our model are also listed.

A growing number of health questions are raised by consumers on websites nowadays, which are usually written in natural language and including detailed and peripheral information not related to the answers. Summaries of such questions can greatly improve the performance in retrieving relevant answers Ben Abacha and Demner-Fushman (2019). Accordingly, the medical question summarization task is defined as summarizing the consumer health questions (CHQ) into frequently asked questions (FAQ), which are shorter but remain essential information of the original question to get correct answers. An example of medical question summarization is shown in Table 1.

The Seq2Seq neural models have been widely used in abstractive summarization Nallapati et al. (2016); Lewis et al. (2020); Zhang et al. (2020) and show promising potentials, and they have also been applied in medical question summarization and achieve current state-of-the-art results. Ben Abacha and Demner-Fushman (2019) apply the pointer-generator model for this task. Yadav et al. (2021) present a reinforcement learning framework with question-type identification reward and question-focus recognition reward. Mrini et al. (2021) propose a multitask learning method by treating recognizing question entailment as an auxiliary task.

Figure 1: Sketch of our proposed contrastive learning framework. $M_{s}$ , $M_{h}$ represents the memory bank that contains simple negative samples and hard negative samples respectively. $R_{f}$ , $R_{c}$ , $R_{g}$ denotes the sentence representation of FAQ, CHQ and generated summary. $L_{c t r S}$ and $L_{c t r H}$ are contrastive learning loss on simple negative samples and hard negative samples respectively. $+$ indicates the positive sample, and $-$ indicates the negative sample.

In the medical question summarization task, the input question CHQ is always lengthy and contains redundant information, where some salient medical entities and the semantic focus of question are vital to understand users’ intention. But it still remains a challenging task for the existing methods to capture the question focus. As described in the example 1, the focus "gender dysphoria" is mis-replaced by "weight loss" in the summary generated by the fine-tuned BART, resulting in a completely different meaning from the original sentence.

For the medical question summarization task, the generated question summary is required to semantically close to the reference question. However, in most of current pre-trained models such as BART Lewis et al. (2020), the model adopts maximum likelihood estimation (MLE) and mainly focuses on the accuracy of the prediction of masked tokens, but does not guarantee to the semantic similarity or dissimilarity of the whole sentences. To address this issue, some previous works adopt reinforcement learning (RL) in text summarization task Li et al. (2019); Paulus et al. (2018), but RL suffers from the noise gradient estimation problem Greensmith et al. (2004), which makes the training process unstable and sensitive to hyper-parameters.

To alleviate these problems, we propose a novel question focus-driven contrastive learning (QFCL) framework for medical question summarization, as illustrated in Figure 1. In our model, we introduce a "double anchors" strategy for contrastive learning, by utilizing the sentence representation of CHQ as an anchor and the generated summary as another anchor, and regarding the golden reference FAQ as the positive sample. In addition, we present a "focus-driven hard negatives generator" to construct hard negative samples, by replacing the focus phrases with other phrases sharing the same attribute.

Through contrastive learning, we minimize the distance between CHQ/generated summary and golden reference, and maximize the distance between CHQ/generated summary and other negative samples. By using the double anchors, our model is able to extract sentence-level semantic features to alleviate the problem of MLE. With the help of hard negatives generator, the model learns to pay more attention to question focus and thus produces high quality summary.

We conduct extensive experiments on three medical question summarization datasets: Meqsum Ben Abacha and Demner-Fushman (2019), HealthCareMagic and iCliniq Zeng et al. (2020). Our proposed model outperforms previous best results by a wide margin, achieving new state-of-the-art results on all three datasets. Compared with the baseline BART, our model brings a relative performance gain of $12.2 %$ , $28.7 %$ and $9.6 %$ on Meqsum, Cliniq and HealthcareMagic respectively. Through analysis, we prove that our model significantly gains the power of distinguishing the semantics between generated summaries and negative samples, and our model generates high-quality summaries capturing more question focuses.

Figure 2: The overall framework of QFCL. $L_{c t r C}$ and $L_{c t r G}$ are contrastive learning loss on the two anchors respectively.

2 Ralated Work

2.1 Medical Question Summarization

The medical question summarization task is defined by Ben Abacha and Demner-Fushman (2019). They construct a benchmark dataset Meqsum, and apply a pointer-generator model to generate question summary. At the question summarization campaign of MEDIQA-21 organized by Ben Abacha et al. (2021), almost all approaches rely on the fine-tuning of pre-trained transformer models. Transfer learning, knowledge-base, and ensemble methods are widely utilized by participanting teams to achieve better performance He et al. (2021); Yadav et al. (2021); Mrini et al. (2021b); Sänger et al. (2021). In this paper, we also base our method on the strong pre-trained BART model.

Recently, Yadav et al. (2021) propose a RL framework with two question-aware semantic rewards: question-type identification reward (QTR) and question-focus recognition reward (QFR). QTR is to identify whether the question types are consistent with the gold question, and QFR is designed to capture question focus. But in their work, the question types and question focuses in the dataset should be manually labeled, which is both time-consuming and labor-intensive for large-scale datasets such as HealthcareMagic and iCliniq. Moreover, the RL training process is unstable. Mrini et al. (2021) claim an equivalence between medical question summary and recognizing question entailment(RQE), and employ multi-task learning to train the model to not only perform next-word-prediction but also carry question entailment recognition. These two studies demonstrate that the pre-trained models achieve better performance after capturing the underlying sentence semantics of generated questions. Different from these works, we exploit contrastive learning to obtain focus-aware question representations.

2.2 Contrastive Learning

Different from the traditional methods which learn representations in pixel-level for computer vision tasks, contrastive learning encodes high-level features to distinguish different objects and has achieved great success Henaff (2020); Chen et al. (2020); Misra and van der Maaten (2020); He et al. (2020), and it has also been applied in several NLP tasks such as machine translation Pan et al. (2021), pre-training Chi et al. (2021) and question answering Yang et al. (2021). In the field of summarization, Liu and Liu (2021) present a contrastive framework to bridge the gap between the learning objective and evaluation metrics, Cao and Wang (2021) design several negative sample construction strategies to solve the factual inconsistency problem. In contrast, we use the MoCo structure to handle with the large volume of negative samples, and propose a new negative sample construction method.

Chen et al. (2020) prove that large size of negative samples can improve the performance of contrastive learning, but it also brings heavy burden on computation cost. To address this issue, He et al. (2020) propose MoCo, which maintains a queue as the memory bank to store negative samples. MoCo adopts two encoders with the same structure: key encoder and query encoder, where the key encoder is momentum updated from the query encoder.

3 Model

Given an input question CHQ, which is written by consumers and contains lengthy and complex information, the medical question summarization task aims to automatically generate a question summary that is a frequently asked question (FAQ), capturing the essential information to help efficiently retrieve correct answers. A more detailed structure of our proposed QFCL model is presented in Figure 2.

3.1 Contrastive Learning Architecture

We employ the pre-trained BART Lewis et al. (2020) as our basic model to generate question summaries. For contrastive learning, we adopt the MoCo architecure He et al. (2020), which contains a key encoder $E_{k}$ with the same structure as the BART encoder $E_{q}$ , and a queue to store simple negative samples with large volume. The simple negative samples in the queue are progressively replaced by current mini-batch of representations extracted from the key encoder. All samples in the queue will be used as negative samples in the next batch. In addition, QFCL employs a hard negatives generator to generate hard negative samples.

In our model, the BART encoder $E_{q}$ and the decoder are updated via back propagation by combining three types of loss functions, as described in the subsequent sections. The parameters of $E_{k}$ are frozen and updated slowly towards that of $E_{q}$ :

θ_{k} \leftarrow m θ_{k} + (1 - m) θ_{q}

(1)

where $m$ is a momentum coefficient.

At the inference, only the BART encoder and decoder are retained, other parts such as the key encoder, the queue, and the hard negatives generator are all discarded.

3.2 Simple Negative Samples

In the medical question summarization task, the input question CHQ should be semantically close to its reference summary FAQ but different from other question summaries. Therefore, we regard the CHQ $c_{i}$ in the $i$ -th pair as the anchor, FAQ $f_{i}$ in the same pair as the positive sample and randomly select $f_{j}$ from other different pairs to serve as simple negative samples.

Let $R_{s}$ denote the average decoded output of an arbitrary sentence $s$ , the objective function of the simple contrastive learning is defined as:

L_{c t r C S} = - l o g \frac{e^{s i m (R_{c i}, R_{f i}) / τ}}{\sum_{R_{f j} \in M_{s}} e^{s i m (R_{c i}, R_{f j}) / τ}}

(2)

where $R_{c i}$ indicates the sentence representation of the $i$ -th CHQ extracted from $E_{q}$ , and $R_{f i}$ and $R_{f j}$ are extracted from the key encoder $E_{k}$ for the $i$ -th and $j$ -th FAQ respectively. The operation $s i m$ is to calculate the cosine similarity, $τ$ is a temperature hyper-parameter. $M_{s}$ is the memory bank which contains one positive sample and $K$ simple negative samples in the queue with respect to an anchor.

3.3 Focus-Driven Hard Negative Samples

Figure 3: The method of hard negative samples generation.

The above simple negative samples are randomly selected. As claimed by Kalantidis et al. (2020), hard negative samples that are more similar to positive samples can facilitate the model to get better performance. Inspired by this, we build a bridge between hard sample generation and question focus prediction.

3.3.1 Question Focus Identification

As mentioned before, the question focus is essential to understand a consumer health question. If some focus phrases are missing in the generated summary, the semantic will drift far away from the original user’s intention. So we construct difficult negative samples based on the question focus to enhance contrastive learning. Specially, we replace the focus phrases with some other phrases of the same attribution, and keep other words of the sentence unchanged. An example of hard negative sample generation is shown in Figure 3.

One issue for our method is how to automatically annotate question focus. Yadav et al. (2021) manually labeled the question focus in MeqSum dataset. However, this is quite time-consuming and labor-intensive, driving us to find a method which can automatically mark the question focus in larger datasets, such as HealthcareMagic and iCliniq. We analyzed the manually labeled MeqSum dataset, and found that in 340 of the total 500 records (up to $68 %$ ), the question focuses are the overlap phrases between CHQ and FAQ. Accordingly, we hypothesize that the same phrases appearing both in the source question and the golden summary have a high probability to be key-phrases. This idea is also proved to be effective in Li et al. (2020).

Since the question focus is usually a phrase rather than a single word, we need to split one sentence into phrases. We apply the chunker Akbik et al. (2018) to the CHQ and FAQ text, and record the chunk label of each phrase. Then the consistent phrases appearing both in CHQ and FAQ are labeled as the question focuses.

3.3.2 Hard Negative Sample Generation

We constructed a dictionary by concatenating all phrases of the FAQ sentences in the train set. To generate hard negative samples, the question focuses are randomly replaced by other phrases of the same chunk label from the dictionary. As shown in Figure 2, “breast cancer” is replaced by “diabetes” since they share the same label “NP”. We repeat this process $N_{h}$ times to construct $N_{h}$ different hard negative samples for each CHQ-FAQ pair.

3.3.3 Contrastive Learning on Hard Negative Samples

The sentence representation of hard sample $R_{h}$ is extracted from the key encoder $E_{k}$ . We define the hard loss function of contrastive learning as:

L_{c t r C H} = - l o g \frac{e^{s i m (R_{c i}, R_{f i}) / τ}}{\sum_{R_{h} \in M_{h}} e^{s i m (R_{c i}, R_{h}) / τ}}

(3)

where $M_{h}$ denotes the memory bank containing one positive sample and $N_{h}$ hard negative samples.

This loss function forces the model to not only shorten the distance between CHQ and FAQ, but also expand the gap between the CHQ and hard negative samples. In this way, we achieve the goal of making the model pay more attention to the question focus, and obtain a focus-aware representation.

3.4 Contrastive Learning at Decoder

An imbalance existing in the above method is that contrastive learning is only utilized at the encoder. We fine-tuned BART on iCliniq dataset, and found that the decoder lacks the ability to distinguish the representations between the generated summary and the positive samples/unrelated negative samples, as $s_{g_f a q}^{+}$ , $s_{g_s i m}^{-}$ , $s_{g_h a r d}^{-}$ shown in Figure 4. Therefore, we try to improve the similarity between the generated summary and its reference FAQ, and at the same time enlarge the dis-similarity between the generated summary and other unrelated questions.

Specially, we regard the generated summary as an extra anchor, and denote the representation of the generated summary as $g_{i}$ . Since the output summary should be semantically consistent with the corresponding FAQ, we consider the representation of the FAQ $f_{i}$ in the same pair as the positive sample, and select the simple negative samples randomly from the queue and generate hard negative samples using the hard negatives generator. The object functions of contrast loss $L_{c t r G S}$ and $L_{c t r G H}$ at the decoder end are defined in a similar style as Equation 2 and 3, except that the anchor $c_{i}$ is replaced by another anchor $g_{i}$ .

3.5 Overall Objective Function

For predicting next tokens in the generated summary, we use the cross entropy loss $L_{c e}$ :

L_{c e} = - \frac{1}{| T |} \sum t \in T l o g (p (y_{t} | x, y_{1 : t - 1}, θ))

(4)

In our model, the overall loss function consists of five parts: the cross entropy loss $L_{c e}$ and four different loss functions of contrastive learning: $L_{c t r C S}$ , $L_{c t r C H}$ for the anchor at the encoder end, $L_{c t r G S}$ , $L_{c t r G H}$ for the anchor at the decoder end. We define the contrastive learning loss with respect to these two anchors as:

\begin{matrix} L_{c t r C} & = α L_{c t r C S} + β L_{c t r C H} L_{c t r G} & = α L_{c t r G S} + β L_{c t r G H} \end{matrix}

(5)

where $α$ , $β$ are hyper-parameters to control the balance between simple negatives and hard ones. The weights of contrastive learning loss at the encoder and decoder are considered as equal, and the overall loss is defined as:

(6)

4 Experiments

4.1 Datasets

We conduct experiments on three English benchmark medical question summarization datasets, including Meqsum, HealthcareMagic and iCliniq. Meqsum is a high-quality dataset from NIH ¹¹1www.nlm.nih.gov/medlineplus, constructed by Ben Abacha and Demner-Fushman (2019). Mrini et al. (2021a) extracted HealthCareMagic and iCliniq datasets from MedDialog Zeng et al. (2020) , which are collected automatically from the online healthcare service platforms ²²2www.healthcaremagic.com ³³3www.icliniq.com. MeqSum’s and HealthcareMagic’s summaries are written by medical experts in formal style, while iCliniq’s are patient-written. We list some statistics of these datasets in table 2. Following previous works, we adopt ROUGE Lin (2004)⁴⁴4https://pypi.org/project/py-rouge as the evaluation metric.

Dataset	Train	Dev	Test	Length
MeqSum	400	100	500	60.8/10.1
HealthCareMagic	181,122	22,641	22,642	82.8/9.7
iCliniq	24,851	3,105	3,106	89.7/12.3

Table 2: Statistics of three medical question summarization datasets. Length indicates the average length of CHQ/FAQ.

Model	MeqSum			iCliniq			HealthCareMagic
Model	R1	R2	RL	R1	R2	RL	R1	R2	RL
ProphetNet + QTR + QFRYadav et al. (2021)	45.52	27.54	48.19	-	-	-	-	-	-
MTL+Data augmentationMrini et al. (2021)	49.20	29.50	44.80	54.20	36.90	49.10	45.90	24.30	42.90
BART Lewis et al. (2020)	46.17	28.05	43.75	48.79	25.47	44.69	42.33	23.07	39.60
BART + S	49.30	31.78	46.89	56.58	36.43	52.06	44.35	24.73	41.46
BART + S + H	49.96	32.72	47.66	58.26	40.08	55.34	45.52	25.71	42.51
BART + S + H + D (QFCL)	51.48	34.16	49.08	60.09	43.22	57.54	46.42	26.47	43.41

Table 3: Experimental results on three medical question summarization datasets.

S

denotes the contrastive learning on simple negative samples at the encoder end;

H

denotes the contrastive learning on hard negative samples at the encoder end;

D

denotes the decoder end’s contrastive learning. The top group lists the existing state-of-the-art results on three datasets, and the bottom group shows our ablation study on different components.

4.2 Training Details

We utilize BART-large Lewis et al. (2020) in huggingface⁵⁵5huggingface.co/facebook/bart-large as our pre-trained model. The learning rate of BART baseline is set to 3e-5 as the same with Mrini et al. (2021). For contrastive learning in QFCL, the learning rate is optimized to 1e-5. Betas of Adam optimizer is set to 0.9 and 0.999. Batch size is set to 16. The number of hard negative samples $n_{h}$ is set to 64. For Moco, the queue size $K$ is set to 4096, temperature $τ$ is 0.07, and the momentum coefficient $m$ is 0.999. In Equation 5, $α$ and $β$ are set to 1 and 0.5 respectively through grid search on MeqSum development set. Experiments were all performed on a single NVIDIA RTX 3090 GPU. The average runtimes of each epoch for MeqSum, iCliniq and HealthcareMagic are 4.2h, 0.6h and 0.1h respectively.

4.3 Overall Performance

We report our experimental results in Table 3. Our model achieves new state-of-the-art results on all three datasets. Compared with the previous best results, we obtain an improvement of 0.99 ROUGE-L score on MeqSum, 8.44 on iCliniq, and 0.51 on HealthcareMagic, respectively.

MTL+Data augmentation Mrini et al. (2021) obtains the previous state-of-the-art results on iCliniq and HealthcareMagic, which utilizes the question entailment data to augment summarization data. In contrast, our method doesn’t need other classification models or external data. The work of ProphetNet+QTR+QFR Yadav et al. (2021) gets the previous best result on MeqSum, which presents a reinforcement learning-based framework with question-aware rewards. Comparing with this competitive model, our method obtains consistent better performance on all metrics, with 2.28 improvement on R1, 4.66 improvement on R2 and 0.89 improvement on RL. We did not compare the results of Yadav et al. (2021) on the other two datasets, since their method requires manually labeled question focuses and question types.

4.4 Ablation Study

We perform ablation study to evaluate the impacts of different components employed in QFCL, and report the results in Table 3. In particular, for Meqsum dataset, due to the small size which may cause the training unstable, we conducted five separate experiments and computed the average ROUGE score of these five checkpoints as the final result. Compared with the base BART model, we obtain an absolute improvement of 5.33 points on average. T-test is implemented on such five ROUGE scores and the p-value is less than 1e-2, validating that this improvement is significant. On Cliniq the absolute improvement is 12.85 points and on HealthcareMagic 3.81 points. In comparison to BART, the relative improvements of our model are $12.2 %$ , $28.7 %$ and $9.6 %$ on Meqsum, Cliniq and HealthcareMagic respectively.

The results demonstrate that each component of our model is helpful. On MeqSum, there is an increase of 3.15 points for BART+S compared to the baseline, indicating that the contrastive learning on simple negative samples largely improves model performance. It shows an continuous increase of 0.77 points for BART+S+H, and the highest ROUGE-L score is obtained when three parts are all implemented in our model. It suggests that each component in QFCL contributes positively, and metrics like ROUGE evaluating the similarity between whole sentences benefit from our contrastive learning strategy.

4.5 Human Evaluation

To quantitatively assess the results, we compare our method with the baseline BART through human judgement. We randomly selected 50 samples from each of three datasets, and hired 3 graduate students to categorize each generated summary into one of the following categories: ’Incorrect’, ’Acceptable’, and ’Perfect’. We compute the average number of each category, and report the result in Table 4. The average Spearman correlation coefficient between three annotators is 0.68, which guarantees a high quality of our annotation data. The evaluation results show that our model generates a higher proportion of perfect samples and a lower proportion of incorrect ones, by enhancing the model’s ability of capturing sentence semantics and question focuses.

Model	MeqSum			iCliniq			HealthCareMagic
Model	I	A	P	I	A	P	I	A	P
BART	28.7	17.3	4.0	12.3	17.0	20.7	20.7	20.3	9.0
QFCL	12.0	18.0	20.0	6.3	17.7	26.0	5.7	16.3	28.0

Table 4: Human evaluation of the summaries generated by BART and QFCL respectively. The metric I means the number of incorrect samples, A means acceptable, P means perfect.

4.6 Case Study

To clearly show the output question summary, we list two samples to compare our model with BART in Table 5. In Case 1, BART captures the question focus "Ampicillin" but misses "drink alcohol", and in Case 2 it misses the question focus "breast milk". In contrast, our model successfully extract multiple question focuses from the lengthy CHQ, and generate summaries which more conform to the meaning of original questions.

Case1
CHQ	MESSAGE: Is it okay to drink alcohol in
	moderation when taking Ampicillin. I was
	told it negates any medical effect of the drug
FAQ	Can I drink alcohol while taking Amoxicillin?
BART	What are the side effects of Ampicillin?
QFCL	Is it okay to drink alcohol with Ampicillin?
Case2
CHQ	Hi….. I have 3 month old baby girl…… I don t
	have breast milk from the beginning due to
	some reason. I can not give formula milk to
	baby…… So right now i m giving buffelo milk
	…….. What else i should give her for better
	nourishment????? ……. She has constipation
	problem may be due to milk but i cant give her
	breastmilk or formula ……. How to overcome
	it?????……… Please help me
FAQ	Suggest ways to feed newborn other than
FAQ	breast milk
BART	Suggest treatment for constipation in a child
QFCL	Suggest better nourishment for baby other
QFCL	than breast milk

Table 5: Examples of generated question summaries by BART and our QFCL model. The question focuses are highlighted.

Figure 4: Correlation between sentence representation similarities and epoch numbers on dev set. The red lines are about the anchor CHQ. $s_{c_f a q}^{+}$ is the average cosine similarity between CHQ and related FAQ, $s_{c_s i m}^{-}$ is between CHQ and simple negative samples (other FAQs), $s_{c_h a r d}^{-}$ is between CHQ and hard negative samples. The green lines are about the anchor of generated summary. $s_{g_f a q}^{+}$ is the average cosine similarity between the generated summary and FAQ, $s_{g_s i m}^{-}$ is between generated summary and simple negatives, $s_{g_h a r d}^{-}$ is between generated summary and hard negatives. The epoch number equaling 0 denotes the initial pre-trained model.

5 Model Analysis

5.1 Correlation of Sentence Representations

Since the auxiliary structures are discarded at the inference stage, we make further analysis to check that whether the retained model has the ability to distinguish different sentence-level semantics when facing unknown data. We train QFCL and BART on the training set for 20 epochs and save each checkpoint, and evaluate these checkpoints on the development set.

Four types of sentence representations are extracted from these checkpoints: CHQ’s representation $R_{c}$ , FAQ’s representation $R_{f}$ , hard negatives’ representation $R_{h}$ , and the generated summary’s representation at decoder end $R_{g}$ . Then we calculate the cosine similarity between them, and draw the relationship between these similarity scores and the epoch numbers, as shown in Figure 4.

Regarding the anchor CHQ in the curve of iCliniq, $s_{c_f a q}^{+}$ , $s_{c_s i m}^{-}$ and $s_{c_h a r d}^{-}$ are very close to each other at epoch 0, suggesting that the initial encoder lacks the ability to capture different semantics. With the increase of training steps, $s_{c_f a q}^{+}$ changes smoothly, while $s_{c_s i m}^{-}$ decreases sharply to near zero and $s_{c_h a r d}^{-}$ decreases gradually and converges at a middle level between $s_{c_f a q}^{+}$ and $s_{c_s i m}^{-}$ . This suggests that, powered by contrastive learning, our model has learned to distinguish sentences of different meanings at the encoder end.

With the generated summary as another anchor, we find out that $s_{g_f a q}^{+}$ , $s_{g_s i m}^{-}$ , $s_{g_h a r d}^{-}$ are all near to 0 initially, which depict that the decoder is also weak in representing sentence-level semantics. After training, $s_{g_f a q}^{+}$ increases significantly, $s_{g_h a r d}^{-}$ converges between $s_{g_f a q}^{+}$ and $s_{g_s i m}^{-}$ , and $s_{g_s i m}^{-}$ keeps very low all the time. It suggests that the decoder has strengthened its power to distinguish different semantics as the same to the encoder end.

Another chart is drawn to show this relationship for BART baseline in Figure 4. The similarities between the anchor and the positive samples, negative samples are very close, and never improve significantly with the progress of training. This situation suggesting that the BART baseline has a relatively weaker performance to distinguish the sentences of different meanings at both encoder and decode, since it only focuses on the prediction of next tokens.

We also draw this correlation curve on MeqSum and HealthcareMagic. The curve of HealthcareMagic is similar to iCliniq. On MeqSum, our model can still distinguish sentences with different semantics better than the baseline, but the signal is not as significant as iCliniq or HealthcareMagic due to the limited size of training set.

Model	C1	C2	C3	C4	C5	Mean
BART	33.37	40.76	39.78	35.34	36.21	37.09
QFCL	47.41	42.24	45.20	45.20	47.17	45.44

Table 6: Accuracy of question focuses in generated summaries. C1-C5 means 5 different checkpoints trained by each model.

5.2 Capturing Question Focus

To study whether our model pays more attention to the question focus, we evaluate the accuracy of question focuses in generated summaries. We use the sequence labeling model trained by Yadav et al. (2021) to predict question focuses on the MeqSum dataset, and regard the 812 predicted question focuses in test set as the gold-standard. For QFCL and BART, we train five checkpoints and generate summaries on these checkpoints, and compute the accuracy of question focuses on test set. As shown in Table 6, the average accuracy is 37.09% for BART and 45.44% for QFCL. Our model exceeds the baseline by 8.35 points for question focus generation. P-value of t-test on these two sets of results is 1.04e-3, indicating that this improvement is statistically significant.

6 Conclusion

In this paper, we introduce a novel question focus-based contrastive learning framework QFCL for medical question summarization. In the proposed model, we adopt a "double anchor" strategy, by considering both the input question CHQ and the generated summary as comparing anchors. And we exploit a "hard negatives generator" to generate hard negative samples based on the question focus. Our model significantly improves the performance on three medical question summarization datasets, and achieves new state-of-the-art results. In the future, we would like to find a more effective way to do question focus recognition.

Acknowledgement

This work is supported by the National Hi-Tech RD Program of China (No.2020AAA0106600), the National Natural Science Foundation of China (62076008) and the Key Project of Natural Science Foundation of China (61936012).

Appendix A Ethical Consideration

The datasets used in our work are all publicly available. We used BART as our basic model which follows the apache-2.0 license. The datasets and the method should only be used for research purposes, not in the commercial field.

The personal information in the datasets has been hidden through preprocessing. For example, the name of the consumer was converted to an placeholder [name] and the address was converted to [location], as shown in Table 1.

As the current models could not guarantee to generate summaries fully conforms to the intention of the consumers, the method in our paper can only be used as an auxiliary tool to avoid further misleading suggestions.

References

A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1638–1649. External Links: Link Cited by: §3.3.1.
A. Ben Abacha and D. Demner-Fushman (2019) On the summarization of consumer health questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2228–2234. External Links: Link, Document Cited by: §1, §1, §1, §2.1, §4.1.
A. Ben Abacha, Y. Mrabet, Y. Zhang, C. Shivade, C. Langlotz, and D. Demner-Fushman (2021) Overview of the MEDIQA 2021 shared task on summarization in the medical domain. In Proceedings of the 20th Workshop on Biomedical Language Processing, Online, pp. 74–85. External Links: Link, Document Cited by: §2.1.
S. Cao and L. Wang (2021) CLIFF: contrastive learning for improving faithfulness and factuality in abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 6633–6649. External Links: Link Cited by: §2.2.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §2.2, §2.2.
Z. Chi, L. Dong, F. Wei, N. Yang, S. Singhal, W. Wang, X. Song, X. Mao, H. Huang, and M. Zhou (2021) InfoXLM: an information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 3576–3588. External Links: Link, Document Cited by: §2.2.
E. Greensmith, P. L. Bartlett, and J. Baxter (2004) Variance reduction techniques for gradient estimates in reinforcement learning.. Journal of Machine Learning Research 5 (9). Cited by: §1.
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §2.2, §2.2, §3.1.
Y. He, M. Chen, and S. Huang (2021) Damo_nlp at MEDIQA 2021: knowledge-based preprocessing and coverage-oriented reranking for medical question summarization. In Proceedings of the 20th Workshop on Biomedical Language Processing, Online, pp. 112–118. External Links: Link, Document Cited by: §2.1.
O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pp. 4182–4192. Cited by: §2.2.
Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus (2020) Hard negative mixing for contrastive learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 21798–21809. External Links: Link Cited by: §3.3.
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Link, Document Cited by: §1, §1, §3.1, §4.2, Table 3.
H. Li, J. Zhu, J. Zhang, C. Zong, and X. He (2020) Keywords-guided abstractive sentence summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8196–8203. Cited by: §3.3.1.
S. Li, D. Lei, P. Qin, and W. Y. Wang (2019) Deep reinforcement learning with distributional semantic rewards for abstractive summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 6038–6044. External Links: Link, Document Cited by: §1.
C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §4.1.
Y. Liu and P. Liu (2021) SimCLS: a simple framework for contrastive learning of abstractive summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, pp. 1065–1072. External Links: Link, Document Cited by: §2.2.
I. Misra and L. van der Maaten (2020) Self-supervised learning of pretext-invariant representations. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 6706–6716. External Links: Document Cited by: §2.2.
K. Mrini, F. Dernoncourt, W. Chang, E. Farcas, and N. Nakashole (2021a) Joint summarization-entailment optimization for consumer health question understanding. In Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations, Online, pp. 58–65. External Links: Link, Document Cited by: §4.1.
K. Mrini, F. Dernoncourt, S. Yoon, T. Bui, W. Chang, E. Farcas, and N. Nakashole (2021b) UCSD-adobe at MEDIQA 2021: transfer learning and answer sentence selection for medical summarization. In Proceedings of the 20th Workshop on Biomedical Language Processing, Online, pp. 257–262. External Links: Link, Document Cited by: §2.1.
K. Mrini, F. Dernoncourt, S. Yoon, T. Bui, W. Chang, E. Farcas, and N. Nakashole (2021) A gradually soft multi-task and data-augmented approach to medical question understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 1505–1515. External Links: Link, Document Cited by: §1, §2.1, §4.2, §4.3, Table 3.
R. Nallapati, B. Zhou, C. dos Santos, Ç. Gulçehre, and B. Xiang (2016) Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 280–290. External Links: Link, Document Cited by: §1.
X. Pan, M. Wang, L. Wu, and L. Li (2021) Contrastive learning for many-to-many multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 244–258. External Links: Link, Document Cited by: §2.2.
R. Paulus, C. Xiong, and R. Socher (2018) A deep reinforced model for abstractive summarization. In International Conference on Learning Representations, External Links: Link Cited by: §1.
M. Sänger, L. Weber, and U. Leser (2021) WBI at MEDIQA 2021: summarizing consumer health questions with generative transformers. In Proceedings of the 20th Workshop on Biomedical Language Processing, Online, pp. 86–95. External Links: Link, Document Cited by: §2.1.
S. Yadav, D. Gupta, A. Ben Abacha, and D. Demner-Fushman (2021) Reinforcement learning for abstractive question summarization with question-aware semantic rewards. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, pp. 249–255. External Links: Link, Document Cited by: §1, §2.1, §3.3.1, §4.3, Table 3, §5.2.
S. Yadav, M. Sarrouti, and D. Gupta (2021) NLM at MEDIQA 2021: transfer learning-based approaches for consumer question and multi-answer summarization. In Proceedings of the 20th Workshop on Biomedical Language Processing, Online, pp. 291–301. External Links: Link, Document Cited by: §2.1.
N. Yang, F. Wei, B. Jiao, D. Jiang, and L. Yang (2021) XMoCo: cross momentum contrastive learning for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 6120–6129. External Links: Link, Document Cited by: §2.2.
G. Zeng, W. Yang, Z. Ju, Y. Yang, S. Wang, R. Zhang, M. Zhou, J. Zeng, X. Dong, R. Zhang, H. Fang, P. Zhu, S. Chen, and P. Xie (2020) MedDialog: large-scale medical dialogue datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 9241–9250. External Links: Link, Document Cited by: §1, §4.1.
J. Zhang, Y. Zhao, M. Saleh, and P. Liu (2020) Pegasus: pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pp. 11328–11339. Cited by: §1.