Podcast Summary Assessment: A Resource for Evaluating Summary Assessment Methods

Potsawee Manakul and Mark J. F. Gales
Department of Engineering, University of Cambridge
pm574@cam.ac.uk, mjfg@eng.cam.ac.uk
Abstract

Automatic summary assessment is useful for both machine-generated and human-produced summaries. Automatically evaluating the summary text given the document enables, for example, summary generation system development and detection of inappropriate summaries. Summary assessment can be run in a number of modes: ranking summary generation systems; ranking summaries of a particular document; and estimating the quality of a document-summary pair on an absolute scale. Existing datasets with annotation for summary assessment are usually based on news summarization datasets such as CNN/DailyMail or XSum. In this work, we describe a new dataset, the podcast summary assessment corpus, a collection of podcast summaries that were evaluated by human experts at TREC2020. Compared to existing summary assessment data, this dataset has two unique aspects: (i) long-input, speech podcast based, documents; and (ii) an opportunity to detect inappropriate reference summaries in podcast corpus. First, we examine existing assessment methods, including model-free and model-based methods, and provide benchmark results for this long-input summary assessment dataset. Second, with the aim of filtering reference summary-document pairings for training, we apply summary assessment for data selection. The experimental results on these two aspects provide interesting insights on the summary assessment and generation tasks. The podcast summary assessment data is available.111Data is available at https://github.com/potsawee/podcast_summary_assessment under the CC-BY-4.0 license.

1 Introduction

Summarization or summary generation aims to compress a document into a concise summary that conveys the important information, while summary assessment or evaluation aims to provide the quality of the summary text given the document. With the advances in deep learning, a variety of automatic summary generation models have been proposed see-etal-2017-get; lewis-etal-2020-bart; zhang2020pegasus. However, there is less attention on automatic summary assessment.

Firstly, automatic assessment such as ROUGE lin-2004-rouge allows researchers to quickly compare and rank summary generation models as it has been shown to have a high/moderate correlation with human judgements at the system-level. Secondly, automatic assessment can also be applied to rank a set of summaries for the document, i.e. summary-level evaluation. The definitions of system-level and summary-level are provided in Section 5.1. Thirdly, instead of ranking, another assessment task is to evaluate the quality of a document-summary pair on an absolute scoring scale. This is a regression task which has applications such as assessing summaries of English learners xia-etal-2019-automatic, or selecting good document-summary pairs for training generation systems.

In this work, we compile and release summaries of podcasts and associated human judgements from the Spotify Podcast Challenge at TREC2020 jones_trec2020, which is based on podcast data of more than 100,000 episodes for training summary generation systems clifton-etal-2020-100000. The Podcast Summary Assessment dataset consists of long documents, e.g. the average number of words is more than 6000, meaning that some assessment methods may fail to correlate well with human judgements. Using this new dataset for assessment, we provide benchmark results of standard and recent assessment methods as measured by system-level and summary-level correlations.

In addition, we link the summary assessment task to the summary generation task. Creator-provided podcast descriptions have been used as the reference summaries in training summary generation models manakul2020cued_speech; however, human evaluation suggests that up to half of the descriptions are judged as only fair or bad (see Section 4.2). Thus, it is a challenge to select appropriate and high-quality training examples for the generation task. In this work, we propose using summary assessment to tackle this data selection problem, and we provide baseline results and insights based on supervised assessment models. The main contributions of this paper are:

  • We assemble and release Podcast Summary Assessment – a summary assessment dataset based on a large podcast summarization data from the podcast challenge at TREC2020. The data provides a diverse assessment resource beyond the scope of news articles.

  • We provide benchmark results including several assessment methods on the new dataset.

  • We link the assessment task to the generation task, and we provide baseline results.

2 Related Work: Assessment Methods

Our notation is \mathbf{x} = document, \mathbf{y} = candidate summary, \mathbf{y}^{*} = reference summary, and z = quality of the summary. We categorize summary assessment methods by: first, f(\mathbf{y},\mathbf{y}^{*}) v.s. f(\mathbf{y},\mathbf{x}) i.e. whether the summary is compared against the document or the reference summary; second, unsupervised approach v.s. supervised approach. In this section, we provide the details of methods used in this work. A literature review of recent summary assessment or evaluation methods can be found in koto2022ffci.

2.1 Summary and Reference f(\mathbf{y},\mathbf{y}^{*})

Typically, datasets for developing summary generation systems contain a set of documents \mathbf{X}=\{\mathbf{x}^{(1)},\mathbf{x}^{(2)},...\} and reference summaries \mathbf{Y}^{*}=\{\mathbf{y}^{*(1)},\mathbf{y}^{*(2)},...\}. Generation systems are trained to maximise the likelihood of the reference summaries such that \theta_{\tt{ml}}=\text{argmax}_{\theta}[P(\mathbf{Y}^{*}|\mathbf{X};\theta)]. Consequently, standard summary assessment methods take the form f(\mathbf{y},\mathbf{y}^{*}).

2.1.1 Unsupervised f(\mathbf{y},\mathbf{y}^{*})

By far the most commonly used f(\mathbf{y},\mathbf{y}^{*}) method is ROUGE lin-2004-rouge, which is model-free and based on the n-gram overlap between \mathbf{y} and \mathbf{y}^{*}. Other variants of n-gram based methods include BLEU papineni-etal-2002-bleu. Despite its robustness, n-gram based methods cannot take in account word semantic. Model-based word-level representation matching such as BERTScore BERTScore or MoverScore zhao-etal-2019-moverscore are proposed to incorporate word semantic. This idea could be extended into sentence-level representation matching such as Sentence-BERT reimers-gurevych-2019-sentence. Rather than n-gram matching or representation matching, triple matching has also been proposed goodrich_triple.

2.1.2 Supervised f(\mathbf{y},\mathbf{y}^{*})

Methods such as BLEURT sellam-etal-2020-bleurt or COMET rei-etal-2020-comet are trained to predict human scores given \mathbf{y} and \mathbf{y}^{*}. However, it is tedious to collect both human scores and \mathbf{y}^{*}, making it less practical. Thus, we omit this type of approach.

2.2 Summary and Document f(\mathbf{y},\mathbf{x})

In a practical scenario as such assessing human’s summarization skill without reference summary \mathbf{y}^{*} being available, the document \mathbf{x} has to be used in assessing the summary \mathbf{y}.

2.2.1 Unsupervised f(\mathbf{y},\mathbf{x})

Question-Answering. To assess the faithfulness aspect, wang-etal-2020-asking proposed QAGS, the first QA-based method. Given \mathbf{y}, noun-phrases are extracted. For each noun-phrase, generate a question through \text{noun}+\mathbf{y}\rightarrow\text{question}, and the answer conditioned on \mathbf{x} is compared to the answer conditioned on \mathbf{y}, e.g. word overlap F1.

\text{QA-score}=\mathop{E}_{Q\sim P(Q|\mathbf{y})}[D(P(A|Q,\mathbf{x}),P(A|Q,% \mathbf{y})] (1)

A concurrent and similar QA-based method called FEQA was also proposed by durmus-etal-2020-feqa. Because QAGS is a precision-based metric (e.g. it generates questions from \mathbf{y} and checks for consistency against \mathbf{x}), scialom-etal-2021-questeval proposed QuestEval, which is a combination of QAG-Precision and QAG-Recall.

Entailment. Textual entailment task is that given a premise/context \mathbf{x} and hypothesis \mathbf{y}, predict one of three possible relations: entail, neutral, contradict. A common training data is Multi-Genre Natural Language Inference (MNLI), which is a crowd-sourced collection of 433k sentence pairs. maynez-etal-2020-faithfulness showed that BERT fine-tuned to MNLI achieves the highest Spearman correlation with human judgements on faithfulness and factuality.

Other unsupervised f(\mathbf{y},\mathbf{x}) approaches include Language Model Score. For example, yuan2021bartscore proposed a conditional LM score \text{BARTScore}=\sum_{t=1}^{M}\omega_{t}\log P(\mathbf{y}_{t}|\mathbf{y}_{<t}% ,\mathbf{x};\theta) where \omega_{t} is a weight such as TF-IDF for each word.

2.2.2 Supervised f(\mathbf{y},\mathbf{x})

Supervised approaches require ground-truth scores (human judgements) \mathbf{Z}^{*}=\{z^{*(1)},{z}^{*(2)},...\} to train regression models \theta_{\tt reg}:

\theta_{\tt reg}=\text{argmax}_{\theta}[P(\mathbf{Z}^{*}|\mathbf{X},\mathbf{Y}% ;\theta)] (2)

For example, xia-etal-2019-automatic collected English learners’ summaries from a real examination, and have the summaries graded by professional examiners. Kernel Ridge Regression, LSTM, and CNN models were trained using this data. bao2020end trained fully connected, CNN, LSTM, and BERT-based models on simulated CNN/DailyMail, BillSum, arXiv, BigPatent data. They created simulated by negative sampling, e.g. random shuffling summaries or word-level summary corruption. Similarly, kryscinski-etal-2020-evaluating proposed FactCC metric by fine-tuning BERT classifier on adversarial data to distinguish between faithful and unfaithful summaries. wu-etal-2020-unsupervised constructed negative samples with respect to linguistic qualities and informativeness, and they trained BERT-based models using contrasting learning.

3 Related Work: Data

DUC 2001-2003222https://duc.nist.gov/data.html and TAC 2008-2010 datasets dang2008OverviewOT; dang2009OverviewOT consist of summaries and human evaluation from news articles. Despite the size, the systems in these corpora are extractive and no longer matched current abstractive summarization systems.

Recently, the summaries of CNN/DailyMail hermann2015teaching and XSum narayan-etal-2018-dont are annotated to address the lack of summary assessment resource. For example, maynez-etal-2020-faithfulness collected annotation for XSum summaries on faithfulness and factuality aspects. QAGS wang-etal-2020-asking released annotation for CNNDM and XSum on faithfulness. NeR18 grusky-etal-2018-newsroom has human annotation for some of its summaries. RealSum bhandari-etal-2020-evaluating and SummEval fabbri2021summeval annotated recent advanced summarization systems for CNNDM. In addition, a corpus for summary assessment for English learners was collected xia-etal-2019-automatic. Human evaluation assesses the quality of summaries on one or more aspects as follows:

  • Informative (relevance) = how much salient information is presented in the summary, and it should contain little or no redundancy.

  • Faithfulness = whether the information in the summary can be inferred by the document. An unfaithful summary contains hallucination, which can be categorized into (i) intrinsic hallucination when information is manipulated inaccurately; (ii) extrinsic hallucination, which is when information is added.

  • Factuality (consistency) = whether the information in the summary (regardless of its presence in the document) is right or wrong.

  • Fluency = how good the language usage, e.g. no grammatical errors.

  • Coherence = collective quality of all sentence, e.g. how well are sentences connected.

Overall quality is typically assessed as a combination of the aspects. We summarize existing datasets and their annotation aspects in Table 1.

Corpus Data Size{}^{\dagger} Annotation
TAC2008 News 2736 (57\times48) Fluency, Relevance, Overall
TAC2009 News 2420 (55\times44) Fluency, Relevance, Overall
TAC2010 News 1978 (43\times46) Fluency, Relevance, Overall
XSum Faithfulness News (XSum) 2500 (5\times500) Faithfulness, Factuality
QAGS News (CNNDM,XSum) 235, 239 Faithfulness
NeR18 News 420 (7\times60) Coherence, Fluency, Relevance, Informative
RealSum News (CNNDM) 2500 (25\times100) Coverage
SummEval News (CNNDM) 1600 (16\times100) Coherence, Faithfulness, Fluency, Relevance
English Learner English Exam 411 Informative, Coherence, Fluency
Podcast Summary Podcast 3580 (20\times179) 4-point scale Overall (Informative & Fluency)
Assessment and 8 binary attributes (e.g. names, topic, etc.)
Table 1: Summary of Datasets. {}^{\dagger}#systems\times#documents

4 Podcast Summary Assessment Data

The corpus is a collection of podcast summaries generated by recent summarization systems at the Spotify Podcast Challenge at TREC2020 jones_trec2020. The summary assessment corpus consists of 179 podcast episodes (i.e. documents). All episodes have summaries from 20 systems (19 summarization systems + 1 creator desccription), and human evaluation was performed by NIST333https://www.nist.gov/ assessors for the TREC2020 challenge, resulting in 3580 annotated document-summary pairs in total.

4.1 Summarization Systems

20 summarization systems jones_trec2020; zheng2020two; manakul2020cued_speech; song2020automatic; glasgow_trec; hk_uu_trec are:

Reference444Creator-provided description has been used as the reference summary in training podcast summarization systems. = R1.

Extractive systems = E1, E2, E3.

Abstractive systems = A1, A2, A3,…, A16.

Extractive systems are based on TextRank mihalcea-tarau-2004-textrank, while abstractive systems use a form of deep learning and pre-trained seq2seq models including BART lewis-etal-2020-bart and T5 raffel2020exploring. Full details of all the systems can be found in jones_trec2020.

#sentences #words
Transcript 303\pm258 6375\pm5092
Summary 5.9\pm9.2 98\pm75
Table 2: Length (Avg.\pmStd.) based on nltk tokenizer.

4.2 Human Annotation

The summaries were judged by NIST assessors on a 4-point scale (Excellent-Good-Fair-Bad). An excellent summary should be informative and has no redundancy, and it should be fluent. A bad summary does not convey any salient information (not informative), or not factually correct. More descriptions about the annotation guideline can be found in jones_trec2020.

Shown in Fig. 1 is the distribution of human scores. It can be seen that around a quarter of creator descriptions are graded Bad. This result means noisy data in training summarization systems, and it motivates our work in Section 6.2.

Additionally, the annotation includes 8 binary attributes such as whether the summary contains topic information. This work has not used utilized this annotation. More information can be found in Appendix B and jones_trec2020.

(a) All systems
(b) Creator Description
Figure 1: The distribution of human scores.

5 Assessment Method Evaluation

5.1 Evaluation Metrics

Following the notation in deutsch-etal-2021-towards, let x_{i}^{j} and y_{i}^{j} be two scores of metrics X and Y for the summary output by system i\in\{1,...,N\} on the document j\in\{1,...,M\}. Correlations are:

  • System-level (aka Corpus-level)

    \rho=\text{Corr}\left(\left\{\frac{\sum_{j}x_{i}^{j}}{M},\frac{\sum_{j}y_{i}^{% j}}{M}\right\}_{i=1}^{N}\right) (3)
  • Summary-level (aka Sentence-level)

    \rho=\frac{1}{M}\sum_{j}\text{Corr}\left(\left\{x_{i}^{j},y_{i}^{j}\right\}_{i% =1}^{N}\right) (4)
  • All test examples

    \rho=\text{Corr}\left(\left\{x_{i}^{j},y_{i}^{j}\right\}_{i=1,j=1}^{i=N,j=M}\right) (5)

Correlation in Eq. 5 is used in Section 6 where all document-summary pairs are evaluated together on an absolute scale. Note that Eq. 5 is only applicable when the assessment method gives a score on an absolute scale. For example, the ROUGE score per one document is not comparable across different documents, i.e. it is not on an absolute scale.

5.2 Assessment Method Setup

Implementation details are given in Appendix A.

ROUGE and TripleMatching: ROUGE-1,2,L typically show the same ranking trend, so as a simple unsupervised baseline we report ROUGE-L F1 similar to jones_trec2020. Instead of n-gram matching such as ROUGE or BLEU, we follow goodrich_triple in extracting a set of triples (Subj-Relation-Obj) from two texts, and we compute the F1-score of the triple overlap.

Question-Answering (QA): We follow QAG in Eq. 1 wang-etal-2020-asking. For question generation, BART fine-tuned to NewsQA trischler-etal-2017-newsqa is used. For question answering, BERT (max #words = 512) and Longformer (max #words = 4096) fine-tuned to SQuAD2.0 are used.

Entailment: We train BERT/Longformer on the MNLI corpus williams-etal-2018-broad. At inference time, document \mathbf{x} (context) and summary \mathbf{y} (hypothesis) are concatenated as the input, and the entailment probability is used as the summary score.

CNN model: Due to long documents, we use the sentence-level similarity grid as the input to our CNN model. Document and summary are split into sentences, and each sentence is encoded to a sentence representation via Sentence-BERT reimers-gurevych-2019-sentence. Cell (i,j) in the similarity grid is cosine similarity between doc-sent{}_{i} and summary-sent{}_{j}. CNN uses ResNet18 backbone.

BERT devlin-etal-2019-bert and Longformer beltagy2020longformer: We fine-tune sequence classification weights where the input is \mathbf{x} concatenated by \mathbf{y} and the target is z. When [\mathbf{x};\mathbf{y}] exceeds model’s max length, we first truncate \mathbf{x}.

In the weakly supervised setting, z is ROUGE-L(\mathbf{y},\mathbf{y}^{*}). In the supervised setting, z is human score: Excellent=3, Good=2, Fair=1, Bad=0. Because 3,580 assessment examples is small for training a deep learning model, we perform a 5-fold cross-validation in our supervised training experiments. Also, we perform 5-fold cross-validation 5 times with different data shuffles, and we report the mean of 5 runs (and the standard deviation in Section 6 where we focus on supervised models).

5.3 Correlation against Human Judgements

Compared to existing data such as SummEval or RealSum, the podcast summarization task is more abstractive, and its document length is about 10 times longer. Hence, we benchmark automatic assessment methods discussed in Section 5.2. The results are presented in Table 3 and Fig. 2.

Unsupervised with Reference. The methods achieve a high correlation. Due to the references being abstractive, ROUGE and TripleMatching with reference generally yields higher scores for abstractive systems as shown in Fig. 1(a) and 1(b).

Unsupervised with Document. Not only these methods show a low correlation, their correlation with human judgements is negative when including both extractive and abstractive systems. As shown in Fig. 1(c) to Fig. 1(h), these methods give overly high scores to extractive systems. The summary of an extractive system by default has a high lexical overlap with the document, suggesting that although question answering (QA) and entailment approaches are not designed to directly rely on a lexical overlap, they appear to give a high score for the summary with a high lexical overlap.

Another point is that when the input document is much longer (e.g. 6375 for podcast transcript in average) than the limit of a model (e.g. 512 for BERT), the entailment system is poor, but this can be mitigated by using a base entailment model with a larger limit such as Longformer. For the question-answering approach, we observe that swapping the question answering model from BERT to Longformer does not show an improvement. This is likely because the question answering model is trained on SQuAD2.0 data, where most answers are within BERT’s length limit.

Supervised with Document. First, a baseline CNN model is trained in a weakly supervised fashion using ROUGE-L(\mathbf{y},\mathbf{y}^{*}) as the target. We show that this weakly supervised approach yields a considerably higher correlation than unsupervised approaches, and it is able to learn not to score extractive systems too high. Second, we show that supervised training yields models with the highest correlation among the approaches without reference, and a correlation similar to that of ROUGE-L(\mathbf{y},\mathbf{y}^{*}) can be achieved. Next observation is when comparing supervised BERT and supervised Longformer. Both systems take concatenated [\mathbf{x};\mathbf{y}] with \mathbf{x} being truncated first for long inputs. The fact that these two systems achieve a similar performance level suggests that the systems may learn to use the signal only from \mathbf{y}, i.e. on fluency/coherence aspect rather than the informativeness aspect.

Method Against Type System-level Summary-level
Ref Doc Inc. Exc. Inc. Exc.
ROUGE-L (\mathbf{y},\mathbf{y}^{*}) Unsupervised 0.905 0.864 0.350 0.246
TripleMatching (\mathbf{y},\mathbf{y}^{*}) Unsupervised 0.838 0.746 0.079 0.052
ROUGE-L (\mathbf{y},\mathbf{x}) Unsupervised -0.200 0.364 -0.036 0.250
TripleMatching (\mathbf{y},\mathbf{x}) Unsupervised -0.159 0.453 -0.123 0.143
QA approach [B-512] Unsupervised -0.112 0.517 -0.045 0.123
QA approach [L-4096] Unsupervised -0.115 0.503 -0.071 0.118
Entailment [B-512] Unsupervised 0.356 0.114 0.102 0.021
Entailment [L-4096] Unsupervised -0.192 0.392 -0.105 -0.059
CNN model Weakly Supervised 0.728 0.563 0.171 0.019
CNN model Supervised 0.901 0.902 0.299 0.183
BERT model Supervised 0.905 0.869 0.237 0.156
Longformer model Supervised 0.909 0.896 0.278 0.196
Table 3: Spearman correlation (19 systems – excluding creator description). Inc./Exc. = Including/Excluding extractive summaries. Pearson correlation results are provided in Appendix C.
(a) ROUGE-L (\mathbf{y},\mathbf{y}^{*})
(b) TripleMatch (\mathbf{y},\mathbf{y}^{*})
(c) ROUGE-L (\mathbf{y},\mathbf{x})
(d) TripleMatch (\mathbf{y},\mathbf{x})
(e) QA [BERT-512]
(f) QA [Long-4096]
(g) Entail [BERT-512]
(h) Entail [Long-4096]
(i) CNN (ROUGE)
(j) CNN (Human)
(k) BERT
(l) Longformer
Figure 2: Scatter plots and best fitted lines on abstractive systems. Blue = abstractive systems, Orange = extractive systems, Green = bart-large-cnn system.

6 Assessment Method for Data Selection

6.1 Absolute Score Prediction

Another use case of summary assessment is to predict the quality on an absolute scale. On the podcast data, a direct application is to select appropriate document-description pairs for training summarization models (discussed in Section 6.2).

Baselines: Supervised Approach

Because methods such as ROUGE, QA, or entailment do not predict a score on an absolute scale, they are not applicable. Hence, we focus on the performance of supervised approaches. We perform a 5-fold cross validation training. In Table 4, we show the correlation against human judgements and RMSE when computed on all test samples. Despite a similar correlation at summary-level and sentence-level, the CNN model achieves the highest correlation as well as lowest variance in performance when evaluating on all test samples.

Model Spearman (\uparrow) RMSE (\downarrow)
CNN 0.431\pm0.005 0.884\pm0.003
BERT 0.353\pm0.061 0.909\pm0.024
Longformer 0.397\pm0.041 0.900\pm0.019
Table 4: Absolute score prediction baselines

Next, we investigate the impact of pre-training CNN with negative samples as done in bao2020end. For pre-training, we use the CNNDM dataset where we assign 1.0 to real summaries and 0.0 to randomly selected summaries. When using the pre-trained model on podcast, the prediction is scaled up by \times3.0. Shown in Table 5, pre-trained and fine-tuned models perform worse than the vanilla model. We found that the mean prediction of pre-trained model 2.75, which is close to 3.0, suggesting that the negative sampling task is too different from the podcast task.

CNN Model Spearman (\uparrow) RMSE (\downarrow)
Trained 0.431\pm0.005 0.884\pm0.003
Pre-trained -0.005 1.954
+ Fine-tuned 0.400\pm0.006 0.915\pm0.005
Table 5: Impact of pre-training.

Seen v.s. Unseen Data

So far, we have performed cross-validation training where samples are all-shuffled. Here, we investigate other scenarios, including: (i) when a system is held-out entirely such that no summaries from a particular system are seen at training; (ii) when some documents are held-out. Again, we train each configuration 5 times. In Table 6, the results show that RMSE is the highest when the creator description set (R1) is held-out. Note that using an extractive system such as E1 is expected to yield a low correlation because most extractive summaries are graded either just fair or bad (83% for E1, 95% for E2, and 97% for E3), but their RMSE values are not the worst. Next, when there are unseen documents at inference time (e.g. held-out documents), the performance is also worse than all-shuffled.

n-fold Held-out Spearman (\uparrow) RMSE (\downarrow)
all-shuffled random 0.431\pm0.005 0.884\pm0.003
system E1 0.233\pm0.034 0.780\pm0.036
system E3 0.129\pm0.055 0.878\pm0.079
system A7 0.473\pm0.062 0.968\pm0.053
system A12 0.540\pm0.023 0.958\pm0.020
system A16 0.434\pm0.043 0.888\pm0.022
system R1 0.245\pm0.049 1.035\pm0.027
document document 0.242\pm0.044 0.964\pm0.057
Table 6: Different ways of held-out splits.

The results in Table 6 motivate us to further investigate the scenario where there are unseen document and creator description pairs. Hence, we use all of 3580 summary assessment examples as train/valid sets (80%/20%), and we make use of 150 document-description pairs as the test set.555Spotify released 150 documents/episodes in the initial phase of TREC2020, and we call this set test150.

We found that this unseen scenario appears very challenging for the model. 22 out of 50 training runs666Each run is different by a train/valid data shuffle. have a negative correlation on test150, and the average correlation of all 50 runs is close to zero at 0.011 as shown in Table 7.

Method Spearman (\uparrow) RMSE (\downarrow)
SingleModel 0.011\pm0.090 1.100\pm0.040
Ensemble 0.109 1.034
Table 7: Results on unseen doc-description (test150).

Ensemble Performance and Uncertainty

To achieve the best performance, we use an ensemble by averaging the predictions of the single models. The ensemble achieves 0.109 in Spearman correlation on test150. In addition to the performance gain, the ensemble allows us to investigate uncertainty. Initial uncertainty results in Fig. 3 show that when the models agree the predictions are more reliable than when they are not. This suggests that uncertainty could further help the data selection task for future work.

(a) \rho and Uncertainty
(b) RMSE and Uncertainty
Figure 3: Uncertainty Results on test150..

6.2 Summary Generation Training

Because the podcast summarization dataset does not have perfect or gold summaries for training and evaluating summary generation models, previous work filtered training set, down from 105k to 60k examples, using simple heuristics777More information in Appendix B. manakul2020cued_speech. This filtered set is called brass set. In this work, we investigate if assessment models can be used to perform data selection.

We use the ensemble system (in Table 7) for selecting document-description training examples. We run the system on the entire podcast summarization training set of 105k examples. We create top set where we keep 60k examples of the highest assessment scores and bottom set where we keep 60k examples of the lowest scores.

We train BART on each training set in Table 8 using the best configuration described in manakul2021_longspan: ORC-pad-rand is applied to select sentences at training time, and model-based MCS is applied at inference time. Note that we keep the same valid/test sets.

Train-set Size Description
All 105k All training examples
Brass 60k Selection based on heuristics
Top 60k Examples of highest score
Bottom 60k Examples of lowest score
Table 8: Training sets for summarization systems.

Impact on Assessment System Score

We generate summaries of the summarization testset (1027 examples) using BART trained with different training sets. Then, we predict the summary quality score. Table 9 and Fig. 4 support that using assessment model to select training set is able to shift the summarization model towards generating summaries that either have a higher or lower assessment score at inference time. Therefore, this simple training data selection via assessment method can guide the summarization model.

Summarization Model Average Score
Train-set Test-set
All 1.053 0.941
Brass 1.083 0.956
Top 1.236 0.982
Bottom 0.867 0.900
Table 9: Assessment score in range [0.0, 3.0] predicted by our ensemble system on testset.
Figure 4: Cumulative density plot of the assessment scores on testset.

Impact on ROUGE

Despite the generated summaries of BART trained on top-score set obtaining the highest assessment system score, the performance measured by ROUGE (in Table 10) does not show an improvement over BART trained on all/brass/bottom sets.

It should be noted that the testset set contains all EGFB grades, and a higher ROUGE score may only indicate that generated summaries are lexically closer to the summaries in the testset. Figure 1(a) also reveals that the correlation between ROUGE and human judgement is low or even negative when considering only top systems, e.g. system-level \rho = -0.28 for top-7 systems. We suggest that more attention is required when comparing high performing systems using ROUGE.

Summarization Model R1 R2 RL
All 28.46 11.19 20.08
Brass 27.28 9.82 19.00
Top 27.22 9.81 18.87
Bottom 27.52 10.43 19.37
Table 10: Summarization system development results.

7 Conclusion

This work has assembled and released a new resource for summary assessment. The corpus is unique in that the data consists of podcast episodes, instead of news articles which have received more attention. This corpus has two interesting aspects that the documents are long, and there is a challenge in applying summary assessment methods to improvement the summary generation task. We provide benchmark results of existing assessment methods on this new corpus as a baseline for future work. In addition, we apply model-based supervised assessment methods to select data for the generation task, and we provide initial results and insights based on the new corpus.

References

Appendix A Implementation

TripleMatching: We use Standford’s CoreNLP OpenIE angeli-etal-2015-leveraging as in the information extraction module.

CNN: The network has ResNet18 backbone he2016deep followed by a dropout layer (p=0.2) and a linear layer (1000\rightarrow 1). Sentence similarity grid is obtained via cosine-similarity between every pair of document sentences and summary sentences. Each similarity grid is resized to 640\times 32. The sentence representation is based on Sentence-BERT (bert-large-nli-mean-tokens).

BERT/Longformer: For supervised training, we use pre-trained weights from HuggingFace as follows: "bert-large-uncased" for BERT and "allenai/longformer-base-4096" for Longformer.

Supervised Training: We use the Adam optimizer with 10^{-5} learning rate, and we adopt early stopping, i.e. stop training when the validation loss does not improve.

Appendix B More information on Data

8 binary attributes: In addition to overall scores, NIST annotators also labelled 8 binary attributes (Yes/No questions) for each summary as follows jones_trec2020: (1) names of the main people included?; (2) any additional information about the people mentioned?; (3) main topic included?; (4) format of the podcast mentioned?; (5) context on title?; (6) redundant information?; (7) good written English? (8) good start and end points?.

brass set: jones_trec2020 filtered the entire summarization training set using three heuristics: (1) too long (>750 characters) or too short (<20 characters); (2) description too similar to other descriptions; (3) description too similar to its show description. Similarity is calculated using sklearn.

Appendix C Additional correlation results

Method System-lvl Summary-lvl
Inc. Exc. Inc. Exc.
R-L (\mathbf{y},\mathbf{y}^{*}) 0.919 0.868 0.326 0.226
TripleM (\mathbf{y},\mathbf{y}^{*}) 0.815 0.762 0.069 0.047
R-L (\mathbf{y},\mathbf{x}) -0.465 0.427 -0.137 0.224
TripleM (\mathbf{y},\mathbf{x}) -0.556 0.440 -0.197 0.140
QA [B-512] -0.305 0.666 -0.069 0.113
QA [L-4096] -0.422 0.629 -0.100 0.107
Entail [B-512] 0.453 0.212 0.119 0.024
Entail [L-4096] -0.515 0.345 -0.109 -0.061
CNN (weakly) 0.685 0.565 0.198 0.021
CNN 0.838 0.889 0.315 0.191
BERT 0.907 0.841 0.267 0.180
Longformer 0.922 0.926 0.295 0.203
Table 11: Pearson’s r (complementary to Tab. 3).