我们调查对话响应生成系统的评估指标,其中监督标签(例如任务完成)不可用。响应生成的近期工作采用了机器翻译的度量标准,以模拟模型生成的对单个目标响应的响应。我们证明这些指标与非technicalTwitter域中的人类判断非常弱相关,而在技术Ubuntu域中根本不相关。我们提供定量和定性结果,突出显示未成熟度量的具体弱点,并为对话系统的自动评估指标的未来发展提供建议。
translated by 谷歌翻译
BLEU is the de facto standard machine translation (MT) evaluation metric. However , because BLEU computes a geometric mean of n-gram precisions, it often correlates poorly with human judgment on the sentence-level. Therefore , several smoothing techniques have been proposed. This paper systematically compares 7 smoothing techniques for sentence-level BLEU. Three of them are first proposed in this paper, and they correlate better with human judgments on the sentence-level than other smoothing techniques. Moreover, we also compare the performance of using the 7 smoothing techniques in statistical machine translation tuning.
translated by 谷歌翻译
We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore , METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. We evaluate METEOR by measuring the correlation between the metric scores and human judgments of translation quality. We compute the Pearson R correlation value between its scores and human quality assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets. We perform segment-by-segment correlation, and show that METEOR gets an R correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram-precision, unigram-recall and their harmonic F1 combination. We also perform experiments to show the relative contributions of the various mapping modules.
translated by 谷歌翻译
In this paper, we present ParaEval, an automatic evaluation framework that uses paraphrases to improve the quality of machine translation evaluations. Previous work has focused on fixed n-gram evaluation metrics coupled with lexical identity matching. ParaEval addresses three important issues: support for para-phrase/synonym matching, recall measurement , and correlation with human judgments. We show that ParaEval correlates significantly better than BLEU with human assessment in measurements for both fluency and adequacy.
translated by 谷歌翻译
In this paper we describe two new objective automatic evaluation methods for machine translation. The first method is based on longest common subsequence between a candidate translation and a set of reference translations. Longest common subsequence takes into account sentence level structure similarity naturally and identifies longest co-occurring in-sequence n-grams automatically. The second method relaxes strict n-gram matching to skip-bigram matching. Skip-bigram is any pair of words in their sentence order. Skip-bigram co-occurrence statistics measure the overlap of skip-bigrams between a candidate translation and a set of reference translations. The empirical results show that both methods correlate with human judgments very well in both adequacy and fluency.
translated by 谷歌翻译
We argue that the machine translation community is overly reliant on the Bleu machine translation evaluation metric. We show that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality , and give two significant counterexamples to Bleu's correlation with human judgments of quality. This offers new potential for research which was previously deemed unpromising by an inability to improve upon Bleu scores.
translated by 谷歌翻译
Recognizing and generating paraphrases is an important component in many natural language processing applications. A well-established technique for automatically extracting paraphrases leverages bilingual corpora to find meaning-equivalent phrases in a single language by "pivot-ing" over a shared translation in another language. In this paper we revisit bilingual pivoting in the context of neural machine translation and present a paraphrasing model based purely on neural networks. Our model represents paraphrases in a continuous space, estimates the degree of semantic relatedness between text segments of arbitrary length, or generates candidate paraphrases for any source input. Experimental results across tasks and datasets show that neural paraphrases out-perform those obtained with conventional phrase-based pivoting approaches.
translated by 谷歌翻译
We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilar-ity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.
translated by 谷歌翻译
我们介绍机器翻译(MT)评估调查,其中包含手动和自动评估方法。传统的人类评价标准主要包括可懂度,忠诚度,流畅性,充分性,理解力和信息量。先进的人类评估包括面向任务的测量,后编辑,细分排名和扩展标准等。我们将自动评估方法分为两类,包括词汇相似性场景和语言特征应用。神经相似性方法包含编辑距离,精度,召回,F度量和单词顺序。语言特征可分别分为句法特征和语义特征。句法特征包括语音标签,短语类型和句子结构的一部分,语义特征包括命名实体,同义词,文本蕴涵,释义,语义角色和语言模型。用于评估的深度学习模型是非常新的。随后,我们还介绍了MTevaluation的评估方法,包括不同的相关分数,以及最近的MT质量评估(QE)任务。本文不同于现有的工作\引用{GALEprogram2009,EuroMatrixProject2007}从几个方面,通过介绍MT评估措施的最新发展,从手动到自动评估措施的不同分类,MT的近期QE任务的介绍,以及简明的构建。内容。我们希望这项工作能够帮助MT研究人员轻松获取最适合其特定MT模型开发的某些度量标准,并帮助MT评估研究人员获得MT评估研究如何发展的一般线索。此外,希望这项工作还可以为NLP领域的其他评估任务提供一些启示,除了翻译。
translated by 谷歌翻译
This paper studies the impact of paraphrases on the accuracy of automatic evaluation. Given a reference sentence and a machine-generated sentence, we seek to find a paraphrase of the reference sentence that is closer in wording to the machine output than the original reference. We apply our paraphrasing method in the context of machine translation evaluation. Our experiments show that the use of a paraphrased synthetic reference refines the accuracy of automatic evaluation. We also found a strong connection between the quality of automatic paraphrases as judged by humans and their contribution to automatic evaluation.
translated by 谷歌翻译
Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).
translated by 谷歌翻译
Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new triplet-based method of collecting human annotations to measure consensus, a new automated metric that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences describing each image. Our simple metric captures human judgment of consensus better than existing metrics across sentences generated by various sources. We also evaluate five state-of-the-art image description approaches using this new protocol and provide a benchmark for future comparisons. A version of CIDEr named CIDEr-D is available as a part of MS COCO evaluation server to enable systematic evaluation and benchmarking.
translated by 谷歌翻译
Automatic Machine Translation (MT) evaluation metrics have traditionally been evaluated by the correlation of the scores they assign to MT output with human judgments of translation performance. Different types of human judgments, such as Fluency, Adequacy, and HTER, measure varying aspects of MT performance that can be captured by automatic MT metrics. We explore these differences through the use of a new tunable MT metric: TER-Plus, which extends the Translation Edit Rate evaluation metric with tun-able parameters and the incorporation of morphology, synonymy and paraphrases. TER-Plus was shown to be one of the top metrics in NIST's Metrics MATR 2008 Challenge, having the highest average rank in terms of Pearson and Spear-man correlation. Optimizing TER-Plus to different types of human judgments yields significantly improved correlations and meaningful changes in the weight of different types of edits, demonstrating significant differences between the types of human judgments.
translated by 谷歌翻译
We describe two metrics for automatic evaluation of machine translation quality. These metrics, BLEU and NEE, are compared to human judgment of quality of translation of Arabic, Chinese, French, and Spanish documents into English.
translated by 谷歌翻译
We consider the problem of learning general-purpose, paraphrastic sentence embeddings in the setting of Wieting et al. (2016b). We use neural machine translation to generate sentential paraphrases via back-translation of bilingual sentence pairs. We evaluate the paraphrase pairs by their ability to serve as training data for learning paraphrastic sentence embed-dings. We find that the data quality is stronger than prior work based on bitext and on par with manually-written English paraphrase pairs, with the advantage that our approach can scale up to generate large training sets for many languages and domains. We experiment with several language pairs and data sources, and develop a variety of data filtering techniques. In the process, we explore how neural machine translation output differs from human-written sentences, finding clear differences in length, the amount of repetition, and the use of rare words. 1
translated by 谷歌翻译
最近关于可接受性判断的概率建模的研究结果的推动,我们提出了句法对数比值比(SLOR),一个标准化的语言模型得分,作为句子级别的自然语言生成输出的无参考流畅度评估的度量。我们进一步介绍了WPSLOR,这是一个基于WordPiece的新版本,它利用了一个更紧凑的语言模型。尽管像ROUGE这样的单词重叠度量是在手写参考文献的帮助下计算出来的,但是我们的无参考方法与人类流畅度得分的关联性显着提高。压缩句子的基准数据集。最后,我们提出了ROUGE-LM,一种基于参考的度量,它是WPSLOR对可用参考的自然扩展。 Weshow表明ROUGE-LM与人类判断的相关性显着高于所有基线指标,包括WPSLOR本身。
translated by 谷歌翻译
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation , and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations. 1
translated by 谷歌翻译
Comparisons of automatic evaluation metrics for machine translation are usually conducted on corpus level using correlation statistics such as Pearson's product moment correlation coefficient or Spearman's rank order correlation coefficient between human scores and automatic scores. However, such comparisons rely on human judgments of translation qualities such as adequacy and fluency. Unfortunately, these judgments are often inconsistent and very expensive to acquire. In this paper, we introduce a new evaluation method, ORANGE, for evaluating automatic machine translation evaluation metrics automatically without extra human involvement other than using a set of reference translations. We also show the results of comparing several existing automatic metrics and three new automatic metrics using ORANGE.
translated by 谷歌翻译
尽管神经机器翻译(NMT)产生了有希望的翻译性能,但遗憾的是,它存在过度翻译和翻译不足的问题[Tu et al。,2016],其研究已成为NMT的研究热点。目前,这些研究主要应用主导的自动评估指标,如BLEU,来评估关于bothadequacy和uency的整体翻译质量。但是,他们无法准确测量NMT系统处理上述问题的能力。在本文中,我们提出了两个量化指标,即Otem和Utem,分别在过翻译和翻译不足的情况下自动评估系统性能。这两个指标都是基于黄金参与与系统之间不匹配的n-gram的比例。翻译。我们通过比较它们与人类评估来评估这两个指标,其中Pearson Cor- relationCoefficient的值揭示了它们的强相关性。此外,对各种翻译系统的深入分析表明,BLEU与我们提出的指标之间存在一些不一致性,突出了我们指标的必要性和重要性。
translated by 谷歌翻译
We propose an automatic machine translation (MT) evaluation metric that calculates a similarity score (based on precision and recall) of a pair of sentences. Unlike most metrics, we compute a similarity score between items across the two sentences. We then find a maximum weight matching between the items such that each item in one sentence is mapped to at most one item in the other sentence. This general framework allows us to use arbitrary similarity functions between items, and to incorporate different information in our comparison , such as n-grams, dependency relations , etc. When evaluated on data from the ACL-07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop .
translated by 谷歌翻译