使用基于词典的方法将语言L1中的短语转换为语言L2的过去方法需要语法规则来重组初始翻译。本文引入了一种新颖的方法,而无需使用任何语法规则将L1中不存在的L1中的给定短语转换为L2。我们在L2中至少需要一个L1-L2双语词典和N-Gram数据。我们翻译的平均手动评估得分为4.29/5.00,这意味着非常高质量。
translated by 谷歌翻译
本文提出的方法是通过单个输入双语词典自动为低资源语言(尤其是资源贫乏的语言)创建大量新的双语词典。我们的算法使用可用的WordNets和Machine Translator(MT)生成了源语言的单词翻译为丰富的目标语言。由于我们的方法仅依赖于一个输入字典,可用的WordNet和MT,因此它们适用于任何双语词典,只要两种语言之一是英语,或者具有链接到Princeton WordNet的WordNet。从5个可用的双语词典开始,我们创建了48个新的双语词典。其中,流行的MTS不支持30双语言:Google和Bing。
translated by 谷歌翻译
本文研究了为濒危语言生成词汇资源的方法。我们的算法使用公共文字网和机器翻译器(MT)构建双语词典和多语言词库。由于我们的作品仅依赖于濒危语言和“中间帮手”语言之间的一个双语词典,因此它适用于缺乏许多现有资源的语言。
translated by 谷歌翻译
手动构建WordNet是一项艰巨的任务,需要多年的专家时间。作为自动构建完整WordNet的第一步,我们建议使用公开可用的WordNet,机器翻译器和/或单语言词典来生成有关资源丰富和资源贫乏语言的WordNet Synset的方法。我们的算法将现有WordNet的合成器转换为目标语言t,然后在翻译候选者上应用排名方法以查找T中的最佳翻译。我们的方法适用于任何至少有一个从英语翻译到它的现有双语字典的语言。
translated by 谷歌翻译
双语词典是昂贵的资源,当其中一种语言贫穷时,没有多少可用。在本文中,我们提出了从现有双语词典中创建新的反向双语词典的算法,其中英语是两种语言之一。我们的算法利用了使用英语WordNet产生反向字典条目之间单词概念对之间的相似性。由于我们的算法依赖于可用的双语词典,因此只要两种语言之一具有WordNet型词汇本体论,它们就适用于任何双语词典。
translated by 谷歌翻译
神经机器翻译(NMT)模型在大型双语数据集上已有效。但是,现有的方法和技术表明,该模型的性能高度取决于培训数据中的示例数量。对于许多语言而言,拥有如此数量的语料库是一个牵强的梦想。我们从单语言词典探索新语言的单语扬声器中汲取灵感,我们研究了双语词典对具有极低或双语语料库的语言的适用性。在本文中,我们使用具有NMT模型的双语词典探索方法,以改善资源极低的资源语言的翻译。我们将此工作扩展到多语言系统,表现出零拍的属性。我们详细介绍了字典质量,培训数据集大小,语言家族等对翻译质量的影响。多种低资源测试语言的结果表明,我们的双语词典方法比基线相比。
translated by 谷歌翻译
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations. 1
translated by 谷歌翻译
两个关键假设塑造了排名检索的通常视图:(1)搜索者可以为他们希望看到的文档中的疑问选择单词,并且(2)排名检索的文档就足以,因为搜索者将足够就足够了能够认识到他们希望找到的那些。当要搜索的文档处于搜索者未知的语言时,既不是真的。在这种情况下,需要跨语言信息检索(CLIR)。本章审查了艺术技术的交流信息检索,并概述了一些开放的研究问题。
translated by 谷歌翻译
印尼语是一种凝结的语言,因为它具有复杂的单词形成过程。因此,该语言的翻译模型需要一种甚至低于单词级别的机制,称为子字级别。自词汇量爆炸以来,这种复合过程导致了一个罕见的单词问题。我们提出了一种解决神经机器翻译(NMT)系统的唯一单词问题的策略,该系统将印度尼西亚语用作一对语言。我们的方法使用基于规则的方法将单词转换为其根部并伴随词缀以保留其含义和上下文。使用基于规则的算法具有更多优势:它不需要语料库数据,而仅应用标准的印尼规则。我们的实验证实了这种方法是实用的。它将词汇的数量大大减少到57%,在英语到印度尼西亚翻译上,此策略在不使用此技术的类似NMT系统上提供了多达5个BLEU点的改进。
translated by 谷歌翻译
虽然端到端的神经机翻译(NMT)取得了令人印象深刻的进步,但嘈杂的输入通常会导致模型变得脆弱和不稳定。生成对抗性示例作为增强数据被证明是有用的,以减轻这个问题。对逆势示例生成(AEG)的现有方法是字级或字符级。在本文中,我们提出了一个短语级侵犯示例生成(PAEG)方法来增强模型的鲁棒性。我们的方法利用基于梯度的策略来替代源输入中的弱势位置的短语。我们在三个基准中验证了我们的方法,包括LDC中文 - 英语,IWSLT14德语,以及WMT14英语 - 德语任务。实验结果表明,与以前的方法相比,我们的方法显着提高了性能。
translated by 谷歌翻译
Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character ngram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English→German and English→Russian by up to 1.1 and 1.3 BLEU, respectively.
translated by 谷歌翻译
从许多科目中,从一系列文本中提取频繁的单词都在很大程度上进行。另一方面,提取短语通常是由于提取短语时固有的并发症而进行的,最重要的并发症是双计数的并发症,当单词或短语出现在较长的短语中时,它们也被计算在内。已经写了几篇关于这一问题解决方案的短语挖掘的论文。但是,他们要么需要一个所谓的质量短语列表,要么可以用于提取过程,要么需要人类的互动来在此过程中识别这些质量短语。我们提出了一种消除双重计数的方法,而无需识别质量短语列表。在一组文本的上下文中,我们将主短语定义为不交叉标点标记的短语,不以停止词开头用停止单词,在这些文本中经常出现,而无需双重计数,并且对用户有意义。我们的方法可以独立地识别这种主短语而无需人类投入,并可以从任何文本中提取。已经开发了一个称为PHM的R软件包,以实现此方法。
translated by 谷歌翻译
Gender-inclusive language is important for achieving gender equality in languages with gender inflections, such as German. While stirring some controversy, it is increasingly adopted by companies and political institutions. A handful of tools have been developed to help people use gender-inclusive language by identifying instances of the generic masculine and providing suggestions for more inclusive reformulations. In this report, we define the underlying tasks in terms of natural language processing, and present a dataset and measures for benchmarking them. We also present a model that implements these tasks, by combining an inclusive language database with an elaborate sequence of processing steps via standard pre-trained models. Our model achieves a recall of 0.89 and a precision of 0.82 in our benchmark for identifying exclusive language; and one of its top five suggestions is chosen in real-world texts in 44% of cases. We sketch how the area could be further advanced by training end-to-end models and using large language models; and we urge the community to include more gender-inclusive texts in their training data in order to not present an obstacle to the adoption of gender-inclusive language. Through these efforts, we hope to contribute to restoring justice in language and, to a small extent, in reality.
translated by 谷歌翻译
The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.
translated by 谷歌翻译
识别跨语言抄袭是挑战性的,特别是对于遥远的语言对和感知翻译。我们介绍了这项任务的新型多语言检索模型跨语言本体论(CL \ nobreakdash-osa)。 CL-OSA表示从开放知识图Wikidata获得的实体向量的文档。反对其他方法,Cl \ nobreakdash-osa不需要计算昂贵的机器翻译,也不需要使用可比较或平行语料库进行预培训。它可靠地歧义同音异义和缩放,以允许其应用于Web级文档集合。我们展示了CL-OSA优于从五个大局部多样化的测试语料中检索候选文档的最先进的方法,包括日语英语等遥控语言对。为了识别在角色级别的跨语言抄袭,CL-OSA主要改善了感觉识别翻译的检测。对于这些挑战性案例,CL-OSA在良好的Plagdet得分方面的表现超过了最佳竞争对手的比例超过两种。我们研究的代码和数据公开可用。
translated by 谷歌翻译
With 84.75 million Filipinos online, the ability for models to process online text is crucial for developing Filipino NLP applications. To this end, spelling correction is a crucial preprocessing step for downstream processing. However, the lack of data prevents the use of language models for this task. In this paper, we propose an N-Gram + Damerau Levenshtein distance model with automatic rule extraction. We train the model on 300 samples, and show that despite limited training data, it achieves good performance and outperforms other deep learning approaches in terms of accuracy and edit distance. Moreover, the model (1) requires little compute power, (2) trains in little time, thus allowing for retraining, and (3) is easily interpretable, allowing for direct troubleshooting, highlighting the success of traditional approaches over more complex deep learning models in settings where data is unavailable.
translated by 谷歌翻译
This paper describes Meteor Universal, released for the 2014 ACL Workshop on Statistical Machine Translation. Meteor Universal brings language specific evaluation to previously unsupported target languages by (1) automatically extracting linguistic resources (paraphrase tables and function word lists) from the bitext used to train MT systems and (2) using a universal parameter set learned from pooling human judgments of translation quality from several language directions. Meteor Universal is shown to significantly outperform baseline BLEU on two new languages, Russian (WMT13) and Hindi (WMT14).
translated by 谷歌翻译
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
translated by 谷歌翻译
Machine Translation (MT) system generally aims at automatic representation of source language into target language retaining the originality of context using various Natural Language Processing (NLP) techniques. Among various NLP methods, Statistical Machine Translation(SMT). SMT uses probabilistic and statistical techniques to analyze information and conversion. This paper canvasses about the development of bilingual SMT models for translating English to fifteen low-resource Indian Languages (ILs) and vice versa. At the outset, all 15 languages are briefed with a short description related to our experimental need. Further, a detailed analysis of Samanantar and OPUS dataset for model building, along with standard benchmark dataset (Flores-200) for fine-tuning and testing, is done as a part of our experiment. Different preprocessing approaches are proposed in this paper to handle the noise of the dataset. To create the system, MOSES open-source SMT toolkit is explored. Distance reordering is utilized with the aim to understand the rules of grammar and context-dependent adjustments through a phrase reordering categorization framework. In our experiment, the quality of the translation is evaluated using standard metrics such as BLEU, METEOR, and RIBES
translated by 谷歌翻译
本文介绍了一个大规模的多模式和多语言数据集,该数据集旨在促进在语言中的上下文使用中对图像进行接地的研究。数据集由选择明确说明在电影字幕句子中表达的概念的图像组成。数据集是一个宝贵的资源,因为(i)图像与文本片段一致,而不是整个句子; (ii)对于文本片段和句子,可以使用多个图像; (iii)这些句子是自由形式和现实世界的; (iv)平行文本是多语言的。我们为人类设置了一个填充游戏,以评估数据集的自动图像选择过程的质量。我们在两个自动任务上显示了数据集的实用程序:(i)填充填充; (ii)词汇翻译。人类评估和自动模型的结果表明,图像可以是文本上下文的有用补充。该数据集将受益于单词视觉基础的研究,尤其是在自由形式句子的背景下,可以从https://doi.org/10.5281/zenodo.5034604获得创意常识许可。
translated by 谷歌翻译