Domain Adaptation (DA) techniques aim at enabling machine learning methods learn effective classifiers for a "target" domain when the only available training data belongs to a different "source" domain. In this paper we present the Distributional Correspondence Indexing (DCI) method for domain adaptation in sentiment classification. DCI derives term representations in a vector space common to both domains where each dimension reflects its distributional correspondence to a pivot, i.e., to a highly predictive term that behaves similarly across domains. Term correspondence is quantified by means of a distri-butional correspondence function (DCF). We propose a number of efficient DCFs that are motivated by the distributional hypothesis, i.e., the hypothesis according to which terms with similar meaning tend to have similar distributions in text. Experiments show that DCI obtains better performance than current state-of-the-art techniques for cross-lingual and cross-domain sentiment classification. DCI also brings about a significantly reduced computational cost, and requires a smaller amount of human intervention. As a final contribution , we discuss a more challenging formulation of the domain adaptation problem, in which both the cross-domain and cross-lingual dimensions are tackled simultaneously.
translated by 谷歌翻译
多语言文本分类(PLC)包括根据一组共同的C类自动分类文档,每个文档用一组语言L中的一种编写,并且比通过其相应的语言特定分类器对每个文档进行天真分类时更准确地进行分类。为了提高给定语言的分类准确度,系统也需要利用其他语言编写的训练样例。我们通过漏斗处理multilabel PLC,这是我们在此提出的一种新的集成学习方法。漏斗包括生成一个两层分类系统,其中所有文档,无论语言如何,都由同一(第二层)分类器分类。对于该分类器,所有文档都表示在一个共同的,与语言无关的特征空间中,该特征空间由第一层语言相关分类器生成的后验概率组成。这允许对任何语言的所有测试文档进行分类,以受益于所有语言的所有培训文档中存在的信息。我们提供了大量的实验,在公开的多语言文本集上运行,其中显示漏斗显着优于许多最先进的基线。所有代码和数据集(invector表单)都是公开的。
translated by 谷歌翻译
We present a new approach to cross-language text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. The approach uses unlabeled documents , along with a simple word translation oracle, in order to induce task-specific, cross-lingual word correspondences. We report on analyses that reveal quantitative insights about the use of un-labeled data and the complexity of inter-language correspondence modeling. We conduct experiments in the field of cross-language sentiment classification, employing English as source language, and German, French, and Japanese as target languages. The results are convincing; they demonstrate both the robustness and the competitiveness of the presented ideas.
translated by 谷歌翻译
近年来,英语情感分类取得了巨大成功,这部分归功于大量注释资源的可用性。遗憾的是,大多数语言都没有如此丰富的标记数据。要解决低资源语言中的情感分类问题而不充分注释数据,我们建议使用Adversarial Deep AveragingNetwork(ADAN)将从源资源丰富的源语言的标记数据中学到的知识转移到仅存在未标记数据的低资源语言。 ADAN有两个有区别的分支:情感分类器和对抗语言鉴别器。两个分支都从共享特征提取器获取输入,以学习同时表示分类任务和跨语言不变的隐藏表示。中文和阿拉伯语情感分类的实验表明,ADAN明显优于最先进的系统。
translated by 谷歌翻译
我们介绍了BilBOWA(双语词语无对齐),这是一种简单且计算效率高的模型,用于学习单词的双语分布式表示,可以扩展到大型单语数据集,不需要单词对齐的并行训练数据。相反,它直接训练单语数据,并从较小的原始文本对齐数据集中提取双语信号。这是通过使用一种新颖的采样词袋交叉语言目标来实现的,该目标用于规范两种噪声对比语言模型,以实现有效的跨语言特征学习。我们展示了使用所提出的模型学习的双语嵌入优于跨语言文档分类任务的最新方法以及WMT11数据的词汇翻译任务​​。
translated by 谷歌翻译
自然语言处理是以盎格鲁为中心的,而以英语以外的语言工作的需求模型比以往任何时候都要大。然而,将模型从一种语言转移到另一种语言的任务可能是注释成本,工程时间和工作的昂贵内容。在本文中,我们提出了一个简单有效地将神经模型从英语转移到其他语言的一般框架。该框架依赖于任务表示作为弱监督的一种形式,是模型和任务不可知的,这意味着许多现有的神经架构可以用最小的努力移植到其他语言。唯一的要求是未标记的并行数据,以及在任务表示中定义的损失。我们通过将英语情绪分类器转换为三种不同的语言来评估我们的框架。在测试的基础上,我们发现我们的模型优于许多强基线并且可以与最先进的结果相媲美,这些结果依赖于更复杂的方法和更多的资源和数据。此外,我们发现本文提出的框架能够捕获跨语言的语义richand有意义的表示,尽管缺乏directsupervision。
translated by 谷歌翻译
Despite interest in using cross-lingual knowledge to learn word embeddings for various tasks, a systematic comparison of the possible approaches is lacking in the literature. We perform an extensive evaluation of four popular approaches of inducing cross-lingual embeddings, each requiring a different form of supervision, on four typologically different language pairs. Our evaluation setup spans four different tasks, including intrinsic evaluation on mono-lingual and cross-lingual similarity , and extrinsic evaluation on downstream semantic and syntactic applications. We show that models which require expensive cross-lingual knowledge almost always perform better, but cheaply supervised models often prove competitive on certain tasks.
translated by 谷歌翻译
跨语言嵌入在多语种NLP中变得越来越重要。最近,已经表明,通过线性变换对齐两个不相交的单语向量空间,使用不超过一个小的双语词典,可以有效地学习这些嵌入。在这项工作中,我们建议在初始对齐步骤之后应用额外的转换,这将跨语言同义词移向它们之间的中间点。通过应用这种转换,我们的目标是获得更好的矢量空间的跨语言整合。此外,令人惊讶的是,单语空间也通过这种转变得到改善。这与原始比对形成对比,原始比对通常被学习,使得单语空间的结构得以保留。我们的实验证实,由此产生的跨语言嵌入在单语和跨语言评估任务中都表现出最先进的模型。
translated by 谷歌翻译
我们介绍了一种架构,用于学习93种语言的联合多语言句子表示,属于30多种不同的语言家族,并用28种不同的脚本编写。我们的系统使用单个BiLSTMencoder,其中包含所有语言的共享BPE词汇表,它与辅助解码器耦合并在公共可用的并行语料库上进行训练。这使得我们能够在仅使用英语注释数据的句子嵌入之上学习分类器,并将其转换为93种语言中的任何一种而无需任何修改。我们的方法为XNLIdataset中的所有14种语言设置了一种新的最先进的语言自然语言推理方法。我们还在跨语言文档分类(MLDoc数据集)中取得了非常有竞争力的结果。我们的句子嵌入在并行语料库挖掘中是相似的,在4个语言对中的3个语言对中为BUCC共享任务建立了一个新的最新技术。最后,我们基于Tatoeba语料库引入了122种语言的最新一组对齐句子,并且表明我们的句子嵌入在多语言相似性搜索中获得了强有力的结果,即使对于低资源语言也是如此。我们的PyTorch实现,预先训练的编码器和多语言测试装置将免费提供。
translated by 谷歌翻译
平均字嵌入是更复杂的句子嵌入技术的共同基线。然而,它们通常不能满足更复杂模型(如InferSent)的性能。在这里,我们将平均词嵌入的概念概括为幂平均词嵌入。我们表明,不同类型的权力的相互作用意味着单词嵌入显着地缩小了与单一方法的最新方法之间的差距,并且在语言上大体上优于这些更复杂的技术。此外,我们提出的方法优于不同的最近提出的基线,例如SIF和Sent2Vec一个稳定的边界,因此构成了一个难以击败的单语基线。我们的数据和代码是公开的。
translated by 谷歌翻译
The lack of Chinese sentiment corpora limits the research progress on Chinese sentiment classification. However, there are many freely available English sentiment corpora on the Web. This paper focuses on the problem of cross-lingual sentiment classification, which leverages an available English corpus for Chi-nese sentiment classification by using the Eng-lish corpus as training data. Machine translation services are used for eliminating the language gap between the training set and test set, and English features and Chinese features are considered as two independent views of the classification problem. We propose a co-training approach to making use of unlabeled Chinese data. Experimental results show the effectiveness of the proposed approach, which can outperform the standard inductive classifi-ers and the transductive classifiers.
translated by 谷歌翻译
How do we parse the languages for which no treebanks are available? This contribution addresses the cross-lingual viewpoint on statistical dependency parsing, in which we attempt to make use of resource-rich source language treebanks to build and adapt models for the under-resourced target languages. We outline the benefits, and indicate the drawbacks of the current major approaches. We emphasize synthetic treebanking: the automatic creation of target language treebanks by means of annotation projection and machine translation. We present competitive results in cross-lingual dependency parsing using a combination of various techniques that contribute to the overall success of the method. We further include a detailed discussion about the impact of part-of-speech label accuracy on parsing results that provide guidance in practical applications of cross-lingual methods for truly under-resourced languages.
translated by 谷歌翻译
Cross-lingual sentiment classification aims to adapt the sentiment resource in a resource-rich language to a resource-poor language. In this study, we propose a representation learning approach which simultaneously learns vector representations for the texts in both the source and the target languages. Different from previous research which only gets bilingual word embedding, our Bilingual Document Representation Learning model BiDRL directly learns document representations. Both semantic and sentiment correlations are utilized to map the bilingual texts into the same embedding space. The experiments are based on the multilingual multi-domain Amazon review dataset. We use English as the source language and use Japanese, German and French as the target languages. The experimental results show that BiDRL outperforms the state-of-the-art methods for all the target languages.
translated by 谷歌翻译
最近的研究已经证明了生成预训练对于英语自然语言理解的效率。在这项工作中,我们将这种方法扩展到多种语言,并展示了跨语言预训练的有效性。我们提出了两种学习跨语言语言模型的方法:一种是仅依赖于单语数据的监督模式,另一种是监督使用并行数据的方法。一种新的跨语言语言模型目标。我们在跨语言分类,无监督和监督机器翻译方面取得了最先进的成果。在XNLI上,我们的方法以绝对增益4.9%的精度推动了现有技术。在无人监督的机器翻译中,我们在WMT'16德语 - 英语上获得了34.3 BLEU,提高了超过9个BLEU的先前技术水平。在有监督的机器翻译中,我们在WMT'16罗马尼亚语 - 英语上获得了38.5 BLEU的最新技术水平,超过了以前的最佳方法超过4个BLEU。我们的代码和预训练模型将公开发布。
translated by 谷歌翻译
We introduce a distribution based model to learn bilingual word embeddings from monolingual data. It is simple, effective and does not require any parallel data or any seed lexicon. We take advantage of the fact that word embeddings are usually in form of dense real-valued low-dimensional vector and therefore the distribution of them can be accurately estimated. A novel cross-lingual learning objective is proposed which directly matches the distributions of word embeddings in one language with that in the other language. During the joint learning process, we dynamically estimate the distributions of word embeddings in two languages respectively and minimize the dissimilarity between them through standard back propagation algorithm. Our learned bilingual word embeddings allow to group each word and its translations together in the shared vector space. We demonstrate the utility of the learned embeddings on the task of finding word-to-word translations from monolingual corpora. Our model achieved encouraging performance on data in both related languages and substantially different languages.
translated by 谷歌翻译
We present a novel, count-based approach to obtaining inter-lingual word representations based on inverted indexing of Wikipedia. We present experiments applying these representations to 17 datasets in document classification, POS tagging, dependency parsing, and word alignment. Our approach has the advantage that it is simple, computationally efficient and almost parameter-free, and, more importantly , it enables multi-source cross-lingual learning. In 14/17 cases, we improve over using state-of-the-art bilingual embeddings.
translated by 谷歌翻译
最先进的自然语言处理系统依赖于注释数据形式的监督来学习有能力的模型。这些模型通常使用单一语言(通常是英语)对数据进行训练,并且不能在该语言之外直接使用。由于收集每种语言的数据都不现实,因此人们越来越关注跨语言语言理解(XLU)和低资源的跨语言转移。在这项工作中,我们通过将多类型自然语言推理语料库(MultiNLI)的开发和测试集扩展到15种语言(包括斯瓦希里语和乌尔都语等低资源语言)来构建XLU的评估集。我们希望我们的数据集(称为XNLI)将通过提供信息性的标准评估任务来促进跨语言句子理解的研究。此外,我们为多语言句子理解提供了几个基线,其中两个基于机器翻译系统,两个使用paralleldata训练对齐的多语言词袋和LSTM编码器。我们发现XNLI代表了一个实用且具有挑战性的评估套件,直接翻译测试数据可以在可用基线之间产生最佳性能。
translated by 谷歌翻译
In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. We further exploit the readily available un-labeled data in the target language via semi-supervised learning, and adapt the translated model to better fit the data distribution of the target language.
translated by 谷歌翻译
This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the case that for some language no training examples are available. Experimental results of the bilingual classification of the ILO corpus (with documents in English and Spanish) are obtained using bilingual training, terminology translation and profile-based translation.
translated by 谷歌翻译