Domain Adaptation (DA) techniques aim at enabling machine learning methods learn effective classifiers for a "target" domain when the only available training data belongs to a different "source" domain. In this paper we present the Distributional Correspondence Indexing (DCI) method for domain adaptation in sentiment classification. DCI derives term representations in a vector space common to both domains where each dimension reflects its distributional correspondence to a pivot, i.e., to a highly predictive term that behaves similarly across domains. Term correspondence is quantified by means of a distri-butional correspondence function (DCF). We propose a number of efficient DCFs that are motivated by the distributional hypothesis, i.e., the hypothesis according to which terms with similar meaning tend to have similar distributions in text. Experiments show that DCI obtains better performance than current state-of-the-art techniques for cross-lingual and cross-domain sentiment classification. DCI also brings about a significantly reduced computational cost, and requires a smaller amount of human intervention. As a final contribution , we discuss a more challenging formulation of the domain adaptation problem, in which both the cross-domain and cross-lingual dimensions are tackled simultaneously.
translated by 谷歌翻译
The automated categorization (or classification) of texts into predefinedcategories has witnessed a booming interest in the last ten years, due to theincreased availability of documents in digital form and the ensuing need toorganize them. In the research community the dominant approach to this problemis based on machine learning techniques: a general inductive processautomatically builds a classifier by learning, from a set of preclassifieddocuments, the characteristics of the categories. The advantages of thisapproach over the knowledge engineering approach (consisting in the manualdefinition of a classifier by domain experts) are a very good effectiveness,considerable savings in terms of expert manpower, and straightforwardportability to different domains. This survey discusses the main approaches totext categorization that fall within the machine learning paradigm. We willdiscuss in detail issues pertaining to three different problems, namelydocument representation, classifier construction, and classifier evaluation.
translated by 谷歌翻译
我们介绍了一种架构,用于学习93种语言的联合多语言句子表示,属于30多种不同的语言家族,并用28种不同的脚本编写。我们的系统使用单个BiLSTMencoder,其中包含所有语言的共享BPE词汇表,它与辅助解码器耦合并在公共可用的并行语料库上进行训练。这使得我们能够在仅使用英语注释数据的句子嵌入之上学习分类器,并将其转换为93种语言中的任何一种而无需任何修改。我们的方法为XNLIdataset中的所有14种语言设置了一种新的最先进的语言自然语言推理方法。我们还在跨语言文档分类(MLDoc数据集)中取得了非常有竞争力的结果。我们的句子嵌入在并行语料库挖掘中是相似的,在4个语言对中的3个语言对中为BUCC共享任务建立了一个新的最新技术。最后,我们基于Tatoeba语料库引入了122种语言的最新一组对齐句子,并且表明我们的句子嵌入在多语言相似性搜索中获得了强有力的结果,即使对于低资源语言也是如此。我们的PyTorch实现,预先训练的编码器和多语言测试装置将免费提供。
translated by 谷歌翻译
This paper presents the results of the WMT16 shared tasks, which included five machine translation (MT) tasks (standard news, IT-domain, biomedical, multimodal, pronoun), three evaluation tasks (metrics, tuning, run-time estimation of MT quality), and an automatic post-editing task and bilingual document alignment task. This year, 102 MT systems from 24 institutions (plus 36 anonymized online systems) were submitted to the 12 translation directions in the news translation task. The IT-domain task received 31 submissions from 12 institutions in 7 directions and the Biomedical task received 15 submissions systems from 5 institutions. Evaluation was both automatic and manual (relative ranking and 100-point scale assessments). The quality estimation task had three sub-tasks, with a total of 14 teams, submitting 39 entries. The automatic post-editing task had a total of 6 teams, submitting 11 entries .
translated by 谷歌翻译
We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and context-counting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.
translated by 谷歌翻译
Ensemble methods using multiple classifiers have proven to be among the most successful approaches for the task of Native Language Identification (NLI), achieving the current state of the art. However, a systematic examination of ensemble methods for NLI has yet to be conducted. Additionally, deeper ensemble architectures such as classifier stacking have not been closely evaluated. We present a set of experiments using three ensemble-based models, testing each with multiple configurations and algorithms. This includes a rigorous application of meta-classification models for NLI, achieving state-of-the-art results on several large data sets, evaluated in both intra-corpus and cross-corpus modes.
translated by 谷歌翻译
我们介绍了BilBOWA(双语词语无对齐),这是一种简单且计算效率高的模型,用于学习单词的双语分布式表示,可以扩展到大型单语数据集,不需要单词对齐的并行训练数据。相反,它直接训练单语数据,并从较小的原始文本对齐数据集中提取双语信号。这是通过使用一种新颖的采样词袋交叉语言目标来实现的,该目标用于规范两种噪声对比语言模型,以实现有效的跨语言特征学习。我们展示了使用所提出的模型学习的双语嵌入优于跨语言文档分类任务的最新方法以及WMT11数据的词汇翻译任务​​。
translated by 谷歌翻译
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.
translated by 谷歌翻译
How do we parse the languages for which no treebanks are available? This contribution addresses the cross-lingual viewpoint on statistical dependency parsing, in which we attempt to make use of resource-rich source language treebanks to build and adapt models for the under-resourced target languages. We outline the benefits, and indicate the drawbacks of the current major approaches. We emphasize synthetic treebanking: the automatic creation of target language treebanks by means of annotation projection and machine translation. We present competitive results in cross-lingual dependency parsing using a combination of various techniques that contribute to the overall success of the method. We further include a detailed discussion about the impact of part-of-speech label accuracy on parsing results that provide guidance in practical applications of cross-lingual methods for truly under-resourced languages.
translated by 谷歌翻译
在本文中,我们报告了我们对文本数据密集分布表示的研究结果。我们提出了两种新颖的神经模型来学习这种表征。第一个模型学习文档级别的表示,而第二个模型学习单词级表示。对于文档级表示,我们提出二进制段落向量:用于学习文本文档的二进制表示的神经网络模型,其可用于快速文档检索。我们对这些模型进行了全面评估,并证明它们在信息检索任务中的表现优于该领域的开创性方法。我们还报告了强有力的结果转换学习设置,其中我们的模型在通用textcorpus上训练,然后用于从特定于域的数据集推断文档的代码。与先前提出的方法相反,二进制段落矢量模型直接从原始文本数据学习嵌入。对于词级表示,我们提出消歧Skip-gram:用于学习多义词嵌入的神经网络模型。通过该模型学习的表示可以用于下游任务,例如词性标记或语义关系的识别。在单词意义上感应任务Disambiguated Skip-gram在三个基准测试数据集上优于最先进的模型。我们的模型具有优雅的概率解释。此外,与以前的这种模型不同,它在所有参数方面都是不同的,并且可以用反向传播进行训练。除了定量结果,我们还提出消除歧义的Skip-gram的定性评估,包括选定的词义嵌入的二维可视化。
translated by 谷歌翻译
最近,高效的分布式数字表示模型(字嵌入)与现代机器学习算法相结合,对自动文档分类任务产生了可观的改进。然而,尚未对分层文本分类(HTC)评估此类技术的有效性。本研究通过实验和分析研究了这些模型和算法在这一特定问题上的应用。我们使用突出的机器学习算法实现训练分类模型--- fastText,XGBoost,SVM和Keras'CNN ---以及可观察的词嵌入生成方法--- GloVe,word2vec和fastText ---以及公开可用的数据并且通过测量特别地评估它们适用于分层上下文。 FastText在RCV1数据集的单标签版本上实现了$ {} _ {LCA} F_1 $ 0.893。分析表明,使用单词嵌入及其风格是HTC非常有希望的方法。
translated by 谷歌翻译
translated by 谷歌翻译
This paper presents the results of the WMT14 shared tasks, which included a standard news translation task, a separate medical translation task, a task for run-time estimation of machine translation quality, and a metrics task. This year, 143 machine translation systems from 23 institutions were submitted to the ten translation directions in the standard translation task. An additional 6 anonymized systems were included, and were then evaluated both automatically and manually. The quality estimation task had four subtasks, with a total of 10 teams, submitting 57 entries .
translated by 谷歌翻译