Domain Adaptation (DA) techniques aim at enabling machine learning methods learn effective classifiers for a "target" domain when the only available training data belongs to a different "source" domain. In this paper we present the Distributional Correspondence Indexing (DCI) method for domain adaptation in sentiment classification. DCI derives term representations in a vector space common to both domains where each dimension reflects its distributional correspondence to a pivot, i.e., to a highly predictive term that behaves similarly across domains. Term correspondence is quantified by means of a distri-butional correspondence function (DCF). We propose a number of efficient DCFs that are motivated by the distributional hypothesis, i.e., the hypothesis according to which terms with similar meaning tend to have similar distributions in text. Experiments show that DCI obtains better performance than current state-of-the-art techniques for cross-lingual and cross-domain sentiment classification. DCI also brings about a significantly reduced computational cost, and requires a smaller amount of human intervention. As a final contribution , we discuss a more challenging formulation of the domain adaptation problem, in which both the cross-domain and cross-lingual dimensions are tackled simultaneously.
translated by 谷歌翻译
我们讨论\ emph {跨语言文本量化}(CLTQ),执行文本量化的任务(即估计所有类的相对频率$ p_ {c}(D)$ $ c \ in \ mathcal {C} $ in当训练文档可用于源语言$ \ mathcal {S} $但不能用于需要执行量化的目标语言$ \ mathcal {T} $时,一组$ D $ of unlabelleddocuments)。 CLTQ从未在文献中讨论过;我们通过将最先进的量化方法与能够生成所涉及的源文档和目标文档的跨语言矢量表示的方法相结合来建立二元案例的基线结果。我们提出了在公开可用的数据集中获得的跨语言情感分类的实验结果;结果表明,所提出的方法可以以惊人的准确度执行CLTQ。
translated by 谷歌翻译
多语言文本分类(PLC)包括根据一组共同的C类自动分类文档,每个文档用一组语言L中的一种编写,并且比通过其相应的语言特定分类器对每个文档进行天真分类时更准确地进行分类。为了提高给定语言的分类准确度,系统也需要利用其他语言编写的训练样例。我们通过漏斗处理multilabel PLC,这是我们在此提出的一种新的集成学习方法。漏斗包括生成一个两层分类系统,其中所有文档,无论语言如何,都由同一(第二层)分类器分类。对于该分类器,所有文档都表示在一个共同的,与语言无关的特征空间中,该特征空间由第一层语言相关分类器生成的后验概率组成。这允许对任何语言的所有测试文档进行分类,以受益于所有语言的所有培训文档中存在的信息。我们提供了大量的实验,在公开的多语言文本集上运行,其中显示漏斗显着优于许多最先进的基线。所有代码和数据集(invector表单)都是公开的。
translated by 谷歌翻译
We present a new approach to cross-language text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. The approach uses unlabeled documents , along with a simple word translation oracle, in order to induce task-specific, cross-lingual word correspondences. We report on analyses that reveal quantitative insights about the use of un-labeled data and the complexity of inter-language correspondence modeling. We conduct experiments in the field of cross-language sentiment classification, employing English as source language, and German, French, and Japanese as target languages. The results are convincing; they demonstrate both the robustness and the competitiveness of the presented ideas.
translated by 谷歌翻译
近年来,英语情感分类取得了巨大成功,这部分归功于大量注释资源的可用性。遗憾的是,大多数语言都没有如此丰富的标记数据。要解决低资源语言中的情感分类问题而不充分注释数据,我们建议使用Adversarial Deep AveragingNetwork(ADAN)将从源资源丰富的源语言的标记数据中学到的知识转移到仅存在未标记数据的低资源语言。 ADAN有两个有区别的分支:情感分类器和对抗语言鉴别器。两个分支都从共享特征提取器获取输入,以学习同时表示分类任务和跨语言不变的隐藏表示。中文和阿拉伯语情感分类的实验表明,ADAN明显优于最先进的系统。
translated by 谷歌翻译
自然语言处理是以盎格鲁为中心的,而以英语以外的语言工作的需求模型比以往任何时候都要大。然而,将模型从一种语言转移到另一种语言的任务可能是注释成本,工程时间和工作的昂贵内容。在本文中,我们提出了一个简单有效地将神经模型从英语转移到其他语言的一般框架。该框架依赖于任务表示作为弱监督的一种形式,是模型和任务不可知的,这意味着许多现有的神经架构可以用最小的努力移植到其他语言。唯一的要求是未标记的并行数据,以及在任务表示中定义的损失。我们通过将英语情绪分类器转换为三种不同的语言来评估我们的框架。在测试的基础上,我们发现我们的模型优于许多强基线并且可以与最先进的结果相媲美,这些结果依赖于更复杂的方法和更多的资源和数据。此外,我们发现本文提出的框架能够捕获跨语言的语义richand有意义的表示,尽管缺乏directsupervision。
translated by 谷歌翻译
The automated categorization (or classification) of texts into predefinedcategories has witnessed a booming interest in the last ten years, due to theincreased availability of documents in digital form and the ensuing need toorganize them. In the research community the dominant approach to this problemis based on machine learning techniques: a general inductive processautomatically builds a classifier by learning, from a set of preclassifieddocuments, the characteristics of the categories. The advantages of thisapproach over the knowledge engineering approach (consisting in the manualdefinition of a classifier by domain experts) are a very good effectiveness,considerable savings in terms of expert manpower, and straightforwardportability to different domains. This survey discusses the main approaches totext categorization that fall within the machine learning paradigm. We willdiscuss in detail issues pertaining to three different problems, namelydocument representation, classifier construction, and classifier evaluation.
translated by 谷歌翻译
The exponential increase in the availability of online reviews and recommendations makes sentiment classification an interesting topic in academic and industrial research. Reviews can span so many different domains that it is difficult to gather annotated training data for all of them. Hence, this paper studies the problem of domain adaptation for sentiment classifiers, hereby a system is trained on labeled reviews from one source domain but is meant to be deployed on another. We propose a deep learning approach which learns to extract a meaningful representation for each review in an unsuper-vised fashion. Sentiment classifiers trained with this high-level feature representation clearly outperform state-of-the-art methods on a benchmark composed of reviews of 4 types of Amazon products. Furthermore, this method scales well and allowed us to successfully perform domain adaptation on a larger industrial-strength dataset of 22 domains.
translated by 谷歌翻译
我们介绍了一种架构,用于学习93种语言的联合多语言句子表示,属于30多种不同的语言家族,并用28种不同的脚本编写。我们的系统使用单个BiLSTMencoder,其中包含所有语言的共享BPE词汇表,它与辅助解码器耦合并在公共可用的并行语料库上进行训练。这使得我们能够在仅使用英语注释数据的句子嵌入之上学习分类器,并将其转换为93种语言中的任何一种而无需任何修改。我们的方法为XNLIdataset中的所有14种语言设置了一种新的最先进的语言自然语言推理方法。我们还在跨语言文档分类(MLDoc数据集)中取得了非常有竞争力的结果。我们的句子嵌入在并行语料库挖掘中是相似的,在4个语言对中的3个语言对中为BUCC共享任务建立了一个新的最新技术。最后,我们基于Tatoeba语料库引入了122种语言的最新一组对齐句子,并且表明我们的句子嵌入在多语言相似性搜索中获得了强有力的结果,即使对于低资源语言也是如此。我们的PyTorch实现,预先训练的编码器和多语言测试装置将免费提供。
translated by 谷歌翻译
The lack of Chinese sentiment corpora limits the research progress on Chinese sentiment classification. However, there are many freely available English sentiment corpora on the Web. This paper focuses on the problem of cross-lingual sentiment classification, which leverages an available English corpus for Chi-nese sentiment classification by using the Eng-lish corpus as training data. Machine translation services are used for eliminating the language gap between the training set and test set, and English features and Chinese features are considered as two independent views of the classification problem. We propose a co-training approach to making use of unlabeled Chinese data. Experimental results show the effectiveness of the proposed approach, which can outperform the standard inductive classifi-ers and the transductive classifiers.
translated by 谷歌翻译
Ensemble methods using multiple classifiers have proven to be among the most successful approaches for the task of Native Language Identification (NLI), achieving the current state of the art. However, a systematic examination of ensemble methods for NLI has yet to be conducted. Additionally, deeper ensemble architectures such as classifier stacking have not been closely evaluated. We present a set of experiments using three ensemble-based models, testing each with multiple configurations and algorithms. This includes a rigorous application of meta-classification models for NLI, achieving state-of-the-art results on several large data sets, evaluated in both intra-corpus and cross-corpus modes.
translated by 谷歌翻译
This paper presents the results of the WMT16 shared tasks, which included five machine translation (MT) tasks (standard news, IT-domain, biomedical, multimodal, pronoun), three evaluation tasks (metrics, tuning, run-time estimation of MT quality), and an automatic post-editing task and bilingual document alignment task. This year, 102 MT systems from 24 institutions (plus 36 anonymized online systems) were submitted to the 12 translation directions in the news translation task. The IT-domain task received 31 submissions from 12 institutions in 7 directions and the Biomedical task received 15 submissions systems from 5 institutions. Evaluation was both automatic and manual (relative ranking and 100-point scale assessments). The quality estimation task had three sub-tasks, with a total of 14 teams, submitting 39 entries. The automatic post-editing task had a total of 6 teams, submitting 11 entries .
translated by 谷歌翻译
The lack of labeled data always poses challenges for tasks where machine learning is involved. Semi-supervised and cross-domain approaches represent the most common ways to overcome this difficulty. Graph-based algorithms have been widely studied during the last decade and have proved to be very effective at solving the data limitation problem. This paper explores one of the most popular state-of-the-art graph-based algorithms-label propagation, together with its modifications previously applied to sentiment classification. We study the impact of modified graph structure and parameter variations and compare the performance of graph-based algorithms in cross-domain and semi-supervised settings. The results provide a strategy for selecting the most favourable algorithm and learning paradigm on the basis of the available labeled and unlabeled data.
translated by 谷歌翻译
对于许多文本分类任务,目标域中标记数据的缺失存在一个主要问题。虽然目标域的分类器可以在来自相关源域的标记文本数据上进行训练,但是这种分类器的准确性通常在跨域设置中较低。最近,字符串内核在各种文本分类任务中获得了最先进的结果。例如母语识别或自动论文编辑。此外,已经发现基于字符串内核的分类器是不同域之间的分布间隙的。在本文中,我们形式地描述了一种算法,该算法由两种简单但有效的传导学习方法组成,以进一步改善跨域设置中字符串内核的结果。通过使用地面实况测试标签使字符串内核适应测试集,我们在跨域英语极性分类中报告了显着的预测率。
translated by 谷歌翻译
We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural differences between the datasets.
translated by 谷歌翻译