Distance Metric Learning (DML) has attracted much attention in image processing in recent years. This paper analyzes its impact on supervised fine-tuning language models for Natural Language Processing (NLP) classification tasks under few-shot learning settings. We investigated several DML loss functions in training RoBERTa language models on known SentEval Transfer Tasks datasets. We also analyzed the possibility of using proxy-based DML losses during model inference. Our systematic experiments have shown that under few-shot learning settings, particularly proxy-based DML losses can positively affect the fine-tuning and inference of a supervised language model. Models tuned with a combination of CCE (categorical cross-entropy loss) and ProxyAnchor Loss have, on average, the best performance and outperform models with only CCE by about 3.27 percentage points -- up to 10.38 percentage points depending on the training dataset.
translated by 谷歌翻译
This paper presents an analysis regarding an influence of the Distance Metric Learning (DML) loss functions on the supervised fine-tuning of the language models for classification tasks. We experimented with known datasets from SentEval Transfer Tasks. Our experiments show that applying the DML loss function can increase performance on downstream classification tasks of RoBERTa-large models in few-shot scenarios. Models fine-tuned with the use of SoftTriple loss can achieve better results than models with a standard categorical cross-entropy loss function by about 2.89 percentage points from 0.04 to 13.48 percentage points depending on the training dataset. Additionally, we accomplished a comprehensive analysis with explainability techniques to assess the models' reliability and explain their results.
translated by 谷歌翻译
我们介绍了一种新的损失函数TriplePropy,提高微调普通知识的分类性能,基于交叉熵和软损失。这种损失功能可以通过跨熵损失改善强大的罗伯拉基线模型,大约(0.02% - 2.29%)。对流行数据集的彻底测试表示稳定增益。训练数据集中的样品越小,增益越高,对于小型数据集而言,其为0.78%,用于中等大小 - 0.86%,大约0.20%,超大0.04%。
translated by 谷歌翻译
最近,已证明有监督的对比度学习(SCL)在大多数分类任务中都能取得出色的表现。在SCL中,对神经网络进行了训练,可以优化两个目标:在嵌入空间中将锚定和阳性样品一起拉在一起,并将锚点推开。但是,这两个不同的目标可能需要冲突,需要在优化期间之间进行权衡。在这项工作中,我们将SCL问题作为Roberta语言模型的微调阶段的多目标优化问题。使用两种方法来解决优化问题:(i)线性标量(LS)方法,该方法可最大程度地减少持久性损失的加权线性组合; (ii)确切的帕累托最佳(EPO)方法,该方法找到了帕累托正面与给定优先矢量的相交。我们在不使用数据增强,内存库或生成对抗性示例的情况下评估了几个胶合基准任务的方法。经验结果表明,提出的学习策略大大优于强大的竞争性学习基线
translated by 谷歌翻译
对比的学习技术已广泛用于计算机视野中作为增强数据集的手段。在本文中,我们将这些对比学习嵌入的使用扩展到情绪分析任务,并证明了对这些嵌入的微调在基于BERT的嵌入物上的微调方面提供了改进,以在评估时实现更高的基准。在Dynasent DataSet上。我们还探讨了我们的微调模型在跨域基准数据集上执行的。此外,我们探索了ups采样技术,以实现更平衡的班级分发,以进一步改进我们的基准任务。
translated by 谷歌翻译
Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models. Modern batch contrastive approaches subsume or significantly outperform traditional contrastive losses such as triplet, max-margin and the N-pairs loss. In this work, we extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. We analyze two possible versions of the supervised contrastive (SupCon) loss, identifying the best-performing formulation of the loss. On ResNet-200, we achieve top-1 accuracy of 81.4% on the Ima-geNet dataset, which is 0.8% above the best number reported for this architecture. We show consistent outperformance over cross-entropy on other datasets and two ResNet variants. The loss shows benefits for robustness to natural corruptions, and is more stable to hyperparameter settings such as optimizers and data augmentations. Our loss function is simple to implement and reference TensorFlow code is released at https://t.ly/supcon 1 .
translated by 谷歌翻译
BERT (Devlin et al., 2018) and RoBERTa has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods. 1
translated by 谷歌翻译
对比学习被出现为强大的代表学习方法,促进各种下游任务,特别是当监督数据有限时。如何通过数据增强构建有效的对比样本是其成功的关键。与视觉任务不同,语言任务中尚未对对比学习进行对比学习的数据增强方法。在本文中,我们提出了一种使用文本摘要构建语言任务的对比样本的新方法。我们使用这些样本进行监督的对比学习,以获得更好的文本表示,这极大地利用了具有有限注释的文本分类任务。为了进一步改进该方法,除了交叉熵损失之外,我们将从不同类中的样本混合并添加一个名为MIXSUM的额外正则化。真实世界文本分类数据集(Amazon-5,Yelp-5,AG新闻和IMDB)的实验展示了基于摘要的数据增强和MIXSUM正规化的提议对比学习框架的有效性。
translated by 谷歌翻译
我们提供了从文本到文本变换器(T5)的第一次探索句子嵌入式。句子嵌入式广泛适用于语言处理任务。虽然T5在作为序列到序列映射问题的语言任务上实现令人印象深刻的性能,但目前尚不清楚如何从编码器解码器模型生成陈列嵌入的句子。我们调查三种方法提取T5句子嵌入方法:两个仅利用T5编码器,一个使用全T5编码器解码器模型。为了支持我们的调查,我们建立了一个新的句子代表转移基准,SentGlue,它将Senteval Toolkit扩展到粘合基准的九个任务。我们的编码器的型号优于Senteval和SentGlue传输任务的句子 - BERT和SIMCSE句子嵌入,包括语义文本相似性(STS)。发现从数百万到数十亿参数的缩放T5产生一致的进一步改进。最后,我们的编码器 - 解码器方法在使用句子嵌入时在STS上实现了新的最先进的。我们的模型在https://tfhub.dev/google/collections/sentence-t5/1发布。
translated by 谷歌翻译
Cross entropy loss has served as the main objective function for classification-based tasks. Widely deployed for learning neural network classifiers, it shows both effectiveness and a probabilistic interpretation. Recently, after the success of self supervised contrastive representation learning methods, supervised contrastive methods have been proposed to learn representations and have shown superior and more robust performance, compared to solely training with cross entropy loss. However, cross entropy loss is still needed to train the final classification layer. In this work, we investigate the possibility of learning both the representation and the classifier using one objective function that combines the robustness of contrastive learning and the probabilistic interpretation of cross entropy loss. First, we revisit a previously proposed contrastive-based objective function that approximates cross entropy loss and present a simple extension to learn the classifier jointly. Second, we propose a new version of the supervised contrastive training that learns jointly the parameters of the classifier and the backbone of the network. We empirically show that our proposed objective functions show a significant improvement over the standard cross entropy loss with more training stability and robustness in various challenging settings.
translated by 谷歌翻译
We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity. Traditionally, for this problem supervision is expressed in the form of sets of points that follow an ordinal relationship -an anchor point x is similar to a set of positive points Y , and dissimilar to a set of negative points Z, and a loss defined over these distances is minimized. While the specifics of the optimization differ, in this work we collectively call this type of supervision Triplets and all methods that follow this pattern Triplet-Based methods. These methods are challenging to optimize. A main issue is the need for finding informative triplets, which is usually achieved by a variety of tricks such as increasing the batch size, hard or semi-hard triplet mining, etc. Even with these tricks, the convergence rate of such methods is slow. In this paper we propose to optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points which are learned as well. These proxies approximate the original data points, so that a triplet loss over the proxies is a tight upper bound of the original loss. This proxy-based loss is empirically better behaved. As a result, the proxy-loss improves on state-of-art results for three standard zero-shot learning datasets, by up to 15% points, while converging three times as fast as other triplet-based losses.
translated by 谷歌翻译
对人类法官和现有的NLP系统,受人尊敬和屈尊的语言(PCL)具有巨大的有害影响,很难检测到。在Semeval-2022任务4中,我们提出了一个基于变压器的新型模型及其合奏,以准确了解PCL检测的这种语言上下文。为了促进对PCL的微妙和主观性质的理解,采用两种微调策略来捕获不同语言行为和分类分布的歧视性特征。该系统在官方排名中取得了显着的结果,包括子任务中的1和第5位。
translated by 谷歌翻译
产品匹配是全球对电子商务消费者行为的理解的基本步骤。实际上,产品匹配是指确定来自不同数据源(例如零售商)是否提供两个产品的任务。标准管道使用以前的阶段,称为阻止,其中给定产品提供了一组潜在的匹配候选者,以相似的特征(例如相同的品牌,类别,风味等)检索。从这些类似的候选产品中,那些不匹配的产品可以被视为艰难的负面因素。我们提出了Block-SCL,该策略使用阻止输出来充分利用监督的对比度学习(SCL)。具体而言,块-SCL使用在阻塞阶段获得的硬性样本来构建丰富的批处理。这些批次提供了一个强大的训练信号,导致该模型了解产品匹配的更有意义的句子嵌入。几个公共数据集中的实验结果表明,尽管仅将短产品标题作为输入,没有数据增强和更轻的变压器主链比竞争方法,但Block-SCL仍取得了最新的结果。
translated by 谷歌翻译
从一个非常少数标记的样品中学习新颖的课程引起了机器学习区域的越来越高。最近关于基于元学习或转移学习的基于范例的研究表明,良好特征空间的获取信息可以是在几次拍摄任务上实现有利性能的有效解决方案。在本文中,我们提出了一种简单但有效的范式,该范式解耦了学习特征表示和分类器的任务,并且只能通过典型的传送学习培训策略从基类嵌入体系结构的特征。为了在每个类别内保持跨基地和新类别和辨别能力的泛化能力,我们提出了一种双路径特征学习方案,其有效地结合了与对比特征结构的结构相似性。以这种方式,内部级别对齐和级别的均匀性可以很好地平衡,并且导致性能提高。三个流行基准测试的实验表明,当与简单的基于原型的分类器结合起来时,我们的方法仍然可以在电感或转换推理设置中的标准和广义的几次射击问题达到有希望的结果。
translated by 谷歌翻译
我们提出了Metricbert,这是一个基于BERT的模型,该模型学会了以明确的相似性度量嵌入文本,同时遵守``传统''蒙面语言任务。我们专注于学习相似之处的下游任务,以表明公制表现优于最先进的替代方案,有时要大幅度。我们对我们的方法及其不同的变体进行了广泛的评估,这表明我们的训练目标对传统的对比损失,标准余弦相似性目标和其他六个基线非常有益。作为另一个贡献,我们发布了视频游戏描述的数据集,以及由域专家制作的一系列相似性注释。
translated by 谷歌翻译
Deep Metric Learning (DML) learns a non-linear semantic embedding from input data that brings similar pairs together while keeping dissimilar data away from each other. To this end, many different methods are proposed in the last decade with promising results in various applications. The success of a DML algorithm greatly depends on its loss function. However, no loss function is perfect, and it deals only with some aspects of an optimal similarity embedding. Besides, the generalizability of the DML on unseen categories during the test stage is an important matter that is not considered by existing loss functions. To address these challenges, we propose novel approaches to combine different losses built on top of a shared deep feature extractor. The proposed ensemble of losses enforces the deep model to extract features that are consistent with all losses. Since the selected losses are diverse and each emphasizes different aspects of an optimal semantic embedding, our effective combining methods yield a considerable improvement over any individual loss and generalize well on unseen categories. Here, there is no limitation in choosing loss functions, and our methods can work with any set of existing ones. Besides, they can optimize each loss function as well as its weight in an end-to-end paradigm with no need to adjust any hyper-parameter. We evaluate our methods on some popular datasets from the machine vision domain in conventional Zero-Shot-Learning (ZSL) settings. The results are very encouraging and show that our methods outperform all baseline losses by a large margin in all datasets.
translated by 谷歌翻译
自我监督的学习方法,如对比学习,在自然语言处理中非常重视。它使用对培训数据增强对具有良好表示能力的编码器构建分类任务。然而,在对比学习的学习成对的构建在NLP任务中更难。以前的作品生成单词级更改以形成对,但小变换可能会导致句子含义的显着变化作为自然语言的离散和稀疏性质。在本文中,对对抗的训练在NLP的嵌入空间中产生了挑战性和更难的学习对抗性示例作为学习对。使用对比学学习提高了对抗性培训的泛化能力,因为对比损失可以使样品分布均匀。同时,对抗性培训也提高了对比学习的稳健性。提出了两种小说框架,监督对比对抗学习(SCAS)和无监督的SCAS(USCAL),通过利用对比学习的对抗性培训来产生学习成对。利用基于标签的监督任务丢失,以产生对抗性示例,而无监督的任务会带来对比损失。为了验证所提出的框架的有效性,我们将其雇用到基于变换器的模型,用于自然语言理解,句子语义文本相似性和对抗学习任务。胶水基准任务的实验结果表明,我们的微调监督方法优于BERT $ _ {基础} $超过1.75 \%。我们还评估我们对语义文本相似性(STS)任务的无监督方法,并且我们的方法获得77.29 \%with bert $ _ {base} $。我们方法的稳健性在NLI任务的多个对抗性数据集下进行最先进的结果。
translated by 谷歌翻译
句子嵌入通常用于文本聚类和语义检索任务中。最先进的句子表示方法基于大量手动标记句子对集合的人工神经网络。高资源语言(例如英语或中文)可以使用足够数量的注释数据。在不太受欢迎的语言中,必须使用多语言模型,从而提供较低的性能。在本出版物中,我们通过提出一种培训有效的语言特定句子编码的方法来解决此问题,而无需手动标记数据。我们的方法是从句子对准双语文本语料库中自动构建释义对数据集。然后,我们使用收集的数据来微调具有附加复发池层的变压器语言模型。我们的句子编码器可以在不到一天的时间内在一张图形卡上进行培训,从而在各种句子级的任务上实现高性能。我们在波兰语中评估了八个语言任务的方法,并将其与最佳可用多语言句子编码器进行比较。
translated by 谷歌翻译
深度度量学习(DML)有助于学习嵌入功能,以将语义上的数据投射到附近的嵌入空间中,并在许多应用中起着至关重要的作用,例如图像检索和面部识别。但是,DML方法的性能通常很大程度上取决于采样方法,从训练中的嵌入空间中选择有效的数据。实际上,嵌入空间中的嵌入是通过一些深层模型获得的,其中嵌入空间通常由于缺乏训练点而在贫瘠的区域中,导致所谓的“缺失嵌入”问题。此问题可能会损害样品质量,从而导致DML性能退化。在这项工作中,我们研究了如何减轻“缺失”问题以提高采样质量并实现有效的DML。为此,我们提出了一个密集锚定的采样(DAS)方案,该方案将嵌入的数据点视为“锚”,并利用锚附近的嵌入空间来密集地生成无数据点的嵌入。具体而言,我们建议用判别性特征缩放(DFS)和多个锚点利用单个锚周围的嵌入空间,并具有记忆转换转换(MTS)。通过这种方式,通过有或没有数据点的嵌入方式,我们能够提供更多的嵌入以促进采样过程,从而提高DML的性能。我们的方法毫不费力地集成到现有的DML框架中,并在没有铃铛和哨声的情况下改进了它们。在三个基准数据集上进行的广泛实验证明了我们方法的优势。
translated by 谷歌翻译
存在预训练模型在各种文本分类任务上取得了最先进的性能。这些模型已被证明可用于学习普遍语言表示。然而,通过先进的预训练模型无法有效地区分类似文本之间的语义差异,这对难以区分类的性能产生了很大的影响。为了解决这个问题,我们在这项工作中提出了一种与标签距离(CLLD)的新型对比学习。灵感来自最近对比学习的进步,我们专门设计了一种具有标签距离的分类方法,用于学习对比类。 CLLD可确保在导致不同标签分配的细微差别中的灵活性,并为同时具有相似性的每个类生成不同的表示。关于公共基准和内部数据集的广泛实验表明,我们的方法提高了预先训练模型在分类任务上的性能。重要的是,我们的实验表明,学习的标签距离减轻了细胞的对抗性质。
translated by 谷歌翻译