Distantly-Supervised Named Entity Recognition (DS-NER) effectively alleviates the data scarcity problem in NER by automatically generating training samples. Unfortunately, the distant supervision may induce noisy labels, thus undermining the robustness of the learned models and restricting the practical application. To relieve this problem, recent works adopt self-training teacher-student frameworks to gradually refine the training labels and improve the generalization ability of NER models. However, we argue that the performance of the current self-training frameworks for DS-NER is severely underestimated by their plain designs, including both inadequate student learning and coarse-grained teacher updating. Therefore, in this paper, we make the first attempt to alleviate these issues by proposing: (1) adaptive teacher learning comprised of joint training of two teacher-student networks and considering both consistent and inconsistent predictions between two teachers, thus promoting comprehensive student learning. (2) fine-grained student ensemble that updates each fragment of the teacher model with a temporal moving average of the corresponding fragment of the student, which enhances consistent predictions on each model fragment against noise. To verify the effectiveness of our proposed method, we conduct experiments on four DS-NER datasets. The experimental results demonstrate that our method significantly surpasses previous SOTA methods.
translated by 谷歌翻译
缺乏标记数据是关系提取的主要障碍。通过将未标记的样本作为额外培训数据注释,已经证明,半监督联系提取(SSRE)已被证明是一个有希望的方法。沿着这条线几乎所有先前的研究采用多种模型来使注释通过从这些模型中获取交叉路口集的预测结果来更加可靠。然而,差异集包含有关未标记数据的丰富信息,并通过事先研究忽略了忽视。在本文中,我们建议不仅从共识中学习,而且还要学习SSRE中不同模型之间的分歧。为此,我们开发了一种简单且一般的多教师蒸馏(MTD)框架,可以轻松集成到任何现有的SSRE方法中。具体来说,我们首先让教师对应多个模型,并在SSRE方法中选择最后一次迭代的交叉点集中的样本,以便像往常一样增加标记的数据。然后,我们将类分布转移为差异设置为软标签以指导学生。我们最后使用训练有素的学生模型进行预测。两个公共数据集上的实验结果表明,我们的框架显着促进了基础SSRE方法的性能,具有相当低的计算成本。
translated by 谷歌翻译
命名实体识别(NER)是自然语言处理中的重要任务。但是,传统的监督NER需要大规模注释的数据集。提出了远处的监督以减轻对数据集的巨大需求,但是以这种方式构建的数据集非常嘈杂,并且存在严重的未标记实体问题。交叉熵(CE)损耗函数对未标记的数据高度敏感,从而导致严重的性能降解。作为替代方案,我们提出了一种称为NRCES的新损失函数,以应对此问题。Sigmoid项用于减轻噪声的负面影响。此外,我们根据样品和训练过程平衡模型的收敛性和噪声耐受性。关于合成和现实世界数据集的实验表明,在严重的未标记实体问题的情况下,我们的方法表现出强大的鲁棒性,从而实现了现实世界数据集的新最新技术。
translated by 谷歌翻译
在最近的半监督语义分割方法中,一致性正则化已被广泛研究。从图像,功能和网络扰动中受益,已经实现了出色的性能。为了充分利用这些扰动,在这项工作中,我们提出了一个新的一致性正则化框架,称为相互知识蒸馏(MKD)。我们创新地基于一致性正则化方法,创新了两个辅助均值老师模型。更具体地说,我们使用一位卑鄙的老师生成的伪标签来监督另一个学生网络,以在两个分支之间进行相互知识蒸馏。除了使用图像级强和弱的增强外,我们还采用了特征增强,考虑隐性语义分布来增加对学生的进一步扰动。提出的框架大大增加了训练样本的多样性。公共基准测试的广泛实验表明,我们的框架在各种半监督设置下都优于先前的最先进方法(SOTA)方法。
translated by 谷歌翻译
除了使用硬标签的标准监督学习外,通常在许多监督学习设置中使用辅助损失来改善模型的概括。例如,知识蒸馏增加了第二个教师模仿模型训练的损失,在该培训中,教师可能是一个验证的模型,可以输出比标签更丰富的分布。同样,在标记数据有限的设置中,弱标记信息以标签函数的形式使用。此处引入辅助损失来对抗标签函数,这些功能可能是基于嘈杂的规则的真实标签近似值。我们解决了学习以原则性方式结合这些损失的问题。我们介绍AMAL,该AMAL使用元学习在验证度量上学习实例特定的权重,以实现损失的最佳混合。在许多知识蒸馏和规则降解域中进行的实验表明,Amal在这些领域中对竞争基准的增长可显着。我们通过经验分析我们的方法,并分享有关其提供性能提升的机制的见解。
translated by 谷歌翻译
最初引入了知识蒸馏,以利用来自单一教师模型的额外监督为学生模型培训。为了提高学生表现,最近的一些变体试图利用多个教师利用不同的知识来源。然而,现有研究主要通过对多种教师预测的平均或将它们与其他无标签策略相结合,将知识集成在多种来源中,可能在可能存在低质量的教师预测存在中误导学生。为了解决这个问题,我们提出了信心感知的多教师知识蒸馏(CA-MKD),该知识蒸馏(CA-MKD)在地面真理标签的帮助下,适用于每个教师预测的样本明智的可靠性,与那些接近单热的教师预测标签分配了大量的重量。此外,CA-MKD包含中间层,以进一步提高学生表现。广泛的实验表明,我们的CA-MKD始终如一地优于各种教师学生架构的所有最先进的方法。
translated by 谷歌翻译
由于许多微调预先训练的语言模型〜(PLMS)具有有希望的性能,因此慷慨地释放,研究了重用这些模型的更好方法至关重要,因为它可以大大降低再培训计算成本和潜在的环境副作用。在本文中,我们探索了一种小型模型重用范式,知识合并〜(ka)。如果没有人为注释,KA旨在将来自不同教师的知识合并到一个专门从事不同的分类问题中的知识,进入多功能的学生模型。实现这一目标,我们设计了模型不确定感知知识合并〜(Muka)框架,其使用Monte-Carlo辍学来识别潜在的足够教师,以估计金色监督指导学生。实验结果表明,Muka在基准数据集上实现了对基准的基本改进。进一步的分析表明,Muka可以通过多个教师模型,异构教师,甚至交叉数据集教师概括很好的复杂设置。
translated by 谷歌翻译
Information Extraction (IE) aims to extract structured information from heterogeneous sources. IE from natural language texts include sub-tasks such as Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE). Most IE systems require comprehensive understandings of sentence structure, implied semantics, and domain knowledge to perform well; thus, IE tasks always need adequate external resources and annotations. However, it takes time and effort to obtain more human annotations. Low-Resource Information Extraction (LRIE) strives to use unsupervised data, reducing the required resources and human annotation. In practice, existing systems either utilize self-training schemes to generate pseudo labels that will cause the gradual drift problem, or leverage consistency regularization methods which inevitably possess confirmation bias. To alleviate confirmation bias due to the lack of feedback loops in existing LRIE learning paradigms, we develop a Gradient Imitation Reinforcement Learning (GIRL) method to encourage pseudo-labeled data to imitate the gradient descent direction on labeled data, which can force pseudo-labeled data to achieve better optimization capabilities similar to labeled data. Based on how well the pseudo-labeled data imitates the instructive gradient descent direction obtained from labeled data, we design a reward to quantify the imitation process and bootstrap the optimization capability of pseudo-labeled data through trial and error. In addition to learning paradigms, GIRL is not limited to specific sub-tasks, and we leverage GIRL to solve all IE sub-tasks (named entity recognition, relation extraction, and event extraction) in low-resource settings (semi-supervised IE and few-shot IE).
translated by 谷歌翻译
最近,许多半监督的对象检测(SSOD)方法采用教师学生框架并取得了最新的结果。但是,教师网络与学生网络紧密相结合,因为教师是学生的指数移动平均值(EMA),这会导致表现瓶颈。为了解决耦合问题,我们为SSOD提出了一个周期自我训练(CST)框架,该框架由两个老师T1和T2,两个学生S1和S2组成。基于这些网络,构建了一个周期自我训练机制​​,即S1 $ {\ rightarrow} $ t1 $ {\ rightArow} $ s2 $ {\ rightArrow} $ t2 $ {\ rightArrow} $ s1。对于S $ {\ Rightarrow} $ T,我们还利用学生的EMA权重来更新老师。对于t $ {\ rightarrow} $ s,而不是直接为其学生S1(S2)提供监督,而是老师T1(T2)为学生S2(S1)生成伪标记,从而松散耦合效果。此外,由于EMA的财产,老师最有可能积累学生的偏见,并使错误变得不可逆转。为了减轻问题,我们还提出了分配一致性重新加权策略,在该策略中,根据教师T1和T2的分配一致性,将伪标记重新加权。通过该策略,可以使用嘈杂的伪标签对两个学生S2和S1进行训练,以避免确认偏见。广泛的实验证明了CST的优势,通过将AP比基线优于最先进的方法提高了2.1%的绝对AP改进,并具有稀缺的标记数据,而胜过了2.1%的绝对AP。
translated by 谷歌翻译
在多种方式知识蒸馏研究的背景下,现有方法主要集中在唯一的学习教师最终产出问题。因此,教师网络与学生网络之间存在深处。有必要强制学生网络来学习教师网络的模态关系信息。为了有效利用从教师转移到学生的知识,采用了一种新的模型关系蒸馏范式,通过建模不同的模态之间的关系信息,即学习教师模级克矩阵。
translated by 谷歌翻译
AI-powered Medical Imaging has recently achieved enormous attention due to its ability to provide fast-paced healthcare diagnoses. However, it usually suffers from a lack of high-quality datasets due to high annotation cost, inter-observer variability, human annotator error, and errors in computer-generated labels. Deep learning models trained on noisy labelled datasets are sensitive to the noise type and lead to less generalization on the unseen samples. To address this challenge, we propose a Robust Stochastic Knowledge Distillation (RoS-KD) framework which mimics the notion of learning a topic from multiple sources to ensure deterrence in learning noisy information. More specifically, RoS-KD learns a smooth, well-informed, and robust student manifold by distilling knowledge from multiple teachers trained on overlapping subsets of training data. Our extensive experiments on popular medical imaging classification tasks (cardiopulmonary disease and lesion classification) using real-world datasets, show the performance benefit of RoS-KD, its ability to distill knowledge from many popular large networks (ResNet-50, DenseNet-121, MobileNet-V2) in a comparatively small network, and its robustness to adversarial attacks (PGD, FSGM). More specifically, RoS-KD achieves >2% and >4% improvement on F1-score for lesion classification and cardiopulmonary disease classification tasks, respectively, when the underlying student is ResNet-18 against recent competitive knowledge distillation baseline. Additionally, on cardiopulmonary disease classification task, RoS-KD outperforms most of the SOTA baselines by ~1% gain in AUC score.
translated by 谷歌翻译
尽管配备的远景和语言预处理(VLP)在过去两年中取得了显着的进展,但它遭受了重大缺点:VLP型号不断增加的尺寸限制了其部署到现实世界的搜索场景(高潜伏期是不可接受的)。为了减轻此问题,我们提出了一种新颖的插件动态对比度蒸馏(DCD)框架,以压缩ITR任务的大型VLP模型。从技术上讲,我们面临以下两个挑战:1)由于GPU内存有限,在处理交叉模式融合功能期间优化了太多的负样本,因此很难直接应用于跨模式任务,因此很难直接应用于跨模式任务。 。 2)从不同的硬样品中静态优化学生网络的效率效率低下,这些样本对蒸馏学习和学生网络优化具有不同的影响。我们试图从两点克服这些挑战。首先,为了实现多模式对比度学习并平衡培训成本和效果,我们建议使用教师网络估算学生的困难样本,使学生吸收了预培训的老师的强大知识,并掌握知识来自硬样品。其次,要从硬样品对学习动态,我们提出动态蒸馏以动态学习不同困难的样本,从更好地平衡知识和学生的自学能力的困难的角度。我们成功地将我们提出的DCD策略应用于两个最先进的视觉语言预处理模型,即vilt和仪表。关于MS-Coco和FlickR30K基准测试的广泛实验显示了我们DCD框架的有效性和效率。令人鼓舞的是,与现有的ITR型号相比,我们可以至少加快推断至少129美元的$ \ times $。
translated by 谷歌翻译
从大型预训练模型转移学习对于许多计算机视觉任务来说都是至关重要的。最近的研究表明,由于存在存在的多个对象类的图像被分配单个标签,所以类似于想象成的数据集弱标记。这种模糊的偏置模型朝向单一预测,这可能导致抑制数据中倾向于共同发生的类。灵感来自语言出现文学,我们提出了多标签迭代学习(英里)来利用迭代学习框架从单个标签中融入多标签学习的归纳偏见。英里是一种简单而有效的过程,通过通过与学习瓶颈的连续几代教师和学生网络传播二进制预测来构建图像的多标签描述。实验表明,我们的方法对Imagenet的准确性以及真正的F1分数表现出系统的益处,这表明英里与标签歧义更好地优于标准训练程序,即使在自我监督权重的微调时也会比标准训练程序更好。我们还表明英里有效地减少标签噪音,实现了最先进的性能,如WebVision等现实大规模嘈杂的数据。此外,英里提高了类增量设置中的性能,例如IIRC,它是强大的分发班次。代码:https://github.com/rajeswar18/mile.
translated by 谷歌翻译
Although continually extending an existing NMT model to new domains or languages has attracted intensive interest in recent years, the equally valuable problem of continually improving a given NMT model in its domain by leveraging knowledge from an unlimited number of existing NMT models is not explored yet. To facilitate the study, we propose a formal definition for the problem named knowledge accumulation for NMT (KA-NMT) with corresponding datasets and evaluation metrics and develop a novel method for KA-NMT. We investigate a novel knowledge detection algorithm to identify beneficial knowledge from existing models at token level, and propose to learn from beneficial knowledge and learn against other knowledge simultaneously to improve learning efficiency. To alleviate catastrophic forgetting, we further propose to transfer knowledge from previous to current version of the given model. Extensive experiments show that our proposed method significantly and consistently outperforms representative baselines under homogeneous, heterogeneous, and malicious model settings for different language pairs.
translated by 谷歌翻译
Zero-shot cross-lingual named entity recognition (NER) aims at transferring knowledge from annotated and rich-resource data in source languages to unlabeled and lean-resource data in target languages. Existing mainstream methods based on the teacher-student distillation framework ignore the rich and complementary information lying in the intermediate layers of pre-trained language models, and domain-invariant information is easily lost during transfer. In this study, a mixture of short-channel distillers (MSD) method is proposed to fully interact the rich hierarchical information in the teacher model and to transfer knowledge to the student model sufficiently and efficiently. Concretely, a multi-channel distillation framework is designed for sufficient information transfer by aggregating multiple distillers as a mixture. Besides, an unsupervised method adopting parallel domain adaptation is proposed to shorten the channels between the teacher and student models to preserve domain-invariant features. Experiments on four datasets across nine languages demonstrate that the proposed method achieves new state-of-the-art performance on zero-shot cross-lingual NER and shows great generalization and compatibility across languages and fields.
translated by 谷歌翻译
To improve the performance of the dual-encoder retriever, one effective approach is knowledge distillation from the cross-encoder ranker. Existing works construct the candidate passages following the supervised learning setting where a query is paired with a positive passage and a batch of negatives. However, through empirical observation, we find that even the hard negatives from advanced methods are still too trivial for the teacher to distinguish, preventing the teacher from transferring abundant dark knowledge to the student through its soft label. To alleviate this issue, we propose ADAM, a knowledge distillation framework that can better transfer the dark knowledge held in the teacher with Adaptive Dark exAMples. Different from previous works that only rely on one positive and hard negatives as candidate passages, we create dark examples that all have moderate relevance to the query through mixing-up and masking in discrete space. Furthermore, as the quality of knowledge held in different training instances varies as measured by the teacher's confidence score, we propose a self-paced distillation strategy that adaptively concentrates on a subset of high-quality instances to conduct our dark-example-based knowledge distillation to help the student learn better. We conduct experiments on two widely-used benchmarks and verify the effectiveness of our method.
translated by 谷歌翻译
Label noise is ubiquitous in various machine learning scenarios such as self-labeling with model predictions and erroneous data annotation. Many existing approaches are based on heuristics such as sample losses, which might not be flexible enough to achieve optimal solutions. Meta learning based methods address this issue by learning a data selection function, but can be hard to optimize. In light of these pros and cons, we propose Selection-Enhanced Noisy label Training (SENT) that does not rely on meta learning while having the flexibility of being data-driven. SENT transfers the noise distribution to a clean set and trains a model to distinguish noisy labels from clean ones using model-based features. Empirically, on a wide range of tasks including text classification and speech recognition, SENT improves performance over strong baselines under the settings of self-training and label corruption.
translated by 谷歌翻译
自我训练在半监督学习中表现出巨大的潜力。它的核心思想是使用在标记数据上学习的模型来生成未标记样本的伪标签,然后自我教学。为了获得有效的监督,主动尝试通常会采用动量老师进行伪标签的预测,但要观察确认偏见问题,在这种情况下,错误的预测可能会提供错误的监督信号并在培训过程中积累。这种缺点的主要原因是,现行的自我训练框架充当以前的知识指导当前状态,因为老师仅与过去的学生更新。为了减轻这个问题,我们提出了一种新颖的自我训练策略,该策略使模型可以从未来学习。具体而言,在每个培训步骤中,我们都会首先优化学生(即,在不将其应用于模型权重的情况下缓存梯度),然后用虚拟未来的学生更新老师,最后要求老师为伪标记生产伪标签目前的学生作为指导。这样,我们设法提高了伪标签的质量,从而提高了性能。我们还通过深入(FST-D)和广泛(FST-W)窥视未来,开发了我们未来自我训练(FST)框架的两个变体。将无监督的域自适应语义分割和半监督语义分割的任务作为实例,我们在广泛的环境下实验表明了我们方法的有效性和优越性。代码将公开可用。
translated by 谷歌翻译
我们提出了一种简单而有效的方法,用于培训命名实体识别(NER)模型,该模型在业务电话交易记录上运行,该转录本包含噪音,这是由于口语对话的性质和自动语音识别的工件。我们首先通过有限数量的成绩单微调卢克(Luke),这是一种最先进的命名实体识别(NER)模型弱标记的数据和少量的人类注销数据。该模型可以达到高精度,同时还满足了将包含在商业电话产品中的实际限制:在具有成本效益的CPU而不是GPU上部署时实时性能。
translated by 谷歌翻译
基于深度学习的方法在3D对象检测任务中显示出显着性能。然而,当在逐步学习新类时,它们遭受了最初训练的课程的灾难性表现下降,而无需重新审视旧数据。这种“灾难性忘记”现象阻碍了现实世界场景中的3D对象检测方法的部署,其中需要连续学习系统。在本文中,我们研究了未开发的但重要的类增量3D对象检测问题,并提出了第一种解决方案 - SDCOT,一种新型静态动态共同教学方法。我们的SDCOT通过静态教师减轻了灾难性的旧课程,这为新样本中的旧课程提供了伪注释,并通过用蒸馏损失提取先前的知识来规范电流模型。与此同时,SDCOT一致地通过动态教师从新数据中了解基础知识。我们对两个基准数据集进行了广泛的实验,并在几个增量学习场景中展示了我们SDCOT对基线方法的卓越性能。
translated by 谷歌翻译