几个示例,几乎没有射击的生物声学事件检测是检测新声音的发生时间的任务。先前的方法采用公制学习来建立一个潜在空间,其中包括不同声音类别的标记部分,也称为积极事件。在这项研究中,我们提出了一个细分级的几杆学习框架,该框架在模型优化过程中利用正面和负面事件。负面事件的训练比积极事件更大,可以提高模型的概括能力。此外,我们对训练期间的验证集使用跨性推断,以更好地适应新的课程。我们对我们提出的方法进行消融研究,并在输入特征,训练数据和超参数上进行不同的设置。我们的最终系统在DCASE 2022挑战任务5(DCASE2022-T5)验证集上实现了62.73的F量,以优于基线原型网络34.02的性能。使用提出的方法,我们提交的系统在Dcase2022-T5中排名第二。本文的代码在https://github.com/haoheliu/dcase_2022_task_5上完全开源。
translated by 谷歌翻译
尽管只有几个兴趣类的示例,但很少有声音事件检测是检测声音事件的任务。该框架在生物声学中特别有用,在生物声学中,通常需要注释很长的录音,但是专家注释时间是有限的。本文概述了Dcase 2022 Challenge中包含的第二次发射生物声音事件检测任务的第二版。介绍了任务目标,数据集和基准的详细描述,以及所获得的主要结果以及提交系统的特征。该任务收到了15个不同团队的提交,其中13个得分高于基线。最高的F-评分在评估集中为60%,这对去年的版本有了巨大的进步。高度表现的方法利用了原型网络,转导学习,并解决了所有目标类别的事件长度。此外,通过分析每个子集的结果,我们可以确定系统面临的主要困难,并得出结论,很少有展示的生物声音事件检测仍然是一个开放的挑战。
translated by 谷歌翻译
将音频分离成不同声音源的深度学习技术面临着几种挑战。标准架构需要培训不同类型的音频源的独立型号。虽然一些通用分离器采用单个模型来靶向多个来源,但它们难以推广到看不见的来源。在本文中,我们提出了一个三个组件的管道,可以从大型但弱标记的数据集:audioset训练通用音频源分离器。首先,我们提出了一种用于处理弱标记训练数据的变压器的声音事件检测系统。其次,我们设计了一种基于查询的音频分离模型,利用此数据进行模型培训。第三,我们设计一个潜在的嵌入处理器来编码指定用于分离的音频目标的查询,允许零拍摄的概括。我们的方法使用单一模型进行多种声音类型的源分离,并仅依赖于跨标记的培训数据。此外,所提出的音频分离器可用于零拍摄设置,学习以分离从未在培训中看到的音频源。为了评估分离性能,我们在侦察中测试我们的模型,同时在不相交的augioset上培训。我们通过对从训练中保持的音频源类型进行另一个实验,进一步通过对训练进行了另一个实验来验证零射性能。该模型在两种情况下实现了对当前监督模型的相当的源 - 失真率(SDR)性能。
translated by 谷歌翻译
音频分割和声音事件检测是机器聆听中的关键主题,旨在检测声学类别及其各自的边界。它对于音频分析,语音识别,音频索引和音乐信息检索非常有用。近年来,大多数研究文章都采用分类。该技术将音频分为小帧,并在这些帧上单独执行分类。在本文中,我们提出了一种新颖的方法,叫您只听一次(Yoho),该方法受到计算机视觉中普遍采用的Yolo算法的启发。我们将声学边界的检测转换为回归问题,而不是基于框架的分类。这是通过具有单独的输出神经元来检测音频类的存在并预测其起点和终点来完成的。与最先进的卷积复发性神经网络相比,Yoho的F量的相对改善范围从多个数据集中的1%到6%不等,以进行音频分段和声音事件检测。由于Yoho的输出更端到端,并且可以预测的神经元更少,因此推理速度的速度至少比逐个分类快6倍。另外,由于这种方法可以直接预测声学边界,因此后处理和平滑速度约为7倍。
translated by 谷歌翻译
Metric-based meta-learning is one of the de facto standards in few-shot learning. It composes of representation learning and metrics calculation designs. Previous works construct class representations in different ways, varying from mean output embedding to covariance and distributions. However, using embeddings in space lacks expressivity and cannot capture class information robustly, while statistical complex modeling poses difficulty to metric designs. In this work, we use tensor fields (``areas'') to model classes from the geometrical perspective for few-shot learning. We present a simple and effective method, dubbed hypersphere prototypes (HyperProto), where class information is represented by hyperspheres with dynamic sizes with two sets of learnable parameters: the hypersphere's center and the radius. Extending from points to areas, hyperspheres are much more expressive than embeddings. Moreover, it is more convenient to perform metric-based classification with hypersphere prototypes than statistical modeling, as we only need to calculate the distance from a data point to the surface of the hypersphere. Following this idea, we also develop two variants of prototypes under other measurements. Extensive experiments and analysis on few-shot learning tasks across NLP and CV and comparison with 20+ competitive baselines demonstrate the effectiveness of our approach.
translated by 谷歌翻译
很少有视觉识别是指从一些标记实例中识别新颖的视觉概念。通过将查询表示形式与类表征进行比较以预测查询实例的类别,许多少数射击的视觉识别方法采用了基于公制的元学习范式。但是,当前基于度量的方法通常平等地对待所有实例,因此通常会获得有偏见的类表示,考虑到并非所有实例在总结了类级表示的实例级表示时都同样重要。例如,某些实例可能包含无代表性的信息,例如过多的背景和无关概念的信息,这使结果偏差。为了解决上述问题,我们提出了一个新型的基于公制的元学习框架,称为实例自适应类别表示网络(ICRL-net),以进行几次视觉识别。具体而言,我们开发了一个自适应实例重新平衡网络,具有在生成班级表示,通过学习和分配自适应权重的不同实例中的自适应权重时,根据其在相应类的支持集中的相对意义来解决偏见的表示问题。此外,我们设计了改进的双线性实例表示,并结合了两个新型的结构损失,即,阶层内实例聚类损失和阶层间表示区分损失,以进一步调节实例重估过程并完善类表示。我们对四个通常采用的几个基准测试:Miniimagenet,Tieredimagenet,Cifar-FS和FC100数据集进行了广泛的实验。与最先进的方法相比,实验结果证明了我们的ICRL-NET的优势。
translated by 谷歌翻译
We propose prototypical networks for the problem of few-shot classification, where a classifier must generalize to new classes not seen in the training set, given only a small number of examples of each new class. Prototypical networks learn a metric space in which classification can be performed by computing distances to prototype representations of each class. Compared to recent approaches for few-shot learning, they reflect a simpler inductive bias that is beneficial in this limited-data regime, and achieve excellent results. We provide an analysis showing that some simple design decisions can yield substantial improvements over recent approaches involving complicated architectural choices and meta-learning. We further extend prototypical networks to zero-shot learning and achieve state-of-theart results on the CU-Birds dataset.
translated by 谷歌翻译
声音事件检测(SED)在监控,视频索引等中的广泛应用程序上获得了越来越长的关注。SED中的现有模型主要产生帧级预测,将其转换为序列多标签分类问题。基于帧的模型的一个关键问题是它追求最佳的帧级预测而不是最佳的事件级预测。此外,它需要后处理,无法以端到端的方式培训。本文首先介绍了一维检测变压器(1D-DETR),受到图像对象检测的检测变压器的启发。此外,鉴于SED的特征,音频查询分支和用于微调的一对多匹配策略将模型添加到1D-DETR以形成声音事件检测变压器(SEDT)。据我们所知,Sedt是第一个基于事件和最终的SED模型。实验在城市 - SED数据集和DCES2019任务4数据集上进行,两者都表明席克可以实现竞争性能。
translated by 谷歌翻译
很少有开放式识别旨在对可见类别的培训数据进行有限的培训数据进行分类和新颖的图像。这项任务的挑战是,该模型不仅需要学习判别性分类器,以用很少的培训数据对预定的类进行分类,而且还要拒绝从未见过的培训时间出现的未见类别的输入。在本文中,我们建议从两个新方面解决问题。首先,我们没有像在标准的封闭设置分类中那样学习看到类之间的决策边界,而是为看不见的类保留空间,因此位于这些区域中的图像被认为是看不见的类。其次,为了有效地学习此类决策边界,我们建议利用所见类的背景功能。由于这些背景区域没有显着促进近距离分类的决定,因此自然地将它们用作分类器学习的伪阶层。我们的广泛实验表明,我们提出的方法不仅要优于多个基线,而且还为三个流行的基准测试(即Tieredimagenet,Miniimagenet和Caltech-uscd Birds-birds-2011-2011(Cub))设定了新的最先进结果。
translated by 谷歌翻译
由于标记数据稀缺,提高概括是音频分类中的主要挑战。自我监督的学习(SSL)方法通过利用未标记的数据来学习下游分类任务的有用功能来解决这一点。在这项工作中,我们提出了一个增强的对比SSL框架,以从未标记数据学习不变的表示。我们的方法将各种扰动应用于未标记的输入数据,并利用对比学学习,以便在这种扰动中学习鲁棒性。Audioset和Desed数据集上的实验结果表明,我们的框架显着优于最先进的SSL和Sound / Event分类任务的监督学习方法。
translated by 谷歌翻译
我们研究了很少的开放式识别(FSOR)的问题,该问题学习了一个能够快速适应新类的识别系统,具有有限的标签示例和对未知负样本的拒绝。由于数据限制,传统的大规模开放式方法对FSOR问题有效无效。当前的FSOR方法通常校准了几个弹出封闭式分类器对负样品敏感的,因此可以通过阈值拒绝它们。但是,阈值调整是一个具有挑战性的过程,因为不同的FSOR任务可能需要不同的拒绝功能。在本文中,我们提出了任务自适应的负面类别设想,以使FSOR集成阈值调整到学习过程中。具体而言,我们增加了几个封闭式分类器,并使用少量示例产生的其他负面原型。通过在负生成过程中纳入很少的类相关性,我们可以学习FSOR任务的动态拒绝边界。此外,我们将我们的方法扩展到概括的少数开放式识别(GFSOR),该识别需要在许多射击和少数类别上进行分类以及拒绝​​负样本。公共基准的广泛实验验证了我们在这两个问题上的方法。
translated by 谷歌翻译
Nearest-Neighbor (NN) classification has been proven as a simple and effective approach for few-shot learning. The query data can be classified efficiently by finding the nearest support class based on features extracted by pretrained deep models. However, NN-based methods are sensitive to the data distribution and may produce false prediction if the samples in the support set happen to lie around the distribution boundary of different classes. To solve this issue, we present P3DC-Shot, an improved nearest-neighbor based few-shot classification method empowered by prior-driven data calibration. Inspired by the distribution calibration technique which utilizes the distribution or statistics of the base classes to calibrate the data for few-shot tasks, we propose a novel discrete data calibration operation which is more suitable for NN-based few-shot classification. Specifically, we treat the prototypes representing each base class as priors and calibrate each support data based on its similarity to different base prototypes. Then, we perform NN classification using these discretely calibrated support data. Results from extensive experiments on various datasets show our efficient non-learning based method can outperform or at least comparable to SOTA methods which need additional learning steps.
translated by 谷歌翻译
我们解决了几次拍摄语义分割(FSS)的问题,该问题旨在通过一些带有一些注释的样本分段为目标图像中的新型类对象。尽管通过结合基于原型的公制学习来进行最近的进步,但由于其特征表示差,现有方法仍然显示出在极端内部对象变化和语义相似的类别对象下的有限性能。为了解决这个问题,我们提出了一种针对FSS任务定制的双重原型对比学习方法,以有效地捕获代表性的语义。主要思想是通过增加阶级距离来鼓励原型更差异,同时减少了原型特征空间中的课堂距离。为此,我们首先向类别特定的对比丢失丢失具有动态原型字典,该字典字典存储在训练期间的类感知原型,从而实现相同的类原型和不同的类原型是不同的。此外,我们通过压缩每集内语义类的特征分布来提高类别无话的对比损失,以提高未经看不见的类别的概念能力。我们表明,所提出的双重原型对比学习方法优于Pascal-5i和Coco-20i数据集的最先进的FSS方法。该代码可用于:https://github.com/kwonjunn01/dpcl1。
translated by 谷歌翻译
Many modern computer vision algorithms suffer from two major bottlenecks: scarcity of data and learning new tasks incrementally. While training the model with new batches of data the model looses it's ability to classify the previous data judiciously which is termed as catastrophic forgetting. Conventional methods have tried to mitigate catastrophic forgetting of the previously learned data while the training at the current session has been compromised. The state-of-the-art generative replay based approaches use complicated structures such as generative adversarial network (GAN) to deal with catastrophic forgetting. Additionally, training a GAN with few samples may lead to instability. In this work, we present a novel method to deal with these two major hurdles. Our method identifies a better embedding space with an improved contrasting loss to make classification more robust. Moreover, our approach is able to retain previously acquired knowledge in the embedding space even when trained with new classes. We update previous session class prototypes while training in such a way that it is able to represent the true class mean. This is of prime importance as our classification rule is based on the nearest class mean classification strategy. We have demonstrated our results by showing that the embedding space remains intact after training the model with new classes. We showed that our method preformed better than the existing state-of-the-art algorithms in terms of accuracy across different sessions.
translated by 谷歌翻译
老年人的跌倒检测是一些经过深入研究的问题,其中包括多种拟议的解决方案,包括可穿戴和不可磨损的技术。尽管现有技术的检测率很高,但由于需要佩戴设备和用户隐私问题,因此缺乏目标人群的采用。我们的论文提供了一种新颖的,不可磨损的,不受欢迎的和可扩展的解决方案,用于秋季检测,该解决方案部署在配备麦克风的自主移动机器人上。所提出的方法使用人们在房屋中记录的环境声音输入。我们专门针对浴室环境,因为它很容易跌落,并且在不危害用户隐私的情况下无法部署现有技术。目前的工作开发了一种基于变压器体系结构的解决方案,该解决方案从浴室中获取嘈杂的声音输入,并将其分为秋季/禁止类别,准确性为0.8673。此外,提出的方法可扩展到其他室内环境,除了浴室外,还适合在老年家庭,医院和康复设施中部署,而无需用户佩戴任何设备或不断受到传感器的“观察”。
translated by 谷歌翻译
音频标记是一个活跃的研究区,具有广泛的应用。自发布以来,在推进模型性能方面取得了很大进展,主要来自新颖的模型架构和注意力模块。但是,我们发现适当的培训技术对于使用音频构建音频标记模型同样重要,但没有得到他们应得的关注。为了填补差距,在这项工作中,我们呈现PSLA,一系列培训技术,可以明显增强模型准确性,包括想象成预测,平衡采样,数据增强,标签增强,模型聚集和其设计选择。通过使用这些技术培训效率,我们可以分别获得单个型号(具有13.6M参数)和一个集合模型,分别实现Audioset的平均平均精度(MAP)分数为0.444和0.474,优于81米的先前最佳系统0.439参数。此外,我们的型号还在FSD50K上实现了0.567的新型地图。
translated by 谷歌翻译
Despite the great progress made by deep CNNs in image semantic segmentation, they typically require a large number of densely-annotated images for training and are difficult to generalize to unseen object categories. Few-shot segmentation has thus been developed to learn to perform segmentation from only a few annotated examples. In this paper, we tackle the challenging few-shot segmentation problem from a metric learning perspective and present PANet, a novel prototype alignment network to better utilize the information of the support set. Our PANet learns classspecific prototype representations from a few support images within an embedding space and then performs segmentation over the query images through matching each pixel to the learned prototypes. With non-parametric metric learning, PANet offers high-quality prototypes that are representative for each semantic class and meanwhile discriminative for different classes. Moreover, PANet introduces a prototype alignment regularization between support and query. With this, PANet fully exploits knowledge from the support and provides better generalization on few-shot segmentation. Significantly, our model achieves the mIoU score of 48.1% and 55.7% on PASCAL-5 i for 1-shot and 5-shot settings respectively, surpassing the state-of-the-art method by 1.8% and 8.6%.
translated by 谷歌翻译
关键字发现是检测流音频中的关键字的任务。传统的关键字点斑点目标预定义的关键字分类,但是越来越多的关键字(逐示例)关键字点斑点,例如,N-Way分类给出了M-Shot支持样本。此外,在现实世界中,可能会有意外类别(开放设定)的话语需要被拒绝,而不是归类为N类之一。结合了两个需求,我们将几个开放式关键字点斑点与名为SplitGSC的新基准设置进行了处理。我们提出了基于公制学习的情节 - 已知的虚拟原型,以更好地检测开放式设定,并引入一种简单而强大的方法,虚拟原型网络(D-Protonets)。与最新的SplitGSC中的几个射击开放式识别(FSOSR)方法相比,我们的D-Protonets显示出明显的边缘。我们还可以在标准基准测试中验证我们的方法,微型果胶和D-Protonets显示了FSOSR中最新的开放式检测率。
translated by 谷歌翻译
Few-shot segmentation (FSS) aims to segment unseen classes using a few annotated samples. Typically, a prototype representing the foreground class is extracted from annotated support image(s) and is matched to features representing each pixel in the query image. However, models learnt in this way are insufficiently discriminatory, and often produce false positives: misclassifying background pixels as foreground. Some FSS methods try to address this issue by using the background in the support image(s) to help identify the background in the query image. However, the backgrounds of theses images is often quite distinct, and hence, the support image background information is uninformative. This article proposes a method, QSR, that extracts the background from the query image itself, and as a result is better able to discriminate between foreground and background features in the query image. This is achieved by modifying the training process to associate prototypes with class labels including known classes from the training data and latent classes representing unknown background objects. This class information is then used to extract a background prototype from the query image. To successfully associate prototypes with class labels and extract a background prototype that is capable of predicting a mask for the background regions of the image, the machinery for extracting and using foreground prototypes is induced to become more discriminative between different classes. Experiments for both 1-shot and 5-shot FSS on both the PASCAL-5i and COCO-20i datasets demonstrate that the proposed method results in a significant improvement in performance for the baseline methods it is applied to. As QSR operates only during training, these improved results are produced with no extra computational complexity during testing.
translated by 谷歌翻译
从一个非常少数标记的样品中学习新颖的课程引起了机器学习区域的越来越高。最近关于基于元学习或转移学习的基于范例的研究表明,良好特征空间的获取信息可以是在几次拍摄任务上实现有利性能的有效解决方案。在本文中,我们提出了一种简单但有效的范式,该范式解耦了学习特征表示和分类器的任务,并且只能通过典型的传送学习培训策略从基类嵌入体系结构的特征。为了在每个类别内保持跨基地和新类别和辨别能力的泛化能力,我们提出了一种双路径特征学习方案,其有效地结合了与对比特征结构的结构相似性。以这种方式,内部级别对齐和级别的均匀性可以很好地平衡,并且导致性能提高。三个流行基准测试的实验表明,当与简单的基于原型的分类器结合起来时,我们的方法仍然可以在电感或转换推理设置中的标准和广义的几次射击问题达到有希望的结果。
translated by 谷歌翻译