跨言扬声器风格的转移旨在提取给定参考语音的语音样式,可以在任意目标扬声器的音色中复制。有关此主题的现有方法已经探索了利用语音级样式标签通过全球或本地规模样式表示进行样式转移。但是,有声读物数据集通常以本地韵律和全球类型的形式进行特征,并且很少伴有发言级风格的标签。因此,正确地将阅读方式转移到不同的扬声器上仍然是一项具有挑战性的任务。本文旨在介绍块的多尺度跨言式风格模型,以捕获有声读物的全球类型和本地韵律。此外,通过使用拟议的可切换对手分类器来解开扬声器的音色和样式,提取的阅读样式可适应不同扬声器的音色。实验结果证实,该模型设法将给定的阅读方式转移到新的目标扬声器上。在局部韵律和全球流派类型预测指标的支持下,进一步揭示了所提出的方法在多扬声器有声读物中的潜力。
translated by 谷歌翻译
在本文中,我们介绍了2022年多模式情感分析挑战(MUSE)的解决方案,其中包括Muse-Humor,Muse-Rection和Muse Surns Sub-Challenges。 2022年穆斯穆斯(Muse 2022)着重于幽默检测,情绪反应和多模式的情感压力,利用不同的方式和数据集。在我们的工作中,提取了不同种类的多模式特征,包括声学,视觉,文本和生物学特征。这些功能由Temma和Gru融合到自发机制框架中。在本文中,1)提取了一些新的音频功能,面部表达功能和段落级文本嵌入以进行准确的改进。 2)我们通过挖掘和融合多模式特征来显着提高多模式情感预测的准确性和可靠性。 3)在模型培训中应用有效的数据增强策略,以减轻样本不平衡问题并防止模型形成学习有偏见的主题字符。对于博物馆的子挑战,我们的模型获得了0.8932的AUC分数。对于Muse Rection子挑战,我们在测试集上的Pearson相关系数为0.3879,它的表现优于所有其他参与者。对于Muse Surst Sub-Challenge,我们的方法在测试数据集上的唤醒和价值都优于基线,达到了0.5151的最终综合结果。
translated by 谷歌翻译
伪标记已被证明是一种有希望的半监督学习(SSL)范式。现有的伪标记方法通常假定培训数据的类别分布是平衡的。但是,这种假设远非现实的场景,现有的伪标记方法在班级不平衡的背景下遭受了严重的性能变性。在这项工作中,我们在半监督设置下研究伪标记。核心思想是使用偏置自适应分类器自动吸收由班级失衡引起的训练偏差,该分类器将原始线性分类器与偏置吸引子配合使用。偏置吸引子设计为适应训练偏见的轻巧残留网络。具体而言,通过双级学习框架来学习偏见吸引子,以便偏见自适应分类器能够符合不平衡的训练数据,而线性分类器可以为每个类提供无偏的标签预测。我们在各种不平衡的半监督设置下进行了广泛的实验,结果表明我们的方法可以适用于不同的伪标记模型,并且优于先前的艺术。
translated by 谷歌翻译
Colonoscopy, currently the most efficient and recognized colon polyp detection technology, is necessary for early screening and prevention of colorectal cancer. However, due to the varying size and complex morphological features of colonic polyps as well as the indistinct boundary between polyps and mucosa, accurate segmentation of polyps is still challenging. Deep learning has become popular for accurate polyp segmentation tasks with excellent results. However, due to the structure of polyps image and the varying shapes of polyps, it is easy for existing deep learning models to overfit the current dataset. As a result, the model may not process unseen colonoscopy data. To address this, we propose a new state-of-the-art model for medical image segmentation, the SSFormer, which uses a pyramid Transformer encoder to improve the generalization ability of models. Specifically, our proposed Progressive Locality Decoder can be adapted to the pyramid Transformer backbone to emphasize local features and restrict attention dispersion. The SSFormer achieves stateof-the-art performance in both learning and generalization assessment.
translated by 谷歌翻译
分层分类旨在将对象对类别的层次进行。例如,可以根据订单,家庭和物种的三级层次分类来分类鸟类。现有方法通过将其解耦为几个多级分类任务来常见地解决分层分类。但是,这种多任务学习策略未能充分利用不同层次结构的各种类别之间的相关性。在本文中,我们提出了基于深度学习的统一概率框架的标签层次转换,以解决层次分类。具体地,我们明确地学习标签层次转换矩阵,其列向量表示两个相邻层次结构之间的类的条件标签分布,并且可以能够编码嵌入类层次结构中的相关性。我们进一步提出了混淆损失,这鼓励分类网络在训练期间学习不同标签层次结构的相关性。所提出的框架可以适用于任何现有的深网络,只有轻微的修改。我们尝试具有各种层次结构的三个公共基准数据集,结果证明了我们的方法超出现有技术的优势。源代码将公开可用。
translated by 谷歌翻译
随着深度学习技术的快速发展和计算能力的提高,深度学习已广泛应用于高光谱图像(HSI)分类领域。通常,深度学习模型通常包含许多可训练参数,并且需要大量标记的样品来实现最佳性能。然而,关于HSI分类,由于手动标记的难度和耗时的性质,大量标记的样本通常难以获取。因此,许多研究工作侧重于建立一个少数标记样本的HSI分类的深层学习模型。在本文中,我们专注于这一主题,并对相关文献提供系统审查。具体而言,本文的贡献是双重的。首先,相关方法的研究进展根据学习范式分类,包括转移学习,积极学习和少量学习。其次,已经进行了许多具有各种最先进的方法的实验,总结了结果以揭示潜在的研究方向。更重要的是,虽然深度学习模型(通常需要足够的标记样本)和具有少量标记样本的HSI场景之间存在巨大差距,但是通过深度学习融合,可以很好地表征小样本集的问题方法和相关技术,如转移学习和轻量级模型。为了再现性,可以在HTTPS://github.com/shuguoj/hsi-classification中找到纸张中评估的方法的源代码.git。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.
translated by 谷歌翻译