Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task in computer vision. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without any further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel framework called CLIP-ES for WSSS. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) to mitigate noise and focus on confident regions. Our proposed framework dramatically reduces the cost of training for WSSS and shows the capability of localizing objects in CLIP. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.
translated by 谷歌翻译
Despite the tremendous progress of Masked Autoencoders (MAE) in developing vision tasks such as image and video, exploring MAE in large-scale 3D point clouds remains challenging due to the inherent irregularity. In contrast to previous 3D MAE frameworks, which either design a complex decoder to infer masked information from maintained regions or adopt sophisticated masking strategies, we instead propose a much simpler paradigm. The core idea is to apply a \textbf{G}enerative \textbf{D}ecoder for MAE (GD-MAE) to automatically merges the surrounding context to restore the masked geometric knowledge in a hierarchical fusion manner. In doing so, our approach is free from introducing the heuristic design of decoders and enjoys the flexibility of exploring various masking strategies. The corresponding part costs less than \textbf{12\%} latency compared with conventional methods, while achieving better performance. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: Waymo, KITTI, and ONCE. Consistent improvement on downstream detection tasks illustrates strong robustness and generalization capability. Not only our method reveals state-of-the-art results, but remarkably, we achieve comparable accuracy even with \textbf{20\%} of the labeled data on the Waymo dataset. The code will be released at \url{https://github.com/Nightmare-n/GD-MAE}.
translated by 谷歌翻译
半监督的对象检测在平均教师驱动的自我训练的发展中取得了重大进展。尽管结果有令人鼓舞,但在先前的工作中尚未完全探索标签不匹配问题,从而导致自训练期间严重确认偏见。在本文中,我们从两个不同但互补的角度(即分布级别和实例级别)提出了一个简单而有效的标签框架。对于前者,根据Monte Carlo采样,可以合理地近似来自标记数据的未标记数据的类分布。在这种弱监督提示的指导下,我们引入了一个重新分配卑鄙的老师,该老师利用自适应标签 - 分布意识到的信心阈值来生成无偏见的伪标签来推动学生学习。对于后一个,存在着跨教师模型的被忽视的标签分配歧义问题。为了解决这个问题,我们提出了一种新的标签分配机制,用于自我训练框架,即提案自我分配,该机制将学生的建议注入教师,并生成准确的伪标签,以相应地匹配学生模型中的每个建议。 MS-Coco和Pascal-VOC数据集的实验证明了我们提出的框架与其他最先进的框架相当优越。代码将在https://github.com/hikvision-research/ssod上找到。
translated by 谷歌翻译
我们提出了一个小说嵌入字段\ emph {pref}作为促进神经信号建模和重建任务的紧凑表示。基于纯的多层感知器(MLP)神经技术偏向低频信号,并依赖于深层或傅立叶编码以避免丢失细节。取而代之的是,基于傅立叶嵌入空间的相拟合公式,PREF采用了紧凑且物理上解释的编码场。我们进行全面的实验,以证明PERF比最新的空间嵌入技术的优势。然后,我们使用近似的逆傅里叶变换方案以及新型的parseval正常器来开发高效的频率学习框架。广泛的实验表明,我们的高效和紧凑的基于频率的神经信号处理技术与2D图像完成,3D SDF表面回归和5D辐射场现场重建相同,甚至比最新的。
translated by 谷歌翻译
由于巨大的未标记数据的出现,现在已经增加了更加关注无监督的功能选择。需要考虑使用更有效的顺序使用样品训练学习方法的样本和潜在效果的分布,以提高该方法的鲁棒性。自定步学习是考虑样本培训顺序的有效方法。在本研究中,通过整合自花枢学习和子空间学习框架来提出无监督的特征选择。此外,保留了局部歧管结构,并且特征的冗余受到两个正则化术语的约束。 $ l_ {2,1 / 2} $ - norm应用于投影矩阵,旨在保留歧视特征,并进一步缓解数据中噪声的影响。然后,提出了一种迭代方法来解决优化问题。理论上和实验证明了该方法的收敛性。将所提出的方法与九个现实世界数据集上的其他技术的算法进行比较。实验结果表明,该方法可以提高聚类方法的性能,优于其他比较算法。
translated by 谷歌翻译
本文介绍了WenetsPeech,一个由10000多小时的高质量标记语音组成的多域普通话语料库,2400多小时弱贴言论,大约100万小时的语音,总共22400多小时。我们收集来自YouTube和Podcast的数据,涵盖各种演讲样式,场景,域名,主题和嘈杂的条件。引入了基于光学字符识别(OCR)的方法,以在其对应的视频字幕上为YouTube数据生成音频/文本分段候选,而高质量的ASR转录系统用于为播客数据生成音频/文本对候选。然后我们提出了一种新的端到端标签错误检测方法,可以进一步验证和过滤候选者。我们还提供三个手动标记的高质量测试集,以及WenetsPeech进行评估 - 开发用于训练中的交叉验证目的,从互联网收集的匹配测试,并从真实会议中记录的测试\ _MEETING,以获得更具挑战性的不匹配测试。使用有线exeeEX培训的基线系统,用于三个流行的语音识别工具包,即Kaldi,Espnet和Wenet,以及三个测试集的识别结果也被提供为基准。据我们所知,WenetsPeech是目前最大的开放式普通话语音语料库,其中有利于生产级语音识别的研究。
translated by 谷歌翻译
统一的流和非流式的双通(U2)用于语音识别的端到端模型在流传输能力,准确性,实时因素(RTF)和延迟方面表现出很大的性能。在本文中,我们呈现U2 ++,U2的增强版本,进一步提高了准确性。 U2 ++的核心思想是在训练中同时使用标签序列的前向和向后信息来学习更丰富的信息,并在解码时结合前向和后向预测以提供更准确的识别结果。我们还提出了一种名为SPECSUB的新数据增强方法,以帮助U2 ++模型更准确和强大。我们的实验表明,与U2相比,U2 ++在训练中显示了更快的收敛,更好地鲁棒性对解码方法,以及U2上的一致5 \%-8 \%字错误率降低增益。在Aishell-1的实验中,我们通过u2 ++实现了一个4.63 \%的字符错误率(cer),其中没有流媒体设置和5.05 \%,具有320ms延迟的流设置。据我们所知,5.05 \%是Aishell-1测试集上的最佳发布的流媒体结果。
translated by 谷歌翻译
最近,3D深度学习模型已被证明易于对其2D对应物的对抗性攻击影响。大多数最先进的(SOTA)3D对抗性攻击对3D点云进行扰动。为了在物理场景中再现这些攻击,需要重建生成的对抗3D点云以网状,这导致其对抗效果显着下降。在本文中,我们提出了一个名为Mesh攻击的强烈的3D对抗性攻击,通过直接对3D对象的网格进行扰动来解决这个问题。为了利用最有效的基于梯度的攻击,介绍了一种可差异化的样本模块,其反向传播点云梯度以网格传播。为了进一步确保没有异常值和3D可打印的对抗性网状示例,采用了三种网格损耗。广泛的实验表明,所提出的方案优于SOTA 3D攻击,通过显着的保证金。我们还在各种防御下实现了SOTA表现。我们的代码可用于:https://github.com/cuge1995/mesh-attack。
translated by 谷歌翻译
在本文中,我们提出了一个名为Wenet的开源,生产第一和生产准备的语音识别工具包,其中实现了一种新的双通方法,以统一流传输和非流媒体端到端(E2E)语音识别单一模型。 WENET的主要动机是缩放研究与E2E演示识别模型的生产之间的差距。 Wenet提供了一种有效的方法,可以在几个真实情景中运送ASR应用程序,这是其他开源E2E语音识别工具包的主要差异和优势。在我们的工具包中,实现了一种新的双通方法。我们的方法提出了一种基于动态的基于块的关注策略,变压器层,允许任意右上下文长度修改在混合CTC /注意架构中。只有更改块大小,可以轻松控制推理延迟。然后,CTC假设被注意力解码器重新筛选以获得最终结果。我们在使用WENET上的Aishell-1数据集上的实验表明,与标准的非流式变压器相比,我们的模型在非流式ASR中实现了5.03 \%相对字符的误差率(CER)。在模型量化之后,我们的模型执行合理的RTF和延迟。
translated by 谷歌翻译
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译