A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.
translated by 谷歌翻译
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows explicitly associating 3D and semantic-rich captions. Further, to facilitate coarse-to-fine visual-semantic representation learning from captions, we design hierarchical 3D-caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary tasks. Our method not only remarkably outperforms baseline methods by 25.8% $\sim$ 44.7% hIoU and 14.5% $\sim$ 50.4% hAP$_{50}$ on open-vocabulary semantic and instance segmentation, but also shows robust transferability on challenging zero-shot domain transfer tasks. Code will be available at https://github.com/CVMI-Lab/PLA.
translated by 谷歌翻译
Weakly supervised detection of anomalies in surveillance videos is a challenging task. Going beyond existing works that have deficient capabilities to localize anomalies in long videos, we propose a novel glance and focus network to effectively integrate spatial-temporal information for accurate anomaly detection. In addition, we empirically found that existing approaches that use feature magnitudes to represent the degree of anomalies typically ignore the effects of scene variations, and hence result in sub-optimal performance due to the inconsistency of feature magnitudes across scenes. To address this issue, we propose the Feature Amplification Mechanism and a Magnitude Contrastive Loss to enhance the discriminativeness of feature magnitudes for detecting anomalies. Experimental results on two large-scale benchmarks UCF-Crime and XD-Violence manifest that our method outperforms state-of-the-art approaches.
translated by 谷歌翻译
Deep learning has attained remarkable success in many 3D visual recognition tasks, including shape classification, object detection, and semantic segmentation. However, many of these results rely on manually collecting densely annotated real-world 3D data, which is highly time-consuming and expensive to obtain, limiting the scalability of 3D recognition tasks. Thus, we study unsupervised 3D recognition and propose a Self-supervised-Self-Labeled 3D Recognition (SL3D) framework. SL3D simultaneously solves two coupled objectives, i.e., clustering and learning feature representation to generate pseudo-labeled data for unsupervised 3D recognition. SL3D is a generic framework and can be applied to solve different 3D recognition tasks, including classification, object detection, and semantic segmentation. Extensive experiments demonstrate its effectiveness. Code is available at https://github.com/fcendra/sl3d.
translated by 谷歌翻译
Most existing 3D point cloud object detection approaches heavily rely on large amounts of labeled training data. However, the labeling process is costly and time-consuming. This paper considers few-shot 3D point cloud object detection, where only a few annotated samples of novel classes are needed with abundant samples of base classes. To this end, we propose Prototypical VoteNet to recognize and localize novel instances, which incorporates two new modules: Prototypical Vote Module (PVM) and Prototypical Head Module (PHM). Specifically, as the 3D basic geometric structures can be shared among categories, PVM is designed to leverage class-agnostic geometric prototypes, which are learned from base classes, to refine local features of novel categories.Then PHM is proposed to utilize class prototypes to enhance the global feature of each object, facilitating subsequent object localization and classification, which is trained by the episodic training strategy. To evaluate the model in this new setting, we contribute two new benchmark datasets, FS-ScanNet and FS-SUNRGBD. We conduct extensive experiments to demonstrate the effectiveness of Prototypical VoteNet, and our proposed method shows significant and consistent improvements compared to baselines on two benchmark datasets.
translated by 谷歌翻译
3D场景由大量背景点主导,这对于主要需要集中在前景对象的检测任务是多余的。在本文中,我们分析了现有的稀疏3D CNN的主要组成部分,发现3D CNN忽略了数据的冗余,并在下降过程中进一步扩大了数据,这带来了大量的多余和不必要的计算间开销。受到这一点的启发,我们提出了一个名为“空间修剪稀疏卷积”(SPS-CONV)的新型卷积操作员,其中包括两个变体,空间修剪的Submanifold稀疏卷积(SPSS-CONV)和空间修剪的常规稀疏卷积(SPRS-CONV),包括这是基于动态确定冗余降低关键领域的想法。我们验证该幅度可以作为确定摆脱基于学习方法的额外计算的关键领域的重要提示。提出的模块可以轻松地将其纳入现有的稀疏3D CNN中,而无需额外的架构修改。关于Kitti,Waymo和Nuscenes数据集的广泛实验表明,我们的方法可以在不损害性能的情况下实现超过50%的GFLOPS。
translated by 谷歌翻译
在本文中,我们从经验上研究了如何充分利用低分辨率框架以进行有效的视频识别。现有方法主要集中于开发紧凑的网络或减轻视频输入的时间冗余以提高效率,而压缩框架分辨率很少被认为是有希望的解决方案。一个主要问题是低分辨率帧的识别准确性不佳。因此,我们首先分析低分辨率帧上性能降解的根本原因。我们的主要发现是,降级的主要原因不是在下采样过程中的信息丢失,而是网络体系结构和输入量表之间的不匹配。通过知识蒸馏(KD)的成功,我们建议通过跨分辨率KD(RESKD)弥合网络和输入大小之间的差距。我们的工作表明,RESKD是一种简单但有效的方法,可以提高低分辨率帧的识别精度。没有铃铛和哨子,RESKD在四个大规模基准数据集(即ActivityNet,FCVID,Mini-Kinetics,sopeings soseings ossings v2)上,就效率和准确性上的所有竞争方法都大大超过了所有竞争方法。此外,我们广泛地展示了其对最先进的体系结构(即3D-CNN和视频变压器)的有效性,以及对超低分辨率帧的可扩展性。结果表明,RESKD可以作为最先进视频识别的一般推理加速方法。我们的代码将在https://github.com/cvmi-lab/reskd上找到。
translated by 谷歌翻译
由于没有大型配对的文本形状数据,这两种方式之间的大量语义差距以及3D形状的结构复杂性,因此文本指导的3D形状生成仍然具有挑战性。本文通过引入2D图像作为垫脚石来连接两种方式并消除对配对的文本形状数据的需求,提出了一个名为“图像”的新框架,称为“垫脚石”(ISS)。我们的关键贡献是一种两阶段的功能空间对准方法,它通过利用具有多视图Supperions的预训练的单视重构造(SVR)模型来映射剪辑功能以形成形状:首先将剪辑图像剪辑剪辑功能到详细信息 - SVR模型中的丰富形状空间,然后将剪辑文本功能映射到形状空间,并通过鼓励输入文本和渲染图像之间的剪辑一致性来优化映射。此外,我们制定了一个文本制定的形状样式化模块,以用新颖的纹理打扮出输出形状。除了从文本上生成3D Shape生成的现有作品外,我们的新方法是在不需要配对的文本形状数据的情况下创建形状的一般性。实验结果表明,我们的方法在忠诚度和与文本一致性方面优于最先进的和我们的基线。此外,我们的方法可以通过逼真的和幻想结构和纹理对生成的形状进行样式化。
translated by 谷歌翻译
随着移动设备的快速开发,现代使用的手机通常允许用户捕获4K分辨率(即超高定义)图像。然而,对于图像进行示范,在低级视觉中,一项艰巨的任务,现有作品通常是在低分辨率或合成图像上进行的。因此,这些方法对4K分辨率图像的有效性仍然未知。在本文中,我们探索了Moire模式的删除,以进行超高定义图像。为此,我们提出了第一个超高定义的演示数据集(UHDM),其中包含5,000个现实世界4K分辨率图像对,并对当前最新方法进行基准研究。此外,我们提出了一个有效的基线模型ESDNET来解决4K Moire图像,其中我们构建了一个语义对准的比例感知模块来解决Moire模式的尺度变化。广泛的实验表明了我们的方法的有效性,这可以超过最轻巧的优于最先进的方法。代码和数据集可在https://xinyu-andy.github.io/uhdm-page上找到。
translated by 谷歌翻译