A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
The architecture of transformers, which recently witness booming applications in vision tasks, has pivoted against the widespread convolutional paradigm. Relying on the tokenization process that splits inputs into multiple tokens, transformers are capable of extracting their pairwise relationships using self-attention. While being the stemming building block of transformers, what makes for a good tokenizer has not been well understood in computer vision. In this work, we investigate this uncharted problem from an information trade-off perspective. In addition to unifying and understanding existing structural modifications, our derivation leads to better design strategies for vision tokenizers. The proposed Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Furthermore, a regularization objective TokenProp is embraced in the standard training regime. Through extensive experiments on various transformer architectures, we observe both improved performance and intriguing properties of these two plug-and-play designs with negligible computational overhead. These observations further indicate the importance of the commonly-omitted designs of tokenizers in vision transformer.
translated by 谷歌翻译
Deep Neural Networks (DNNs) are vulnerable to the black-box adversarial attack that is highly transferable. This threat comes from the distribution gap between adversarial and clean samples in feature space of the target DNNs. In this paper, we use Deep Generative Networks (DGNs) with a novel training mechanism to eliminate the distribution gap. The trained DGNs align the distribution of adversarial samples with clean ones for the target DNNs by translating pixel values. Different from previous work, we propose a more effective pixel level training constraint to make this achievable, thus enhancing robustness on adversarial samples. Further, a class-aware feature-level constraint is formulated for integrated distribution alignment. Our approach is general and applicable to multiple tasks, including image classification, semantic segmentation, and object detection. We conduct extensive experiments on different datasets. Our strategy demonstrates its unique effectiveness and generality against black-box attacks.
translated by 谷歌翻译
Generative adversarial networks (GANs) have made great success in image inpainting yet still have difficulties tackling large missing regions. In contrast, iterative algorithms, such as autoregressive and denoising diffusion models, have to be deployed with massive computing resources for decent effect. To overcome the respective limitations, we present a novel spatial diffusion model (SDM) that uses a few iterations to gradually deliver informative pixels to the entire image, largely enhancing the inference efficiency. Also, thanks to the proposed decoupled probabilistic modeling and spatial diffusion scheme, our method achieves high-quality large-hole completion. On multiple benchmarks, we achieve new state-of-the-art performance. Code is released at https://github.com/fenglinglwb/SDM.
translated by 谷歌翻译
Existing 3D scene stylization methods employ an arbitrary style reference to transfer textures and colors as styles without establishing meaningful semantic correspondences. We present Reference-Based Non-Photorealistic Radiance Fields, i.e., Ref-NPR. It is a controllable scene stylization method utilizing radiance fields to stylize a 3D scene, with a single stylized 2D view taken as reference. To achieve decent results, we propose a ray registration process based on the stylized reference view to obtain pseudo-ray supervision in novel views, and exploit the semantic correspondence in content images to fill occluded regions with perceptually similar styles. Combining these operations, Ref-NPR generates non-photorealistic and continuous novel view sequences with a single reference while obtaining reasonable stylization in occluded regions. Experiments show that Ref-NPR significantly outperforms other scene and video stylization methods in terms of both visual quality and semantic correspondence. Code and data will be made publicly available.
translated by 谷歌翻译
In dense image segmentation tasks (e.g., semantic, panoptic), existing methods can hardly generalize well to unseen image domains, predefined classes, and image resolution & quality variations. Motivated by these observations, we construct a large-scale entity segmentation dataset to explore fine-grained entity segmentation, with a strong focus on open-world and high-quality dense segmentation. The dataset contains images spanning diverse image domains and resolutions, along with high-quality mask annotations for training and testing. Given the high-quality and -resolution nature of the dataset, we propose CropFormer for high-quality segmentation, which can improve mask prediction using high-res image crops that provide more fine-grained image details than the full image. CropFormer is the first query-based Transformer architecture that can effectively ensemble mask predictions from multiple image crops, by learning queries that can associate the same entities across the full image and its crop. With CropFormer, we achieve a significant AP gain of $1.9$ on the challenging fine-grained entity segmentation task. The dataset and code will be released at http://luqi.info/entityv2.github.io/.
translated by 谷歌翻译
在本文中,我们提出了广义参数对比度学习(GPACO/PACO),该学习在不平衡和平衡数据上都很好地工作。基于理论分析,我们观察到,受监督的对比损失倾向于偏向高频类别,从而增加了学习不平衡的学习难度。我们从优化的角度介绍了一组参数班的可学习中心,以重新平衡。此外,我们在平衡的环境下分析了GPACO/PACO损失。我们的分析表明,GPACO/PACO可以适应地增强同一等级样品的强度,因为将更多的样品与相应的中心一起拉在一起并有益于艰难的示例学习。长尾基准测试的实验表明了长尾识别的新最先进。在完整的Imagenet上,与MAE模型相比,从CNN到接受GPACO损失训练的视觉变压器的模型显示出更好的泛化性能和更强的鲁棒性。此外,GPACO可以应用于语义分割任务,并在4个最受欢迎的基准测试中观察到明显的改进。我们的代码可在https://github.com/dvlab-research/parametric-contrastive-learning上找到。
translated by 谷歌翻译
在本文中,我们提出了一个简单的SEQ2SEQ公式,用于查看合成,其中我们将一组射线点作为输入和输出颜色对应于射线。在此SEQ2SEQ公式上直接应用标准变压器具有两个局限性。首先,标准注意力不能成功拟合体积渲染过程,因此在合成视图中缺少高频组件。其次,将全球关注应用于所有射线和像素非常效率极低。受神经辐射场(NERF)的启发,我们建议NERF注意(NERFA)解决上述问题。一方面,Nerfa将体积渲染方程视为软特征调制过程。通过这种方式,特征调制可以通过类似NERF的电感偏置增强变压器。另一方面,Nerfa执行多阶段的关注以减少计算开销。此外,NERFA模型采用射线和像素变压器来学习射线和像素之间的相互作用。 Nerfa在四个数据集上展示了比NERF和Nerformer出色的性能:DeepVoxels,Blender,LLFF和CO3D。此外,Nerfa在两个设置下建立了一个新的最新技术:单场视图合成和以类别为中心的小说视图合成。该代码将公开可用。
translated by 谷歌翻译
在语义细分中进行了无监督的域的适应,以减轻对昂贵像素的依赖的依赖。它利用标有标记的源域数据集以及未标记的目标域图像来学习分割网络。在本文中,我们观察到现有的域不变学习框架的两个主要问题。 (1)由于特征分布对齐而分心,网络不能专注于分割任务。 (2)拟合源域数据很好地损害了目标域性能。为了解决这些问题,我们提出了减轻过度拟合源域的脱钩,并使最终模型能够更多地专注于细分任务。此外,我们提出自我歧视(SD),并引入辅助分类器,以使用伪标签学习更多歧视目标域特征。最后,我们建议在线增强自我训练(OEST),以在线方式上下文提高伪标签的质量。实验表明,我们的方法优于现有的最新方法,广泛的消融研究验证了每个组件的有效性。代码可在https://github.com/dvlab-research/decouplenet上找到。
translated by 谷歌翻译
多对象跟踪(MOT)需要通过帧检测和关联对象。与通过检测到的边界框或将对象作为点跟踪不同,我们建议跟踪对象作为像素分布。我们将此想法实例化,以基于变压器的体系结构P3Aformer,并具有像素的传播,预测和关联。P3Aformer通过流量信息引导的Pixel-Pixel特征,以传递帧之间的消息。此外,P3Aformer采用元结构结构来生成多尺度对象特征图。在推断期间,提出了一个像素关联过程,以基于像素的预测来通过帧恢复对象连接。P3Aformer在MOT17基准上的MOTA中产生81.2 \%,这是所有变压器网络中第一个达到文献中80 \%MOTA。P3AFORMER在MOT20和Kitti基准测试上也优于最先进的。
translated by 谷歌翻译