Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task in computer vision. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without any further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel framework called CLIP-ES for WSSS. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) to mitigate noise and focus on confident regions. Our proposed framework dramatically reduces the cost of training for WSSS and shows the capability of localizing objects in CLIP. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.
translated by 谷歌翻译
We present a framework for ranking images within their class based on the strength of spurious cues present. By measuring the gap in accuracy on the highest and lowest ranked images (we call this spurious gap), we assess spurious feature reliance for $89$ diverse ImageNet models, finding that even the best models underperform in images with weak spurious presence. However, the effect of spurious cues varies far more dramatically across classes, emphasizing the crucial, often overlooked, class-dependence of the spurious correlation problem. While most spurious features we observe are clarifying (i.e. improving test-time accuracy when present, as is typically expected), we surprisingly find many cases of confusing spurious features, where models perform better when they are absent. We then close the spurious gap by training new classification heads on lowly ranked (i.e. without common spurious cues) images, resulting in improved effective robustness to distribution shifts (ObjectNet, ImageNet-R, ImageNet-Sketch). We also propose a second metric to assess feature reliability, finding that spurious features are generally less reliable than non-spurious (core) ones, though again, spurious features can be more reliable for certain classes. To enable our analysis, we annotated $5,000$ feature-class dependencies over {\it all} of ImageNet as core or spurious using minimal human supervision. Finally, we show the feature discovery and spuriosity ranking framework can be extended to other datasets like CelebA and WaterBirds in a lightweight fashion with only linear layer training, leading to discovering a previously unknown racial bias in the Celeb-A hair classification.
translated by 谷歌翻译
Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://elicit3d.github.io .
translated by 谷歌翻译
尽管具有明显的区分靶向分布样本的能力,但深度神经网络在检测异常分布数据方面的性能差。为了解决此缺陷,最先进的解决方案选择在离群值的辅助数据集上训练深网。这些辅助离群值的各种培训标准是根据启发式直觉提出的。但是,我们发现这些直观设计的离群训练标准可能会损害分布学习,并最终导致劣等的表现。为此,我们确定了分布不兼容的三个原因:矛盾的梯度,错误的可能性和分布变化。基于我们的新理解,我们通过调整深层模型和损耗函数的顶级设计,提出一种新的分布检测方法。我们的方法通过减少对分布特征的概率特征的干扰来实现分布兼容性。在几个基准上,我们的方法不仅可以实现最新的分布检测性能,而且还提高了分布精度。
translated by 谷歌翻译
场景图生成(SGG)任务旨在在给定图像中检测所有对象及其成对的视觉关系。尽管SGG在过去几年中取得了显着的进展,但几乎所有现有的SGG模型都遵循相同的训练范式:他们将SGG中的对象和谓词分类视为单标签分类问题,而地面真实性是一个hot目标。标签。但是,这种普遍的训练范式忽略了当前SGG数据集的两个特征:1)对于正样本,某些特定的主题对象实例可能具有多个合理的谓词。 2)对于负样本,有许多缺失的注释。不管这两个特征如何,SGG模型都很容易被混淆并做出错误的预测。为此,我们为无偏SGG提出了一种新颖的模型不合命相的标签语义知识蒸馏(LS-KD)。具体而言,LS-KD通过将预测的标签语义分布(LSD)与其原始的单热目标标签融合来动态生成每个主题对象实例的软标签。 LSD反映了此实例和多个谓词类别之间的相关性。同时,我们提出了两种不同的策略来预测LSD:迭代自我KD和同步自我KD。大量的消融和对三项SGG任务的结果证明了我们所提出的LS-KD的优势和普遍性,这些LS-KD可以始终如一地实现不同谓词类别之间的不错的权衡绩效。
translated by 谷歌翻译
两阶段探测器在3D对象检测中已广受欢迎。大多数两阶段的3D检测器都使用网格点,体素电网或第二阶段的ROI特征提取的采样关键点。但是,这种方法在处理不均匀分布和稀疏的室外点方面效率低下。本文在三个方面解决了这个问题。 1)动态点聚集。我们建议补丁搜索以快速在本地区域中为每个3D提案搜索点。然后,将最远的体素采样采样用于均匀采样点。特别是,体素尺寸沿距离变化,以适应点的不均匀分布。 2)Ro-Graph Poling。我们在采样点上构建本地图,以通过迭代消息传递更好地模型上下文信息和地雷关系。 3)视觉功能增强。我们引入了一种简单而有效的融合策略,以补偿具有有限语义提示的稀疏激光雷达点。基于这些模块,我们将图形R-CNN构建为第二阶段,可以将其应用于现有的一阶段检测器,以始终如一地提高检测性能。广泛的实验表明,图R-CNN的表现优于最新的3D检测模型,而Kitti和Waymo Open DataSet的差距很大。我们在Kitti Bev汽车检测排行榜上排名第一。代码将在\ url {https://github.com/nightmare-n/graphrcnn}上找到。
translated by 谷歌翻译
数据中毒考虑了一个对手,该对手扭曲了用于恶意目的的机器学习算法的训练集。在这项工作中,我们揭示了一个关于数据中毒基本原理的猜想,我们称之为致命的剂量猜想。猜想指出:如果需要$ n $清洁的训练样品才能进行准确的预测,则在尺寸 - $ n $训练套件中,只能在确保准确性的同时耐受$ \ theta(n/n)$中毒样品。从理论上讲,我们在多种情况下验证了这一猜想。我们还通过分配歧视提供了对这种猜想的更普遍的看法。深度分区聚合(DPA)及其扩展,有限聚合(FA)是可证明防御数据中毒的可证明防御方法的方法,他们通过使用给定的学习者从不同的培训集中训练的许多基本模型对许多基本模型进行了预测。猜想意味着DPA和FA都是最佳的 - 如果我们拥有最高的学习者,它们可以将其变成针对数据中毒的最强大的防御能力之一。这概述了一种实用方法,可以通过寻找数据效率的学习者来开发更强大的防御能力。从经验上讲,作为概念的证明,我们表明,通过简单地为基础学习者使用不同的数据增强,我们可以分别将DPA在CIFAR-10和GTSRB上的认证稳健性和三倍,而无需牺牲准确性。
translated by 谷歌翻译
数据中毒攻击旨在通过扭曲培训数据来操纵模型行为。以前,提出了基于聚合的认证辩护,深度分区聚合(DPA),以减轻这种威胁。 DPA通过在数据不相交子集对基础分类器的聚合中进行预测,从而限制了其对数据集畸变的敏感性。在这项工作中,我们提出了对一般中毒攻击的经过改进的辩护,即有限的聚集。与直接将训练设置为不相交子集的DPA相反,我们的方法首先将训练设置分为较小的不相交子集,然后将它们的重复项组合在一起,以构建较大(但不是不相关的)子集来用于培训基础分类器。这减少了毒药样品的最严重影响,从而改善了认证的鲁棒性界限。此外,我们还提供了我们方法的替代视图,桥接了确定性和基于随机聚合的认证防御的设计。从经验上讲,我们提出的有限聚合一致地改善了MNIST,CIFAR-10和GTSRB的证书,将认证的分数提高了高达3.05%,3.87%和4.77%,同时保持与DPA相同的清洁精度,实际上建立了新的状态对数据中毒的(尖锐)认证的鲁棒性。
translated by 谷歌翻译
点云上采样是为了使从3D传感器获得的稀疏点集致密,从而为基础表面提供了密度的表示。现有方法将输入点划分为小贴片,并分别对每个贴片进行整理,但是,忽略了补丁之间的全局空间一致性。在本文中,我们提出了一种新颖的方法PC $^2 $ -PU,该方法探讨了贴片对点和点对点相关性,以实现更有效和强大的点云上采样。具体而言,我们的网络有两个吸引人的设计:(i)我们将相邻的补丁作为补充输入来补偿单个补丁中的损失结构信息,并引入一个补丁相关模块以捕获补丁之间的差异和相似性。 (ii)在增强每个贴片的几何形状后,我们进一步引入了一个点相关模块,以揭示每个贴片内部的关系以维持局部空间一致性。对合成和真实扫描数据集进行的广泛实验表明,我们的方法超过了以前的上采样方法,尤其是在嘈杂的输入中。代码和数据位于\ url {https://github.com/chenlongwhu/pc2-pu.git}。
translated by 谷歌翻译