Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task in computer vision. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without any further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel framework called CLIP-ES for WSSS. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) to mitigate noise and focus on confident regions. Our proposed framework dramatically reduces the cost of training for WSSS and shows the capability of localizing objects in CLIP. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.
translated by 谷歌翻译
Despite the tremendous progress of Masked Autoencoders (MAE) in developing vision tasks such as image and video, exploring MAE in large-scale 3D point clouds remains challenging due to the inherent irregularity. In contrast to previous 3D MAE frameworks, which either design a complex decoder to infer masked information from maintained regions or adopt sophisticated masking strategies, we instead propose a much simpler paradigm. The core idea is to apply a \textbf{G}enerative \textbf{D}ecoder for MAE (GD-MAE) to automatically merges the surrounding context to restore the masked geometric knowledge in a hierarchical fusion manner. In doing so, our approach is free from introducing the heuristic design of decoders and enjoys the flexibility of exploring various masking strategies. The corresponding part costs less than \textbf{12\%} latency compared with conventional methods, while achieving better performance. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: Waymo, KITTI, and ONCE. Consistent improvement on downstream detection tasks illustrates strong robustness and generalization capability. Not only our method reveals state-of-the-art results, but remarkably, we achieve comparable accuracy even with \textbf{20\%} of the labeled data on the Waymo dataset. The code will be released at \url{https://github.com/Nightmare-n/GD-MAE}.
translated by 谷歌翻译
Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://elicit3d.github.io .
translated by 谷歌翻译
尽管具有明显的区分靶向分布样本的能力,但深度神经网络在检测异常分布数据方面的性能差。为了解决此缺陷,最先进的解决方案选择在离群值的辅助数据集上训练深网。这些辅助离群值的各种培训标准是根据启发式直觉提出的。但是,我们发现这些直观设计的离群训练标准可能会损害分布学习,并最终导致劣等的表现。为此,我们确定了分布不兼容的三个原因:矛盾的梯度,错误的可能性和分布变化。基于我们的新理解,我们通过调整深层模型和损耗函数的顶级设计,提出一种新的分布检测方法。我们的方法通过减少对分布特征的概率特征的干扰来实现分布兼容性。在几个基准上,我们的方法不仅可以实现最新的分布检测性能,而且还提高了分布精度。
translated by 谷歌翻译
近年来,神经科学家一直对脑部计算机界面(BCI)设备的开发感兴趣。患有运动障碍的患者可能会受益于BCIS作为通讯手段和恢复运动功能。脑电图(EEG)是评估神经元活性的最常用之一。在许多计算机视觉应用中,深度神经网络(DNN)都具有显着优势。为了最终使用DNN,我们在这里提出了一个浅神经网络,该网络主要使用两个卷积神经网络(CNN)层,其参数相对较少,并且快速从脑电图中学习光谱时期特征。我们将该模型与其他三个神经网络模型进行了比较,其深度不同于精神算术任务,该模型使用了针对患有运动障碍的患者和视觉功能下降的患者进行的眼神闭合状态。实验结果表明,浅CNN模型的表现优于所有其他模型,并达到了90.68%的最高分类精度。处理跨主题分类问题也更加健壮:准确性的标准偏差仅为3%,而不是传统方法的15.6%。
translated by 谷歌翻译
安全的加强学习(RL)研究智能代理人不仅必须最大程度地提高奖励,而且还要避免探索不安全领域的问题。在这项研究中,我们提出了CUP,这是一种基于约束更新投影框架的新型政策优化方法,享有严格的安全保证。我们杯杯发展的核心是新提出的替代功能以及性能结合。与以前的安全RL方法相比,杯子的好处1)杯子将代孕功能推广到广义优势估计量(GAE),从而导致强烈的经验性能。 2)杯赛统一性界限,为某些现有算法提供更好的理解和解释性; 3)CUP仅通过一阶优化器提供非凸的实现,该优化器不需要在目标的凸面上进行任何强近似。为了验证我们的杯子方法,我们将杯子与在各种任务上进行的安全RL基线的全面列表进行了比较。实验表明杯子在奖励和安全限制满意度方面的有效性。我们已经在https://github.com/rl-boxes/safe-rl/tree/ main/cup上打开了杯子源代码。
translated by 谷歌翻译
在本文中,我们提出了一个新颖的对象级映射系统,该系统可以同时在动态场景中分段,跟踪和重建对象。它可以通过对深度输入的重建和类别级别的重建来进一步预测并完成其完整的几何形状,其目的是完成对象几何形状会导致更好的对象重建和跟踪准确性。对于每个传入的RGB-D帧,我们执行实例分割以检测对象并在检测和现有对象图之间构建数据关联。将为每个无与伦比的检测创建一个新的对象映射。对于每个匹配的对象,我们使用几何残差和差分渲染残留物共同优化其姿势和潜在的几何表示形式,并完成其形状之前和完成的几何形状。与使用传统的体积映射或学习形状的先验方法相比,我们的方法显示出更好的跟踪和重建性能。我们通过定量和定性测试合成和现实世界序列来评估其有效性。
translated by 谷歌翻译
在本文中,我们提出了一个紧密耦合的视觉惯性对象级多效性动态大满贯系统。即使在极其动态的场景中,它也可以为摄像机姿势,速度,IMU偏见并构建一个密集的3D重建对象级映射图。我们的系统可以通过稳健的传感器和对象跟踪,可以强牢固地跟踪和重建任意对象的几何形状,其语义和运动的几何形状,其语义和运动的几何形状,并通过逐步融合相关的颜色,深度,语义和前景对象概率概率。此外,当对象在视野视野外丢失或移动时,我们的系统可以在重新观察时可靠地恢复其姿势。我们通过定量和定性测试现实世界数据序列来证明我们方法的鲁棒性和准确性。
translated by 谷歌翻译
电子商务在通过互联网增强商人的能力方面已经大有帮助。为了有效地存储商品并正确安排营销资源,对他们来说,进行准确的总商品价值(GMV)预测非常重要。但是,通过数字化数据的缺乏进行准确的预测是不算平的。在本文中,我们提出了一个解决方案,以更好地预测Apay应用程序内的GMV。得益于Graph Neural网络(GNN),它具有很好的关联不同实体以丰富信息的能力,我们提出了Gaia,Gaia是一个图形神经网络(GNN)模型,具有时间移动意识注意。Gaia利用相关的电子销售商的销售信息,并根据时间依赖性学习邻居相关性。通过测试Apleay的真实数据集并与其他基线进行比较,Gaia表现出最佳性能。盖亚(Gaia)部署在模拟的在线环境中,与基线相比,这也取得了很大的进步。
translated by 谷歌翻译