玻璃在我们的日常生活中非常普遍。现有的计算机视觉系统忽略了它,因此可能会产生严重的后果,例如,机器人可能会坠入玻璃墙。但是,感知玻璃的存在并不简单。关键的挑战是,任意物体/场景可以出现在玻璃后面。在本文中,我们提出了一个重要的问题,即从单个RGB图像中检测玻璃表面。为了解决这个问题,我们构建了第一个大规模玻璃检测数据集(GDD),并提出了一个名为GDNet-B的新颖玻璃检测网络,该网络通过新颖的大型场探索大型视野中的丰富上下文提示上下文特征集成(LCFI)模块并将高级和低级边界特征与边界特征增强(BFE)模块集成在一起。广泛的实验表明,我们的GDNET-B可以在GDD测试集内外的图像上达到满足玻璃检测结果。我们通过将其应用于其他视觉任务(包括镜像分割和显着对象检测)来进一步验证我们提出的GDNET-B的有效性和概括能力。最后,我们显示了玻璃检测的潜在应用,并讨论了可能的未来研究方向。
translated by 谷歌翻译
稀疏的一般矩阵乘法(SPGEMM)是许多科学应用中的基本构件。 SPGEMM的一项关键任务是计算或预测有效的内存分配和负载平衡的输出矩阵的结构(即,每个输出行的非零元素的数量),这会影响SPGEMM的整体性能。现有工作要么精确地计算出输出结构,要么采用基于上限或采样的方法来预测输出结构。但是,这些方法要么需要太多执行时间,要么不够准确。在本文中,我们提出了一种基于采样的新方法,与现有基于采样的方法相比,具有更好的精度和低成本。该方法首先通过利用中间产品的数量(表示为flop)和同一采样结果矩阵的非零元素(表示为NNZ)来预测SPGEMM的压缩比。然后,通过将每次输出行除以预测的压缩率来获得预测的输出结构。我们还建议使用优化的计算开销的基于采样的方法的参考设计,以证明所提出的方法的准确性。我们构建具有各种矩阵维度和稀疏结构的625个测试用例,以评估预测准确性。实验结果表明,在最坏的情况下,所提出方法和参考设计的绝对相对误差分别为1.56 \%和8.12 \%,分别为25 \%和156 \%。
translated by 谷歌翻译
当一个代理与多代理环境互动时,与以前看不见的各种对手打交道是一项挑战。建模对手的行为,目标或信念可以帮助代理人调整其政策以适应不同的对手。此外,考虑同时学习或能够推理的对手也很重要。但是,现有工作通常仅处理上述对手类型之一。在本文中,我们提出了基于模型的对手建模(MBOM)​​,该模型采用环境模型来适应各种对手。 MBOM在环境模型中模拟了递归推理过程,并想象一组改进对手政策。为了有效,准确地代表对手政策,MBOM根据与对手的真实行为的相似性进一步将想象中的对手政策混合在一起。从经验上讲,我们表明,MBOM比在各种任务中的现有方法更有效地适应,分别具有不同类型的对手,即固定的政策,NA \“ IVE”学习者和推理者。
translated by 谷歌翻译
In the field of cross-modal retrieval, single encoder models tend to perform better than dual encoder models, but they suffer from high latency and low throughput. In this paper, we present a dual encoder model called BagFormer that utilizes a cross modal interaction mechanism to improve recall performance without sacrificing latency and throughput. BagFormer achieves this through the use of bag-wise interactions, which allow for the transformation of text to a more appropriate granularity and the incorporation of entity knowledge into the model. Our experiments demonstrate that BagFormer is able to achieve results comparable to state-of-the-art single encoder models in cross-modal retrieval tasks, while also offering efficient training and inference with 20.72 times lower latency and 25.74 times higher throughput.
translated by 谷歌翻译
Pedestrian detection in the wild remains a challenging problem especially for scenes containing serious occlusion. In this paper, we propose a novel feature learning method in the deep learning framework, referred to as Feature Calibration Network (FC-Net), to adaptively detect pedestrians under various occlusions. FC-Net is based on the observation that the visible parts of pedestrians are selective and decisive for detection, and is implemented as a self-paced feature learning framework with a self-activation (SA) module and a feature calibration (FC) module. In a new self-activated manner, FC-Net learns features which highlight the visible parts and suppress the occluded parts of pedestrians. The SA module estimates pedestrian activation maps by reusing classifier weights, without any additional parameter involved, therefore resulting in an extremely parsimony model to reinforce the semantics of features, while the FC module calibrates the convolutional features for adaptive pedestrian representation in both pixel-wise and region-based ways. Experiments on CityPersons and Caltech datasets demonstrate that FC-Net improves detection performance on occluded pedestrians up to 10% while maintaining excellent performance on non-occluded instances.
translated by 谷歌翻译
Event cameras, offering high temporal resolutions and high dynamic ranges, have brought a new perspective to address common challenges (e.g., motion blur and low light) in monocular depth estimation. However, how to effectively exploit the sparse spatial information and rich temporal cues from asynchronous events remains a challenging endeavor. To this end, we propose a novel event-based monocular depth estimator with recurrent transformers, namely EReFormer, which is the first pure transformer with a recursive mechanism to process continuous event streams. Technically, for spatial modeling, a novel transformer-based encoder-decoder with a spatial transformer fusion module is presented, having better global context information modeling capabilities than CNN-based methods. For temporal modeling, we design a gate recurrent vision transformer unit that introduces a recursive mechanism into transformers, improving temporal modeling capabilities while alleviating the expensive GPU memory cost. The experimental results show that our EReFormer outperforms state-of-the-art methods by a margin on both synthetic and real-world datasets. We hope that our work will attract further research to develop stunning transformers in the event-based vision community. Our open-source code can be found in the supplemental material.
translated by 谷歌翻译
This paper presents an approach that reconstructs a hand-held object from a monocular video. In contrast to many recent methods that directly predict object geometry by a trained network, the proposed approach does not require any learned prior about the object and is able to recover more accurate and detailed object geometry. The key idea is that the hand motion naturally provides multiple views of the object and the motion can be reliably estimated by a hand pose tracker. Then, the object geometry can be recovered by solving a multi-view reconstruction problem. We devise an implicit neural representation-based method to solve the reconstruction problem and address the issues of imprecise hand pose estimation, relative hand-object motion, and insufficient geometry optimization for small objects. We also provide a newly collected dataset with 3D ground truth to validate the proposed approach.
translated by 谷歌翻译
This paper focuses on the prevalent performance imbalance in the stages of incremental learning. To avoid obvious stage learning bottlenecks, we propose a brand-new stage-isolation based incremental learning framework, which leverages a series of stage-isolated classifiers to perform the learning task of each stage without the interference of others. To be concrete, to aggregate multiple stage classifiers as a uniform one impartially, we first introduce a temperature-controlled energy metric for indicating the confidence score levels of the stage classifiers. We then propose an anchor-based energy self-normalization strategy to ensure the stage classifiers work at the same energy level. Finally, we design a voting-based inference augmentation strategy for robust inference. The proposed method is rehearsal free and can work for almost all continual learning scenarios. We evaluate the proposed method on four large benchmarks. Extensive results demonstrate the superiority of the proposed method in setting up new state-of-the-art overall performance. \emph{Code is available at} \url{https://github.com/iamwangyabin/ESN}.
translated by 谷歌翻译
长期以来,将物体检测推向开放量和几乎没有射击转移一直是计算机视觉研究的挑战。这项工作探讨了一种持续的学习方法,该方法使探测器能够通过多数据远见语言的预训练扩展其零/少量功能。我们使用自然语言作为知识表示,我们探讨了从不同培训数据集积累“视觉词汇”的方法,并将任务统一为语言条件的检测框架。具体而言,我们提出了一种新颖的语言感知探测器OMDET和一种新颖的培训机制。拟议的多模式检测网络可以解决多数据库联合培训中的技术挑战,并且可以推广到任意数量的培训数据集,而无需手动标签分类合并的要求。与单独训练相比,Coco,Pascal VOC和更宽的面部/行人的实验结果通过在关节训练中或更高的分数来证实了疗效。此外,我们对超过400万个独特的对象词汇进行了预先培训,并在ODINW的35个下游任务上评估了所得模型。结果表明,OMDET能够在ODINW上实现最新的微调性能。分析表明,通过扩展提出的预训练方法,OMDET继续改善其零/少量调整性能,这表明了进一步扩展的有希望的方法。
translated by 谷歌翻译
我们提出了一个新问题:代理可以学习如何将以前任务中的动作结合起来,以完成新任务,就像人类一样?与模仿学习相反,没有专家数据,只有通过环境探索收集的数据。与离线增强学习相比,数据分配转移的问题更为严重。由于解决新任务的动作顺序可能是多个培训任务的轨迹段的组合,换句话说,测试任务和求解策略不直接存在于培训数据中。这使问题更加困难。我们提出了一种与内存相关的多任务方法(M3)来解决此问题。该方法包括三个阶段。首先,进行任务不足的探索以收集数据。与以前的方法不同,我们将探索数据组织到知识图中。我们根据勘探数据设计一个模型,以提取动作效果功能并将其保存在记忆中,同时训练了动作预测模型。其次,对于新任务,存储在内存中的动作效应特征用于通过基于特征分解的方法来生成候选动作。最后,一个多尺度的候选动作池和动作预测模型融合在一起,以生成完成任务的策略。实验结果表明,与基线相比,我们提出的方法的性能得到了显着提高。
translated by 谷歌翻译