In contrast to fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of simple box annotations, which has recently attracted increasing research attention. This paper presents a novel single-shot instance segmentation approach, namely Box2Mask, which integrates the classical level-set evolution model into deep neural network learning to achieve accurate mask prediction with only bounding box supervision. Specifically, both the input image and its deep features are employed to evolve the level-set curves implicitly, and a local consistency module based on a pixel affinity kernel is used to mine the local context and spatial relations. Two types of single-stage frameworks, i.e., CNN-based and transformer-based frameworks, are developed to empower the level-set evolution for box-supervised instance segmentation, and each framework consists of three essential components: instance-aware decoder, box-level matching assignment and level-set evolution. By minimizing the level-set energy function, the mask map of each instance can be iteratively optimized within its bounding box annotation. The experimental results on five challenging testbeds, covering general scenes, remote sensing, medical and scene text images, demonstrate the outstanding performance of our proposed Box2Mask approach for box-supervised instance segmentation. In particular, with the Swin-Transformer large backbone, our Box2Mask obtains 42.4% mask AP on COCO, which is on par with the recently developed fully mask-supervised methods. The code is available at: https://github.com/LiWentomng/boxlevelset.
translated by 谷歌翻译
Detecting abnormal crowd motion emerging from complex interactions of individuals is paramount to ensure the safety of crowds. Crowd-level abnormal behaviors (CABs), e.g., counter flow and crowd turbulence, are proven to be the crucial causes of many crowd disasters. In the recent decade, video anomaly detection (VAD) techniques have achieved remarkable success in detecting individual-level abnormal behaviors (e.g., sudden running, fighting and stealing), but research on VAD for CABs is rather limited. Unlike individual-level anomaly, CABs usually do not exhibit salient difference from the normal behaviors when observed locally, and the scale of CABs could vary from one scenario to another. In this paper, we present a systematic study to tackle the important problem of VAD for CABs with a novel crowd motion learning framework, multi-scale motion consistency network (MSMC-Net). MSMC-Net first captures the spatial and temporal crowd motion consistency information in a graph representation. Then, it simultaneously trains multiple feature graphs constructed at different scales to capture rich crowd patterns. An attention network is used to adaptively fuse the multi-scale features for better CAB detection. For the empirical study, we consider three large-scale crowd event datasets, UMN, Hajj and Love Parade. Experimental results show that MSMC-Net could substantially improve the state-of-the-art performance on all the datasets.
translated by 谷歌翻译
与使用像素面罩标签的完全监督的方法相反,盒子监督实例细分利用了简单的盒子注释,该盒子注释最近吸引了许多研究注意力。在本文中,我们提出了一种新颖的单弹盒监督实例分割方法,该方法将经典级别设置模型与深度神经网络精致整合在一起。具体而言,我们提出的方法迭代地通过端到端的方式通过基于Chan-Vese的连续能量功能来学习一系列级别集。一个简单的掩码监督的SOLOV2模型可供选择,以预测实例感知的掩码映射为每个实例的级别设置。输入图像及其深度特征都被用作输入数据来发展级别集曲线,其中使用框投影函数来获得初始边界。通过最大程度地减少完全可分化的能量函数,在其相应的边界框注释中迭代优化了每个实例的级别设置。在四个具有挑战性的基准上的实验结果表明,在各种情况下,我们提出的强大实例分割方法的领先表现。该代码可在以下网址获得:https://github.com/liwentomng/boxlevelset。
translated by 谷歌翻译
关于驾驶场景图像的语义细分对于自动驾驶至关重要。尽管在白天图像上已经实现了令人鼓舞的性能,但由于暴露不足和缺乏标记的数据,夜间图像的性能不那么令人满意。为了解决这些问题,我们提出了一个称为双图像自动学习过滤器(拨号过滤器)的附加模块,以改善夜间驾驶条件下的语义分割,旨在利用不同照明下驾驶场景图像的内在特征。拨盘滤波器由两个部分组成,包括图像自适应处理模块(IAPM)和可学习的引导过滤器(LGF)。使用拨号过滤器,我们设计了无监督和有监督的框架,用于夜间驾驶场景细分,可以以端到端的方式进行培训。具体而言,IAPM模块由一个带有一组可区分图像过滤器的小型卷积神经网络组成,可以自适应地增强每个图像,以更好地相对于不同的照明。 LGF用于增强分割网络的输出以获得最终的分割结果。拨号过滤器轻巧有效,可以在白天和夜间图像中轻松应用它们。我们的实验表明,Dail过滤器可以显着改善ACDC_Night和Nightcity数据集的监督细分性能,而它展示了有关无监督的夜间夜间语义细分的最新性能,在黑暗的苏黎世和夜间驾驶测试床上。
translated by 谷歌翻译
盒子监督的实例分割最近吸引了大量的研究工作,而在空中图像域中则收到很少的关注。与通用物体集合相比,空中对象具有大型内部差异和阶级相似性与复杂的背景。此外,高分辨率卫星图像中存在许多微小的物体。这使得最近的一对亲和力建模方法不可避免地涉及具有劣势的噪声监督。为了解决这些问题,我们提出了一种新颖的空中实例分割方法,该方法驱动网络为空中对象的一系列级别设置功能,只有盒子注释以端到端的方式。具有精心设计的能量函数的级别集方法而不是学习成对亲和力将对象分段视为曲线演进,这能够准确地恢复对象的边界并防止来自无法区分的背景和类似对象的干扰。实验结果表明,所提出的方法优于最先进的盒子监督实例分段方法。源代码可在https://github.com/liwentomng/boxLevelset上获得。
translated by 谷歌翻译
与通用物体相反,空中目标通常是非轴与具有杂乱的周围环境的任意取向对齐。与回归边界盒取向的主流化方法不同,本文通过利用自适应点表示,提出了一种有效的自适应点学习方法,可以利用自适应点表示来捕获任意定向的实例的几何信息。为此,提出了三个取向的转换功能,以便于准确方向进行分类和本地化。此外,我们提出了一种有效的质量评估和样本分配方案,用于学习在训练期间选择代表导向的检测点样本,能够捕获来自邻近物体或背景噪声的非轴对准特征。引入了空间约束以惩罚ROUST自适应学习的异常点。在包括DotA,HRSC2016,UCAS-AOD和Dior-R的四个具有挑战性的空中数据集上的实验结果证明了我们提出的方法的功效。源代码是可用的:https://github.com/liwentomng/orientedreppoints。
translated by 谷歌翻译
在复杂的场景中,尤其是在城市交通交叉点,对实体关系和运动行为的深刻理解对于实现高质量的计划非常重要。我们提出了有关交通信号灯D2-Tpred的轨迹预测方法,该方法使用空间动态交互图(SDG)和行为依赖图(BDG)来处理空间空间中不连续依赖的问题。具体而言,SDG用于通过在每帧中具有动态和可变特征的不同试剂的子图来捕获空间相互作用。 BDG用于通过建模当前状态对先验行为的隐式依赖性来推断运动趋势,尤其是与加速度,减速或转向方向相对应的不连续运动。此外,我们提出了一个新的数据集,用于在称为VTP-TL的交通信号灯下进行车辆轨迹预测。我们的实验结果表明,与其他轨迹预测算法相比,我们的模型在ADE和FDE方面分别获得了{20.45%和20.78%}的改善。数据集和代码可在以下网址获得:https://github.com/vtp-tl/d2-tpred。
translated by 谷歌翻译
行人轨迹预测是自动驾驶的重要技术,近年来已成为研究热点。以前的方法主要依靠行人的位置关系来模型社交互动,这显然不足以代表实际情况中的复杂病例。此外,大多数现有工作通常通常将场景交互模块作为独立分支介绍,并在轨迹生成过程中嵌入社交交互功能,而不是同时执行社交交互和场景交互,这可能破坏轨迹预测的合理性。在本文中,我们提出了一个名为社会软关注图卷积网络(SSAGCN)的一个新的预测模型,旨在同时处理行人和环境之间的行人和场景相互作用之间的社交互动。详细说明,在建模社交互动时,我们提出了一种新的\ EMPH {社会软关注功能},其充分考虑了行人之间的各种交互因素。并且它可以基于各种情况下的不同因素来区分行人周围的人行力的影响。对于物理互动,我们提出了一个新的\ emph {顺序场景共享机制}。每个时刻在每个时刻对一个代理的影响可以通过社会柔和关注与其他邻居共享,因此场景的影响在空间和时间尺寸中都是扩展。在这些改进的帮助下,我们成功地获得了社会和身体上可接受的预测轨迹。公共可用数据集的实验证明了SSAGCN的有效性,并取得了最先进的结果。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译