智能论文笔记

LWSIS: LiDAR-guided Weakly Supervised Instance Segmentation for Autonomous Driving

Xiang Li , Junbo Yin , Botian Shi , Yikang Li , Ruigang Yang , Jianbin Shen

分类：计算机视觉 | 人工智能

2022-12-07

Image instance segmentation is a fundamental research topic in autonomous driving, which is crucial for scene understanding and road safety. Advanced learning-based approaches often rely on the costly 2D mask annotations for training. In this paper, we present a more artful framework, LiDAR-guided Weakly Supervised Instance Segmentation (LWSIS), which leverages the off-the-shelf 3D data, i.e., Point Cloud, together with the 3D boxes, as natural weak supervisions for training the 2D image instance segmentation models. Our LWSIS not only exploits the complementary information in multimodal data during training, but also significantly reduces the annotation cost of the dense 2D masks. In detail, LWSIS consists of two crucial modules, Point Label Assignment (PLA) and Graph-based Consistency Regularization (GCR). The former module aims to automatically assign the 3D point cloud as 2D point-wise labels, while the latter further refines the predictions by enforcing geometry and appearance consistency of the multimodal data. Moreover, we conduct a secondary instance segmentation annotation on the nuScenes, named nuInsSeg, to encourage further research on multimodal perception tasks. Extensive experiments on the nuInsSeg, as well as the large-scale Waymo, show that LWSIS can substantially improve existing weakly supervised segmentation models by only involving 3D data during training. Additionally, LWSIS can also be incorporated into 3D object detectors like PointPainting to boost the 3D detection performance for free. The code and dataset are available at https://github.com/Serenos/LWSIS.

translated by 谷歌翻译

Multimodal Virtual Point 3D Detection

Tianwei Yin , Xingyi Zhou , Philipp Krähenbühl

分类：计算机视觉 | 机器学习 | 机器人

2021-11-12

基于LIDAR的传感驱动器电流自主车辆。尽管进展迅速，但目前的激光雷达传感器在分辨率和成本方面仍然落后于传统彩色相机背后的二十年。对于自主驾驶，这意味着靠近传感器的大物体很容易可见，但远方或小物体仅包括一个测量或两个。这是一个问题，尤其是当这些对象结果驾驶危险时。另一方面，在车载RGB传感器中清晰可见这些相同的对象。在这项工作中，我们提出了一种将RGB传感器无缝熔化成基于LIDAR的3D识别方法。我们的方法采用一组2D检测来生成密集的3D虚拟点，以增加否则稀疏的3D点云。这些虚拟点自然地集成到任何基于标准的LIDAR的3D探测器以及常规激光雷达测量。由此产生的多模态检测器简单且有效。大规模NUSCENES数据集的实验结果表明，我们的框架通过显着的6.6地图改善了强大的中心点基线，并且优于竞争融合方法。代码和更多可视化可在https://tianweiy.github.io/mvp/上获得

translated by 谷歌翻译

Multimodal Transformer for Automatic 3D Annotation and Object Detection

Chang Liu , Xiaoyan Qian , Binxiao Huang , Xiaojuan Qi , Edmund Lam , Siew-Chong Tan , Ngai Wong

分类：计算机视觉

2022-07-20

尽管收集了越来越多的数据集用于培训3D对象检测模型，但在LiDar扫描上注释3D盒仍然需要大量的人类努力。为了自动化注释并促进了各种自定义数据集的生产，我们提出了一个端到端的多模式变压器（MTRANS）自动标签器，该标签既利用LIDAR扫描和图像，以生成来自弱2D边界盒的精确的3D盒子注释。为了减轻阻碍现有自动标签者的普遍稀疏性问题，MTRAN通过基于2D图像信息生成新的3D点来致密稀疏点云。凭借多任务设计，MTRANS段段前景/背景片段，使LIDAR POINT CLUENS云密布，并同时回归3D框。实验结果验证了MTRAN对提高生成标签质量的有效性。通过丰富稀疏点云，我们的方法分别在Kitti中度和硬样品上获得了4.48 \％和4.03 \％更好的3D AP，而不是最先进的自动标签器。也可以扩展Mtrans以提高3D对象检测的准确性，从而在Kitti硬样品上产生了显着的89.45 \％AP。代码位于\ url {https://github.com/cliu2/mtrans}。

translated by 谷歌翻译

Point Cloud Instance Segmentation with Semi-supervised Bounding-Box Mining

Yongbin Liao , Hongyuan Zhu , Yanggang Zhang , Chuangguan Ye , Tao Chen , Jiayuan Fan

分类：计算机视觉 | 人工智能

2021-11-30

点云实例分割在深度学习的出现方面取得了巨大进展。然而，这些方法通常是具有昂贵且耗时的密度云注释的数据饥饿。为了减轻注释成本，在任务中仍申请未标记或弱标记的数据。在本文中，我们使用标记和未标记的边界框作为监控，介绍第一个半监控点云实例分段框架（SPIB）。具体而言，我们的SPIB架构涉及两级学习程序。对于阶段，在具有扰动一致性正则化（SPCR）的半监控设置下培训边界框提案生成网络。正规化通过强制执行对应用于输入点云的不同扰动的边界框预测的不变性，为网络学习提供自我监督。对于阶段，使用SPCR的边界框提案被分组为某些子集，并且使用新颖的语义传播模块和属性一致性图模块中的每个子集中挖掘实例掩码。此外，我们介绍了一种新型占用比导改进模块，以优化实例掩码。对挑战队的攻击v2数据集进行了广泛的实验，证明了我们的方法可以实现与最近的完全监督方法相比的竞争性能。

translated by 谷歌翻译

Box2Mask: Box-supervised Instance Segmentation via Level-set Evolution

Wentong Li , Wenyu Liu , Jianke Zhu , Miaomiao Cui , Risheng Yu , Xiansheng Hua , Lei Zhang

分类：计算机视觉

2022-12-03

In contrast to fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of simple box annotations, which has recently attracted increasing research attention. This paper presents a novel single-shot instance segmentation approach, namely Box2Mask, which integrates the classical level-set evolution model into deep neural network learning to achieve accurate mask prediction with only bounding box supervision. Specifically, both the input image and its deep features are employed to evolve the level-set curves implicitly, and a local consistency module based on a pixel affinity kernel is used to mine the local context and spatial relations. Two types of single-stage frameworks, i.e., CNN-based and transformer-based frameworks, are developed to empower the level-set evolution for box-supervised instance segmentation, and each framework consists of three essential components: instance-aware decoder, box-level matching assignment and level-set evolution. By minimizing the level-set energy function, the mask map of each instance can be iteratively optimized within its bounding box annotation. The experimental results on five challenging testbeds, covering general scenes, remote sensing, medical and scene text images, demonstrate the outstanding performance of our proposed Box2Mask approach for box-supervised instance segmentation. In particular, with the Swin-Transformer large backbone, our Box2Mask obtains 42.4% mask AP on COCO, which is on par with the recently developed fully mask-supervised methods. The code is available at: https://github.com/LiWentomng/boxlevelset.

translated by 谷歌翻译

Box2Seg: Learning Semantics of 3D Point Clouds with Box-Level Supervision

Yan Liu , Qingyong Hu , Yinjie Lei , Kai Xu , Jonathan Li , Yulan Guo

分类：计算机视觉

2022-01-09

从非结构化的3D点云学习密集点语义，虽然是一个逼真的问题，但在文献中探讨了逼真的问题。虽然现有的弱监督方法可以仅具有小数点的点级注释来有效地学习语义，但我们发现香草边界箱级注释也是大规模3D点云的语义分割信息。在本文中，我们介绍了一个神经结构，称为Box2Seg，以了解3D点云的点级语义，具有边界盒级监控。我们方法的关键是通过探索每个边界框内和外部的几何和拓扑结构来生成准确的伪标签。具体地，利用基于注意的自我训练（AST）技术和点类激活映射（PCAM）来估计伪标签。通过伪标签进行进一步培训并精制网络。在两个大型基准测试中的实验，包括S3DIS和Scannet，证明了该方法的竞争性能。特别是，所提出的网络可以培训，甚至是均匀的空缺边界箱级注释和子环级标签。

translated by 谷歌翻译

Image Understands Point Cloud: Weakly Supervised 3D Semantic Segmentation via Association Learning

Tianfang Sun , Zhizhong Zhang , Xin Tan , Yanyun Qu , Yuan Xie , Lizhuang Ma

分类：计算机视觉

2022-09-16

弱监督的点云语义分割方法需要1 \％或更少的标签，希望实现与完全监督的方法几乎相同的性能，这些方法最近引起了广泛的研究关注。该框架中的一个典型解决方案是使用自我训练或伪标记来从点云本身挖掘监督，但忽略了图像中的关键信息。实际上，在激光雷达场景中广泛存在相机，而这种互补信息对于3D应用似乎非常重要。在本文中，我们提出了一种用于3D分割的新型交叉模式弱监督的方法，并结合了来自未标记图像的互补信息。基本上，我们设计了一个配备有效标签策略的双分支网络，以最大程度地发挥标签的力量，并直接实现2D到3D知识转移。之后，我们以期望最大（EM）的视角建立了一个跨模式的自我训练框架，该框架在伪标签估计和更新参数之间进行了迭代。在M-Step中，我们提出了一个跨模式关联学习，通过增强3D点和2D超级像素之间的周期矛盾性，从图像中挖掘互补的监督。在E-Step中，伪标签的自我校准机制被得出过滤噪声标签，从而为网络提供了更准确的标签，以进行全面训练。广泛的实验结果表明，我们的方法甚至优于最先进的竞争对手，而少于1 \％的主动选择注释。

translated by 谷歌翻译

Deep Level Set for Box-supervised Instance Segmentation in Aerial Images

Wentong Li , Yijie Chen , Wenyu Liu , Jianke Zhu

分类：计算机视觉

2021-12-07

盒子监督的实例分割最近吸引了大量的研究工作，而在空中图像域中则收到很少的关注。与通用物体集合相比，空中对象具有大型内部差异和阶级相似性与复杂的背景。此外，高分辨率卫星图像中存在许多微小的物体。这使得最近的一对亲和力建模方法不可避免地涉及具有劣势的噪声监督。为了解决这些问题，我们提出了一种新颖的空中实例分割方法，该方法驱动网络为空中对象的一系列级别设置功能，只有盒子注释以端到端的方式。具有精心设计的能量函数的级别集方法而不是学习成对亲和力将对象分段视为曲线演进，这能够准确地恢复对象的边界并防止来自无法区分的背景和类似对象的干扰。实验结果表明，所提出的方法优于最先进的盒子监督实例分段方法。源代码可在https://github.com/liwentomng/boxLevelset上获得。

translated by 谷歌翻译

Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation Using Bounding Boxes

Julian Chibane , Francis Engelmann , Tuan Anh Tran , Gerard Pons-Moll

分类：计算机视觉

2022-06-02

当前的3D分割方法很大程度上依赖于大规模的点状数据集，众所周知，这些数据集众所周知。很少有尝试规避需要每点注释的需求。在这项工作中，我们研究了弱监督的3D语义实例分割。关键的想法是利用3D边界框标签，更容易，更快地注释。确实，我们表明只有仅使用边界框标签训练密集的分割模型。在我们方法的核心上，\ name {}是一个深层模型，灵感来自经典的霍夫投票，直接投票赞成边界框参数，并且是专门针对边界盒票的专门定制的群集方法。这超出了常用的中心票，这不会完全利用边界框注释。在扫描仪测试中，我们弱监督的模型在其他弱监督的方法中获得了领先的性能（+18 MAP@50）。值得注意的是，它还达到了当前完全监督模型的50分数的地图的97％。为了进一步说明我们的工作的实用性，我们在最近发布的Arkitscenes数据集中训练Box2mask，该数据集仅使用3D边界框注释，并首次显示引人注目的3D实例细分掩码。

translated by 谷歌翻译

EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection

Zhe Liu , Tengteng~Huang , Bingling Li , Xiwu Chen , Xi Wang , Xiang Bai

分类：计算机视觉

2021-12-21

最近，融合了激光雷达点云和相机图像，提高了3D对象检测的性能和稳健性，因为这两种方式自然具有强烈的互补性。在本文中，我们通过引入新型级联双向融合〜（CB融合）模块和多模态一致性〜（MC）损耗来提出用于多模态3D对象检测的EPNet ++。更具体地说，所提出的CB融合模块提高点特征的丰富语义信息，以级联双向交互融合方式具有图像特征，导致更全面且辨别的特征表示。 MC损失明确保证预测分数之间的一致性，以获得更全面且可靠的置信度分数。基蒂，JRDB和Sun-RGBD数据集的实验结果展示了通过最先进的方法的EPNet ++的优越性。此外，我们强调一个关键但很容易被忽视的问题，这是探讨稀疏场景中的3D探测器的性能和鲁棒性。广泛的实验存在，EPNet ++优于现有的SOTA方法，在高稀疏点云壳中具有显着的边距，这可能是降低LIDAR传感器的昂贵成本的可用方向。代码将来会发布。

translated by 谷歌翻译

Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization

Xizhe Xue , Dongdong Yu , Lingqiao Liu , Yu Liu , Ying Li , Zehuan Yuan , Ping Song , Mike Zheng Shou

分类：计算机视觉

2022-08-18

Open-World实例细分（OWIS）旨在从图像中分割类不足的实例，该图像具有广泛的现实应用程序，例如自主驾驶。大多数现有方法遵循两阶段的管道：首先执行类不足的检测，然后再进行特定于类的掩模分段。相比之下，本文提出了一个单阶段框架，以直接为每个实例生成掩码。另外，实例掩码注释在现有数据集中可能很吵。为了克服这个问题，我们引入了新的正规化损失。具体而言，我们首先训练一个额外的分支来执行预测前景区域的辅助任务（即属于任何对象实例的区域），然后鼓励辅助分支的预测与实例掩码的预测一致。关键的见解是，这种交叉任务一致性损失可以充当误差校正机制，以打击注释中的错误。此外，我们发现所提出的跨任务一致性损失可以应用于图像，而无需任何注释，将自己借给了半监督的学习方法。通过广泛的实验，我们证明了所提出的方法可以在完全监督和半监督的设置中获得令人印象深刻的结果。与SOTA方法相比，所提出的方法将$ ap_ {100} $得分提高了4.75 \％\％\％\ rightarrow $ uvo设置和4.05 \％\％\％\％\％\％\ rightarrow $ uvo设置。在半监督学习的情况下，我们的模型仅使用30 \％标记的数据学习，甚至超过了其完全监督的数据，并具有50 \％标记的数据。该代码将很快发布。

translated by 谷歌翻译

Frustum PointNets for 3D Object Detection from RGB-D Data

Charles R. Qi , Wei Liu , Chenxia Wu , Hao Su , Leonidas J. Guibas

分类：

2017-11-22

In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability. * Majority of the work done as an intern at Nuro, Inc. depth to point cloud 2D region (from CNN) to 3D frustum 3D box (from PointNet)

translated by 谷歌翻译

RWSeg: Cross-graph Competing Random Walks for Weakly Supervised 3D Instance Segmentation

Shichao Dong , Ruibo Li , Jiacheng Wei , Fayao Liu , Guosheng Lin

分类：计算机视觉

2022-08-10

由于其广泛的应用，尤其是在现场理解领域，因此在3D点云上进行的实例细分一直在吸引越来越多的关注。但是，大多数现有方法都需要完全注释培训数据。在点级的手动准备地面真相标签非常繁琐且劳动密集型。为了解决这个问题，我们提出了一种新颖的弱监督方法RWSEG，该方法仅需要用一个点标记一个对象。有了这些稀疏的标签，我们使用自我注意事项和随机步行引入了一个带有两个分支的统一框架，分别将语义和实例信息分别传播到未知区域。此外，我们提出了一个跨画竞争的随机步行（CGCRW）算法，该算法鼓励不同实例图之间的竞争以解决紧密放置对象中的歧义并改善实例分配的性能。 RWSEG可以生成定性实例级伪标签。 Scannet-V2和S3DIS数据集的实验结果表明，我们的方法通过完全监督的方法实现了可比的性能，并且通过大幅度优于先前的弱监督方法。这是弥合该地区弱和全面监督之间差距的第一项工作。

translated by 谷歌翻译

Deep Learning for 3D Point Clouds: A Survey

Yulan Guo , Hanyun Wang , Qingyong Hu , Hao Liu , Li Liu , Mohammed Bennamoun

分类：

2019-12-27

Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.

translated by 谷歌翻译

VIN: Voxel-based Implicit Network for Joint 3D Object Detection and Segmentation for Lidars

Yuanxin Zhong , Minghan Zhu , Huei Peng

分类：计算机视觉

2021-07-07

本文提出了一个统一的神经网络结构，用于联合3D对象检测和点云分段。我们利用检测和分割标签的丰富监督，而不是使用其中一个。另外，基于广泛应用于3D场景和对象理解的隐式功能，提出了基于单级对象检测器的扩展。扩展分支从对象检测模块作为输入采用最终特征映射，并产生隐式功能，为其对应的体素中心产生每个点的语义分布。我们展示了我们在NUSCENES-LIDARSEG上的结构的表现，这是一个大型户外数据集。我们的解决方案在与对象检测解决方案相比，在3D对象检测和点云分割中实现了针对现有的方法的竞争结果。通过实验验证了所提出的方法的有效弱监管语义分割的能力。

translated by 谷歌翻译

From Points to Parts: 3D Object Detection from Point Cloud with Part-aware and Part-aggregation Network

Shaoshuai Shi , Zhe Wang , Jianping Shi , Xiaogang Wang , Hongsheng Li

分类：

2019-07-08

3D object detection from LiDAR point cloud is a challenging problem in 3D scene understanding and has many practical applications. In this paper, we extend our preliminary work PointRCNN to a novel and strong point-cloud-based 3D object detection framework, the part-aware and aggregation neural network (Part-A 2 net). The whole framework consists of the part-aware stage and the part-aggregation stage. Firstly, the part-aware stage for the first time fully utilizes free-of-charge part supervisions derived from 3D ground-truth boxes to simultaneously predict high quality 3D proposals and accurate intra-object part locations. The predicted intra-object part locations within the same proposal are grouped by our new-designed RoI-aware point cloud pooling module, which results in an effective representation to encode the geometry-specific features of each 3D proposal. Then the part-aggregation stage learns to re-score the box and refine the box location by exploring the spatial relationship of the pooled intra-object part locations. Extensive experiments are conducted to demonstrate the performance improvements from each component of our proposed framework. Our Part-A 2 net outperforms all existing 3D detection methods and achieves new state-of-the-art on KITTI 3D object detection dataset by utilizing only the LiDAR point cloud data. Code is available at https://github.com/sshaoshuai/PointCloudDet3D.

translated by 谷歌翻译

Panoptic Segmentation: A Review

Omar Elharrouss , Somaya Al-Maadeed , Nandhini Subramanian , Najmath Ottakath , Noor Almaadeed , Yassine Himeur

分类：计算机视觉

2021-11-19

视频分析的图像分割在不同的研究领域起着重要作用，例如智能城市，医疗保健，计算机视觉和地球科学以及遥感应用。在这方面，最近致力于发展新的细分策略;最新的杰出成就之一是Panoptic细分。后者是由语义和实例分割的融合引起的。明确地，目前正在研究Panoptic细分，以帮助获得更多对视频监控，人群计数，自主驾驶，医学图像分析的图像场景的更细致的知识，以及一般对场景更深入的了解。为此，我们介绍了本文的首次全面审查现有的Panoptic分段方法，以获得作者的知识。因此，基于所采用的算法，应用场景和主要目标的性质，执行现有的Panoptic技术的明确定义分类。此外，讨论了使用伪标签注释新数据集的Panoptic分割。继续前进，进行消融研究，以了解不同观点的Panoptic方法。此外，讨论了适合于Panoptic分割的评估度量，并提供了现有解决方案性能的比较，以告知最先进的并识别其局限性和优势。最后，目前对主题技术面临的挑战和吸引不久的将来吸引相当兴趣的未来趋势，可以成为即将到来的研究研究的起点。提供代码的文件可用于：https：//github.com/elharroussomar/awesome-panoptic-egation

translated by 谷歌翻译

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection

Shaoqing Xu , Dingfu Zhou , Jin Fang , Pengcheng Wang , Liangjun Zhang

分类：计算机视觉

2022-12-10

LiDAR-based 3D Object detectors have achieved impressive performances in many benchmarks, however, multisensors fusion-based techniques are promising to further improve the results. PointPainting, as a recently proposed framework, can add the semantic information from the 2D image into the 3D LiDAR point by the painting operation to boost the detection performance. However, due to the limited resolution of 2D feature maps, severe boundary-blurring effect happens during re-projection of 2D semantic segmentation into the 3D point clouds. To well handle this limitation, a general multimodal fusion framework MSF has been proposed to fuse the semantic information from both the 2D image and 3D points scene parsing results. Specifically, MSF includes three main modules. First, SOTA off-the-shelf 2D/3D semantic segmentation approaches are employed to generate the parsing results for 2D images and 3D point clouds. The 2D semantic information is further re-projected into the 3D point clouds with calibrated parameters. To handle the misalignment between the 2D and 3D parsing results, an AAF module is proposed to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a DFF module to aggregate deep features in different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Most importantly, the proposed approach significantly outperforms other approaches and sets new SOTA results on the nuScenes testing benchmark.

translated by 谷歌翻译

EfficientLPS: Efficient LiDAR Panoptic Segmentation

Kshitij Sirohi , Rohit Mohan , Daniel Büscher , Wolfram Burgard , Abhinav Valada

分类：计算机视觉 | 机器学习 | 机器人

2021-02-16

点云的Panoptic分割是一种重要的任务，使自动车辆能够使用高精度可靠的激光雷达传感器来理解其附近。现有的自上而下方法通过将独立的任务特定网络或转换方法从图像域转换为忽略激光雷达数据的复杂性，因此通常会导致次优性性能来解决这个问题。在本文中，我们提出了新的自上而下的高效激光乐光线分割（有效的LID）架构，该架构解决了分段激光雷达云中的多种挑战，包括距离依赖性稀疏性，严重的闭塞，大规模变化和重新投影误差。高效地板包括一种新型共享骨干，可以通过加强的几何变换建模容量进行编码，并聚合语义丰富的范围感知多尺度特征。它结合了新的不变语义和实例分段头以及由我们提出的Panoptic外围损耗功能监督的Panoptic Fusion模块。此外，我们制定了正则化的伪标签框架，通过对未标记数据的培训进行进一步提高高效性的性能。我们在两个大型LIDAR数据集中建议模型基准：NUSCENES，我们还提供了地面真相注释和Semantickitti。值得注意的是，高效地将在两个数据集上设置新的最先进状态。

translated by 谷歌翻译

Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration

Liqi Yan , Qifan Wang , Siqi Ma , Jingang Wang , Changbin Yu

分类：计算机视觉

2022-12-15

Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel weakly supervised framework with \textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance \textbf{Seg}mentation in videos, namely \textbf{STC-Seg}. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the robustness to different object appearances. Finally, our framework is flexible and enables image-level instance segmentation methods to operate the video-level task. We conduct an extensive set of experiments on the KITTI MOTS and YT-VIS datasets. Experimental results demonstrate that our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the community, as it reflects the tip of an iceberg about the innovative opportunities in the weakly supervised paradigm for instance segmentation in videos.

translated by 谷歌翻译