In this paper we illustrate how to perform both visual object tracking and semi-supervised video object segmentation, in real-time, with a single simple approach. Our method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. Once trained, SiamMask solely relies on a single bounding box initialisation and operates online, producing class-agnostic object segmentation masks and rotated bounding boxes at 55 frames per second. Despite its simplicity, versatility and fast speed, our strategy allows us to establish a new state of the art among real-time trackers on VOT-2018, while at the same time demonstrating competitive performance and the best speed for the semisupervised video object segmentation task on DAVIS-2016 and DAVIS-2017. The project website is http://www. robots.ox.ac.uk/ ˜qwang/SiamMask.
translated by 谷歌翻译
在本文中,我们介绍了Siammask,这是一个实时使用相同简单方法实时执行视觉对象跟踪和视频对象分割的框架。我们通过通过二进制细分任务来增强其损失,从而改善了流行的全面暹罗方法的离线培训程序。离线训练完成后,SiamMask只需要一个单个边界框来初始化,并且可以同时在高框架速率下进行视觉对象跟踪和分割。此外,我们表明可以通过简单地以级联的方式重新使用多任务模型来扩展框架以处理多个对象跟踪和细分。实验结果表明,我们的方法具有较高的处理效率,每秒约55帧。它可以在视觉对象跟踪基准测试中产生实时最新结果,同时以高速进行视频对象分割基准测试以高速显示竞争性能。
translated by 谷歌翻译
估计目标范围在视觉对象跟踪中构成了基本挑战。通常,跟踪器以箱子为中心,并且完全依靠边界框来定义场景中的目标。实际上,对象通常具有复杂的形状,并且与图像轴不符。在这些情况下,边界框不能提供对目标的准确描述,并且通常包含大多数背景像素。我们提出了一个以细分为中心的跟踪管道,该管道不仅会产生高度准确的分割掩码,而且还可以使用分割掩码而不是边界框来使用内部。因此,我们的跟踪器能够更好地学习目标表示形式,该目标表示明确将场景中的目标与背景内容区分开来。为了实现具有挑战性的跟踪方案的必要鲁棒性,我们提出了一个单独的实例本地化组件,该组件用于在产生输出掩码时用于调节分割解码器。我们从分段掩码中推断出一个边界框,验证我们的跟踪器在挑战跟踪数据集方面,并在LASOT上实现新的最新状态,并以69.7%的速度获得了AUC得分。由于大多数跟踪数据集不包含掩码注释,因此我们无法使用它们来评估预测的分割掩码。相反,我们在两个流行的视频对象细分数据集上验证了分割质量。
translated by 谷歌翻译
基于暹罗的跟踪器在Visual Object跟踪任务上实现了有希望的性能。大多数现有的基于暹罗的跟踪器包含两个单独的跟踪分支,包括分类分支和边界框回归分支。此外,图像分割提供了obetain更准确的目标区域的替代方法。在本文中,我们提出了一种具有两个阶段的新型跟踪器:检测和分割。检测阶段能够通过暹罗网络定位目标。然后,通过在第一阶段中的粗状态估计,通过分割模块获得更准确的跟踪结果。我们对四个基准进行实验。我们的方法可以实现最先进的结果,在VOT2016,VOT2018上的51.3美元\%$ 52.6 $ \%$分别在VOT2018和VOT2019数据集上的39.0 $ \%$。
translated by 谷歌翻译
传统上通过在线学习物体的外观模型来解决任意对象跟踪问题,作为视频本身的唯一培训数据。尽管这些方法的成功,但他们的在线方法本身限制了他们可以学习的模型的丰富性。最近,已经进行了几次尝试利用深度卷积网络的表现力。然而,当预先知道对象轨道时,有必要在线执行随机梯度下降以调整网络的权重,严重影响系统的速度。本文在ILSVRC15数据集上装备了一种基本的跟踪算法,该历程在ILSVRC15数据集上训练了训练的端到端,用于视频中的对象检测。我们的跟踪器以超出实时的帧速率运行,尽管其极端简单,但在多个基准中实现了最先进的性能。
translated by 谷歌翻译
基于模板的鉴别性跟踪器是目前主导的跟踪范例由于其稳健性,但不限于边界框跟踪和有限的转换模型,这降低了它们的本地化准确性。我们提出了一个判别的单次分割跟踪器 - D3S2,其缩小了视觉对象跟踪和视频对象分段之间的差距。单次网络应用两个具有互补的几何属性的目标模型,一个不变的变换,包括非刚性变形,另一个假设刚性对象同时实现强大的在线目标分段。通过解耦对象和特征比例估计,进一步提高了整体跟踪可靠性。没有每数据集FineTuning,并且仅用于分段作为主要输出,D3S2胜过最近的短期跟踪基准Vot2020上的所有已发布的跟踪器,并非常接近GOT-10K上的最先进的跟踪器, TrackingNet,OTB100和Lasot。 D3S2优于视频对象分段基准上的前导分割跟踪器SIAMMASK,并与顶部视频对象分段算法进行操作。
translated by 谷歌翻译
准确且强大的视觉对象跟踪是最具挑战性和最基本的计算机视觉问题之一。它需要在图像序列中估计目标的轨迹,仅给出其初始位置和分段,或者在边界框的形式中粗略近似。判别相关滤波器(DCF)和深度暹罗网络(SNS)被出现为主导跟踪范式,这导致了重大进展。在过去十年的视觉对象跟踪快速演变之后,该调查介绍了90多个DCFS和暹罗跟踪器的系统和彻底审查,基于九个跟踪基准。首先,我们介绍了DCF和暹罗跟踪核心配方的背景理论。然后,我们在这些跟踪范式中区分和全面地审查共享以及具体的开放研究挑战。此外,我们彻底分析了DCF和暹罗跟踪器对九个基准的性能,涵盖了视觉跟踪的不同实验方面:数据集,评估度量,性能和速度比较。通过提出根据我们的分析提出尊重开放挑战的建议和建议来完成调查。
translated by 谷歌翻译
基于无人机(UAV)基于无人机的视觉对象跟踪已实现了广泛的应用,并且由于其多功能性和有效性而引起了智能运输系统领域的越来越多的关注。作为深度学习革命性趋势的新兴力量,暹罗网络在基于无人机的对象跟踪中闪耀,其准确性,稳健性和速度有希望的平衡。由于开发了嵌入式处理器和深度神经网络的逐步优化,暹罗跟踪器获得了广泛的研究并实现了与无人机的初步组合。但是,由于无人机在板载计算资源和复杂的现实情况下,暹罗网络的空中跟踪仍然在许多方面都面临严重的障碍。为了进一步探索基于无人机的跟踪中暹罗网络的部署,这项工作对前沿暹罗跟踪器进行了全面的审查,以及使用典型的无人机板载处理器进行评估的详尽无人用分析。然后,进行板载测试以验证代表性暹罗跟踪器在现实世界无人机部署中的可行性和功效。此外,为了更好地促进跟踪社区的发展,这项工作分析了现有的暹罗跟踪器的局限性,并进行了以低弹片评估表示的其他实验。最后,深入讨论了基于无人机的智能运输系统的暹罗跟踪的前景。领先的暹罗跟踪器的统一框架,即代码库及其实验评估的结果,请访问https://github.com/vision4robotics/siamesetracking4uav。
translated by 谷歌翻译
While recent years have witnessed astonishing improvements in visual tracking robustness, the advancements in tracking accuracy have been limited. As the focus has been directed towards the development of powerful classifiers, the problem of accurate target state estimation has been largely overlooked. In fact, most trackers resort to a simple multi-scale search in order to estimate the target bounding box. We argue that this approach is fundamentally limited since target estimation is a complex task, requiring highlevel knowledge about the object.We address this problem by proposing a novel tracking architecture, consisting of dedicated target estimation and classification components. High level knowledge is incorporated into the target estimation through extensive offline learning. Our target estimation component is trained to predict the overlap between the target object and an estimated bounding box. By carefully integrating target-specific information, our approach achieves previously unseen bounding box accuracy. We further introduce a classification component that is trained online to guarantee high discriminative power in the presence of distractors. Our final tracking framework sets a new state-of-the-art on five challenging benchmarks. On the new large-scale Track-ingNet dataset, our tracker ATOM achieves a relative gain of 15% over the previous best approach, while running at over 30 FPS. Code and models are available at https: //github.com/visionml/pytracking.
translated by 谷歌翻译
In this paper we present a new computer vision task, named video instance segmentation. The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos. In words, it is the first time that the image instance segmentation problem is extended to the video domain. To facilitate research on this new task, we propose a large-scale benchmark called YouTube-VIS, which consists of 2,883 high-resolution YouTube videos, a 40-category label set and 131k high-quality instance masks.In addition, we propose a novel algorithm called Mask-Track R-CNN for this task. Our new method introduces a new tracking branch to Mask R-CNN to jointly perform the detection, segmentation and tracking tasks simultaneously. Finally, we evaluate the proposed method and several strong baselines on our new dataset. Experimental results clearly demonstrate the advantages of the proposed algorithm and reveal insight for future improvement. We believe the video instance segmentation task will motivate the community along the line of research for video understanding.
translated by 谷歌翻译
跟踪需要为推理阶段构建目标的判别模型。实现这一目标的有效方法是在线学习,可以舒适地占据截肢培训的型号。最近的研究表明,由于其像素级别歧视,视觉跟踪从统一视觉跟踪和分割的统一中受益匪浅。但是,对这种统一模型进行在线学习产生巨大挑战。分段模型不能轻易地从视觉跟踪方案中给出的先前信息学习。在本文中,我们提出了TrackM1P:一种新的元学习方法,优化了仅从部分信息学习以解决强加的挑战。我们的模型能够广泛利用有限的事先信息,因此具有比其他在线学习方法更强大的目标 - 背景辨别性。凭经验,我们表明我们的模型在竞争模型上实现了最先进的性能和切实改善。我们的模式实现了VOT2019,VOT2018,VOT2018和VOT2016数据集的66.0%,67.1%,68.5%的平均重叠增长了6.4%,7.3%,高于我们基线的6.4%。代码将公开可用。
translated by 谷歌翻译
半监控视频对象分割(VOS)是指在近年来在第一帧中的注释中分割剩余帧中的目标对象,该帧近年来已经积极研究。关键挑战在于找到利用过去框架的时空上下文的有效方法来帮助学习当前帧的判别目标表示。在本文中,我们提出了一种具有专门设计的交互式变压器的新型暹罗网络,称为SITVOS,以实现从历史到当前帧的有效上下文传播。从技术上讲,我们使用变换器编码器和解码器单独处理过去的帧和当前帧,即,编码器从过去的帧中对目标对象的强大的时空上下文进行编码,而解码器将当前帧的特征嵌入为查询。从编码器输出检索目标。为了进一步增强目标表示,设计了一种特征交互模块(FIM)以促进编码器和解码器之间的信息流。此外,我们使用暹罗架构来提取过去和当前帧的骨干功能,它能够重用并且比现有方法更有效。三个挑战基准测试的实验结果验证了SITVOS在最先进的方法上的优越性。
translated by 谷歌翻译
Siamese network based trackers formulate tracking as convolutional feature cross-correlation between a target template and a search region. However, Siamese trackers still have an accuracy gap compared with state-of-theart algorithms and they cannot take advantage of features from deep networks, such as ResNet-50 or deeper. In this work we prove the core reason comes from the lack of strict translation invariance. By comprehensive theoretical analysis and experimental validations, we break this restriction through a simple yet effective spatial aware sampling strategy and successfully train a ResNet-driven Siamese tracker with significant performance gain. Moreover, we propose a new model architecture to perform layer-wise and depthwise aggregations, which not only further improves the accuracy but also reduces the model size. We conduct extensive ablation studies to demonstrate the effectiveness of the proposed tracker, which obtains currently the best results on five large tracking benchmarks, including OTB2015, VOT2018, UAV123, LaSOT, and TrackingNet. Our model will be released to facilitate further researches.
translated by 谷歌翻译
We pose video object segmentation as spectral graph clustering in space and time, with one graph node for each pixel and edges forming local space-time neighborhoods. We claim that the strongest cluster in this video graph represents the salient object. We start by introducing a novel and efficient method based on 3D filtering for approximating the spectral solution, as the principal eigenvector of the graph's adjacency matrix, without explicitly building the matrix. This key property allows us to have a fast parallel implementation on GPU, orders of magnitude faster than classical approaches for computing the eigenvector. Our motivation for a spectral space-time clustering approach, unique in video semantic segmentation literature, is that such clustering is dedicated to preserving object consistency over time, which we evaluate using our novel segmentation consistency measure. Further on, we show how to efficiently learn the solution over multiple input feature channels. Finally, we extend the formulation of our approach beyond the segmentation task, into the realm of object tracking. In extensive experiments we show significant improvements over top methods, as well as over powerful ensembles that combine them, achieving state-of-the-art on multiple benchmarks, both for tracking and segmentation.
translated by 谷歌翻译
我们考虑半监督视频对象分段(VOS)的任务。我们的方法通过解决视觉翘曲的详细保存和时间一致性来减轻以前的VOS工作中的缺点。与使用完全光流的事先工作相比,我们介绍了一种新的前景目标视觉翘曲方法,了解来自VOS数据的流场。我们训练一个流模块,以使用两个弱监督损失捕获帧之间的详细运动。我们的对象翘曲前面的前景对象掩模在目标帧中的位置的术语方法使得具有快速运行时的详细掩模细化而不使用额外的流量监控。它也可以直接集成到最先进的分段网络中。在Davis17和Youtubevos基准测试中,我们优于不使用额外数据的最先进的脱机方法,以及使用额外数据的许多在线方法。定性地,我们还显示了我们的方法,以高细节和时间一致性产生分割。
translated by 谷歌翻译
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.
translated by 谷歌翻译
尽管机器学习在视觉对象跟踪的任务上进行了广泛采用,但最近基于学习的方法在很大程度上忽略了一个事实,即视觉跟踪是其本质上的序列级任务。他们在很大程度上依赖框架级训练,这不可避免地会导致数据分布和任务目标的培训和测试之间的不一致。这项工作介绍了基于强化学习的视觉跟踪序列训练策略,并讨论了数据采样,学习目标和数据增强的序列级设计如何提高跟踪算法的准确性和稳健性。我们对包括LASOT,TrackingNet和GoT-10K在内的标准基准测试的实验表明,四个代表性跟踪模型,SiamRPN ++,Siamattn,Transt和TRDIMP,通过在不修改建筑架构的情况下将提出的方法纳入训练中,从而不断改进。
translated by 谷歌翻译
对人类对象相互作用的理解在第一人称愿景(FPV)中至关重要。遵循相机佩戴者操纵的对象的视觉跟踪算法可以提供有效的信息,以有效地建模此类相互作用。在过去的几年中,计算机视觉社区已大大提高了各种目标对象和场景的跟踪算法的性能。尽管以前有几次尝试在FPV域中利用跟踪器,但仍缺少对最先进跟踪器的性能的有条理分析。这项研究差距提出了一个问题,即应使用当前的解决方案``现成''还是应进行更多特定领域的研究。本文旨在为此类问题提供答案。我们介绍了FPV中单个对象跟踪的首次系统研究。我们的研究广泛分析了42个算法的性能,包括通用对象跟踪器和基线FPV特定跟踪器。分析是通过关注FPV设置的不同方面,引入新的绩效指标以及与FPV特定任务有关的。这项研究是通过引入Trek-150(由150个密集注释的视频序列组成的新型基准数据集)来实现的。我们的结果表明,FPV中的对象跟踪对当前的视觉跟踪器构成了新的挑战。我们强调了导致这种行为的因素,并指出了可能的研究方向。尽管遇到了困难,但我们证明了跟踪器为需要短期对象跟踪的FPV下游任务带来好处。我们预计,随着新的和FPV特定的方法学会得到研究,通用对象跟踪将在FPV中受欢迎。
translated by 谷歌翻译
视频分割,即将视频帧分组到多个段或对象中,在广泛的实际应用中扮演关键作用,例如电影中的视觉效果辅助,自主驾驶中的现场理解,以及视频会议中的虚拟背景创建,名称一些。最近,由于计算机愿景中的联系复兴,一直存在众多深度学习的方法,这一直专用于视频分割并提供引人注目的性能。在这项调查中,通过引入各自的任务设置,背景概念,感知需要,开发历史,以及开发历史,综合审查这一领域的两种基本研究,即在视频和视频语义分割中,即视频和视频语义分割中的通用对象分段(未知类别)。主要挑战。我们还提供关于两种方法和数据集的代表文学的详细概述。此外,我们在基准数据集中呈现了审查方法的定量性能比较。最后,我们指出了这一领域的一套未解决的开放问题,并提出了进一步研究的可能机会。
translated by 谷歌翻译
Visual object tracking has been a fundamental topic in recent years and many deep learning based trackers have achieved state-of-the-art performance on multiple benchmarks. However, most of these trackers can hardly get top performance with real-time speed. In this paper, we propose the Siamese region proposal network (Siamese-RPN) which is end-to-end trained off-line with large-scale image pairs. Specifically, it consists of Siamese subnetwork for feature extraction and region proposal subnetwork including the classification branch and regression branch. In the inference phase, the proposed framework is formulated as a local one-shot detection task. We can pre-compute the template branch of the Siamese subnetwork and formulate the correlation layers as trivial convolution layers to perform online tracking. Benefit from the proposal refinement, traditional multi-scale test and online fine-tuning can be discarded. The Siamese-RPN runs at 160 FPS while achieving leading performance in VOT2015, VOT2016 and VOT2017 real-time challenges.
translated by 谷歌翻译