内窥镜检查是一种常规成像技术,用于诊断和微创手术治疗。诸如运动模糊,气泡,镜面反射,浮动物体和像素饱和等伪像妨碍了内窥镜视频的视觉解释和自动分析。鉴于内窥镜在不同临床应用中的广泛应用,我们认为这种伪影的稳健可靠识别和损坏的视频帧的自动恢复是一个基本的医学成像问题。现有的最先进的方法只涉及检测和恢复选定的文物。然而,通常内窥镜视频包含许多工件,这些工件促使建立全面的解决方案。我们提出了一个全自动框架,它可以:1)检测和分类六个不同的主要工件,2)为每个帧提供质量分数,3)恢复轻度损坏的帧。为了检测不同的伪像,我们的框架开发了快速多尺度,单级卷积神经网络检测器。我们引入质量度量来评估帧质量并预测图像恢复成功。具有精心选择的规则化的生成对抗网络最终用于恢复损坏的帧。我们的探测器产生的最高平均精度(mAP在5%阈值)为49.0,最低计算时间为88 ms,可实现精确的实时处理。我们用于盲目去模糊,饱和度校正和修复的修复模型比以前的方法显示出显着的改进。在一组10个测试视频中,我们显示我们的方法保留了68.7%的平均值,这比原始视频保留的帧多25%。
translated by 谷歌翻译
The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008-2012. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classi-fier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community's progress through time using the methods of Hoiem et al (2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
translated by 谷歌翻译
The PASCAL Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation , and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
translated by 谷歌翻译
我们提出了一个弱监督模型,它共同执行语义和实例分割 - 这是一个特别相关的问题,因为获得了完成这些任务的像素完美注释的实质成本。与基于对象检测器的许多流行实例分割方法相比,我们的方法不能预测任何重叠的实例。此外,我们能够分割“东西”和“东西”类,从而解释图像中的所有像素。“东西”类用边界框进行弱监督,而“东西”则用图像级标记。我们在Pascal VOC上获得最先进的结果,用于完全监督和弱监督(其实现了大约95%的完全监督性能)。此外,我们为语义和实例分割提出了第一个关于城市景观的弱监督结果。最后,我们使用我们的弱监督框架来分析注释质量和预测性能之间的关系,这是数据集创建者感兴趣的。
translated by 谷歌翻译
The Mapillary Vistas Dataset is a novel, large-scale street-level image dataset containing 25 000 high-resolution images annotated into 66 object categories with additional, instance-specific labels for 37 classes. Annotation is performed in a dense and fine-grained style by using polygons for delineating individual objects. Our dataset is 5× larger than the total amount of fine annotations for Cityscapes and contains images from all around the world, captured at various conditions regarding weather, season and daytime. Images come from different imaging devices (mobile phones, tablets, action cameras, professional capturing rigs) and differently experienced photographers. In such a way, our dataset has been designed and compiled to cover diversity, richness of detail and geographic extent. As default benchmark tasks, we define semantic image seg-mentation and instance-specific image segmentation, aiming to significantly further the development of state-of-the-art methods for visual road-scene understanding.
translated by 谷歌翻译
在本文中,我们提出了一个新的计算机视觉任务,名为视频实例分割。此新任务的目标是同时检测,分割和跟踪视频中的实例。用语言来说,这是第一次将图像实例分割问题扩展到视频域。为了完成这项新任务的研究,我们提出了一个名为YouTube-VIS的大型基准测试,它包括2883个高分辨率YouTube视频,40个类别的标签集和131k高质量的实例掩码。此外,我们为此任务提出了一种名为MaskTrack R-CNN的新算法。我们的新方法引入了一个新的跟踪分支到Mask R-CNN,以同时共同执行检测,分割和跟踪任务。最后,我们对我们的新数据集评估了所提出的方法和几个强大的基线。实验结果清楚地证明了所提算法的优点,并揭示了对未来改进的洞察力。我们相信视频实例细分任务将激励社区沿着研究视频理解的路线。
translated by 谷歌翻译
We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier trained on carefully annotated box annotations. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance , (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes, available at http://tinyurl.com/hollywood2tubes.
translated by 谷歌翻译
We propose and study a task we name panoptic segmen-tation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete , an important step toward real-world vision systems. While early work in computer vision addressed related im-age/scene parsing tasks, these are not currently popular, possibly due to lack of appropriate metrics or associated recognition challenges. To address this, we propose a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner. Using the proposed metric, we perform a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task. The aim of our work is to revive the interest of the community in a more unified view of image segmentation.
translated by 谷歌翻译
平均精度(AP),即召回精度(RP)曲线下的面积,是物体检测的标准性能测量。尽管它具有广泛的接受性,但它具有许多缺点,其中最重要的是(i)无法区分非常不同的RP曲线,以及(ii)缺乏直接测量边界框定位精度。在本文中,我们提出了“本地化召回精度(LRP)误差”,这是一种专为对象检测而设计的新度量。 LRP错误由与定位,假阴性(FN)率和假阳性(FP)率相关的三个组件组成。基于LRP,我们引入了“最佳LRP”,最小可实现的LRP误差,表示探测精度的最佳可实现配置和盒子的紧密度。与考虑整个回忆域精确度的AP相比,最佳LRP确定了一个类的“最佳”置信度得分阈值,这平衡了本地化和召回精度之间的权衡。在我们的实验中,我们知道,对于最先进的物体(SOTA)探测器,最佳LRP提供了比AP更多的判别信息。我们还证明了最佳置信度得分阈值在类和检测器之间存在显着差异。此外,我们提出了一个简单的在线视频对象检测器的LRP结果,它使用SOTA静止图像对象检测器,并显示特定于类的优化阈值提高了使用所有类的一般阈值的常见方法的准确性。 Athttps://github.com/cancam/LRP我们提供了可以为PASCAL VOC和MSCOCO数据集计算LRP的源代码。我们的源代码也可以轻松地适应其他数据集。
translated by 谷歌翻译
We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the COCO [32] label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second. The use of a cascade of increasingly precise human annotations ensures a label accuracy above 95% for every class and tight bounding boxes. Finally , we train and evaluate well-known deep network ar-chitectures and report baseline figures for per-frame classification and localization to provide a point of comparison for future work. We also demonstrate how the temporal contiguity of video can potentially be used to improve such inferences. The data set can be found at https://research.google.com/youtube-bb. We hope the availability of such large curated corpus will spur new advances in video object detection and tracking.
translated by 谷歌翻译
Evaluating multi-target tracking based on ground truth data is a surprisingly challenging task. Erroneous or ambiguous ground truth annotations, numerous evaluation protocols, and the lack of standardized benchmarks make a direct quantitative comparison of different tracking approaches rather difficult. The goal of this paper is to raise awareness of common pitfalls related to objective ground truth evaluation. We investigate the influence of different annotations, evaluation software, and training procedures using several publicly available resources, and point out the limitations of current definitions of evaluation metrics. Finally , we argue that the development an extensive standardized benchmark for multi-target tracking is an essential step toward more objective comparison of tracking approaches.
translated by 谷歌翻译
State-of-the-art learning based boundary detection methods require extensive training data. Since labelling object boundaries is one of the most expensive types of annotations , there is a need to relax the requirement to carefully annotate images to make both the training more affordable and to extend the amount of training data. In this paper we propose a technique to generate weakly supervised annotations and show that bounding box annotations alone suffice to reach high-quality object boundaries without using any object-specific boundary annotations. With the proposed weak supervision techniques we achieve the top performance on the object boundary detection task, outperforming by a large margin the current fully supervised state-of-the-art methods.
translated by 谷歌翻译
Automatically recognizing and localizing wide ranges of human actions has crucial importance for video understanding. Towards this goal, the THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition. Until then, video action recognition, including THUMOS challenge, had focused primarily on the classification of pre-segmented (i.e., trimmed) videos, which is an artificial task. In THU-MOS 2014, we elevated action recognition to a more practical level by introducing temporally untrimmed videos. These also include 'background videos' which share similar scenes and backgrounds as action videos, but are devoid of the specific actions. The three editions of the challenge organized in 2013-2015 have made THUMOS a common benchmark for action classification and detection and the annual challenge is widely attended by teams from around the world. In this paper we describe the THUMOS benchmark in detail and give an overview of data collection and annotation procedures. We present the evaluation protocols used to quantify results in the two THUMOS tasks of action classification and temporal detection. We also present results of submissions to the THUMOS 2015 challenge and review the participating approaches. Additionally, we include a comprehensive empirical study evaluating the differences in action recognition between trimmed and $ www.thumos.info untrimmed videos, and how well methods trained on trimmed videos generalize to untrimmed videos. We conclude by proposing several directions and improvements for future THUMOS challenges.
translated by 谷歌翻译
自动驾驶汽车需要了解周围的道路布局,这可以通过最先进的CNN进行预测。这项工作解决了目前缺乏用于确定车道实例的数据,这些数据是各种驾驶员所需要的。主要问题是耗时的手动标记过程,通常适用于每个图像。我们注意到驾驶汽车本身就是一种注解形式。因此,我们提出了一种半自动方法,该方法允许通过利用基于汽车驱动的位置的3D估计的道路平面来预先标记图像序列,并将标记从该平面投影到序列的所有图像中。每张图像的平均标记时间减少到5秒,数据捕获只需要便宜的破折号。我们正在发布24,000个图像的数据集,并另外显示实验语义分割和实例分割结果。
translated by 谷歌翻译
本文提出了一种用于物体检测数据集的快速边界框注释的方法。该过程包括两个阶段:第一步是手动注释数据集的一部分,第二步使用第一阶段注释训练的模型为剩余样本提出注释。我们通过实验研究哪个第一/第二阶段最小化为总工作量。此外,我们还介绍了从室内场景中收集的新的完全标记对象检测数据集。与其他indoordatasets相比,我们的系列有更多的课程类别,不同的背景,光照条件,遮挡和高级内部差异。我们使用许多最先进的模型训练基于deeplearning的物体探测器,并在速度和准确性方面进行比较。完全注释的数据集可以免费发布给研究社区。
translated by 谷歌翻译
我们提出了Open Images V4,这是一个9.2M图像的数据集,带有用于图像分类,对象检测和视觉关系检测的统一注释。图像具有Creative Commons Attribution许可证,允许共享和添加材料,它们是从Flickr收集的,没有预先定义类名或标签列表,导致自然类统计并避免初始设计偏差。 Open Images V4提供了几个尺寸的大规模:30.1M图像级标签,用于19.8k概念,15.4M边界框用于600个对象类,375k视觉关系注释包含57个类。特别是对于物体检测,我们提供比下一个最大数据集多15倍的边界框(1.9M图像上的15.4M框)。图像通常显示具有多个对象的复杂场景(平均8个带注释的对象周围图像)。我们注释了它们之间的视觉关系,它支持视觉关系检测,这是一项需要结构化推理的新兴任务。我们提供有关数据集的深入全面的统计数据,我们验证注释的质量,并研究许多现代模型的性能如何随着训练数据量的增加而演变。我们希望Open Image V4的规模,质量和种类能够促进进一步的研究和创新,甚至超越图像分类,物体检测和视觉关系检测领域。
translated by 谷歌翻译
ImageNet大规模视觉识别挑战是数百个对象类别和数百万图像的基准对象类别分类和检测。挑战每年都从2010年开始,吸引了来自50多个机构的参与。本文描述了这个基准数据集的创建以及结果可能实现的对象识别的进展。我们讨论了收集大规模地面实况注释的挑战,突出了分类对象识别中的突破性,提供了对大规模图像分类和对象检测领域当前状态的详细分析,并比较了最先进的计算机视觉准确性与人类准确性。我们总结了五年挑战中的经验教训,并提出了未来的方向和改进。
translated by 谷歌翻译
Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for reseach. Recently, a new benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal of collecting existing and new data and creating a framework for the standardized evaluation of multiple object tracking methods [28]. The first release of the benchmark focuses on multiple people tracking, since pedestrians are by far the most studied object in the tracking community. This paper accompanies a new release of the MOTChallenge benchmark. Unlike the initial release, all videos of MOT16 have been carefully annotated following a consistent protocol. Moreover, it not only offers a significant increase in the number of labeled boxes, but also provides multiple object classes beside pedestrians and the level of visibility for every single object of interest.
translated by 谷歌翻译
由于生成注释所需的努力,用于自动驾驶的训练3D物体检测器已经限于小的生物群。减少任务复杂度和注释器完成的任务切换量是减少生成3D边界框注释所需的工作量和时间的关键。本文介绍了一种新的地面实况生成方法,将人工监督与预训练神经网络相结合,生成每个实例的3Dpoint云。分段,3D边界框和类注释。注释器提供对象锚点击,其表现为种子以在3D中生成实例分割结果。然后,属于每个实例的点用于回归对象质心,边界框尺寸和对象方向。我们提出的注释方案要求人类注释时间缩短30倍。我们使用KITTI 3D物体检测数据集来评估注释方案的效率和质量。我们还根据Autonomoose自动驾驶车辆先前看不见的数据来测试所提出的方案,以证明网络的通用化能力。
translated by 谷歌翻译
本文讨论了open-setconditions中的语义实例分割任务,其中输入图像可以包含已知和未知的对象类。现有语义实例分割方法的训练过程需要所有对象实例的注释掩码,这对于某些对象实例来说是昂贵的,甚至是不可行的。现实场景,其中类别的数量可能会无限增加。在本文中,我们提出了一种新颖的开放式语义实例分割方法,能够根据已知对象类上的对象检测的输出,对图像中所有已知和未知的对象类进行分割。我们使用贝叶斯框架来制定问题,其中后验分布用配备有效图像分区采样器的模拟退火优化来近似。 Weshow根据经验证明,我们的方法与已知类别的现有技术监督方法相比具有竞争力,但与无监督方法相比,在未知类别上表现良好。
translated by 谷歌翻译