提出了一种统计注意力定位(SAL)方法,以促进本工作中的对象分类任务。 SAL由三个步骤组成:1)通过决策统计数据的初步注意窗口选择,2)注意力图改进和3)矩形注意区域的最终确定。 SAL计算本地平方窗口的软性决定分数,并使用它们来识别步骤1中的明显区域。为了适应各种尺寸和形状的对象,SAL优化了初步结果,并在步骤2中获得了更灵活形状的注意力图。最后, SAL使用步骤3中的精制注意图和边界框正则化产生矩形注意区域。作为应用程序,我们采用E-PixelHop,这是基于连续的子空间学习(SSL)的对象分类解决方案,作为基线。我们应用SAL以获取裁剪和调整大小的注意区域作为替代输入。整个图像的分类结果以及注意区域都被结合起来,以达到最高的分类精度。给出了CIFAR-10数据集上的实验,以证明SAL辅助对象分类方法的优势。
translated by 谷歌翻译
在这项工作中,研究了在广泛的监督学位下提供稳定表现的强大学习系统的设计。我们选择图像分类问题作为一个说明性示例,并专注于由三个学习模块组成的模块化系统的设计:表示学习,特征学习和决策学习。我们讨论调整每个模块的方法,以使设计相对于不同的培训样本编号具有强大的功能。基于这些想法,我们提出了两个学习系统家庭。一个人采用定向梯度(HOG)特征的经典直方图,而另一个则使用连续的subspace-Learning(SSL)功能。我们针对MNIST和Fashion-MNIST数据集测试了他们对LENET-5的性能,这是一个端到端的优化神经网络。每个图像类别类别的训练样本数量从极度弱的监督状况(即每班标记的样本标记为1个)到强大的监督状况(即4096个标记为每类标签样本),并逐渐过渡(即$ 2^n $) ,$ n = 0,1,\ cdots,12 $)。实验结果表明,模块化学习系统的两个家族比Lenet-5具有更强的性能。对于小$ n $,它们都超过了Lenet-5的优于Lenet-5,并且具有与Lenet-5相当的性能。
translated by 谷歌翻译
使用深度学习模型从组织学数据中诊断癌症提出了一些挑战。这些图像中关注区域(ROI)的癌症分级和定位通常依赖于图像和像素级标签,后者需要昂贵的注释过程。深度弱监督的对象定位(WSOL)方法为深度学习模型的低成本培训提供了不同的策略。仅使用图像级注释,可以训练这些方法以对图像进行分类,并为ROI定位进行分类类激活图(CAM)。本文综述了WSOL的​​最先进的DL方法。我们提出了一种分类法,根据模型中的信息流,将这些方法分为自下而上和自上而下的方法。尽管后者的进展有限,但最近的自下而上方法目前通过深层WSOL方法推动了很多进展。早期作品的重点是设计不同的空间合并功能。但是,这些方法达到了有限的定位准确性,并揭示了一个主要限制 - 凸轮的不足激活导致了高假阴性定位。随后的工作旨在减轻此问题并恢复完整的对象。评估和比较了两个具有挑战性的组织学数据集的分类和本地化准确性,对我们的分类学方法进行了评估和比较。总体而言,结果表明定位性能差,特别是对于最初设计用于处理自然图像的通用方法。旨在解决组织学数据挑战的方法产生了良好的结果。但是,所有方法都遭受高假阳性/阴性定位的影响。在组织学中应用深WSOL方法的应用是四个关键的挑战 - 凸轮的激活下/过度激活,对阈值的敏感性和模型选择。
translated by 谷歌翻译
在弱监督的本地化设置中,监督作为图像级标签。我们建议使用图像分类器$ F $,并培训发电网络$ G $,给定输入图像,指示图像内对象位置的每个像素权重映射。通过最大限度地减少原始图像上的分类器F $ F $的输出之间的差异来培训网络$ G $培训。该方案需要一个正常化术语,确保$ G $不提供统一的重量,以及提前停止标准,以防止超过段图像。我们的结果表明,该方法在充满挑战的细粒度分类数据集中的相当余量以及通用图像识别数据集中优于现有的本地化方法。另外,在细粒度分类数据集中的弱监督分割中,所获得的权重映射也是最新的。
translated by 谷歌翻译
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.
translated by 谷歌翻译
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learned simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. In post-competition work, we establish a new state of the art for the detection task. Finally, we release a feature extractor from our best model called OverFeat.
translated by 谷歌翻译
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them.
translated by 谷歌翻译
Single-frame InfraRed Small Target (SIRST) detection has been a challenging task due to a lack of inherent characteristics, imprecise bounding box regression, a scarcity of real-world datasets, and sensitive localization evaluation. In this paper, we propose a comprehensive solution to these challenges. First, we find that the existing anchor-free label assignment method is prone to mislabeling small targets as background, leading to their omission by detectors. To overcome this issue, we propose an all-scale pseudo-box-based label assignment scheme that relaxes the constraints on scale and decouples the spatial assignment from the size of the ground-truth target. Second, motivated by the structured prior of feature pyramids, we introduce the one-stage cascade refinement network (OSCAR), which uses the high-level head as soft proposals for the low-level refinement head. This allows OSCAR to process the same target in a cascade coarse-to-fine manner. Finally, we present a new research benchmark for infrared small target detection, consisting of the SIRST-V2 dataset of real-world, high-resolution single-frame targets, the normalized contrast evaluation metric, and the DeepInfrared toolkit for detection. We conduct extensive ablation studies to evaluate the components of OSCAR and compare its performance to state-of-the-art model-driven and data-driven methods on the SIRST-V2 benchmark. Our results demonstrate that a top-down cascade refinement framework can improve the accuracy of infrared small target detection without sacrificing efficiency. The DeepInfrared toolkit, dataset, and trained models are available at https://github.com/YimianDai/open-deepinfrared to advance further research in this field.
translated by 谷歌翻译
弱监督对象本地化(WSOL)旨在仅使用图像级标签作为监控本地化对象区域。最近,通过生成前景预测映射(FPM)来实现新的范例来实现本地化任务。现有的基于FPM的方法使用跨熵(CE)来评估前景预测映射并引导发电机的学习。我们争辩使用激活值来实现更高效的学习。它基于实验观察,对于培训的网络,CE当前景掩模仅覆盖物体区域的一部分时,CE会聚到零。虽然激活值增加,直到掩码扩展到对象边界,这表明可以通过使用激活值来学习更多对象区域。在本文中,我们提出了背景激活抑制(BAS)方法。具体地,设计激活地图约束模块(AMC)以通过抑制背景激活值来促进生成器的学习。同时,通过使用前景区域指导和区域约束,BAS可以学习对象的整个区域。此外,在推理阶段,我们考虑不同类别的预测映射,以获得最终的本地化结果。广泛的实验表明,BAS通过CUB-200-2011和ILSVRC数据集的基线方法实现了显着和一致的改进。
translated by 谷歌翻译
相机陷阱彻底改变了许多物种的动物研究,这些物种以前由于其栖息地或行为而几乎无法观察到。它们通常是固定在触发时拍摄短序列图像的树上的相机。深度学习有可能克服工作量以根据分类单元或空图像自动化图像分类。但是,标准的深神经网络分类器失败,因为动物通常代表了高清图像的一小部分。这就是为什么我们提出一个名为“弱对象检测”的工作流程,以更快的速度rcnn+fpn适合这一挑战。该模型受到弱监督,因为它仅需要每个图像的动物分类量标签,但不需要任何手动边界框注释。首先,它会使用来自多个帧的运动自动执行弱监督的边界框注释。然后,它使用此薄弱的监督训练更快的RCNN+FPN模型。来自巴布亚新几内亚和密苏里州生物多样性监测活动的两个数据集获得了实验结果,然后在易于重复的测试台上获得了实验结果。
translated by 谷歌翻译
仅使用诸如图像类标签的全局注释,弱监督学习方法允许CNN分类器共同分类图像,并产生与预测类相关的感兴趣区域。然而,在像素水平的任何引导下,这种方法可以产生不准确的区域。已知该问题与组织学图像更具挑战,而不是与天然自然的图像,因为物体不太突出,结构具有更多变化,并且前景和背景区域具有更强的相似之处。因此,用于CNNS的视觉解释的计算机视觉文献中的方法可能无法直接适用。在这项工作中,我们提出了一种基于复合损耗功能的简单而有效的方法,可利用完全消极样本的信息。我们的新损失函数包含两个补充项:第一次利用CNN分类器收集的积极证据,而第二个利用来自CNN分类器的积极证据,而第二个互联网将利用来自训练数据集的完全消极样本。特别是,我们用解码器装备预先训练的分类器,该解码器允许精制感兴趣的区域。利用相同的分类器来收集像素电平的正面和负证据,以培训解码器。这使得能够利用自然地发生在数据中的完全消极样本,而没有任何额外的监督信号,并且仅使用图像类作为监督。与几种相关方法相比,在冒号癌的公共基准GLAS和使用三种不同的骨架的CONELYON16基于乳腺癌的CAMELYON16基准测试,我们展示了我们方法引入的大量改进。我们的结果表明了使用负数和积极证据的好处,即,从分类器获得的效益以及在数据集中自然可用的那个。我们对这两种术语进行了消融研究。我们的代码公开提供。
translated by 谷歌翻译
通过使用图像级分类掩模监督其学习过程,弱监督对象本地化(WSOL)放宽对对象本地化的密度注释的要求。然而,当前的WSOL方法遭受背景位置的过度激活,并且需要后处理以获得定位掩模。本文将这些问题归因于背景提示的不明显,并提出了背景感知分类激活映射(B-CAM),以便仅使用图像级标签同时学习对象和背景的本地化分数。在我们的B-CAM中,两个图像级功能,由潜在背景和对象位置的像素级别功能聚合,用于从对象相关的背景中净化对象功能,并表示纯背景样本的功能,分别。然后基于这两个特征,学习对象分类器和背景分类器,以确定二进制对象本地化掩码。我们的B-CAM可以基于提出的错开分类损失以端到端的方式培训,这不仅可以改善对象本地化,而且还抑制了背景激活。实验表明,我们的B-CAM在Cub-200,OpenImages和VOC2012数据集上优于一级WSOL方法。
translated by 谷歌翻译
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012-achieving a mAP of 53.3%. Our approach combines two key insights:(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/ ˜rbg/rcnn.
translated by 谷歌翻译
车辆重新识别(RE-ID)旨在通过不同的摄像机检索具有相同车辆ID的图像。当前的零件级特征学习方法通​​常通过统一的部门,外部工具或注意力建模来检测车辆零件。但是,此部分功能通常需要昂贵的额外注释,并在不可靠的零件遮罩预测的情况下导致次优性能。在本文中,我们提出了一个针对车辆重新ID的弱监督零件注意网络(Panet)和零件式网络(PMNET)。首先,Panet通过与零件相关的通道重新校准和基于群集的掩模生成无需车辆零件监管信息来定位车辆零件。其次,PMNET利用教师指导的学习来从锅et中提取特定于车辆的特定功能,并进行多尺度的全球零件特征提取。在推断过程中,PMNET可以自适应提取歧视零件特征,而无需围绕锅et定位,从而防止了不稳定的零件掩模预测。我们将重新ID问题作为一个多任务问题,并采用同质的不确定性来学习最佳的ID损失权衡。实验是在两个公共基准上进行的,这表明我们的方法优于最近的方法,这不需要额外的注释,即CMC@5的平均增加3.0%,而Veri776的MAP中不需要超过1.4%。此外,我们的方法可以扩展到遮挡的车辆重新ID任务,并具有良好的概括能力。
translated by 谷歌翻译
大多数现有的语义分割方法都以图像级类标签作为监督,高度依赖于从标准分类网络生成的初始类激活图(CAM)。在本文中,提出了一种新颖的“渐进贴片学习”方法,以改善分类的局部细节提取,从而更好地覆盖整个对象的凸轮,而不仅仅是在常规分类模型中获得的CAM中的最歧视区域。 “补丁学习”将特征映射破坏成贴片,并在最终聚合之前并行独立处理每个本地贴片。这样的机制强迫网络从分散的歧视性本地部分中找到弱信息,从而提高了本地细节的敏感性。 “渐进的补丁学习”进一步将特征破坏和补丁学习扩展到多层粒度。与多阶段优化策略合作,这种“渐进的补丁学习”机制隐式地为模型提供了跨不同位置粒状性的特征提取能力。作为隐式多粒性渐进式融合方法的替代方案,我们还提出了一种明确的方法,以同时将单个模型中不同粒度的特征融合,从而进一步增强了完整对象覆盖的凸轮质量。我们提出的方法在Pascal VOC 2012数据集上取得了出色的性能,例如,测试集中有69.6 $%miou),它超过了大多数现有的弱监督语义细分方法。代码将在此处公开提供,https://github.com/tyroneli/ppl_wsss。
translated by 谷歌翻译
This paper presents the first attempt to learn semantic boundary detection using image-level class labels as supervision. Our method starts by estimating coarse areas of object classes through attentions drawn by an image classification network. Since boundaries will locate somewhere between such areas of different classes, our task is formulated as a multiple instance learning (MIL) problem, where pixels on a line segment connecting areas of two different classes are regarded as a bag of boundary candidates. Moreover, we design a new neural network architecture that can learn to estimate semantic boundaries reliably even with uncertain supervision given by the MIL strategy. Our network is used to generate pseudo semantic boundary labels of training images, which are in turn used to train fully supervised models. The final model trained with our pseudo labels achieves an outstanding performance on the SBD dataset, where it is as competitive as some of previous arts trained with stronger supervision.
translated by 谷歌翻译
Convolutional networks for image classification progressively reduce resolution until the image is represented by tiny feature maps in which the spatial structure of the scene is no longer discernible. Such loss of spatial acuity can limit image classification accuracy and complicate the transfer of the model to downstream applications that require detailed scene understanding. These problems can be alleviated by dilation, which increases the resolution of output feature maps without reducing the receptive field of individual neurons. We show that dilated residual networks (DRNs) outperform their non-dilated counterparts in image classification without increasing the model's depth or complexity. We then study gridding artifacts introduced by dilation, develop an approach to removing these artifacts ('degridding'), and show that this further increases the performance of DRNs. In addition, we show that the accuracy advantage of DRNs is further magnified in downstream applications such as object localization and semantic segmentation.
translated by 谷歌翻译
Pedestrian detection in the wild remains a challenging problem especially when the scene contains significant occlusion and/or low resolution of the pedestrians to be detected. Existing methods are unable to adapt to these difficult cases while maintaining acceptable performance. In this paper we propose a novel feature learning model, referred to as CircleNet, to achieve feature adaptation by mimicking the process humans looking at low resolution and occluded objects: focusing on it again, at a finer scale, if the object can not be identified clearly for the first time. CircleNet is implemented as a set of feature pyramids and uses weight sharing path augmentation for better feature fusion. It targets at reciprocating feature adaptation and iterative object detection using multiple top-down and bottom-up pathways. To take full advantage of the feature adaptation capability in CircleNet, we design an instance decomposition training strategy to focus on detecting pedestrian instances of various resolutions and different occlusion levels in each cycle. Specifically, CircleNet implements feature ensemble with the idea of hard negative boosting in an end-to-end manner. Experiments on two pedestrian detection datasets, Caltech and CityPersons, show that CircleNet improves the performance of occluded and low-resolution pedestrians with significant margins while maintaining good performance on normal instances.
translated by 谷歌翻译
弱监督的语义细分(WSSS)旨在仅使用用于训练的图像级标签来产生像素类预测。为此,以前的方法采用了通用管道:它们从类激活图(CAM)生成伪口罩,并使用此类掩码来监督分割网络。但是,由于凸轮的局部属性,即它们倾向于仅专注于小的判别对象零件,因此涵盖涵盖整个物体的全部范围的全面伪面罩是一项挑战。在本文中,我们将CAM的局部性与卷积神经网络(CNNS)的质地偏见特性相关联。因此,我们建议利用形状信息来补充质地偏见的CNN特征,从而鼓励掩模预测不仅是全面的,而且还与物体边界相交。我们通过一种新颖的改进方法进一步完善了在线方式的预测,该方法同时考虑了类和颜色亲和力,以生成可靠的伪口罩以监督模型。重要的是,我们的模型是在单阶段框架内进行端到端训练的,因此在培训成本方面有效。通过对Pascal VOC 2012的广泛实验,我们验证了方法在产生精确和形状对准的分割结果方面的有效性。具体而言,我们的模型超过了现有的最新单阶段方法。此外,当在没有铃铛和哨声的简单两阶段管道中采用时,它还在多阶段方法上实现了新的最新性能。
translated by 谷歌翻译
我们考虑临床应用异常定位问题。虽然深入学习推动了最近的医学成像进展,但许多临床挑战都没有完全解决,限制了其更广泛的使用。虽然最近的方法报告了高的诊断准确性,但医生因普遍缺乏算法决策和解释性而涉及诊断决策的这些算法,这是关注这些算法。解决这个问题的一种潜在方法是进一步培训这些模型,以便除了分类它们之外,除了分类。然而,准确地进行这一临床专家需要大量的疾病定位注释,这是对大多数应用程序来实现昂贵的任务。在这项工作中,我们通过一种新的注意力弱监督算法来解决这些问题,该弱势监督算法包括分层关注挖掘框架,可以以整体方式统一激活和基于梯度的视觉关注。我们的关键算法创新包括明确序号注意约束的设计,实现了以弱监督的方式实现了原则的模型培训,同时还通过本地化线索促进了产生视觉关注驱动的模型解释。在两个大型胸部X射线数据集(NIH Chescx-Ray14和Chexpert)上,我们展示了对现有技术的显着本地化性能,同时也实现了竞争的分类性能。我们的代码可在https://github.com/oyxhust/ham上找到。
translated by 谷歌翻译