本文研究了显着性在提供稀缺训练数据的情况下提高卷积神经网络(CNN)分类准确性的作用。我们的方法包括在现有CNN架构中添加显着性分支,该架构用于调整原始图像输入的标准底部视觉特征,充当指导特征提取过程的叙述机制。该方法的主要目的是使有限的训练样本能够有效地训练细粒度识别模型,并提高任务的性能,从而减少对大数据集进行注释的需要。 %绝大多数显着性方法都是根据它们生成能力图的能力而不是它们在完整视觉管道中的功能来评估的。我们提出的管道允许评估高级任务对象识别的显着性方法。我们在不同条件下对各种细粒度数据集(鲜花,鸟类,汽车和狗)进行了大量实验,并表明显着性可以显着提高网络的性能,特别是对于稀缺的训练数据。此外,实验表明,获得改进的显着性图(通过传统显着性基准测量)的显着性方法也转化为显着性方法,当应用于对象识别管道时,该方法会产生改进的性能增益。
translated by 谷歌翻译
Recognizing fine-grained sub-categories such as birds and dogs is extremely challenging due to the highly localized and subtle differences in some specific parts. Most previous works rely on object / part level annotations to build part-based representation, which is demanding in practical applications. This paper proposes an automatic fine-grained recognition approach which is free of any object / part annotation at both training and testing stages. Our method explores a unified framework based on two steps of deep filter response picking. The first picking step is to find distinctive filters which respond to specific patterns significantly and consistently, and learn a set of part detectors via iteratively alternating between new positive sample mining and part model retraining. The second picking step is to pool deep filter responses via spatially weighted combination of Fisher Vectors. We conditionally pick deep filter responses to encode them into the final representation, which considers the importance of filter responses themselves. Integrating all these techniques produces a much more powerful framework, and experiments conducted on CUB-200-2011 and Stanford Dogs demonstrate the superiority of our proposed algorithm over the existing methods.
translated by 谷歌翻译
细粒度识别的最新进展利用注意力图来定位感兴趣的对象。尽管有许多方法可以生成注意图,但大多数方法都依赖于复杂的损失函数或补充处理过程。在这项工作中,我们提出了一个基于分类器输出激活的简单直接的注意力生成模型。我们的模型的优点是它可以很容易地训练图像级标签和softmax损失功能。更具体地,首先采用多个线性局部分类器来在高级CNN特征图的每个位置处执行细粒度分类。注意图是通过聚合和最大化输出激活来生成的。然后,注意力映射作为代理目标对象掩码来训练那些局部分类器,类似于用于语义分割的训练模型。我们的模型在三个基准测试数据集上实现了最先进的结果,即CUC-200-2011数据集为87.9%,Stanford Cars数据集为94.1%,FGVC-Aircraftdataset为92.1%,证明了其在细粒度识别任务中的有效性。
translated by 谷歌翻译
弱监督图像分割是计算机视觉中的一项重要任务。关键问题是如何从图像级别获取高质量的物体位置。分类激活映射是可用于生成高精度对象位置提示的常用方法。然而,这些位置线索通常非常稀疏且小,使得它们不能提供用于图像分割的有效信息。在本文中,我们提出了一个显着性指导图像分割网络来解决这个问题。我们采用自注意方法生成微妙的显着性图,并通过种子区域生长方法将位置cuesgrow渲染为种子,以扩展像素级标签区域。在种子生长的过程中,我们使用显着性值来加权像素之间的相似性以控制生长。因此,显着性信息可以帮助生成判别对象区域,并且可以有效地抑制错误显着像素的影响。在常见的分割数据集PASCAL VOC2012上的实验结果证明了我们的方法的有效性。
translated by 谷歌翻译
Fine-grained object classification attracts increasing attention in multimedia applications. However, it is a quite challenging problem due to the subtle inter-class difference and large intra-class variation. Recently, visual attention models have been applied to automatically localize the discriminative regions of an image for better capturing critical difference, which have demonstrated promising performance. Unfortunately, without consideration of the diversity in attention process, most of existing attention models perform poorly in classifying fine-grained objects. In this paper, we propose a diversified visual attention network (DVAN) to address the problem of fine-grained object classification, which substantially relieves the dependency on strongly-supervised information for learning to localize dis-criminative regions compared with attention-less models. More importantly, DVAN explicitly pursues the diversity of attention and is able to gather discriminative information to the maximal extent. Multiple attention canvases are generated to extract convolutional features for attention. An LSTM recurrent unit is employed to learn the attentiveness and discrimination of attention canvases. The proposed DVAN has the ability to attend the object from coarse to fine granularity, and a dynamic internal representation for classification is built up by incrementally combining the information from different locations and scales of the image. Extensive experiments conducted on CUB-2011, Stanford Dogs and Stanford Cars datasets have demonstrated that the proposed diversified visual attention network achieves competitive performance compared to the state-of-the-art approaches , without using any prior knowledge, user interaction or external resource in training and testing.
translated by 谷歌翻译
Fine-grained classification is challenging because categories can only be discriminated by subtle and local differences. Variances in the pose, scale or rotation usually make the problem more difficult. Most fine-grained classification systems follow the pipeline of finding foreground object or object parts (where) to extract discriminative features (what). In this paper, we propose to apply visual attention to fine-grained classification task using deep neural network. Our pipeline integrates three types of attention: the bottom-up attention that propose candidate patches, the object-level top-down attention that selects relevant patches to a certain object, and the part-level top-down attention that localizes discriminative parts. We combine these attentions to train domain-specific deep nets, then use it to improve both the what and where aspects. Importantly, we avoid using expensive annotations like bounding box or part information from end-to-end. The weak supervision constraint makes our work easier to generalize. We have verified the effectiveness of the method on the subsets of ILSVRC2012 dataset and CUB200 2011 dataset. Our pipeline delivered significant improvements and achieved the best accuracy under the weakest supervision condition. The performance is competitive against other methods that rely on additional annotations.
translated by 谷歌翻译
细粒度对象分类旨在区分属于同一入门级对象类别的下级类别的对象。由于(1)难以获得具有地面真实标签的训练图像,以及(2)不同下属类别之间的变化是微妙的,因此该任务具有挑战性。众所周知,不同从属类别的特征特征位于对象实例的本地部分。实际上,在许多细粒度分类数据集中都可以使用仔细的零件注释。但是,手动注释对象部分需要特性,这也很难推广到新的细粒度分类任务。在这项工作中,我们提出了一个弱监督的部分检测网络(PartNet),它能够检测用于细粒度分类的判别性局部部分。一个vanilla PartNet构建在abase子网之上,两个并行的上层网络流,分别为局部感兴趣区域(RoI)计算分类概率(超过下属类别)和检测概率(在指定数量的差异部分检测器上)的得分。通过聚合这些区域级概率的元素乘积来获得图像级预测。为了生成一组不同的RoI作为PartNet的输入,我们提出了一个简单的离散化部分建议模块(DPP),它直接针对提出有区别的本地部分的候选者,而不是通过对象级提议进行桥接。在benchmarkC-200-2011和Oxford Flower 102数据集上的实验表明了我们提出的方法在判别部分检测和细粒度分类方面的功效。特别是,我们在CUB-200上实现了最新的最先进性能 - 当地面实况部分注释不可用时,2011dataset。
translated by 谷歌翻译
Recognizing fine-grained categories (e.g., bird species) is difficult due to the challenges of discriminative region localization and fine-grained feature learning. Existing approaches predominantly solve these challenges independently , while neglecting the fact that region detection and fine-grained feature learning are mutually correlated and thus can reinforce each other. In this paper, we propose a novel recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutually reinforced way. The learning at each scale consists of a classification sub-network and an attention proposal sub-network (APN). The APN starts from full images, and iteratively generates region attention from coarse to fine by taking previous predictions as a reference, while a finer scale network takes as input an amplified attended region from previous scales in a recurrent way. The proposed RA-CNN is optimized by an intra-scale classification loss and an inter-scale ranking loss, to mutually learn accurate region attention and fine-grained representation. RA-CNN does not need bounding box/part annotations and can be trained end-to-end. We conduct comprehensive experiments and show that RA-CNN achieves the best performance in three fine-grained tasks, with relative accuracy gains of 3.3%, 3.7%, 3.8%, on CUB Birds, Stanford Dogs and Stanford Cars, respectively.
translated by 谷歌翻译
Part models of object categories are essential for challenging recognitiontasks, where differences in categories are subtle and only reflected inappearances of small parts of the object. We present an approach that is ableto learn part models in a completely unsupervised manner, without partannotations and even without given bounding boxes during learning. The key ideais to find constellations of neural activation patterns computed usingconvolutional neural networks. In our experiments, we outperform existingapproaches for fine-grained recognition on the CUB200-2011, NA birds, OxfordPETS, and Oxford Flowers dataset in case no part or bounding box annotationsare available and achieve state-of-the-art performance for the Stanford Dogdataset. We also show the benefits of neural constellation models as a dataaugmentation technique for fine-tuning. Furthermore, our paper unites the areasof generic and fine-grained classification, since our approach is suitable forboth scenarios. The source code of our method is available online athttp://www.inf-cv.uni-jena.de/part_discovery
translated by 谷歌翻译
Fine-grained categorization, which aims to distinguish subordinate-level categories such as bird species or dog breeds, is an extremely challenging task. This is due to two main issues: how to localize discriminative regions for recognition and how to learn sophisticated features for representation. Neither of them is easy to handle if there is insufficient labeled data. We leverage the fact that a subordinate-level object already has other labels in its ontology tree. These "free" labels can be used to train a series of CNN-based classifiers, each specialized at one grain level. The internal representations of these networks have different region of interests, allowing the construction of multi-grained descriptors that encode informative and discriminative features covering all the grain levels. Our multiple granularity framework can be learned with the weakest supervision, requiring only image-level label and avoiding the use of labor-intensive bounding box or part annotations. Experimental results on three challenging fine-grained image datasets demonstrate that our approach outperforms state-of-the-art algorithms, including those requiring strong labels.
translated by 谷歌翻译
Fine-grained image classification is a challenging task due to the large intra-class variance and small inter-class variance, aiming at recognizing hundreds of sub-categories belonging to the same basic-level category. Most existing fine-grained image classification methods generally learn part detection models to obtain the semantic parts for better classification accuracy. Despite achieving promising results, these methods mainly have two limitations: (1) not all the parts which obtained through the part detection models are beneficial and indispensable for classification , and (2) fine-grained image classification requires more detailed visual descriptions which could not be provided by the part locations or attribute annotations. For addressing the above two limitations, this paper proposes the two-stream model combining vision and language (CVL) for learning latent semantic representations. The vision stream learns deep representations from the original visual information via deep convolutional neural network. The language stream utilizes the natural language descriptions which could point out the discriminative parts or characteristics for each image, and provides a flexible and compact way of encoding the salient visual aspects for distinguishing sub-categories. Since the two streams are complementary , combining the two streams can further achieves better classification accuracy. Comparing with 12 state-of-the-art methods on the widely used CUB-200-2011 dataset for fine-grained image classification, the experimental results demonstrate our CVL approach achieves the best performance .
translated by 谷歌翻译
Fine-grained image recognition is a challenging computer vision problem, due to the small inter-class variations caused by highly similar subordinate categories, and the large intra-class variations in poses, scales and rotations. In this paper, we prove that selecting useful deep descriptors contributes well to fine-grained image recognition. Specifically, a novel Mask-CNN model without the fully connected layers is proposed. Based on the part annotations, the proposed model consists of a fully convolutional network to both locate the discriminative parts (e.g. , head and torso), and more importantly generate weighted ob-ject/part masks for selecting useful and meaningful convolutional descriptors. After that, a three-stream Mask-CNN model is built for aggregating the selected object-and part-level descriptors simultaneously. Thanks to discarding the parameter redundant fully connected layers, our Mask-CNN has a small feature dimensionality and efficient inference speed by comparing with other fine-grained approaches. Furthermore , we obtain a new state-of-the-art accuracy on two challenging fine-grained bird species categoriza-tion datasets, which validates the effectiveness of both the descriptor selection scheme and the proposed Mask-CNN model.
translated by 谷歌翻译
基于计算机视觉的细粒度识别近年来受到了极大的关注。现有的工作集中在判别部分定位和特征学习。在本文中,为了提高细粒度识别的性能,我们首先尝试精确定位尽可能多的对象的显着部分。然后,我们计算出通过使用单独的部分进行对象分类可以获得的分类概率。最后,通过从每个部分提取有效特征并将它们组合,然后馈送到分类器进行识别,在CUB200-2011鸟类数据集上获得了比现有技术更高的精度。
translated by 谷歌翻译
由于难以找到严格的特征,细粒度分类具有挑战性。找到完全表征对象的那些微妙特征并不简单。为了处理这种情况,我们提出了一种新的自我监督机制,以有效地定位信息区域,而不需要边界框/部分注释。我们的模型称为NTS-Netfor Navigator-Teacher-Scrutinizer Network,由Navigator代理,aTeacher代理和Scrutinizer代理组成。考虑到区域信息量与其概率地面 - 真值类之间的内在一致性,我们设计了一种新的训练范式,使得导航员能够在教师的指导下检测大部分信息区域。之后,审查员仔细检查导航器和预测区域中提出的区域。 。我们的模型可以被视为一种多代理合作,其中代理商互相受益,共同进步。 NTS-Net可以端到端地进行训练,同时在推理期间提供准确的细粒度分类预测以及高度信息化的区域。我们在广泛的基准数据集中实现了最先进的性能。
translated by 谷歌翻译
We propose a new method for fine-grained object recognition that employs part-level annotations and deep convo-lutional neural networks (CNNs) in a unified framework. Although both schemes have been widely used to boost recognition performance, due to the difficulty in acquiring detailed part annotations, strongly supervised fine-grained datasets are usually too small to keep pace with the rapid evolution of CNN architectures. In this paper, we solve this problem by exploiting inexhaustible web data. The proposed method improves classification accuracy in two ways: more discriminative CNN feature representations are generated using a training set augmented by collecting a large number of part patches from weakly supervised web images; and more robust object classifiers are learned using a multi-instance learning algorithm jointly on the strong and weak datasets. Despite its simplicity, the proposed method delivers a remarkable performance improvement on the CUB200-2011 dataset compared to baseline part-based R-CNN methods, and achieves the highest accuracy on this dataset even in the absence of test image annotations.
translated by 谷歌翻译
用于细粒度图像识别的基于注意力的学习仍然是一项艰巨的任务,其中大多数现有方法将每个对象部分处理为隔离,而忽略了它们之间的相关性。此外,所涉及的多阶段或多规模机制使现有方法效率低下且难以端到端地进行训练。在本文中,我们提出了一种基于noveradtention的卷积神经网络(CNN),它调节不同输入图像之间的多个对象部分。我们的方法首先通过一次挤压多激励(OSME)模块学习每个输入图像的多重区域特征,然后将多注意多类约束(MAMC)应用于度量学习框架中。对于每个锚点特征,MAMC通过拉近相同注意力的同类特征来发挥作用,同时推动不同注意力或不同类别的特征。我们的方法可以进行端到端的训练,并且高效,只需要训练阶段。此外,我们还介绍了Dogs-in-the-Wild,这是一个综合的狗种数据集,它按类别覆盖范围,数据量和注释质量超过了类似的现有数据集。该数据集将在接受时发布,以促进细粒度图像识别的研究。进行了大量实验,以显示我们的方法在四个基准数据集上的实质性改进。
translated by 谷歌翻译
We propose a new method for the task of fine-grained visual categorization. The method builds a model of the base-level category that can be fitted to images, producing high-quality foreground segmentation and mid-level part local-izations. The model can be learnt from the typical datasets available for fine-grained categorization, where the only annotation provided is a loose bounding box around the instance (e.g. bird) in each image. Both segmentation and part localizations are then used to encode the image content into a highly-discriminative visual signature. The model is symbiotic in that part discov-ery/localization is helped by segmentation and, conversely, the segmentation is helped by the detection (e.g. part layout). Our model builds on top of the part-based object category detector of Felzenszwalb et al., and also on the powerful GrabCut segmentation algorithm of Rother et al., and adds a simple spatial saliency coupling between them. In our evaluation, the model improves the categorization accuracy over the state-of-the-art. It also improves over what can be achieved with an analogous system that runs segmentation and part-localization independently.
translated by 谷歌翻译
Learning visual representations from web data has recently attracted attention for object recognition. Previous studies have mainly focused on overcoming label noise and data bias and have shown promising results by learning directly from web data. However, we argue that it might be better to transfer knowledge from existing human labeling resources to improve performance at nearly no additional cost. In this paper, we propose a new semi-supervised method for learning via web data. Our method has the unique design of exploiting strong supervision, i.e., in addition to standard image-level labels, our method also utilizes detailed annotations including object bounding boxes and part landmarks. By transferring as much knowledge as possible from existing strongly supervised datasets to weakly supervised web images, our method can benefit from sophisticated object recognition algorithms and overcome several typical problems found in webly-supervised learning. We consider the problem of fine-grained visual categorization, in which existing training resources are scarce, as our main research objective. Comprehensive experimentation and extensive analysis demonstrate encouraging performance of the proposed approach, which, at the same time, delivers a new pipeline for fine-grained visual categorization that is likely to be highly effective for real-world applications.
translated by 谷歌翻译
We propose 'Hide-and-Seek', a weakly-supervised framework that aims to improve object localization in images and action localization in videos. Most existing weakly-supervised methods localize only the most discrim-inative parts of an object rather than all relevant parts, which leads to suboptimal performance. Our key idea is to hide patches in a training image randomly, forcing the network to seek other relevant parts when the most dis-criminative part is hidden. Our approach only needs to modify the input image and can work with any network designed for object localization. During testing, we do not need to hide any patches. Our Hide-and-Seek approach obtains superior performance compared to previous methods for weakly-supervised object localization on the ILSVRC dataset. We also demonstrate that our framework can be easily extended to weakly-supervised action localization.
translated by 谷歌翻译
突出物体检测作为几种模式识别和图像处理任务中的重要组成部分或步骤越来越受到关注。尽管已经集中提出了各种强大的显着性模型,但它们通常涉及基于先验(orassumptions)的重要特征(或模型)工程。对象和背景的属性。受最近开发的特征学习的有效性的启发,我们提供了一种新颖的DeepImage显着性计算(DISC)框架,用于细粒度图像显着性计算。特别地,我们从粗略和细观级别的观察中模拟图像显着性,并利用深度卷积神经网络(CNN)以渐进方式学习显着性表示。具体地,我们的显着性模型建立在两个堆叠的CNN上。第一个CNN通过将整个图像作为输入来生成粗略显着性图,粗略地识别全局上下文中的显着性区域。此外,我们在第一个CNN中集成了基于超像素的本地上下文信息,以重新组合粗级显着图。在粗显着图的引导下,第二个CNN侧重于本地环境,以生成细粒度和准确的显着图,同时保留对象细节。对于测试图像,两个CNN可以一次性协同进行显着性计算。我们的DISC框架能够在保留井对象细节的同时,统一地突出复杂背景中的感兴趣对象。对几个标准基准测试的广泛实验表明,DISC优于其他最先进的方法,并且无需额外培训即可在数据集中得到很好的推广。 DISC的可执行版本可在线获取:http://vision.sysu.edu.cn/projects/DISC。
translated by 谷歌翻译