在地面平台中,已经开发出许多显着性模型以像人类那样感知视觉世界。然而,它们可能不适合从许多异常观点看起来的dronet。为解决这一问题,本文提出了一种众包多路径网络(CMNet),用于传输航空视频中时空显着性预测的地面知识。为了训练CMNet,我们首先收集并融合24个subjectson 1,000个航拍视频的眼动追踪数据诠释地面真相的显着区域。受到眼动追踪实验中众包注释的启发,我们为CMNet设计了多路径架构,其中每条路径在经典地面显着性模型的监督下进行初始化。之后,以数据驱动的方式选择最具代表性的路径,然后将其融合并同时在航拍视频上进行微调。通过这种方式,可以将各种经典的地面显着性模型中的priorknowledge转移到CMNet,以提高其处理航空视频的能力。最后,通过时空显着优化算法对CMNet给出的空间预测进行自适应改进。 。实验结果表明,所提出的方法在预测航空视频中的视觉显着性方面优于十种最先进的模型。
translated by 谷歌翻译
在人类固定预测领域,提出了数十种计算显着性模型,以揭示不同假设和定义下的某些显着性特征。因此,显着性模型基准测试通常需要多个评估指标,以便从多个角度同时评估显着性模型。然而,大多数计算度量没有被设计为直接测量显着性图的感知相似性,使得评估结果有时可能与主观印象不一致。为了解决这个问题,本文首先进行了广泛的主观测试,以找出显着性图之间的视觉相似性如何被人类感知。基于这些测试中收集的众包数据,我们总结了评估显着性图和量化现有指标性能的几个关键因素。受这些因素的启发,我们建议基于使用众包感知判断的双流卷积神经网络来容忍显着性评估度量。具体地,利用来自众包数据的每对的相对性得分来在训练过程中使网络规则化。通过在比较显着性图中捕获各个主题共享的关键因素,学习的度量更好地与人类对显着性图的感知一致,使其成为现有度量的良好补充。实验结果验证了学习度量可以推广到新图像,新数据集,新模型和合成数据的显着性图的比较。由于学习度量的有效性,它还可以用于促进用于固定预测的新模型的开发。
translated by 谷歌翻译
由于深度学习和大规模注释数据的进步,视觉显着性模型近年来在性能上有了很大的飞跃。尽管取得了巨大的努力和巨大的突破,但模型在达到人类精度方面仍然有所下降。在这项工作中,我探索了该领域的景观,强调新的深度显着性模型,基准和数据集。审查了大量的图像和视频显着性模型,并比较了两个图像基准和两个大型视频数据集。此外,识别导致模型和人类之间差距的因素,并讨论需要解决的剩余问题,以构建下一代更强大的显着性模型。解决的一些具体问题包括:当前模型以何种方式失败,如何对其进行补救,从注意力的认知研究中得到什么,显着性显着性判断如何与注意事项相关,如何进行公平的模型比较,以及显着性模型的重要应用是什么。
translated by 谷歌翻译
In the past decades, hundreds of saliency models have been proposed for fixation prediction, along with dozens of evaluation metrics. However, existing metrics, which are often heuristically designed, may draw conflict conclusions in comparing saliency models. As a consequence, it becomes somehow confusing on the selection of metrics in comparing new models with state-of-the-arts. To address this problem , we propose a data-driven metric for comprehensive evaluation of saliency models. Instead of heuristically designing such a metric, we first conduct extensive subjective tests to find how saliency maps are assessed by the human-being. Based on the user data collected in the tests, nine representative evaluation metrics are directly compared by quantizing their performances in assessing saliency maps. Moreover, we propose to learn a data-driven metric by using Convolutional Neural Network. Compared with existing metrics, experimental results show that the data-driven metric performs the most consistently with the human-being in evaluating saliency maps as well as saliency models.
translated by 谷歌翻译
It is believed that eye movements in free-viewing of natural scenes are directed by both bottom-up visual saliency and top-down visual factors. In this paper, we propose a novel computational framework to simultaneously learn these two types of visual features from raw image data using a multiresolution convolutional neural network (Mr-CNN) for predicting eye fixations. The Mr-CNN is directly trained from image regions centered on fixation and non-fixation locations over multiple resolutions, using raw image pixels as inputs and eye fixation attributes as labels. Diverse top-down visual features can be learned in higher layers. Meanwhile bottom-up visual saliency can also be inferred via combining information over multiple resolutions. Finally, optimal integration of bottom-up and top-down cues can be learned in the last logistic regression layer to predict eye fixations. The proposed approach achieves state-of-the-art results over four publically available benchmark datasets, demonstrating the superiority of our work.
translated by 谷歌翻译
Understanding and predicting the human visual attentional mechanism is anactive area of research in the fields of neuroscience and computer vision. Inthis work, we propose DeepFix, a first-of-its-kind fully convolutional neuralnetwork for accurate saliency prediction. Unlike classical works whichcharacterize the saliency map using various hand-crafted features, our modelautomatically learns features in a hierarchical fashion and predicts saliencymap in an end-to-end manner. DeepFix is designed to capture semantics atmultiple scales while taking global context into account using network layerswith very large receptive fields. Generally, fully convolutional nets arespatially invariant which prevents them from modeling location dependentpatterns (e.g. centre-bias). Our network overcomes this limitation byincorporating a novel Location Biased Convolutional layer. We evaluate ourmodel on two challenging eye fixation datasets -- MIT300, CAT2000 and show thatit outperforms other recent approaches by a significant margin.
translated by 谷歌翻译
最近,随着深度卷积神经网络(DCNN)的出现,视觉显着性预测研究的改进令人印象深刻。接近下一步改进的一个可能方向是在DCNN架构中用计算友好模块完全表征多尺度显着性影响因素。在这项工作中,我们提出了一个用于视觉显着性预测的端到端扩容网络(DINet)。它利用非常有限的额外参数有效地捕获多尺度的上下文特征。我们提出的扩张初始模块(DIM)使用具有不同扩张率的平行扩张卷积,而不是利用具有不同内核尺寸的并行标准卷积和现有的初始模块,这可以显着降低计算量同时丰富了特征图中的接受领域的多样性。此外,通过使用一组基于线性归一化的概率分布距离度量作为损失函数,我们的显着性模型的性能得到进一步改善。因此,我们可以将显着性预测表示为全局显着性推断的概率分布预测任务,而不是典型的像素方式回归问题。几个具有挑战性的显着性基准数据集的实验结果表明我们的DINet具有建议的损失函数可以实现最先进的性能推理时间短。
translated by 谷歌翻译
Saliency detection models aiming to quantitatively predict human eye-attended locations in the visual field have been receiving increasing research interest in recent years. Unlike traditional methods that rely on hand-designed features and contrast inference mechanisms, this paper proposes a novel framework to learn saliency detection models from raw image data using deep networks. The proposed framework mainly consists of two learning stages. At the first learning stage, we develop a stacked denoising autoencoder (SDAE) model to learn robust, representative features from raw image data under an unsupervised manner. The second learning stage aims to jointly learn optimal mechanisms to capture the intrinsic mutual patterns as the feature contrast and to integrate them for final saliency prediction. Given the input of pairs of a center patch and its surrounding patches represented by the features learned at the first stage, a SDAE network is trained under the supervision of eye fixation labels, which achieves both contrast inference and contrast integration simultaneously. Experiments on three pub-lically available eye tracking benchmarks and the comparisons with 16 state-of-the-art approaches demonstrate the effectiveness of the proposed framework.
translated by 谷歌翻译
Saliency in Context (SALICON) is an ongoing effort that aims at understanding and predicting visual attention. Conventional saliency models typically rely on low-level image statistics to predict human fixations. While these models perform significantly better than chance, there is still a large gap between model prediction and human behavior. This gap is largely due to the limited capability of models in predicting eye fixations with strong semantic content, the so-called semantic gap. This paper presents a focused study to narrow the semantic gap with an architecture based on Deep Neural Network (DNN). It leverages the represen-tational power of high-level semantics encoded in DNNs pretrained for object recognition. Two key components are fine-tuning the DNNs with an objective function based on the saliency evaluation metrics, and integrating information at different image scales. We compare our method with 14 saliency models on 6 public eye tracking benchmark datasets. Results demonstrate that our DNNs can automatically learn features for saliency prediction that surpass by a big margin the state-of-the-art. In addition, our model ranks top to date under all seven metrics on the MIT300 challenge set.
translated by 谷歌翻译
在本文中,我们提出了一种语义感知和对比感知显着性的综合模型,结合了自下而上和自上而下的线索,有效显着性估计和眼睛固定预测。所提出的模型使用两个途径处理视觉信息。第一个途径旨在捕捉图像中有吸引力的语义信息,特别是对于有意义的对象和诸如人脸等对象部分的存在。第二个路径基于多尺度在线特征学习和信息最大化,其学习输入的自适应稀疏表示并且去除图像上下文中的高对比度突出模式。 twopathways表征长期和短期注意线索,并使用最大值归一化动态集成。我们研究了语义路径的两个差异化,包括端到端深度神经网络解决方案和动态特征集成解决方案,分别产生了SCA和SCAFI模型。人工图像和流行的基准数据集的实验结果证明了所提出的模型相对于经典方法和近期深度模型的优越性能和更好的可行性。
translated by 谷歌翻译
In this work, we contribute to video saliency research in two ways. First, we introduce a new benchmark for predicting human eye movements during dynamic scene free-viewing, which is long-time urged in this field. Our dataset, named DHF1K (Dynamic Human Fixation), consists of 1K high-quality, elaborately selected video sequences spanning a large range of scenes, motions, object types and background complexity. Existing video saliency datasets lack variety and generality of common dynamic scenes and fall short in covering challenging situations in unconstrained environments. In contrast, DHF1K makes a significant leap in terms of scalability, diversity and difficulty, and is expected to boost video saliency modeling. Second, we propose a novel video saliency model that augments the CNN-LSTM network architecture with an attention mechanism to enable fast, end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information , thus allowing LSTM to focus on learning more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. We thoroughly examine the performance of our model, with respect to state-of-the-art saliency models, on three large-scale datasets (i.e., DHF1K, Hollywood2, UCF sports). Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that our model outper-forms other competitors.
translated by 谷歌翻译
Computational saliency models for still images have gained significant popularity in recent years. Saliency prediction from videos, on the other hand, has received relatively little interest from the community. Motivated by this, in this work, we study the use of deep learning for dynamic saliency prediction and propose the so-called spatio-temporal saliency networks. The key to our models is the architecture of two-stream networks where we investigate different fusion mechanisms to integrate spatial and temporal information. We evaluate our models on the DIEM and UCF-Sports datasets and present highly competitive results against the existing state-of-the-art models. We also carry out some experiments on a number of still images from the MIT300 dataset by exploiting the optical flow maps predicted from these images. Our results show that considering inherent motion information in this way can be helpful for static saliency estimation.
translated by 谷歌翻译
作为计算机视觉中的一个重要问题,图像中的显着物体检测(SOD)多年来一直吸引着越来越多的研究工作。毫不奇怪,SOD的最新进展主要由基于深度学习的解决方案(称为深度SOD)引导,并由数百篇论文反映出来。为了促进对深度SOD的深入理解,在本文中,我们提供了一个全面的调查,涵盖从算法分类到未解决的开放问题的各个方面。特别是,我们首先从不同的角度审视SOD算法,包括网络架构,监督水平,学习范式和对象/实例级别检测。接下来,我们总结了现有的SOD评估数据集和度量。然后,我们仔细编制了一个完整的基准测试结果。 SOD方法基于以前的工作,并对比较结果进行了详细的分析。此外,我们通过构造具有丰富属性注释的新型SOD数据集,研究了SOD算法在不同属性下的性能,这在以前几乎没有被探索过。我们进一步分析了该领域的第一次深度SOD模型的鲁棒性和可转移性,以及对抗性攻击。我们还研究了输入扰动的影响,以及现有SOD数据集的推广和硬度。最后,讨论SOD的几个未解决的问题和挑战,并指出未来可能的研究方向。所有显着性预测图,带有注释的构建数据集和评估代码都可以在https://github.com/wenguanwang/SODsurvey上公布。
translated by 谷歌翻译
Detecting conspicuous image content is a challenging task in the field of computer vision. In existing studies, most approaches focus on estimating saliency only with the cues from the input image. However, such "intrin-sic" cues are often insufficient to distinguish targets and distractors that may share some common visual attributes. To address this problem, we present an approach to estimate image saliency by measuring the joint visual surprise from intrinsic and extrinsic contexts. In this approach, a hierarchical context model is first built on a database of 31.2 million images, where a Gaussian mixture model (GMM) is trained for each leaf node to encode the prior knowledge on "what is where" in a specific scene. For a testing image that shares similar spatial layout within a scene, the pre-trained GMM can serve as an extrinsic context model to measure the "surprise" of an image patch. Since human attention may quickly shift between different surprising locations, we adopt a Markov chain to model a surprise-driven attention-shifting process so as to infer the salient patches that can best capture human attention. Experiments show that our approach outperforms 19 state-of-the-art methods in fixation prediction.
translated by 谷歌翻译
Most saliency estimation methods aim to explicitly model low-level conspicuity cues such as edges or blobs and may additionally incorporate top-down cues using face or text detection. Data-driven methods for training saliency models using eye-fixation data are increasingly popular, particularly with the introduction of large-scale datasets and deep architectures. However, current methods in this latter paradigm use loss functions designed for classification or regression tasks whereas saliency estimation is evaluated on topographical maps. In this work, we introduce a new saliency map model which formulates a map as a generalized Bernoulli distribution. We then train a deep architecture to predict such maps using novel loss functions which pair the softmax activation function with measures designed to compute distances between probability distributions. We show in extensive experiments the effectiveness of such loss functions over standard ones on four public benchmark datasets, and demonstrate improved performance over state-of-the-art saliency methods.
translated by 谷歌翻译
由于使用卷积神经网络预测凝视注视,数据驱动的显着性最近引起了很多关注。在本文中,我们超越了显着性预测的标准方法,其中凝视图与前馈网络一起计算,并提出了一种新模型,它可以通过结合神经注意机制来预测准确的显着性图。我们的解决方案的核心是卷积LSTM,重点关注在输入图像的最邻近区域上迭代地细化预测的显着图。此外,为了解决人眼固定的典型中心偏差,我们的模型可以学习一组用高斯函数生成的先验地图。通过广泛的评估,Weshow所提出的架构优化了公共显着性预测数据的当前最新技术水平。我们进一步研究了每个关键组件的贡献,以证明它们在不同场景下的稳健性。
translated by 谷歌翻译
Visual saliency is a useful cue to locate the conspicuous image content. To estimate saliency, many approaches have been proposed to detect the unique or rare visual stimuli. However, such bottom-up solutions are often insufficient since the prior knowledge, which often indicates a biased selectivity on the input stimuli, is not taken into account. To solve this problem, this paper presents a novel approach to estimate image saliency by learning the prior knowledge. In our approach, the influences of the visual stimuli and the prior knowledge are jointly incorporated into a Bayesian framework. In this framework, the bottom-up saliency is calculated to pop-out the visual subsets that are probably salient, while the prior knowledge is used to recover the wrongly suppressed targets and inhibit the improperly popped-out distractors. Compared with existing approaches, the prior knowledge used in our approach, including the foreground prior and the correlation prior, is statistically learned from 9.6 million images in an unsupervised manner. Experimental results on two public benchmarks show that such statistical priors are effective to modulate the bottom-up saliency to achieve impressive improvements when compared with 10 state-of-the-art methods.
translated by 谷歌翻译
到目前为止,几乎所有现有的视觉显着模型都侧重于预测所有观察者的通用显着性图。然而心理学研究表明,不同观察者的视觉注意力可以在不明显的情况下发生变化,特别是场景由多个显着对象组成。为了研究观察者之间的这种异质视觉注意模式,我们首先构建一个个性化的显着性数据集,并探索视觉注意力,个人偏好和图像内容之间的相关性。具体而言,我们建议将个性化显着性图(称为PSM)分解为通用显着性图。 (称为USM)可通过显着性检测模型和跨用户的新差异图来预测个性化显着性。然后,我们提出两种解决方案来预测这种差异图,即多任务卷积神经网络(CNN)框架和具有人特定信息编码滤波器(CNN-PIEF)的扩展CNN。广泛的实验结果证明了我们的模型对PSM预测的有效性以及它们对看不见的观察者的泛化能力。
translated by 谷歌翻译
Image-based salient object detection (SOD) has been extensively studied in the past decades. However, video-based SOD is much less explored since there lack large-scale video datasets within which salient objects are unambiguously defined and annotated. Toward this end, this paper proposes a video-based SOD dataset that consists of 200 videos (64 minutes). In constructing the dataset, we manually annotate all objects and regions over 7,650 uniformly sampled keyframes and collect the eye-tracking data of 23 subjects that free-view all videos. From the user data, we find salient objects in video can be defined as objects that consistently pop-out throughout the video, and objects with such attributes can be unambiguously annotated by combining manually annotated object/region masks with eye-tracking data of multiple subjects. To the best of our knowledge, it is currently the largest dataset for video-based salient object detection. Based on this dataset, this paper proposes an unsupervised baseline approach for video-based SOD by using saliency-guided stacked autoencoders. In the proposed approach, multiple spatiotemporal saliency cues are first extracted at pixel, superpixel and object levels. With these saliency cues, stacked autoencoders are unsupervisedly constructed which automatically infer a saliency score for each pixel by progressively encoding the high-dimensional saliency cues gathered from the pixel and its spatiotemporal neighbors. Experimental results show that the proposed unsupervised approach out-performs 30 state-of-the-art models on the proposed dataset, including 19 image-based & classic (unsupervised or non-deep learning), 6 image-based & deep learning, and 5 video-based & unsupervised. Moreover, benchmarking results show that the proposed dataset is very challenging and has the potential to boost the development of video-based SOD.
translated by 谷歌翻译
This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps. We propose an architecture which, instead, combines features extracted at different levels of a Convolutional Neural Network (CNN). Our model is composed of three main blocks: a feature extraction CNN, a feature encoding network, that weights low and high level feature maps, and a prior learning network. We compare our solution with state of the art saliency models on two public benchmarks datasets. Results show that our model outperforms under all evaluation metrics on the SALICON dataset, which is currently the largest public dataset for saliency prediction, and achieves competitive results on the MIT300 benchmark. Code is available at https://github.com/marcellacornia/mlnet.
translated by 谷歌翻译