When people try to find particular objects in natural scenes they make extensive use of knowledge about how and where objects tend to appear in a scene. Although many forms of such ''top-down'' knowledge have been incorporated into saliency map models of visual search, surprisingly, the role of object appearance has been infrequently investigated. Here we present an appearance-based saliency model derived in a Bayesian framework. We compare our approach with both bottom-up saliency algorithms as well as the state-of-the-art Contextual Guidance model of Torralba et al. (2006) at predicting human fixations. Although both top-down approaches use very different types of information, they achieve similar performance ; each substantially better than the purely bottom-up models. Our experiments reveal that a simple model of object appearance can predict human fixations quite well, even making the same mistakes as people.
translated by 谷歌翻译
We present a novel unified framework for both static and space-time saliency detection. Our method is a bottom-up approach and computes so-called local regression kernels (i.e., local descriptors) from the given image (or a video), which measure the likeness of a pixel (or voxel) to its surroundings. Visual saliency is then computed using the said "self-resemblance" measure. The framework results in a saliency map where each pixel (or voxel) indicates the statistical likelihood of saliency of a feature matrix given its surrounding feature matrices. As a similarity measure, matrix cosine similarity (a generalization of cosine similarity) is employed. State of the art performance is demonstrated on commonly used human eye fixation data (static scenes (N. Bruce & J. Tsotsos, 2006) and dynamic scenes (L. Itti & P. Baldi, 2006)) and some psychological patterns.
translated by 谷歌翻译
To detect visually salient elements of complex natural scenes, computational bottom-up saliency models commonly examine several feature channels such as color and orientation in parallel. They compute a separate feature map for each channel and then linearly combine these maps to produce a master saliency map. However, only a few studies have investigated how different feature dimensions contribute to the overall visual saliency. We address this integration issue and propose to use covariance matrices of simple image features (known as region covariance descriptors in the computer vision community; Tuzel, Porikli, & Meer, 2006) as meta-features for saliency estimation. As low-dimensional representations of image patches, region covariances capture local image structures better than standard linear filters, but more importantly, they naturally provide nonlinear integration of different features by modeling their correlations. We also show that first-order statistics of features could be easily incorporated to the proposed approach to improve the performance. Our experimental evaluation on several benchmark data sets demonstrate that the proposed approach outperforms the state-of-art models on various tasks including prediction of human eye fixations, salient object detection, and image-retargeting.
translated by 谷歌翻译
It has been suggested that saliency mechanisms play a role in perceptual organization. This work evaluates the plausibility of a recently proposed generic principle for visual saliency: that all saliency decisions are optimal in a decision-theoretic sense. The discriminant saliency hypothesis is combined with the classical assumption that bottom-up saliency is a center-surround process to derive a (decision-theoretic) optimal saliency architecture. Under this architecture, the saliency of each image location is equated to the discriminant power of a set of features with respect to the classification problem that opposes stimuli at center and surround. The optimal saliency detector is derived for various stimulus modalities, including intensity, color, orientation, and motion, and shown to make accurate quantitative predictions of various psychophysics of human saliency for both static and motion stimuli. These include some classical nonlinearities of orientation and motion saliency and a Weber law that governs various types of saliency asymmetries. The discriminant saliency detectors are also applied to various saliency problems of interest in computer vision, including the prediction of human eye fixations on natural scenes, motion-based saliency in the presence of ego-motion, and background subtraction in highly dynamic scenes. In all cases, the discriminant saliency detectors outperform previously proposed methods from both the saliency and the general computer vision literatures.
translated by 谷歌翻译
Visual attention is a process that enables biological and machine vision systems to select the most relevant regions from a scene. Relevance is determined by two components: 1) top-down factors driven by task and 2) bottom-up factors that highlight image regions that are different from their surroundings. The latter are often referred to as "visual saliency." Modeling bottom-up visual saliency has been the subject of numerous research efforts during the past 20 years, with many successful applications in computer vision and robotics. Available models have been tested with different datasets (e.g., synthetic psychological search arrays, natural images or videos) using different evaluation scores (e.g., search slopes, comparison to human eye tracking) and parameter settings. This has made direct comparison of models difficult. Here, we perform an exhaustive comparison of 35 state-of-the-art saliency models over 54 challenging synthetic patterns, three natural image datasets, and two video datasets, using three evaluation scores. We find that although model rankings vary, some models consistently perform better. Analysis of datasets reveals that existing datasets are highly center-biased, which influences some of the evaluation scores. Computational complexity analysis shows that some models are very fast, yet yield competitive eye movement prediction accuracy. Different models often have common easy/difficult stimuli. Furthermore, several concerns in visual saliency modeling, eye movement datasets, and evaluation scores are discussed and insights for future work are provided. Our study allows one to assess the state-of-the-art, helps to organizing this rapidly growing field, and sets a unified comparison framework for gauging future efforts, similar to the PASCAL VOC challenge in the object recognition and detection domains. Index Terms-Bottom-up attention, eye movement prediction, model comparison, visual attention, visual saliency.
translated by 谷歌翻译
Organisms use the process of selective attention to optimally allocate their computational resources to the instantaneously most relevant subsets of a visual scene, ensuring that they can parse the scene in real time. Many models of bottom-up attentional selection assume that elementary image features, like intensity , color and orientation, attract attention. Gestalt psychologists, however, argue that humans perceive whole objects before they analyze individual features. This is supported by recent psychophysical studies that show that objects predict eye-fixations better than features. In this report we present a neurally inspired algorithm of object based, bottom-up attention. The model rivals the performance of state of the art non-biologically plausible feature based algorithms (and outperforms biologically plausible feature based algorithms) in its ability to predict perceptual saliency (eye fixations and subjective interest points) in natural scenes. The model achieves this by computing saliency as a function of proto-objects that establish the perceptual organization of the scene. All computational mechanisms of the algorithm have direct neural correlates, and our results provide evidence for the interface theory of attention.
translated by 谷歌翻译
This paper presents a novel approach to visual saliency that relies on a contextually adapted representation produced through adaptive whitening of color and scale features. Unlike previous models, the proposal is grounded on the specific adaptation of the basis of low level features to the statistical structure of the image. Adaptation is achieved through decorrelation and contrast normalization in several steps in a hierarchical approach, in compliance with coarse features described in biological visual systems. Saliency is simply computed as the square of the vector norm in the resulting representation. The performance of the model is compared with several state-of-the-art approaches, in predicting human fixations using three different eye-tracking datasets. Referring this measure to the performance of human priority maps, the model proves to be the only one able to keep the same behavior through different datasets, showing free of biases. Moreover, it is able to predict a wide set of relevant psychophysical observations, to our knowledge, not reproduced together by any other model before.
translated by 谷歌翻译
Despite significant recent progress, the best available visual saliency models still lag behind human performance in predicting eye fixations in free-viewing of natural scenes. Majority of models are based on low-level visual features and the importance of top-down factors has not yet been fully explored or modeled. Here, we combine low-level features such as orientation, color, intensity, saliency maps of previous best bottom-up models with top-down cognitive visual features (e.g., faces, humans, cars, etc.) and learn a direct mapping from those features to eye fixations using Regression, SVM, and AdaBoost classifiers. By extensive experimenting over three benchmark eye-tracking datasets using three popular evaluation scores, we show that our boosting model outperforms 27 state-of-the-art models and is so far the closest model to the accuracy of human model for fixation prediction. Furthermore, our model successfully detects the most salient object in a scene without sophisticated image processings such as region segmentation.
translated by 谷歌翻译
We introduce a saliency model based on two key ideas. The first one is considering local and global image patch rarities as two complementary processes. The second one is based on our observation that for different images, one of the RGB and Lab color spaces outperforms the other in saliency detection. We propose a framework that measures patch rarities in each color space and combines them in a final map. For each color channel, first, the input image is partitioned into non-overlapping patches and then each patch is represented by a vector of coefficients that linearly reconstruct it from a learned dictionary of patches from natural scenes. Next, two measures of saliency (Local and Global) are calculated and fused to indicate saliency of each patch. Local saliency is distinctiveness of a patch from its surrounding patches. Global saliency is the inverse of a patch's probability of happening over the entire image. The final saliency map is built by normalizing and fusing local and global saliency maps of all channels from both color systems. Extensive evaluation over four benchmark eye-tracking datasets shows the significant advantage of our approach over 10 state-of-the-art saliency models.
translated by 谷歌翻译
In this paper, we present a unified statistical framework for modeling both saccadic eye movements and visual saliency. By analyzing the statistical properties of human eye fixations on natural images, we found that human attention is sparsely distributed and usually deployed to locations with abundant structural information. This observations inspired us to model saccadic behavior and visual saliency based on super-Gaussian component (SGC) analysis. Our model sequentially obtains SGC using projection pursuit, and generates eye movements by selecting the location with maximum SGC response. Besides human saccadic behavior simulation, we also demonstrated our superior effectiveness and robustness over state-of-the-arts by carrying out dense experiments on synthetic patterns and human eye fixation benchmarks. Multiple key issues in saliency modeling research, such as individual differences, the effects of scale and blur, are explored in this paper. Based on extensive qualitative and quantitative experimental results, we show promising potentials of statistical approaches for human behavior research.
translated by 谷歌翻译
Many successful models for predicting attention in a scene involve three main steps: convolution with a set of filters, a center-surround mechanism and spatial pooling to construct a saliency map. However, integrating spatial information and justifying the choice of various parameter values remain open problems. In this paper we show that an efficient model of color appearance in human vision, which contains a principled selection of parameters as well as an innate spatial pooling mechanism, can be generalized to obtain a saliency model that outperforms state-of-the-art models. Scale integration is achieved by an inverse wavelet transform over the set of scale-weighted center-surround responses. The scale-weighting function (termed ECSF) has been optimized to better replicate psychophysical data on color appearance, and the appropriate sizes of the center-surround inhibition windows have been determined by training a Gaussian Mixture Model on eye-fixation data, thus avoiding ad-hoc parameter selection. Additionally, we conclude that the extension of a color appearance model to saliency estimation adds to the evidence for a common low-level visual front-end for different visual tasks.
translated by 谷歌翻译
The technique of visual saliency detection supports video surveillance systems by reducing redundant information and highlighting the critical, visually important regions. It follows that information about the image might be of great importance in depicting the visual saliency. However, the majority of existing methods extract contrast-like features without considering the contribution of information content. Based on the hypothesis that information divergence leads to visual saliency, a two-stage framework for saliency detection, namely information divergence model (IDM), is introduced in this paper. The term "information divergence" is used to express the non-uniform distribution of the visual information in an image. The first stage is constructed to extract sparse features by employing independent component analysis (ICA) and difference of Gaussians (DoG) filter. The second stage improves the Bayesian surprise model to compute information divergence across an image. A visual saliency map is finally obtained from the information divergence. Experiments are conducted on nature image databases, psychological patterns and video surveillance sequences. The results show the effectiveness of the proposed method by comparing it with 13 state-of-the-art visual saliency detection methods.
translated by 谷歌翻译
A bottom-up visual saliency detector is proposed, following a decision-theoretic formulation of saliency, previously developed for top-down processing (object recognition) [5]. The saliency of a given location of the visual field is defined as the power of a Gabor-like feature set to discriminate between the visual appearance of 1) a neighborhood centered at that location (the center) and 2) a neighborhood that surrounds it (the surround). Discrimination is defined in an information-theoretic sense and the optimal saliency detector derived for a class of stimuli that complies with known statistical properties of natural images, so as to achieve a computationally efficient solution. The resulting saliency detector is shown to replicate the fundamental properties of the psychophysics of pre-attentive vision, including stimulus pop-out, inability to detect feature conjunctions, asymmetries with respect to feature presence vs. absence, and compliance with Weber's law. It is also shown that the detector produces better predictions of human eye fixations than two previously proposed bottom-up saliency detectors.
translated by 谷歌翻译
Most bottom-up models that predict human eye fixations are based on contrast features. The saliency model of Itti, Koch and Niebur is an example of such contrast-saliency models. Although the model has been successfully compared to human eye fixations, we show that it lacks preciseness in the prediction of fixations on mirror-symmetrical forms. The contrast model gives high response at the borders, whereas human observers consistently look at the symmetrical center of these forms. We propose a sal-iency model that predicts eye fixations using local mirror symmetry. To test the model, we performed an eye-tracking experiment with participants viewing complex photographic images and compared the data with our symmetry model and the contrast model. The results show that our symmetry model predicts human eye fixations significantly better on a wide variety of images including many that are not selected for their symmetrical content. Moreover, our results show that especially early fixations are on highly symmetrical areas of the images. We conclude that symmetry is a strong predictor of human eye fixations and that it can be used as a predictor of the order of fixation.
translated by 谷歌翻译
We propose a novel algorithm to detect visual saliency from video signals by combining both spatial and temporal information and statistical uncertainty measures. The main novelty of the proposed method is twofold. First, separate spatial and temporal saliency maps are generated, where the computation of temporal saliency incorporates a recent psychological study of human visual speed perception. Second, the spatial and temporal saliency maps are merged into one using a spatiotemporally adaptive entropy-based uncertainty weighting approach. The spatial uncertainty weighing incorporates the characteristics of proximity and continuity of spatial saliency, while the temporal uncertainty weighting takes into account the variations of background motion and local contrast. Experimental results show that the proposed spatiotemporal uncertainty weighting algorithm significantly outperforms state-of-the-art video saliency detection models. Index Terms-Visual attention, video saliency, spatiotemporal saliency detection, uncertainty weighting.
translated by 谷歌翻译
How do we decide where to look next? During natural, active vision, we move our eyes to gather task-relevant information from the visual scene. Information theory provides an elegant framework for investigating how visual stimulus information combines with prior knowledge and task goals to plan an eye movement. We measured eye movements as observers performed a shape-learning and-matching task, for which the task-relevant information was tightly controlled. Using computational models, we probe the underlying strategies used by observers when planning their next eye movement. One strategy is to move the eyes to locations that maximize the total information gained about the shape, which is equivalent to reducing global uncertainty. Observers' behavior may appear highly similar to this strategy, but a rigorous analysis of sequential fixation placement reveals that observers may instead be using a local rule: fixate only the most informative locations, that is, reduce local uncertainty.
translated by 谷歌翻译
A biologically motivated computational model of bottom-up visual selective attention was used to examine the degree to which stimulus salience guides the allocation of attention. Human eye movements were recorded while participants viewed a series of digitized images of complex natural and artificial scenes. Stimulus dependence of attention, as measured by the correlation between computed stimulus salience and fixation locations, was found to be significantly greater than that expected by chance alone and furthermore was greatest for eye movements that immediately follow stimulus onset. The ability to guide attention of three modeled stimulus features (color, intensity and orientation) was examined and found to vary with image type. Additionally, the effect of the drop in visual sensitivity as a function of eccentricity on stimulus salience was examined, modeled, and shown to be an important determiner of attentional allocation. Overall, the results indicate that stimulus-driven, bottom-up mechanisms contribute significantly to attentional guidance under natural viewing conditions. Ó
translated by 谷歌翻译
A hierarchical definition of optical variability is proposed that links physical magnitudes to visual saliency and yields a more reductionist interpretation than previous approaches. This definition is shown to be grounded on the classical efficient coding hypothesis. Moreover, we propose that a major goal of contextual adaptation mechanisms is to ensure the invariance of the behavior that the contribution of an image point to optical variability elicits in the visual system. This hypothesis and the necessary assumptions are tested through the comparison with human fixations and state-of-the-art approaches to saliency in three open access eye-tracking datasets, including one devoted to images with faces, as well as in a novel experiment using hyperspectral representations of surface reflectance. The results on faces yield a significant reduction of the potential strength of semantic influences compared to previous works. The results on hyperspectral images support the assumptions to estimate optical variability. As well, the proposed approach explains quantitative results related to a visual illusion observed for images of corners, which does not involve eye movements.
translated by 谷歌翻译
Many experiments have shown that the human visual system makes extensive use of contextual information for facilitating object search in natural scenes. However, the question of how to formally model contextual influences is still open. On the basis of a Bayesian framework, the authors present an original approach of attentional guidance by global scene context. The model comprises 2 parallel pathways; one pathway computes local features (saliency) and the other computes global (scene-centered) features. The contextual guidance model of attention combines bottom-up saliency, scene context, and top-down mechanisms at an early stage of visual processing and predicts the image regions likely to be fixated by human observers performing natural search tasks in real-world scenes. According to feature-integration theory (Treisman & Gelade, 1980), the search for objects requires slow serial scanning because attention is necessary to integrate low-level features into single objects. Current computational models of visual attention based on saliency maps have been inspired by this approach, as it allows a simple and direct implementation of bottom-up attentional mechanisms that are not task specific. Computational models of image saliency) provide some predictions about which regions are likely to attract observers' attention. These models work best in situations in which the image itself provides little semantic information and in which no specific task is driving the observer's exploration. In real-world images, the semantic content of the scene, the co-occurrence of objects, and task constraints have been shown to play a key role in modulating where attention and eye movements go
translated by 谷歌翻译
A model of bottom-up overt attention is proposed based on the principle of maximizing information sampled from a scene. The proposed operation is based on Shannon's self-information measure and is achieved in a neural circuit, which is demonstrated as having close ties with the circuitry existent in the primate visual cortex. It is further shown that the proposed saliency measure may be extended to address issues that currently elude explanation in the domain of saliency based models. Resu lts on natural images are compared with experimental eye tracking data revealing the efficacy of the model in predicting the deployment of overt attention as compared with existing efforts.
translated by 谷歌翻译