Face tracking serves as the crucial initial step in mobile applications trying to analyse target faces over time in mobile settings. However, this problem has received little attention, mainly due to the scarcity of dedicated face tracking benchmarks. In this work, we introduce MobiFace, the first dataset for single face tracking in mobile situations. It consists of 80 unedited live-streaming mobile videos captured by 70 different smartphone users in fully unconstrained environments. Over 95K bounding boxes are manually labelled. The videos are carefully selected to cover typical smartphone usage. The videos are also annotated with 14 attributes, including 6 newly proposed attributes and 8 commonly seen in object tracking. 36 state-of-the-art trackers, including facial landmark trackers, generic object trackers and trackers that we have fine-tuned or improved, are evaluated. The results suggest that mobile face tracking cannot be solved through existing approaches. In addition, we show that fine-tuning on the MobiFace training data significantly boosts the performance of deep learning-based trackers, suggesting that MobiFace captures the unique characteristics of mobile face tracking. Our goal is to offer the community a diverse dataset to enable the design and evaluation of mobile face trackers. The dataset, annotations and the evaluation server will be on https://mobiface. github.io/.
translated by 谷歌翻译
在过去几年中,由于其在临床诊断,法医调查和安全系统等各个领域的潜在应用,自动面部微表情分析已经引起了不同学科专家越来越多的关注。计算机算法和视频采集技术的进步已经使得面部微表情的机器分析成为可能,与几十年前相比,它主要是精神病学家的领域,其中分析主要是手动的。事实上,虽然面部微表情的研究是一个很好的在心理学领域,从计算机的角度来看,它仍然是一个相对较新的问题。在本次调查中,我们提出了对最先进的数据库和方法的综合评述,以及表格识别和识别。还详细描述和审查了涉及这些任务自动化的各个阶段。此外,我们还审议了自动面部微表情分析这一增长领域的挑战和未来方向。
translated by 谷歌翻译
随着面部表情识别(FER)从实验室控制到具有挑战性的野外条件的转变以及深度学习技术在各个领域的重新获得,深度神经网络越来越多地被用于学习自动FER的判别表示。最近的深度FER系统通常关注两个重要问题:由于缺乏足够的训练数据而引起的过度拟合和与表达无关的变化,例如照明,头部姿势和身份。在本文中,我们提供了深度FER的综合调查,包括数据集和算法,提供了对这些内在问题的见解。首先,我们描述了深度FER系统的标准流水线,并提供了相关的背景知识和每个阶段适用实施的建议。然后,我们介绍了在文献中广泛使用的可用数据集,并为这些数据集提供了可接受的数据选择和评估原则。对于深度FER的现有技术,我们回顾了现有的新型深度神经网络和相关的训练策略,这些策略是针对基于静态图像和动态图像序列的FER而设计的,并讨论了优势和局限性。本节还总结了广泛使用的基准测试的竞争性能。然后,我们将调查扩展到其他相关问题和应用场景。最后,我们回顾了该领域的其余挑战和相应的机会,以及强大的深FER系统设计的未来发展方向。
translated by 谷歌翻译
识别行人属性是计算机视觉社区的一项重要任务,因为它在视频监控中发挥着重要作用。已经提出Manyalgorithms来处理该任务。本文的目的是使用传统方法或基于深度学习网络来回顾现有作品。首先,我们介绍了行人属性识别的背景(简称PAR),包括行人属性的基本概念和相应的挑战。其次,我们介绍了现有的基准,包括流行的数据集和评估标准。第三,分析了多任务学习和多标签学习的概念,并阐述了这两种学习算法与行人属性识别之间的关系。我们还回顾了一些在深度学习社区中广泛应用的流行网络架构。第四,我们分析了这个任务的流行解决方案,例如属性组,基于部分,\ emph {etc}。第五,我们展示了一些应用程序,这些应用程序考虑了行人属性并实现了更好的性能。最后,本文对本文进行了论述,并为行人属性识别提供了几个可能的研究方向。可以从以下网站找到本文的项目页面:\ url {https://sites.google.com/view/ahu-pedestrianattributes/}。
translated by 谷歌翻译
Micro-expressions (MEs) are rapid, involuntary facial expressions which reveal emotions that people do not intend to show. Studying MEs is valuable as recognizing them has many important applications, particularly in forensic science and psychotherapy. However, analyzing spontaneous MEs is very challenging due to their short duration and low intensity. Automatic ME analysis includes two tasks: ME spotting and ME recognition. For ME spotting, previous studies have focused on posed rather than spontaneous videos. For ME recognition, the performance of previous studies is low. To address these challenges, we make the following contributions: (i) We propose the first method for spotting spontaneous MEs in long videos (by exploiting feature difference contrast). This method is training free and works on arbitrary unseen videos. (ii) We present an advanced ME recognition framework, which outperforms previous work by a large margin on two challenging spontaneous ME databases (SMIC and CASMEII). (iii) We propose the first automatic ME analysis system (MESR), which can spot and recognize MEs from spontaneous video data. Finally, we show our method outperforms humans in the ME recognition task by a large margin, and achieves comparable performance to humans at the very challenging task of spotting and then recognizing spontaneous MEs.
translated by 谷歌翻译
We present a multi-stream bi-directional recurrent neu-ral network for fine-grained action detection. Recently, two-stream convolutional neural networks (CNNs) trained on stacked optical flow and image frames have been successful for action recognition in videos. Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion and also suppresses background noise that is not within the bounding box. We train two additional streams on motion and appearance cropped to the tracked bounding box, along with full-frame streams. Our motion streams use pixel trajectories of a frame as raw features, in which the displacement values corresponding to a moving scene point are at the same spatial position across several frames. To model long-term temporal dynamics within and between actions , the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. We show that our bi-directional LSTM network utilizes about 8 seconds of the video sequence to predict an action label. We test on two action detection datasets: the MPII Cooking 2 Dataset, and a new MERL Shopping Dataset that we introduce and make available to the community with this paper. The results demonstrate that our method significantly outperforms state-of-the-art action detection methods on both datasets.
translated by 谷歌翻译
大规模数据集已经相继证明了它们在多个研究领域的基本重要性,特别是对于一些新兴主题的早期进展。在本文中,我们关注视觉语音识别问题,也称为唇读,近年来受到越来越多的关注。 。我们提供了一个自然分布的大规模唇读基准测试,命名为LRW-1000,其中包含1000个类,其中包含来自超过2000个个体扬声器的约745,187个样本。每个类对应一个由一个或几个中文字符组成的普通话单词的音节。据我们所知,它是最大的词 - levellipreading数据集,也是唯一的公共大型普通话lipreadingdataset。该数据集旨在覆盖不同语言模式和成像条件的“自然”变化,以结合在实际应用中遇到的挑战。该基准测试显示了相对于各个方面的巨大变化,包括每个类别中的样本数量,视频分辨率,照明条件以及发言人的姿势,年龄,性别和制作等属性。除了对数据集及其集合管道的详细描述外,我们还评估了流行的唇线方法,并从几个方面对结果进行了全面的分析。结果证明了我们的数据集的一致性和挑战,这可能为未来的工作开辟了一些新的有希望的方向。数据集和相应的代码将在网上公布以供研究使用。
translated by 谷歌翻译
面部疼痛表达是评估疼痛的重要方式,特别是当患者的言语沟通能力受损时。由面部动作编码系统(FACS)定义的基于面部肌肉的动作单元(AU)已经被广泛研究并且作为用于检测面部表情(FE)的方法(包括有效的疼痛检测)是高度可靠的。不幸的是,FACS编码由人类是一项非常耗时的任务,使得临床使用受到限制。自动面部表情识别(AFER)的重大进展已经导致其在基于FACS的情感计算问题中的众多成功应用。然而,只有少数研究报道了自动疼痛检测(APD),其应用临床设置仍远未实现。在本文中,我们回顾了有助于自动疼痛检测的研究进展,重点是1)自发性AFER和APD问题之间的框架级相似性; 2)系统设计的演变,包括最近开发的深度学习方法; 3)从现有研究开发基于FACS的疼痛检测框架的策略和考虑因素; 4)介绍可用于AFER和APD研究的最相关的数据库。在临床环境中将一般AFER框架扩展到APD框架时,尝试提出关键考虑因素。此外,在评估AFER或APD系统时,性能指标也会突出显示。
translated by 谷歌翻译
Over the last five years, methods based on Deep Convolutional Neural Networks (DCNNs) have shown impressive performance improvements for object detection and recognition problems. This has been made possible due to the availability of large annotated datasets, a better understanding of the non-linear mapping between input images and class labels as well as the affordability of GPUs. In this paper, we present the design details of a deep learning system for unconstrained face recognition, including modules for face detection, association, alignment and face verification. The quantitative performance evaluation is conducted using the IARPA Janus Benchmark A (IJB-A), the JANUS Challenge Set 2 (JANUS CS2), and the LFW dataset. The IJB-A dataset includes real-world unconstrained faces of 500 subjects with significant pose and illumination variations which are much harder than the Labeled Faces in the Wild (LFW) and Youtube Face (YTF) datasets. JANUS CS2 is the extended version of IJB-A which contains not only all the images/frames of IJB-A but also includes the original videos. Some open issues regarding DCNNs for face verification problems are then discussed.
translated by 谷歌翻译
We introduce an approach for learning human actions as interactions between persons and objects in realistic videos. Previous work typically represents actions with low-level features such as image gradients or optical flow. In contrast, we explicitly localize in space and track over time both the object and the person, and represent an action as the trajectory of the object wrt to the person position. Our approach relies on state-of-the-art techniques for human detection [32], object detection [10], and tracking [39]. We show that this results in human and object tracks of sufficient quality to model and localize human-object interactions in realistic videos. Our human-object interaction features capture the relative trajectory of the object wrt the human. Experimental results on the Coffee & Cigarettes dataset [25], the video dataset of [19] and the Rochester Daily Activities dataset [29] show that (i) our explicit human-object model is an informative cue for action recognition; (ii) it is complementary to traditional low-level descriptors such as 3D-HOG [23] extracted over human tracks. When show that combining our human-object interaction features with 3D-HOG improves over their individual performance as well as over the state-of-the-art [23], [29].
translated by 谷歌翻译
本文提出了一种基于混合模型的卷积神经网络(SDT)的鲁棒视觉跟踪方法。为了处理突然的快速运动,生成先前的映射以便在执行深度跟踪器之前促进感兴趣区域(ROI)的定位。使用具有十九个浅提示的自上而下的显着性模型来构建具有在线学习的组合权重的在先地图。此外,除了整体的不满之外,还训练了四个本地网络来学习目标的不同组成部分。生成的四个局部热图将通过消除干扰因素来避免漂移,从而有助于纠正影像地图。此外,为了保证高质量在线更新的实例,通过将问题转化为标签噪声问题来实现优先级更新策略。通过考虑置信度值和用于时间信息整合的生物启发存储器来设计选择概率。实验在一组具有挑战性的图像序列上定性和定量地进行。比较研究表明,所提出的算法优于其他最先进的方法。
translated by 谷歌翻译
This paper presents a face liveness detection system against spoofing with photographs, videos, and 3D models of a valid user in a face recognition system. Anti-spoofing clues inside and outside a face are both exploited in our system. The inside-face clues of spontaneous eye-blinks are employed for anti-spoofing of photographs and 3D models. The outside-face clues of scene context are used for anti-spoofing of video replays. The system does not need user collaborations, i.e. it runs in a non-intrusive manner. In our system, the eyeblink detection is formulated as an inference problem of an undirected conditional graphical framework which models contextual dependencies in blink image sequences. The scene context clue is found by comparing the difference of regions of interest between the reference scene image and the input one, which is based on the similarity computed by local binary pattern descriptors on a series of fiducial points extracted in scale space. Extensive experiments are carried out to show the effectiveness of our system.
translated by 谷歌翻译
Fine grained video action analysis often requires reliable detection and tracking of various interacting objects and human body parts, denoted as Interactional Object Parsing. However, most of the previous methods based on either independent or joint object detection might suffer from high model complexity and challenging image content, e.g., illumination/pose/appearance/scale variation, motion, and occlusion etc. In this work, we propose an end-to-end system based on recurrent neural network to perform frame by frame interactional object parsing, which can alleviate the difficulty through an incremental/progressive manner. Our key innovation is that: instead of jointly outputting all object detections at once, for each frame we use a set of long-short term memory (LSTM) nodes to incrementally refine the detections. After passing through each LSTM node, more object detections are consolidated and thus more con-textual information could be utilized to localize more difficult objects. The object parsing results are further utilized to form object specific action representation for fine grained action detection. Extensive experiments on two benchmark fine grained activity datasets demonstrate that our proposed algorithm achieves better interacting object detection performance , which in turn boosts the action recognition performance over the state-of-the-art.
translated by 谷歌翻译
Unlike conventional facial expressions, microexpres-sions are instantaneous and involuntary reflections of human emotion. Because microexpressions are fleeting, lasting only a few frames within a video sequence, they are difficult to perceive and interpret correctly, and they are highly challenging to identify and categorize automatically. Existing recognition methods are often ineffective at handling subtle face displacements, which can be prevalent in typical microexpression applications due to the constant movements of the individuals being observed. To address this problem, a novel method called the Facial Dynamics Map is proposed to characterize the movements of a microexpression in different granularity. Specifically, an algorithm based on optical flow estimation is used to perform pixel-level alignment for microexpression sequences. Each expression sequence is then divided into spatiotemporal cuboids in the chosen granularity. We also present an iterative optimal strategy to calculate the principal optical flow direction of each cuboid for better representation of the local facial dynamics. With these principal directions, the resulting Facial Dynamics Map can characterize a microexpres-sion sequence. Finally, a classifier is developed to identify the presence of microexpressions and to categorize different types. Experimental results on four benchmark datasets demonstrate higher recognition performance and improved interpretability.
translated by 谷歌翻译
In this paper, we present an end-to-end system for the unconstrained face verification problem based on deep con-volutional neural networks (DCNN). The end-to-end system consists of three modules for face detection, alignment and verification and is evaluated using the newly released IARPA Janus Benchmark A (IJB-A) dataset and its extended version Janus Challenging set 2 (JANUS CS2) dataset. The IJB-A and CS2 datasets include real-world unconstrained faces of 500 subjects with significant pose and illumination variations which are much harder than the Labeled Faces in the Wild (LFW) and Youtube Face (YTF) datasets. Results of experimental evaluations for the proposed system on the IJB-A dataset are provided.
translated by 谷歌翻译
Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance. It aims at spotting a person of interest in other cameras. In the early days, hand-crafted algorithms and small-scale evaluation were predominantly reported. Recent years have witnessed the emergence of large-scale datasets and deep learning systems which make use of large data volumes. Considering different tasks, we classify most current re-ID methods into two classes, i.e., image-based and video-based; in both tasks, hand-crafted and deep learning systems will be reviewed. Moreover, two new re-ID tasks which are much closer to real-world applications are described and discussed, i.e., end-to-end re-ID and fast re-ID in very large galleries. This paper: 1) introduces the history of person re-ID and its relationship with image classification and instance retrieval; 2) surveys a broad selection of the hand-crafted systems and the large-scale methods in both image-and video-based re-ID; 3) describes critical future directions in end-to-end re-ID and fast retrieval in large galleries; and 4) finally briefs some important yet underdeveloped issues.
translated by 谷歌翻译
Video based person re-identification plays a central role in realistic security and video surveillance. In this paper we propose a novel Accumulative Motion Context (AMOC) network for addressing this important problem, which effectively exploits the long-range motion context for robustly identifying the same person under challenging conditions. Given a video sequence of the same or different persons, the proposed AMOC network jointly learns appearance representation and motion context from a collection of adjacent frames using a two-stream convolutional architecture. Then AMOC accumulates clues from motion context by recurrent aggregation, allowing effective information flow among adjacent frames and capturing dynamic gist of the persons. The architecture of AMOC is end-to-end trainable and thus motion context can be adapted to complement appearance clues under unfavorable conditions (e.g., occlusions). Extensive experiments are conduced on three public benchmark datasets, i.e., the iLIDS-VID, PRID-2011 and MARS datasets, to investigate the performance of AMOC. The experimental results demonstrate that the proposed AMOC network outperforms state-of-the-arts for video-based re-identification significantly and confirm the advantage of exploiting long-range motion context for video based person re-identification, validating our motivation evidently.
translated by 谷歌翻译
Biometrics, an integral component of Identity Science, is widely used in several large-scale-county-wide projects to provide a meaningful way of recognizing individuals. Among existing modalities, ocular bio-metric traits such as iris, periocular, retina, and eye movement have received significant attention in the recent past. Iris recognition is used in Unique Identification Authority of India's Aadhaar Program and the United Arab Emirate's border security programs, whereas the periocular recognition is used to augment the performance of face or iris when only ocular region is present in the image. This paper reviews the research progression in these modalities. The paper discusses existing algorithms and the limitations of each of the biometric traits and information fusion approaches which combine ocular modalities with other modalities. We also propose a path forward to advance the research on ocular recognition by (i) improving the sensing technology, (ii) heterogeneous recognition for addressing interoperability, (iii) utilizing advanced machine learning algorithms for better representation and classification, (iv) developing algorithms for ocular recognition at a distance, (v) using multimodal ocular biometrics for recognition, and (vi) encouraging benchmarking standards and open-source software development.
translated by 谷歌翻译
源于计算机视觉和机器学习的快速发展,视频分析任务已经从推断现状到预测未来状态。基于视觉的动作识别和来自视频的预测是这样的任务,其中动作识别是基于完整动作执行来推断人类动作(呈现状态),以及基于不完整动作执行来预测动作(未来状态)的动作预测。这些twotasks最近已经成为特别流行的主题,因为它们具有爆炸性的新兴现实应用,例如视觉监控,自动驾驶车辆,娱乐和视频检索等。在过去的几十年中,为了建立一个强大的应用程序,已经投入了大量的时间。行动识别和预测的有效框架。在本文中,我们调查了动作识别和预测中完整的最先进技术。现有的模型,流行的算法,技术难点,流行的行动数据库,评估协议和有希望的未来方向也提供了系统的讨论。
translated by 谷歌翻译