智能论文笔记

Voice-Face Homogeneity Tells Deepfake

Harry Cheng , Yangyang Guo , Tianyi Wang , Qi Li , Xiaojun Chang , Liqiang Nie

分类：计算机视觉 | 人工智能

2022-03-04

由于滥用了深层，检测伪造视频是非常可取的。现有的检测方法有助于探索DeepFake视频中的特定工件，并且非常适合某些数据。但是，这些人工制品的不断增长的技术一直在挑战传统的深泡探测器的鲁棒性。结果，这些方法的普遍性的发展已达到阻塞。为了解决这个问题，鉴于经验结果是，深层视频中经常在声音和面部背后的身份不匹配，并且声音和面孔在某种程度上具有同质性，在本文中，我们建议从未开发的语音中执行深层检测 - 面对匹配视图。为此，设计了一种语音匹配方法来测量这两个方法的匹配度。然而，对特定的深泡数据集进行培训使模型过于拟合深层算法的某些特征。相反，我们提倡一种迅速适应未开发的伪造方法的方法，然后进行预训练，然后进行微调范式。具体而言，我们首先在通用音频视频数据集上预先培训该模型，然后在下游深板数据上进行微调。我们对三个广泛利用的DeepFake数据集进行了广泛的实验-DFDC，Fakeavceleb和DeepFaketimit。与其他最先进的竞争对手相比，我们的方法获得了显着的性能增长。还值得注意的是，我们的方法在有限的DeepFake数据上进行了微调时已经取得了竞争性结果。

translated by 谷歌翻译

FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video Deepfake Detection

Gil Knafo , Ohad Fried

分类：计算机视觉

2022-12-01

Video synthesis methods rapidly improved in recent years, allowing easy creation of synthetic humans. This poses a problem, especially in the era of social media, as synthetic videos of speaking humans can be used to spread misinformation in a convincing manner. Thus, there is a pressing need for accurate and robust deepfake detection methods, that can detect forgery techniques not seen during training. In this work, we explore whether this can be done by leveraging a multi-modal, out-of-domain backbone trained in a self-supervised manner, adapted to the video deepfake domain. We propose FakeOut; a novel approach that relies on multi-modal data throughout both the pre-training phase and the adaption phase. We demonstrate the efficacy and robustness of FakeOut in detecting various types of deepfakes, especially manipulations which were not seen during training. Our method achieves state-of-the-art results in cross-manipulation and cross-dataset generalization. This study shows that, perhaps surprisingly, training on out-of-domain videos (i.e., videos with no speaking humans), can lead to better deepfake detection systems. Code is available on GitHub.

translated by 谷歌翻译

Fighting Malicious Media Data: A Survey on Tampering Detection and Deepfake Detection

Junke Wang , Zhenxin Li , Chao Zhang , Jingjing Chen , Zuxuan Wu , Larry S. Davis , Yu-Gang Jiang

分类：计算机视觉

2022-12-12

Online media data, in the forms of images and videos, are becoming mainstream communication channels. However, recent advances in deep learning, particularly deep generative models, open the doors for producing perceptually convincing images and videos at a low cost, which not only poses a serious threat to the trustworthiness of digital information but also has severe societal implications. This motivates a growing interest of research in media tampering detection, i.e., using deep learning techniques to examine whether media data have been maliciously manipulated. Depending on the content of the targeted images, media forgery could be divided into image tampering and Deepfake techniques. The former typically moves or erases the visual elements in ordinary images, while the latter manipulates the expressions and even the identity of human faces. Accordingly, the means of defense include image tampering detection and Deepfake detection, which share a wide variety of properties. In this paper, we provide a comprehensive review of the current media tampering detection approaches, and discuss the challenges and trends in this field for future research.

translated by 谷歌翻译

Deep Convolutional Pooling Transformer for Deepfake Detection

Tianyi Wang , Harry Cheng , Kam Pui Chow , Liqiang Nie

分类：计算机视觉 | 人工智能

2022-09-12

最近，由于社交媒体数字取证中的安全性和隐私问题，DeepFake引起了广泛的公众关注。随着互联网上广泛传播的深层视频变得越来越现实，传统的检测技术未能区分真实和假货。大多数现有的深度学习方法主要集中于使用卷积神经网络作为骨干的局部特征和面部图像中的关系。但是，本地特征和关系不足以用于模型培训，无法学习足够的一般信息以进行深层检测。因此，现有的DeepFake检测方法已达到瓶颈，以进一步改善检测性能。为了解决这个问题，我们提出了一个深度卷积变压器，以在本地和全球范围内纳入决定性图像。具体而言，我们应用卷积池和重新注意事项来丰富提取的特征并增强功效。此外，我们在模型训练中采用了几乎没有讨论的图像关键框架来改进性能，并可视化由视频压缩引起的密钥和正常图像帧之间的特征数量差距。我们最终通过在几个DeepFake基准数据集上进行了广泛的实验来说明可传递性。所提出的解决方案在内部和跨数据库实验上始终优于几个最先进的基线。

translated by 谷歌翻译

Audio-Visual Person-of-Interest DeepFake Detection

Davide Cozzolino , Matthias Nießner , Luisa Verdoliva

分类：计算机视觉

2022-04-06

Face manipulation technology is advancing very rapidly, and new methods are being proposed day by day. The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world. Our key insight is that each person has specific biometric characteristics that a synthetic generator cannot likely reproduce. Accordingly, we extract high-level audio-visual biometric features which characterize the identity of a person, and use them to create a person-of-interest (POI) deepfake detector. We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity. As a result, when the video and/or audio of a person is manipulated, its representation in the embedding space becomes inconsistent with the real identity, allowing reliable detection. Training is carried out exclusively on real talking-face videos, thus the detector does not depend on any specific manipulation method and yields the highest generalization ability. In addition, our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos by building only on high-level semantic features. Experiments on a wide variety of datasets confirm that our method ensures a SOTA performance, with an average improvement in terms of AUC of around 3%, 10%, and 4% for high-quality, low quality, and attacked videos, respectively. https://github.com/grip-unina/poi-forensics

translated by 谷歌翻译

Landmark Enhanced Multimodal Graph Learning for Deepfake Video Detection

Zhiyuan Yan , Peng Sun , Yubo Lang , Shuo Du , Shanzhuo Zhang , Wei Wang

分类：计算机视觉

2022-09-12

随着面部伪造技术的快速发展，DeepFake视频在数字媒体上引起了广泛的关注。肇事者大量利用这些视频来传播虚假信息并发表误导性陈述。大多数现有的DeepFake检测方法主要集中于纹理特征，纹理特征可能会受到外部波动（例如照明和噪声）的影响。此外，基于面部地标的检测方法对外部变量更强大，但缺乏足够的细节。因此，如何在空间，时间和频域中有效地挖掘独特的特征，并将其与面部地标融合以进行伪造视频检测仍然是一个悬而未决的问题。为此，我们提出了一个基于多种模式的信息和面部地标的几何特征，提出了地标增强的多模式图神经网络（LEM-GNN）。具体而言，在框架级别上，我们设计了一种融合机制来挖掘空间和频域元素的联合表示，同时引入几何面部特征以增强模型的鲁棒性。在视频级别，我们首先将视频中的每个帧视为图中的节点，然后将时间信息编码到图表的边缘。然后，通过应用图形神经网络（GNN）的消息传递机制，将有效合并多模式特征，以获得视频伪造的全面表示。广泛的实验表明，我们的方法始终优于广泛使用的基准上的最先进（SOTA）。

translated by 谷歌翻译

Deepfake Detection via Joint Unsupervised Reconstruction and Supervised Classification

Bosheng Yan , Chang-Tsun Li , Xuequan Lu

分类：计算机视觉

2022-11-24

Deep learning has enabled realistic face manipulation (i.e., deepfake), which poses significant concerns over the integrity of the media in circulation. Most existing deep learning techniques for deepfake detection can achieve promising performance in the intra-dataset evaluation setting (i.e., training and testing on the same dataset), but are unable to perform satisfactorily in the inter-dataset evaluation setting (i.e., training on one dataset and testing on another). Most of the previous methods use the backbone network to extract global features for making predictions and only employ binary supervision (i.e., indicating whether the training instances are fake or authentic) to train the network. Classification merely based on the learning of global features leads often leads to weak generalizability to unseen manipulation methods. In addition, the reconstruction task can improve the learned representations. In this paper, we introduce a novel approach for deepfake detection, which considers the reconstruction and classification tasks simultaneously to address these problems. This method shares the information learned by one task with the other, which focuses on a different aspect other existing works rarely consider and hence boosts the overall performance. In particular, we design a two-branch Convolutional AutoEncoder (CAE), in which the Convolutional Encoder used to compress the feature map into the latent representation is shared by both branches. Then the latent representation of the input data is fed to a simple classifier and the unsupervised reconstruction component simultaneously. Our network is trained end-to-end. Experiments demonstrate that our method achieves state-of-the-art performance on three commonly-used datasets, particularly in the cross-dataset evaluation setting.

translated by 谷歌翻译

Deepfake Face Traceability with Disentangling Reversing Network

Jiaxin Ai , Zhongyuan Wang , Baojin Huang , Zhen Han

分类：计算机视觉

2022-07-08

Deepfake面临的不仅侵犯了个人身份的隐私，而且会使公众感到困惑并造成巨大的社会伤害。当前的DeepFake检测仅保持在区分真和错误的水平上，并且无法追踪与假面相对应的原始真实面孔，也就是说，它没有能力追踪证据来源。司法取证的深层对策技术紧急要求具有深层可追溯性。本文提出了一个有趣的问题，即“知道它以及如何发生”的脸部深击，积极的取证。鉴于深冰面的面孔并不能完全丢弃原始面孔的特征，尤其是面部表情和姿势，我们认为可以大约从其深料对应物中推测原始面孔。相应地，我们设计了一个解开的倒车网络，该网络在假脸部的脸部样品的监督下解除了深泡面孔的潜在空间特征，以反向推断原始面孔。

translated by 谷歌翻译

Watch Those Words: Video Falsification Detection Using Word-Conditioned Facial Motion

Shruti Agarwal , Liwen Hu , Evonne Ng , Trevor Darrell , Hao Li , Anna Rohrbach

分类：计算机视觉 | 人工智能 | 自然语言处理

2021-12-21

在今天的数字错误信息的时代，我们越来越受到视频伪造技术构成的新威胁。这种伪造的范围从Deepfakes（例如，复杂的AI媒体合成方法）的经济饼（例如，精致的AI媒体合成方法）从真实视频中无法区分。为了解决这一挑战，我们提出了一种多模态语义法医法，可以发现超出视觉质量差异的线索，从而处理更简单的便宜赌注和视觉上有说服力的德国。在这项工作中，我们的目标是验证视频中看到的据称人士确实是通过检测他们的面部运动与他们所说的词语之间的异常对应。我们利用归因的想法，以了解特定于人的生物识别模式，将给定发言者与他人区分开来。我们使用可解释的行动单位（AUS）来捕捉一个人的面部和头部运动，而不是深入的CNN视觉功能，我们是第一个使用字样的面部运动分析。与现有的人特定的方法不同，我们的方法也有效地对抗专注于唇部操纵的攻击。我们进一步展示了我们的方法在培训中没有看到的一系列假装的效率，包括未经视频操纵的培训，这在事先工作中没有解决。

translated by 谷歌翻译

Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in VIS and NIR Scenario

Yukai Wang , Chunlei Peng , Decheng Liu , Nannan Wang , Xinbo Gao

分类：计算机视觉

2022-07-05

近年来，随着面部编辑和发电的迅速发展，越来越多的虚假视频正在社交媒体上流传，这引起了极端公众的关注。基于频域的现有面部伪造方法发现，与真实图像相比，GAN锻造图像在频谱中具有明显的网格视觉伪像。但是对于综合视频，这些方法仅局限于单个帧，几乎不关注不同框架之间最歧视的部分和时间频率线索。为了充分利用视频序列中丰富的信息，本文对空间和时间频域进行了视频伪造检测，并提出了一个离散的基于余弦转换的伪造线索增强网络（FCAN-DCT），以实现更全面的时空功能表示。 FCAN-DCT由一个骨干网络和两个分支组成：紧凑特征提取（CFE）模块和频率时间注意（FTA）模块。我们对两个可见光（VIS）数据集Wilddeepfake和Celeb-DF（V2）进行了彻底的实验评估，以及我们的自我构建的视频伪造数据集DeepFakenir，这是第一个近境模式的视频伪造数据集。实验结果证明了我们方法在VIS和NIR场景中检测伪造视频的有效性。

translated by 谷歌翻译

Look\&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Junwen Xiong , Yu Zhou , Peng Zhang , Lei Xie , Wei Huang , Yufei Zha

分类：人工智能

2022-03-04

主动演讲者的检测和语音增强已成为视听场景中越来越有吸引力的主题。根据它们各自的特征，独立设计的体系结构方案已被广泛用于与每个任务的对应。这可能导致模型特定于任务所学的表示形式，并且不可避免地会导致基于多模式建模的功能缺乏概括能力。最近的研究表明，建立听觉和视觉流之间的跨模式关系是针对视听多任务学习挑战的有前途的解决方案。因此，作为弥合视听任务中多模式关联的动机，提出了一个统一的框架，以通过在本研究中通过联合学习视听模型来实现目标扬声器的检测和语音增强。

translated by 谷歌翻译

Deep Learning for Deepfakes Creation and Detection: A Survey

Thanh Thi Nguyen , Quoc Viet Hung Nguyen , Dung Tien Nguyen , Duc Thanh Nguyen , Thien Huynh-The , Saeid Nahavandi , Thanh Tam Nguyen , Quoc-Viet Pham , Cuong M. Nguyen

分类：计算机视觉 | 机器学习

2019-09-25

深度学习已成功地用于解决从大数据分析到计算机视觉和人级控制的各种复杂问题。但是，还采用了深度学习进步来创建可能构成隐私，民主和国家安全威胁的软件。最近出现的那些深度学习驱动的应用程序之一是Deepfake。 DeepFake算法可以创建人类无法将它们与真实图像区分开的假图像和视频。因此，可以自动检测和评估数字视觉媒体完整性的技术的建议是必不可少的。本文介绍了一项用于创造深击的算法的调查，更重要的是，提出的方法旨在检测迄今为止文献中的深击。我们对与Deepfake技术有关的挑战，研究趋势和方向进行了广泛的讨论。通过回顾深层味和最先进的深层检测方法的背景，本研究提供了深入的深层技术的概述，并促进了新的，更强大的方法的发展，以应对日益挑战性的深击。

translated by 谷歌翻译

AVA-AVD: Audio-visual Speaker Diarization in the Wild

Eric Zhongcong Xu , Zeyang Song , Chao Feng , Mang Ye , Mike Zheng Shou

分类：计算机视觉

2021-11-29

视听扬声器日复速度旨在检测使用听觉和视觉信号时的``谁说话。现有的视听深度数据集主要专注于会议室或新闻工作室等室内环境，这些工作室与电影，纪录片和观众情景喜剧等许多情景中的野外视频完全不同。要创建一个能够有效地比较野外视频的日复速度方法的测试平台，我们向AVA电影数据集注释说话者深度标签，并创建一个名为AVA-AVD的新基准。由于不同的场景，复杂的声学条件和完全偏离屏幕扬声器，该基准是挑战。然而，如何处理偏离屏幕和屏幕上的扬声器仍然是一个关键挑战。为了克服它，我们提出了一种新的视听关系网络（AVR-Net），它引入了有效的模态掩模，以基于可见性捕获辨别信息。实验表明，我们的方法不仅可以优于最先进的方法，而且可以更加强大，因为改变屏幕扬声器的比率。消融研究证明了拟议的AVR-NET和尤其是日复一化的模态掩模的优点。我们的数据和代码将公开可用。

translated by 谷歌翻译

Digital and Physical Face Attacks: Reviewing and One Step Further

Chenqi Kong , Shiqi Wang , Haoliang Li

分类：计算机视觉

2022-09-29

随着过去五年的快速发展，面部身份验证已成为最普遍的生物识别方法。得益于高准确的识别性能和用户友好的用法，自动面部识别（AFR）已爆炸成多次实用的应用程序，而不是设备解锁，签到和经济支付。尽管面部身份验证取得了巨大的成功，但各种面部表现攻击（FPA），例如印刷攻击，重播攻击和3D面具攻击，但仍引起了不信任的问题。除了身体上的攻击外，面部视频/图像很容易受到恶意黑客发起的各种数字攻击技术的影响，从而对整个公众造成了潜在的威胁。由于无限制地访问了巨大的数字面部图像/视频，并披露了互联网上流通的易于使用的面部操纵工具，因此没有任何先前专业技能的非专家攻击者能够轻松创建精致的假面，从而导致许多危险的应用程序例如财务欺诈，模仿和身份盗用。这项调查旨在通过提供对现有文献的彻底分析并突出需要进一步关注的问题来建立面部取证的完整性。在本文中，我们首先全面调查了物理和数字面部攻击类型和数据集。然后，我们回顾了现有的反攻击方法的最新和最先进的进度，并突出显示其当前限制。此外，我们概述了面对法医社区中现有和即将面临的挑战的未来研究指示。最后，已经讨论了联合物理和数字面部攻击检测的必要性，这在先前的调查中从未进行过研究。

translated by 谷歌翻译

MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection

Davide Alessandro Coccomini , Giorgos Kordopatis Zilos , Giuseppe Amato , Roberto Caldelli , Fabrizio Falchi , Symeon Papadopoulos , Claudio Gennaro

分类：计算机视觉

2022-11-20

In this paper, we introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes. Previous approaches disregard such information either by using simple a-posteriori aggregation schemes, i.e., average or max operation, or using only one identity for the inference, i.e., the largest one. On the contrary, the proposed approach builds on a Spatio-Temporal TimeSformer combined with a Convolutional Neural Network backbone to capture spatio-temporal anomalies from the face sequences of multiple identities depicted in a video. This is achieved through an Identity-aware Attention mechanism that attends to each face sequence independently based on a masking operation and facilitates video-level aggregation. In addition, two novel embeddings are employed: (i) the Temporal Coherent Positional Embedding that encodes each face sequence's temporal information and (ii) the Size Embedding that encodes the size of the faces as a ratio to the video frame size. These extensions allow our system to adapt particularly well in the wild by learning how to aggregate information of multiple identities, which is usually disregarded by other methods in the literature. It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people and demonstrates ample generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection.

translated by 谷歌翻译

Detecting Deepfake by Creating Spatio-Temporal Regularity Disruption

Jiazhi Guan , Hang Zhou , Mingming Gong , Youjian Zhao , Errui Ding , Jingdong Wang

分类：计算机视觉

2022-07-21

尽管令人鼓舞的是深泡检测的进展，但由于训练过程中探索的伪造线索有限，对未见伪造类型的概括仍然是一个重大挑战。相比之下，我们注意到Deepfake中的一种常见现象：虚假的视频创建不可避免地破坏了原始视频中的统计规律性。受到这一观察的启发，我们建议通过区分实际视频中没有出现的“规律性中断”来增强深层检测的概括。具体而言，通过仔细检查空间和时间属性，我们建议通过伪捕获生成器破坏真实的视频，并创建各种伪造视频以供培训。这种做法使我们能够在不使用虚假视频的情况下实现深泡沫检测，并以简单有效的方式提高概括能力。为了共同捕获空间和时间上的破坏，我们提出了一个时空增强块，以了解我们自我创建的视频之间的规律性破坏。通过全面的实验，我们的方法在几个数据集上表现出色。

translated by 谷歌翻译

Deepfake Video Detection with Spatiotemporal Dropout Transformer

Daichi Zhang , Fanzhao Lin , Yingying Hua , Pengju Wang , Dan Zeng , Shiming Ge

分类：计算机视觉 | 人工智能

2022-07-14

尽管最近对Deepfake技术的滥用引起了严重的关注，但由于每个帧的光真逼真的合成，如何检测DeepFake视频仍然是一个挑战。现有的图像级方法通常集中在单个框架上，而忽略了深击视频中隐藏的时空提示，从而导致概括和稳健性差。视频级检测器的关键是完全利用DeepFake视频中不同框架的当地面部区域分布在当地面部区域中的时空不一致。受此启发，本文提出了一种简单而有效的补丁级方法，以通过时空辍学变压器促进深击视频检测。该方法将每个输入视频重组成贴片袋，然后将其馈入视觉变压器以实现强大的表示。具体而言，提出了时空辍学操作，以充分探索斑块级时空提示，并作为有效的数据增强，以进一步增强模型的鲁棒性和泛化能力。该操作是灵活的，可以轻松地插入现有的视觉变压器中。广泛的实验证明了我们对25种具有令人印象深刻的鲁棒性，可推广性和表示能力的最先进的方法的有效性。

translated by 谷歌翻译

Cross-Domain Local Characteristic Enhanced Deepfake Video Detection

Zihan Liu , Hanyi Wang , Shilin Wang

分类：计算机视觉

2022-11-07

As ultra-realistic face forgery techniques emerge, deepfake detection has attracted increasing attention due to security concerns. Many detectors cannot achieve accurate results when detecting unseen manipulations despite excellent performance on known forgeries. In this paper, we are motivated by the observation that the discrepancies between real and fake videos are extremely subtle and localized, and inconsistencies or irregularities can exist in some critical facial regions across various information domains. To this end, we propose a novel pipeline, Cross-Domain Local Forensics (XDLF), for more general deepfake video detection. In the proposed pipeline, a specialized framework is presented to simultaneously exploit local forgery patterns from space, frequency, and time domains, thus learning cross-domain features to detect forgeries. Moreover, the framework leverages four high-level forgery-sensitive local regions of a human face to guide the model to enhance subtle artifacts and localize potential anomalies. Extensive experiments on several benchmark datasets demonstrate the impressive performance of our method, and we achieve superiority over several state-of-the-art methods on cross-dataset generalization. We also examined the factors that contribute to its performance through ablations, which suggests that exploiting cross-domain local characteristics is a noteworthy direction for developing more general deepfake detectors.

translated by 谷歌翻译

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

Yake Wei , Di Hu , Yapeng Tian , Xuelong Li

分类：计算机视觉 | 人工智能

2022-08-20

视觉和听力是两种在人类交流和场景理解中起着至关重要的作用的感觉。为了模仿人类的感知能力，旨在开发从音频和视觉方式学习的计算方法的视听学习一直是一个蓬勃发展的领域。预计可以系统地组织和分析视听领域的研究的全面调查。从对视听认知基础的分析开始，我们介绍了几个关键发现，这些发现激发了我们的计算研究。然后，我们系统地回顾了最近的视听学习研究，并将其分为三类：视听，跨模式感知和视听合作。通过我们的分析，我们发现，跨语义，空间和时间支持上述研究的视听数据的一致性。为了重新审视视听学习领域的当前发展，我们进一步提出了关于视听场景理解的新观点，然后讨论和分析视听学习领域的可行未来方向。总体而言，这项调查从不同方面审查并展示了当前视听学习领域。我们希望它可以为研究人员提供对这一领域的更好理解。发布了包括不断更新的调查在内的网站：\ url {https://gewu-lab.github.io/audio-visual-learning/}。

translated by 谷歌翻译

MARLIN: Masked Autoencoder for facial video Representation LearnINg

Zhixi Cai , Shreya Ghosh , Kalin Stefanov , Abhinav Dhall , Jianfei Cai , Hamid Rezatofighi , Reza Haffari , Munawar Hayat

分类：计算机视觉

2022-11-12

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our codes and pre-trained models will be made public.

translated by 谷歌翻译