智能论文笔记

Progressive Scene Text Erasing with Self-Supervision

Xiangcheng Du , Zhao Zhou , Yingbin Zheng , Xingjiao Wu , Tianlong Ma , Cheng Jin

分类：计算机视觉

2022-07-23

场景文本擦除旨在从场景图像中删除文本内容，而当前的最新文本擦除模型经过大规模合成数据的培训。尽管数据合成引擎可以提供大量注释的训练样本，但合成数据和现实世界数据之间存在差异。在本文中，我们在未标记的现实世界场景文本图像上采用自我审视来进行特征表示。一项新颖的借口任务旨在在图像变体的文本蒙版之间保持一致。我们设计了渐进式擦除网络，以删除剩余文本。场景文本通过利用中间生成的结果逐渐消除，这为随后的更高质量结果奠定了基础。实验表明，我们的方法显着改善了文本擦除任务的概括，并在公共基准上实现了最先进的性能。

translated by 谷歌翻译

Stroke-Based Scene Text Erasing Using Synthetic Data for Training

Zhengmi Tang , Tomo Miyazaki , Yoshihiro Sugaya , Shinichiro Omachi

分类：计算机视觉

2021-04-23

场景文本擦除，它在自然图像中替换了具有合理内容的文本区域，近年来在计算机视觉社区中造成了重大关注。场景文本删除中有两个潜在的子任务：文本检测和图像修复。两个子任务都需要相当多的数据来实现更好的性能;但是，缺乏大型现实世界场景文本删除数据集不允许现有方法实现其潜力。为了弥补缺乏成对的真实世界数据，我们在额外的增强后大大使用了合成文本，随后仅在改进的合成文本引擎生成的数据集上培训了我们的模型。我们所提出的网络包含一个笔划掩模预测模块和背景染色模块，可以从裁剪文本图像中提取文本笔划作为相对较小的孔，以维持更多的背景内容以获得更好的修复结果。该模型可以用边界框部分删除场景图像中的文本实例，或者使用现有场景文本检测器进行自动场景文本擦除。 SCUT-SYN，ICDAR2013和SCUT-ENSTEXT数据集的定性和定量评估的实验结果表明，即使在现实世界数据上培训，我们的方法也显着优于现有的最先进的方法。

translated by 谷歌翻译

Don't Forget Me: Accurate Background Recovery for Text Removal via Modeling Local-Global Context

Chongyu Liu , Lianwen Jin , Yuliang Liu , Canjie Luo , Bangdong Chen , Fengjun Guo , Kai Ding

分类：计算机视觉

2022-07-21

由于其在隐私保护，文档修复和文本编辑方面的各种应用，因此删除文本引起了越来越多的关注。它显示出深度神经网络的重大进展。但是，大多数现有方法通常会为复杂的背景产生不一致的结果。为了解决此问题，我们提出了一个上下文引导的文本删除网络，称为CTRNET。 Ctrnet探索了低级结构和高级判别上下文特征，作为指导背景恢复过程的先验知识。我们进一步提出了具有CNNS和Transformer-编码器的局部全球含量建模（LGCM）块，以捕获局部特征并在全球像素之间建立长期关系。最后，我们将LGCM与特征建模和解码的上下文指南合并。在基准数据集，Scut-Enstext和Scut-Syn上进行的实验表明，CTRNET显着胜过现有的最新方法。此外，关于考试论文的定性实验也证明了我们方法的概括能力。代码和补充材料可在https://github.com/lcy0604/ctrnet上获得。

translated by 谷歌翻译

A Scene-Text Synthesis Engine Achieved Through Learning from Decomposed Real-World Data

Zhengmi Tang , Tomo Miyazaki , Shinichiro Omachi

分类：计算机视觉

2022-09-06

场景文本图像综合技术旨在自然构成背景场景上的文本实例，非常吸引训练深神经网络，因为它们可以提供准确而全面的注释信息。先前的研究探索了基于实际观察结果的规则，在二维和三维表面上生成了合成文本图像。其中一些研究提出了从学习中生成场景文本图像。但是，由于缺乏合适的培训数据集，已经探索了无监督的框架，以从现有的现实世界数据中学习，这可能不会导致强大的性能。为了缓解这一难题并促进基于学习的场景文本综合研究，我们建议使用公共基准准备的真实世界数据集，并具有三种注释：四边形级别的bbox，streoke-level文本掩码和文本屏蔽词图片。使用Depompst数据集，我们提出了一个图像合成引擎，其中包括文本位置建议网络（TLPNET）和文本外观适应网络（TAANET）。 TLPNET首先预测适合文本嵌入的区域。然后，taanet根据背景的上下文自适应地改变文本实例的几何形状和颜色。我们的全面实验验证了提出的方法为场景文本检测器生成预浏览数据的有效性。

translated by 谷歌翻译

Exploring Stroke-Level Modifications for Scene Text Editing

Yadong Qu , Qingfeng Tan , Hongtao Xie , Jianjun Xu , Yuxin Wang , Yongdong Zhang

分类：计算机视觉

2022-12-05

Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text. However, due to the complicated background textures and various text styles, existing methods fall short in generating clear and legible edited text images. In this study, we attribute the poor editing performance to two problems: 1) Implicit decoupling structure. Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously. 2) Domain gap. Due to the lack of edited real scene text images, the network can only be well trained on synthetic pairs and performs poorly on real-world images. To handle the above problems, we propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL). Firstly, we generate stroke guidance maps to explicitly indicate regions to be edited. Different from the implicit one by directly modifying all the pixels at image level, such explicit instructions filter out the distractions from background and guide the network to focus on editing rules of text regions. Secondly, we propose a Semi-supervised Hybrid Learning to train the network with both labeled synthetic images and unpaired real scene text images. Thus, the STE model is adapted to real-world datasets distributions. Moreover, two new datasets (Tamper-Syn2k and Tamper-Scene) are proposed to fill the blank of public evaluation datasets. Extensive experiments demonstrate that our MOSTEL outperforms previous methods both qualitatively and quantitatively. Datasets and code will be available at https://github.com/qqqyd/MOSTEL.

translated by 谷歌翻译

DGFont++: Robust Deformable Generative Networks for Unsupervised Font Generation

Xinyuan Chen , Yangchen Xie , Li Sun , Yue Lu

分类：计算机视觉 | 人工智能

2022-12-30

Automatic font generation without human experts is a practical and significant problem, especially for some languages that consist of a large number of characters. Existing methods for font generation are often in supervised learning. They require a large number of paired data, which are labor-intensive and expensive to collect. In contrast, common unsupervised image-to-image translation methods are not applicable to font generation, as they often define style as the set of textures and colors. In this work, we propose a robust deformable generative network for unsupervised font generation (abbreviated as DGFont++). We introduce a feature deformation skip connection (FDSC) to learn local patterns and geometric transformations between fonts. The FDSC predicts pairs of displacement maps and employs the predicted maps to apply deformable convolution to the low-level content feature maps. The outputs of FDSC are fed into a mixer to generate final results. Moreover, we introduce contrastive self-supervised learning to learn a robust style representation for fonts by understanding the similarity and dissimilarities of fonts. To distinguish different styles, we train our model with a multi-task discriminator, which ensures that each style can be discriminated independently. In addition to adversarial loss, another two reconstruction losses are adopted to constrain the domain-invariant characteristics between generated images and content images. Taking advantage of FDSC and the adopted loss functions, our model is able to maintain spatial information and generates high-quality character images in an unsupervised manner. Experiments demonstrate that our model is able to generate character images of higher quality than state-of-the-art methods.

translated by 谷歌翻译

StrokeNet: Stroke Assisted and Hierarchical Graph Reasoning Networks

Lei Li , Kai Fan , Chun Yuan

分类：计算机视觉

2021-11-23

场景文本检测仍然是一个具有挑战性的任务，因为可能存在极小的小或低分辨率的笔划，并且关闭或任意形状的文本。在本文中，提出了通过捕获细粒度的笔划来有效地检测文本，并在图中的分层表示之间推断结构关系。不同于由一系列点或矩形框表示文本区域的现有方法，我们通过笔划辅助预测网络（SAPN）直接本地化每个文本实例的笔划。此外，采用分层关系图网络（HRGN）来执行关系推理和预测链接的可能性，有效地将关闭文本实例和分组节点分类结果分割成任意形状的文本区域。我们介绍了一个小型数据集，其中具有笔划级注释，即SyntheTroke，用于我们模型的脱机预培训。宽范围基准测试的实验验证了我们方法的最先进的性能。我们的数据集和代码将可用。

translated by 谷歌翻译

Zoom-to-Inpaint: Image Inpainting with High-Frequency Details

Soo Ye Kim , Kfir Aberman , Nori Kanazawa , Rahul Garg , Neal Wadhwa , Huiwen Chang , Nikhil Karnad , Munchurl Kim , Orly Liba

分类：计算机视觉

2020-12-17

尽管深度学习使图像介绍方面取得了巨大的飞跃，但当前的方法通常无法综合现实的高频细节。在本文中，我们建议将超分辨率应用于粗糙的重建输出，以高分辨率进行精炼，然后将输出降低到原始分辨率。通过将高分辨率图像引入改进网络，我们的框架能够重建更多的细节，这些细节通常由于光谱偏置而被平滑 - 神经网络倾向于比高频更好地重建低频。为了协助培训大型高度孔洞的改进网络，我们提出了一种渐进的学习技术，其中缺失区域的大小随着培训的进行而增加。我们的缩放，完善和缩放策略，结合了高分辨率的监督和渐进学习，构成了一种框架 - 不合时宜的方法，用于增强高频细节，可应用于任何基于CNN的涂层方法。我们提供定性和定量评估以及消融分析，以显示我们方法的有效性。这种看似简单但功能强大的方法优于最先进的介绍方法。我们的代码可在https://github.com/google/zoom-to-inpaint中找到

translated by 谷歌翻译

Structure-guided Image Outpainting

Xi Wang , Weixi Cheng , Wenliang Jia

分类：计算机视觉 | 人工智能

2022-12-21

Deep learning techniques have made considerable progress in image inpainting, restoration, and reconstruction in the last few years. Image outpainting, also known as image extrapolation, lacks attention and practical approaches to be fulfilled, owing to difficulties caused by large-scale area loss and less legitimate neighboring information. These difficulties have made outpainted images handled by most of the existing models unrealistic to human eyes and spatially inconsistent. When upsampling through deconvolution to generate fake content, the naive generation methods may lead to results lacking high-frequency details and structural authenticity. Therefore, as our novelties to handle image outpainting problems, we introduce structural prior as a condition to optimize the generation quality and a new semantic embedding term to enhance perceptual sanity. we propose a deep learning method based on Generative Adversarial Network (GAN) and condition edges as structural prior in order to assist the generation. We use a multi-phase adversarial training scheme that comprises edge inference training, contents inpainting training, and joint training. The newly added semantic embedding loss is proved effective in practice.

translated by 谷歌翻译

Contrastive Attention Network with Dense Field Estimation for Face Completion

Xin Ma , Xiaoqiang Zhou , Huaibo Huang , Gengyun Jia , Zhenhua Chai , Xiaolin Wei

分类：计算机视觉

2021-12-20

大多数现代脸部完成方法采用AutoEncoder或其变体来恢复面部图像中缺失的区域。编码器通常用于学习强大的表现，在满足复杂的学习任务的挑战方面发挥着重要作用。具体地，各种掩模通常在野外的面部图像中呈现，形成复杂的图案，特别是在Covid-19的艰难时期。编码器很难在这种复杂的情况下捕捉如此强大的陈述。为了解决这一挑战，我们提出了一个自我监督的暹罗推论网络，以改善编码器的泛化和鲁棒性。它可以从全分辨率图像编码上下文语义并获得更多辨别性表示。为了处理面部图像的几何变型，将密集的对应字段集成到网络中。我们进一步提出了一种具有新型双重关注融合模块（DAF）的多尺度解码器，其可以以自适应方式将恢复和已知区域组合。这种多尺度架构有利于解码器利用从编码器学习到图像中的辨别性表示。广泛的实验清楚地表明，与最先进的方法相比，拟议的方法不仅可以实现更具吸引力的结果，而且还提高了蒙面的面部识别的性能。

translated by 谷歌翻译

Arbitrary Shape Text Detection via Segmentation with Probability Maps

Shi-Xue Zhang , Xiaobin Zhu , Lei Chen , Jie-Bo Hou , Xu-Cheng Yin

分类：计算机视觉

2022-08-26

任意形状的文本检测是一项具有挑战性的任务，这是由于大小和宽高比，任意取向或形状，不准确的注释等各种变化的任务。最近引起了大量关注。但是，文本的准确像素级注释是强大的，现有的场景文本检测数据集仅提供粗粒的边界注释。因此，始终存在大量错误分类的文本像素或背景像素，从而降低基于分割的文本检测方法的性能。一般来说，像素是否属于文本与与相邻注释边界的距离高度相关。通过此观察，在本文中，我们通过概率图提出了一种创新且可靠的基于分割的检测方法，以准确检测文本实例。为了具体，我们采用Sigmoid alpha函数（SAF）将边界及其内部像素之间的距离传输到概率图。但是，由于粗粒度文本边界注释的不确定性，一个概率图无法很好地覆盖复杂的概率分布。因此，我们采用一组由一系列Sigmoid alpha函数计算出的概率图来描述可能的概率分布。此外，我们提出了一个迭代模型，以学习预测和吸收概率图，以提供足够的信息来重建文本实例。最后，采用简单的区域生长算法来汇总概率图以完成文本实例。实验结果表明，我们的方法在几个基准的检测准确性方面实现了最先进的性能。

translated by 谷歌翻译

HTML版本

Texture Memory-Augmented Deep Patch-Based Image Inpainting

Rui Xu , Minghao Guo , Jiaqi Wang , Xiaoxiao Li , Bolei Zhou , Chen Change Loy

分类：计算机视觉 | 机器学习

2020-09-28

基于补丁的方法和深度网络已经采用了解决图像染色问题，具有自己的优势和劣势。基于补丁的方法能够通过从未遮盖区域搜索最近的邻居修补程序来恢复具有高质量纹理的缺失区域。但是，这些方法在恢复大缺失区域时会带来问题内容。另一方面，深度网络显示有希望的成果完成大区域。尽管如此，结果往往缺乏类似周围地区的忠诚和尖锐的细节。通过汇集两个范式中，我们提出了一种新的深度染色框架，其中纹理生成是由从未掩蔽区域提取的补丁样本的纹理记忆引导的。该框架具有一种新颖的设计，允许使用深度修复网络训练纹理存储器检索。此外，我们还介绍了贴片分配损失，以鼓励高质量的贴片合成。所提出的方法在三个具有挑战性的图像基准测试中，即地位，Celeba-HQ和巴黎街道视图数据集来说，该方法显示出质量和定量的卓越性能。

translated by 谷歌翻译

Image Synthesis with Disentangled Attributes for Chest X-Ray Nodule Augmentation and Detection

Zhenrong Shen , Xi Ouyang , Bin Xiao , Jie-Zhi Cheng , Qian Wang , Dinggang Shen

分类：计算机视觉

2022-07-19

胸部X射线（CXR）图像中的肺结节检测是肺癌的早期筛查。基于深度学习的计算机辅助诊断（CAD）系统可以支持放射线医生在CXR中进行结节筛选。但是，它需要具有高质量注释的大规模和多样化的医学数据，以训练这种强大而准确的CAD。为了减轻此类数据集的有限可用性，为了增加数据增强而提出了肺结核合成方法。然而，以前的方法缺乏产生结节的能力，这些结节与检测器所需的大小属性相关。为了解决这个问题，我们在本文中介绍了一种新颖的肺结综合框架，该框架分别将结节属性分为三个主要方面，包括形状，大小和纹理。基于GAN的形状生成器首先通过产生各种形状掩模来建模结节形状。然后，以下大小调制可以对像素级粒度中生成的结节形状的直径进行定量控制。一条粗到细门的卷积卷积纹理发生器最终合成了以调制形状掩模为条件的视觉上合理的结节纹理。此外，我们建议通过控制数据增强的分离结节属性来合成结节CXR图像，以便更好地补偿检测任务中容易错过的结节。我们的实验证明了所提出的肺结构合成框架的图像质量，多样性和可控性的增强。我们还验证了数据增强对大大改善结节检测性能的有效性。

translated by 谷歌翻译

TextDCT: Arbitrary-Shaped Text Detection via Discrete Cosine Transform Mask

Yuchen Su , Zhiwen Shao , Yong Zhou , Fanrong Meng , Hancheng Zhu , Bing Liu , Rui Yao

分类：计算机视觉

2022-06-27

由于字体，大小，颜色和方向的各种文本变化，任意形状的场景文本检测是一项具有挑战性的任务。大多数现有基于回归的方法求助于回归文本区域的口罩或轮廓点以建模文本实例。但是，回归完整的口罩需要高训练的复杂性，并且轮廓点不足以捕获高度弯曲的文本的细节。为了解决上述限制，我们提出了一个名为TextDCT的新颖的轻巧锚文本检测框架，该框架采用离散的余弦变换（DCT）将文本掩码编码为紧凑型向量。此外，考虑到金字塔层中训练样本不平衡的数量，我们仅采用单层头来进行自上而下的预测。为了建模单层头部的多尺度文本，我们通过将缩水文本区域视为正样本，并通过融合来介绍一个新颖的积极抽样策略，并通过融合来设计特征意识模块（FAM），以实现空间意识和规模的意识丰富的上下文信息并关注更重要的功能。此外，我们提出了一种分割的非量最大抑制（S-NMS）方法，该方法可以过滤低质量的掩模回归。在四个具有挑战性的数据集上进行了广泛的实验，这表明我们的TextDCT在准确性和效率上都获得了竞争性能。具体而言，TextDCT分别以每秒17.2帧（FPS）和F-measure的F-MEASIE达到85.1，而CTW1500和Total-Text数据集的F-Measure 84.9分别为15.1 fps。

translated by 谷歌翻译

Progressive Update Guided Interdependent Networks for Single Image Dehazing

Aupendu Kar , Sobhan Kanti Dhara , Debashis Sen , Prabir Kumar Biswas

分类：计算机视觉

2020-08-04

Images with haze of different varieties often pose a significant challenge to dehazing. Therefore, guidance by estimates of haze parameters related to the variety would be beneficial and their progressive update jointly with haze reduction will allow effective dehazing. To this end, we propose a multi-network dehazing framework containing novel interdependent dehazing and haze parameter updater networks that operate in a progressive manner. The haze parameters, transmission map and atmospheric light, are first estimated using specific convolutional networks allowing color-cast handling. The estimated parameters are then used to guide our dehazing module, where the estimates are progressively updated by novel convolutional networks. The updating takes place jointly with progressive dehazing by a convolutional network that invokes inter-step dependencies. The joint progressive updating and dehazing gradually modify the haze parameter estimates toward achieving effective dehazing. Through different studies, our dehazing framework is shown to be more effective than image-to-image mapping or predefined haze formation model based dehazing. Our dehazing framework is qualitatively and quantitatively found to outperform the state-of-the-art on synthetic and real-world hazy images of several datasets with varied haze conditions.

translated by 谷歌翻译

Coarse-to-fine Task-driven Inpainting for Geoscience Images

Huiming Sun , Jin Ma , Qing Guo , Song Shaoyue , Yuewei Lin , Hongkai Yu

分类：计算机视觉

2022-11-20

The processing and recognition of geoscience images have wide applications. Most of existing researches focus on understanding the high-quality geoscience images by assuming that all the images are clear. However, in many real-world cases, the geoscience images might contain occlusions during the image acquisition. This problem actually implies the image inpainting problem in computer vision and multimedia. To the best of our knowledge, all the existing image inpainting algorithms learn to repair the occluded regions for a better visualization quality, they are excellent for natural images but not good enough for geoscience images by ignoring the geoscience related tasks. This paper aims to repair the occluded regions for a better geoscience task performance with the advanced visualization quality simultaneously, without changing the current deployed deep learning based geoscience models. Because of the complex context of geoscience images, we propose a coarse-to-fine encoder-decoder network with coarse-to-fine adversarial context discriminators to reconstruct the occluded image regions. Due to the limited data of geoscience images, we use a MaskMix based data augmentation method to exploit more information from limited geoscience image data. The experimental results on three public geoscience datasets for remote sensing scene recognition, cross-view geolocation and semantic segmentation tasks respectively show the effectiveness and accuracy of the proposed method.

translated by 谷歌翻译

V-LinkNet: Learning Contextual Inpainting Across Latent Space of Generative Adversarial Network

Jireh Jam , Connah Kendrick , Vincent Drouard , Kevin Walker , Moi Hoon Yap

分类：计算机视觉

2022-01-02

深度学习方法在图像染色中优于传统方法。为了生成上下文纹理，研究人员仍在努力改进现有方法，并提出可以提取，传播和重建类似于地面真实区域的特征的模型。此外，更深层的缺乏高质量的特征传递机制有助于对所产生的染色区域有助于持久的像差。为了解决这些限制，我们提出了V-Linknet跨空间学习策略网络。为了改善语境化功能的学习，我们设计了一种使用两个编码器的损失模型。此外，我们提出了递归残留过渡层（RSTL）。 RSTL提取高电平语义信息并将其传播为下层。最后，我们将在与不同面具的同一面孔和不同面部面上的相同面上进行了比较的措施。为了提高图像修复再现性，我们提出了一种标准协议来克服各种掩模和图像的偏差。我们使用实验方法调查V-LinkNet组件。当使用标准协议时，在Celeba-HQ上评估时，我们的结果超越了现有技术。此外，我们的模型可以在Paris Street View上评估时概括良好，以及具有标准协议的Parume2数据集。

translated by 谷歌翻译

SGUIE-Net: Semantic Attention Guided Underwater Image Enhancement with Multi-Scale Perception

Qi Qi , Kunqian Li , Haiyong Zheng , Xiang Gao , Guojia Hou , Kun Sun

分类：计算机视觉

2022-01-08

由于波长依赖性的光衰减，折射和散射，水下图像通常遭受颜色变形和模糊的细节。然而，由于具有未变形图像的数量有限数量的图像作为参考，培训用于各种降解类型的深度增强模型非常困难。为了提高数据驱动方法的性能，必须建立更有效的学习机制，使得富裕监督来自有限培训的示例资源的信息。在本文中，我们提出了一种新的水下图像增强网络，称为Sguie-net，其中我们将语义信息引入了共享常见语义区域的不同图像的高级指导。因此，我们提出了语义区域 - 明智的增强模块，以感知不同语义区域从多个尺度的劣化，并将其送回从其原始比例提取的全局注意功能。该策略有助于实现不同的语义对象的强大和视觉上令人愉快的增强功能，这应该由于对差异化增强的语义信息的指导应该。更重要的是，对于在训练样本分布中不常见的那些劣化类型，指导根据其语义相关性与已经良好的学习类型连接。对公共数据集的广泛实验和我们拟议的数据集展示了Sguie-Net的令人印象深刻的表现。代码和建议的数据集可用于：https：//trentqq.github.io/sguie-net.html

translated by 谷歌翻译

Multi-stage progressive image restoration

分类：

Image restoration tasks demand a complex balance between spatial details and high-level contextualized information while recovering images. In this paper, we propose a novel synergistic design that can optimally balance these competing goals. Our main proposal is a multi-stage architecture, that progressively learns restoration functions for the degraded inputs, thereby breaking down the overall recovery process into more manageable steps. Specifically, our model first learns the contextualized features using encoder-decoder architectures and later combines them with a high-resolution branch that retains local information. At each stage, we introduce a novel per-pixel adaptive design that leverages in-situ supervised attention to reweight the local features. A key ingredient in such a multi-stage architecture is the information exchange between different stages. To this end, we propose a twofaceted approach where the information is not only exchanged sequentially from early to late stages, but lateral connections between feature processing blocks also exist to avoid any loss of information. The resulting tightly interlinked multi-stage architecture, named as MPRNet, delivers strong performance gains on ten datasets across a range of tasks including image deraining, deblurring, and denoising. The source code and pre-trained models are available at https://github.com/swz30/MPRNet.

translated by 谷歌翻译

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

Longlong Jing , Yingli Tian

分类：

2019-02-16

Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.

translated by 谷歌翻译