Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation.
translated by 谷歌翻译
The biomedical imaging world is notorious for working with small amounts of data, frustrating state-of-the-art efforts in the computer vision and deep learning worlds. With large datasets, it is easier to make progress we have seen from the natural image distribution. It is the same with microscopy videos of neuron cells moving in a culture. This problem presents several challenges as it can be difficult to grow and maintain the culture for days, and it is expensive to acquire the materials and equipment. In this work, we explore how to alleviate this data scarcity problem by synthesizing the videos. We, therefore, take the recent work of the video diffusion model to synthesize videos of cells from our training dataset. We then analyze the model's strengths and consistent shortcomings to guide us on improving video generation to be as high-quality as possible. To improve on such a task, we propose modifying the denoising function and adding motion information (dense optical flow) so that the model has more context regarding how video frames transition over time and how each pixel changes over time.
translated by 谷歌翻译
Although unsupervised domain adaptation methods have achieved remarkable performance in semantic scene segmentation in visual perception for self-driving cars, these approaches remain impractical in real-world use cases. In practice, the segmentation models may encounter new data that have not been seen yet. Also, the previous data training of segmentation models may be inaccessible due to privacy problems. Therefore, to address these problems, in this work, we propose a Continual Unsupervised Domain Adaptation (CONDA) approach that allows the model to continuously learn and adapt with respect to the presence of the new data. Moreover, our proposed approach is designed without the requirement of accessing previous training data. To avoid the catastrophic forgetting problem and maintain the performance of the segmentation models, we present a novel Bijective Maximum Likelihood loss to impose the constraint of predicted segmentation distribution shifts. The experimental results on the benchmark of continual unsupervised domain adaptation have shown the advanced performance of the proposed CONDA method.
translated by 谷歌翻译
在这项工作中,我们研究了面部重建的问题,鉴于从黑框面部识别引擎中提取的面部特征表示。确实,由于引擎中抽象信息的局限性,在实践中,这是非常具有挑战性的问题。因此,我们在蒸馏框架(dab-gan)中引入了一种名为基于注意力的生成对抗网络的新方法,以合成受试者的面孔,鉴于其提取的面部识别功能。鉴于主题的任何不受约束的面部特征,Dab-Gan可以在高清上重建他/她的脸。 DAB-GAN方法包括一种新型的基于注意力的生成结构,采用新的定义的Bioxtive Metrics学习方法。该框架首先引入徒图,以便可以在图像域中直接采用距离测量和度量学习过程,以进行图像重建任务。来自Blackbox面部识别引擎的信息将使用全局蒸馏过程最佳利用。然后,提出了一个基于注意力的发电机,以使一个高度可靠的发电机通过ID保存综合逼真的面孔。我们已经评估了有关具有挑战性的面部识别数据库的方法,即Celeba,LF​​W,AgeDB,CFP-FP,并始终取得了最新的结果。 Dab-Gan的进步也得到了图像现实主义和ID保存属性的证明。
translated by 谷歌翻译
本文旨在解决多个对象跟踪(MOT),这是计算机视觉中的一个重要问题,但由于许多实际问题,尤其是阻塞,因此仍然具有挑战性。确实,我们提出了一种新的实时深度透视图 - 了解多个对象跟踪(DP-MOT)方法,以解决MOT中的闭塞问题。首先提出了一个简单但有效的主题深度估计(SODE),以在2D场景中自动以无监督的方式自动订购检测到的受试者的深度位置。使用SODE的输出,提出了一个新的活动伪3D KALMAN滤波器,即具有动态控制变量的Kalman滤波器的简单但有效的扩展,以动态更新对象的运动。此外,在数据关联步骤中提出了一种新的高阶关联方法,以合并检测到的对象之间的一阶和二阶关系。与标准MOT基准的最新MOT方法相比,提出的方法始终达到最先进的性能。
translated by 谷歌翻译
在本文中,我们利用涉及视觉和语言互动的人类感知过程来生成对未修剪视频的连贯段落描述。我们提出了视觉语言(VL)功能,这些功能由两种模态组成,即(i)视觉方式,以捕获整个场景的全局视觉内容以及(ii)语言方式来提取人类和非人类对象的场景元素描述(例如,动物,车辆等),视觉和非视觉元素(例如关系,活动等)。此外,我们建议在对比度学习VL损失下培训我们提出的VLCAP。有关活动网字幕和YouCookii数据集的实验和消融研究表明,我们的VLCAP在准确性和多样性指标上都优于现有的SOTA方法。
translated by 谷歌翻译
基于硬件的加速度是促进许多计算密集型数学操作的广泛尝试。本文提出了一个基于FPGA的体系结构来加速卷积操作 - 在许多卷积神经网络模型中出现的复杂且昂贵的计算步骤。我们将设计定为标准卷积操作,打算以边缘-AI解决方案启动产品。该项目的目的是产生一个可以一次处理卷积层的FPGA IP核心。系统开发人员可以使用Verilog HDL作为体系结构的主要设计语言来部署IP核心。实验结果表明,我们在简单的边缘计算FPGA板上合成的单个计算核心可以提供0.224 GOPS。当董事会充分利用时,可以实现4.48 GOP。
translated by 谷歌翻译
自我训练的人群计数尚未得到专心探索,尽管这是计算机视觉中的重要挑战之一。实际上,完全监督的方法通常需要大量的手动注释资源。为了应对这一挑战,这项工作引入了一种新的方法,以利用现有的数据集,以地面真理来在人群计数中对未标记的数据集(称为域名适应)产生更强大的预测。尽管网络接受了标记的数据训练,但培训过程中还添加了来自目标域的标签的样品。在此过程中,除了平行设计的对抗训练过程外,还计算和最小化熵图。在shanghaitech,UCF_CC_50和UCF-QNRF数据集上进行的实验证明,在跨域设置中,我们的方法对我们的方法进行了更广泛的改进。
translated by 谷歌翻译
分解表示形式通常被用于年龄不变的面部识别(AIFR)任务。但是,这些方法已经达到了一些局限性,(1)具有年龄标签的大规模面部识别(FR)培训数据的要求,这在实践中受到限制; (2)高性能的重型深网架构; (3)他们的评估通常是在与年龄相关的面部数据库上进行的,同时忽略了标准的大规模FR数据库以确保鲁棒性。这项工作提出了一种新颖的轻巧的角度蒸馏(LIAAD)方法,用于克服这些限制的大规模轻量级AIFR。鉴于两个具有不同专业知识的教师,LIAAD引入了学习范式,以有效地提炼老年人的专注和棱角分明的知识,从这些老师到轻量级的学生网络,使其更强大,以更高的fr准确性和稳健的年龄,从而有效地提炼了一个学习范式因素。因此,LIAAD方法能够采用带有和不具有年龄标签的两个FR数据集的优势来训练AIFR模型。除了先前的蒸馏方法主要关注封闭设置问题中的准确性和压缩比,我们的LIAAD旨在解决开放式问题,即大规模的面部识别。对LFW,IJB-B和IJB-C Janus,AgeDB和Megaface-Fgnet的评估证明了拟议方法在轻重量结构上的效率。这项工作还提出了一个新的纵向面部衰老(Logiface)数据库\ footNote {将提供该数据库},以进一步研究未来与年龄相关的面部问题。
translated by 谷歌翻译
基于流量的生成模型最近已成为模拟数据生成的最有效方法之一。实际上,它们是由一系列可逆和可触觉转换构建的。Glow首先使用可逆$ 1 \ times 1 $卷积引入了一种简单的生成流。但是,与标准卷积相比,$ 1 \ times 1 $卷积的灵活性有限。在本文中,我们提出了一种新颖的可逆$ n \ times n $卷积方法,该方法克服了可逆$ 1 \ times 1 $卷积的局限性。此外,我们所提出的网络不仅可以处理和可逆,而且比标准卷积使用的参数少。CIFAR-10,ImageNet和Celeb-HQ数据集的实验表明,我们可逆的$ N \ times n $卷积有助于显着提高生成模型的性能。
translated by 谷歌翻译