与基于图像的静态面部表达识别(SFER)任务相比,基于视频序列的动态面部表达识别(DFER)任务更接近自然表达识别场景。但是,DFE通常更具挑战性。主要原因之一是,视频序列通常包含具有不同表达强度的框架,尤其是对于现实世界中的面部表情,而SFER中的图像经常呈现均匀和高表达强度。但是,如果同样处理具有不同强度的表达式,则网络学到的特征将具有较大的阶层和小类间差异,这对DFER有害。为了解决这个问题,我们建议全球卷积注意区(GCA)重新列出特征地图的渠道。此外,我们在训练过程中介绍了强度感知的损失(IAL),以帮助网络区分具有相对较低表达强度的样品。在两个野外动态面部表达数据集(即DFEW和FERV39K)上进行实验表明,我们的方法表现优于最先进的DFER方法。源代码将公开可用。
translated by 谷歌翻译
由于视频序列中的大量嘈杂框架,野外动态面部表达识别(DFER)是一项极具挑战性的任务。以前的作品着重于提取更多的判别特征,但忽略了将关键帧与嘈杂框架区分开来。为了解决这个问题,我们提出了一个噪声动态的面部表达识别网络(NR-DFERNET),该网络可以有效地减少嘈杂框架对DFER任务的干扰。具体而言,在空间阶段,我们设计了一个动态静态融合模块(DSF),该模块(DSF)将动态特征引入静态特征,以学习更多的判别空间特征。为了抑制目标无关框架的影响,我们在时间阶段引入了针对变压器的新型动态类令牌(DCT)。此外,我们在决策阶段设计了基于摘要的滤镜(SF),以减少过多中性帧对非中性序列分类的影响。广泛的实验结果表明,我们的NR-dfernet优于DFEW和AFEW基准的最先进方法。
translated by 谷歌翻译
面部微表达(MES)是非自愿的面部动作,揭示了人们的真实感受,并在精神疾病,国家安全和许多人类计算机互动系统的早期干预中起着重要作用。但是,现有的微表达数据集有限,通常对培训良好的分类器构成一些挑战。为了建模微妙的面部肌肉运动,我们提出了一个健壮的微表达识别(MER)框架,即肌肉运动引导网络(MMNET)。具体而言,引入了连续的注意(CA)块,专注于对局部微妙的肌肉运动模式进行建模,几乎没有身份信息,这与大多数以前的方法不同,这些方法直接从完整的视频框架中提取具有许多身份信息的方法。此外,我们根据视觉变压器设计一个位置校准(PC)模块。通过添加PC模块在两个分支末端产生的面部的位置嵌入,PC模块可以帮助将位置信息添加到MER的面部肌肉运动图案中。在三个公共微表达数据集上进行的广泛实验表明,我们的方法以大幅度优于最先进的方法。
translated by 谷歌翻译
Bird's-Eye-View (BEV) 3D Object Detection is a crucial multi-view technique for autonomous driving systems. Recently, plenty of works are proposed, following a similar paradigm consisting of three essential components, i.e., camera feature extraction, BEV feature construction, and task heads. Among the three components, BEV feature construction is BEV-specific compared with 2D tasks. Existing methods aggregate the multi-view camera features to the flattened grid in order to construct the BEV feature. However, flattening the BEV space along the height dimension fails to emphasize the informative features of different heights. For example, the barrier is located at a low height while the truck is located at a high height. In this paper, we propose a novel method named BEV Slice Attention Network (BEV-SAN) for exploiting the intrinsic characteristics of different heights. Instead of flattening the BEV space, we first sample along the height dimension to build the global and local BEV slices. Then, the features of BEV slices are aggregated from the camera features and merged by the attention mechanism. Finally, we fuse the merged local and global BEV features by a transformer to generate the final feature map for task heads. The purpose of local BEV slices is to emphasize informative heights. In order to find them, we further propose a LiDAR-guided sampling strategy to leverage the statistical distribution of LiDAR to determine the heights of local slices. Compared with uniform sampling, LiDAR-guided sampling can determine more informative heights. We conduct detailed experiments to demonstrate the effectiveness of BEV-SAN. Code will be released.
translated by 谷歌翻译
参考图像分割旨在通过自然语言表达段段。在文本和图像之间的不同数据属性中,对网络充满良好的对齐文本和像素级别特征是具有挑战性的。现有方法使用借预制模型来促进学习,但分别从预磨料模型转移语言/视觉知识,忽略多模态对应信息。灵感来自最近对比语言 - 图像预测(剪辑)的预先推进(剪辑),在本文中,我们提出了一个端到端的剪辑驱动的参考图像分割框架(CRIS)。有效地转移多模态知识,克里斯语言解码和对比学习来实现文本到像素对齐的对比学习。更具体地,我们设计了一种视觉语言解码器,以将微粒语义信息从文本表示传播到每个像素级激活,这促进了两个模态之间的一致性。此外,我们呈现文本到像素对比学学习,明确强制执行类似于相关像素级别特征的文本特征,并与无关相似。三个基准数据集的实验结果表明,我们的拟议框架显着优于现有的性能而无需任何后处理。代码将被释放。
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
To generate high quality rendering images for real time applications, it is often to trace only a few samples-per-pixel (spp) at a lower resolution and then supersample to the high resolution. Based on the observation that the rendered pixels at a low resolution are typically highly aliased, we present a novel method for neural supersampling based on ray tracing 1/4-spp samples at the high resolution. Our key insight is that the ray-traced samples at the target resolution are accurate and reliable, which makes the supersampling an interpolation problem. We present a mask-reinforced neural network to reconstruct and interpolate high-quality image sequences. First, a novel temporal accumulation network is introduced to compute the correlation between current and previous features to significantly improve their temporal stability. Then a reconstruct network based on a multi-scale U-Net with skip connections is adopted for reconstruction and generation of the desired high-resolution image. Experimental results and comparisons have shown that our proposed method can generate higher quality results of supersampling, without increasing the total number of ray-tracing samples, over current state-of-the-art methods.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译