In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames. Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds. In this paper, we propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information. We also introduce an auxiliary task that utilizes a new pair of features created through switching and aggregation to increase the network's capability for various camera scenarios. Furthermore, we devise a Target Localization Module (TLM) that extracts robust features against a change in the position of the target according to the frame flow and a Frame Weight Generation (FWG) that reflects temporal information in the final representation. Various loss functions for disentanglement learning are designed so that each component of the network can cooperate while satisfactorily performing its own role. Quantitative and qualitative results from extensive experiments demonstrate the superiority of DSANet over state-of-the-art methods on three benchmark datasets.
translated by 谷歌翻译
Skeleton-based action recognition has attracted considerable attention due to its compact skeletal structure of the human body. Many recent methods have achieved remarkable performance using graph convolutional networks (GCNs) and convolutional neural networks (CNNs), which extract spatial and temporal features, respectively. Although spatial and temporal dependencies in the human skeleton have been explored, spatio-temporal dependency is rarely considered. In this paper, we propose the Inter-Frame Curve Network (IFC-Net) to effectively leverage the spatio-temporal dependency of the human skeleton. Our proposed network consists of two novel elements: 1) The Inter-Frame Curve (IFC) module; and 2) Dilated Graph Convolution (D-GC). The IFC module increases the spatio-temporal receptive field by identifying meaningful node connections between every adjacent frame and generating spatio-temporal curves based on the identified node connections. The D-GC allows the network to have a large spatial receptive field, which specifically focuses on the spatial domain. The kernels of D-GC are computed from the given adjacency matrices of the graph and reflect large receptive field in a way similar to the dilated CNNs. Our IFC-Net combines these two modules and achieves state-of-the-art performance on three skeleton-based action recognition benchmarks: NTU-RGB+D 60, NTU-RGB+D 120, and Northwestern-UCLA.
translated by 谷歌翻译
Occluded person re-identification (Re-ID) in images captured by multiple cameras is challenging because the target person is occluded by pedestrians or objects, especially in crowded scenes. In addition to the processes performed during holistic person Re-ID, occluded person Re-ID involves the removal of obstacles and the detection of partially visible body parts. Most existing methods utilize the off-the-shelf pose or parsing networks as pseudo labels, which are prone to error. To address these issues, we propose a novel Occlusion Correction Network (OCNet) that corrects features through relational-weight learning and obtains diverse and representative features without using external networks. In addition, we present a simple concept of a center feature in order to provide an intuitive solution to pedestrian occlusion scenarios. Furthermore, we suggest the idea of Separation Loss (SL) for focusing on different parts between global features and part features. We conduct extensive experiments on five challenging benchmark datasets for occluded and holistic Re-ID tasks to demonstrate that our method achieves superior performance to state-of-the-art methods especially on occluded scene.
translated by 谷歌翻译
Neural Radiance Field(NeRF) has exhibited outstanding three-dimensional(3D) reconstruction quality via the novel view synthesis from multi-view images and paired calibrated camera parameters. However, previous NeRF-based systems have been demonstrated under strictly controlled settings, with little attention paid to less ideal scenarios, including with the presence of noise such as exposure, illumination changes, and blur. In particular, though blur frequently occurs in real situations, NeRF that can handle blurred images has received little attention. The few studies that have investigated NeRF for blurred images have not considered geometric and appearance consistency in 3D space, which is one of the most important factors in 3D reconstruction. This leads to inconsistency and the degradation of the perceptual quality of the constructed scene. Hence, this paper proposes a DP-NeRF, a novel clean NeRF framework for blurred images, which is constrained with two physical priors. These priors are derived from the actual blurring process during image acquisition by the camera. DP-NeRF proposes rigid blurring kernel to impose 3D consistency utilizing the physical priors and adaptive weight proposal to refine the color composition error in consideration of the relationship between depth and blur. We present extensive experimental results for synthetic and real scenes with two types of blur: camera motion blur and defocus blur. The results demonstrate that DP-NeRF successfully improves the perceptual quality of the constructed NeRF ensuring 3D geometric and appearance consistency. We further demonstrate the effectiveness of our model with comprehensive ablation analysis.
translated by 谷歌翻译
无监督的视频对象细分旨在将视频中的目标对象细分为初始框架中没有地面真相掩码。这项具有挑战性的任务需要在视频序列中提取最突出的常见对象的功能。可以通过使用运动信息(例如光流)来解决这个困难,但是仅使用相邻帧之间的信息会导致遥远帧与性能差的连通性差。为了解决这个问题,我们提出了一种新颖的原型内存网络体系结构。提出的模型通过从输入RGB图像和光流图中提取基于超类的组件原型来有效提取RGB和运动信息。此外,该模型基于自学习算法在每个帧中的组件原型评分,并自适应地存储最有用的原型,并放弃过时的原型。我们使用内存库中的原型来预测下一个查询帧掩模,从而增强了远处框架之间的关联以帮助进行准确的掩码预测。我们的方法在三个数据集上进行了评估,从而实现了最先进的性能。我们通过各种消融研究证明了所提出的模型的有效性。
translated by 谷歌翻译
无监督的视频对象分割(VOS)旨在在像素级别的视频序列中检测最显着的对象。在无监督的VO中,大多数最先进的方法除了外观提示外,还利用从光流图获得的运动提示来利用与背景相比,显着物体通常具有独特运动的属性。但是,由于它们过于依赖运动提示,在某些情况下可能是不可靠的,因此它们无法实现稳定的预测。为了减少现有两流VOS方法的这种运动依赖性,我们提出了一个新型的运动 - 选项网络,该网络可选地利用运动提示。此外,为了充分利用并非总是需要运动网络的属性,我们引入了协作网络学习策略。在所有公共基准数据集中,我们提出的网络以实时推理速度提供最先进的性能。
translated by 谷歌翻译
特征相似性匹配将参考框架的信息传输到查询框架,是半监视视频对象分割中的关键组件。如果采用了汇总匹配,则背景干扰器很容易出现并降低性能。徒匹配机制试图通过限制要传输到查询框架的信息的量来防止这种情况,但是有两个局限性:1)由于在测试时转换为两种匹配,因此无法完全利用过滤匹配的匹配; 2)搜索最佳超参数需要测试时间手动调整。为了在确保可靠的信息传输的同时克服这些局限性,我们引入了均衡的匹配机制。为了防止参考框架信息过于引用,通过简单地将SoftMax操作与查询一起应用SoftMax操作,对查询框架的潜在贡献得到了均等。在公共基准数据集上,我们提出的方法与最先进的方法达到了可比的性能。
translated by 谷歌翻译
NIR到VIS的面部识别是通过提取域不变特征来识别两个不同域的面。但是,由于两个不同的领域特征以及缺乏NIR FACE数据集,这是一个具有挑战性的问题。为了在使用现有面部识别模型时减少域差异,我们提出了一个“关系模块”,它可以简单地添加到任何面部识别模型中。从面部图像中提取的本地功能包含面部每个组件的信息。基于两个不同的域特征,使用本地特征之间的关系比以原样的方式使用它更具域名。除了这些关系外,位置信息,例如从嘴唇到下巴到眼睛到眼睛到眼睛的距离,还提供域不变的信息。在我们的关系模块中,关系层隐含地捕获关系,并协调层对位置信息进行建模。此外,我们提出的三重态损失和有条件的边缘损失减少了训练中类内部的变化,并导致了进一步的改进。与一般面部识别模型不同,我们的附加模块无需使用大型数据集进行预训练。所提出的模块仅使用CASIA NIR-VIS 2.0数据库进行微调。使用拟议的模块,我们达到了14.81%的排名1精度和15.47%的验证率,为0.1%的验证率与两个基线模型相比。
translated by 谷歌翻译
区域建议任务是生成一组包含对象的候选区域。在此任务中,最重要的是在固定数量的建议中提出尽可能多的地面真相候选者。然而,在典型的图像中,与大量容易的负面负面相比,艰难的负面例子太少了,因此地区提案网络很难训练硬质否定。由于这个问题,网络倾向于提出艰苦的负面因素作为候选人,而未能提出地面真相的候选者,这导致性能差。在本文中,我们提出了一个负面的区域建议网络(NRPN),以改善区域建议网络(RPN)。NRPN从RPN的误报中学习,并为RPN提供了严重的负面示例。我们提出的NRPN导致假阳性和更好的RPN性能的降低。经过NRPN培训的RPN可以在Pascal VOC 2007数据集上提高性能。
translated by 谷歌翻译
RGB-D显着对象检测(SOD)最近引起了人们的关注,因为它是各种视觉任务的重要预处理操作。但是,尽管基于深度学习的方法取得了进步,但由于RGB图像与深度图和低质量深度图之间的较大域间隙,RGB-D SOD仍然具有挑战性。为了解决这个问题,我们提出了一个新型的超像素原型采样网络(SPSN)体系结构。所提出的模型将输入RGB图像和深度映射分为组件超级像素,以生成组件原型。我们设计了一个原型采样网络,因此网络仅采样对应于显着对象的原型。此外,我们提出了一个Reliance选择模块,以识别每个RGB和深度特征图的质量,并根据其可靠性成比例地适应它们。所提出的方法使模型可靠地到RGB图像和深度图之间的不一致之处,并消除了非偏好对象的影响。我们的方法在五个流行的数据集上进行了评估,从而实现了最先进的性能。我们通过比较实验证明了所提出的方法的有效性。
translated by 谷歌翻译