我们解决了人搜索的任务,即从一组原始场景图像中进行本地化和重新识别查询人员。最近的方法通常是基于Oimnet(在人搜索上的先驱工作)建立的,该作品学习了执行检测和人重新识别(REID)任务的联合人物代表。为了获得表示形式,它们从行人提案中提取特征,然后将其投射到具有L2归一化的单位超晶体上。这些方法还结合了所有积极的建议,这些建议与地面真理充分重叠,同样可以学习REID的人代表。我们发现1)L2归一化而不考虑特征分布会退化人的判别能力,而2)正面建议通常也描绘了背景混乱和人的重叠,这可能会将嘈杂的特征编码为人的表示。在本文中,我们介绍了解决上述局限性的Oimnet ++。为此,我们引入了一个新颖的归一化层,称为Protonorm,该层校准了行人建议的特征,同时考虑了人ID的长尾分布,使L2归一化的人表示具有歧视性。我们还提出了一种本地化感知的特征学习计划,该方案鼓励更好地调整的建议在学习歧视性表示方面做出更多的贡献。对标准人员搜索基准的实验结果和分析证明了Oimnet ++的有效性。
translated by 谷歌翻译
很少有细粒度的分类和人搜索作为独特的任务和文学作品,已经分别对待了它们。但是,仔细观察揭示了重要的相似之处:这两个任务的目标类别只能由特定的对象细节歧视;相关模型应概括为新类别,而在培训期间看不到。我们提出了一个适用于这两个任务的新型统一查询引导网络(QGN)。QGN由一个查询引导的暹罗引文和兴奋子网组成,该子网还重新进行了所有网络层的查询和画廊功能,一个查询实习的区域建议特定于特定于特定的本地化以及查询指导的相似性子网络子网本网络用于公制学习。QGN在最近的一些少数细颗粒数据集上有所改善,在幼崽上的其他技术优于大幅度。QGN还对人搜索Cuhk-Sysu和PRW数据集进行了竞争性执行,我们在其中进行了深入的分析。
translated by 谷歌翻译
人员搜索旨在同时本地化和识别从现实,无折叠图像的查询人员。为了实现这一目标,最先进的模型通常在两级探测器上添加重新ID分支,如更快的R-CNN。由于ROI对准操作,该管道产生了有希望的准确性,因为重新ID特征与相应的对象区域明确对齐,但在此同时,由于致密物体锚,它引入了高计算开销。在这项工作中,我们通过引入以下专用设计,提出了一种无限制的方法来有效地解决这一具有挑战性的任务。首先,我们选择一个无锚的探测器(即,FCO)作为我们框架的原型。由于缺乏致密物体锚,与现有人搜索模型相比,它表现出明显更高的效率。其次,当直接容纳这种免费探测器的人搜索时,在学习强大的RE-ID功能方面存在几种主要挑战,我们将其总结为不同级别的未对准问题(即规模,区域和任务)。为了解决这些问题,我们提出了一个对齐的特征聚合模块来生成更辨别性和强大的功能嵌入。因此,我们将我们的模型命名为特征对齐的人搜索网络(SimblePs)。第三,通过调查基于锚和无锚模型的优点,我们进一步增强了带有ROI对齐头的对比,这显着提高了重新ID功能的鲁棒性,同时仍然保持模型高效。在两个具有挑战性的基准(即Cuhk-Sysu和PRW)上进行的广泛实验表明,我们的框架实现了最先进的或竞争性能,同时呈现更高的效率。所有源代码,数据和培训的型号可用于:https://github.com/daodaofr/alignps。
translated by 谷歌翻译
Person Search aims to simultaneously localize and recognize a target person from realistic and uncropped gallery images. One major challenge of person search comes from the contradictory goals of the two sub-tasks, i.e., person detection focuses on finding the commonness of all persons so as to distinguish persons from the background, while person re-identification (re-ID) focuses on the differences among different persons. In this paper, we propose a novel Sequential Transformer (SeqTR) for end-to-end person search to deal with this challenge. Our SeqTR contains a detection transformer and a novel re-ID transformer that sequentially addresses detection and re-ID tasks. The re-ID transformer comprises the self-attention layer that utilizes contextual information and the cross-attention layer that learns local fine-grained discriminative features of the human body. Moreover, the re-ID transformer is shared and supervised by multi-scale features to improve the robustness of learned person representations. Extensive experiments on two widely-used person search benchmarks, CUHK-SYSU and PRW, show that our proposed SeqTR not only outperforms all existing person search methods with a 59.3% mAP on PRW but also achieves comparable performance to the state-of-the-art results with an mAP of 94.8% on CUHK-SYSU.
translated by 谷歌翻译
人搜索是一项具有挑战性的任务,旨在实现共同的行人检测和人重新识别(REID)。以前的作品在完全和弱监督的设置下取得了重大进步。但是,现有方法忽略了人搜索模型的概括能力。在本文中,我们采取了进一步的步骤和现在的域自适应人员搜索(DAPS),该搜索旨在将模型从标记的源域概括为未标记的目标域。在这种新环境下出现了两个主要挑战:一个是如何同时解决检测和重新ID任务的域未对准问题,另一个是如何在目标域上训练REID子任务而不可靠的检测结果。为了应对这些挑战,我们提出了一个强大的基线框架,并使用两个专用设计。 1)我们设计一个域对齐模块,包括图像级和任务敏感的实例级别对齐,以最大程度地减少域差异。 2)我们通过动态聚类策略充分利用未标记的数据,并使用伪边界框来支持目标域上的REID和检测训练。通过上述设计,我们的框架在MAP中获得了34.7%的地图,而PRW数据集的TOP-1则达到80.6%,超过了直接转移基线的大幅度。令人惊讶的是,我们无监督的DAPS模型的性能甚至超过了一些完全和弱监督的方法。该代码可在https://github.com/caposerenity/daps上找到。
translated by 谷歌翻译
人员搜索是人重新识别(RE-ID)的扩展任务。但是,大多数现有的一步人搜索工作尚未研究如何使用现有的高级RE-ID模型来提高由于人员检测和重新ID的集成而促进了一步人搜索性能。为了解决这个问题,我们提出了更快,更强大的一步人搜索框架,教师导师的解解网络(TDN),使单步搜索享受现有的重新ID研究的优点。所提出的TDN可以通过将高级人的RE-ID知识转移到人员搜索模型来显着提高人员搜索性能。在提议的TDN中,为了从重新ID教师模型到单步搜索模型的更好的知识转移,我们通过部分解除两个子任务来设计一个强大的一步人搜索基础框架。此外,我们提出了一种知识转移桥模块,以弥合在重新ID模型和一步人搜索模型之间不同的输入格式引起的比例差距。在测试期间,我们进一步提出了与上下文人员战略的排名来利用全景图像中的上下文信息以便更好地检索。两个公共人员搜索数据集的实验证明了该方法的有利性能。
translated by 谷歌翻译
Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for FOUR different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.
translated by 谷歌翻译
Recent years witnessed the breakthrough of face recognition with deep convolutional neural networks. Dozens of papers in the field of FR are published every year. Some of them were applied in the industrial community and played an important role in human life such as device unlock, mobile payment, and so on. This paper provides an introduction to face recognition, including its history, pipeline, algorithms based on conventional manually designed features or deep learning, mainstream training, evaluation datasets, and related applications. We have analyzed and compared state-of-the-art works as many as possible, and also carefully designed a set of experiments to find the effect of backbone size and data distribution. This survey is a material of the tutorial named The Practical Face Recognition Technology in the Industrial World in the FG2023.
translated by 谷歌翻译
从图像中学习代表,健壮和歧视性信息对于有效的人重新识别(RE-ID)至关重要。在本文中,我们提出了一种基于身体和手部图像的人重新ID的端到端判别深度学习的复合方法。我们仔细设计了本地感知的全球注意力网络(Laga-Net),这是一个多分支深度网络架构,由一个用于空间注意力的分支组成,一个用于渠道注意。注意分支集中在图像的相关特征上,同时抑制了无关紧要的背景。为了克服注意力机制的弱点,与像素改组一样,我们将相对位置编码整合到空间注意模块中以捕获像素的空间位置。全球分支机构打算保留全球环境或结构信息。对于打算捕获细粒度信息的本地分支,我们进行统一的分区以水平在Conv-Layer上生成条纹。我们通过执行软分区来检索零件,而无需明确分区图像或需要外部线索,例如姿势估计。一组消融研究表明,每个组件都会有助于提高拉加网络的性能。对四个受欢迎的人体重新ID基准和两个公开可用的手数据集的广泛评估表明,我们的建议方法始终优于现有的最新方法。
translated by 谷歌翻译
人员搜索旨在共同本地化和识别来自自然的查询人员,不可用的图像,这在过去几年中在计算机视觉社区中积极研究了这一图像。在本文中,我们将在全球和本地围绕目标人群的丰富的上下文信息中阐述,我们分别指的是场景和组上下文。与以前的作品单独处理这两种类型的作品,我们将它们利用统一的全球本地上下文网络(GLCNet),其具有直观的功能增强。具体地,以多级方式同时增强重新ID嵌入和上下文特征,最终导致人员搜索增强,辨别特征。我们对两个人搜索基准(即Cuhk-Sysu和PRW)进行实验,并将我们的方法扩展到更具有挑战性的环境(即,在MovieIenet上的字符搜索)。广泛的实验结果表明,在三个数据集上的最先进方法中提出的GLCNET的一致性改进。我们的源代码,预先训练的型号,以及字符搜索的新设置可以:https://github.com/zhengpeng7/llcnet。
translated by 谷歌翻译
Accurate whole-body multi-person pose estimation and tracking is an important yet challenging topic in computer vision. To capture the subtle actions of humans for complex behavior analysis, whole-body pose estimation including the face, body, hand and foot is essential over conventional body-only pose estimation. In this paper, we present AlphaPose, a system that can perform accurate whole-body pose estimation and tracking jointly while running in realtime. To this end, we propose several new techniques: Symmetric Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating redundant human detections and Pose Aware Identity Embedding for jointly pose estimation and tracking. During training, we resort to Part-Guided Proposal Generator (PGPG) and multi-domain knowledge distillation to further improve the accuracy. Our method is able to localize whole-body keypoints accurately and tracks humans simultaneously given inaccurate bounding boxes and redundant detections. We show a significant improvement over current state-of-the-art methods in both speed and accuracy on COCO-wholebody, COCO, PoseTrack, and our proposed Halpe-FullBody pose estimation dataset. Our model, source codes and dataset are made publicly available at https://github.com/MVIG-SJTU/AlphaPose.
translated by 谷歌翻译
对无监督对象发现的现有方法(UOD)不会向大大扩展到大型数据集,而不会损害其性能的近似。我们提出了一种新颖的UOD作为排名问题的制定,适用于可用于特征值问题和链接分析的分布式方法的阿森纳。通过使用自我监督功能,我们还展示了UOD的第一个有效的完全无监督的管道。对Coco和OpenImages的广泛实验表明,在每个图像中寻求单个突出对象的单对象发现设置中,所提出的LOD(大规模对象发现)方法与之相当于或更好地中型数据集的艺术(最多120K图像),比能够缩放到1.7M图像的唯一其他算法超过37%。在每个图像中寻求多个对象的多对象发现设置中,所提出的LOD平均精度(AP)比所有其他用于从20K到1.7M图像的数据的方法更好。使用自我监督功能,我们还表明该方法在OpenImages上获得最先进的UOD性能。我们的代码在HTTPS://github.com/huyvvo/lod上公开提供。
translated by 谷歌翻译
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.
translated by 谷歌翻译
人员搜索是一个有关的任务,旨在共同解决人员检测和人员重新识别(RE-ID)。虽然最先前的方法侧重于学习稳健的个人功能,但由于照明,大构成方差和遮挡,仍然很难区分令人困惑的人。上下文信息实际上是人们搜索任务,这些任务在减少混淆方面搜索。为此,我们提出了一个名为注意上下文感知嵌入(ACAE)的新颖的上下文特征头,这增强了上下文信息。 Acae反复审查图像内部和图像内的该人员,以查找类似的行人模式,允许它隐含地学会找到可能的共同旅行者和有效地模范上下文相关的实例的关系。此外,我们提出了图像记忆库来提高培训效率。实验上,ACAE在基于不同的一步法时显示出广泛的促销。我们的整体方法实现了最先进的结果与先前的一步法。
translated by 谷歌翻译
人搜索是多个子任务的集成任务,例如前景/背景分类,边界框回归和人员重新识别。因此,人搜索是一个典型的多任务学习问题,尤其是在以端到端方式解决时。最近,一些作品通过利用各种辅助信息,例如人关节关键点,身体部位位置,属性等,这带来了更多的任务并使人搜索模型更加复杂。每个任务的不一致的趋同率可能会损害模型优化。一个直接的解决方案是手动为不同的任务分配不同的权重,以补偿各种融合率。但是,鉴于人搜索的特殊情况,即有大量任务,手动加权任务是不切实际的。为此,我们提出了一种分组的自适应减肥方法(GALW)方法,该方法会自动和动态地调整每个任务的权重。具体而言,我们根据其收敛率对任务进行分组。同一组中的任务共享相同的可学习权重,这是通过考虑损失不确定性动态分配的。对两个典型基准(Cuhk-Sysu and Prw)的实验结果证明了我们方法的有效性。
translated by 谷歌翻译
人员搜索统一人员检测和人重新识别(重新ID),以从全景画廊图像找到查询人员。一个主要挑战来自于不平衡的长尾人身份分布,这可以防止一步人搜索模型学习歧视性人员特征,以获得最终重新识别。但是,探索了如何解决一步人员搜索的重型不平衡的身份分布。设计用于长尾分类任务的技术,例如,图像级重新采样策略很难被有效地应用于与基于检测的多个多个多的人检测和重新ID子任务共同解决人员检测和重新ID子任务 - 框架框架。为了解决这个问题,我们提出了一个子任务主导的传输学习(STL)方法。 STL方法解决了主导的重新ID子批次的预测阶段的长尾问题,并通过转移学习来改善普试模型的一步人搜索。我们进一步设计了一个多级ROI融合池层,以提高一步人搜索的人特征的辨别能力。 Cuhk-Sysu和Prw Datasets的广泛实验证明了该方法的优越性和有效性。
translated by 谷歌翻译
零拍摄对象检测(ZSD),将传统检测模型扩展到检测来自Unseen类别的对象的任务,已成为计算机视觉中的新挑战。大多数现有方法通过严格的映射传输策略来解决ZSD任务,这可能导致次优ZSD结果:1)这些模型的学习过程忽略了可用的看不见的类信息,因此可以轻松地偏向所看到的类别; 2)原始视觉特征空间并不合适,缺乏歧视信息。为解决这些问题,我们开发了一种用于ZSD的新型语义引导的对比网络,命名为Contrastzsd,一种检测框架首先将对比学习机制带入零拍摄检测的领域。特别地,对比度包括两个语义导向的对比学学习子网,其分别与区域类别和区域区域对之间形成对比。成对对比度任务利用从地面真理标签和预定义的类相似性分布派生的附加监督信号。在那些明确的语义监督的指导下,模型可以了解更多关于看不见的类别的知识,以避免看到概念的偏见问题,同时优化视觉功能的数据结构,以更好地辨别更好的视觉语义对齐。广泛的实验是在ZSD,即Pascal VOC和MS Coco的两个流行基准上进行的。结果表明,我们的方法优于ZSD和广义ZSD任务的先前最先进的。
translated by 谷歌翻译
人重新识别(Reid)旨在从不同摄像机捕获的图像中检索一个人。对于基于深度学习的REID方法,已经证明,使用本地特征与人物图像的全局特征可以帮助为人员检索提供强大的特征表示。人类的姿势信息可以提供人体骨架的位置,有效地指导网络在这些关键领域更加关注这些关键领域,也可能有助于减少来自背景或闭塞的噪音分散。然而,先前与姿势相关的作品提出的方法可能无法充分利用姿势信息的好处,并没有考虑不同当地特征的不同贡献。在本文中,我们提出了一种姿势引导图注意网络,一个多分支架构,包括一个用于全局特征的一个分支,一个用于中粒体特征的一个分支,一个分支用于细粒度关键点特征。我们使用预先训练的姿势估计器来生成本地特征学习的关键点热图,并仔细设计图表卷积层以通过建模相似关系来重新评估提取的本地特征的贡献权重。实验结果表明我们对歧视特征学习的方法的有效性,我们表明我们的模型在几个主流评估数据集上实现了最先进的表演。我们还对我们的网络进行了大量的消融研究和设计不同类型的比较实验,以证明其有效性和鲁棒性,包括整体数据集,部分数据集,遮挡数据集和跨域测试。
translated by 谷歌翻译
The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译