We consider the problem of improving the human instance segmentation mask quality for a given test image using keypoints estimation. We compare two alternative approaches. The first approach is a test-time adaptation (TTA) method, where we allow test-time modification of the segmentation network's weights using a single unlabeled test image. In this approach, we do not assume test-time access to the labeled source dataset. More specifically, our TTA method consists of using the keypoints estimates as pseudo labels and backpropagating them to adjust the backbone weights. The second approach is a training-time generalization (TTG) method, where we permit offline access to the labeled source dataset but not the test-time modification of weights. Furthermore, we do not assume the availability of any images from or knowledge about the target domain. Our TTG method consists of augmenting the backbone features with those generated by the keypoints head and feeding the aggregate vector to the mask head. Through a comprehensive set of ablations, we evaluate both approaches and identify several factors limiting the TTA gains. In particular, we show that in the absence of a significant domain shift, TTA may hurt and TTG show only a small gain in performance, whereas for a large domain shift, TTA gains are smaller and dependent on the heuristics used, while TTG gains are larger and robust to architectural choices.
translated by 谷歌翻译
Mask r-cnn
分类:
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code will be made available.
translated by 谷歌翻译
Bottom-up human pose estimation methods have difficulties in predicting the correct pose for small persons due to challenges in scale variation. In this paper, we present HigherHRNet: a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids. Equipped with multi-resolution supervision for training and multiresolution aggregation for inference, the proposed approach is able to solve the scale variation challenge in bottom-up multi-person pose estimation and localize keypoints more precisely, especially for small person. The feature pyramid in HigherHRNet consists of feature map outputs from HRNet and upsampled higher-resolution outputs through a transposed convolution. HigherHR-Net outperforms the previous best bottom-up method by 2.5% AP for medium person on COCO test-dev, showing its effectiveness in handling scale variation. Furthermore, HigherHRNet achieves new state-of-the-art result on COCO test-dev (70.5% AP) without using refinement or other post-processing techniques, surpassing all existing bottom-up methods. HigherHRNet even surpasses all topdown methods on CrowdPose test (67.6% AP), suggesting its robustness in crowded scene. The code and models are available at https://github.com/HRNet/ Higher-HRNet-Human-Pose-Estimation.
translated by 谷歌翻译
我们为变体视觉任务提供了一个概念上简单,灵活和通用的视觉感知头,例如分类,对象检测,实例分割和姿势估计以及不同的框架,例如单阶段或两个阶段的管道。我们的方法有效地标识了图像中的对象,同时同时生成高质量的边界框或基于轮廓的分割掩码或一组关键点。该方法称为Unihead,将不同的视觉感知任务视为通过变压器编码器体系结构学习的可分配点。给定固定的空间坐标,Unihead将其自适应地分散到了不同的空间点和有关它们的关系的原因。它以多个点的形式直接输出最终预测集,使我们能够在具有相同头部设计的不同框架中执行不同的视觉任务。我们展示了对成像网分类的广泛评估以及可可套件的所有三个曲目,包括对象检测,实例分割和姿势估计。如果没有铃铛和口哨声,Unihead可以通过单个视觉头设计统一这些视觉任务,并与为每个任务开发的专家模型相比,实现可比的性能。我们希望我们的简单和通用的Unihead能够成为可靠的基线,并有助于促进通用的视觉感知研究。代码和型号可在https://github.com/sense-x/unihead上找到。
translated by 谷歌翻译
临床医生在手术室(OR)的细粒度定位是设计新一代或支持系统的关键组成部分。需要基于人像素的分段和身体视觉计算机的计算机视觉模型检测,以更好地了解OR的临床活动和空间布局。这是具有挑战性的,这不仅是因为或图像与传统视觉数据集有很大不同,还因为在隐私问题上很难收集和生成数据和注释。为了解决这些问题,我们首先研究了如何在低分辨率图像上进行姿势估计和实例分割,而下采样因子从1x到12倍进行下采样因子。其次,为了解决域的偏移和缺乏注释,我们提出了一种新型的无监督域适应方法,称为适配器,以使模型从野外标记的源域中适应统计上不同的未标记目标域。我们建议在未标记的目标域图像的不同增强上利用明确的几何约束,以生成准确的伪标签,并使用这些伪标签在自我训练框架中对高分辨率和低分辨率或图像进行训练。此外,我们提出了分离的特征归一化,以处理统计上不同的源和目标域数据。对两个或数据集MVOR+和TUM-或TUM-或测试的详细消融研究的广泛实验结果表明,我们方法对强构建的基线的有效性,尤其是在低分辨率的隐私性或图像上。最后,我们在大规模可可数据集上显示了我们作为半监督学习方法(SSL)方法的普遍性,在这里,我们获得了可比较的结果,而对经过100%标记的监督培训的模型的标签监督只有1%。 。
translated by 谷歌翻译
The topic of multi-person pose estimation has been largely improved recently, especially with the development of convolutional neural network. However, there still exist a lot of challenging cases, such as occluded keypoints, invisible keypoints and complex background, which cannot be well addressed. In this paper, we present a novel network structure called Cascaded Pyramid Network (CPN) which targets to relieve the problem from these "hard" keypoints. More specifically, our algorithm includes two stages: Glob-alNet and RefineNet. GlobalNet is a feature pyramid network which can successfully localize the "simple" keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints. Our RefineNet tries explicitly handling the "hard" keypoints by integrating all levels of feature representations from the Global-Net together with an online hard keypoint mining loss. In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by our CPN for keypoint localization in each human bounding box. Based on the proposed algorithm, we achieve stateof-art results on the COCO keypoint benchmark, with average precision at 73.0 on the COCO test-dev dataset and 72.1 on the COCO test-challenge dataset, which is a 19% relative improvement compared with 60.5 from the COCO 2016 keypoint challenge. Code 1 and the detection results are publicly available for further research.
translated by 谷歌翻译
我们提出了一种用于多实例姿态估计的端到端培训方法,称为诗人(姿势估计变压器)。将卷积神经网络与变压器编码器 - 解码器架构组合,我们将多个姿势估计从图像标记为直接设置预测问题。我们的模型能够使用双方匹配方案直接出现所有个人的姿势。诗人使用基于集的全局损失进行培训,该丢失包括关键点损耗,可见性损失和载重损失。诗歌的原因与多个检测到的个人与完整图像上下文之间的关系直接预测它们并行姿势。我们展示诗人在Coco Keypoint检测任务上实现了高精度,同时具有比其他自下而上和自上而下的方法更少的参数和更高推理速度。此外,在将诗人应用于动物姿势估计时,我们表现出了成功的转移学习。据我们所知,该模型是第一个端到端的培训多实例姿态估计方法,我们希望它将成为一种简单而有前途的替代方案。
translated by 谷歌翻译
本文介绍了使用变压器解决关键点检测和实例关联的新方法。对于自下而上的多人姿势估计模型,他们需要检测关键点并在关键点之间学习关联信息。我们认为这些问题可以完全由变压器解决。具体而言,变压器中的自我关注测量任何一对位置之间的依赖性,这可以为关键点分组提供关联信息。但是,天真的注意力模式仍然没有主观控制,因此无法保证关键点始终会参加它们所属的实例。为了解决它,我们提出了一种监督多人关键点检测和实例关联的自我关注的新方法。通过使用实例掩码来监督自我关注的实例感知,我们可以基于成对引人注定分数为其对应的实例分配检测到的关键字,而无需使用预定义的偏移量字段或嵌入像基于CNN的自下而上模型。我们方法的另一个好处是可以从监督的注意矩阵直接获得任何数量的人的实例分段结果,从而简化了像素分配管道。对Coco多人关键点检测挑战和人实例分割任务的实验证明了所提出的方法的有效性和简单性,并显示出于针对特定目的控制自我关注行为的有希望的方法。
translated by 谷歌翻译
分割高度重叠的图像对象是具有挑战性的,因为图像上的真实对象轮廓和遮挡边界之间通常没有区别。与先前的实例分割方法不同,我们将图像形成模拟为两个重叠层的组成,并提出了双层卷积网络(BCNET),其中顶层检测到遮挡对象(遮挡器),而底层则渗透到部分闭塞实例(胶囊)。遮挡关系与双层结构的显式建模自然地将遮挡和遮挡实例的边界解散,并在掩模回归过程中考虑了它们之间的相互作用。我们使用两种流行的卷积网络设计(即完全卷积网络(FCN)和图形卷积网络(GCN))研究了双层结构的功效。此外,我们通过将图像中的实例表示为单独的可学习封闭器和封闭者查询,从而使用视觉变压器(VIT)制定双层解耦。使用一个/两个阶段和基于查询的对象探测器具有各种骨架和网络层选择验证双层解耦合的概括能力,如图像实例分段基准(可可,亲戚,可可)和视频所示实例分割基准(YTVIS,OVIS,BDD100K MOTS),特别是对于重闭塞病例。代码和数据可在https://github.com/lkeab/bcnet上找到。
translated by 谷歌翻译
我们观察到,由于不同身体部位的生物学约束,人类的姿势表现出强大的群体结构相关性和空间耦合。可以探索这种群体结构相关性,以提高人类姿势估计的准确性和鲁棒性。在这项工作中,我们开发了一个自我控制的预测验证网络,以表征和学习训练过程中关键点之间的结构相关性。在推理阶段,来自验证网络的反馈信息使我们能够进一步优化姿势预测,从而显着提高了人类姿势估计的性能。具体而言,我们根据人体的生物结构将关键点分组分组。在每个组中,关键点进一步分为两个子集,高信心基础关键点和低信心终端关键点。我们开发一个自我约束的预测验证网络,以在这些关键点子集之间执行前向和向后的预测。姿势估计以及通用预测任务中的一个基本挑战是,由于无法获得地面真相,因此我们没有机制可以验证获得的姿势估计或预测结果是否准确。一旦成功学习,验证网络将用作前向姿势预测的准确性验证模块。在推理阶段,它可用于指导低保持信心关键点的姿势估计结果的局部优化,而高信心关键点的自我约束损失是目标函数。我们对基准MS可可和人群数据集的广泛实验结果表明,所提出的方法可以显着改善姿势估计结果。
translated by 谷歌翻译
我们提出了一种直接的,基于回归的方法,以从单个图像中估计2D人姿势。我们将问题提出为序列预测任务,我们使用变压器网络解决了问题。该网络直接学习了从图像到关键点坐标的回归映射,而无需诉诸中间表示(例如热图)。这种方法避免了与基于热图的方法相关的许多复杂性。为了克服以前基于回归的方法的特征错位问题,我们提出了一种注意机制,该机制适应与目标关键最相关的功能,从而大大提高了准确性。重要的是,我们的框架是端到端的可区分,并且自然学会利用关键点之间的依赖关系。两个主要的姿势估计数据集在MS-Coco和MPII上进行的实验表明,我们的方法在基于回归的姿势估计中的最新方法显着改善。更值得注意的是,与最佳的基于热图的姿势估计方法相比,我们的第一种基于回归的方法是有利的。
translated by 谷歌翻译
自动核细胞分割和分类在数字病理学中起着至关重要的作用。但是,以前的作品主要基于具有有限的多样性和小尺寸的数据构建,使得在实际下游任务中的结果可疑或误导。在本文中,我们的目标是建立一种可靠且强大的方法,能够处理“临床野生”中的数据。具体地,我们研究和设计一种同时检测,分段和分类来自血红素和曙红(H&E)染色的组织病理学数据的新方法,并使用最近的最大数据集评估我们的方法:Pannuke。我们以新颖的语义关键点估计问题解决每个核的检测和分类,以确定每个核的中心点。接下来,使用动态实例分段获得核心点的相应类别 - 不可止液掩模。通过解耦两个同步具有挑战性的任务,我们的方法可以从类别感知的检测和类别不可知的细分中受益,从而导致显着的性能提升。我们展示了我们提出的核细胞分割和分类方法的卓越性能,跨越19种不同的组织类型,提供了新的基准结果。
translated by 谷歌翻译
In this paper, we are interested in the human pose estimation problem with a focus on learning reliable highresolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process.We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutliresolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich highresolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. In addition, we show the superiority of our network in pose tracking on the PoseTrack dataset. The code and models have been publicly available at https://github.com/leoxiaobin/ deep-high-resolution-net.pytorch.
translated by 谷歌翻译
本文提出了一个统一的框架到(i)找到球,(ii)预测姿势,(iii)在团队体育场景中分段播放器的实例掩码。这些问题对自动体育分析,生产和广播有高兴趣。常见做法是通过利用通用最先进的模型,例如Panoptic-Deeblab来单独解决每个问题,用于玩家分割。除了从单任务模型的乘法乘以增加的复杂性之外,由于团队体育场景的复杂性和特异性,使用现成的架子模型也会阻碍性能,如强大的遮挡和运动模糊。为了规避这些限制,我们的论文提出培训一种单一的模型,它通过组合零件强度场和空间嵌入原理来预测球和玩家掩模和姿势。部件强度场提供球和播放器位置,以及播放器接头位置。然后利用空间嵌入来将播放器实例像素联系到其各自的播放器中心,而且还将播放器接头分组成骷髅。我们展示了拟议模型在DeepSport篮球数据集上的有效性,为单独解决每个单独任务的SOA模型实现了可比性的性能。
translated by 谷歌翻译
在本文中,我们提出了简单的关注机制,我们称之为箱子。它可以实现网格特征之间的空间交互,从感兴趣的框中采样,并提高变压器的学习能力,以获得几个视觉任务。具体而言,我们呈现拳击手,短暂的框变压器,通过从输入特征映射上的参考窗口预测其转换来参加一组框。通过考虑其网格结构,拳击手通过考虑其网格结构来计算这些框的注意力。值得注意的是,Boxer-2D自然有关于其注意模块内容信息的框信息的原因,使其适用于端到端实例检测和分段任务。通过在盒注意模块中旋转的旋转的不变性,Boxer-3D能够从用于3D端到端对象检测的鸟瞰图平面产生识别信息。我们的实验表明,拟议的拳击手-2D在Coco检测中实现了更好的结果,并且在Coco实例分割上具有良好的和高度优化的掩模R-CNN可比性。 Boxer-3D已经为Waymo开放的车辆类别提供了令人信服的性能,而无需任何特定的类优化。代码将被释放。
translated by 谷歌翻译
本文介绍了Houghnet,这是一种单阶段,无锚,基于投票的,自下而上的对象检测方法。受到广义的霍夫变换的启发,霍尼特通过在该位置投票的总和确定了某个位置的物体的存在。投票是根据对数极极投票领域的近距离和长距离地点收集的。由于这种投票机制,Houghnet能够整合近距离和远程的班级条件证据以进行视觉识别,从而概括和增强当前的对象检测方法,这通常仅依赖于本地证据。在可可数据集中,Houghnet的最佳型号达到$ 46.4 $ $ $ ap $(和$ 65.1 $ $ $ ap_ {50} $),与自下而上的对象检测中的最先进的作品相同,超越了最重要的一项 - 阶段和两阶段方法。我们进一步验证了提案在其他视觉检测任务中的有效性,即视频对象检测,实例分割,3D对象检测和人为姿势估计的关键点检测以及其他“图像”图像生成任务的附加“标签”,其中集成的集成在所有情况下,我们的投票模块始终提高性能。代码可在https://github.com/nerminsamet/houghnet上找到。
translated by 谷歌翻译
Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as a feature representation. However, the information in this layer may be too coarse spatially to allow precise localization. On the contrary, earlier layers may be precise in localization but will not capture semantics. To get the best of both worlds, we define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel. Using hypercolumns as pixel descriptors, we show results on three fine-grained localization tasks: simultaneous detection and segmentation [22], where we improve state-of-the-art from 49.7 mean AP r [22] to 60.0, keypoint localization, where we get a 3.3 point boost over [20], and part labeling, where we show a 6.6 point gain over a strong baseline.
translated by 谷歌翻译
Applications in the field of augmented reality or robotics often require joint localisation and 6D pose estimation of multiple objects. However, most algorithms need one network per object class to be trained in order to provide the best results. Analysing all visible objects demands multiple inferences, which is memory and time-consuming. We present a new single-stage architecture called CASAPose that determines 2D-3D correspondences for pose estimation of multiple different objects in RGB images in one pass. It is fast and memory efficient, and achieves high accuracy for multiple objects by exploiting the output of a semantic segmentation decoder as control input to a keypoint recognition decoder via local class-adaptive normalisation. Our new differentiable regression of keypoint locations significantly contributes to a faster closing of the domain gap between real test and synthetic training data. We apply segmentation-aware convolutions and upsampling operations to increase the focus inside the object mask and to reduce mutual interference of occluding objects. For each inserted object, the network grows by only one output segmentation map and a negligible number of parameters. We outperform state-of-the-art approaches in challenging multi-object scenes with inter-object occlusion and synthetic training.
translated by 谷歌翻译
我们提出Bapose,一种新颖的自下而上的方法,实现了多人姿态估计的最先进结果。我们的最终培训框架利用了解开的多尺度瀑布架构,并将自适应卷曲融合在拥挤的场景中更准确地推断出闭塞的关键点。由BAPOSE中的解开瀑布模块获得的多尺度表示,利用级联架构中进行逐行滤波的效率,同时保持与空间金字塔配置的多尺度视图相当。我们对挑战性的Coco和Crowdose数据集的结果表明,Bapose是多人姿态估计的高效且稳健的框架,实现了最先进的准确性的显着改善。
translated by 谷歌翻译
视频分析的图像分割在不同的研究领域起着重要作用,例如智能城市,医疗保健,计算机视觉和地球科学以及遥感应用。在这方面,最近致力于发展新的细分策略;最新的杰出成就之一是Panoptic细分。后者是由语义和实例分割的融合引起的。明确地,目前正在研究Panoptic细分,以帮助获得更多对视频监控,人群计数,自主驾驶,医学图像分析的图像场景的更细致的知识,以及一般对场景更深入的了解。为此,我们介绍了本文的首次全面审查现有的Panoptic分段方法,以获得作者的知识。因此,基于所采用的算法,应用场景和主要目标的性质,执行现有的Panoptic技术的明确定义分类。此外,讨论了使用伪标签注释新数据集的Panoptic分割。继续前进,进行消融研究,以了解不同观点的Panoptic方法。此外,讨论了适合于Panoptic分割的评估度量,并提供了现有解决方案性能的比较,以告知最先进的并识别其局限性和优势。最后,目前对主题技术面临的挑战和吸引不久的将来吸引相当兴趣的未来趋势,可以成为即将到来的研究研究的起点。提供代码的文件可用于:https://github.com/elharroussomar/awesome-panoptic-egation
translated by 谷歌翻译