Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Time-series anomaly detection is an important task and has been widely applied in the industry. Since manual data annotation is expensive and inefficient, most applications adopt unsupervised anomaly detection methods, but the results are usually sub-optimal and unsatisfactory to end customers. Weak supervision is a promising paradigm for obtaining considerable labels in a low-cost way, which enables the customers to label data by writing heuristic rules rather than annotating each instance individually. However, in the time-series domain, it is hard for people to write reasonable labeling functions as the time-series data is numerically continuous and difficult to be understood. In this paper, we propose a Label-Efficient Interactive Time-Series Anomaly Detection (LEIAD) system, which enables a user to improve the results of unsupervised anomaly detection by performing only a small amount of interactions with the system. To achieve this goal, the system integrates weak supervision and active learning collaboratively while generating labeling functions automatically using only a few labeled data. All of these techniques are complementary and can promote each other in a reinforced manner. We conduct experiments on three time-series anomaly detection datasets, demonstrating that the proposed system is superior to existing solutions in both weak supervision and active learning areas. Also, the system has been tested in a real scenario in industry to show its practicality.
translated by 谷歌翻译
Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations, wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention
translated by 谷歌翻译
参考图像分割(RIS)旨在通过输出给定文本描述的相应对象掩码连接图像和语言,这是一项基本的视觉语言任务。尽管RIS取得了很多进展,但在这项工作中,我们还是探索了一个基本问题:“如果描述是错误的或文本描述的误导怎么办?”。我们将这样的句子称为否定句子。但是,我们发现现有作品无法处理此类设置。为此,我们提出了一种新颖的RIS,称为Robust Robust Toemustring图像分割(R-RIS)。除了定期给出的文本输入外,它还考虑了否定句子输入。我们通过增加输入负面句子和一个新的指标来统一两种输入类型,提出三个不同的数据集。此外,我们设计了一个名为RefSegformer的新的基于变压器的模型,在其中引入了基于令牌的视觉和语言融合模块。通过添加额外的空白令牌,可以轻松地将此类模块扩展到我们的R-RIS设置。我们提出的RefSegormer在三个常规RIS数据集和三个R-RIS数据集上实现了新的最新结果,这是用于进一步研究的新基线。项目页面位于\ url {https://lxtgh.github.io/project/robust_ref_seg/}。
translated by 谷歌翻译
在本文中,我们专注于探索有效的方法,以更快,准确和域的不可知性语义分割。受到相邻视频帧之间运动对齐的光流的启发,我们提出了一个流对齐模块(FAM),以了解相邻级别的特征映射之间的\ textit {语义流},并将高级特征广播到高分辨率特征有效地,有效地有效。 。此外,将我们的FAM与共同特征的金字塔结构集成在一起,甚至在轻量重量骨干网络(例如Resnet-18和DFNET)上也表现出优于其他实时方法的性能。然后,为了进一步加快推理过程,我们还提出了一个新型的封闭式双流对齐模块,以直接对齐高分辨率特征图和低分辨率特征图,在该图中我们将改进版本网络称为SFNET-LITE。广泛的实验是在几个具有挑战性的数据集上进行的,结果显示了SFNET和SFNET-LITE的有效性。特别是,建议的SFNET-LITE系列在使用RESNET-18主链和78.8 MIOU以120 fps运行的情况下,使用RTX-3090上的STDC主链在120 fps运行时,在60 fps运行时达到80.1 miou。此外,我们将四个具有挑战性的驾驶数据集(即CityScapes,Mapillary,IDD和BDD)统一到一个大数据集中,我们将其命名为Unified Drive细分(UDS)数据集。它包含不同的域和样式信息。我们基准了UDS上的几项代表性作品。 SFNET和SFNET-LITE仍然可以在UDS上取得最佳的速度和准确性权衡,这在如此新的挑战性环境中是强大的基准。所有代码和模型均可在https://github.com/lxtgh/sfsegnets上公开获得。
translated by 谷歌翻译
以前的多任务密集预测研究开发了复杂的管道,例如在多个阶段进行多模式蒸馏或为每个任务寻找任务关系上下文。这些方法以外的核心洞察力是最大程度地利用每个任务之间的相互作用。受到最近基于查询的变压器的启发,我们提出了一条更简单的管道,称为Multi-Querti-Transformer(MQTRANSFORMER),该管道配备了来自不同任务的多个查询,以促进多个任务之间的推理并简化交叉任务管道。我们没有在不同任务之间建模每个像素上下文的密集上下文,而是寻求特定于任务的代理,以通过每个查询编码与任务相关的上下文进行编码的多个查询执行交叉任务推理。 MQTRANSFORMER由三个关键组件组成:共享编码器,交叉任务注意和共享解码器。我们首先将每个任务与任务相关且具有比例意识的查询对每个任务进行建模,然后将功能提取器的图像功能输出和与任务相关的查询功能都馈入共享编码器,从而从图像功能中编码查询功能。其次,我们设计了一个交叉任务注意模块,以从两个角度来推理多个任务和特征量表之间的依赖项,包括相同尺度的不同任务和同一任务的不同尺度。然后,我们使用共享解码器逐渐使用来自不同任务的合理查询功能来逐步完善图像功能。对两个密集的预测数据集(NYUD-V2和Pascal-Context)的广泛实验结果表明,该方法是一种有效的方法,并实现了最新结果。
translated by 谷歌翻译
全景部分分割(PPS)旨在将泛型分割和部分分割统一为一个任务。先前的工作主要利用分离的方法来处理事物,物品和部分预测,而无需执行任何共享的计算和任务关联。在这项工作中,我们旨在将这些任务统一在架构层面上,设计第一个名为Panoptic-Partformer的端到端统一方法。特别是,由于视觉变压器的最新进展,我们将事物,内容和部分建模为对象查询,并直接学会优化所有三个预测作为统一掩码的预测和分类问题。我们设计了一个脱钩的解码器,以分别生成零件功能和事物/东西功能。然后,我们建议利用所有查询和相应的特征共同执行推理。最终掩码可以通过查询和相应特征之间的内部产品获得。广泛的消融研究和分析证明了我们框架的有效性。我们的全景局势群体在CityScapes PPS和Pascal Context PPS数据集上实现了新的最新结果,至少有70%的GFLOPS和50%的参数降低。特别是,在Pascal上下文PPS数据集上采用SWIN Transformer后,我们可以通过RESNET50骨干链和10%的改进获得3.4%的相对改进。据我们所知,我们是第一个通过\ textit {统一和端到端变压器模型来解决PPS问题的人。鉴于其有效性和概念上的简单性,我们希望我们的全景贡献者能够充当良好的基准,并帮助未来的PPS统一研究。我们的代码和型号可在https://github.com/lxtgh/panoptic-partformer上找到。
translated by 谷歌翻译
人类时尚理解是一项至关重要的计算机视觉任务,因为它具有用于现实世界应用的全面信息。这种关注人类时装细分和属性识别。与以前的作品相反,将每个任务分别建模为多头预测问题,我们的见解是通过Vision Transformer建模将这两个任务用一个统一的模型桥接,以使每个任务受益。特别是,我们介绍了分割的对象查询和属性预测的属性查询。查询及其相应的功能都可以通过掩码预测链接。然后,我们采用两流查询学习框架来学习解耦的查询表示。我们为属性流设计了一种新颖的多层渲染模块,以探索更细粒度的功能。解码器设计与DETR具有相同的精神。因此,我们将提出的方法\ textit {fahsionformer}命名。在三个人类时尚数据集上进行的广泛实验说明了我们方法的有效性。特别是,在\ textit {a intivit {a intim trictric(ap $^{\ text {mask}} _ {_ {\ text {iou+f text {iou+f textiT { } _1} $)用于分割和属性识别}。据我们所知,我们是人类时装分析的第一个统一的端到端视觉变压器框架。我们希望这种简单而有效的方法可以作为时尚分析的新灵活基准。代码可从https://github.com/xushilin1/fashionformer获得。
translated by 谷歌翻译
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (2 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models will be available for further research.
translated by 谷歌翻译