The long-standing theory that a colour-naming system evolves under the dual pressure of efficient communication and perceptual mechanism is supported by more and more linguistic studies including the analysis of four decades' diachronic data from the Nafaanra language. This inspires us to explore whether artificial intelligence could evolve and discover a similar colour-naming system via optimising the communication efficiency represented by high-level recognition performance. Here, we propose a novel colour quantisation transformer, CQFormer, that quantises colour space while maintaining the accuracy of machine recognition on the quantised images. Given an RGB image, Annotation Branch maps it into an index map before generating the quantised image with a colour palette, meanwhile the Palette Branch utilises a key-point detection way to find proper colours in palette among whole colour space. By interacting with colour annotation, CQFormer is able to balance both the machine vision accuracy and colour perceptual structure such as distinct and stable colour distribution for discovered colour system. Very interestingly, we even observe the consistent evolution pattern between our artificial colour system and basic colour terms across human languages. Besides, our colour quantisation method also offers an efficient quantisation method that effectively compresses the image storage while maintaining a high performance in high-level recognition tasks such as classification and detection. Extensive experiments demonstrate the superior performance of our method with extremely low bit-rate colours. We will release the source code soon.
translated by 谷歌翻译
颜色和结构是结合形象的两个支柱。对神经网络识别的关键结构感兴趣,我们通过将颜色空间限制为几个位来隔离颜色的影响,并找到能够在此类约束下实现网络识别的结构。为此,我们提出了一个颜色量化网络Colorcnn,该网络通过最大程度地减少分类损失来学习在有限的颜色空间中构建图像。在Colorcnn的体系结构和见解的基础上,我们介绍了ColorCnn+,该+支持多种颜色空间大小的配置,并解决了以前的识别精度差的不良问题和在大型颜色空间下的不良视觉保真度。通过一种新颖的模仿学习方法,Colorcnn+学会了群集颜色,例如传统的颜色量化方法。这减少了过度拟合,并有助于在大颜色空间下的视觉保真度和识别精度。实验验证ColorCNN+在大多数情况下取得了非常有竞争力的结果,可以保留具有准确颜色的网络识别和视觉保真度的关键结构。我们进一步讨论关键结构和准确颜色之间的差异及其对网络识别的具体贡献。对于潜在应用,我们表明ColorCNN可以用作网络识别的图像压缩方法。
translated by 谷歌翻译
变压器是一种基于关注的编码器解码器架构,彻底改变了自然语言处理领域。灵感来自这一重大成就,最近在将变形式架构调整到计算机视觉(CV)领域的一些开创性作品,这已经证明了他们对各种简历任务的有效性。依靠竞争力的建模能力,与现代卷积神经网络相比在本文中,我们已经为三百不同的视觉变压器进行了全面的审查,用于三个基本的CV任务(分类,检测和分割),提出了根据其动机,结构和使用情况组织这些方法的分类。 。由于培训设置和面向任务的差异,我们还在不同的配置上进行了评估了这些方法,以便于易于和直观的比较而不是各种基准。此外,我们已经揭示了一系列必不可少的,但可能使变压器能够从众多架构中脱颖而出,例如松弛的高级语义嵌入,以弥合视觉和顺序变压器之间的差距。最后,提出了三个未来的未来研究方向进行进一步投资。
translated by 谷歌翻译
在现实世界中,具有挑战性的照明条件(低光,不渗透和过度暴露)不仅具有令人不愉快的视觉外观,而且还要污染计算机视觉任务。现有的光自适应方法通常分别处理每种条件。而且,其中大多数经常在原始图像上运行或过度简化相机图像信号处理(ISP)管道。通过将光转换管道分解为局部和全局ISP组件,我们提出了一个轻巧的快速照明自适应变压器(IAT),其中包括两个变压器式分支:本地估计分支和全球ISP分支。尽管本地分支估算与照明有关的像素的本地组件,但全局分支定义了可学习的Quires,可以参加整个图像以解码参数。我们的IAT还可以在各种光条件下同时进行对象检测和语义分割。我们已经在2个低级任务和3个高级任务上对多个现实世界数据集进行了广泛评估。我们的IAT只有90K参数和0.004S处理速度(不包括高级模块),其IAT始终达到了卓越的性能。代码可从https://github.com/cuiziteng/illumination-aptive-transformer获得
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
Continual Learning (CL) is a field dedicated to devise algorithms able to achieve lifelong learning. Overcoming the knowledge disruption of previously acquired concepts, a drawback affecting deep learning models and that goes by the name of catastrophic forgetting, is a hard challenge. Currently, deep learning methods can attain impressive results when the data modeled does not undergo a considerable distributional shift in subsequent learning sessions, but whenever we expose such systems to this incremental setting, performance drop very quickly. Overcoming this limitation is fundamental as it would allow us to build truly intelligent systems showing stability and plasticity. Secondly, it would allow us to overcome the onerous limitation of retraining these architectures from scratch with the new updated data. In this thesis, we tackle the problem from multiple directions. In a first study, we show that in rehearsal-based techniques (systems that use memory buffer), the quantity of data stored in the rehearsal buffer is a more important factor over the quality of the data. Secondly, we propose one of the early works of incremental learning on ViTs architectures, comparing functional, weight and attention regularization approaches and propose effective novel a novel asymmetric loss. At the end we conclude with a study on pretraining and how it affects the performance in Continual Learning, raising some questions about the effective progression of the field. We then conclude with some future directions and closing remarks.
translated by 谷歌翻译
机器的图像编码(ICM)旨在压缩图像进行AI任务分析,而不是满足人类的看法。学习一种既是一般(用于AI任务)的特征,也是紧凑的(用于压缩)的功能,这对于其成功而言至关重要。在本文中,我们试图通过学习通用功能,同时考虑压缩来开发ICM框架。我们将诸如无所不能功能和相应框架的功能命名为Omni-ICM。考虑到自我监督学习(SSL)提高了特征的概括,我们将其与压缩任务集成到OMNI-ICM框架中,以学习无所不能的功能。但是,在SSL中协调语义建模并在压缩中删除冗余是不平凡的,因此我们通过合作实例区分和熵最小化以自适应掉落的信息来设计新颖的信息过滤(如果)模块,以较弱相关的信息执行AI任务(例如,某些纹理冗余)。与以前的特定解决方案不同,Omni-ICM可以直接基于学习的无能功能的AI任务分析,而无需联合培训或额外的转换。尽管简单而直观,但Omni-ICM在多个基本愿景任务上大大优于现有的传统和基于学习的编解码器。
translated by 谷歌翻译
视觉问题应答(VQA)任务利用视觉图像和语言分析来回回答图像的文本问题。它是一个流行的研究课题,在过去十年中越来越多的现实应用。本文介绍了我们最近对AliceMind-MMU的研究(阿里巴巴的编码器 - 解码器来自Damo Academy - 多媒体理解的机器智能实验室),其比人类在VQA上获得相似甚至略微更好的结果。这是通过系统地改善VQA流水线来实现的,包括:(1)具有全面的视觉和文本特征表示的预培训; (2)与学习参加的有效跨模型互动; (3)一个新颖的知识挖掘框架,具有专门的专业专家模块,适用于复杂的VQA任务。处理不同类型的视觉问题,需要具有相应的专业知识在提高我们的VQA架构的表现方面发挥着重要作用,这取决于人力水平。进行了广泛的实验和分析,以证明新的研究工作的有效性。
translated by 谷歌翻译
随着深度卷积神经网络的兴起,对象检测在过去几年中取得了突出的进步。但是,这种繁荣无法掩盖小物体检测(SOD)的不令人满意的情况,这是计算机视觉中臭名昭著的挑战性任务之一,这是由于视觉外观不佳和由小目标的内在结构引起的嘈杂表示。此外,用于基准小对象检测方法基准测试的大规模数据集仍然是瓶颈。在本文中,我们首先对小物体检测进行了详尽的审查。然后,为了催化SOD的发展,我们分别构建了两个大规模的小物体检测数据集(SODA),SODA-D和SODA-A,分别集中在驾驶和空中场景上。 SODA-D包括24704个高质量的交通图像和277596个9个类别的实例。对于苏打水,我们收集2510个高分辨率航空图像,并在9个类别上注释800203实例。众所周知,拟议的数据集是有史以来首次尝试使用针对多类SOD量身定制的大量注释实例进行大规模基准测试。最后,我们评估主流方法在苏打水上的性能。我们预计发布的基准可以促进SOD的发展,并产生该领域的更多突破。数据集和代码将很快在:\ url {https://shaunyuan22.github.io/soda}上。
translated by 谷歌翻译
零拍摄对象检测(ZSD),将传统检测模型扩展到检测来自Unseen类别的对象的任务,已成为计算机视觉中的新挑战。大多数现有方法通过严格的映射传输策略来解决ZSD任务,这可能导致次优ZSD结果:1)这些模型的学习过程忽略了可用的看不见的类信息,因此可以轻松地偏向所看到的类别; 2)原始视觉特征空间并不合适,缺乏歧视信息。为解决这些问题,我们开发了一种用于ZSD的新型语义引导的对比网络,命名为Contrastzsd,一种检测框架首先将对比学习机制带入零拍摄检测的领域。特别地,对比度包括两个语义导向的对比学学习子网,其分别与区域类别和区域区域对之间形成对比。成对对比度任务利用从地面真理标签和预定义的类相似性分布派生的附加监督信号。在那些明确的语义监督的指导下,模型可以了解更多关于看不见的类别的知识,以避免看到概念的偏见问题,同时优化视觉功能的数据结构,以更好地辨别更好的视觉语义对齐。广泛的实验是在ZSD,即Pascal VOC和MS Coco的两个流行基准上进行的。结果表明,我们的方法优于ZSD和广义ZSD任务的先前最先进的。
translated by 谷歌翻译
标记数据通常昂贵且耗时,特别是对于诸如对象检测和实例分割之类的任务,这需要对图像的密集标签进行密集的标签。虽然几张拍摄对象检测是关于培训小说中的模型(看不见的)对象类具有很少的数据,但它仍然需要在许多标记的基础(见)类的课程上进行训练。另一方面,自我监督的方法旨在从未标记数据学习的学习表示,该数据转移到诸如物体检测的下游任务。结合几次射击和自我监督的物体检测是一个有前途的研究方向。在本调查中,我们审查并表征了几次射击和自我监督对象检测的最新方法。然后,我们给我们的主要外卖,并讨论未来的研究方向。https://gabrielhuang.github.io/fsod-survey/的项目页面
translated by 谷歌翻译
我们提出Osformer,这是伪装实例分割(CIS)的第一个单阶段变压器框架。Osformer基于两个关键设计。首先,我们设计了一个位置传感变压器(LST),以通过引入位置引导查询和混合通风volvolution feedforward网络来获得位置标签和实例感知参数。其次,我们开发了一个粗到细节的融合(CFF),以合并LST编码器和CNN骨架的各种上下文信息。结合这两个组件使Osformer能够有效地融合本地特征和远程上下文依赖关系,以预测伪装的实例。与两阶段的框架相比,我们的OSFORMER达到41%的AP并达到良好的收敛效率,而无需大量的训练数据,即仅3040个以下的样本以下60个时代。代码链接:https://github.com/pjlallen/osformer。
translated by 谷歌翻译
Strong lensing in galaxy clusters probes properties of dense cores of dark matter halos in mass, studies the distant universe at flux levels and spatial resolutions otherwise unavailable, and constrains cosmological models independently. The next-generation large scale sky imaging surveys are expected to discover thousands of cluster-scale strong lenses, which would lead to unprecedented opportunities for applying cluster-scale strong lenses to solve astrophysical and cosmological problems. However, the large dataset challenges astronomers to identify and extract strong lensing signals, particularly strongly lensed arcs, because of their complexity and variety. Hence, we propose a framework to detect cluster-scale strongly lensed arcs, which contains a transformer-based detection algorithm and an image simulation algorithm. We embed prior information of strongly lensed arcs at cluster-scale into the training data through simulation and then train the detection algorithm with simulated images. We use the trained transformer to detect strongly lensed arcs from simulated and real data. Results show that our approach could achieve 99.63 % accuracy rate, 90.32 % recall rate, 85.37 % precision rate and 0.23 % false positive rate in detection of strongly lensed arcs from simulated images and could detect almost all strongly lensed arcs in real observation images. Besides, with an interpretation method, we have shown that our method could identify important information embedded in simulated data. Next step, to test the reliability and usability of our approach, we will apply it to available observations (e.g., DESI Legacy Imaging Surveys) and simulated data of upcoming large-scale sky surveys, such as the Euclid and the CSST.
translated by 谷歌翻译
压缩学习(CL)是一个新兴框架,可以通过压缩传感(CS)和机器学习来整合信号的收购,直接在少量测量上进行推理任务。它可以是经典图像域方法的有希望的替代方法,并且在保存和计算效率方面具有很大的优势。但是,以前对CL的尝试不仅限于固定的CS比率,该比率缺乏灵活性,而且还限于MNIST/CIFAR样数据集,并且不扩展到复杂的现实世界高分辨率(HR)数据或视觉任务。在本文中,提出了一个新型的基于变压器的压缩学习框架,该框架在具有任意CS比率的大规模图像上(称为TransCl)。具体而言,TransCL首先采用了基于可学习的基于块的压缩感测的策略,并提出了一种灵活的线性投影策略,以使CL能够以任意CS比率的有效逐块方式在大规模图像上进行。然后,关于从所有块作为序列的CS测量值,将部署一个基于纯变压器的骨架来执行具有各种面向任务的头部的视觉任务。我们的足够分析表明,TRANSCL对干扰和对任意CS比率的强大适应性表现出强烈的抵抗力。复杂HR数据的广泛实验表明,所提出的TransCl可以在图像分类和语义分割任务中实现最新性能。特别是,CS比率为$ 10 \%$的TRANSCL几乎可以获得与直接在原始数据上运行时的性能,即使CS极低的CS比率为$ 1 \%$ $,也可以获得令人满意的性能。我们提出的TransCl的源代码可在\ url {https://github.com/mc-e/transcl/}上获得。
translated by 谷歌翻译
Diagram object detection is the key basis of practical applications such as textbook question answering. Because the diagram mainly consists of simple lines and color blocks, its visual features are sparser than those of natural images. In addition, diagrams usually express diverse knowledge, in which there are many low-frequency object categories in diagrams. These lead to the fact that traditional data-driven detection model is not suitable for diagrams. In this work, we propose a gestalt-perception transformer model for diagram object detection, which is based on an encoder-decoder architecture. Gestalt perception contains a series of laws to explain human perception, that the human visual system tends to perceive patches in an image that are similar, close or connected without abrupt directional changes as a perceptual whole object. Inspired by these thoughts, we build a gestalt-perception graph in transformer encoder, which is composed of diagram patches as nodes and the relationships between patches as edges. This graph aims to group these patches into objects via laws of similarity, proximity, and smoothness implied in these edges, so that the meaningful objects can be effectively detected. The experimental results demonstrate that the proposed GPTR achieves the best results in the diagram object detection task. Our model also obtains comparable results over the competitors in natural image object detection.
translated by 谷歌翻译
先前的工作提出了几种策略,以降低自我发挥机制的计算成本。这些作品中的许多作品都考虑将自我关注程序分解为区域和局部特征提取程序,这些程序都会产生较小的计算复杂性。但是,区域信息通常仅以损失的不良信息为代价,原因是由于下采样而丢失。在本文中,我们提出了一种新颖的变压器体系结构,旨在减轻成本问题,称为双视觉变压器(双击)。新的体系结构结合了一个关键的语义途径,可以更有效地将代币向量压缩到具有降低的复杂性顺序的全球语义中。然后,这种压缩的全局语义是通过另一个构造的像素途径在学习更精细的像素级详细信息中作为有用的先前信息。然后将语义途径和像素途径集成在一起并进行联合训练,从而通过这两个途径并行传播增强的自我运动信息。此后,双攻击能够降低计算复杂性,而不会损害很大的准确性。我们从经验上证明,双重射击比SOTA变压器体系结构具有较高的训练复杂性。源代码可在\ url {https://github.com/yehli/imagenetmodel}中获得。
translated by 谷歌翻译
从点云中检测3D对象是一项实用但充满挑战的任务,最近引起了越来越多的关注。在本文中,我们提出了针对3D对象检测的标签引导辅助训练方法(LG3D),该方法是增强现有3D对象检测器的功能学习的辅助网络。具体而言,我们提出了两个新型模块:一个标签 - 通道诱导器,该模块诱导器将框架中的注释和点云映射到特定于任务的表示形式和一个标签 - 知识式插曲器,该标签知识映射器有助于获得原始特征以获得检测临界表示。提出的辅助网络被推理丢弃,因此在测试时间没有额外的计算成本。我们对室内和室外数据集进行了广泛的实验,以验证我们的方法的有效性。例如,我们拟议的LG3D分别在SUN RGB-D和SCANNETV2数据集上将投票人员分别提高了2.5%和3.1%的地图。
translated by 谷歌翻译
在鸟眼中学习强大的表现(BEV),以进行感知任务,这是趋势和吸引行业和学术界的广泛关注。大多数自动驾驶算法的常规方法在正面或透视视图中执行检测,细分,跟踪等。随着传感器配置变得越来越复杂,从不同的传感器中集成了多源信息,并在统一视图中代表功能至关重要。 BEV感知继承了几个优势,因为代表BEV中的周围场景是直观和融合友好的。对于BEV中的代表对象,对于随后的模块,如计划和/或控制是最可取的。 BEV感知的核心问题在于(a)如何通过从透视视图到BEV来通过视图转换来重建丢失的3D信息; (b)如何在BEV网格中获取地面真理注释; (c)如何制定管道以合并来自不同来源和视图的特征; (d)如何适应和概括算法作为传感器配置在不同情况下各不相同。在这项调查中,我们回顾了有关BEV感知的最新工作,并对不同解决方案进行了深入的分析。此外,还描述了该行业的BEV方法的几种系统设计。此外,我们推出了一套完整的实用指南,以提高BEV感知任务的性能,包括相机,激光雷达和融合输入。最后,我们指出了该领域的未来研究指示。我们希望该报告能阐明社区,并鼓励对BEV感知的更多研究。我们保留一个活跃的存储库来收集最新的工作,并在https://github.com/openperceptionx/bevperception-survey-recipe上提供一包技巧的工具箱。
translated by 谷歌翻译
我们为变体视觉任务提供了一个概念上简单,灵活和通用的视觉感知头,例如分类,对象检测,实例分割和姿势估计以及不同的框架,例如单阶段或两个阶段的管道。我们的方法有效地标识了图像中的对象,同时同时生成高质量的边界框或基于轮廓的分割掩码或一组关键点。该方法称为Unihead,将不同的视觉感知任务视为通过变压器编码器体系结构学习的可分配点。给定固定的空间坐标,Unihead将其自适应地分散到了不同的空间点和有关它们的关系的原因。它以多个点的形式直接输出最终预测集,使我们能够在具有相同头部设计的不同框架中执行不同的视觉任务。我们展示了对成像网分类的广泛评估以及可可套件的所有三个曲目,包括对象检测,实例分割和姿势估计。如果没有铃铛和口哨声,Unihead可以通过单个视觉头设计统一这些视觉任务,并与为每个任务开发的专家模型相比,实现可比的性能。我们希望我们的简单和通用的Unihead能够成为可靠的基线,并有助于促进通用的视觉感知研究。代码和型号可在https://github.com/sense-x/unihead上找到。
translated by 谷歌翻译
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (2 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models will be available for further research.
translated by 谷歌翻译