Evaluating neural network performance is critical to deep neural network design but a costly procedure. Neural predictors provide an efficient solution by treating architectures as samples and learning to estimate their performance on a given task. However, existing predictors are task-dependent, predominantly estimating neural network performance on image classification benchmarks. They are also search-space dependent; each predictor is designed to make predictions for a specific architecture search space with predefined topologies and set of operations. In this paper, we propose a novel All-in-One Predictor (AIO-P), which aims to pretrain neural predictors on architecture examples from multiple, separate computer vision (CV) task domains and multiple architecture spaces, and then transfer to unseen downstream CV tasks or neural architectures. We describe our proposed techniques for general graph representation, efficient predictor pretraining and knowledge infusion techniques, as well as methods to transfer to downstream tasks/spaces. Extensive experimental results show that AIO-P can achieve Mean Absolute Error (MAE) and Spearman's Rank Correlation (SRCC) below 1% and above 0.5, respectively, on a breadth of target downstream CV tasks with or without fine-tuning, outperforming a number of baselines. Moreover, AIO-P can directly transfer to new architectures not seen during training, accurately rank them and serve as an effective performance estimator when paired with an algorithm designed to preserve performance while reducing FLOPs.
translated by 谷歌翻译
Predicting neural architecture performance is a challenging task and is crucial to neural architecture design and search. Existing approaches either rely on neural performance predictors which are limited to modeling architectures in a predefined design space involving specific sets of operators and connection rules, and cannot generalize to unseen architectures, or resort to zero-cost proxies which are not always accurate. In this paper, we propose GENNAPE, a Generalized Neural Architecture Performance Estimator, which is pretrained on open neural architecture benchmarks, and aims to generalize to completely unseen architectures through combined innovations in network representation, contrastive pretraining, and fuzzy clustering-based predictor ensemble. Specifically, GENNAPE represents a given neural network as a Computation Graph (CG) of atomic operations which can model an arbitrary architecture. It first learns a graph encoder via Contrastive Learning to encourage network separation by topological features, and then trains multiple predictor heads, which are soft-aggregated according to the fuzzy membership of a neural network. Experiments show that GENNAPE pretrained on NAS-Bench-101 can achieve superior transferability to 5 different public neural network benchmarks, including NAS-Bench-201, NAS-Bench-301, MobileNet and ResNet families under no or minimum fine-tuning. We further introduce 3 challenging newly labelled neural network benchmarks: HiAML, Inception and Two-Path, which can concentrate in narrow accuracy ranges. Extensive experiments show that GENNAPE can correctly discern high-performance architectures in these families. Finally, when paired with a search algorithm, GENNAPE can find architectures that improve accuracy while reducing FLOPs on three families.
translated by 谷歌翻译
神经结构搜索(NAS)已被广泛采用设计准确,高效的图像分类模型。但是,将NAS应用于新的计算机愿景任务仍然需要大量的努力。这是因为1)以前的NAS研究已经过度优先考虑图像分类,同时在很大程度上忽略了其他任务; 2)许多NAS工作侧重于优化特定于任务特定的组件,这些组件不能有利地转移到其他任务; 3)现有的NAS方法通常被设计为“Proxyless”,需要大量努力与每个新任务的培训管道集成。为了解决这些挑战,我们提出了FBNetv5,这是一个NAS框架,可以在各种视觉任务中寻找神经架构,以降低计算成本和人力努力。具体而言,我们设计1)一个简单但包容性和可转换的搜索空间; 2)用目标任务培训管道解开的多址搜索过程; 3)一种算法,用于同时搜索具有计算成本不可知的多个任务的架构到任务数。我们评估所提出的FBNetv5目标三个基本视觉任务 - 图像分类,对象检测和语义分割。 FBNETV5在单一搜索中搜索的模型在所有三个任务中都表现优于先前的议定书 - 现有技术:图像分类(例如,与FBNetv3相比,在与FBNetv3相比的同一拖鞋下的1 + 1.3%Imageet Top-1精度。 (例如,+ 1.8%较高的Ade20k Val。Miou比SegFormer为3.6倍的拖鞋),对象检测(例如,+ 1.1%Coco Val。与yolox相比,拖鞋的1.2倍的地图。
translated by 谷歌翻译
在对象检测模型中,检测骨干机消耗超过一半的整体推理成本。最近的研究试图通过在神经结构搜索(NAS)的帮助下优化骨干架构来降低这一成本。然而,对象检测的现有NAS方法需要数百至数千个GPU小时的搜索,使它们在快节奏的研究和开发中不切实际。在这项工作中,我们提出了一种新的零射NAS方法来解决这个问题。所提出的方法,命名为Zendet,在不训练网络参数的情况下自动设计有效的检测骨干网,从而降低了架构设计成本,几乎归零但提供了最先进的(SOTA)性能。在引擎盖下,Zendet最大化了检测骨干的差分熵,导致对象检测的更好的特征提取器,在相同的计算预算下。在仅为全自动设计的一个GPU日之后,Zendet在多个检测基准数据集上创新了SOTA检测骨干,具有很少的人为干预。与Reset-50个骨干相比,Zendet在Map中使用相同数量的拖波/参数时更好地+ 2.0%,并且在同一地图上的NVIDIA V100速度快1.54倍。稍后将发布代码和预先训练的型号。
translated by 谷歌翻译
深度学习技术在各种任务中都表现出了出色的有效性,并且深度学习具有推进多种应用程序(包括在边缘计算中)的潜力,其中将深层模型部署在边缘设备上,以实现即时的数据处理和响应。一个关键的挑战是,虽然深层模型的应用通常会产生大量的内存和计算成本,但Edge设备通常只提供非常有限的存储和计算功能,这些功能可能会在各个设备之间差异很大。这些特征使得难以构建深度学习解决方案,以释放边缘设备的潜力,同时遵守其约束。应对这一挑战的一种有希望的方法是自动化有效的深度学习模型的设计,这些模型轻巧,仅需少量存储,并且仅产生低计算开销。该调查提供了针对边缘计算的深度学习模型设计自动化技术的全面覆盖。它提供了关键指标的概述和比较,这些指标通常用于量化模型在有效性,轻度和计算成本方面的水平。然后,该调查涵盖了深层设计自动化技术的三类最新技术:自动化神经体系结构搜索,自动化模型压缩以及联合自动化设计和压缩。最后,调查涵盖了未来研究的开放问题和方向。
translated by 谷歌翻译
最近,已经成功地应用于各种遥感图像(RSI)识别任务的大量基于深度学习的方法。然而,RSI字段中深度学习方法的大多数现有进步严重依赖于手动设计的骨干网络提取的特征,这严重阻碍了由于RSI的复杂性以及先前知识的限制而受到深度学习模型的潜力。在本文中,我们研究了RSI识别任务中的骨干架构的新设计范式,包括场景分类,陆地覆盖分类和对象检测。提出了一种基于权重共享策略和进化算法的一拍架构搜索框架,称为RSBNet,其中包括三个阶段:首先,在层面搜索空间中构造的超空网是在自组装的大型中预先磨削 - 基于集合单路径培训策略进行缩放RSI数据集。接下来,预先培训的SuperNet通过可切换识别模块配备不同的识别头,并分别在目标数据集上进行微调,以获取特定于任务特定的超网络。最后,我们根据没有任何网络训练的进化算法,搜索最佳骨干架构进行不同识别任务。对于不同识别任务的五个基准数据集进行了广泛的实验,结果显示了所提出的搜索范例的有效性,并证明搜索后的骨干能够灵活地调整不同的RSI识别任务并实现令人印象深刻的性能。
translated by 谷歌翻译
为了同时朝着对多个下游任务的整体理解,需要提取具有更好可传递性的功能。尽管许多最新的自我监管的预训练方法在普遍的预处理前范式下在各种视觉任务上取得了令人印象深刻的表现,但它们对多任务学习方案的概括能力尚待探索。在本文中,我们在三个下游任务上进行了广泛研究各种类型的自我监督方法的转移性能,例如Moco和Simc​​lr,包括语义细分,可驱动的区域细分和交通对象检测,在大规模驾驶数据集中BDD100K。我们出人意料地发现,他们的表现是最佳的甚至落后于单任务基线的滞后,这可能是由于训练目标和建筑设计的区别在于预处理范式。为了克服这一难题,并避免重新设计资源密集的预培训阶段,我们提出了一种简单而有效的预处理 - 适应性 - 赛范围,用于一般的多任务培训,可以有效地适应现行预审预周态的模型没有增加培训开销。在自适应阶段,我们利用可学习的多尺度适配器来动态调整由多任务目标监督的预验证的模型权重,同时使经过预告片的知识未经触及。此外,我们将视觉语言预训练模型剪辑视为对预处理 - 适应 - 最终范式的强烈补充,并提出了一个名为LV-Adapter的新型适配器,该适配器通过任务特定的提示将语言先验纳入了多任务的模型中和视觉和文本特征之间的对齐。
translated by 谷歌翻译
转移学习可以在源任务上重新使用知识来帮助学习目标任务。一种简单的转移学习形式在当前的最先进的计算机视觉模型中是常见的,即预先训练ILSVRC数据集上的图像分类模型,然后在任何目标任务上进行微调。然而,先前对转移学习的系统研究已经有限,并且预计工作的情况并不完全明白。在本文中,我们对跨越不同的图像域进行了广泛的转移学习实验探索(消费者照片,自主驾驶,空中图像,水下,室内场景,合成,特写镜头)和任务类型(语义分割,物体检测,深度估计,关键点检测)。重要的是,这些都是与现代计算机视觉应用相关的复杂的结构化的输出任务类型。总共执行超过2000年的转移学习实验,包括许多来源和目标来自不同的图像域,任务类型或两者。我们系统地分析了这些实验,了解图像域,任务类型和数据集大小对传输学习性能的影响。我们的研究导致了几个见解和具体建议:(1)对于大多数任务,存在一个显着优于ILSVRC'12预培训的来源; (2)图像领域是实现阳性转移的最重要因素; (3)源数据集应该\ \ emph {include}目标数据集的图像域以获得最佳结果; (4)与此同时,当源任务的图像域比目标的图像域时,我们只观察小的负面影响; (5)跨任务类型的转移可能是有益的,但其成功严重依赖于源和目标任务类型。
translated by 谷歌翻译
深度神经网络中的建筑进步导致了跨越一系列计算机视觉任务的巨大飞跃。神经建筑搜索(NAS)并没有依靠人类的专业知识,而是成为自动化建筑设计的有前途的途径。尽管图像分类的最新成就提出了机会,但NAS的承诺尚未对更具挑战性的语义细分任务进行彻底评估。将NAS应用于语义分割的主要挑战来自两个方面:(i)要处理的高分辨率图像; (ii)针对自动驾驶等应用的实时推理速度(即实时语义细分)的其他要求。为了应对此类挑战,我们在本文中提出了一种替代辅助的多目标方法。通过一系列自定义预测模型,我们的方法有效地将原始的NAS任务转换为普通的多目标优化问题。然后是用于填充选择的层次预筛选标准,我们的方法逐渐实现了一组有效的体系结构在细分精度和推理速度之间进行交易。对三个基准数据集的经验评估以及使用华为地图集200 dk的应用程序的实证评估表明,我们的方法可以识别架构明显优于人类专家手动设计和通过其他NAS方法自动设计的现有最先进的体系结构。
translated by 谷歌翻译
深层神经网络(DNN)是通过依次执行线性和非线性过程产生的。使用线性和非线性程序的组合对于生成足够深的特征空间至关重要。大多数非线性运算符是激活函数或合并函数的推导。数学形态是数学的一个分支,为各种图像处理问题提供了非线性操作员。我们调查了将这些操作集成到本文端到端深度学习框架中的实用性。 DNN旨在获得特定工作的现实代表。形态运算符给出拓扑描述符,以传达有关图像中描述的物体形状的显着信息。我们提出了一种基于元学习的方法,将形态算子纳入DNN。博学的结构展示了我们的新型形态操作如何显着提高各种任务(包括图片分类和边缘检测)的DNN性能。
translated by 谷歌翻译
Jitendra Malik once said, "Supervision is the opium of the AI researcher". Most deep learning techniques heavily rely on extreme amounts of human labels to work effectively. In today's world, the rate of data creation greatly surpasses the rate of data annotation. Full reliance on human annotations is just a temporary means to solve current closed problems in AI. In reality, only a tiny fraction of data is annotated. Annotation Efficient Learning (AEL) is a study of algorithms to train models effectively with fewer annotations. To thrive in AEL environments, we need deep learning techniques that rely less on manual annotations (e.g., image, bounding-box, and per-pixel labels), but learn useful information from unlabeled data. In this thesis, we explore five different techniques for handling AEL.
translated by 谷歌翻译
尽管参数有效调整(PET)方法在自然语言处理(NLP)任务上显示出巨大的潜力,但其有效性仍然对计算机视觉(CV)任务的大规模转向进行了研究。本文提出了Conv-Adapter,这是一种专为CONCNET设计的PET模块。 Conv-Adapter具有轻巧的,可转让的域和架构,不合时宜,并且在不同的任务上具有广义性能。当转移下游任务时,Conv-Adapter将特定于任务的特征调制到主链的中间表示,同时保持预先训练的参数冻结。通过仅引入少量可学习的参数,例如,仅3.5%的RESNET50的完整微调参数,Conv-Adapter优于先前的宠物基线方法,并实现可比性或超过23个分类任务的全面调查的性能。它还在几乎没有分类的情况下表现出卓越的性能,平均利润率为3.39%。除分类外,Conv-Adapter可以推广到检测和细分任务,其参数降低了50%以上,但性能与传统的完整微调相当。
translated by 谷歌翻译
In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.
translated by 谷歌翻译
We propose a new neural network design paradigm Reversible Column Network (RevCol). The main body of RevCol is composed of multiple copies of subnetworks, named columns respectively, between which multi-level reversible connections are employed. Such architectural scheme attributes RevCol very different behavior from conventional networks: during forward propagation, features in RevCol are learned to be gradually disentangled when passing through each column, whose total information is maintained rather than compressed or discarded as other network does. Our experiments suggest that CNN-style RevCol models can achieve very competitive performances on multiple computer vision tasks such as image classification, object detection and semantic segmentation, especially with large parameter budget and large dataset. For example, after ImageNet-22K pre-training, RevCol-XL obtains 88.2% ImageNet-1K accuracy. Given more pre-training data, our largest model RevCol-H reaches 90.0% on ImageNet-1K, 63.8% APbox on COCO detection minival set, 61.0% mIoU on ADE20k segmentation. To our knowledge, it is the best COCO detection and ADE20k segmentation result among pure (static) CNN models. Moreover, as a general macro architecture fashion, RevCol can also be introduced into transformers or other neural networks, which is demonstrated to improve the performances in both computer vision and NLP tasks. We release code and models at https://github.com/megvii-research/RevCol
translated by 谷歌翻译
In this work, we introduce Panoptic-DeepLab, a simple, strong, and fast system for panoptic segmentation, aiming to establish a solid baseline for bottom-up methods that can achieve comparable performance of two-stage methods while yielding fast inference speed. In particular, Panoptic-DeepLab adopts the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model (e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. As a result, our single Panoptic-DeepLab simultaneously ranks first at all three Cityscapes benchmarks, setting the new state-of-art of 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set. Additionally, equipped with MobileNetV3, Panoptic-DeepLab runs nearly in real-time with a single 1025 × 2049 image (15.8 frames per second), while achieving a competitive performance on Cityscapes (54.1 PQ% on test set). On Mapillary Vistas test set, our ensemble of six models attains 42.7% PQ, outperforming the challenge winner in 2018 by a healthy margin of 1.5%. Finally, our Panoptic-DeepLab also performs on par with several topdown approaches on the challenging COCO dataset. For the first time, we demonstrate a bottom-up approach could deliver state-of-the-art results on panoptic segmentation.
translated by 谷歌翻译
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-ofthe-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, topperforming method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
translated by 谷歌翻译
本文调查了2D全身人类姿势估计的任务,该任务旨在将整个人体(包括身体,脚,脸部和手)局部定位在整个人体上。我们提出了一种称为Zoomnet的单网络方法,以考虑到完整人体的层次结构,并解决不同身体部位的规模变化。我们进一步提出了一个称为Zoomnas的神经体系结构搜索框架,以促进全身姿势估计的准确性和效率。Zoomnas共同搜索模型体系结构和不同子模块之间的连接,并自动为搜索的子模块分配计算复杂性。为了训练和评估Zoomnas,我们介绍了第一个大型2D人类全身数据集,即可可叶全体V1.0,它注释了133个用于野外图像的关键点。广泛的实验证明了Zoomnas的有效性和可可叶v1.0的重要性。
translated by 谷歌翻译
大多数现有的神经体系结构搜索(NAS)基准和算法优先考虑了良好的任务,例如CIFAR或Imagenet上的图像分类。这使得在更多样化的领域的NAS方法的表现知之甚少。在本文中,我们提出了NAS-Bench-360,这是一套基准套件,用于评估超出建筑搜索传统研究的域的方法,并使用它来解决以下问题:最先进的NAS方法在多样化的任务?为了构建基准测试,我们策划了十个任务,这些任务涵盖了各种应用程序域,数据集大小,问题维度和学习目标。小心地选择每个任务与现代CNN的搜索方法互操作,同时可能与其原始开发领域相距遥远。为了加快NAS研究的成本,对于其中两个任务,我们发布了包括标准CNN搜索空间的15,625个体系结构的预定性能。在实验上,我们表明需要对NAS BENCH-360进行更强大的NAS评估,从而表明几种现代NAS程序在这十个任务中执行不一致,并且有许多灾难性差的结果。我们还展示了NAS Bench-360及其相关的预算结果将如何通过测试NAS文献中最近推广的一些假设来实现未来的科学发现。 NAS-Bench-360托管在https://nb360.ml.cmu.edu上。
translated by 谷歌翻译
Sensors in cyber-physical systems often capture interconnected processes and thus emit correlated time series (CTS), the forecasting of which enables important applications. The key to successful CTS forecasting is to uncover the temporal dynamics of time series and the spatial correlations among time series. Deep learning-based solutions exhibit impressive performance at discerning these aspects. In particular, automated CTS forecasting, where the design of an optimal deep learning architecture is automated, enables forecasting accuracy that surpasses what has been achieved by manual approaches. However, automated CTS solutions remain in their infancy and are only able to find optimal architectures for predefined hyperparameters and scale poorly to large-scale CTS. To overcome these limitations, we propose SEARCH, a joint, scalable framework, to automatically devise effective CTS forecasting models. Specifically, we encode each candidate architecture and accompanying hyperparameters into a joint graph representation. We introduce an efficient Architecture-Hyperparameter Comparator (AHC) to rank all architecture-hyperparameter pairs, and we then further evaluate the top-ranked pairs to select a final result. Extensive experiments on six benchmark datasets demonstrate that SEARCH not only eliminates manual efforts but also is capable of better performance than manually designed and existing automatically designed CTS models. In addition, it shows excellent scalability to large CTS.
translated by 谷歌翻译
Recently, Neural Architecture Search (NAS) has successfully identified neural network architectures that exceed human designed ones on large-scale image classification. In this paper, we study NAS for semantic image segmentation. Existing works often focus on searching the repeatable cell structure, while hand-designing the outer network structure that controls the spatial resolution changes. This choice simplifies the search space, but becomes increasingly problematic for dense image prediction which exhibits a lot more network level architectural variations. Therefore, we propose to search the network level structure in addition to the cell level structure, which forms a hierarchical architecture search space. We present a network level search space that includes many popular designs, and develop a formulation that allows efficient gradient-based architecture search (3 P100 GPU days on Cityscapes images). We demonstrate the effectiveness of the proposed method on the challenging Cityscapes, PASCAL VOC 2012, and ADE20K datasets. Auto-DeepLab, our architecture searched specifically for semantic image segmentation, attains state-of-the-art performance without any ImageNet pretraining. 1 * Work done while an intern at Google.
translated by 谷歌翻译