Feature reuse has been a key technique in light-weight convolutional neural networks (CNNs) design. Current methods usually utilize a concatenation operator to keep large channel numbers cheaply (thus large network capacity) by reusing feature maps from other layers. Although concatenation is parameters- and FLOPs-free, its computational cost on hardware devices is non-negligible. To address this, this paper provides a new perspective to realize feature reuse via structural re-parameterization technique. A novel hardware-efficient RepGhost module is proposed for implicit feature reuse via re-parameterization, instead of using concatenation operator. Based on the RepGhost module, we develop our efficient RepGhost bottleneck and RepGhostNet. Experiments on ImageNet and COCO benchmarks demonstrate that the proposed RepGhostNet is much more effective and efficient than GhostNet and MobileNetV3 on mobile devices. Specially, our RepGhostNet surpasses GhostNet 0.5x by 2.5% Top-1 accuracy on ImageNet dataset with less parameters and comparable latency on an ARM-based mobile phone.
translated by 谷歌翻译
由于存储器和计算资源有限,部署在移动设备上的卷积神经网络(CNNS)是困难的。我们的目标是通过利用特征图中的冗余来设计包括CPU和GPU的异构设备的高效神经网络,这很少在神经结构设计中进行了研究。对于类似CPU的设备,我们提出了一种新颖的CPU高效的Ghost(C-Ghost)模块,以生成从廉价操作的更多特征映射。基于一组内在的特征映射,我们使用廉价的成本应用一系列线性变换,以生成许多幽灵特征图,可以完全揭示内在特征的信息。所提出的C-Ghost模块可以作为即插即用组件,以升级现有的卷积神经网络。 C-Ghost瓶颈旨在堆叠C-Ghost模块,然后可以轻松建立轻量级的C-Ghostnet。我们进一步考虑GPU设备的有效网络。在建筑阶段的情况下,不涉及太多的GPU效率(例如,深度明智的卷积),我们建议利用阶段明智的特征冗余来制定GPU高效的幽灵(G-GHOST)阶段结构。舞台中的特征被分成两个部分,其中使用具有较少输出通道的原始块处理第一部分,用于生成内在特征,另一个通过利用阶段明智的冗余来生成廉价的操作。在基准测试上进行的实验证明了所提出的C-Ghost模块和G-Ghost阶段的有效性。 C-Ghostnet和G-Ghostnet分别可以分别实现CPU和GPU的准确性和延迟的最佳权衡。代码可在https://github.com/huawei-noah/cv-backbones获得。
translated by 谷歌翻译
Deploying convolutional neural networks (CNNs) on embedded devices is difficult due to the limited memory and computation resources. The redundancy in feature maps is an important characteristic of those successful CNNs, but has rarely been investigated in neural architecture design. This paper proposes a novel Ghost module to generate more feature maps from cheap operations. Based on a set of intrinsic feature maps, we apply a series of linear transformations with cheap cost to generate many ghost feature maps that could fully reveal information underlying intrinsic features. The proposed Ghost module can be taken as a plug-and-play component to upgrade existing convolutional neural networks. Ghost bottlenecks are designed to stack Ghost modules, and then the lightweight Ghost-Net can be easily established. Experiments conducted on benchmarks demonstrate that the proposed Ghost module is an impressive alternative of convolution layers in baseline models, and our GhostNet can achieve higher recognition performance (e.g. 75.7% top-1 accuracy) than MobileNetV3 with similar computational cost on the ImageNet ILSVRC-2012 classification dataset. Code is available at https: //github.com/huawei-noah/ghostnet.
translated by 谷歌翻译
Currently, the neural network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs. However, the direct metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical guidelines for efficient network design. Accordingly, a new architecture is presented, called ShuffleNet V2.Comprehensive ablation experiments verify that our model is the stateof-the-art in terms of speed and accuracy tradeoff.
translated by 谷歌翻译
We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 7.8%) than recent MobileNet [12] on Ima-geNet classification task, under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves ∼13× actual speedup over AlexNet while maintaining comparable accuracy.
translated by 谷歌翻译
We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardwareaware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 20% compared to MobileNetV2. MobileNetV3-Small is 6.6% more accurate compared to a MobileNetV2 model with comparable latency. MobileNetV3-Large detection is over 25% faster at roughly the same accuracy as Mo-bileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 34% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
translated by 谷歌翻译
用于移动设备的有效神经网络骨干通常针对诸如FLOPS或参数计数之类的指标进行优化。但是,这些指标在移动设备上部署时可能与网络的延迟不太相关。因此,我们通过在移动设备上部署多个移动友好网络来对不同指标进行广泛的分析。我们在最近有效的神经网络中识别和分析建筑和优化瓶颈,并提供减轻这些瓶颈的方法。为此,我们设计了一个高效的骨干莫比尼蛋白,在iPhone12上的推理时间低于1毫秒,ImageNet上的Top-1精度为75.9%。我们表明,Mobileone在高效体系结构中实现了最先进的性能,同时在移动设备上的速度更快。我们的最佳模型在38倍的速度中,在Imagenet上的性能与移动形式相似。与在类似延迟时,我们的模型在ImageNet上获得了2.3%的TOP-1精度。此外,我们表明我们的模型概括为多个任务 - 图像分类,对象检测和语义分割,与在移动设备上部署时现有的有效体系结构相比,延迟和准确性的显着提高。
translated by 谷歌翻译
为了实现不断增长的准确性,通常会开发大型和复杂的神经网络。这样的模型需要高度的计算资源,因此不能在边缘设备上部署。由于它们在几个应用领域的有用性,建立资源有效的通用网络非常感兴趣。在这项工作中,我们努力有效地结合了CNN和变压器模型的优势,并提出了一种新的有效混合体系结构。特别是在EDGENEXT中,我们引入了分裂深度转置注意力(SDTA)编码器,该编码器将输入张量分解为多个通道组,并利用深度旋转以及跨通道维度的自我注意力,以隐含地增加接受场并编码多尺度特征。我们在分类,检测和分割任务上进行的广泛实验揭示了所提出的方法的优点,优于相对较低的计算要求的最先进方法。我们具有130万参数的EDGENEXT模型在Imagenet-1k上达到71.2 \%TOP-1的精度,超过移动设备的绝对增益为2.2 \%,而拖鞋减少了28 \%。此外,我们具有560万参数的EDGENEXT模型在Imagenet-1k上达到了79.4 \%TOP-1的精度。代码和模型可在https://t.ly/_vu9上公开获得。
translated by 谷歌翻译
视觉变压器(VIT)显示了计算机视觉任务的快速进步,在各种基准上取得了令人鼓舞的结果。但是,由于参数和模型设计的数量大量,例如注意机制,基于VIT的模型通常比轻型卷积网络慢。因此,为实时应用程序部署VIT特别具有挑战性,尤其是在资源受限的硬件(例如移动设备)上。最近的努力试图通过网络体系结构搜索或与Mobilenet块的混合设计来降低VIT的计算复杂性,但推理速度仍然不令人满意。这导致了一个重要的问题:变形金刚在获得高性能的同时可以像Mobilenet一样快吗?为了回答这一点,我们首先重新审视基于VIT的模型中使用的网络体系结构和运营商,并确定效率低下的设计。然后,我们引入了一个尺寸一致的纯变压器(无需Mobilenet块)作为设计范式。最后,我们执行以延迟驱动的缩小,以获取一系列称为EfficityFormer的最终模型。广泛的实验表明,在移动设备上的性能和速度方面,有效形式的优势。我们最快的型号,EfficientFormer-L1,在ImagEnet-1k上获得$ 79.2 \%$ $ TOP-1的准确性,仅$ 1.6 $ MS推理潜伏期在iPhone 12上(与Coreml一起编译),该{运行速度与MobileNetV2 $ \ Times Times 1.4 $( $ 1.6 $ MS,$ 74.7 \%$ top-1),我们最大的型号EfficientFormer-L7,获得了$ 83.3 \%$精度,仅$ 7.0 $ MS延迟。我们的工作证明,正确设计的变压器可以在移动设备上达到极低的延迟,同时保持高性能。
translated by 谷歌翻译
由于复杂的注意机制和模型设计,大多数现有的视觉变压器(VIT)无法在现实的工业部署方案中的卷积神经网络(CNN)高效,例如张力和coreml。这提出了一个独特的挑战:可以设计视觉神经网络以与CNN一样快地推断并表现强大吗?最近的作品试图设计CNN-Transformer混合体系结构来解决这个问题,但是这些作品的整体性能远非令人满意。为了结束这些结束,我们提出了下一代视觉变压器,以在现实的工业场景中有效部署,即下一步,从延迟/准确性权衡的角度来看,它在CNN和VIT上占主导地位。在这项工作中,下一个卷积块(NCB)和下一个变压器块(NTB)分别开发出用于使用部署友好机制捕获本地和全球信息。然后,下一个混合策略(NHS)旨在将NCB和NTB堆叠在有效的混合范式中,从而提高了各种下游任务中的性能。广泛的实验表明,在各种视觉任务方面的延迟/准确性权衡方面,下一个VIT明显优于现有的CNN,VIT和CNN转换混合体系结构。在Tensorrt上,在可可检测上,Next-Vit超过5.4 MAP(从40.4到45.8),在类似延迟下,ADE20K细分的8.2%MIOU(从38.8%到47.0%)。同时,它可以与CSWIN达到可比的性能,而推理速度则以3.6倍的速度加速。在COREML上,在类似的延迟下,在COCO检测上,下一步超过了可可检测的4.6 MAP(从42.6到47.2),ADE20K分割的3.5%MIOU(从45.2%到48.7%)。代码将最近发布。
translated by 谷歌翻译
在过去几年中,已经制作了神经结构搜索领域的显着改进。然而,由于存在搜索的约束和实际推断时间之间的间隙,搜索有效网络仍然具有挑战性。为了搜索具有低推理时间的高性能网络,若干以前的作品为搜索算法设置了计算复杂性约束。然而,许多因素影响推理的速度(例如,拖鞋,MAC)。单个指示符与延迟之间的相关性并不强。目前,提出了一些重新参数化(REP)技术将多分支转换为对单路径架构进行推断友好的。然而,多分支架构仍然是人为定义和效率低下。在这项工作中,我们提出了一种适用于结构重新参数化技术的新搜索空间。 repnas是一种单级NAS方法,以便在分支号约束下有效地搜索每个层的最佳分支块(ODBB)。我们的实验结果表明,搜索的ODBB可以轻松超越手动各种分支块(DBB),高效培训。代码和型号将越早提供。
translated by 谷歌翻译
我们提出了一种多移民通道(MGIC)方法,该方法可以解决参数数量相对于标准卷积神经网络(CNN)中的通道数的二次增长。因此,我们的方法解决了CNN中的冗余,这也被轻量级CNN的成功所揭示。轻巧的CNN可以达到与参数较少的标准CNN的可比精度。但是,权重的数量仍然随CNN的宽度四倍地缩放。我们的MGIC体系结构用MGIC对应物代替了每个CNN块,该块利用了小组大小的嵌套分组卷积的层次结构来解决此问题。因此,我们提出的架构相对于网络的宽度线性扩展,同时保留了通道的完整耦合,如标准CNN中。我们对图像分类,分割和点云分类进行的广泛实验表明,将此策略应用于Resnet和MobilenetV3等不同体系结构,可以减少参数的数量,同时获得相似或更好的准确性。
translated by 谷歌翻译
我们从实际应用的角度重新审视了现有的出色变压器。他们中的大多数甚至不如基本的重新连接系列效率那么高,并且偏离了现实的部署方案。这可能是由于当前的标准测量计算效率,例如FLOPS或参数是单方面的,次优的和对硬件的不敏感的。因此,本文直接将特定硬件的紧张延迟视为效率指标,该指标提供了涉及计算能力,内存成本和带宽的更全面的反馈。基于一系列受控实验,这项工作为面向浓度和部署的网络设计提供了四个实用指南,例如,在阶段级别,早期的变压器和晚期CNN,在Block Level的早期CNN和Late Transformer。因此,提出了一个面向Tensortrt的变压器家族,缩写为TRT-VIT。广泛的实验表明,在不同的视觉任务(例如,图像分类,对象检测和语义细分)方面,TRT-VIT显着优于现有的Convnet和视觉变压器。例如,在82.7%的Imagenet-1k Top-1精度下,TRT-VIT比CSWIN快2.7 $ \ times $,比双胞胎快2.0 $ \ times $。在MS-COCO对象检测任务上,TRT-VIT与双胞胎达到可比的性能,而推理速度则增加了2.8 $ \ times $。
translated by 谷歌翻译
虽然残留连接使训练非常深的神经网络,但由于其多分支拓扑而​​导致在线推断不友好。这鼓励许多研究人员在推动时没有残留连接的情况下设计DNN。例如,repvgg在部署时将多分支拓扑重新参数化为vgg型(单分支)模型,当网络相对较浅时显示出具有很大的性能。但是,RepVGG不能等效地将Reset转换为VGG,因为重新参数化方法只能应用于线性块,并且必须将非线性层(Relu)放在残余连接之外,这导致了有限的表示能力,特别是更深入网络。在本文中,我们的目标是通过在Resblock上的保留和合并(RM)操作等效地纠正此问题,并提出删除Vanilla Reset中的残留连接。具体地,RM操作允许输入特征映射通过块,同时保留其信息,并在每个块的末尾合并所有信息,这可以去除残差而不改变原始输出。作为一个插件方法,RM操作基本上有三个优点:1)其实现使其实现高比率网络修剪。 2)它有助于打破RepVGG的深度限制。 3)与Reset和RepVGG相比,它导致更好的精度速度折衷网络(RMNet)。我们相信RM操作的意识形态可以激发对未来社区的模型设计的许多见解。代码可用:https://github.com/fxmeng/rmnet。
translated by 谷歌翻译
更好的准确性和效率权衡在对象检测中是一个具有挑战性的问题。在这项工作中,我们致力于研究对象检测的关键优化和神经网络架构选择,以提高准确性和效率。我们调查了无锚策略对轻质对象检测模型的适用性。我们增强了骨干结构并设计了颈部的轻质结构,从而提高了网络的特征提取能力。我们改善标签分配策略和损失功能,使培训更稳定和高效。通过这些优化,我们创建了一个名为PP-Picodet的新的实时对象探测器系列,这在移动设备的对象检测上实现了卓越的性能。与其他流行型号相比,我们的模型在准确性和延迟之间实现了更好的权衡。 Picodet-s只有0.99m的参数达到30.6%的地图,它是地图的绝对4.8%,同时与yolox-nano相比将移动CPU推理延迟减少55%,并且与Nanodet相比,MAP的绝对改善了7.1%。当输入大小为320时,它在移动臂CPU上达到123个FPS(使用桨Lite)。Picodet-L只有3.3M参数,达到40.9%的地图,这是地图的绝对3.7%,比yolov5s更快44% 。如图1所示,我们的模型远远优于轻量级对象检测的最先进的结果。代码和预先训练的型号可在https://github.com/paddlepaddle/paddledentions提供。
translated by 谷歌翻译
视觉变压器的最新进展在基于点产生自我注意的新空间建模机制驱动的各种任务中取得了巨大成功。在本文中,我们表明,视觉变压器背后的关键要素,即输入自适应,远程和高阶空间相互作用,也可以通过基于卷积的框架有效地实现。我们介绍了递归封闭式卷积($ \ textit {g}^\ textit {n} $ conv),该卷积{n} $ conv)与封闭的卷积和递归设计执行高阶空间交互。新操作是高度灵活和可定制的,它与卷积的各种变体兼容,并将自我注意的两阶相互作用扩展到任意订单,而无需引入大量额外的计算。 $ \ textit {g}^\ textit {n} $ conv可以用作插件模块,以改善各种视觉变压器和基于卷积的模型。根据该操作,我们构建了一个名为Hornet的新型通用视觉骨干家族。关于ImageNet分类,可可对象检测和ADE20K语义分割的广泛实验表明,大黄蜂的表现优于Swin变形金刚,并具有相似的整体体系结构和训练配置的明显边距。大黄蜂还显示出对更多训练数据和更大模型大小的有利可伸缩性。除了在视觉编码器中的有效性外,我们还可以将$ \ textit {g}^\ textit {n} $ conv应用于特定于任务的解码器,并始终通过较少的计算来提高密集的预测性能。我们的结果表明,$ \ textIt {g}^\ textit {n} $ conv可以成为视觉建模的新基本模块,可有效结合视觉变形金刚和CNN的优点。代码可从https://github.com/raoyongming/hornet获得
translated by 谷歌翻译
Designing accurate and efficient ConvNets for mobile devices is challenging because the design space is combinatorially large. Due to this, previous neural architecture search (NAS) methods are computationally expensive. ConvNet architecture optimality depends on factors such as input resolution and target devices. However, existing approaches are too resource demanding for case-by-case redesigns. Also, previous work focuses primarily on reducing FLOPs, but FLOP count does not always reflect actual latency. To address these, we propose a differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize Con-vNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. FBNets (Facebook-Berkeley-Nets), a family of models discovered by DNAS surpass state-of-the-art models both designed manually and generated automatically. FBNet-B achieves 74.1% top-1 accuracy on ImageNet with 295M FLOPs and 23.1 ms latency on a Samsung S8 phone, 2.4x smaller and 1.5x faster than MobileNetV2-1.3[17] with similar accuracy. Despite higher accuracy and lower latency than MnasNet[20], we estimate FBNet-B's search cost is 420x smaller than MnasNet's, at only 216 GPUhours. Searched for different resolutions and channel sizes, FBNets achieve 1.5% to 6.4% higher accuracy than Mo-bileNetV2. The smallest FBNet achieves 50.2% accuracy and 2.9 ms latency (345 frames per second) on a Samsung S8. Over a Samsung-optimized FBNet, the iPhone-Xoptimized model achieves a 1.4x speedup on an iPhone X. FBNet models are open-sourced at https://github. com/facebookresearch/mobile-vision. * Work done while interning at Facebook.… Figure 1. Differentiable neural architecture search (DNAS) for ConvNet design. DNAS explores a layer-wise space that each layer of a ConvNet can choose a different block. The search space is represented by a stochastic super net. The search process trains the stochastic super net using SGD to optimize the architecture distribution. Optimal architectures are sampled from the trained distribution. The latency of each operator is measured on target devices and used to compute the loss for the super net.
translated by 谷歌翻译
Recently, Neural architecture search has achieved great success on classification tasks for mobile devices. The backbone network for object detection is usually obtained on the image classification task. However, the architecture which is searched through the classification task is sub-optimal because of the gap between the task of image and object detection. As while work focuses on backbone network architecture search for mobile device object detection is limited, mainly because the backbone always requires expensive ImageNet pre-training. Accordingly, it is necessary to study the approach of network architecture search for mobile device object detection without expensive pre-training. In this work, we propose a mobile object detection backbone network architecture search algorithm which is a kind of evolutionary optimized method based on non-dominated sorting for NAS scenarios. It can quickly search to obtain the backbone network architecture within certain constraints. It better solves the problem of suboptimal linear combination accuracy and computational cost. The proposed approach can search the backbone networks with different depths, widths, or expansion sizes via a technique of weight mapping, making it possible to use NAS for mobile devices detection tasks a lot more efficiently. In our experiments, we verify the effectiveness of the proposed approach on YoloX-Lite, a lightweight version of the target detection framework. Under similar computational complexity, the accuracy of the backbone network architecture we search for is 2.0% mAP higher than MobileDet. Our improved backbone network can reduce the computational effort while improving the accuracy of the object detection network. To prove its effectiveness, a series of ablation studies have been carried out and the working mechanism has been analyzed in detail.
translated by 谷歌翻译
在这项工作中,我们提出了一种方法,以准确评估和比较有效的神经网络构建块的性能,以硬件感知方式进行计算机视觉。我们的比较使用了基于设计空间的随机采样网络的帕累托前沿来捕获潜在的准确性/复杂性权衡。我们表明,我们的方法允许通过以前的比较范例获得的信息匹配,但对硬件成本和准确性之间的关系提供了更多见解。我们使用我们的方法来分析不同的构件并评估其在一系列嵌入式硬件平台上的性能。这突出了基准构建块作为神经网络设计过程中的预选步骤的重要性。我们表明,选择合适的构件可以在特定硬件ML加速器上加快推理的速度2倍。
translated by 谷歌翻译
被广泛采用的缩减采样是为了在视觉识别的准确性和延迟之间取得良好的权衡。不幸的是,没有学习常用的合并层,因此无法保留重要信息。作为另一个降低方法,自适应采样权重和与任务相关的过程区域,因此能够更好地保留有用的信息。但是,自适应采样的使用仅限于某些层。在本文中,我们表明,在深神经网络的构件中使用自适应采样可以提高其效率。特别是,我们提出了SSBNET,该SSBNET是通过将采样层反复插入Resnet等现有网络构建的。实验结果表明,所提出的SSBNET可以在ImageNet和可可数据集上实现竞争性图像分类和对象检测性能。例如,SSB-Resnet-RS-200在Imagenet数据集上的精度达到82.6%,比基线RESNET-RS-152高0.6%,具有相似的复杂性。可视化显示了SSBNET在允许不同层专注于不同位置的优势,而消融研究进一步验证了自适应采样比均匀方法的优势。
translated by 谷歌翻译