高效的视频架构是在具有有限计算资源的设备上部署视频识别系统的关键。不幸的是,现有的视频架构通常是计算密集的,不适合这些应用。最近的X3D工作通过沿着多个轴扩展手工制作的图像架构,介绍了一系列高效的视频模型系列,例如空间,时间,宽度和深度。虽然在概念上的大空间中操作,但x3d一次搜索一个轴,并且仅探索了一组总共30个架构,这不足以探索空间。本文绕过了现有的2D架构,并直接搜索了一个细粒度空间中的3D架构,其中共同搜索了块类型,滤波器编号,扩展比和注意力块。采用概率性神经结构搜索方法来有效地搜索如此大的空间。动力学和某事物的评估 - 某事-V2基准确认我们的AutoX3D模型在类似的拖鞋中的准确性高达1.3%的准确性优于现有的模型,并在达到类似的性能时降低计算成本高达X1.74。
translated by 谷歌翻译
Designing accurate and efficient ConvNets for mobile devices is challenging because the design space is combinatorially large. Due to this, previous neural architecture search (NAS) methods are computationally expensive. ConvNet architecture optimality depends on factors such as input resolution and target devices. However, existing approaches are too resource demanding for case-by-case redesigns. Also, previous work focuses primarily on reducing FLOPs, but FLOP count does not always reflect actual latency. To address these, we propose a differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize Con-vNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. FBNets (Facebook-Berkeley-Nets), a family of models discovered by DNAS surpass state-of-the-art models both designed manually and generated automatically. FBNet-B achieves 74.1% top-1 accuracy on ImageNet with 295M FLOPs and 23.1 ms latency on a Samsung S8 phone, 2.4x smaller and 1.5x faster than MobileNetV2-1.3[17] with similar accuracy. Despite higher accuracy and lower latency than MnasNet[20], we estimate FBNet-B's search cost is 420x smaller than MnasNet's, at only 216 GPUhours. Searched for different resolutions and channel sizes, FBNets achieve 1.5% to 6.4% higher accuracy than Mo-bileNetV2. The smallest FBNet achieves 50.2% accuracy and 2.9 ms latency (345 frames per second) on a Samsung S8. Over a Samsung-optimized FBNet, the iPhone-Xoptimized model achieves a 1.4x speedup on an iPhone X. FBNet models are open-sourced at https://github. com/facebookresearch/mobile-vision. * Work done while interning at Facebook.… Figure 1. Differentiable neural architecture search (DNAS) for ConvNet design. DNAS explores a layer-wise space that each layer of a ConvNet can choose a different block. The search space is represented by a stochastic super net. The search process trains the stochastic super net using SGD to optimize the architecture distribution. Optimal architectures are sampled from the trained distribution. The latency of each operator is measured on target devices and used to compute the loss for the super net.
translated by 谷歌翻译
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. X3D achieves state-of-the-art performance while requiring 4.8× and 5.5× fewer multiply-adds and parameters for similar accuracy as previous work. Our most surprising finding is that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters. We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks. Code will be available at: https: //github.com/facebookresearch/SlowFast.
translated by 谷歌翻译
大多数对象检测框架都使用最初设计用于图像分类的主链体系结构,通常在Imagenet上具有预训练的参数。但是,图像分类和对象检测本质上是不同的任务,并且不能保证分类的最佳主链也适用于对象检测。最近的神经体系结构搜索(NAS)研究表明,自动设计专门用于对象检测的骨干有助于提高整体准确性。在本文中,我们引入了一种神经体系结构适应方法,该方法可以优化给定的主链以进行检测目的,同时仍允许使用预训练的参数。我们建议除了每个块的输出通道尺寸外,还通过搜索特定操作和层数来调整微体系结构。重要的是要找到最佳的通道深度,因为它极大地影响了特征表示功能和计算成本。我们使用搜索的主链进行对象检测进行实验,并证明我们的主链在可可数据集上的手动设计和搜索的最新骨干均优于手动设计和搜索的骨干。
translated by 谷歌翻译
现有的光流估计器通常采用通常用于图像分类的网络体系结构作为提取人均功能的编码器。但是,由于任务之间的自然差异,用于图像分类的架构可能是最佳的流量估计。为了解决此问题,我们建议一种名为Falownas的神经体系结构搜索方法,以自动找到用于流估计任务的更好的编码器体系结构。我们首先设计一个合适的搜索空间,包括各种卷积运算符,并构建一个体重共享的超级网络,以有效评估候选体系结构。然后,为了更好地训练超级网络,我们提出了特征对齐蒸馏,该蒸馏利用训练有素的流量估计器来指导超级网络的训练。最后,利用资源约束的进化算法找到最佳体系结构(即子网络)。实验结果表明,从超级网络继承的权重的发现的结构达到了4.67 \%f1-able kitti上的误差,这是RAFT基线的8.4 \%降低,超过了先进的手工制作的型号GMA和AGFlow,同时降低模型的复杂性和延迟。源代码和训练有素的模型将在https://github.com/vdigpku/flownas中发布。
translated by 谷歌翻译
深度学习技术在各种任务中都表现出了出色的有效性,并且深度学习具有推进多种应用程序(包括在边缘计算中)的潜力,其中将深层模型部署在边缘设备上,以实现即时的数据处理和响应。一个关键的挑战是,虽然深层模型的应用通常会产生大量的内存和计算成本,但Edge设备通常只提供非常有限的存储和计算功能,这些功能可能会在各个设备之间差异很大。这些特征使得难以构建深度学习解决方案,以释放边缘设备的潜力,同时遵守其约束。应对这一挑战的一种有希望的方法是自动化有效的深度学习模型的设计,这些模型轻巧,仅需少量存储,并且仅产生低计算开销。该调查提供了针对边缘计算的深度学习模型设计自动化技术的全面覆盖。它提供了关键指标的概述和比较,这些指标通常用于量化模型在有效性,轻度和计算成本方面的水平。然后,该调查涵盖了深层设计自动化技术的三类最新技术:自动化神经体系结构搜索,自动化模型压缩以及联合自动化设计和压缩。最后,调查涵盖了未来研究的开放问题和方向。
translated by 谷歌翻译
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together. * The work was done during an internship at SenseTime.
translated by 谷歌翻译
语义细分是计算机视觉中的一个流行研究主题,并且在其上做出了许多努力,结果令人印象深刻。在本文中,我们打算搜索可以实时运行此问题的最佳网络结构。为了实现这一目标,我们共同搜索深度,通道,扩张速率和特征空间分辨率,从而导致搜索空间约为2.78*10^324可能的选择。为了处理如此大的搜索空间,我们利用差异架构搜索方法。但是,需要离散地使用使用现有差异方法搜索的体系结构参数,这会导致差异方法找到的架构参数与其离散版本作为体系结构搜索的最终解决方案之间的离散差距。因此,我们从解决方案空间正则化的创新角度来缓解离散差距的问题。具体而言,首先提出了新型的解决方案空间正则化(SSR)损失,以有效鼓励超级网络收敛到其离散。然后,提出了一种新的分层和渐进式解决方案空间缩小方法,以进一步实现较高的搜索效率。此外,我们从理论上表明,SSR损失的优化等同于L_0-NORM正则化,这说明了改善的搜索评估差距。综合实验表明,提出的搜索方案可以有效地找到最佳的网络结构,该结构具有较小的模型大小(1 m)的分割非常快的速度(175 fps),同时保持可比较的精度。
translated by 谷歌翻译
We revisit the one-shot Neural Architecture Search (NAS) paradigm and analyze its advantages over existing NAS approaches. Existing one-shot method, however, is hard to train and not yet effective on large scale datasets like ImageNet. This work propose a Single Path One-Shot model to address the challenge in the training. Our central idea is to construct a simplified supernet, where all architectures are single paths so that weight co-adaption problem is alleviated. Training is performed by uniform path sampling. All architectures (and their weights) are trained fully and equally. Comprehensive experiments verify that our approach is flexible and effective. It is easy to train and fast to search. It effortlessly supports complex search spaces (e.g., building blocks, channel, mixed-precision quantization) and different search constraints (e.g., FLOPs, latency). It is thus convenient to use for various needs. It achieves start-of-the-art performance on the large dataset ImageNet.Equal contribution. This work is done when Haoyuan Mu and Zechun Liu are interns at MEGVII Technology.
translated by 谷歌翻译
自我关注架构被出现为最近提高视力任务表现的最新进步。手动确定自我关注网络的架构依赖于专家的经验,无法自动适应各种场景。同时,神经结构搜索(NAS)显着推出了神经架构的自动设计。因此,需要考虑使用NAS方法自动发现更好的自我关注架构。然而,由于基于细胞的搜索空间统一和缺乏长期内容依赖性,直接使用现有的NAS方法来搜索关注网络是具有挑战性的。为了解决这个问题,我们提出了一种基于全部关注的NAS方法。更具体地,构造阶段明智的搜索空间,其允许为网络的不同层采用各种关注操作。为了提取全局特征,提出了一种使用上下文自动回归来发现全部关注架构的自我监督的搜索算法。为了验证所提出的方法的功效,我们对各种学习任务进行了广泛的实验,包括图像分类,细粒度的图像识别和零拍摄图像检索。经验结果表明,我们的方法能够发现高性能,全面关注架构,同时保证所需的搜索效率。
translated by 谷歌翻译
自2020年推出以来,Vision Transformers(VIT)一直在稳步打破许多视觉任务的记录,通常被描述为``全部'''替换Convnet。而且对于嵌入式设备不友好。此外,最近的研究表明,标准的转话如果经过重新设计和培训,可以在准确性和可伸缩性方面与VIT竞争。在本文中,我们采用Convnet的现代化结构来设计一种新的骨干,以采取行动,以采取行动特别是我们的主要目标是为工业产品部署服务,例如仅支持标准操作的FPGA董事会。因此,我们的网络仅由2D卷积组成,而无需使用任何3D卷积,远程注意插件或变压器块。在接受较少的时期(5x-10x)训练时,我们的骨干线超过了(2+1)D和3D卷积的方法,并获得可比的结果s在两个基准数据集上具有vit。
translated by 谷歌翻译
最近,已经成功地应用于各种遥感图像(RSI)识别任务的大量基于深度学习的方法。然而,RSI字段中深度学习方法的大多数现有进步严重依赖于手动设计的骨干网络提取的特征,这严重阻碍了由于RSI的复杂性以及先前知识的限制而受到深度学习模型的潜力。在本文中,我们研究了RSI识别任务中的骨干架构的新设计范式,包括场景分类,陆地覆盖分类和对象检测。提出了一种基于权重共享策略和进化算法的一拍架构搜索框架,称为RSBNet,其中包括三个阶段:首先,在层面搜索空间中构造的超空网是在自组装的大型中预先磨削 - 基于集合单路径培训策略进行缩放RSI数据集。接下来,预先培训的SuperNet通过可切换识别模块配备不同的识别头,并分别在目标数据集上进行微调,以获取特定于任务特定的超网络。最后,我们根据没有任何网络训练的进化算法,搜索最佳骨干架构进行不同识别任务。对于不同识别任务的五个基准数据集进行了广泛的实验,结果显示了所提出的搜索范例的有效性,并证明搜索后的骨干能够灵活地调整不同的RSI识别任务并实现令人印象深刻的性能。
translated by 谷歌翻译
神经结构搜索(NAS)已被广泛采用设计准确,高效的图像分类模型。但是,将NAS应用于新的计算机愿景任务仍然需要大量的努力。这是因为1)以前的NAS研究已经过度优先考虑图像分类,同时在很大程度上忽略了其他任务; 2)许多NAS工作侧重于优化特定于任务特定的组件,这些组件不能有利地转移到其他任务; 3)现有的NAS方法通常被设计为“Proxyless”,需要大量努力与每个新任务的培训管道集成。为了解决这些挑战,我们提出了FBNetv5,这是一个NAS框架,可以在各种视觉任务中寻找神经架构,以降低计算成本和人力努力。具体而言,我们设计1)一个简单但包容性和可转换的搜索空间; 2)用目标任务培训管道解开的多址搜索过程; 3)一种算法,用于同时搜索具有计算成本不可知的多个任务的架构到任务数。我们评估所提出的FBNetv5目标三个基本视觉任务 - 图像分类,对象检测和语义分割。 FBNETV5在单一搜索中搜索的模型在所有三个任务中都表现优于先前的议定书 - 现有技术:图像分类(例如,与FBNetv3相比,在与FBNetv3相比的同一拖鞋下的1 + 1.3%Imageet Top-1精度。 (例如,+ 1.8%较高的Ade20k Val。Miou比SegFormer为3.6倍的拖鞋),对象检测(例如,+ 1.1%Coco Val。与yolox相比,拖鞋的1.2倍的地图。
translated by 谷歌翻译
在NAS领域中,可分构造的架构搜索是普遍存在的,因为它的简单性和效率,其中两个范例,多路径算法和单路径方法主导。多路径框架(例如,DARTS)是直观的,但遭受内存使用和培训崩溃。单路径方法(例如,e.g.gdas和proxylesnnas)减轻了内存问题并缩小了搜索和评估之间的差距,但牺牲了性能。在本文中,我们提出了一种概念上简单的且有效的方法来桥接这两个范式,称为相互意识的子图可差架构搜索(MSG-DAS)。我们框架的核心是一个可分辨动的Gumbel-Topk采样器,它产生多个互斥的单路径子图。为了缓解多个子图形设置所带来的Severer Skip-Connect问题,我们提出了一个Dropblock-Identity模块来稳定优化。为了充分利用可用的型号(超级网和子图),我们介绍了一种记忆高效的超净指导蒸馏,以改善培训。所提出的框架击中了灵活的内存使用和搜索质量之间的平衡。我们展示了我们在想象中和CIFAR10上的方法的有效性,其中搜索的模型显示了与最近的方法相当的性能。
translated by 谷歌翻译
Conventional neural architecture search (NAS) approaches are based on reinforcement learning or evolutionary strategy, which take more than 3000 GPU hours to find a good model on CIFAR-10. We propose an efficient NAS approach learning to search by gradient descent. Our approach represents the search space as a directed acyclic graph (DAG). This DAG contains billions of sub-graphs, each of which indicates a kind of neural architecture. To avoid traversing all the possibilities of the sub-graphs, we develop a differentiable sampler over the DAG. This sampler is learnable and optimized by the validation loss after training the sampled architecture. In this way, our approach can be trained in an end-to-end fashion by gradient descent, named Gradient-based search using Differentiable Architecture Sampler (GDAS). In experiments, we can finish one searching procedure in four GPU hours on CIFAR-10, and the discovered model obtains a test error of 2.82% with only 2.5M parameters, which is on par with the state-of-the-art. Code is publicly available on GitHub: https://github.com/D-X-Y/NAS-Projects.
translated by 谷歌翻译
Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.This paper studies the effects of different design choices in 3D group convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks. Our experiments suggest two main findings. First, it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions as this leads to improved accuracy and lower computational cost. Second, 3D channel-separated convolutions provide a form of regularization, yielding lower training accuracy but higher test accuracy compared to 3D convolutions. These two empirical findings lead us to design an architecture -Channel-Separated Convolutional Network (CSN) -which is simple, efficient, yet accurate. On Sports1M, Kinetics, and Something-Something, our CSNs are comparable with or better than the state-of-the-art while being 2-3 times more efficient.
translated by 谷歌翻译
最近,变压器和多层感知器(MLP)体系结构在各种视觉任务上取得了令人印象深刻的结果。但是,如何有效地结合这些操作员形成高性能混合视觉体系结构仍然是一个挑战。在这项工作中,我们通过提出一种新型的统一体系结构搜索方法来研究卷积,变压器和MLP的可学习组合。我们的方法包含两个关键设计,以实现高性能网络的搜索。首先,我们以统一的形式对截然不同的可搜索运算符进行建模,从而使操作员能够用相同的配置参数进行表征。这样,总体搜索空间规模大大减少,总搜索成本变得负担得起。其次,我们提出上下文感知的倒数采样模块(DSM),以减轻不同类型的操作员之间的差距。我们提出的DSM能够更好地适应不同类型的操作员的功能,这对于识别高性能混合体系结构很重要。最后,我们将可配置的运算符和DSM集成到统一的搜索空间中,并使用基于增强学习的搜索算法进行搜索,以充分探索操作员的最佳组合。为此,我们搜索一个基线网络并扩大规模,以获得一个名为UNINET的模型系列,该模型的准确性和效率比以前的Convnets和Transformers更好。特别是,我们的UNET-B5在ImageNet上获得了84.9%的TOP-1精度,比效应网络-B7和Botnet-T7分别少了44%和55%。通过在Imagenet-21K上进行预处理,我们的UNET-B6获得了87.4%,表现优于SWIN-L,拖鞋少51%,参数减少了41%。代码可在https://github.com/sense-x/uninet上找到。
translated by 谷歌翻译
最近,自我关注操作员将卓越的性能作为视觉模型的独立构建块。然而,现有的自我关注模型通常是手动设计的,从CNN修改,并仅通过堆叠一个操作员而获得。很少探索相结合不同的自我关注操作员和卷积的更广泛的建筑空间。在本文中,我们探讨了具有权重共享神经结构搜索(NAS)算法的新颖建筑空间。结果架构被命名为Triomet,用于组合卷积,局部自我关注和全球(轴向)自我关注操作员。为了有效地搜索在这个巨大的建筑空间中,我们提出了分层采样,以便更好地培训超空网。此外,我们提出了一种新的重量分享策略,多头分享,专门针对多头自我关注运营商。我们搜索的Tri of将自我关注和卷积相结合优于所有独立的模型,在想象网分类上具有较少的拖鞋,自我关注比卷积更好。此外,在各种小型数据集上,我们观察对自我关注模型的劣等性能,但我们的小脚仍然能够匹配这种情况下的最佳操作员,卷积。我们的代码可在https://github.com/phj128/trionet提供。
translated by 谷歌翻译
Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short-and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of subconvolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.
translated by 谷歌翻译
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-theart on Sports-1M, Kinetics, UCF101, and HMDB51.
translated by 谷歌翻译