Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.
translated by 谷歌翻译
使用具有固定尺度的图像超分辨率(SR)的深度学习技术,已经取得了巨大的成功。为了提高其现实世界的适用性,还提出了许多模型来恢复具有任意尺度因子的SR图像,包括不对称的图像,其中图像沿水平和垂直方向大小为不同的尺度。尽管大多数模型仅针对单向上升尺度任务进行了优化,同时假设针对低分辨率(LR)输入的预定义的缩小内核,但基于可逆神经网络(INN)的最新模型能够通过优化降低和降低尺度和降低范围的降低准确性来显着提高上升的准确性共同。但是,受创新体系结构的限制,它被限制在固定的整数尺度因素上,并且需要每个量表的一个模型。在不增加模型复杂性的情况下,提出了一个简单有效的可逆重新恢复网络(IARN),以通过在这项工作中仅训练一个模型来实现任意图像重新缩放。使用创新的组件,例如位置感知量表编码和先发制通道拆分,该网络被优化,以将不可固化的重新恢复周期转换为有效的可逆过程。证明它可以在双向任意重新缩放中实现最新的(SOTA)性能,而不会在LR输出中损害感知质量。还可以证明,使用相同的网络体系结构在不对称尺度的测试上表现良好。
translated by 谷歌翻译
语义细分是计算机视觉中的一个流行研究主题,并且在其上做出了许多努力,结果令人印象深刻。在本文中,我们打算搜索可以实时运行此问题的最佳网络结构。为了实现这一目标,我们共同搜索深度,通道,扩张速率和特征空间分辨率,从而导致搜索空间约为2.78*10^324可能的选择。为了处理如此大的搜索空间,我们利用差异架构搜索方法。但是,需要离散地使用使用现有差异方法搜索的体系结构参数,这会导致差异方法找到的架构参数与其离散版本作为体系结构搜索的最终解决方案之间的离散差距。因此,我们从解决方案空间正则化的创新角度来缓解离散差距的问题。具体而言,首先提出了新型的解决方案空间正则化(SSR)损失,以有效鼓励超级网络收敛到其离散。然后,提出了一种新的分层和渐进式解决方案空间缩小方法,以进一步实现较高的搜索效率。此外,我们从理论上表明,SSR损失的优化等同于L_0-NORM正则化,这说明了改善的搜索评估差距。综合实验表明,提出的搜索方案可以有效地找到最佳的网络结构,该结构具有较小的模型大小(1 m)的分割非常快的速度(175 fps),同时保持可比较的精度。
translated by 谷歌翻译
神经体系结构搜索(NAS)在从给定的超网中寻找有效的深神经网络(DNN)方面取得了惊人的成功。同时,彩票票证假设表明,DNN包含可以从头开始训练的小子网,以达到比原始DNN的可比精度或更高的精度。因此,目前是通过第一次搜索然后修剪的管道开发有效的DNN的常见做法。然而,这样做通常需要进行搜索训练培训过程,因此计算成本过高。在本文中,我们首次发现高效的DNN及其彩票子网(即彩票)可以直接从超级网络中直接识别,我们将其称为超级票,这是通过共同体系结构的两合一培训方案。搜索和参数修剪。此外,我们制定了一种进步和统一的超级标识识别策略,该策略使子网络在超网训练期间的连通性更改,比传统的稀疏培训更高的准确性和效率折衷。最后,我们评估了从一个任务中汲取的这种确定的超级款项是否可以很好地转移到其他任务,从而验证其同时处理多个任务的潜力。对三个任务和四个基准数据集进行的广泛实验和消融研究表明,与典型的NAS和修剪管道相比,我们所提出的超级款项实现了提高的准确性和效率权衡。可以在https://github.com/rice-eic/supertickets上获得代码和预估计的模型。
translated by 谷歌翻译
很少有细粒度的学习旨在将查询图像分类为具有细粒度差异的一组支持类别之一。尽管学习不同对象通过深神网络的局部差异取得了成功,但如何在基于变压器的架构中利用查询支持的跨图像对象语义关系在几个摄像机的细粒度场景中仍未得到充分探索。在这项工作中,我们提出了一个基于变压器的双螺旋模型,即HelixFormer,以双向和对称方式实现跨图像对象语义挖掘。 HelixFormer由两个步骤组成:1)跨不同分支的关系挖掘过程(RMP),以及2)在每个分支中表示增强过程(REP)。通过设计的RMP,每个分支都可以使用来自另一个分支的信息提取细粒对象级跨图义语义关系图(CSRMS),从而确保在语义相关的本地对象区域中更好地跨图像相互作用。此外,借助CSRMS,开发的REP可以增强每个分支中发现的与语义相关的局部区域的提取特征,从而增强模型区分细粒物体的细微特征差异的能力。在五个公共细粒基准上进行的广泛实验表明,螺旋形式可以有效地增强识别细颗粒物体的跨图像对象语义关系匹配,从而在1次以下的大多数先进方法中实现更好的性能,并且5击场景。我们的代码可在以下网址找到:https://github.com/jiakangyuan/helixformer
translated by 谷歌翻译
具有密集乘法的神经网络(NNS)(例如,卷积和变形金刚)具有饥饿的能力,阻碍了它们更广泛的部署到资源受限的设备中。因此,遵循节能硬件实施的共同实践的无乘法网络,以更有效的运算符(例如,位移位和加法)参数化NN,并引起了人们的关注。但是,从实现的准确性方面,无乘法网络的表现不足。为此,这项工作倡导混合NN,包括强大但昂贵的乘法和有效而强大的运营商来嫁给两全其美的运营商,并提出了ShiftAddnas,它们可以自动寻找更准确,更有效的NN。我们的ShiftAddnas突出了两个推动者。具体而言,它集成了(1)第一个混合搜索空间,该空间同时结合了基于乘法的和无乘法的运算符,以促进精确和有效的混合NNS的开发; (2)一种新型的重量共享策略,可以在遵循异质分布的不同操作员之间有效分享(例如,用于卷积的高斯与添加操作员的拉普拉斯人),并同时导致超级降低的超网尺寸和更好的搜索网络。对各种模型,数据集和任务的广泛实验和消融研究始终如一地验证了ShiftAddnas的功效,例如,与最先进的NN相比,获得的精度高达 +4.7%,或者+4.9更好的BLEU得分,而BLEU得分更好最多可提供93%或69%的能源和延迟节省。可以在https://github.com/rice-eic/shiftaddnas上获得代码和预估计的模型。
translated by 谷歌翻译
Machine Translation Quality Estimation (QE) is the task of evaluating translation output in the absence of human-written references. Due to the scarcity of human-labeled QE data, previous works attempted to utilize the abundant unlabeled parallel corpora to produce additional training data with pseudo labels. In this paper, we demonstrate a significant gap between parallel data and real QE data: for QE data, it is strictly guaranteed that the source side is original texts and the target side is translated (namely translationese). However, for parallel data, it is indiscriminate and the translationese may occur on either source or target side. We compare the impact of parallel data with different translation directions in QE data augmentation, and find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart. Moreover, since the WMT corpus lacks direction information for each parallel sentence, we train a classifier to distinguish source- and target-original bitext, and carry out an analysis of their difference in both style and domain. Together, these findings suggest using source-original parallel data for QE data augmentation, which brings a relative improvement of up to 4.0% and 6.4% compared to undifferentiated data on sentence- and word-level QE tasks respectively.
translated by 谷歌翻译
我们描述了JD Explore Academy对WMT 2022共享的一般翻译任务的提交。我们参加了所有高资源曲目和一条中型曲目,包括中文英语,德语英语,捷克语英语,俄语 - 英语和日语英语。我们通过扩大两个主要因素,即语言对和模型大小,即\ textbf {vega-mt}系统来推动以前的工作的极限 - 进行翻译的双向培训。至于语言对,我们将“双向”扩展到“多向”设置,涵盖所有参与语言,以利用跨语言的常识,并将其转移到下游双语任务中。至于型号尺寸,我们将变压器限制到拥有近47亿参数的极大模型,以完全增强我们VEGA-MT的模型容量。此外,我们采用数据增强策略,例如单语数据的循环翻译以及双语和单语数据的双向自我训练,以全面利用双语和单语言数据。为了使我们的Vega-MT适应通用域测试集,设计了概括调整。根据受约束系统的官方自动分数,根据图1所示的sacrebleu,我们在{zh-en(33.5),en-zh(49.7)(49.7),de-en(33.7)上获得了第一名-de(37.8),CS-EN(54.9),En-CS(41.4)和En-Ru(32.7)},在{ru-en(45.1)和Ja-en(25.6)}和第三名上的第二名和第三名在{en-ja(41.5)}上; W.R.T彗星,我们在{zh-en(45.1),en-zh(61.7),de-en(58.0),en-de(63.2),cs-en(74.7),ru-en(ru-en(ru-en)上,我们获得了第一名64.9),en-ru(69.6)和en-ja(65.1)},分别在{en-cs(95.3)和ja-en(40.6)}上的第二名。将发布模型,以通过GitHub和Omniforce平台来促进MT社区。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译