Deep learning technology has made great progress in multi-view 3D reconstruction tasks. At present, most mainstream solutions establish the mapping between views and shape of an object by assembling the networks of 2D encoder and 3D decoder as the basic structure while they adopt different approaches to obtain aggregation of features from several views. Among them, the methods using attention-based fusion perform better and more stable than the others, however, they still have an obvious shortcoming -- the strong independence of each view during predicting the weights for merging leads to a lack of adaption of the global state. In this paper, we propose a global-aware attention-based fusion approach that builds the correlation between each branch and the global to provide a comprehensive foundation for weights inference. In order to enhance the ability of the network, we introduce a novel loss function to supervise the shape overall and propose a dynamic two-stage training strategy that can effectively adapt to all reconstructors with attention-based fusion. Experiments on ShapeNet verify that our method outperforms existing SOTA methods while the amount of parameters is far less than the same type of algorithm, Pix2Vox++. Furthermore, we propose a view-reduction method based on maximizing diversity and discuss the cost-performance tradeoff of our model to achieve a better performance when facing heavy input amount and limited computational cost.
translated by 谷歌翻译
与卷积神经网络(CNN)相比,视觉变压器(VIT)表现出了有希望的性能,但是VIT的训练比CNN难得多。在本文中,我们定义了几个指标,包括动态数据比例(DDP)和知识同化率(KAR),以研究训练过程,并将其分为三个时期:形成,增长和探索。特别是,在训练的最后阶段,我们观察到只有很小的训练示例用于优化模型。鉴于VIT的数据渴望的性质,我们提出了一个简单但重要的问题:在培训的每个阶段,是否有可能提供丰富的``有效''培训示例吗?为了解决这个问题,我们需要解决两个关键问题,即\ ie,如何衡量单个培训示例的``有效性'',以及如何系统地生成足够数量的``有效''示例。为了回答第一个问题,我们发现训练样本的``困难''可以作为衡量培训样本的``有效性''的指标。为了解决第二个问题,我们建议在这些演化阶段动态调整训练数据的``难度''分布。为了实现这两个目的,我们提出了一个新颖的以数据为中心的VIT培训框架,以动态测量训练样本的``难度'',并为不同培训阶段的模型生成``有效的''样品。此外,为了进一步扩大``有效''样品的数量,并减轻了VIT的后期训练阶段的过度拟合问题,我们提出了一种称为Patcherasing的补丁级擦除策略。广泛的实验证明了提出的以数据为中心的VIT培训框架和技术的有效性。
translated by 谷歌翻译
去耦时尚表示是指将空间和时间特征分解成尺寸无关的因素。尽管以前的基于RGB-D的运动识别方法通过紧密耦合的多模态时空表示来实现了有希望的性能,但由于紧密的时空缠绕的建模,它们仍然在小数据设置下遭受(i)优化困难;(ii)信息冗余通常包含与分类弱相关的大量边际信息; (iii)由晚期融合不足引起的多模态起峰型信息之间的低相互作用。为了缓解这些缺点,我们建议去除并循环基于RGB-D的运动识别的时空表示。具体而言,我们解开了学习时空表示的任务到3个子任务:(1)通过解耦的空间和时间建模网络学习高质量和维度独立特征。 (2)重新汇总解耦表示,以确定更强的时空依赖。 (3)引入跨型自适应后融合(CAPF)机制,用于从RGB-D数据捕获跨模态时空信息。这些新颖设计的无缝组合形成了强大的时空表示,而不是在四个公共运动数据集上的最先进的方法实现了更好的性能。我们的代码可在https://github.com/damo-cv/motionrgbd获得。
translated by 谷歌翻译
通过利用仅偏置模型的输出来调整学习目标,可以有效地显示了基于组合的脱叠方法。在本文中,我们专注于这些基于集合的方法的偏见模型,这起到了重要作用,但在现有文献中没有大量关注。从理论上讲,我们证明了脱结性能可能因偏见模型的不准确性估计而受损。凭经验,我们表明现有的偏见模型在产生准确的不确定性估计方面不足。这些发现的动机,我们建议在唯一的模型上进行校准,从而实现基于三阶段的脱叠框架,包括偏置建模,模型校准和脱叠。 NLI的实验结果和事实验证任务表明,我们提出的三阶段脱叠框架始终如一地优于传统的两级,以分配的准确性。
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
Decompilation aims to transform a low-level program language (LPL) (eg., binary file) into its functionally-equivalent high-level program language (HPL) (e.g., C/C++). It is a core technology in software security, especially in vulnerability discovery and malware analysis. In recent years, with the successful application of neural machine translation (NMT) models in natural language processing (NLP), researchers have tried to build neural decompilers by borrowing the idea of NMT. They formulate the decompilation process as a translation problem between LPL and HPL, aiming to reduce the human cost required to develop decompilation tools and improve their generalizability. However, state-of-the-art learning-based decompilers do not cope well with compiler-optimized binaries. Since real-world binaries are mostly compiler-optimized, decompilers that do not consider optimized binaries have limited practical significance. In this paper, we propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries. NeurDP uses a graph neural network (GNN) model to convert LPL to an intermediate representation (IR), which bridges the gap between source code and optimized binary. We also design an Optimized Translation Unit (OTU) to split functions into smaller code fragments for better translation performance. Evaluation results on datasets containing various types of statements show that NeurDP can decompile optimized binaries with 45.21% higher accuracy than state-of-the-art neural decompilation frameworks.
translated by 谷歌翻译
Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
It has been observed in practice that applying pruning-at-initialization methods to neural networks and training the sparsified networks can not only retain the testing performance of the original dense models, but also sometimes even slightly boost the generalization performance. Theoretical understanding for such experimental observations are yet to be developed. This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. Specifically, this work considers a classification task for overparameterized two-layer neural networks, where the network is randomly pruned according to different rates at the initialization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization performance. More surprisingly, the generalization bound gets better as the pruning fraction gets larger. To complement this positive result, this work further shows a negative result: there exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing. This further suggests that pruning can change the feature learning process, which leads to the performance drop of the pruned neural network. Up to our knowledge, this is the \textbf{first} generalization result for pruned neural networks, suggesting that pruning can improve the neural network's generalization.
translated by 谷歌翻译
This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.
translated by 谷歌翻译