This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.
translated by 谷歌翻译
Recent years have witnessed the tremendous progress of 3D GANs for generating view-consistent radiance fields with photo-realism. Yet, high-quality generation of human radiance fields remains challenging, partially due to the limited human-related priors adopted in existing methods. We present HumanGen, a novel 3D human generation scheme with detailed geometry and $\text{360}^{\circ}$ realistic free-view rendering. It explicitly marries the 3D human generation with various priors from the 2D generator and 3D reconstructor of humans through the design of "anchor image". We introduce a hybrid feature representation using the anchor image to bridge the latent space of HumanGen with the existing 2D generator. We then adopt a pronged design to disentangle the generation of geometry and appearance. With the aid of the anchor image, we adapt a 3D reconstructor for fine-grained details synthesis and propose a two-stage blending scheme to boost appearance generation. Extensive experiments demonstrate our effectiveness for state-of-the-art 3D human generation regarding geometry details, texture quality, and free-view performance. Notably, HumanGen can also incorporate various off-the-shelf 2D latent editing methods, seamlessly lifting them into 3D.
translated by 谷歌翻译
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.
translated by 谷歌翻译
成像检查(例如胸部X射线照相)将产生一小部分常见发现和一组少数罕见的发现。虽然训练有素的放射科医生可以通过研究一些代表性的例子来学习罕见条件的视觉呈现,但是教机器从这种“长尾”分布中学习的情况更加困难,因为标准方法很容易偏向最常见的类别。在本文中,我们介绍了胸部X射线胸腔疾病特定领域的长尾学习问题的全面基准研究。我们专注于从自然分布的胸部X射线数据中学习,不仅优化了分类精度,不仅是常见的“头”类,而且还优化了罕见但至关重要的“尾巴”类。为此,我们引入了一个具有挑战性的新长尾X射线基准,以促进开发长尾学习方法进行医学图像分类。该基准由两个用于19-和20向胸部疾病分类的胸部X射线数据集组成,其中包含多达53,000的类别,只有7个标记的训练图像。我们在这种新的基准上评估了标准和最先进的长尾学习方法,分析这些方法的哪些方面对长尾医学图像分类最有益,并总结了对未来算法设计的见解。数据集,训练有素的模型和代码可在https://github.com/vita-group/longtailcxr上找到。
translated by 谷歌翻译
基于匹配的方法,尤其是基于时空记忆的方法,在半监督视频对象分割(VOS)中明显领先于其他解决方案。但是,不断增长和冗余的模板特征导致推断效率低下。为了减轻这一点,我们提出了一个新型的顺序加权期望最大化(SWEM)网络,以大大降低记忆特征的冗余。与以前仅检测帧之间特征冗余的方法不同,Swem通过利用顺序加权EM算法来合并框架内和框架间的相似特征。此外,框架特征的自适应权重具有代表硬样品的灵活性,从而改善了模板的歧视。此外,该提出的方法在内存中保留了固定数量的模板特征,从而确保了VOS系统的稳定推理复杂性。对常用的戴维斯和YouTube-VOS数据集进行了广泛的实验,验证了SWEM的高效率(36 fps)和高性能(84.3 \%$ \ Mathcal {J} \&\ Mathcal {F} $代码可在以下网址获得:https://github.com/lmm077/swem。
translated by 谷歌翻译
最近,类似于MLP的视觉模型已在主流视觉识别任务上实现了有希望的表演。与视觉变压器和CNN相反,类似MLP的模型的成功表明,令牌和渠道之间的简单信息融合操作可以为深度识别模型带来良好的表示能力。但是,现有的类似于MLP的模型通过静态融合操作融合代币,缺乏对代币内容的适应性。因此,习惯信息融合程序不够有效。为此,本文介绍了一种有效的MLP式网络体系结构,称为Dynamixer,诉诸动态信息融合。至关重要的是,我们提出了一个程序,该过程依赖于该过程,以通过利用混合所有令牌的内容来动态生成混合矩阵。为了减少时间复杂性并提高鲁棒性,采用了降低性降低技术和多段融合机制。我们提出的Dynamixer模型(9700万参数)在没有额外的训练数据的情况下,在Imagenet-1k数据集上实现了84.3 \%TOP-1的精度,对最先进的视觉MLP模型表现出色。当参数数量减少到26m时,它仍然可以达到82.7 \%TOP-1的精度,超过了具有相似容量的现有MLP样模型。该代码可在\ url {https://github.com/ziyuwwang/dynamixer}中获得。
translated by 谷歌翻译
对比学习方法在学习视觉表现方面取得了巨大成功,目标课程少数标签很少。这意味着诱使将它们缩放超出策划的“种子”基准,从互联网级外部源结合更多未标记的图像以提高其性能。然而,在实践中,由于所需的型号和更长的培训,更大的未标记数据将需要更多的计算资源。此外,开放世界未标记的数据通常遵循隐式的长尾类或属性分布,其中许多也不属于目标类。盲目利用所有未标记的数据,因此可以导致数据不平衡以及分散化问题。这使我们能够寻求原则性的方法来战略性地从外部来源选择未标记的数据,以便学习相关课程的可概括,平衡和多样化的陈述。在这项工作中,我们介绍了一个名为Model-Aware K-Center(MAK)的开放式未标记的数据采样框架,其遵循三个简单的原则:(1)尾巴,这鼓励通过对实证对比进行尾舱来抽样。随机数据增强的样本的损失预期(ECLE); (2)靠近,拒绝分配可能分散训练的分配异常值; (3)多样性,可确保采样例集中的多样性。经验,使用ImageNet-100-LT(没有标签)作为种子数据集和两个“嘈杂”的外部数据源,我们证明MAK可以一致地提高学习功能的总体表示质量和阶级平衡,如通过线性评估的全拍和少量设置的分类器评估。代码可用:\ url {https://github.com/vita-group/mak
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
We aim to bridge the gap between our common-sense few-sample human learning and large-data machine learning. We derive a theory of human-like few-shot learning from von-Neuman-Landauer's principle. modelling human learning is difficult as how people learn varies from one to another. Under commonly accepted definitions, we prove that all human or animal few-shot learning, and major models including Free Energy Principle and Bayesian Program Learning that model such learning, approximate our theory, under Church-Turing thesis. We find that deep generative model like variational autoencoder (VAE) can be used to approximate our theory and perform significantly better than baseline models including deep neural networks, for image recognition, low resource language processing, and character recognition.
translated by 谷歌翻译
Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.
translated by 谷歌翻译