Our situated environment is full of uncertainty and highly dynamic, thus hindering the widespread adoption of machine-led Intelligent Decision-Making (IDM) in real world scenarios. This means IDM should have the capability of continuously learning new skills and efficiently generalizing across wider applications. IDM benefits from any new approaches and theoretical breakthroughs that exhibit Artificial General Intelligence (AGI) breaking the barriers between tasks and applications. Recent research has well-examined neural architecture, Transformer, as a backbone foundation model and its generalization to various tasks, including computer vision, natural language processing, and reinforcement learning. We therefore argue that a foundation decision model (FDM) can be established by formulating various decision-making tasks as a sequence decoding task using the Transformer architecture; this would be a promising solution to advance the applications of IDM in more complex real world tasks. In this paper, we elaborate on how a foundation decision model improves the efficiency and generalization of IDM. We also discuss potential applications of a FDM in multi-agent game AI, production scheduling, and robotics tasks. Finally, through a case study, we demonstrate our realization of the FDM, DigitalBrain (DB1) with 1.2 billion parameters, which achieves human-level performance over 453 tasks, including text generation, images caption, video games playing, robotic control, and traveling salesman problems. As a foundation decision model, DB1 would be a baby step towards more autonomous and efficient real world IDM applications.
translated by 谷歌翻译
Recent years have witnessed the tremendous progress of 3D GANs for generating view-consistent radiance fields with photo-realism. Yet, high-quality generation of human radiance fields remains challenging, partially due to the limited human-related priors adopted in existing methods. We present HumanGen, a novel 3D human generation scheme with detailed geometry and $\text{360}^{\circ}$ realistic free-view rendering. It explicitly marries the 3D human generation with various priors from the 2D generator and 3D reconstructor of humans through the design of "anchor image". We introduce a hybrid feature representation using the anchor image to bridge the latent space of HumanGen with the existing 2D generator. We then adopt a pronged design to disentangle the generation of geometry and appearance. With the aid of the anchor image, we adapt a 3D reconstructor for fine-grained details synthesis and propose a two-stage blending scheme to boost appearance generation. Extensive experiments demonstrate our effectiveness for state-of-the-art 3D human generation regarding geometry details, texture quality, and free-view performance. Notably, HumanGen can also incorporate various off-the-shelf 2D latent editing methods, seamlessly lifting them into 3D.
translated by 谷歌翻译
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.
translated by 谷歌翻译
基于深度学习的视网膜病变分割方法通常需要大量精确的像素注释数据。但是,概述病变区域的圆形或椭圆等粗糙注释的效率可能是像素级注释的六倍。因此,本文提出了一个注释细化网络,以将粗注释转换为像素级分割掩码。我们的主要新颖性是原型学习范式的应用来增强不同数据集或类型病变的概括能力。我们还引入了一个原型称量模块,以处理过度较小的病变的具有挑战性的病例。提出的方法对公开可用的IDRID数据集进行了培训,然后概括为公共DDR和我们的现实世界私人数据集。实验表明,我们的方法显着改善了初始的粗蒙版,并以较大的边缘优于非概率基线。此外,我们证明了原型称量模块在跨数据库和跨阶级设置中的实用性。
translated by 谷歌翻译
通过内核矩阵或图形laplacian矩阵代表数据点的光谱方法已成为无监督数据分析的主要工具。在许多应用程序场景中,可以通过神经网络嵌入的光谱嵌入可以在数据样本上进行训练,这为实现自动样本外扩展以及计算可扩展性提供了一种有希望的方法。在Spectralnet的原始论文中采用了这种方法(Shaham等人,2018年),我们称之为Specnet1。当前的论文引入了一种名为SpecNet2的新神经网络方法,以计算光谱嵌入,该方法优化了特征问题的等效目标,并删除了SpecNet1中的正交层。 SpecNet2还允许通过通过梯度公式跟踪每个数据点的邻居来分离图形亲和力矩阵的行采样和列。从理论上讲,我们证明了新的无正交物质目标的任何局部最小化均显示出领先的特征向量。此外,证明了使用基于批处理的梯度下降法的这种新的无正交目标的全局收敛。数值实验证明了在模拟数据和图像数据集上Specnet2的性能和计算效率的提高。
translated by 谷歌翻译
对比学习方法在学习视觉表现方面取得了巨大成功,目标课程少数标签很少。这意味着诱使将它们缩放超出策划的“种子”基准,从互联网级外部源结合更多未标记的图像以提高其性能。然而,在实践中,由于所需的型号和更长的培训,更大的未标记数据将需要更多的计算资源。此外,开放世界未标记的数据通常遵循隐式的长尾类或属性分布,其中许多也不属于目标类。盲目利用所有未标记的数据,因此可以导致数据不平衡以及分散化问题。这使我们能够寻求原则性的方法来战略性地从外部来源选择未标记的数据,以便学习相关课程的可概括,平衡和多样化的陈述。在这项工作中,我们介绍了一个名为Model-Aware K-Center(MAK)的开放式未标记的数据采样框架,其遵循三个简单的原则:(1)尾巴,这鼓励通过对实证对比进行尾舱来抽样。随机数据增强的样本的损失预期(ECLE); (2)靠近,拒绝分配可能分散训练的分配异常值; (3)多样性,可确保采样例集中的多样性。经验,使用ImageNet-100-LT(没有标签)作为种子数据集和两个“嘈杂”的外部数据源,我们证明MAK可以一致地提高学习功能的总体表示质量和阶级平衡,如通过线性评估的全拍和少量设置的分类器评估。代码可用:\ url {https://github.com/vita-group/mak
translated by 谷歌翻译
Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-range joint relationship modeling under multiscale operators and (2) unobstructed cross-spacetime information flow for capturing complex spatial-temporal dependencies. In this work, we present (1) a simple method to disentangle multi-scale graph convolutions and (2) a unified spatial-temporal graph convolutional operator named G3D. The proposed multi-scale aggregation scheme disentangles the importance of nodes in different neighborhoods for effective long-range modeling. The proposed G3D module leverages dense cross-spacetime edges as skip connections for direct information propagation across the spatial-temporal graph. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model 1 outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.
translated by 谷歌翻译
We present Second Thought, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thought not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness.
translated by 谷歌翻译
This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.
translated by 谷歌翻译
Privacy noise may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP^2), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP^2 across several real-world datasets, demonstrating that it can improve convergence speed by as much as 4x relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data.
translated by 谷歌翻译