最近,Graph神经网络(GNNS)已成为聚光灯作为强大的工具,可以有效地在图形结构化数据上执行各种推理任务。随着现实图表的大小继续扩展,GNN训练系统面临可扩展性挑战。分布式培训是一种流行的方法,可以通过扩展CPU节点来应对这一挑战。但是,对基于磁盘的GNN培训的关注不多,该培训可以通过利用NVME SSD等高性能存储设备来以更具成本效益的方式扩展单节点系统。我们观察到,主内存和磁盘之间的数据移动是基于SSD的训练系统中的主要瓶颈,并且常规的GNN训练管道是不错的选择,而无需考虑此开销。因此,我们提出了Ginex,这是第一个基于SSD的GNN训练系统,可以在单台计算机上处​​理数十亿个图形数据集。受到编译器优化的检查员执行模型的启发,Ginex通过分开样品和收集阶段来重组GNN训练管道。这种分离使Ginex能够实现一种可证明的最佳替换算法,即被称为Belady的算法,用于存储器中的Caching特征向量,该算法是I/O访问的主要部分。根据我们对40亿尺度图数据集的评估,Ginex平均比SSD扩展的Pytorch几何得出了2.11倍的训练吞吐量(最大最高2.67倍)。
translated by 谷歌翻译
尽管电子保健记录(EHR)丰富,但其异质性限制了医疗数据在构建预测模型中的利用。为了应对这一挑战,我们提出了通用医疗预测框架(UNIHPF),该框架不需要医疗领域知识和对多个预测任务的最小预处理。实验结果表明,UNIHPF能够构建可以从不同EHR系统处理任何形式的医疗数据的大规模EHR模型。我们的框架在多源学习任务(包括转移和汇总学习)中大大优于基线模型,同时在单个医疗数据集中接受培训时也会显示出可比的结果。为了凭经验证明我们工作的功效,我们使用各种数据集,模型结构和任务进行了广泛的实验。我们认为,我们的发现可以为对EHR的多源学习提供进一步研究提供有益的见解。
translated by 谷歌翻译
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
translated by 谷歌翻译
Mix-up training approaches have proven to be effective in improving the generalization ability of Deep Neural Networks. Over the years, the research community expands mix-up methods into two directions, with extensive efforts to improve saliency-guided procedures but minimal focus on the arbitrary path, leaving the randomization domain unexplored. In this paper, inspired by the superior qualities of each direction over one another, we introduce a novel method that lies at the junction of the two routes. By combining the best elements of randomness and saliency utilization, our method balances speed, simplicity, and accuracy. We name our method R-Mix following the concept of "Random Mix-up". We demonstrate its effectiveness in generalization, weakly supervised object localization, calibration, and robustness to adversarial attacks. Finally, in order to address the question of whether there exists a better decision protocol, we train a Reinforcement Learning agent that decides the mix-up policies based on the classifier's performance, reducing dependency on human-designed objectives and hyperparameter tuning. Extensive experiments further show that the agent is capable of performing at the cutting-edge level, laying the foundation for a fully automatic mix-up. Our code is released at [https://github.com/minhlong94/Random-Mixup].
translated by 谷歌翻译
Natural language explanations promise to offer intuitively understandable explanations of a neural network's decision process in complex vision-language tasks, as pursued in recent VL-NLE models. While current models offer impressive performance on task accuracy and explanation plausibility, they suffer from a range of issues: Some models feature a modular design where the explanation generation module is poorly integrated with a separate module for task-answer prediction, employ backbone models trained on limited sets of tasks, or incorporate ad hoc solutions to increase performance on single datasets. We propose to evade these limitations by applying recent advances in large-scale multi-task pretraining of generative Transformer models to the problem of VL-NLE tasks. Our approach outperforms recent models by a large margin, with human annotators preferring the generated explanations over the ground truth in two out of three evaluated datasets. As a novel challenge in VL-NLE research, we propose the problem of multi-task VL-NLE and show that jointly training on multiple tasks can increase the explanation quality. We discuss the ethical implications of high-quality NLE generation and other issues in recent VL-NLE research.
translated by 谷歌翻译
Our long term goal is to use image-based depth completion to quickly create 3D models from sparse point clouds, e.g. from SfM or SLAM. Much progress has been made in depth completion. However, most current works assume well distributed samples of known depth, e.g. Lidar or random uniform sampling, and perform poorly on uneven samples, such as from keypoints, due to the large unsampled regions. To address this problem, we extend CSPN with multiscale prediction and a dilated kernel, leading to much better completion of keypoint-sampled depth. We also show that a model trained on NYUv2 creates surprisingly good point clouds on ETH3D by completing sparse SfM points.
translated by 谷歌翻译
Multilayer perceptrons (MLPs) learn high frequencies slowly. Recent approaches encode features in spatial bins to improve speed of learning details, but at the cost of larger model size and loss of continuity. Instead, we propose to encode features in bins of Fourier features that are commonly used for positional encoding. We call these Quantized Fourier Features (QFF). As a naturally multiresolution and periodic representation, our experiments show that using QFF can result in smaller model size, faster training, and better quality outputs for several applications, including Neural Image Representations (NIR), Neural Radiance Field (NeRF) and Signed Distance Function (SDF) modeling. QFF are easy to code, fast to compute, and serve as a simple drop-in addition to many neural field representations.
translated by 谷歌翻译
Knowledge about space and time is necessary to solve problems in the physical world: An AI agent situated in the physical world and interacting with objects often needs to reason about positions of and relations between objects; and as soon as the agent plans its actions to solve a task, it needs to consider the temporal aspect (e.g., what actions to perform over time). Spatio-temporal knowledge, however, is required beyond interacting with the physical world, and is also often transferred to the abstract world of concepts through analogies and metaphors (e.g., "a threat that is hanging over our heads"). As spatial and temporal reasoning is ubiquitous, different attempts have been made to integrate this into AI systems. In the area of knowledge representation, spatial and temporal reasoning has been largely limited to modeling objects and relations and developing reasoning methods to verify statements about objects and relations. On the other hand, neural network researchers have tried to teach models to learn spatial relations from data with limited reasoning capabilities. Bridging the gap between these two approaches in a mutually beneficial way could allow us to tackle many complex real-world problems, such as natural language processing, visual question answering, and semantic image segmentation. In this chapter, we view this integration problem from the perspective of Neuro-Symbolic AI. Specifically, we propose a synergy between logical reasoning and machine learning that will be grounded on spatial and temporal knowledge. Describing some successful applications, remaining challenges, and evaluation datasets pertaining to this direction is the main topic of this contribution.
translated by 谷歌翻译
We introduce a new method for diverse foreground generation with explicit control over various factors. Existing image inpainting based foreground generation methods often struggle to generate diverse results and rarely allow users to explicitly control specific factors of variation (e.g., varying the facial identity or expression for face inpainting results). We leverage contrastive learning with latent codes to generate diverse foreground results for the same masked input. Specifically, we define two sets of latent codes, where one controls a pre-defined factor (``known''), and the other controls the remaining factors (``unknown''). The sampled latent codes from the two sets jointly bi-modulate the convolution kernels to guide the generator to synthesize diverse results. Experiments demonstrate the superiority of our method over state-of-the-arts in result diversity and generation controllability.
translated by 谷歌翻译
深度神经网络(DNN)的训练过程通常是用阶段进行管道的,用于在CPU上进行数据制备,然后对GPU等加速器进行梯度计算。在理想的管道中,端到端训练吞吐量最终受到加速器的吞吐量的限制,而不是数据准备。过去,DNN训练管道通过使用使用轻巧,有损的图像格式(如JPEG)编码的数据集实现了近乎最佳的吞吐量。但是,随着高分辨率,无损编码的数据集变得越来越流行,对于需要高精度的应用程序,由于CPU上的低通量图像解码,在数据准备阶段出现了性能问题。因此,我们提出了L3,这是一种用于高分辨率,高通量DNN训练的定制轻巧,无损的图像格式。 L3的解码过程在加速器上有效平行,从而最大程度地减少了在DNN培训期间进行数据制备的CPU干预。 L3比最流行的无损图像格式PNG获得了9.29倍的数据准备吞吐量,用于NVIDIA A100 GPU上的CityScapes数据集,该数据集可导致1.71倍更高的端到端训练吞吐量。与JPEG和WebP相比,两种流行的有损图像格式,L3分别以同等的度量性能为Imagenet提供高达1.77倍和2.87倍的端到端训练吞吐量。
translated by 谷歌翻译