我们在大图中介绍了图形神经网络(GNNS)的分布式全批量培训的顺序聚合和换算(SAR)方案。最近,GNN的大规模培训是基于非学习消息传递的基于采样的方法和方法主导的。另一方面,SAR是一种分布式技术,可以直接在整个大图上培训任何GNN类型。 SAR中的关键创新是分布式顺序修补方案,其在后向通过期间依次重新构造,然后在后向通行证期间释放禁止的大型GNN计算图。这导致优异的记忆缩放行为,其中每个工作人员的内存消耗与工人的数量线性地下降,即使对于密集连接的图形。使用SAR,我们报告了最大的全批量GNN培训应用到目前为止,并随着工人数量的增加而展示了大的内存节省。我们还基于内核融合和注意力矩阵的一般技术提出了一种优化了基于关注的模型的运行时和内存效率。我们表明,与SAR相结合,我们的优化注意核导致了基于关注的GNN的显着加速和内存节省。
translated by 谷歌翻译
图形神经网络(GNN)已被证明是分析非欧国人图数据的强大工具。但是,缺乏有效的分布图学习(GL)系统极大地阻碍了GNN的应用,尤其是当图形大且GNN相对深时。本文中,我们提出了GraphTheta,这是一种以顶点为中心的图形编程模型实现的新颖分布式和可扩展的GL系统。 GraphTheta是第一个基于分布式图处理的GL系统,其神经网络运算符以用户定义的功能实现。该系统支持多种培训策略,并在分布式(虚拟)机器上启用高度可扩展的大图学习。为了促进图形卷积实现,GraphTheta提出了一个名为NN-Tgar的新的GL抽象,以弥合图形处理和图形深度学习之间的差距。提出了分布式图引擎,以通过混合平行执行进行随机梯度下降优化。此外,除了全球批次和迷你批次外,我们还为新的集群批次培训策略提供了支持。我们使用许多网络大小的数据集评估GraphTheta,范围从小,适度到大规模。实验结果表明,GraphTheta可以很好地扩展到1,024名工人,用于培训内部开发的GNN,该工业尺度的Aripay数据集为14亿个节点和41亿个属性边缘,并带有CPU虚拟机(Dockers)群的小群。 (5 $ \ sim $ 12GB)。此外,GraphTheta比最先进的GNN实现获得了可比或更好的预测结果,证明其学习GNN和现有框架的能力,并且可以超过多达$ 2.02 \ tims $ $ 2.02 \ times $,具有更好的可扩展性。据我们所知,这项工作介绍了文献中最大的边缘属性GNN学习任务。
translated by 谷歌翻译
Training Graph Neural Networks, on graphs containing billions of vertices and edges, at scale using minibatch sampling poses a key challenge: strong-scaling graphs and training examples results in lower compute and higher communication volume and potential performance loss. DistGNN-MB employs a novel Historical Embedding Cache combined with compute-communication overlap to address this challenge. On a 32-node (64-socket) cluster of $3^{rd}$ generation Intel Xeon Scalable Processors with 36 cores per socket, DistGNN-MB trains 3-layer GraphSAGE and GAT models on OGBN-Papers100M to convergence with epoch times of 2 seconds and 4.9 seconds, respectively, on 32 compute nodes. At this scale, DistGNN-MB trains GraphSAGE 5.2x faster than the widely-used DistDGL. DistGNN-MB trains GraphSAGE and GAT 10x and 17.2x faster, respectively, as compute nodes scale from 2 to 32.
translated by 谷歌翻译
Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. However, the workflows, computational patterns, communication patterns, and optimization techniques of distributed GNN training remain preliminarily understood. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.
translated by 谷歌翻译
开发用于训练图形的可扩展解决方案,用于链路预测任务的Neural网络(GNNS)由于具有高计算成本和巨大内存占用的高数据依赖性,因此由于高数据依赖性而具有挑战性。我们提出了一种新的方法,用于缩放知识图形嵌入模型的培训,以满足这些挑战。为此,我们提出了以下算法策略:自给自足的分区,基于约束的负采样和边缘迷你批量培训。两者都是分区策略和基于约束的负面采样,避免在训练期间交叉分区数据传输。在我们的实验评估中,我们表明,我们基于GNN的知识图形嵌入模型的缩放解决方案在基准数据集中实现了16倍的加速,同时将可比的模型性能作为标准度量的非分布式方法。
translated by 谷歌翻译
Graph Neural Networks(GNNs) are a family of neural models tailored for graph-structure data and have shown superior performance in learning representations for graph-structured data. However, training GNNs on large graphs remains challenging and a promising direction is distributed GNN training, which is to partition the input graph and distribute the workload across multiple machines. The key bottleneck of the existing distributed GNNs training framework is the across-machine communication induced by the dependency on the graph data and aggregation operator of GNNs. In this paper, we study the communication complexity during distributed GNNs training and propose a simple lossless communication reduction method, termed the Aggregation before Communication (ABC) method. ABC method exploits the permutation-invariant property of the GNNs layer and leads to a paradigm where vertex-cut is proved to admit a superior communication performance than the currently popular paradigm (edge-cut). In addition, we show that the new partition paradigm is particularly ideal in the case of dynamic graphs where it is infeasible to control the edge placement due to the unknown stochastic of the graph-changing process.
translated by 谷歌翻译
Using graph neural networks for large graphs is challenging since there is no clear way of constructing mini-batches. To solve this, previous methods have relied on sampling or graph clustering. While these approaches often lead to good training convergence, they introduce significant overhead due to expensive random data accesses and perform poorly during inference. In this work we instead focus on model behavior during inference. We theoretically model batch construction via maximizing the influence score of nodes on the outputs. This formulation leads to optimal approximation of the output when we do not have knowledge of the trained model. We call the resulting method influence-based mini-batching (IBMB). IBMB accelerates inference by up to 130x compared to previous methods that reach similar accuracy. Remarkably, with adaptive optimization and the right training schedule IBMB can also substantially accelerate training, thanks to precomputed batches and consecutive memory accesses. This results in up to 18x faster training per epoch and up to 17x faster convergence per runtime compared to previous methods.
translated by 谷歌翻译
While many systems have been developed to train Graph Neural Networks (GNNs), efficient model inference and evaluation remain to be addressed. For instance, using the widely adopted node-wise approach, model evaluation can account for up to 94% of the time in the end-to-end training process due to neighbor explosion, which means that a node accesses its multi-hop neighbors. On the other hand, layer-wise inference avoids the neighbor explosion problem by conducting inference layer by layer such that the nodes only need their one-hop neighbors in each layer. However, implementing layer-wise inference requires substantial engineering efforts because users need to manually decompose a GNN model into layers for computation and split workload into batches to fit into device memory. In this paper, we develop Deep Graph Inference (DGI) -- a system for easy and efficient GNN model inference, which automatically translates the training code of a GNN model for layer-wise execution. DGI is general for various GNN models and different kinds of inference requests, and supports out-of-core execution on large graphs that cannot fit in CPU memory. Experimental results show that DGI consistently outperforms layer-wise inference across different datasets and hardware settings, and the speedup can be over 1,000x.
translated by 谷歌翻译
图形神经网络(GNNS)将深度神经网络(DNN)的成功扩展到非欧几里德图数据,实现了各种任务的接地性能,例如节点分类和图形属性预测。尽管如此,现有系统效率低,培训数十亿节点和GPU的节点和边缘训练大图。主要瓶颈是准备GPU数据的过程 - 子图采样和特征检索。本文提出了一个分布式GNN培训系统的BGL,旨在解决一些关键思想的瓶颈。首先,我们提出了一种动态缓存引擎,以最小化特征检索流量。通过协同设计缓存政策和抽样顺序,我们发现低开销和高缓存命中率的精美斑点。其次,我们改善了曲线图分区算法,以减少子图采样期间的交叉分区通信。最后,仔细资源隔离减少了不同数据预处理阶段之间的争用。关于各种GNN模型和大图数据集的广泛实验表明,BGL平均明显优于现有的GNN训练系统20.68倍。
translated by 谷歌翻译
Graph neural networks (GNNs) have received great attention due to their success in various graph-related learning tasks. Several GNN frameworks have then been developed for fast and easy implementation of GNN models. Despite their popularity, they are not well documented, and their implementations and system performance have not been well understood. In particular, unlike the traditional GNNs that are trained based on the entire graph in a full-batch manner, recent GNNs have been developed with different graph sampling techniques for mini-batch training of GNNs on large graphs. While they improve the scalability, their training times still depend on the implementations in the frameworks as sampling and its associated operations can introduce non-negligible overhead and computational cost. In addition, it is unknown how much the frameworks are 'eco-friendly' from a green computing perspective. In this paper, we provide an in-depth study of two mainstream GNN frameworks along with three state-of-the-art GNNs to analyze their performance in terms of runtime and power/energy consumption. We conduct extensive benchmark experiments at several different levels and present detailed analysis results and observations, which could be helpful for further improvement and optimization.
translated by 谷歌翻译
图形神经网络(GNNS)已成为处理机器学习任务的有效方法,它为构建推荐系统带来了一种新方法,其中可以将推荐任务作为用户 - 项目的链接预测问题提出, 。培训基于GNN的推荐系统(GNNRECSYS)在大图上会引起大型内存足迹,很容易超过典型服务器上的DRAM容量。现有的解决方案诉诸分布式子图培训,这是由于动态构建子图和各个子图的大量冗余的高成本而效率低下。新兴的Intel Optane持久记忆使一台机器以可承受的成本具有最多6 TB的存储器,从而使单机器Gnnrecsys训练可行,从而消除了分布式培训中的效率低下。与DRAM相比,将Optane用于Gnnrecsys的一个主要问题是Optane相对较低的带宽。由于其主要的计算内核稀疏且内存访问密集,因此这种限制可能对Gnnrecsys工作量的高性能特别有害。为了了解Optane是否适合Gnnrecsys培训,我们对Gnnrecsys工作负载进行了深入的表征和全面的基准测试研究。我们的基准测试结果表明,经过正确配置后,基于Optane的单机器GNNRECSYS训练优于大幅度的培训,尤其是在处理深度GNN模型时。我们分析了加速度的来源,提供有关如何为GNNRECSYS工作负载配置Optane的指导,并讨论进一步优化的机会。
translated by 谷歌翻译
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs. The large data sizes of graphs and their vertex features make scalable training algorithms and distributed memory systems necessary. Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges. We propose a highly parallel training algorithm that scales to large processor counts. In our solution, the large adjacency and vertex-feature matrices are partitioned among processors. We exploit the vertex-partitioning of the graph to use non-blocking point-to-point communication operations between processors for better scalability. To further minimize the parallelization overheads, we introduce a sparse matrix partitioning scheme based on a hypergraph partitioning model for full-batch training. We also propose a novel stochastic hypergraph model to encode the expected communication volume in mini-batch training. We show the merits of the hypergraph model, previously unexplored for GCN training, over the standard graph partitioning model which does not accurately encode the communication costs. Experiments performed on real-world graph datasets demonstrate that the proposed algorithms achieve considerable speedups over alternative solutions. The optimizations achieved on communication costs become even more pronounced at high scalability with many processors. The performance benefits are preserved in deeper GCNs having more layers as well as on billion-scale graphs.
translated by 谷歌翻译
深度学习领域目睹了对极端计算和内存密集型神经网络的显着转变。这些较新的较大模型使研究人员能够推进各种领域的最先进的工具。这种现象刺激了在更多的硬件加速器上产生了针对神经网络的分布式训练的算法。在本文中,我们讨论并比较了当前的最先进的框架,以实现大规模的分布式深度学习。首先,我们调查分布式学习中的当前实践,并确定所使用的不同类型的并行性。然后,我们提出了对大型图像和语言培训任务的性能进行了经验结果。此外,我们解决了他们的统计效率和内存消耗行为。根据我们的结果,我们讨论了阻碍性能的每个框架的算法和实现部分。
translated by 谷歌翻译
数据增强可帮助神经网络通过放大培训集来更好地推广,但它仍然是如何有效增强图数据以增强GNN的性能的开放问题(图形神经网络)。虽然大多数现有图形常规程序专注于通过添加/删除边缘来操纵图形拓扑结构,但我们提供了一种增强节点功能以获得更好性能的方法。我们提出标志(图中的免费大规模对抗动力增强),它在训练期间迭代地增强了基于梯度的对冲扰动的节点特征。通过使模型不变地在输入数据中的小波动中,我们的方法有助于模型推广到分布外的样本,并在测试时间提高模型性能。标志是图形数据的通用方法,它普遍存在节点分类,链路预测和图形分类任务中。标志也是非常灵活和可扩展的,并且可以使用任意GNN骨架和大规模数据集进行可部署。我们通过广泛的实验和消融研究证明了我们方法的功效和稳定性。我们还提供了直观的观察,以更深入地了解我们的方法。
translated by 谷歌翻译
图形神经网络(GNNS)在学习从图形结构数据中展示了成功,其中包含欺诈检测,推荐和知识图形推理。然而,培训GNN有效地具有挑战性,因为:1)GPU存储器容量有限,对于大型数据集可能不足,而2)基于图形的数据结构导致不规则的数据访问模式。在这项工作中,我们提供了一种统计分析的方法,并确定了GNN培训前更频繁地访问的数据。我们的数据分层方法不仅利用输入图的结构,而且还从实际GNN训练过程中获得了洞察力,以实现更高的预测结果。通过我们的数据分层方法,我们还提供了一种新的数据放置和访问策略,以进一步最大限度地减少CPU-GPU通信开销。我们还考虑了多GPU GNN培训,我们也展示了我们在多GPU系统中的策略的有效性。评估结果表明,我们的工作将CPU-GPU流量降低了87-95%,并通过数亿节点和数十亿边缘的图表提高了现有解决方案的GNN训练速度。
translated by 谷歌翻译
在过去几年中,人们对代表性学习的图形神经网络(GNN)的兴趣不大。GNN提供了一个一般有效的框架,可以从图形结构化数据中学习。但是,GNN通常仅使用一个非常有限的邻域的信息来避免过度光滑。希望为模型提供更多信息。在这项工作中,我们将个性化Pagerank(PPR)的极限分布纳入图形注意力网络(GATS)中,以反映较大的邻居信息,而无需引入过度光滑。从直觉上讲,基于个性化Pagerank的消息聚合对应于无限的许多邻里聚合层。我们表明,对于四个广泛使用的基准数据集,我们的模型优于各种基线模型。我们的实施已在线公开。
translated by 谷歌翻译
在过去几年中,培训最先进的神经网络的记忆要求远远超过了现代硬件加速器的DRAM能力。这仍然需要开发有效的算法,并在大规模的基于GPU的集群上并行培训这些神经网络。由于在现代GPU上的计算相对便宜,因此在这些并行训练算法中设计和实现极其有效的通信对于提取最大性能至关重要。本文介绍了Axonn,一个并行深度学习框架,用于利用异步和消息驱动的执行来安排每个GPU上的神经网络操作,从而降低GPU空闲时间并最大限度地提高硬件效率。通过使用CPU存储器作为划痕空间来定期在训练期间定期卸载数据,AXONN能够将GPU存储器消耗降低四次。这使我们可以将每个GPU的参数数量增加四次,从而减少通信量并将性能提高超过13%。在48-384 NVIDIA TESLA V100 GPU的大型变压器模型上进行了12-100亿参数,Axonn实现了理论峰的49.4-54.78%的每GPU吞吐量,并将培训时间减少22-37天(15-25与最先进的加速度)。
translated by 谷歌翻译
TensorFlow GNN(TF-GNN)是张量曲线的图形神经网络的可扩展库。它是从自下而上设计的,以支持当今信息生态系统中发生的丰富的异质图数据。Google的许多生产模型都使用TF-GNN,最近已作为开源项目发布。在本文中,我们描述了TF-GNN数据模型,其KERAS建模API以及相关功能,例如图形采样,分布式训练和加速器支持。
translated by 谷歌翻译
模型大小的范围不断增加,并且持续改进性能使大型模型时代的到来的到来。在本报告中,我们通过潜入培训目标和培训方法来探讨大型模型培训如何运作。具体而言,培训目标描述了如何利用Web规模数据来开发基于自我监督的学习以及基于分布式培训的培训方法,开发出极强的大型模型,描述了如何使大型模型培训成为现实。我们将现有的培训方法总结为三个主要类别:训练并行性,节省记忆技术和模型稀疏设计。训练并行性可以根据发生的并行性维度分类为数据,管道和张量并行性。节省记忆的技术是正交的,并且与训练并行性互补。和模型稀疏设计以恒定的计算成本进一步扩大模型大小。在https://github.com/qhliu26/bm-training提供了不断更新的大型模型培训清单。
translated by 谷歌翻译
In the last few years, graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. This emerging field has witnessed an extensive growth of promising techniques that have been applied with success to computer science, mathematics, biology, physics and chemistry. But for any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress. This led us in March 2020 to release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to-use and reproducible code infrastructure, and iv) is flexible for researchers to experiment with new theoretical ideas. As of December 2022, the GitHub repository has reached 2,000 stars and 380 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. In this paper, we present an updated version of our benchmark with a concise presentation of the aforementioned framework characteristics, an additional medium-sized molecular dataset AQSOL, similar to the popular ZINC, but with a real-world measured chemical target, and discuss how this framework can be leveraged to explore new GNN designs and insights. As a proof of value of our benchmark, we study the case of graph positional encoding (PE) in GNNs, which was introduced with this benchmark and has since spurred interest of exploring more powerful PE for Transformers and GNNs in a robust experimental setting.
translated by 谷歌翻译