Training Graph Neural Networks, on graphs containing billions of vertices and edges, at scale using minibatch sampling poses a key challenge: strong-scaling graphs and training examples results in lower compute and higher communication volume and potential performance loss. DistGNN-MB employs a novel Historical Embedding Cache combined with compute-communication overlap to address this challenge. On a 32-node (64-socket) cluster of $3^{rd}$ generation Intel Xeon Scalable Processors with 36 cores per socket, DistGNN-MB trains 3-layer GraphSAGE and GAT models on OGBN-Papers100M to convergence with epoch times of 2 seconds and 4.9 seconds, respectively, on 32 compute nodes. At this scale, DistGNN-MB trains GraphSAGE 5.2x faster than the widely-used DistDGL. DistGNN-MB trains GraphSAGE and GAT 10x and 17.2x faster, respectively, as compute nodes scale from 2 to 32.
translated by 谷歌翻译
Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. However, the workflows, computational patterns, communication patterns, and optimization techniques of distributed GNN training remain preliminarily understood. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.
translated by 谷歌翻译
图形神经网络(GNN)已被证明是分析非欧国人图数据的强大工具。但是,缺乏有效的分布图学习(GL)系统极大地阻碍了GNN的应用,尤其是当图形大且GNN相对深时。本文中,我们提出了GraphTheta,这是一种以顶点为中心的图形编程模型实现的新颖分布式和可扩展的GL系统。 GraphTheta是第一个基于分布式图处理的GL系统,其神经网络运算符以用户定义的功能实现。该系统支持多种培训策略,并在分布式(虚拟)机器上启用高度可扩展的大图学习。为了促进图形卷积实现,GraphTheta提出了一个名为NN-Tgar的新的GL抽象,以弥合图形处理和图形深度学习之间的差距。提出了分布式图引擎,以通过混合平行执行进行随机梯度下降优化。此外,除了全球批次和迷你批次外,我们还为新的集群批次培训策略提供了支持。我们使用许多网络大小的数据集评估GraphTheta,范围从小,适度到大规模。实验结果表明,GraphTheta可以很好地扩展到1,024名工人,用于培训内部开发的GNN,该工业尺度的Aripay数据集为14亿个节点和41亿个属性边缘,并带有CPU虚拟机(Dockers)群的小群。 (5 $ \ sim $ 12GB)。此外,GraphTheta比最先进的GNN实现获得了可比或更好的预测结果,证明其学习GNN和现有框架的能力,并且可以超过多达$ 2.02 \ tims $ $ 2.02 \ times $,具有更好的可扩展性。据我们所知,这项工作介绍了文献中最大的边缘属性GNN学习任务。
translated by 谷歌翻译
图形神经网络(GNNS)将深度神经网络(DNN)的成功扩展到非欧几里德图数据,实现了各种任务的接地性能,例如节点分类和图形属性预测。尽管如此,现有系统效率低,培训数十亿节点和GPU的节点和边缘训练大图。主要瓶颈是准备GPU数据的过程 - 子图采样和特征检索。本文提出了一个分布式GNN培训系统的BGL,旨在解决一些关键思想的瓶颈。首先,我们提出了一种动态缓存引擎,以最小化特征检索流量。通过协同设计缓存政策和抽样顺序,我们发现低开销和高缓存命中率的精美斑点。其次,我们改善了曲线图分区算法,以减少子图采样期间的交叉分区通信。最后,仔细资源隔离减少了不同数据预处理阶段之间的争用。关于各种GNN模型和大图数据集的广泛实验表明,BGL平均明显优于现有的GNN训练系统20.68倍。
translated by 谷歌翻译
开发用于训练图形的可扩展解决方案,用于链路预测任务的Neural网络(GNNS)由于具有高计算成本和巨大内存占用的高数据依赖性,因此由于高数据依赖性而具有挑战性。我们提出了一种新的方法,用于缩放知识图形嵌入模型的培训,以满足这些挑战。为此,我们提出了以下算法策略:自给自足的分区,基于约束的负采样和边缘迷你批量培训。两者都是分区策略和基于约束的负面采样,避免在训练期间交叉分区数据传输。在我们的实验评估中,我们表明,我们基于GNN的知识图形嵌入模型的缩放解决方案在基准数据集中实现了16倍的加速,同时将可比的模型性能作为标准度量的非分布式方法。
translated by 谷歌翻译
最近,图形卷积网络(GCNS)已成为用于分析非欧几里德图数据的最先进的算法。然而,实现有效的GCN训练,特别是在大图中挑战。原因是许多折叠的原因:1)GCN训练引发了大量的内存占用。大图中的全批量培训甚至需要数百到数千千兆字节的内存,以缓冲中间数据进行反向传播。 2)GCN培训涉及内存密集型数据减少和计算密集型功能/渐变更新操作。这种异构性质挑战当前的CPU / GPU平台。 3)图形的不规则性和复杂的训练数据流共同增加了提高GCN培训系统效率的难度。本文提出了一种混合架构来解决这些挑战的混合架构。具体地,GCNEAR采用基于DIMM的存储系统,提供易于级别的存储器容量。为了匹配异构性质,我们将GCN培训操作分类为内存密集型减少和计算密集型更新操作。然后,我们卸载将操作减少到DIMM NMES,充分利用高聚合的本地带宽。我们采用具有足够计算能力的CAE来处理更新操作。我们进一步提出了几种优化策略来处理GCN任务的不规则,提高GCNEAR的表现。我们还提出了一种多GCNEAR系统来评估GCNEAR的可扩展性。
translated by 谷歌翻译
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs. The large data sizes of graphs and their vertex features make scalable training algorithms and distributed memory systems necessary. Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges. We propose a highly parallel training algorithm that scales to large processor counts. In our solution, the large adjacency and vertex-feature matrices are partitioned among processors. We exploit the vertex-partitioning of the graph to use non-blocking point-to-point communication operations between processors for better scalability. To further minimize the parallelization overheads, we introduce a sparse matrix partitioning scheme based on a hypergraph partitioning model for full-batch training. We also propose a novel stochastic hypergraph model to encode the expected communication volume in mini-batch training. We show the merits of the hypergraph model, previously unexplored for GCN training, over the standard graph partitioning model which does not accurately encode the communication costs. Experiments performed on real-world graph datasets demonstrate that the proposed algorithms achieve considerable speedups over alternative solutions. The optimizations achieved on communication costs become even more pronounced at high scalability with many processors. The performance benefits are preserved in deeper GCNs having more layers as well as on billion-scale graphs.
translated by 谷歌翻译
我们在大图中介绍了图形神经网络(GNNS)的分布式全批量培训的顺序聚合和换算(SAR)方案。最近,GNN的大规模培训是基于非学习消息传递的基于采样的方法和方法主导的。另一方面,SAR是一种分布式技术,可以直接在整个大图上培训任何GNN类型。 SAR中的关键创新是分布式顺序修补方案,其在后向通过期间依次重新构造,然后在后向通行证期间释放禁止的大型GNN计算图。这导致优异的记忆缩放行为,其中每个工作人员的内存消耗与工人的数量线性地下降,即使对于密集连接的图形。使用SAR,我们报告了最大的全批量GNN培训应用到目前为止,并随着工人数量的增加而展示了大的内存节省。我们还基于内核融合和注意力矩阵的一般技术提出了一种优化了基于关注的模型的运行时和内存效率。我们表明,与SAR相结合,我们的优化注意核导致了基于关注的GNN的显着加速和内存节省。
translated by 谷歌翻译
图形神经网络(GNNS)已成为处理机器学习任务的有效方法,它为构建推荐系统带来了一种新方法,其中可以将推荐任务作为用户 - 项目的链接预测问题提出, 。培训基于GNN的推荐系统(GNNRECSYS)在大图上会引起大型内存足迹,很容易超过典型服务器上的DRAM容量。现有的解决方案诉诸分布式子图培训,这是由于动态构建子图和各个子图的大量冗余的高成本而效率低下。新兴的Intel Optane持久记忆使一台机器以可承受的成本具有最多6 TB的存储器,从而使单机器Gnnrecsys训练可行,从而消除了分布式培训中的效率低下。与DRAM相比,将Optane用于Gnnrecsys的一个主要问题是Optane相对较低的带宽。由于其主要的计算内核稀疏且内存访问密集,因此这种限制可能对Gnnrecsys工作量的高性能特别有害。为了了解Optane是否适合Gnnrecsys培训,我们对Gnnrecsys工作负载进行了深入的表征和全面的基准测试研究。我们的基准测试结果表明,经过正确配置后,基于Optane的单机器GNNRECSYS训练优于大幅度的培训,尤其是在处理深度GNN模型时。我们分析了加速度的来源,提供有关如何为GNNRECSYS工作负载配置Optane的指导,并讨论进一步优化的机会。
translated by 谷歌翻译
图形神经网络(GNN)的输入图的大小不断增加,突显了使用多GPU平台的需求。但是,由于计算不平衡和效率较低的通信,现有的多GPU GNN解决方案遭受了劣质性能。为此,我们提出了MGG,这是一种新型的系统设计,可以通过以GPU为中心的软件管道在多GPU平台上加速GNN。 MGG探讨了通过细粒度计算通信管道中隐藏GNN工作负载中远程内存访问延迟的潜力。具体而言,MGG引入了管​​道感知工作负载管理策略和混合数据布局设计,以促进通信局限性重叠。 MGG实现以优化的管道为中心的内核。它包括工作负载交织和基于经经的映射,以进行有效的GPU内核操作管道和专门的内存设计以及优化,以更好地数据访问性能。此外,MGG还结合了轻巧的分析建模和优化启发式方法,以动态提高运行时不同设置的GNN执行性能。全面的实验表明,MGG在各种GNN设置上的最先进的多GPU系统要比最先进的多GPU系统:平均比具有统一虚拟内存设计的多GPU系统快3.65倍,平均比DGCL框架快7.38倍。
translated by 谷歌翻译
本文介绍了一种通过张量 - 训练(TT)分解来更紧凑地表示图形神经网络(GNN)表的新方法。我们考虑(a)缺乏节点特征的图形数据,从而在训练过程中学习嵌入的情况; (b)我们希望利用GPU平台,即使对于大型内存GPU,也需要较小的桌子来减少主机到GPU的通信。 TT的使用实现了嵌入的紧凑参数化,使其足够小,甚至可以完全适合现代GPU,即使是大量图形。当与明智的初始化和分层图分区结合使用时,这种方法可以将嵌入矢量的大小降低1,659次,至81,362次,在大型公开可用的基准数据集中,可以实现可比性或更高的准确性或更高的准确性和在多GPU系统上的显着速度。在某些情况下,我们的模型在输入上没有明确的节点功能甚至可以匹配使用节点功能的模型的准确性。
translated by 谷歌翻译
While many systems have been developed to train Graph Neural Networks (GNNs), efficient model inference and evaluation remain to be addressed. For instance, using the widely adopted node-wise approach, model evaluation can account for up to 94% of the time in the end-to-end training process due to neighbor explosion, which means that a node accesses its multi-hop neighbors. On the other hand, layer-wise inference avoids the neighbor explosion problem by conducting inference layer by layer such that the nodes only need their one-hop neighbors in each layer. However, implementing layer-wise inference requires substantial engineering efforts because users need to manually decompose a GNN model into layers for computation and split workload into batches to fit into device memory. In this paper, we develop Deep Graph Inference (DGI) -- a system for easy and efficient GNN model inference, which automatically translates the training code of a GNN model for layer-wise execution. DGI is general for various GNN models and different kinds of inference requests, and supports out-of-core execution on large graphs that cannot fit in CPU memory. Experimental results show that DGI consistently outperforms layer-wise inference across different datasets and hardware settings, and the speedup can be over 1,000x.
translated by 谷歌翻译
最近,Graph神经网络(GNNS)已成为聚光灯作为强大的工具,可以有效地在图形结构化数据上执行各种推理任务。随着现实图表的大小继续扩展,GNN训练系统面临可扩展性挑战。分布式培训是一种流行的方法,可以通过扩展CPU节点来应对这一挑战。但是,对基于磁盘的GNN培训的关注不多,该培训可以通过利用NVME SSD等高性能存储设备来以更具成本效益的方式扩展单节点系统。我们观察到,主内存和磁盘之间的数据移动是基于SSD的训练系统中的主要瓶颈,并且常规的GNN训练管道是不错的选择,而无需考虑此开销。因此,我们提出了Ginex,这是第一个基于SSD的GNN训练系统,可以在单台计算机上处​​理数十亿个图形数据集。受到编译器优化的检查员执行模型的启发,Ginex通过分开样品和收集阶段来重组GNN训练管道。这种分离使Ginex能够实现一种可证明的最佳替换算法,即被称为Belady的算法,用于存储器中的Caching特征向量,该算法是I/O访问的主要部分。根据我们对40亿尺度图数据集的评估,Ginex平均比SSD扩展的Pytorch几何得出了2.11倍的训练吞吐量(最大最高2.67倍)。
translated by 谷歌翻译
在过去十年中,已经开发出新的深度学习(DL)算法,工作负载和硬件来解决各种问题。尽管工作量和硬件生态系统的进步,DL系统的编程方法是停滞不前的。 DL工作负载从DL库中的高度优化,特定于平台和不灵活的内核,或者在新颖的操作员的情况下,通过具有强大性能的DL框架基元建立参考实现。这项工作介绍了Tensor加工基元(TPP),一个编程抽象,用于高效的DL工作负载的高效,便携式实现。 TPPS定义了一组紧凑而多才多艺的2D张镜操作员(或虚拟张量ISA),随后可以用作构建块,以在高维张量上构建复杂的运算符。 TPP规范是平台 - 不可行的,因此通过TPPS表示的代码是便携式的,而TPP实现是高度优化的,并且特定于平台。我们展示了我们使用独立内核和端到端DL&HPC工作负载完全通过TPPS表达的方法的效力和生存性,这在多个平台上优于最先进的实现。
translated by 谷歌翻译
Dynamic Graph Neural Networks (DGNNs) have been broadly applied in various real-life applications, such as link prediction and pandemic forecast, to capture both static structural information and temporal characteristics from dynamic graphs. Combining both time-dependent and -independent components, DGNNs manifest substantial parallel computation and data reuse potentials, but suffer from severe memory access inefficiency and data transfer overhead under the canonical one-graph-at-a-time training pattern. To tackle the challenges, we propose PiPAD, a $\underline{\textbf{Pi}}pelined$ and $\underline{\textbf{PA}}rallel$ $\underline{\textbf{D}}GNN$ training framework for the end-to-end performance optimization on GPUs. From both the algorithm and runtime level, PiPAD holistically reconstructs the overall training paradigm from the data organization to computation manner. Capable of processing multiple graph snapshots in parallel, PiPAD eliminates the unnecessary data transmission and alleviates memory access inefficiency to improve the overall performance. Our evaluation across various datasets shows PiPAD achieves $1.22\times$-$9.57\times$ speedup over the state-of-the-art DGNN frameworks on three representative models.
translated by 谷歌翻译
图形神经网络(GNNS)在学习从图形结构数据中展示了成功,其中包含欺诈检测,推荐和知识图形推理。然而,培训GNN有效地具有挑战性,因为:1)GPU存储器容量有限,对于大型数据集可能不足,而2)基于图形的数据结构导致不规则的数据访问模式。在这项工作中,我们提供了一种统计分析的方法,并确定了GNN培训前更频繁地访问的数据。我们的数据分层方法不仅利用输入图的结构,而且还从实际GNN训练过程中获得了洞察力,以实现更高的预测结果。通过我们的数据分层方法,我们还提供了一种新的数据放置和访问策略,以进一步最大限度地减少CPU-GPU通信开销。我们还考虑了多GPU GNN培训,我们也展示了我们在多GPU系统中的策略的有效性。评估结果表明,我们的工作将CPU-GPU流量降低了87-95%,并通过数亿节点和数十亿边缘的图表提高了现有解决方案的GNN训练速度。
translated by 谷歌翻译
Deep learning based recommendation models (DLRM) are widely used in several business critical applications. Training such recommendation models efficiently is challenging primarily because they consist of billions of embedding-based parameters which are often stored remotely leading to significant overheads from embedding access. By profiling existing DLRM training, we observe that only 8.5% of the iteration time is spent in forward/backward pass while the remaining time is spent on embedding and model synchronization. Our key insight in this paper is that access to embeddings have a specific structure and pattern which can be used to accelerate training. We observe that embedding accesses are heavily skewed, with almost 1% of embeddings represent more than 92% of total accesses. Further, we observe that during training we can lookahead at future batches to determine exactly which embeddings will be needed at what iteration in the future. Based on these insight, we propose Bagpipe, a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We designed an Oracle Cacher, a new system component which uses our lookahead algorithm to generate optimal cache update decisions and provide strong consistency guarantees. Our experiments using three datasets and two models shows that our approach provides a speed up of up to 6.2x compared to state of the art baselines, while providing the same convergence and reproducibility guarantees as synchronous training.
translated by 谷歌翻译
近年来,在平衡(超级)图分配算法的设计和评估中取得了重大进展。我们调查了过去十年的实用算法的趋势,用于平衡(超级)图形分区以及未来的研究方向。我们的工作是对先前有关该主题的调查的更新。特别是,该调查还通过涵盖了超图形分区和流算法来扩展先前的调查,并额外关注并行算法。
translated by 谷歌翻译
最近,作为基于图形机器学习的骨干的图形神经网络(GNN)展示了各个域(例如,电子商务)的巨大成功。然而,由于基于高稀疏和不规则的图形操作,GNN的性能通常不令人满意。为此,我们提出,TC-GNN,基于GNN加速框架的第一个GPU张量核心单元(TCU)。核心思想是将“稀疏”GNN计算与“密集”TCU进行调和。具体地,我们对主流GNN计算框架中的稀疏操作进行了深入的分析。我们介绍了一种新颖的稀疏图翻译技术,便于TCU处理稀疏GNN工作量。我们还实现了一个有效的CUDA核心和TCU协作设计,以充分利用GPU资源。我们将TC-GNN与Pytorch框架完全集成,以便于编程。严格的实验在各种GNN型号和数据集设置的最先进的深图库框架上平均显示了1.70倍的加速。
translated by 谷歌翻译
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, generalpurpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor-Flow achieves for several real-world applications.
translated by 谷歌翻译