While many systems have been developed to train Graph Neural Networks (GNNs), efficient model inference and evaluation remain to be addressed. For instance, using the widely adopted node-wise approach, model evaluation can account for up to 94% of the time in the end-to-end training process due to neighbor explosion, which means that a node accesses its multi-hop neighbors. On the other hand, layer-wise inference avoids the neighbor explosion problem by conducting inference layer by layer such that the nodes only need their one-hop neighbors in each layer. However, implementing layer-wise inference requires substantial engineering efforts because users need to manually decompose a GNN model into layers for computation and split workload into batches to fit into device memory. In this paper, we develop Deep Graph Inference (DGI) -- a system for easy and efficient GNN model inference, which automatically translates the training code of a GNN model for layer-wise execution. DGI is general for various GNN models and different kinds of inference requests, and supports out-of-core execution on large graphs that cannot fit in CPU memory. Experimental results show that DGI consistently outperforms layer-wise inference across different datasets and hardware settings, and the speedup can be over 1,000x.
translated by 谷歌翻译
图形神经网络(GNNS)在学习从图形结构数据中展示了成功,其中包含欺诈检测,推荐和知识图形推理。然而,培训GNN有效地具有挑战性,因为:1)GPU存储器容量有限,对于大型数据集可能不足,而2)基于图形的数据结构导致不规则的数据访问模式。在这项工作中,我们提供了一种统计分析的方法,并确定了GNN培训前更频繁地访问的数据。我们的数据分层方法不仅利用输入图的结构,而且还从实际GNN训练过程中获得了洞察力,以实现更高的预测结果。通过我们的数据分层方法,我们还提供了一种新的数据放置和访问策略,以进一步最大限度地减少CPU-GPU通信开销。我们还考虑了多GPU GNN培训,我们也展示了我们在多GPU系统中的策略的有效性。评估结果表明,我们的工作将CPU-GPU流量降低了87-95%,并通过数亿节点和数十亿边缘的图表提高了现有解决方案的GNN训练速度。
translated by 谷歌翻译
图形神经网络(GNNS)将深度神经网络(DNN)的成功扩展到非欧几里德图数据,实现了各种任务的接地性能,例如节点分类和图形属性预测。尽管如此,现有系统效率低,培训数十亿节点和GPU的节点和边缘训练大图。主要瓶颈是准备GPU数据的过程 - 子图采样和特征检索。本文提出了一个分布式GNN培训系统的BGL,旨在解决一些关键思想的瓶颈。首先,我们提出了一种动态缓存引擎,以最小化特征检索流量。通过协同设计缓存政策和抽样顺序,我们发现低开销和高缓存命中率的精美斑点。其次,我们改善了曲线图分区算法,以减少子图采样期间的交叉分区通信。最后,仔细资源隔离减少了不同数据预处理阶段之间的争用。关于各种GNN模型和大图数据集的广泛实验表明,BGL平均明显优于现有的GNN训练系统20.68倍。
translated by 谷歌翻译
最近,Graph神经网络(GNNS)已成为聚光灯作为强大的工具,可以有效地在图形结构化数据上执行各种推理任务。随着现实图表的大小继续扩展,GNN训练系统面临可扩展性挑战。分布式培训是一种流行的方法,可以通过扩展CPU节点来应对这一挑战。但是,对基于磁盘的GNN培训的关注不多,该培训可以通过利用NVME SSD等高性能存储设备来以更具成本效益的方式扩展单节点系统。我们观察到,主内存和磁盘之间的数据移动是基于SSD的训练系统中的主要瓶颈,并且常规的GNN训练管道是不错的选择,而无需考虑此开销。因此,我们提出了Ginex,这是第一个基于SSD的GNN训练系统,可以在单台计算机上处​​理数十亿个图形数据集。受到编译器优化的检查员执行模型的启发,Ginex通过分开样品和收集阶段来重组GNN训练管道。这种分离使Ginex能够实现一种可证明的最佳替换算法,即被称为Belady的算法,用于存储器中的Caching特征向量,该算法是I/O访问的主要部分。根据我们对40亿尺度图数据集的评估,Ginex平均比SSD扩展的Pytorch几何得出了2.11倍的训练吞吐量(最大最高2.67倍)。
translated by 谷歌翻译
图形神经网络(GNN)已被证明是分析非欧国人图数据的强大工具。但是,缺乏有效的分布图学习(GL)系统极大地阻碍了GNN的应用,尤其是当图形大且GNN相对深时。本文中,我们提出了GraphTheta,这是一种以顶点为中心的图形编程模型实现的新颖分布式和可扩展的GL系统。 GraphTheta是第一个基于分布式图处理的GL系统,其神经网络运算符以用户定义的功能实现。该系统支持多种培训策略,并在分布式(虚拟)机器上启用高度可扩展的大图学习。为了促进图形卷积实现,GraphTheta提出了一个名为NN-Tgar的新的GL抽象,以弥合图形处理和图形深度学习之间的差距。提出了分布式图引擎,以通过混合平行执行进行随机梯度下降优化。此外,除了全球批次和迷你批次外,我们还为新的集群批次培训策略提供了支持。我们使用许多网络大小的数据集评估GraphTheta,范围从小,适度到大规模。实验结果表明,GraphTheta可以很好地扩展到1,024名工人,用于培训内部开发的GNN,该工业尺度的Aripay数据集为14亿个节点和41亿个属性边缘,并带有CPU虚拟机(Dockers)群的小群。 (5 $ \ sim $ 12GB)。此外,GraphTheta比最先进的GNN实现获得了可比或更好的预测结果,证明其学习GNN和现有框架的能力,并且可以超过多达$ 2.02 \ tims $ $ 2.02 \ times $,具有更好的可扩展性。据我们所知,这项工作介绍了文献中最大的边缘属性GNN学习任务。
translated by 谷歌翻译
Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. However, the workflows, computational patterns, communication patterns, and optimization techniques of distributed GNN training remain preliminarily understood. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.
translated by 谷歌翻译
图形神经网络(GNN)的输入图的大小不断增加,突显了使用多GPU平台的需求。但是,由于计算不平衡和效率较低的通信,现有的多GPU GNN解决方案遭受了劣质性能。为此,我们提出了MGG,这是一种新型的系统设计,可以通过以GPU为中心的软件管道在多GPU平台上加速GNN。 MGG探讨了通过细粒度计算通信管道中隐藏GNN工作负载中远程内存访问延迟的潜力。具体而言,MGG引入了管​​道感知工作负载管理策略和混合数据布局设计,以促进通信局限性重叠。 MGG实现以优化的管道为中心的内核。它包括工作负载交织和基于经经的映射,以进行有效的GPU内核操作管道和专门的内存设计以及优化,以更好地数据访问性能。此外,MGG还结合了轻巧的分析建模和优化启发式方法,以动态提高运行时不同设置的GNN执行性能。全面的实验表明,MGG在各种GNN设置上的最先进的多GPU系统要比最先进的多GPU系统:平均比具有统一虚拟内存设计的多GPU系统快3.65倍,平均比DGCL框架快7.38倍。
translated by 谷歌翻译
最近,作为基于图形机器学习的骨干的图形神经网络(GNN)展示了各个域(例如,电子商务)的巨大成功。然而,由于基于高稀疏和不规则的图形操作,GNN的性能通常不令人满意。为此,我们提出,TC-GNN,基于GNN加速框架的第一个GPU张量核心单元(TCU)。核心思想是将“稀疏”GNN计算与“密集”TCU进行调和。具体地,我们对主流GNN计算框架中的稀疏操作进行了深入的分析。我们介绍了一种新颖的稀疏图翻译技术,便于TCU处理稀疏GNN工作量。我们还实现了一个有效的CUDA核心和TCU协作设计,以充分利用GPU资源。我们将TC-GNN与Pytorch框架完全集成,以便于编程。严格的实验在各种GNN型号和数据集设置的最先进的深图库框架上平均显示了1.70倍的加速。
translated by 谷歌翻译
TensorFlow GNN(TF-GNN)是张量曲线的图形神经网络的可扩展库。它是从自下而上设计的,以支持当今信息生态系统中发生的丰富的异质图数据。Google的许多生产模型都使用TF-GNN,最近已作为开源项目发布。在本文中,我们描述了TF-GNN数据模型,其KERAS建模API以及相关功能,例如图形采样,分布式训练和加速器支持。
translated by 谷歌翻译
Graph Neural Networks (GNNs) are a class of neural networks designed to extract information from the graphical structure of data. Graph Convolutional Networks (GCNs) are a widely used type of GNN for transductive graph learning problems which apply convolution to learn information from graphs. GCN is a challenging algorithm from an architecture perspective due to inherent sparsity, low data reuse, and massive memory capacity requirements. Traditional neural algorithms exploit the high compute capacity of GPUs to achieve high performance for both inference and training. The architectural decision to use a GPU for GCN inference is a question explored in this work. GCN on both CPU and GPU was characterized in order to better understand the implications of graph size, embedding dimension, and sampling on performance.
translated by 谷歌翻译
Training Graph Neural Networks, on graphs containing billions of vertices and edges, at scale using minibatch sampling poses a key challenge: strong-scaling graphs and training examples results in lower compute and higher communication volume and potential performance loss. DistGNN-MB employs a novel Historical Embedding Cache combined with compute-communication overlap to address this challenge. On a 32-node (64-socket) cluster of $3^{rd}$ generation Intel Xeon Scalable Processors with 36 cores per socket, DistGNN-MB trains 3-layer GraphSAGE and GAT models on OGBN-Papers100M to convergence with epoch times of 2 seconds and 4.9 seconds, respectively, on 32 compute nodes. At this scale, DistGNN-MB trains GraphSAGE 5.2x faster than the widely-used DistDGL. DistGNN-MB trains GraphSAGE and GAT 10x and 17.2x faster, respectively, as compute nodes scale from 2 to 32.
translated by 谷歌翻译
许多现实世界图包含时域信息。时间图神经网络在生成的动态节点嵌入中捕获时间信息以及结构和上下文信息。研究人员表明,这些嵌入在许多不同的任务中实现了最先进的表现。在这项工作中,我们提出了TGL,这是一个用于大规模脱机时间图神经网络训练的统一框架,用户可以使用简单的配置文件组成各种时间图神经网络。 TGL包括五个主要组件,一个临时采样器,一个邮箱,节点内存模块,存储器更新程序和消息传递引擎。我们设计了临时CSR数据结构和平行采样器,以有效地对颞邻邻居进行制作微型批次。我们提出了一种新颖的随机块调度技术,该技术可以减轻大批量训练时过时的节点存储器的问题。为了解决仅在小规模数据集上评估当前TGNN的局限性,我们介绍了两个具有0.2亿和13亿个时间边缘的大型现实世界数据集。我们在四个具有单个GPU的小规模数据集上评估了TGL的性能,以及两个具有多个GPU的大数据集,用于链接预测和节点分类任务。我们将TGL与五种方法的开源代码进行了比较,并表明TGL平均达到13倍的速度可实现相似或更高的精度。与基准相比,我们的时间平行采样器在多核CPU上平均达到173倍加速。在4-GPU机器上,TGL可以在1-10小时内训练一个超过10亿个时间边缘的时期。据我们所知,这是第一项提出了一个关于多个GPU的大规模时间图神经网络培训的一般框架的工作。
translated by 谷歌翻译
Graph convolutional network (GCN) has been successfully applied to many graph-based applications; however, training a large-scale GCN remains challenging. Current SGD-based algorithms suffer from either a high computational cost that exponentially grows with number of GCN layers, or a large space requirement for keeping the entire graph and the embedding of each node in memory. In this paper, we propose Cluster-GCN, a novel GCN algorithm that is suitable for SGD-based training by exploiting the graph clustering structure. Cluster-GCN works as the following: at each step, it samples a block of nodes that associate with a dense subgraph identified by a graph clustering algorithm, and restricts the neighborhood search within this subgraph. This simple but effective strategy leads to significantly improved memory and computational efficiency while being able to achieve comparable test accuracy with previous algorithms. To test the scalability of our algorithm, we create a new Amazon2M data with 2 million nodes and 61 million edges which is more than 5 times larger than the previous largest publicly available dataset (Reddit). For training a 3-layer GCN on this data, Cluster-GCN is faster than the previous state-of-the-art VR-GCN (1523 seconds vs 1961 seconds) and using much less memory (2.2GB vs 11.2GB). Furthermore, for training 4 layer GCN on this data, our algorithm can finish in around 36 minutes while all the existing GCN training algorithms fail to train due to the out-of-memory issue. Furthermore, Cluster-GCN allows us to train much deeper GCN without much time and memory overhead, which leads to improved prediction accuracy-using a 5-layer Cluster-GCN, we achieve state-of-the-art test F1 score 99.36 on the PPI dataset, while the previous best result was 98.71 by [16]. Our codes are publicly available at https://github.com/google-research/google-research/ tree/master/cluster_gcn.
translated by 谷歌翻译
大规模淋巴结分类的图形神经网络(GNNS)培训具有挑战性。关键困难在于在避免邻居爆炸问题的同时获得准确的隐藏节点表示。在这里,我们提出了一种新技术,称为特征动量(FM),该技术在更新功能表示时使用动量步骤来合并历史嵌入。我们开发了两种特定的算法,称为GraphFM-IB和GraphFM-OB,它们分别考虑了内部和隔离外数据。 GraphFM-AIB将FM应用于内部采样数据,而GraphFM-OB则将FM应用于隔离数据的隔离数据,而口气数据是1跳入数据的1个邻域。对于特征嵌入的估计误差,我们为GraphFM-IB和GraphFM-OB的理论见解提供了严格的合并分析。从经验上讲,我们观察到GraphFM-IB可以有效缓解现有方法的邻里爆炸问题。此外,GraphFM-OB在多个大型图形数据集上实现了有希望的性能。
translated by 谷歌翻译
Using graph neural networks for large graphs is challenging since there is no clear way of constructing mini-batches. To solve this, previous methods have relied on sampling or graph clustering. While these approaches often lead to good training convergence, they introduce significant overhead due to expensive random data accesses and perform poorly during inference. In this work we instead focus on model behavior during inference. We theoretically model batch construction via maximizing the influence score of nodes on the outputs. This formulation leads to optimal approximation of the output when we do not have knowledge of the trained model. We call the resulting method influence-based mini-batching (IBMB). IBMB accelerates inference by up to 130x compared to previous methods that reach similar accuracy. Remarkably, with adaptive optimization and the right training schedule IBMB can also substantially accelerate training, thanks to precomputed batches and consecutive memory accesses. This results in up to 18x faster training per epoch and up to 17x faster convergence per runtime compared to previous methods.
translated by 谷歌翻译
最近,图形卷积网络(GCNS)已成为用于分析非欧几里德图数据的最先进的算法。然而,实现有效的GCN训练,特别是在大图中挑战。原因是许多折叠的原因:1)GCN训练引发了大量的内存占用。大图中的全批量培训甚至需要数百到数千千兆字节的内存,以缓冲中间数据进行反向传播。 2)GCN培训涉及内存密集型数据减少和计算密集型功能/渐变更新操作。这种异构性质挑战当前的CPU / GPU平台。 3)图形的不规则性和复杂的训练数据流共同增加了提高GCN培训系统效率的难度。本文提出了一种混合架构来解决这些挑战的混合架构。具体地,GCNEAR采用基于DIMM的存储系统,提供易于级别的存储器容量。为了匹配异构性质,我们将GCN培训操作分类为内存密集型减少和计算密集型更新操作。然后,我们卸载将操作减少到DIMM NMES,充分利用高聚合的本地带宽。我们采用具有足够计算能力的CAE来处理更新操作。我们进一步提出了几种优化策略来处理GCN任务的不规则,提高GCNEAR的表现。我们还提出了一种多GCNEAR系统来评估GCNEAR的可扩展性。
translated by 谷歌翻译
最近提出了基于子图的图表学习(SGRL)来应对规范图神经网络(GNNS)遇到的一些基本挑战,并在许多重要的数据科学应用(例如链接,关系和主题预测)中证明了优势。但是,当前的SGRL方法遇到了可伸缩性问题,因为它们需要为每个培训或测试查询提取子图。扩大规范GNN的最新解决方案可能不适用于SGRL。在这里,我们通过共同设计学习算法及其系统支持,为可扩展的SGRL提出了一种新颖的框架Surel。 Surel采用基于步行的子图表分解,并将步行重新形成子图,从而大大降低了子图提取的冗余并支持并行计算。具有数百万个节点和边缘的六个同质,异质和高阶图的实验证明了Surel的有效性和可扩展性。特别是,与SGRL基线相比,Surel可以实现10 $ \ times $ Quad-Up,具有可比甚至更好的预测性能;与规范GNN相比,Surel可实现50%的预测准确性。
translated by 谷歌翻译
Graph neural networks (GNNs) have received great attention due to their success in various graph-related learning tasks. Several GNN frameworks have then been developed for fast and easy implementation of GNN models. Despite their popularity, they are not well documented, and their implementations and system performance have not been well understood. In particular, unlike the traditional GNNs that are trained based on the entire graph in a full-batch manner, recent GNNs have been developed with different graph sampling techniques for mini-batch training of GNNs on large graphs. While they improve the scalability, their training times still depend on the implementations in the frameworks as sampling and its associated operations can introduce non-negligible overhead and computational cost. In addition, it is unknown how much the frameworks are 'eco-friendly' from a green computing perspective. In this paper, we provide an in-depth study of two mainstream GNN frameworks along with three state-of-the-art GNNs to analyze their performance in terms of runtime and power/energy consumption. We conduct extensive benchmark experiments at several different levels and present detailed analysis results and observations, which could be helpful for further improvement and optimization.
translated by 谷歌翻译
最近,图形神经网络(GNN)通过利用图形结构和节点特征的知识来表现出图表表示的显着性能。但是,他们中的大多数都有两个主要限制。首先,GNN可以通过堆叠更多的层来学习高阶结构信息,但由于过度光滑的问题,无法处理较大的深度。其次,由于昂贵的计算成本和高内存使用情况,在大图上应用这些方法并不容易。在本文中,我们提出了节点自适应特征平滑(NAFS),这是一种简单的非参数方法,该方法构建了没有参数学习的节点表示。 NAFS首先通过特征平滑提取每个节点及其不同啤酒花的邻居的特征,然后自适应地结合了平滑的特征。此外,通过不同的平滑策略提取的平滑特征的合奏可以进一步增强构建的节点表示形式。我们在两个不同的应用程序方案上对四个基准数据集进行实验:节点群集和链接预测。值得注意的是,具有功能合奏的NAFS优于这些任务上最先进的GNN,并减轻上述大多数基于学习的GNN对应物的两个限制。
translated by 谷歌翻译
我们在大图中介绍了图形神经网络(GNNS)的分布式全批量培训的顺序聚合和换算(SAR)方案。最近,GNN的大规模培训是基于非学习消息传递的基于采样的方法和方法主导的。另一方面,SAR是一种分布式技术,可以直接在整个大图上培训任何GNN类型。 SAR中的关键创新是分布式顺序修补方案,其在后向通过期间依次重新构造,然后在后向通行证期间释放禁止的大型GNN计算图。这导致优异的记忆缩放行为,其中每个工作人员的内存消耗与工人的数量线性地下降,即使对于密集连接的图形。使用SAR,我们报告了最大的全批量GNN培训应用到目前为止,并随着工人数量的增加而展示了大的内存节省。我们还基于内核融合和注意力矩阵的一般技术提出了一种优化了基于关注的模型的运行时和内存效率。我们表明,与SAR相结合,我们的优化注意核导致了基于关注的GNN的显着加速和内存节省。
translated by 谷歌翻译