采样是图形神经网络(GNN)培训的关键操作,有助于降低成本。以前的文献已经通过数学和统计方法探索了改进采样算法。但是,采样算法和硬件之间存在差距。在不考虑硬件的情况下,算法设计人员仅在算法级别优化采样,缺少通过利用硬件功能来促进现有采样算法效率的巨大潜力。在本文中,我们开创了一个为主流采样算法提出的统一编程模型,称为GNNSampler,涵盖了各个类别中采样算法的关键过程。其次,为了利用硬件功能,我们选择数据局部性作为案例研究,并在图中探索节点及其邻居之间的数据位置,以减轻采样中不规则的内存访问。第三,我们在GNNSampler中实现了各种采样算法的局部感知优化,以优化一般的采样过程。最后,我们强调在大图数据集上进行实验,以分析训练时间,准确性和硬件级指标之间的相关性。广泛的实验表明,我们的方法通用到主流采样算法,并有助于大大减少训练时间,尤其是在大规模图中。
translated by 谷歌翻译
大规模图在现实情况下无处不在,可以通过图神经网络(GNN)训练以生成下游任务的表示形式。鉴于大规模图的丰富信息和复杂的拓扑结构,我们认为在这样的图中存在冗余,并将降低训练效率。不幸的是,模型可伸缩性严重限制了通过香草GNNS训练大规模图的效率。尽管在基于抽样的培训方法方面取得了最新进展,但基于抽样的GNN通常忽略了冗余问题。在大规模图上训练这些型号仍然需要无法容忍的时间。因此,我们建议通过重新思考图中的固有特征来降低冗余并提高使用GNN的大规模训练效率。在本文中,我们开拓者提出了一种称为dropreef的曾经使用的方法,以在大规模图中删除冗余。具体而言,我们首先进行初步实验,以探索大规模图中的潜在冗余。接下来,我们提出一个度量标准,以量化图中所有节点的异质性。基于实验和理论分析,我们揭示了大规模图中的冗余,即具有高邻居异质的节点和大量邻居。然后,我们建议Dropreef一劳永逸地检测并删除大规模图中的冗余,以帮助减少训练时间,同时确保模型准确性没有牺牲。为了证明DropReef的有效性,我们将其应用于最新的基于最新的采样GNN,用于训练大规模图,这是由于此类模型的高精度。使用Dropreef杠杆,可以大力提高模型的训练效率。 Dropreef高度兼容,并且在离线上执行,从而在很大程度上使目前和未来的最新采样GNN受益。
translated by 谷歌翻译
Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. However, the workflows, computational patterns, communication patterns, and optimization techniques of distributed GNN training remain preliminarily understood. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.
translated by 谷歌翻译
图表卷积网络(GCNS)已成为最先进的图形学习模型。但是,它可以令人难以置于大图数据集的推断GCNS,这会将其应用于大型实际图表并阻碍更深层更复杂的GCN图形的探讨。这是因为真实世界图可能非常大而稀疏。此外,GCN的节点度倾向于遵循幂律分布,因此具有高度不规则的邻接矩阵,导致数据处理和移动中的禁止低效率,从而显着地限制了可实现的GCN加速效率。为此,本文提出了一种GCN算法和加速器协同设计框架被称为GCOD,其在很大程度上可以缓解上述GCN不规则性并提高GCNS推理效率。具体地,在算法级别上,GCOD集成了分割和征服GCN训练策略,该训练策略将图形偏离在本地邻域中的密集或稀疏,而不会影响模型精度,从而导致(主要)的图形邻接矩阵仅仅是两个级别的工作量并享受大部分增强的规律性,从而轻松加速。在硬件水平上,我们进一步开发了一个具有分离发动机的专用双子加速器,以处理每个上述密集和稀疏工作负载,进一步提高整体利用率和加速效率。广泛的实验和消融研究验证了我们的GCOD始终如一地减少了与CPU,GPU和现有技术GCN加速器相比的15286倍,294倍,7.8倍和2.5倍的加速,包括HYGCN和AWB -GCN分别在保持甚至提高任务准确性的同时。
translated by 谷歌翻译
图形神经网络(GNN)已被证明是分析非欧国人图数据的强大工具。但是,缺乏有效的分布图学习(GL)系统极大地阻碍了GNN的应用,尤其是当图形大且GNN相对深时。本文中,我们提出了GraphTheta,这是一种以顶点为中心的图形编程模型实现的新颖分布式和可扩展的GL系统。 GraphTheta是第一个基于分布式图处理的GL系统,其神经网络运算符以用户定义的功能实现。该系统支持多种培训策略,并在分布式(虚拟)机器上启用高度可扩展的大图学习。为了促进图形卷积实现,GraphTheta提出了一个名为NN-Tgar的新的GL抽象,以弥合图形处理和图形深度学习之间的差距。提出了分布式图引擎,以通过混合平行执行进行随机梯度下降优化。此外,除了全球批次和迷你批次外,我们还为新的集群批次培训策略提供了支持。我们使用许多网络大小的数据集评估GraphTheta,范围从小,适度到大规模。实验结果表明,GraphTheta可以很好地扩展到1,024名工人,用于培训内部开发的GNN,该工业尺度的Aripay数据集为14亿个节点和41亿个属性边缘,并带有CPU虚拟机(Dockers)群的小群。 (5 $ \ sim $ 12GB)。此外,GraphTheta比最先进的GNN实现获得了可比或更好的预测结果,证明其学习GNN和现有框架的能力,并且可以超过多达$ 2.02 \ tims $ $ 2.02 \ times $,具有更好的可扩展性。据我们所知,这项工作介绍了文献中最大的边缘属性GNN学习任务。
translated by 谷歌翻译
最近,Graph神经网络(GNNS)已成为聚光灯作为强大的工具,可以有效地在图形结构化数据上执行各种推理任务。随着现实图表的大小继续扩展,GNN训练系统面临可扩展性挑战。分布式培训是一种流行的方法,可以通过扩展CPU节点来应对这一挑战。但是,对基于磁盘的GNN培训的关注不多,该培训可以通过利用NVME SSD等高性能存储设备来以更具成本效益的方式扩展单节点系统。我们观察到,主内存和磁盘之间的数据移动是基于SSD的训练系统中的主要瓶颈,并且常规的GNN训练管道是不错的选择,而无需考虑此开销。因此,我们提出了Ginex,这是第一个基于SSD的GNN训练系统,可以在单台计算机上处​​理数十亿个图形数据集。受到编译器优化的检查员执行模型的启发,Ginex通过分开样品和收集阶段来重组GNN训练管道。这种分离使Ginex能够实现一种可证明的最佳替换算法,即被称为Belady的算法,用于存储器中的Caching特征向量,该算法是I/O访问的主要部分。根据我们对40亿尺度图数据集的评估,Ginex平均比SSD扩展的Pytorch几何得出了2.11倍的训练吞吐量(最大最高2.67倍)。
translated by 谷歌翻译
图形神经网络(GNN)在学习强大的节点表示中显示了令人信服的性能,这些表现在保留节点属性和图形结构信息的强大节点表示中。然而,许多GNNS在设计有更深的网络结构或手柄大小的图形时遇到有效性和效率的问题。已经提出了几种采样算法来改善和加速GNN的培训,但他们忽略了解GNN性能增益的来源。图表数据中的信息的测量可以帮助采样算法来保持高价值信息,同时消除冗余信息甚至噪声。在本文中,我们提出了一种用于GNN的公制引导(MEGUIDE)子图学习框架。 MEGUIDE采用两种新颖的度量:功能平滑和连接失效距离,以指导子图采样和迷你批次的培训。功能平滑度专为分析节点的特征而才能保留最有价值的信息,而连接失败距离可以测量结构信息以控制子图的大小。我们展示了MEGUIDE在多个数据集上培训各种GNN的有效性和效率。
translated by 谷歌翻译
Graph Convolutional Networks (GCNs) are powerful models for learning representations of attributed graphs. To scale GCNs to large graphs, state-of-the-art methods use various layer sampling techniques to alleviate the "neighbor explosion" problem during minibatch training. We propose GraphSAINT, a graph sampling based inductive learning method that improves training efficiency and accuracy in a fundamentally different way. By changing perspective, GraphSAINT constructs minibatches by sampling the training graph, rather than the nodes or edges across GCN layers. Each iteration, a complete GCN is built from the properly sampled subgraph. Thus, we ensure fixed number of well-connected nodes in all layers. We further propose normalization technique to eliminate bias, and sampling algorithms for variance reduction. Importantly, we can decouple the sampling from the forward and backward propagation, and extend GraphSAINT with many architecture variants (e.g., graph attention, jumping connection). GraphSAINT demonstrates superior performance in both accuracy and training time on five large graphs, and achieves new state-of-the-art F1 scores for PPI (0.995) and Reddit (0.970).
translated by 谷歌翻译
最近,作为基于图形机器学习的骨干的图形神经网络(GNN)展示了各个域(例如,电子商务)的巨大成功。然而,由于基于高稀疏和不规则的图形操作,GNN的性能通常不令人满意。为此,我们提出,TC-GNN,基于GNN加速框架的第一个GPU张量核心单元(TCU)。核心思想是将“稀疏”GNN计算与“密集”TCU进行调和。具体地,我们对主流GNN计算框架中的稀疏操作进行了深入的分析。我们介绍了一种新颖的稀疏图翻译技术,便于TCU处理稀疏GNN工作量。我们还实现了一个有效的CUDA核心和TCU协作设计,以充分利用GPU资源。我们将TC-GNN与Pytorch框架完全集成,以便于编程。严格的实验在各种GNN型号和数据集设置的最先进的深图库框架上平均显示了1.70倍的加速。
translated by 谷歌翻译
Using graph neural networks for large graphs is challenging since there is no clear way of constructing mini-batches. To solve this, previous methods have relied on sampling or graph clustering. While these approaches often lead to good training convergence, they introduce significant overhead due to expensive random data accesses and perform poorly during inference. In this work we instead focus on model behavior during inference. We theoretically model batch construction via maximizing the influence score of nodes on the outputs. This formulation leads to optimal approximation of the output when we do not have knowledge of the trained model. We call the resulting method influence-based mini-batching (IBMB). IBMB accelerates inference by up to 130x compared to previous methods that reach similar accuracy. Remarkably, with adaptive optimization and the right training schedule IBMB can also substantially accelerate training, thanks to precomputed batches and consecutive memory accesses. This results in up to 18x faster training per epoch and up to 17x faster convergence per runtime compared to previous methods.
translated by 谷歌翻译
图形神经网络(GNNS)将深度神经网络(DNN)的成功扩展到非欧几里德图数据,实现了各种任务的接地性能,例如节点分类和图形属性预测。尽管如此,现有系统效率低,培训数十亿节点和GPU的节点和边缘训练大图。主要瓶颈是准备GPU数据的过程 - 子图采样和特征检索。本文提出了一个分布式GNN培训系统的BGL,旨在解决一些关键思想的瓶颈。首先,我们提出了一种动态缓存引擎,以最小化特征检索流量。通过协同设计缓存政策和抽样顺序,我们发现低开销和高缓存命中率的精美斑点。其次,我们改善了曲线图分区算法,以减少子图采样期间的交叉分区通信。最后,仔细资源隔离减少了不同数据预处理阶段之间的争用。关于各种GNN模型和大图数据集的广泛实验表明,BGL平均明显优于现有的GNN训练系统20.68倍。
translated by 谷歌翻译
Graph convolutional network (GCN) has been successfully applied to many graph-based applications; however, training a large-scale GCN remains challenging. Current SGD-based algorithms suffer from either a high computational cost that exponentially grows with number of GCN layers, or a large space requirement for keeping the entire graph and the embedding of each node in memory. In this paper, we propose Cluster-GCN, a novel GCN algorithm that is suitable for SGD-based training by exploiting the graph clustering structure. Cluster-GCN works as the following: at each step, it samples a block of nodes that associate with a dense subgraph identified by a graph clustering algorithm, and restricts the neighborhood search within this subgraph. This simple but effective strategy leads to significantly improved memory and computational efficiency while being able to achieve comparable test accuracy with previous algorithms. To test the scalability of our algorithm, we create a new Amazon2M data with 2 million nodes and 61 million edges which is more than 5 times larger than the previous largest publicly available dataset (Reddit). For training a 3-layer GCN on this data, Cluster-GCN is faster than the previous state-of-the-art VR-GCN (1523 seconds vs 1961 seconds) and using much less memory (2.2GB vs 11.2GB). Furthermore, for training 4 layer GCN on this data, our algorithm can finish in around 36 minutes while all the existing GCN training algorithms fail to train due to the out-of-memory issue. Furthermore, Cluster-GCN allows us to train much deeper GCN without much time and memory overhead, which leads to improved prediction accuracy-using a 5-layer Cluster-GCN, we achieve state-of-the-art test F1 score 99.36 on the PPI dataset, while the previous best result was 98.71 by [16]. Our codes are publicly available at https://github.com/google-research/google-research/ tree/master/cluster_gcn.
translated by 谷歌翻译
图形神经网络(GNNS)在学习从图形结构数据中展示了成功,其中包含欺诈检测,推荐和知识图形推理。然而,培训GNN有效地具有挑战性,因为:1)GPU存储器容量有限,对于大型数据集可能不足,而2)基于图形的数据结构导致不规则的数据访问模式。在这项工作中,我们提供了一种统计分析的方法,并确定了GNN培训前更频繁地访问的数据。我们的数据分层方法不仅利用输入图的结构,而且还从实际GNN训练过程中获得了洞察力,以实现更高的预测结果。通过我们的数据分层方法,我们还提供了一种新的数据放置和访问策略,以进一步最大限度地减少CPU-GPU通信开销。我们还考虑了多GPU GNN培训,我们也展示了我们在多GPU系统中的策略的有效性。评估结果表明,我们的工作将CPU-GPU流量降低了87-95%,并通过数亿节点和数十亿边缘的图表提高了现有解决方案的GNN训练速度。
translated by 谷歌翻译
图形神经网络(GNN)在许多基于图的应用程序中取得了巨大成功。但是,巨大的尺寸和高稀疏度的图表阻碍了其在工业场景下的应用。尽管为大规模图提出了一些可扩展的GNN,但它们为每个节点采用固定的$ k $ hop邻域,因此在稀疏区域内采用大型繁殖深度时面临过度光滑的问题。为了解决上述问题,我们提出了一种新的GNN体系结构 - 图形注意多层感知器(GAMLP),该架构可以捕获不同图形知识范围之间的基本相关性。我们已经与天使平台部署了GAMLP,并进一步评估了现实世界数据集和大规模工业数据集的GAMLP。这14个图数据集的广泛实验表明,GAMLP在享有高可扩展性和效率的同时,达到了最先进的性能。具体来说,在我们的大规模腾讯视频数据集上的预测准确性方面,它的表现优于1.3 \%,同时达到了高达$ 50 \ times $ triending的速度。此外,它在开放图基准的最大同质和异质图(即OGBN-PAPERS100M和OGBN-MAG)的排行榜上排名第一。
translated by 谷歌翻译
Graph neural networks (GNNs) have received great attention due to their success in various graph-related learning tasks. Several GNN frameworks have then been developed for fast and easy implementation of GNN models. Despite their popularity, they are not well documented, and their implementations and system performance have not been well understood. In particular, unlike the traditional GNNs that are trained based on the entire graph in a full-batch manner, recent GNNs have been developed with different graph sampling techniques for mini-batch training of GNNs on large graphs. While they improve the scalability, their training times still depend on the implementations in the frameworks as sampling and its associated operations can introduce non-negligible overhead and computational cost. In addition, it is unknown how much the frameworks are 'eco-friendly' from a green computing perspective. In this paper, we provide an in-depth study of two mainstream GNN frameworks along with three state-of-the-art GNNs to analyze their performance in terms of runtime and power/energy consumption. We conduct extensive benchmark experiments at several different levels and present detailed analysis results and observations, which could be helpful for further improvement and optimization.
translated by 谷歌翻译
图表卷积网络(GCNS)在各种半监督节点分类任务中取得了令人印象深刻的实证进步。尽管取得了巨大的成功,但在大型图形上培训GCNS遭受了计算和内存问题。规避这些障碍的潜在路径是基于采样的方法,其中在每个层处采样节点的子集。虽然最近的研究已经证明了基于采样的方法的有效性,但这些作品缺乏在现实环境下的理论融合担保,并且不能完全利用优化期间演出参数的信息。在本文中,我们描述并分析了一般的双差异减少模式,可以在内存预算下加速任何采样方法。所提出的模式的激励推动是仔细分析采样方法的差异,其中示出了诱导方差可以在前进传播期间分解为节点嵌入近似方差(Zeroth阶差异)(第一 - 顺序变化)在后向传播期间。理论上,从理论上分析所提出的架构的融合,并显示它享有$ \ Mathcal {O}(1 / T)$收敛率。我们通过将建议的模式集成在不同的采样方法中并将其应用于不同的大型实际图形来补充我们的理论结果。
translated by 谷歌翻译
While many systems have been developed to train Graph Neural Networks (GNNs), efficient model inference and evaluation remain to be addressed. For instance, using the widely adopted node-wise approach, model evaluation can account for up to 94% of the time in the end-to-end training process due to neighbor explosion, which means that a node accesses its multi-hop neighbors. On the other hand, layer-wise inference avoids the neighbor explosion problem by conducting inference layer by layer such that the nodes only need their one-hop neighbors in each layer. However, implementing layer-wise inference requires substantial engineering efforts because users need to manually decompose a GNN model into layers for computation and split workload into batches to fit into device memory. In this paper, we develop Deep Graph Inference (DGI) -- a system for easy and efficient GNN model inference, which automatically translates the training code of a GNN model for layer-wise execution. DGI is general for various GNN models and different kinds of inference requests, and supports out-of-core execution on large graphs that cannot fit in CPU memory. Experimental results show that DGI consistently outperforms layer-wise inference across different datasets and hardware settings, and the speedup can be over 1,000x.
translated by 谷歌翻译
社交机器人被称为社交网络上的自动帐户,这些帐户试图像人类一样行事。尽管图形神经网络(GNNS)已大量应用于社会机器人检测领域,但大量的领域专业知识和先验知识大量参与了最先进的方法,以设计专门的神经网络体系结构,以设计特定的神经网络体系结构。分类任务。但是,在模型设计中涉及超大的节点和网络层,通常会导致过度平滑的问题和缺乏嵌入歧视。在本文中,我们提出了罗斯加斯(Rosgas),这是一种新颖的加强和自我监督的GNN Architecture搜索框架,以适应性地指出了最合适的多跳跃社区和GNN体系结构中的层数。更具体地说,我们将社交机器人检测问题视为以用户为中心的子图嵌入和分类任务。我们利用异构信息网络来通过利用帐户元数据,关系,行为特征和内容功能来展示用户连接。 Rosgas使用多代理的深钢筋学习(RL)机制来导航最佳邻域和网络层的搜索,以分别学习每个目标用户的子图嵌入。开发了一种用于加速RL训练过程的最接近的邻居机制,Rosgas可以借助自我监督的学习来学习更多的判别子图。 5个Twitter数据集的实验表明,Rosgas在准确性,训练效率和稳定性方面优于最先进的方法,并且在处理看不见的样本时具有更好的概括。
translated by 谷歌翻译
图形神经网络(GNN)的输入图的大小不断增加,突显了使用多GPU平台的需求。但是,由于计算不平衡和效率较低的通信,现有的多GPU GNN解决方案遭受了劣质性能。为此,我们提出了MGG,这是一种新型的系统设计,可以通过以GPU为中心的软件管道在多GPU平台上加速GNN。 MGG探讨了通过细粒度计算通信管道中隐藏GNN工作负载中远程内存访问延迟的潜力。具体而言,MGG引入了管​​道感知工作负载管理策略和混合数据布局设计,以促进通信局限性重叠。 MGG实现以优化的管道为中心的内核。它包括工作负载交织和基于经经的映射,以进行有效的GPU内核操作管道和专门的内存设计以及优化,以更好地数据访问性能。此外,MGG还结合了轻巧的分析建模和优化启发式方法,以动态提高运行时不同设置的GNN执行性能。全面的实验表明,MGG在各种GNN设置上的最先进的多GPU系统要比最先进的多GPU系统:平均比具有统一虚拟内存设计的多GPU系统快3.65倍,平均比DGCL框架快7.38倍。
translated by 谷歌翻译
图形神经网络(GNNS)由于图形数据的规模和模型参数的数量呈指数增长,因此限制了它们在实际应用中的效用,因此往往会遭受高计算成本。为此,最近的一些作品着重于用彩票假设(LTH)稀疏GNN,以降低推理成本,同时保持绩效水平。但是,基于LTH的方法具有两个主要缺点:1)它们需要对密集模型进行详尽且迭代的训练,从而产生了极大的训练计算成本,2)它们仅修剪图形结构和模型参数,但忽略了节点功能维度,存在大量冗余。为了克服上述局限性,我们提出了一个综合的图形渐进修剪框架,称为CGP。这是通过在一个训练过程中设计在训练图周期修剪范式上进行动态修剪GNN来实现的。与基于LTH的方法不同,提出的CGP方法不需要重新训练,这大大降低了计算成本。此外,我们设计了一个共同策略,以全面地修剪GNN的所有三个核心元素:图形结构,节点特征和模型参数。同时,旨在完善修剪操作,我们将重生过程引入我们的CGP框架,以重新建立修剪但重要的连接。提出的CGP通过在6个GNN体系结构中使用节点分类任务进行评估,包括浅层模型(GCN和GAT),浅但深度散发模型(SGC和APPNP)以及Deep Models(GCNII和RESGCN),总共有14个真实图形数据集,包括来自挑战性开放图基准的大规模图数据集。实验表明,我们提出的策略在匹配时大大提高了训练和推理效率,甚至超过了现有方法的准确性。
translated by 谷歌翻译