Rethinking Efficiency and Redundancy in
Training Large-scale Graphs

Xin Liu12, Xunbin Xiong3, Mingyu Yan12, Runzhen Xue12, Shirui Pan4, Xiaochun Ye12, and Dongrui Fan12 1 Xin Liu and Xunbin Xiong contribute equally.2 Mingyu Yan is the corresponding author. 1SKLP, Institute of Computing Technology, Chinese Academy of Sciences, China
2University of Chinese Academy of Sciences, China
3School of Information Science and Technology, ShanghaiTech University, China
4School of ICT, Griffith University, Australia

Abstract

Large-scale graphs are ubiquitous in real-world scenarios and can be trained by Graph Neural Networks (GNNs) to generate representation for downstream tasks. Given the abundant information and complex topology of a large-scale graph, we argue that redundancy exists in such graphs and will degrade the training efficiency. Unfortunately, the model scalability severely restricts the efficiency of training large-scale graphs via vanilla GNNs. Despite recent advances in sampling-based training methods, sampling-based GNNs generally overlook the redundancy issue. It still takes intolerable time to train these models on large-scale graphs. Thereby, we propose to drop redundancy and improve efficiency of training large-scale graphs with GNNs, by rethinking the inherent characteristics in a graph.

In this paper, we pioneer to propose a once-for-all method, termed DropReef, to drop the redundancy in large-scale graphs. Specifically, we first conduct preliminary experiments to explore potential redundancy in large-scale graphs. Next, we present a metric to quantify the neighbor heterophily of all nodes in a graph. Based on both experimental and theoretical analysis, we reveal the redundancy in a large-scale graph, i.e., nodes with high neighbor heterophily and a great number of neighbors. Then, we propose DropReef to detect and drop the redundancy in large-scale graphs once and for all, helping reduce the training time while ensuring no sacrifice in the model accuracy. To demonstrate the effectiveness of DropReef, we apply it to recent state-of-the-art sampling-based GNNs for training large-scale graphs, owing to the high precision of such models. With DropReef leveraged, the training efficiency of models can be greatly promoted. DropReef is highly compatible and is offline performed, benefiting the state-of-the-art sampling-based GNNs in the present and future to a significant extent.

graph neural networks, efficiency, redundancy, graph sampling

I Introduction

Learning graph data has become a hot spot of recent research and has drawn increasing attention in the deep learning domain. To capture information from complex graphs, various methods are abundantly proposed, among which Graph Neural Networks (GNNs) [37] are superior exemplars for tackling diverse tasks [12, 48, 38, 29]. Owing to the outstanding performance and explainability of GNN [35, 52], recent years have witnessed the emergence of diverse GNN variants, e.g., Graph Convolutional Networks (GCNs) [21], Graph Attention Networks (GATs) [43], and Graph Isomorphism Networks (GINs) [49].

However, it generally takes nontrivial efforts to train a well-expressive GNN. Conventionally, training of GNNs, especially GCNs, is generally performed in a full-batch manner [7], which introduces undesirable training efficiency and storage consumption [21]. Moreover, as the scale of real-world graph data rapidly grows by the day, the conventional training method cannot even work properly because of the “out of memory” (OOM) issue. Thereby, efforts have been made to improve the training efficiency of GNNs [1, 27]. Notably, sampling-based GNN models [12, 3, 17, 60, 2, 7, 51, 14, 9, 30, 56, 50] are emphatically studied and focused on by researchers, owing to their great capacity of training graphs in an efficient manner. Taking a graph as the input, sampling-based models utilize a well-designed sampler to acquire a part of the whole graph by step, i.e., a subgraph, and train the subgraphs in a mini-batch manner, instead of directly performing training on the original one. Overall, sampling-based training methods ensure an acceptable convergence rate and the training efficiency.

Unfortunately, the ever-growing graph scale has been challenging the efficiency of the sampling-based training method. Previously, typical graph-related tasks such as node classification are always conducted on small datasets [39], resulting in a limited power in terms of model scalability. Recent literature proposes to train well-expressive models on large-scale graphs [7, 51]. Nevertheless, it takes hundreds or thousands of times longer on a large graph to yield minor promotion in terms of accuracy than on a small one. In addition, previous investigations [26, 28] have discovered that the sampling process is becoming a bottleneck to the training of GNNs, which further results in low-efficiency training of large-scale graphs. Thereby, a question arises: What exactly degrades efficiency of training large-scale graphs?

We answer that: Redundancy in large-scale graphs can slow down the training of GNNs. We further argue that, dropping redundancy in large-scale graphs can significantly increase the efficiency of training large-scale graphs with GNNs. In the graph-related domain, previous literature [45, 18, 32, 53, 19, 11] proposes to detect and reduce redundant information (also regarded as noise) from a graph to promote the model performance or yield a robust model for facing adversarial attacks. However, none of a large-scale graph is considered or applied for analysis, owing to its inherent complexity. In addition, most of the literature [45, 32, 19] pays close attention to the model accuracy and omits the potential of improving efficiency. Thus, we propose to accelerate the training of large-scale graphs with GNNs by dropping redundancy. Moreover, considering the diversity of GNN variants, we believe that a well-designed technique with high compatibility can benefit state-of-the-art models in the present and future better, instead of designing a completely new GNN. Thereby, all we need is a once-for-all method to detect and drop redundancy in large-scale graphs while ensuring no sacrifice in the model accuracy.

In this paper, we pioneer to propose DropReef, a novel method to Drop the Redundancy in large-scale graphs once and for all, helping improve the efficiency of training large-scale graphs with GNNs. Specifically, we first conduct preliminary experiments to explore potential redundancy in large-scale graphs. Next, we propose to quantify the neighbor heterophily of all nodes in a graph by a novel metric, termed weighted neighbor heterophily (WNH), according to the fact that nodes with a heterophilic neighboring distribution are hard to classify accurately by vanilla GNNs. Based on the analysis, we argue that redundancy in a large-scale graph is nodes with high WNH and a great number of neighbors. Finally, we present DropReef to detect and drop the redundancy in large-scale graphs once and for all. Nodes with high WNH and a great number of neighbors are regarded as redundant nodes and are dropped from the training set, thus yielding a significant acceleration in training with no sacrifice in the model accuracy.

Contributions of this paper are summarized as follows:

We first conduct preliminary experiments to quantify the neighbor distribution and explore the potential redundancy (i.e., information that is of no benefit to the model training) in large-scale graphs. Experimental results indicate two findings: the neighbor distribution of nodes is generally imbalanced, and redundancy exists in node regions that are densely connected. Then, we present a novel metric, termed weighted neighbor heterophily, to quantify the neighbor heterophily for nodes in a graph. Based on both experimental and theoretical analysis, we reveal the redundancy in a large-scale graph, i.e., nodes with high neighbor heterophily and a great number of neighbors.
We propose DropReef, a once-for-all method to drop the redundancy in large-scale graphs and improve the efficiency of training large-scale graphs with GNNs. DropReef is easy to use and can be offline performed. One can directly train the large-scale graphs that have been processed by DropReef to witness a considerable acceleration in terms of GNNs training straightforwardly.
We conduct experiments on four real-world graphs in large scale and apply DropReef to three state-of-the-art sampling-based GNNs. Experimental results have demonstrated the effectiveness of DropReef in terms of efficiency and accuracy. DropReef is highly compatible and helps reduce on average 26.80% and up to 83.01% of the training time among four large-scale graphs while ensuring no sacrifice in the model accuracy.

Ii Preliminary and Motivation

In this section, we first introduce the fundamental of GNNs and sampling-based training methods. Next, we make an observation on the topology of large-scale graphs and argue that the potential redundancy heavily exists in such graphs. Then, we put forward our motivation and highlight the opportunity to improve the training efficiency, revealed by preliminary experiments.

Ii-a Fundamental of GNNs

GNNs are prevalent deep learning models for handling graph-related tasks [47]. Studies [44, 12, 49, 51, 54] have previously revealed the superior performance of GNNs on tasks of classification and prediction. In general, GNN models utilize the same computing pattern to learn the hidden information from a graph, despite diverse variants of GNNs. Herein, we take a widely used model, i.e., GCN [21], as an exemplar for the formulated introduction. Given the adjacency matrix A and the initial feature matrix H $^{0}$ as the input, GCN captures the hidden representation from a graph and propagates the information by layers in an iterative manner. The forward propagation is formulated as follows:

H^{l} = σ ({˜ D}^{- \frac{1}{2}} ˜ A {˜ D}^{- \frac{1}{2}} H^{l - 1} W^{l - 1})

(1)

where $˜ A$ denotes the normalized adjacency matrix, and $˜ D$ denotes the degree matrix of $˜ A$ . $σ$ denotes a nonlinear activation function. W $^{l - 1}$ and H $^{l - 1}$ denote the learnable weight matrix and the hidden feature matrix in the (l-1)-th layer, respectively. To sum up, training of GNNs consists of the forward propagation for learning the representation (as given in Equation (1)) and the backward propagation for updating the model parameters (i.e., W).

Ii-B Sampling-based Training Methods

Learning the representation from a large-scale graph takes nontrivial cost in terms of computation and storage under the propagation pattern given in Equation (1) since the hidden feature H in each layer is computed based on $˜ A$ of the whole huge graph. Thereby, sampling-based training methods [12, 3, 17, 60, 2, 7, 51, 14, 9, 30, 56, 50] are proposed to diminish the original large graph to smaller ones and feed them to a GNN in a batched manner, thus reducing the computation and storage cost of GNN training. Classical sampling-based training methods propose to select nodes from all neighbors of one node in a random manner [12], or select nodes from one node batch according to the pre-computed sampling distribution [3].

Fig. 1: Quantification on four large-scale graphs. Please note that we sort the nodes by their neighbors’ amount and pick the top 50% ones for analysis. In each subplot, the selected top 50% of nodes are divided into quintiles. For example, “0-10%” in the x-axis denotes the top quintile of nodes in the top 50% of nodes. Bars in each subplot are the ratio of the number of neighbors (abbr. # Neighbors) of the top 50% of nodes to the number of neighbors of total nodes in a graph. The curves in each subplot are the quantification of the average number of neighbors of the top 50% of nodes.

Nevertheless, such classical methods treat each node identically during sampling and omit the correlation between nodes being sampled in general. Moreover, experiments in such methods are generally conducted on small graphs. As the improvement, recent literature [7, 50, 51] proposes to extend GNN training to larger graphs and deeper model structures. They sample subgraphs and construct mini-batches based on the sampled ones. Training of GNN is subsequently conducted on these mini-batches. Given an un-directed graph $G (V, E)$ , the above process can be formulated as follows:

N o d e S e t : N \leftarrow {Sample} (V, b u d g e t)

(2)

S u b g r a p h : G_{s} (V_{s}, E_{s}) \leftarrow % {Construct} (N_{1}, N_{2}, \dots)

(3)

where $N$ denotes a node set that is sampled from the original graph $G$ . The size of $N$ is restricted to $b u d g e t$ . Then, in each mini-batch, a subgraph $G_{s}$ is constructed based on the sampled node sets and is fed for subsequent training. These subgraph-based sampling methods are generally superior in accuracy since they consider the correlation between nodes when constructing a subgraph [51]. The sampled subgraphs are dense and have many frequently accessed nodes (i.e., nodes with a lot of neighbors), which, however, can introduce redundancy in sampling as well as training of GNNs.

Ii-C Observations on Potential Redundancy

Large-scale graphs are complex in terms of graph topology, making the performance of GNN on such graphs distinct from small ones [16]. Specifically, in large-scale graphs, dense local communities [4] are more easily formed by node regions in which nodes are densely connected to other nodes in a neighboring region, compared to small graphs. In such regions, nodes, especially in a central position, share a great number of edges with other neighboring nodes [8], yielding many high-degree nodes in large-scale graphs.

Definition 1 High-degree nodes (abbr. HD nodes) are nodes with a great number of direct neighbors (one-hop neighbors).

Generally, abundant information exists in such large-scale graphs, and so does the redundancy (i.e., information that is of no benefit to the model training). HD nodes in large-scale graphs result in an imbalanced neighbor distribution. Since HD nodes are sampled with a high probability, it takes nontrivial cost to construct subgraphs based on HD nodes and their neighbors, given their densely connected patterns. Therefore, we propose to analyze such densely connected nodes and quantify the potential redundancy.

We quantify the overall neighbor distribution in large-scale graphs, i.e., Reddit [12], Yelp [51], Amazon [51], and OGBN-Products (abbr. Products) [16]. Detailed information about datasets is given in Table I. We first sort nodes in a graph by their neighbors’ amount. Then, we quantify the ratio of neighbors’ amount of the HD nodes, more specifically, the top 50% of nodes with a great number of neighbors, to the neighbors’ amount of the total nodes in a graph. Please note that total neighbors’ amount of all nodes in a graph generally far exceeds the total number of nodes in a statistical sense, meaning that different nodes might have a great number of the same neighbors. The quantification allows us to analyze node regions that are possibly dense in a fine-grained manner. As illustrated in Figure 1, the neighbors’ amount of the top 10% of nodes dominates a considerable ratio in all large-scale graphs, and is appreciably higher than nodes in other parts. The top 10% of nodes are HD nodes, concluding that the distribution of the node connected density is imbalanced in a graph and is centralized in regions composed of HD nodes. These observations indicate that subgraphs constructed by such HD nodes are generally dense. We suppose that such dense subgraphs have a higher probability of containing redundancy and bring about greater overhead than ones in a sparse pattern. Thereby, our supposition can be proved if the model accuracy does not decline significantly after dropping HD nodes.

Remark In this paper, only the one-hop neighbors of a node are considered.

Fig. 2: Comparison on the training time and the test accuracy among four large-scale graphs, with GraphSAINT used as the backbone. Please note that the training process consists of sampling and pure model training processes. The sampling process is accomplished by a random node sampler. Bars denote the relative training time, while dotted lines denote the variation of the test accuracy. Original denotes the model is normally trained, while Dropped denotes that a naive drop method is performed on datasets before they are fed to the model. Training configurations between Original and Dropped are completely consistent to the official configurations in GraphSAINT.

Ii-D Opportunities to Improve Efficiency

Preliminary experiments are conducted to highlight the influence of dropping HD nodes on four large-scale graphs. Naively, we drop the top 5% of training nodes on each dataset before they are fed to the model for training. We compare the training time and the test accuracy between the original model training and training with the naive drop method equipped. The only difference is that a part of HD nodes are dropped in datasets in the latter case. As illustrated in Figure 2, we find that the acceleration of the training time is nontrivial. Dropping HD nodes reduced 80% of the training time on Amazon dataset, and average 25% of training time on the other three datasets. Nevertheless, the significant reduction in the training time does not bring about a dramatic decline in the model accuracy. For example, the accuracy decline ratio on Reddit dataset is 4.12%.

According to the preliminary experiments, dropping a part of HD nodes from the total training nodes can generally reduce the time cost of training a large-scale graph, especially for graphs in which HD nodes concentrate in the top 10% of nodes (e.g., Yelp and Amazon datasets). In addition, the decline in the model accuracy is slight on all datasets, which can be attributed to the fact that the dropped HD nodes contain redundant information to a large extent. Motivated by the above findings, we argue that designing a more precise drop method to remove HD nodes containing redundant information can attach both efficient training and trivial sacrifice in model accuracy, instead of naively dropping the top 5% of nodes. And it will be best to have this method performed once and for all, avoiding the occupation of online computation and storage resources. Therefore, all we need is an accurate drop method that can be offline performed to detect and decrease redundancy in large-scale graphs.

Iii Methodology

In this section, we first define a metric, termed weighted neighbor heterophily, to quantify the neighbor heterophily for nodes in a graph. Based on the proposed metric, we define the data redundancy and redundant nodes in a graph accordingly. Next, we propose DropReef, a once-for-all method to detect and drop the redundant nodes in large-scale graphs. DropReef consists of three subprocesses and is performed offline. After processing by DropReef, the training efficiency on large-scale graphs is considerably promoted.

Iii-a The Proposed Metric

Iii-A1 Neighbor Heterophily

This subsubsection introduces a key component of the metric, i.e., neighbor heterophily.

In real-world graphs, homophily is a basic principle to describe the distribution of nodes’ classes that connected nodes within a region generally belong to the same class [33]. A straightforward example that a person shares common interests and habits with his friends can well describe this “birds of a feather flock together” phenomenon [34]. We further derive a connected pattern for a neighbor region from this phenomenon. As illustrated in Figure 3(a), a homophilic graph is obtained by treating a person and his friends (with the same interests and habits) as nodes in the same class. In another case, given a node v that is densely connected with its neighbors, labels of v’s neighbors are diverse in which major of the neighbors’ classes is different from the class to which v belongs. The above connected pattern, which is given in Figure 3(b), is an unusual yet possible case in an unrestricted graph, despite the homophily principle. However, many existing GNNs, e.g., GCN [21] and LGNN [6], are designed under a strong assumption of homophily [59], and trends to overfit the majority classes, rendering undesirable learning representation on minority ones [40]. Therefore, supposing that a GNN is trained for node classification, node v is harder to be accurately classified than the node m (given their connected patterns in Figure 3).

Fig. 3: An illustration of two distributions of nodes’ classes. Please note that colors represent classes, and nodes in the same color belong to the same class.

Definition 2 Neighbor heterophily is measurable value to reflect the heterophily for a node within its neighboring region.

We argue that nodes with heterophilic neighbors can hardly be of benefit to the model performance. And such nodes can be detected by calculating the difference between them and their neighbors. Recent literature [59] presented a metric to empirically reflect the edge homophily ratio for a whole graph from a global perspective. In our case, we propose to calculate the difference between a node and its neighbors from a local perspective by quantifying the neighbor heterophily within its neighboring region, considering that nodes with high neighbor heterophily are in a fairly smaller proportion than normal ones. The neighbor heterophily $H e t e_{v}$ for a given node v can be quantified as follows:

H e t e_{v} = \frac{1}{D_{v}} \sum u \in N (v) ∥ c_{v} - c_{u} ∥_{2}

(4)

where $D_{v}$ and $N (v)$ denote the degree and the neighbor set of node $v$ , respectively. $c_{v} \in R^{N_{c}}$ denotes a label vector extended by the class to which node $v$ belongs, where $N_{c}$ denotes the number of all possible classes of nodes.

Proposition

For node $v$ , $H e t e_{v}$ is associated with the distribution of nodes’ classes in $v$ ’s neighboring region. A higher $H e t e_{v}$ indicates a more heterophilic neighboring distribution of $v$ . In the task of node classification, the upper bound of $H e t e_{v}$ in single-class classification is $\sqrt{2}$ and in multi-class classification is $\sqrt{N_{c}}$ . $_{■}$

Proof

For single-class node prediction, assuming that node $v$ and its neighbor $u$ belong to different classes, we can simply derive the 2-Norm of ( $c_{v} - c_{u}$ ):

∥ c_{v} - c_{u} ∥_{2} = \sqrt{2}

(5)

Next, we denote Diff as the number of v’s neighbors that belong to different classes from v, and calculate $H e t e_{v}$ according to Equation (4):

\begin{matrix} H e t e_{v} & = \frac{1}{D_{v}} \sum u \in N (v) ∥ c_{v} - c_{u} ∥_{2} = \sqrt{2} \frac{Diff}{D_{v}} \end{matrix}

(6)

Based on Equation (6), $H e t e_{v}$ is positively correlated with Diff since D $_{v}$ is a constant when the node v is given. Therefore, a more heterophilic neighboring distribution of v indicates a larger Diff, thus boosting $H e t e_{v}$ accordingly. It is also observed that the upper bound of $H e t e_{v}$ is $\sqrt{2}$ since $\frac{Diff}{D_{v}}$ will not exceed 1.

Proof in a multi-class situation is same to the single-class one. Assuming that node $v$ and its neighbor $u$ belong to different classes, we denote Max as the maximum of $∥ c_{v} - c_{u} ∥_{2}$ , and give the upper bound of $H e t e_{v}$ as follows:

H e t e_{v}

= \frac{1}{D_{v}} \sum u \in N (v) ∥ c_{v} - c_{u} ∥_{2} \leq Max \frac{Diff}{D_{v}}

(7)

Possible maximums of Max and Diff can be simultaneously acquired in one particular case: Any two of v and its neighbors do not belong to the same class. In this case, we can derive the value of Max as $\sqrt{N_{c}}$ and Diff as D $_{v}$ , thus inferring the upper bound of $H e t e_{v}$ as $\sqrt{N_{c}}$ .

$_{■}$

Iii-A2 Linking Probability

This subsubsection introduces a key component of the metric, i.e., linking probability.

In social networks, link prediction is a task to predict the existence of the edge between two entity nodes. In this task, a classifier is learned to mark edges with true or false labels [13, 25]. In the GNN-related domain, a GNN model is trained to yield linking probabilities associated with edges by ingesting features and the adjacency matrix of the target graph [55, 20]. The generated linking probabilities are regarded as the existence probabilities for unobserved edges between nodes in general. Considering a case of sampling subgraphs, a designed sampler iteratively samples nodes before the nodes’ amount exceeds the budget. Subsequently, a subgraph is constructed by recovering existing edges between the sampled nodes. Therefore, edges are unobserved for the sampler before the subgraph is built accordingly. Inspired by the weighted message passing in GATs [43], we propose to add linking probabilities to $H e t e_{v}$ since each neighbor has a unique impact on node $v$ . A demonstration will be given in Section III-A3 to reveal that $H e t e_{v}$ with linking probabilities added is more explicable than a naive $H e t e_{v}$ for exceptive cases.

Iii-A3 Definition of the Metric

We propose a metric, termed weighted neighbor heterophily (WNH), to reflect the neighbor heterophily for a given node v with all individual edges between v and its neighbors considered simultaneously. The proposed metric is calculated as follows:

W N H_{v} = \frac{1}{D_{v}} \sum u \in N (v) p_{v u} \cdot ∥ c_{v} - c_{u} ∥_{2}

(8)

where $p_{v u}$ denotes the linking probability between nodes $v$ and $u$ . As the improvement of the naive neighbor heterophily given in Equation (4), we add linking probabilities served as the weight into the proposed metric, making the proposed metric more explicable than a naive one for exceptive cases.

Fig. 4: Overlap between the top 10% of nodes with high WNH and the top 50% of nodes with large neighbors’ amounts on four large-scale graphs. Please note that nodes are independently sorted by two metrics, i.e., WNH and the neighbors’ amount. The value “45” in the top-left grid denotes that in Reddit, about 45% of nodes with high WNH are simultaneously the top 10% of nodes with large neighbors’ amounts.

Proposition

The weighted neighbor heterophily (WNH) is able to distinguish impacts made by individual neighbors compared to the naive neighbor heterophily. $_{■}$

Proof

We assume that label vectors of v and its two neighbors u and m, i.e., c $_{v}$ , c $_{u}$ , and c $_{m}$ are $[0, 0, 1, 1, 0, 0]$ , $[1, 1, 0, 0, 0, 0]$ , and $[0, 0, 0, 0, 1, 1]$ , respectively. 2-Norm of (c $_{v}$ - c $_{u}$ ) and (c $_{v}$ - c $_{m}$ ) are equal, implying that impacts made by two individual neighbors cannot be distinguished. However, WNH introduces the linking probability associated with the edge between v and u (or m) to reflect a distinct impact. $_{■}$

The proposed WNH is utilized to detect the neighbor heterophily for nodes in a large-scale graph. A higher WNH indicates that the neighboring distribution of the node is more heterophilic than a normal one. Therefore, a node with a higher WNH cannot be accurately classified even with a well-trained GNN. We remark that the computation of WNH for all nodes in a large-scale graph is offline performed before sampling, requiring no online resource.

Iii-B The Proposed Method: DropReef

Prior to the release of DropReef, we propose to detect the redundant nodes in a large-scale graph. In section II-D, we have conducted a preliminary experiment (given in Figure 2) to naively drop the top 5% of nodes on four large-scale graphs. We found the decline in the model accuracy is slight, and thus argued that the dropped HD nodes contain redundant information to a large extent.

To support the argument, we conduct experiments to quantify the overlap between HD nodes and nodes with high WNH. For each dataset, we calculate WNH for all nodes and select the top 10% ones sorted by their WNH. Quantification of the overlap is given in Figure 4. Over 70% of nodes with high WNH are simultaneously HD nodes, indicating that a node with a great number of neighbors will also have a heterophilic neighboring distribution with a high probability. Therefore, based on the quantification, we can define the data redundancy and redundant nodes in training large-scale graphs with GNNs.

Definition 3 Redundancy in data denotes some information benefits little to the model performance (both training efficiency and model accuracy) but takes nontrivial cost for its computation. Therefore, redundant nodes can be further defined as HD nodes with large values in terms of WNH at the same time.

INPUT

Original Graph $G (V, E)$ , Edge Prob. Matrix P, Hyperparameters ${T H}_{W N H}$ , ${T H}_{D E G}$ ;

OUTPUT

Low-redundancy Graph $G^{'} (V^{'}, E^{'})$ ;

V^{'} = V

;

V_{d r o p} = \emptyset

;

V_{t r} \leftarrow

get_training_nodes

(V)

;

6 for $v \in V_{t r}$ do

W N H_{v} = 0

;

c_{v} \leftarrow

get_label

(v)

;

D_{v} \leftarrow

get_degree

(v)

;

N_{v} \leftarrow

get_neighbors

(v)

;

11 for $u \in N_{v}$ do

c_{u} \leftarrow

get_label

(u)

;

p_{v u} \leftarrow

get_edge_prob

(v, u)

;

{WNH}_{v} = (\sum_{u} p_{v u} \times ∥ c_{v} - c_{u} ∥_{2}) / D_{v}

;

15 end for

17 end for

18Save the WNH for further usage and free the storage for $v \in V_{t r}$ do

19 if $D_{v} \geq {T H}_{D E G} a n d {WNH}_{v} \geq {T H}_{W N H}$ then

Vdrop←\emphadd_node(Vdel,v)

;

21 end if

23 end for

24for $v \in V_{d r o p}$ do

25 for $u \in N_{v}$ do

E′←\emphremove_edge(E,v,u)

27 end for

V′←\emphremove_node(v)

;

29 end for

Algorithm 1 DropReef (Offline Performed)

Moreover, it is observed that there is a uniform distribution of high WNH nodes on Products. Nevertheless, the uniform distribution has no effect on dropping redundant nodes. We can adopt a flexible dropping process on all training nodes uniformly by adjusting hyperparameters.

Given two hyperparameters and a precomputed edge probability matrix P, DropReef is offline performed on a large-scale graph to drop redundant nodes. A detailed process is given in Algorithm 1. DropReef consists of three subprocesses: computing metrics, detecting redundancy and dropping nodes.

Computing metrics (i.e., WNH) for all training nodes requires all edge probabilities between training nodes and their neighbors. The probability matrix P is precomputed by jointly using a graph auto-encoder (GAE) [20] and a batched probability predictor [24]. Predicting edge probabilities in a batched manner can generally mitigate the storage consumption of a graph with tens of millions of edges [24].

Detecting redundancy aims to select redundant nodes from all training nodes based on two hyperparameters, i.e., TH $_{W N H}$ and TH $_{D E G}$ . Specifically, TH $_{W N H}$ is used to filter nodes with high WNH since such nodes are hard to classify accurately. TH $_{D E G}$ is a restriction on the minimum node degree, ensuring only nodes with a great number of neighbors will be considered to drop. According to two hyperparameters, nodes with high WNH and a great number of neighbors are regarded as redundant nodes.

Dropping redundant nodes includes removing all edges associated with the redundant nodes and dropping these nodes from the training set. The output is a low-redundancy graph $G^{'}$ that can be directly fed to GNNs for training.

Remark We drop redundant nodes on the training set only. All subprocesses in DropReef are offline performed to generate a low-redundancy graph. Training such graphs with some popular GNNs will witness a considerable acceleration straightforwardly.

Iv Experiment

In this section, we conduct extensive experiments to demonstrate the effectiveness of DropReef. We give a detailed analysis based on the results and further discuss the adjustment of DropReef to offer a way for performing one’s own DropReef. In addition, we quantify the sampled subgraphs and reveal the impact of DropReef on the subgraphs.

Iv-a Experimental Setting

Dataset	#Node	#Edge	Degree	Task Type	Train/Val/Test
Reddit	232,965	11,606,919	50	single-cls class.	0.66/0.10/0.24
Yelp	716,847	6,977,410	10	multi-cls class.	0.75/0.10/0.15
Amazon	1,598,960	132,169,734	83	multi-cls class.	0.85/0.05/0.10
Products	2,449,029	61,859,140	51	multi-cls class.	0.08/0.02/0.90

TABLE I: Statistics on large-scale graph datasets.

Fig. 5: Comparison on the training time among four large-scale graphs. We apply DropReef to three sampling-based GNNs to yield a speedup. Please note that Original denotes training the baselines, while DropReef denotes that DropReef was applied before training the baselines. In addition, training configurations between Original and DropReef are completely consistent to the official configurations of the corresponding baseline.

Model

Dataset

Training Time(s.)

(Original Case)

Training Time(s.)

(Use DropReef)

Time

Reduction

Test Accuracy

(Original Case)

Test Accuracy

(Use DropReef)

Accuracy

Gap

Drop Node

Ratio

Drop Edge

Ratio

Cluster-GCN

175.72

160.65

8.58%

0.9600

0.9604

↑

0.04%

8.45%

29.02%

Yelp

185.17

164.17

11.34%

0.6099

0.6158

↑

0.97%

18.12%

65.12%

Amazon

1400.37

605.04

56.79%

0.7566

0.7576

↑

0.13%

5.00%

77.35%

Products

161.39

156.14

3.26%

0.7474

0.7516

↑

0.56%

7.59%

23.46%

PGS-GNN

45.80

36.71

19.86%

0.9549

0.9571

↑

0.02%

5.73%

24.43%

Yelp

831.26

697.17

16.13%

0.6256

0.6280

↑

0.38%

4.54%

44.59%

Amazon

4163.68

2338.21

43.84%

0.7759

0.7849

↑

1.15%

5.00%

77.35%

Products

618.75

536.44

13.30%

0.7389

0.7474

↑

1.15%

8.02%

16.26%

GraphSAINT

36.79

29.73

19.19%

0.9615

0.9576

↓

0.41%

1.88%

9.12%

Yelp

197.22

137.96

30.05%

0.6412

0.6358

↓

0.85%

3.05%

34.19%

Amazon

1462.98

248.50

83.01%

0.7650

0.7787

↑

1.79%

5.00%

77.35%

Products

628.94

512.76

18.47%

0.7852

0.7898

↑

0.15%

7.59%

23.46%

TABLE II: Comparison on the training time and the test accuracy between training original models and with DropReef equipped among four large-scale graphs.

Datasets and Platform. We propose to demonstrate the effectiveness of DropReef in terms of efficiency and accuracy on the node classification task. We conduct experiments on four real-world graphs in large scale: Reddit [12], Yelp [51], Amazon [51], and OGBN-Products (abbr. Products) [16]. Among these four datasets, the smallest graph has more than two hundred thousand nodes and five million edges, while the largest has more than one and a half million nodes and one hundred million edges. Please see Table I for more detailed statistics, where “single-cls class.” denotes the single-class classification task. All experiments are conducted on a Linux server equipped with dual 24-core Intel Xeon CPU E5-2650 v4 CPUs and an NVIDIA Tesla V100 GPU (16 GB memory).

Baselines. We apply DropReef to three popular sampling-based GNNs, i.e., Cluster-GCN [7], parallel graph sampling-based GNN (abbr. PGS-GNN) [50], and GraphSAINT [51], that have the capacity of training large-scale graphs. Specifically, we choose the random node sampler for GraphSAINT to perform sampling. In addition to the hyperparameters of DropReef, all experiments use official training configurations, e.g., sampling size, batch size, and learning rate. We remark that comparisons between training original baselines and with DropReef equipped are conducted under completely consistent configurations other than the input graph since the input graph was processed for dropping redundant nodes by DropReef.

Iv-B Experimental Results

Quantification of the shared neighbors between two nodes in a graph. A deeper color means a larger number of shared neighbors within a node region. — (a) Quantification on a sampled subgraph (vanilla case). Note that x and y axes are node indices.

We demonstrate the effectiveness of DropReef on accelerating the training through a comparison of the relative training time as shown in Figure 5. Please note that the training process consists of sampling and pure model training since they are typically mixed in a batched computing process. With the assistance of DropReef, redundant nodes are dropped from the training set of all large-scale graphs, thus yielding a remarkable acceleration in terms of training. Overall, applying DropReef helps reduce on average 26.80% and up to 83.01% of the training time. For each baseline, the (training) time reduction varies from approx. 10% to 80% among four large-scale graphs. Notably, the time reduction on Amazon dataset is significant. We notice that the time reduction on Products dataset among three baselines is less than on other graphs (given that the size of Products dataset is large) since only 8% of nodes are split as the training set. A detailed performance is given in Table II. Experiments are repeated three times to yield the average value. Since DropReef removed redundant nodes from the training set, we argue that a larger training set of a graph can generally result in a greater gain from DropReef.

With respect to the model accuracy, the reduction of the training time does not bring about a drastic decline in the test accuracy. On the contrary, the test accuracy on most large-scale graphs is promoted after dropping redundancy, which can be attributed to the fact that the dropped redundant nodes are of no benefit to the model accuracy. We remark that DropReef removed redundant nodes from the training set only, leaving the validation and test sets unprocessed. Performance in terms of the time reduction and the test accuracy thereby demonstrates the effectiveness of DropReef in both efficiency and accuracy. Please note that Drop Node Ratio denotes the ratio of the number of dropped nodes to the number of all training nodes, and Drop Edge Ratio denotes the ratio of the number of the removed edges associated with the dropped nodes to the number of edges associated with all training nodes. We have the two metrics discussed in section IV-D.

Iv-C Impact on Subgraphs

DropReef has promoted the training efficiency by dropping redundant nodes from the training set, after which sampling and model training is performed on the low-redundancy graph. Since the batched training is conducted on sampled subgraphs, we thereby analyze the impact of DropReef on sampled subgraphs. Given the stochasticity of sampling, sampled subgraphs inevitably include dense node regions in a vanilla case. Such regions introduce considerable training time and are supposed to be dropped by DropReef.

Fig. 7: Quantization on graph statistics.

To analyze a subgraph, we first conduct sampling on Amazon via a random node sampler provided in GraphSAINT. We sample two subgraphs in the same size for a vanilla case (regular sampling) and our case (apply DropReef before sampling) analysis. For each sampled subgraph, we compute the number of shared neighbors for any two nodes in the subgraph. For instance, given two nodes $v$ and $u$ , n $_{v u}$ denotes the number of shared neighbors between $v$ and $u$ . A large n $_{v u}$ indicates that a dense node region exists in the neighboring area of $v$ and $u$ . As illustrated in Figure 6 (a)&(b), after applying DropReef, n $_{v u}$ in the sampled subgraph has a more balanced distribution than a vanilla case. The number of dense node regions in the sampled low-redundancy subgraph is less than in the vanilla subgraph, as suggested by two exemplars of quantified results of subgraphs in two cases (given in Figure 6 (c)). We also quantify graph statistics, i.e., clustering coefficient (abbr. CC) and the number of closed triads (abbr. #CT), of sampled subgraphs in two cases via Stanford Network Analysis Platform (SNAP) [22]. CC is used to measure the cliquishness of a typical neighborhood in a graph [46]. #CT is used to reveal the number of potential dense connections in a node region composed of triad structures (among three nodes with any two of them connected) [10]. We compute two statistics (average) based on 1000 sampled subgraphs in two cases. As illustrated in Figure 7, the sampled low-redundancy subgraphs have a more balanced distribution than vanilla ones from the graph statistical perspective.

Iv-D Discussion

In addition to the size of the training set, another factor that affects the time reduction is the hyperparameters used in DropReef. Drop Node Ratio and Drop Edge Ratio are metrics controlled by these hyperparameters. As previously introduced in section III-B, DropReef detects redundant nodes with the help of two thresholds, i.e., TH $_{W N H}$ and TH $_{D E G}$ . TH $_{W N H}$ is used to select nodes with high WNH, while TH $_{D E G}$ is used to select nodes with a great number of neighbors. Nodes concurrently exceeding the two thresholds are regarded as redundant nodes. With respect to the adjustment of the two thresholds, we have analyzed the upper bound of $H e t e_{v}$ in Proposition III-A1. WNH can be viewed as a weighted version of $H e t e_{v}$ and will not exceed the upper bound of $H e t e_{v}$ . Moreover, we have quantified the average number of neighbors for each quintile of these HD nodes in Figure 1. The average degree of a graph is also a key value to construct the bound for TH $_{D E G}$ . One can perform a unique DropReef on large-scale graphs by adjusting the hyperparameters based on the provided materials. We remark that turning down the two thresholds by a big margin can yield a more significant acceleration of the training time but will result in an undesirable decline in the model accuracy. Therefore, adjusting hyperparameters has been a trade-off between time and accuracy.

V Related Work

Related work of the proposed DropReef contains three main veins, i.e., graph sampling, graph sparsification, and dropout.

Graph Sampling: Graph sampling methods are popularly utilized to improve the training efficiency of GNNs. These methods sample nodes and construct a subgraph in each mini-batch, making it possible that the batch-performed GNN training takes less computation and storage cost than the full-batch training. Typically, graph sampling methods designed to train homogeneous graphs can be classified into three categories [28], i.e., node-wise, layer-wise, and subgraph-based sampling methods. Node-wise sampling methods, such as GraphSAGE [12], sample a fixed fraction of neighbors for each node in a graph in a random manner. They reduce the training cost by restricting the sampling size. Layer-wise sampling methods, e.g., FastGCN [3], AS-GCN [17], and LADIES [60], simultaneously sample a fixed number of nodes by layers to train a layered GNN in a top-down manner. As the improvement of node-wise sampling methods, layer-wise sampling methods omit the focus on the neighborhoods of one single node in a graph, at the same time, perform according to the pre-computed sampling probability. Subgraph-based sampling methods, e.g., Cluster-GCN [7] and GraphSAINT [51], sample subgraphs in each mini-batch by using graph partition algorithms or conditionally sampled nodes and edges, extending GNN training to large-scale graphs. DropReef is flexible to be equipped with mainstream sampling-based GNNs, boosting the performance of existing state-of-the-art models. Considering the drastically growing cost of training large-scale graphs, DropReef is capable of improving the efficiency and even the accuracy of sampling-based GNNs.

Graph Sparsification: In the GNN-related domain, graph sparsification methods propose to remove particular edges in a graph to improve the model accuracy or reduce the redundant computation. Typically, DropEdge [36] randomly removes edges to avoid over-fitting and over-smoothing issues. AdaptiveGCN [23] and NeuralSparse [58] use learnable models to remove task-irrelevant edges. GAUG [57] and PTDNet [31] predict the quality of edges for the purpose of denoising, with neural networks leveraged. The above insightful efforts provide the promotion in terms of model accuracy. Moreover, UGS [5] prunes both edges and the model weight to yield speedup in GNN inference. FastGAT [41] proposes to sparsify a graph to reduce attention coefficients’ amount for attention-based GNNs by significantly removing useless edges. DropReef primarily distinguishes from graph sparsification methods as we drop redundant nodes rather than a considerable amount of edges in a graph. In addition, DropReef is a once-for-all procedure that requires no online resource to train a special model for sparsifying the graph. Furthermore, most existing graph sparsification methods generally conduct experiments on small datasets, while DropReef aims for the acceleration of training large-scale graphs.

Dropout: In the deep learning domain, Dropout is proposed to randomly omit hidden units in a neural network layer for mitigating the over-smoothing issue [15]. As an easy-to-use technique, Dropout is popularly utilized in diverse neural network models and achieves significant improvements compared to other regularization methods [42]. DropReef can be considered a variant of Dropout customized for the graph domain. In our case, the objective for dropout is redundant nodes, which is a unique component of the graph data, compared to hidden units in a vanilla case. Moreover, the major function of Dropout is to mitigate the over-smoothing issue in the model training, while DropReef has succeeded in achieving a two-fold function, i.e., both the high efficiency and accuracy are well ensured.

Vi Conclusion

In this paper, we propose a once-for-all method, termed DropReef, to detect and drop the redundancy in large-scale graphs. By offline performing DropReef, redundant nodes together with their associated edges are dropped from the training set of a graph. We verify the effectiveness of DropReef using three popular sampling-based GNNs that have the capacity of training large-scale graphs. Experimental results demonstrate that DropReef succeeds in improving the efficiency of training large-scale graphs by GNNs, through dropping redundancy. As the scale of real-world graph data rapidly grows by the day, DropReef can be opportunely used to transform a large-scale graph into a low-redundancy one, benefiting the state-of-the-art methods designed for training large-scale graphs to a significant extent.

References

[1] S. Abadal, A. Jain, R. Guirado, J. López-Alonso, and E. Alarcón (2021) Computing graph neural networks: a survey from algorithms to accelerators. ACM Computing Surveys (CSUR) 54 (9), pp. 1–38. Cited by: §I.
[2] J. Bai and et al. (2021) Ripple walk training: a subgraph-based training framework for large and deep graph neural network. In 2021 International Joint Conference on Neural Networks, pp. 1–8. Cited by: §I, §II-B.
[3] J. Chen, T. Ma, and C. Xiao (2018) FastGCN: fast learning with graph convolutional networks via importance sampling. In International Conference on Learning Representations, Cited by: §I, §II-B, §V.
[4] Q. Chen and et al. (2013) Detecting local community structures in complex networks based on local degree central nodes. Physica A: Statistical Mechanics and its Applications 392 (3), pp. 529–537. Cited by: §II-C.
[5] T. Chen, Y. Sui, X. Chen, A. Zhang, and Z. Wang (2021) A unified lottery ticket hypothesis for graph neural networks. In International Conference on Machine Learning, pp. 1695–1706. Cited by: §V.
[6] Z. Chen, L. Li, and J. Bruna (2019) Supervised community detection with line graph neural networks. In 7th International Conference on Learning Representations, ICLR 2019,, Cited by: §III-A1.
[7] W. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C. Hsieh (2019) Cluster-gcn: an efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 257–266. Cited by: §I, §I, §II-B, §II-B, §IV-A, §V.
[8] A. Clauset (2005) Finding local community structure in networks. Physical review E 72 (2), pp. 026132. Cited by: §II-C.
[9] W. Cong, R. Forsati, and et al. (2020) Minimal variance sampling with provable guarantees for fast training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1393–1403. Cited by: §I, §II-B.
[10] D. Easley and J. Kleinberg (2010) Networks, crowds, and markets: reasoning about a highly connected world. Cambridge university press. Cited by: §IV-C.
[11] N. Entezari, S. A. Al-Sayouri, A. Darvishzadeh, and E. E. Papalexakis (2020) All you need is low (rank) defending against adversarial attacks on graphs. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 169–177. Cited by: §I.
[12] W. L. Hamilton, R. Ying, and et al. (2017) Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035. Cited by: §I, §I, §II-A, §II-B, §II-C, §IV-A, §V.
[13] M. A. Hasan and M. J. Zaki (2011) A survey of link prediction in social networks. In Social network data analytics, pp. 243–275. Cited by: §III-A2.
[14] A. Hasanzadeh, E. Hajiramezanali, S. Boluki, M. Zhou, N. Duffield, K. Narayanan, and X. Qian (2020) Bayesian graph neural networks with adaptive connection sampling. In International conference on machine learning, pp. 4094–4104. Cited by: §I, §II-B.
[15] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §V.
[16] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020) Open graph benchmark: datasets for machine learning on graphs. Advances in neural information processing systems 33, pp. 22118–22133. Cited by: §II-C, §II-C, §IV-A.
[17] W. Huang, T. Zhang, Y. Rong, and J. Huang (2018) Adaptive sampling towards fast graph representation learning. Advances in Neural Information Processing Systems 31, pp. 4558–4567. Cited by: §I, §II-B, §V.
[18] Z. Jia, S. Lin, R. Ying, J. You, J. Leskovec, and A. Aiken (2020) Redundancy-free computation for graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 997–1005. Cited by: §I.
[19] W. Jin, Y. Ma, X. Liu, X. Tang, S. Wang, and J. Tang (2020) Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 66–74. Cited by: §I.
[20] T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §III-A2, §III-B.
[21] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations ICLR 2017, Cited by: §I, §I, §II-A, §III-A1.
[22] J. Leskovec and R. Sosič (2016) SNAP: a general-purpose network analysis and graph-mining library. ACM Transactions on Intelligent Systems and Technology 8 (1), pp. 1. Cited by: §IV-C.
[23] D. Li, T. Yang, L. Du, Z. He, and L. Jiang (2021) AdaptiveGCN: efficient gcn through adaptively sparsifying graphs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3206–3210. Cited by: §V.
[24] G. Li, C. Xiong, A. Thabet, and B. Ghanem (2020) Deepergcn: all you need to train deeper gcns. arXiv preprint arXiv:2006.07739. Cited by: §III-B.
[25] D. Liben-Nowell and J. Kleinberg (2007) The link-prediction problem for social networks. Journal of the American society for information science and technology 58 (7), pp. 1019–1031. Cited by: §III-A2.
[26] Z. Lin, C. Li, Y. Miao, Y. Liu, and Y. Xu (2020) Pagraph: scaling gnn training on large graphs via computation-aware caching. In Proceedings of the 11th ACM Symposium on Cloud Computing, pp. 401–415. Cited by: §I.
[27] X. Liu, M. Yan, L. Deng, G. Li, X. Ye, D. Fan, S. Pan, and Y. Xie (2022) Survey on graph neural network acceleration: an algorithmic perspective. arXiv preprint arXiv:2202.04822. Cited by: §I.
[28] X. Liu, M. Yan, and et al. (2021) Sampling methods for efficient training of graph convolutional networks: a survey. IEEE/CAA Journal of Automatica Sinica 9 (2), pp. 205–234. Cited by: §I, §V.
[29] Y. Liu, Z. Li, S. Pan, C. Gong, C. Zhou, and G. Karypis (2021) Anomaly detection on attributed networks via contrastive self-supervised learning. IEEE transactions on neural networks and learning systems 33 (6), pp. 2378–2392. Cited by: §I.
[30] Z. Liu, Z. Wu, Z. Zhang, J. Zhou, S. Yang, L. Song, and Y. Qi (2020) Bandit samplers for training graph neural networks. Advances in Neural Information Processing Systems 33, pp. 6878–6888. Cited by: §I, §II-B.
[31] D. Luo, W. Cheng, W. Yu, B. Zong, J. Ni, H. Chen, and X. Zhang (2021) Learning to drop: robust graph neural network via topological denoising. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 779–787. Cited by: §V.
[32] D. Luo, W. Cheng, W. Yu, B. Zong, J. Ni, H. Chen, and X. Zhang (2021) Learning to drop: robust graph neural network via topological denoising. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 779–787. Cited by: §I.
[33] M. McPherson, L. Smith-Lovin, and J. M. Cook (2001) Birds of a feather: homophily in social networks. Annual review of sociology 27 (1), pp. 415–444. Cited by: §III-A1.
[34] M. Newman (2018) Networks. Oxford university press. Cited by: §III-A1.
[35] P. E. Pope, S. Kolouri, M. Rostami, C. E. Martin, and H. Hoffmann (2019) Explainability methods for graph convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10772–10781. Cited by: §I.
[36] Y. Rong, W. Huang, T. Xu, and J. Huang (2019) Dropedge: towards deep graph convolutional networks on node classification. arXiv preprint arXiv:1907.10903. Cited by: §V.
[37] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §I.
[38] M. Schlichtkrull, T. N. Kipf, and et al. (2018) Modeling relational data with graph convolutional networks. In European semantic web conference, pp. 593–607. Cited by: §I.
[39] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §I.
[40] M. Shi, Y. Tang, and et al. (2020) Multi-class imbalanced graph convolutional network learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Cited by: §III-A1.
[41] R. S. Srinivasa, C. Xiao, L. Glass, J. Romberg, and J. Sun (2020) Fast graph attention networks using effective resistance based graph sparsification. arXiv preprint arXiv:2006.08796. Cited by: §V.
[42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §V.
[43] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In International Conference on Learning Representations ICLR 2018, Cited by: §I, §III-A2.
[44] S. Wan, C. Gong, P. Zhong, S. Pan, G. Li, and J. Yang (2020) Hyperspectral image classification with context-aware dynamic graph convolutional network. IEEE Transactions on Geoscience and Remote Sensing 59 (1), pp. 597–612. Cited by: §II-A.
[45] L. Wang, W. Yu, W. Wang, W. Cheng, W. Zhang, H. Zha, X. He, and H. Chen (2019) Learning robust representations with graph denoising policy network. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 1378–1383. Cited by: §I.
[46] D. J. Watts and S. H. Strogatz (1998) Collective dynamics of ‘small-world’networks. nature 393 (6684), pp. 440–442. Cited by: §IV-C.
[47] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip (2020) A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32 (1), pp. 4–24. Cited by: §II-A.
[48] Z. Wu, S. Pan, and et al. (2020) Connecting the dots: multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 753–763. Cited by: §I.
[49] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In International Conference on Learning Representations ICLR 2019, Cited by: §I, §II-A.
[50] H. Zeng, H. Zhou, and et al. (2019) Accurate, efficient and scalable graph embedding. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 462–471. Cited by: §I, §II-B, §II-B, §IV-A.
[51] H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. Prasanna (2020) GraphSAINT: graph sampling based inductive learning method. In International Conference on Learning Representations, Cited by: §I, §I, §II-A, §II-B, §II-B, §II-C, §IV-A, §IV-A, §V.
[52] H. Zhang, B. Wu, X. Yuan, S. Pan, H. Tong, and J. Pei (2022) Trustworthy graph neural networks: aspects, methods and trends. arXiv preprint arXiv:2205.07424. Cited by: §I.
[53] H. Zhang, J. Reed, H. Ritter, Z. DeVito, Z. Yu, T. Karaletsos, H. He, G. Dai, G. Huang, A. Ussery, et al. (2022) Understanding gnn computational graph: a coordinated computation, io, and memory perspective. Proceedings of Machine Learning and Systems 4. Cited by: §I.
[54] M. Zhang and et al. (2018) Link prediction based on graph neural networks. Advances in neural information processing systems 31. Cited by: §II-A.
[55] M. Zhang and et al. (2018) Link prediction based on graph neural networks. Advances in neural information processing systems 31. Cited by: §III-A2.
[56] Q. Zhang, D. Wipf, Q. Gan, and L. Song (2021) A biased graph neural network sampler with near-optimal regret. Advances in Neural Information Processing Systems 34. Cited by: §I, §II-B.
[57] T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah (2020) Data augmentation for graph neural networks. arXiv preprint arXiv:2006.06830. Cited by: §V.
[58] C. Zheng, B. Zong, W. Cheng, and et al. (2020) Robust graph representation learning via neural sparsification. In ICML, pp. 11458–11468. Cited by: §V.
[59] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra (2020) Beyond homophily in graph neural networks: current limitations and effective designs. Advances in Neural Information Processing Systems 33, pp. 7793–7804. Cited by: §III-A1, §III-A1.
[60] D. Zou, Z. Hu, and et al. (2019) Layer-dependent importance sampling for training deep and large graph convolutional networks. Advances in neural information processing systems 32. Cited by: §I, §II-B, §V.

Rethinking Efficiency and Redundancy in Training Large-scale Graphs

Abstract

I Introduction

Ii Preliminary and Motivation

Ii-a Fundamental of GNNs

Ii-B Sampling-based Training Methods

Ii-C Observations on Potential Redundancy

Ii-D Opportunities to Improve Efficiency

Iii Methodology

Iii-a The Proposed Metric

Iii-A1 Neighbor Heterophily

Proposition

Proof

Iii-A2 Linking Probability

Iii-A3 Definition of the Metric

Proposition

Proof

Iii-B The Proposed Method: DropReef

Iv Experiment

Iv-a Experimental Setting

Iv-B Experimental Results

Iv-C Impact on Subgraphs

Iv-D Discussion

V Related Work

Vi Conclusion

References

Rethinking Efficiency and Redundancy in
Training Large-scale Graphs