A Survey on Temporal Graph Representation Learning and Generative Modeling

Shubham Gupta
Department of Computer Science
IIT Delhi
shubham.gupta@cse.iitd.ac.in
&Srikanta Bedathur
Department of Computer Science
IIT Delhi
srikanta@cse.iitd.ac.in
Abstract

Temporal graphs represent the dynamic relationships among entities and occur in many real life application like social networks, e-commerce, communication, road networks, biological systems, and many more. They necessitate research beyond the work related to static graphs in terms of their generative modeling and representation learning. In this survey, we comprehensively review the neural time-dependent graph representation learning and generative modeling approaches proposed in recent times for handling temporal graphs. Finally, we identify the weaknesses of existing approaches and discuss research proposal of our recently published paper Tigger Gupta et al. (2022).

1 Introduction

Traditionally static graphs have been the de facto data structures in many real-world settings like social networks, biological networks, computer networks, routing networks, geographical weather networks, interaction networks, co-citation networks, traffic networks, and knowledge graphs Aggarwal and Wang (2012); Aggarwal (2011); Cook and Holder (2006). These graphs are used to represent the relationships between various entities. Major tasks like community detection, graph classification, entity classification, link prediction, and combinatorial optimization are established research areas in this domain. These tasks have applications in recommendation systems Konstas et al. (2009), anomaly detectionNoble and Cook (2003), information retrieval using knowledge graphs Färber et al. (2018), drug discovery Goyal et al. (2020) , traffic prediction Zhao et al. (2020), molecule fingerprinting Duvenaud et al. (2015), protein interface prediction Fout et al. (2017) and combinatorial optimization Peng et al. (2021). Recently, graph neural networks Kipf and Welling (2017); Hamilton et al. (2017); Veličković et al. (2018); Xu et al. (2019c); You et al. (2019) have been developed to improve the state-of-the-art in these applications. Moreover, much success has been achieved in terms of quality and scalability by the graph generative modeling methods You et al. (2018b, 2019); Liao et al. (2019a).

However, most of these datasets often have the added dimension of time. The researchers marginalize the temporal dimension to generate a static graph to execute the above tasks. Nonetheless, many tasks like future link predictionTylenda et al. (2009), time of future link prediction Trivedi et al. (2017) and dynamic node classificationKumar et al. (2019) require temporal attributes as well. This has led to recent advancements in temporal graphs in terms of defining and computing temporal properties Holme (2015). Moreover, algorithmic problems like travelling salesman problem Michail and Spirakis (2016), minimum spanning treesHuang et al. (2015), core decompositionWu et al. (2015), maximum clique Himmel et al. (2016) have been adopted to temporal graphs. Recent research on graphs has focused on dynamic representation learning Rossi et al. (2020); Zhang et al. (2020); Trivedi et al. (2019); Xu et al. (2021) and achieved high fidelity on the downstream tasks. Research in temporal graph generative spaceZhou et al. (2020); Zeno et al. (2021) is in its early stages and requires focus, especially on scalability.

Many surveys exist that separately study techniques on graph representation learning Chen et al. (2020a, b); Khoshraftar and An (2022a, b); Wu et al. (2021), temporal graph representation learning Shehnepoor et al. (2022); Barros et al. (2021); Kazemi et al. (2020), and graph generative modeling Guo and Zhao (2020); Faez et al. (2020). But this survey is a first attempt to unify these inter-related areas. It aims to be an initial point for beginners interested in the temporal graph machine learning domain. In this report, we outline the following-

  • We initiate the discussion with definitions and preliminaries of temporal graphs.

  • We then present the summary of graph representation learning methods and the static graph generative methods that are prerequisites to explore similar approaches for temporal graphs.

  • We discuss in-depth the temporal graph representation approaches proposed in recent literature.

  • Subsequently, we outline the existing temporal graph generative methods and highlight their weaknesses.

  • Finally, we propose the problem formulation of our recently published temporal graph generative model TiggerGupta et al. (2022).

2 Definitions and Preliminaries

This section will formalize the definition of temporal graphs and their various representations. Furthermore, we will explain a temporal graph’s node and edge attributes. We will then discuss the various tasks under temporal graph setting and frequent metrics used in literature.

2.1 Temporal Graph

Temporal graphs are an effective data structure to represent the evolving topology/relationships between various entities across time dimensions. However, temporal graphs are not only used to describe the phenomenon which simulates the evolving links but also the underlying process which triggers the entity addition or removal from the topology. In addition, they also represent the evolution of entity/edge attributes over time. For example, temporal graph-based modeling can explain the sign-up behavior of a user in a shopping network and the causes of churn apart from their shopping behavior. In this survey, we also use temporal graphs to characterize those dynamic graphs that are not evolving anymore.

A temporal graph is either represented in continuous space or discrete space. In the below subsections, we describe the distinction between the two.

2.1.1 Continuous Time Temporal Graph

A continuous-time temporal graph is a stream of dyadic events happening sequentially.

(1)

Each is a temporal event tuple at timestamp with attributes . These attributes can be continuous, categorical, both, or none at all. Each event is defined depending on the type of dataset and tasks. For an interaction network like transaction network, shopping network, and communication networks, each is an interaction between a node and at a time where and is a collection of entities/nodes in the network. In a general framework, is a variable across time. Nodes can be added and deleted, which is represented by the event . An event is specified as node addition event where a node with attributes at timestamp has been added to the network. A repeated node addition event is interpreted as a node update event with new attributes if the node is already existing in the network. For example, the role of an assistant professor node can change to associate professor node in a university network. Similarly, a node deletion event is also possible.

In temporal graphs like knowledge graphs, co-citation networks, biological networks, transport network, each edge will have a time-span i.e, an edge can be added to the network at time and removed from where . is the last observed time-stamp in network. In such cases, we will assume that each edge event is either a link addition event () at in the network or a link deletion event () at in the network.

Figure 1 shows an example of one of the interaction networks. Event representation is often designed as per the requirement in the existing literature. For example, Trivedi et al. (2019) add a variable k in the event tuple . Here, is a binary variable signifying whether this event is an association event (the permanent link between two nodes in the network) or a communication event (interaction between two nodes). We encapsulate such intricacies in the variable to simplify the presentation.

Temporal Interaction Network
Figure 1: Temporal Interaction Network

2.1.2 Discrete-Time Temporal Graph

Generally, a temporal graph is generated by accumulating evolution across a window of consecutive timestamps to extract the desired information or apply static graph modeling techniques. We divide the time axis in equal lengths and perform aggregation to create a graph along with each such temporal window. Figure 2 displays one such aggregated temporal graph. This representation is known as a discrete-time temporal graph.

Evolving Temporal Network
Figure 2: Evolving Temporal Network

A discrete-time temporal graph is defined as follows:

(2)

Here, each is a static network at time where is a collection of nodes in time-window and is the collection of edges . is the node feature matrix, where is the dimension of the feature vector for each node. Similarly, is edge feature matrix and is the dimension of the edge feature vector. Please note that the meaning of the notation is dataset specific. Essentially, it is a custom representation of the period over which the graph has been observed. For example, can be a month if each snapshot is collected for each month. In many cases, it can simply be the mean of the time-window or last timestamp in the time-window as well.

2.2 Attributes

Attributes are a rich source of information, except for the structure of a graph. For example, a non-attributed co-citation network will only provide information about frequent co-authors and similarities between them, but an attributed network containing roles and titles of every node will also offer a holistic view of the evolving interest of co-authors and help in predicting the next co-author or next research area for co-authors. Attributes, in general, can take a categorical form like gender and location, which are represented as 1-hot vectors. They can also make a continuous form like age. For example, in a Wikipedia network Kumar et al. (2019), each interaction’s attributes are word vector of text edits made in the wiki article. And the class (label) of each user explains whether the user has been banned from editing the page. Sometimes these attributes are present as meta-data in the network. Frey and Dueck (2007) airline network dataset contains the meta-information in the form of the city’s latitude, longitude, and population.

2.3 Tasks and Evaluating Metrics

The primary objective in representation learning is to project each node and edge into d-dimension vector space. It achieves it by learning a time-dependent function where is the set of nodes in the temporal network which has been observed until time T. Generally for future prediction tasks, but time-dependent requirement can be dropped to learn only a node-specific representation Nguyen et al. (2018). If the function allows only an existing node as an argument, then this setting in literature is generally known as transductive representation learning Hamilton et al. (2017). Often this is desirable not to have this restriction since frequent model training is not possible in many real-life systems, and it is usually required for the trained model to generate representations for unseen nodes . This setting is recognized as inductive representation learning Hamilton et al. (2017). The argument in the function encapsulates graph/node/edge attributes and is simplified according to design choices. Suppose we assume that temporal graph in the continuous domain, then can be approximated as a sequence of event streams or only those events in which another argument and its neighbors are involved. Neighbors of a node v in a temporal graph do not have a universal definition like static graphs. Most papers typically define the neighbourhood of a node at given time as a set of nodes which are hops away in topology from node and their time of interactions are within of given time . , and are application specific parameters. Trivedi et al. (2019); Xu et al. (2021); Rossi et al. (2020). This definition typically selects the recent interactions of node v before time t. Zhang et al. (2020); Kumar et al. (2019) use the node ’s all previous interactions to learn its representation at time t. This is the special case of and .

Every representation learning primarily focuses on learning the node representation and then using these representations to learn the edge embeddings. Edge representation is learnt by learning a function which aggregates the node representation of its two end nodes and their attributes, including the edge in the consideration. Most often, is a concatenate/min/max/mean operator. It can also be a neural network-based function to learn the aggregation. We note that t is an argument to the function , allowing the to learn a time-dependent embedding function.

Similarly, discrete-time temporal graph snapshots are encoded into low-dimensional representation. Like the edge aggregator function , a graph-level aggregator is learned, taking all nodes in that graph as input.

The most frequent tasks using graph/node/edge representations are graph classification, node classification, and future link prediction. These are known as downstream tasks. These tasks cover many applications in recommendation systems, traffic prediction, anomaly detection, and combinatorial optimization. In recent work, we have also observed additional tasks like event time prediction Trivedi et al. (2019) and clustering Zhang et al. (2017). We will now detail each task and the metrics used to evaluate the model efficiency for these tasks. Please note that we overload the notation for each task.

2.3.1 Node Classification

Given a temporal graph and node , a function is trained to output its label. Formally,

Where C is a set of categories possible of a node in a temporal graph . Often a node can be represented as a time-dependent dimensional vector , can be approximated to

Node classification can be conducted both in transductive and inductive learning depending upon the argument node to the function . In the transductive setting, the argument node is already seen during training. It is unseen in the case of inductive learning. Accuracy, F1, and AUC are frequent metrics for evaluating node classification. The AUC is more favorable since it provides a reliable evaluation even in case high-class imbalance. Anomaly detection is a typical case where class imbalance is observed, and the positive class typically belongs to less than of the population.

2.3.2 Future Link Prediction

Given a temporal graph , two nodes and , a future timestamp t, a function is learned which predicts the probability of these two nodes linking at a time where is the latest timestamp observed in the .

Similarly, we can also predict the attributes of this future link. As simplified in the node classification, since the node representations and are dependent on and time , we can write as follows:

We further observe that in most future link prediction settings, node representations for are approximated as

The argument is often not required in future link prediction functions for such settings since is simply predicting the probability of link formation in the future. In most methods, is the cosine similarity function between node embeddings or a neural function that aggregates the information from these embeddings. In some instances, Trivedi et al. (2019), is approximated by the temporal point processes Rasmussen (2018), which also allows for predicting the time of link formation as well. Like node classification, we can categorize the future link prediction task into transductive and inductive settings, depending upon whether the node in consideration is already seen during training.

Future link prediction is evaluated in 2 significant settings. Researchers typically choose either of these two. In the first set-up, a link prediction task is considered a classification task. The test data consists of an equal number of positive and negative links. Positive links are the actual links present in the future subgraph or test graph. Generally, a test graph is split chronologically from the training graph to evaluate the model performance. Therefore, edges in this test graph or subgraph are considered positive examples. From the sample test graph, an equal number of non-edge node pairs are sampled as negative links. Accuracy, f1, or AUC are used to evaluate the performance of . In another setting, future link prediction is seen as a ranking problem. For every test node, its most probable future neighbors are ranked and compared with actual future neighbors. In a slightly different framework, the top K possible edges are ranked in a test graph and compared with the ground truth edges. Preferred metrics in these ranking tasks are mean reciprocal rank(MRR), mean average precision(MAP) Wang et al. (2016), precision@K, recall@K.

2.3.3 Event Time Prediction

Trivedi et al. (2019) introduces a rather novel task of prediction of the time of the link in consideration. This task has applications in the recommendation system. Learning which items will be purchased by a user at a particular time t will result in better recommendations, optimized product shipment routing, and a better user experience. Mean absolute error (MAE) is the metric for evaluating this task.

3 Literature Review

We first summarize the prevalent static graph representation learning methods. We will later see that temporal graph representation methods are direct extensions of these approaches.

3.1 Static Graph Representation Learning Methods

These methods are divided into two main categories, a) Random walks based methods and b) Graph neural network. There are other categories as well, like factorization based approaches Ahmed et al. (2013) Belkin and Niyogi (2002) and Ou et al. (2016). However, these are generally not used due to associated scalability problems and the inability to use available attributes. Moreover, random walk and GNN-based methods are also superior in quality. For this subsection, we assume as a static graph where is the node-set and is the edge set. is number of nodes, and is number of edges. We denote the hop neighbourhood of node as and as input feature vector for node . Also, the bold small case variable denotes a vector, and the bold large case variable denotes a matrix.

3.1.1 Random Walk based Methods

Node representation is often the reflection of the graph structure, i.e., the more similar the representation, the higher the chances of the corresponding nodes to co-occur in random walks. This intuition provides the unsupervised learning objective to learn the node representation. Building on this, DeepWalk Perozzi et al. (2014) provided the first random walk-based method. They run the random walks from each node . Suppose one such k length random walk sequence is . Using the skip-gram objective Mikolov et al. (2013a), they learn the representation by optimizing the following loss objective.

(3)

where is node representation, is also the node representation, but it’s not used in the downstream tasks. This setup is similar to Mikolov et al. (2013a). is the window size of window centred at . is often re-written using negative sampling method Mikolov et al. (2013b) to avoid the computationally expensive operation in denominator of softmax as follows-

(4)

Where is the sigmoid function and K is number of negative samples, typically 5, and is a probability distribution over . It is often based on the degree of and the task. DeepWalk compute these losses for every node in each random walk and update the by using gradient descents methods Sutskever et al. (2013). We note that the next node is selected uniformly from the current node’s neighborhood in each random walk. DeepWalk also shows that these learned representations can be utilized in downstream tasks like node classification and missing link predictions. LINE is a direct extension of DeepWalk. They modify the DeepWalk loss by restricting co-occurring nodes to be directly connected. Furthermore, they add the following loss as well.

(5)

where is the edge set. This loss forces the neighbouring nodes to have similar representation. Node2Vec Grover and Leskovec (2016) uses the negative sampling Mikolov et al. (2013a) instead of hierarchical softmax to compute the expensive operation in the denominator of softmax. Furthermore, they introduce Breadth-First Search(BFS) and Depth First Search(DFS) biased random walks to learn the node representations. In BFS-based random walks, nodes near in terms of their hop distance are sampled more frequently. This biased sampling leads to learning community structures having similar embeddings for nearby nodes. In DFS-based random walks, nodes which are far are more likely to be sampled. This random walk assigns similar embeddings to nodes having similar roles/structures in the network.

These methods directly work with node IDs and do not factor in the node features and associated meta-data. Thus, these approaches are not extendable to unseen nodes in the network since new node IDs are absent during training. Furthermore, these are unsupervised approaches, so embeddings cannot also be learned using available supervision on the nodes/edges. These challenges limit the practical uses of the above methods.

3.1.2 Graph Neural Network based Methods

Kipf and Welling (2017) introduced Graph Convolutional Network (GCN) to learn the node representation based on graph adjacency matrix and node features. Below equation represents the layer wise message passing in multi-layer Graph Neural Network.

(6)

where is a node feature matrix, i.e. each row corresponding to the feature vector of a node , is a node representation matrix at layer. is a trainable weight matrix for message passing from layer to . where is the adjacency matrix corresponding to the graph . is a diagonal matrix, where each diagonal element is the degree of the corresponding node in the matrix. is a non-linear activation like Sigmoid, ReLU etc. Note that is added to add the self-loops in formulation. This formulation allows the node representation at the next layer to be an aggregation of self features and feature of neighbour nodes. L layer network will cause node representation to be impacted by the L-hops neighbours, which is evident from the formulae.

This proposed approach requires supervision, as the input network needs to have a few nodes labeled to learn the . Authors train the for each layer by cross-entropy loss after applying softmax over for each labeled node . This approach can not incorporate the unseen nodes as training requires a full adjacency matrix. This requirement also causes scalability issues for large graphs.

Hamilton et al. (2017) identified that equation 6 essentially averages the node representation of the target node and its 1-hop neighbor nodes to learn the node representation for each node at the next layer. So, a complete matrix formulation is not needed. Furthermore, the W matrix is also not dependent on node identity. This relaxation motivates the inductive setting as well. Hamilton et al. (2017) authors proposed a method GraphSage with the node-level layer-wise message propagation formulation given below.

(7)

where , is the feature vector for the node . CONCAT and are function which are defined as per the requirement. AGGREGATE simply learns the single representation from the set of neighbour node representations. GraphSage uses MEAN, MAX and RNN based aggregator functions. CONCAT is simply the concatenation of two embeddings. Also, note that after each layer propagation, s are normalized using l-2 norm. The above formulation is effective since it allows the weights matrices to adjust importance on the current target node and neighbor nodes representations to learn the subsequent layer representation for the target node. Additionally, this formulation is inductive since node identity or adjacency matrix is not required. GraphSage utilizes supervised loss on node labels by applying MLP followed by softmax operation to convert the node embedding to a probability vector in node labels space. GraphSage additionally proposes unsupervised loss similar to equation 4 where node u and v co-occurs on a short length sampled random walks.

In the above formulation, neighbours are treated equal in aggregator functions like Max or Mean Pool. Veličković et al. (2018) proposed an attention based aggregator, namely GAT to compute the relative importance of each neighbour. Specifically, the authors proposed the following to compute the node representation of node at layer .

(8)

indicates the importance of message from node to node . Here, a is a trainable weight matrix. Additionally, GAT introduces the multi-head attention similar as Mikolov et al. (2013b) to utilize the self-attention based learning process. So, the final aggregation function using K attention heads becomes as follows:

(9)

Finally, Xu et al. (2019b) proves that aggregator functions used in GraphSage and GCN like max-pool and mean-pool are less powerful in graph isomorphism task than sum aggregator by showing that mean-pool and max-pool can produce similar representation for different node multi-sets. They propose the following simpler GNN formulation -

(10)

where g is graph embedding and L is number of layers in GNN and is a learnable irrational parameter. They also show that this formulation is as powerful in graph isomorphic test as WL-test if node features are from countable setLeman and Weisfeiler (1968).

The above formulations assign similar embeddings to nodes with similar neighborhoods with similar attributes, even if they are distant in the network. However, embeddings need to account for the node’s position in the network in many settings. Applications like routing, which involves the number of hops/distance between nodes for the end objective, are examples of this requirement. Specifically, the node’s position in the network should also be a factor in the learned embeddings. PGNN You et al. (2019) proposed a concept of anchor nodes to learn position-aware node embeddings. Assuming K anchor nodes as and distances of node from these nodes respectively as then a node can be represented using a position encoded vector of size K. More anchor nodes can provide better location estimates in different network regions. PGNN generalizes the concept of an anchor node with an anchor set, which contains a set of nodes. Node ’s distance from an anchor set is the minimum of distances from all nodes in the anchor set. We denote anchor set as . Each contains nodes sampled from . We note that each anchor set can contain a different no of nodes. These K anchor sets create a K size position encoded vector for all nodes. These vectors are used along with original node features to encode each node. However, since each dimension in the position vector is linked with an anchor set, changing the order of the anchor set/position vector should not change the meaning. This constraint requires using a permutation invariant function aggregator in the GNN. So, PGNN introduces the following formulation for position-aware node representation.

(11)

where is K size position aware representation for node v. w is a trainable parameter. Also, is the shortest path between node and . If is more than a certain threshold than it is assumed to be infinity. This assumption speeds up the all pair shortest path computation.

The above GNNs methods work well in graphs having high homophily. Zhu et al. (2020) defines homophily as the ratio of number of edges connecting nodes having the same label to the total number of edges. Many networks like citation networks have high homophily, but networks like dating networks have very low homophily or heterophily Zhu et al. (2020). They show that state-of-the-art GNN methods work very well on high homophily networks but perform worse than simple MLPs in heterophily networks. Their method H2GCN proposes the following simple modifications in GNNs to improve their performance in heterophily networks.

  1. Target node ’s embedding (ego embedding) should not be averaged with neighbourhood embeddings to compute the embedding at next layer, as done in GCN. GraphSage style concatenation is better. This is similar to the skip-connections in deep neural networks, as to increase the depth of the network.

    (12)
  2. Instead of using 1-hop neighbour’s embedding at each layer to compute the ego-embedding, H2GCN proposes to use higher order neighbour as well as follows:-

    (13)

    where denotes the nodes which are at hops away from node .

  3. Instead of using only the last layer embedding as final embedding for each node, H2GCN proposes combination of representation at all GNN layers.

    (14)

    where is input feature representation of node .

Schlichtkrull et al. (2018) observed that the proposed GNN architectures apply only to in-homogeneous networks where each node and edge is of a single type. For example, friendship networks. Thus, they proposed a GCN extension, namely RGCN, applicable for heterogeneous networks like knowledge graphs where entities and relations can be multiple types. RGCN suggested that instead of using the same weight matrix for each neighbor during neighborhood aggregation at each layer, use a separate weight matrix for each node type. Since number of unique relations and nodes are typically around millions, they propose either using block-diagonal matrices or decomposing using basis matrices and learning the coefficient of basis matrices for each entity and relations. Finally, Chiang et al. (2019) proposed a method ClusterGCN to learn GNNs on large-scale networks which have millions of nodes and edges. They proposed to first cluster nodes based on any standard clustering approach and then randomly select multiple node-groups from the clusters and create an induced subgraph using these chosen nodes. Now, update the weights of GNN by running gradient descent on this sub-graph. Repeat the process of randomly sampling node groups, creating an induced sub-graph, and training the GNN until convergence.

3.2 Deep Generative Models for Static Graphs

Till now, we have viewed the static graph representation approaches. We now shortly summarize the deep generative models for static graphs. Given a collection of input graphs which are assumed to be sampled from an unknown underlying distribution , the goal of generative methods is to learn a probability distribution which is highly similar to and produces graphs having highly similar structural properties as input graphs. Traditional graph generative models assume some prior structural form of the graph like degree distribution, diameter, community structure, or clustering coefficients. Examples include Erdős-Rényi Karoński and Ruciński (1997) graphs, small-world models Watts DJ (1998), and scale-free graphs Albert and Barabási (2002). Prior assumptions about the graph structures can be encoded using these approaches. Still, these approaches do not apply to practical applications like drug discovery, molecular property prediction, and modeling a friendship network. These models cannot automatically learn from the data. Learning a generative model on graphs is a challenging problem since the number of nodes varies in each graph of the input data. Furthermore, search space and running time complexity to generate edges is often quadratic in and . Additionally, in naive graph representation, any observed node ordering of a graph has probability, i.e., a single graph can be represented using possible node ordering. So, the learned generative model should be able to navigate this large space, which is not the case with images, text, and other domains. We initiate the discussion with NetGAN Bojchevski et al. (2018) which learn a generative model from the single input network. Then we compare the methods of MolGAN De Cao and Kipf (2018), DeepGMG Li et al. (2018), GraphRNN You et al. (2018a), GraphGen Goyal et al. (2020) and GRAN Liao et al. (2019b). These methods take a collection of graphs as input to learn the generative model. Among these, GraphGen and GRAN currently, are state-of-the-art methods

NetGAN samples a collection of random walks of max length using the sampling approach in Node2Vec and train a WGAN Arjovsky et al. (2017) on this collection. A GAN architecture primarily consists of a generator and a discriminator. The discriminator scores the probability of a given random walk as real. A discriminator is trained by collecting sampled random walks and generated random walks by generator, where its task is to assign high probabilities to real walks and low probabilities to synthetic walks. The discriminator and generator are trained in tandem until the generator generates walks indistinguishable from real walks and confuses the discriminator. The discriminator architecture LSTM based, where each node in a sequence is encoded using a one-hot vector where the size of the vector is . Once the LSTM unit processes a sequence, it outputs a logit that provides the sequence scores. Generator architecture is a bit tricky since it involves a stochastic operation of sampling the next node during random walk sequence generation, i.e., . Basically, a vector is sampled from the normal distribution. is transformed to a memory vector where is a MLP based function. This is used to initialize the LSTM cell along with vector which outputs the next memory state and probability distribution over the next node . From this multinomial distribution , the next node is sampled, which along with is passed to LSTM cell to output . This process repeats until the generator samples the length node sequence. To enable backpropagation using this method, the next node sampling from multinomial distribution is replaced with Gumbel-softmax trick Jang et al. (2017) which is essentially where s are sampled from Gumbel distribution with 0 mean and 1 scale. Forward pass is computed using and backpropagation is computed using continuous . is a temperature parameter. Low works as argmax, and very high works as uniform distribution. Once the WGAN is trained, NetGAN samples a collection of random walk sequences from the generator. It creates a synthetic graph by selecting top edges by frequency counts. NetGAN uses multiple graph properties like max node degree, assortativity, triangle count, power-law exponent, clustering coefficients, and characteristic path length to compare synthetic graph with original graph . We note that NetGAN’s node space is the same as the input graph ’s node space . Due to this limitation, its usages are limited to applications requiring samples with similar properties as the input graph but with the same node space. Next, we look at the MolGAN De Cao and Kipf (2018) which is similar to NetGAN in terms of using GAN. MolGAN takes a collection of graphs as input instead of a single graph. Further, MolGAN is specifically designed for molecular graphs, which will be clear with the design choices.

MolGAN utilizes the GAN architecture to learn the generator. Generator takes a noise vector and transform it using MLPs to a probability-based adjacency matrix and node feature matrix. Using Gumbel-softmax, adjacency and node feature matrices are sampled from probability matrices. Finally, a GNN-based discriminator tries to classify actual graphs and generated graphs. As we see, in MolGAN generator produces a whole graph, but in NetGAN, it was used to produce a random walk. Finally, as MolGAN noted that there are available software packages 111http://www.rdkit.org/ to evaluate the generated molecules in terms of desired chemical properties. These scores act as rewards in MolGAN to provide additional supervision to the generator. This model’s parameters are trained using a deep deterministic reinforcement-learning framework Lillicrap et al. (2016). This method is not scalable since it requires computation and memory. Furthermore, a graph can be represented using adjacency matrices, leading to significant training challenges. We now describe methods that solve these problems.

DeepGMG Li et al. (2018) observes graph generation as a sequential process. This process generates one node at a time and decides to connect the new node to existing nodes based on the current graph state and new node state. DeepGMG employs GNNs to model the states. Specifically, assuming an existing graph having nodes and edges, newly added node , it works as follows:

(15)

For each round of new node addition, an L-layer GNN is executed on the existing graph to compute embeddings of nodes and graphs. Using graph embedding, first, a decision is taken to add a new node or not. If yes, then a decision is taken whether to add edges using this node or not. If yes, then using embedding of existing nodes and new nodes, a score is calculated, and subsequently, a probability distribution is computed over the edges of node with existing nodes. From this distribution, edges are sampled. Moreover, this process repeats till a decision is taken not to add a new node. The embedding of a new node is initialized using its features and graph state. This approach is also computationally expensive, since it runs a GNN and softmax operation over existing nodes for each new node. It has a lower memory requirement since it no longer needs to store the whole adjacency matrix. This process is trained using Maximum likelihood over the training graphs. MolGAN and DeepGMG evaluate the generative models’ performance by visual inspection and using offline available quality scores provided by chemical software packages.

GraphRNN You et al. (2018a) follows similar graph representation approach as DeepGMG but replaces GNN with Recurrent neural architecture and uses Breadth-First Search(BFS) based graph representation. Furthermore, it introduced a comprehensive evaluation pipeline based on graph structural properties. Its contributions are summarized as follows:

  • Graph Representation: Unlike previous methods, GraphRNN considers the BFS node ordering of each permutation to reduce the number of possible permutation as many permutations map to a single BFS ordering. Although a graph can still have multiple BFS sequences, permutation space is drastically reduced. This approach has two-fold benefits.

    1. Training needs to be performed over possible BFS sequences instead of all possible graph permutations.

    2. A major issue with DeepGMG was that possible edges are computed for each node, with all previous nodes causing computations. But GraphRNN makes the following observation in any BFS sequence, i.e. whenever a new node is added in the BFS node sequence where it does not make an edge with a node for , we can safely say that any node will not make an edge with . This observation implies that we do not need to consider all previously generated BFS nodes for possible edges with the new node. We can empirically calculate signifying latest generated nodes are needed for a possible edge with the new node. This reduces the computation to .

  • Hierarchical Recurrent Architecture: GraphRNN uses two-level RNNs for modelling each BFS sequence. Primary RNN decides the new node and its type. Node type also includes a stop node to signal the completion of the graph generation process. The hidden state of the primary RNN initializes the hidden state of the secondary RNN, which sequentially processes over the latest nodes in the BFS order sequence to create possible edges. These 2 RNNs are trained together using the maximum likelihood objective.

  • Metrics: GraphRNN introduced Maximum Mean Discrepancy (MMD) Gretton et al. (2012) based metrics to compute the quantitative performance of a graph generator. For input graphs, a node degree distribution is calculated for each graph in the input graph set and generated graph set. MMD is used to compute the distance between these two distributions. GraphRNN shows the MMD distances for degree distributions, clustering coefficient distribution, and four-node size orbit counts.

GRAN Liao et al. (2019b) also follows a somewhat similar procedure as DeepGMG of learning edges for each new node with previously generated nodes during the graph generation process. GRAN observes the graph generation process of DeepGMG as creating a lower triangular part of adjacency matrix, i.e., generating 1 row at a time starting from the first row. This process requires sequential computations. To scale this approach for large graphs k, instead of generating 1 row at a time, they generate blocks of B rows, i.e., sequential computation. Furthermore, they drop the recurrent architecture to model the sequential steps used in DeepGMG and GraphRNN. This step facilitates parallel training across sequential steps. At each sequential step, the main task is to discover edges between nodes within the new block and edges between existing nodes and new nodes. To do this, GRAN creates augmented edges between nodes of the new block. They also create augmented edges between new nodes and all existing nodes. Finally, they use rows of lower triangular adjacency matrix as features for existing nodes and 0 for new nodes. They transform these features to low dimensions using a transformation matrix. Finally, GRAN runs a round of GNN updates on this graph to compute each node’s embedding. Using these embeddings, they learn the Bernoulli distribution over each augmented edge. These learned distributions are utilized to sample edges. We note that GRAN has used MLP over a difference of embeddings of nodes during message passing and computation of attention coefficients. The overall time complexity of GRAN is the same as DeepGMG. Since the architecture is parallelizable, GRAN can train and generate graphs for large-scale datasets compared to GraphRNN.

Finally, GraphGen introduces a minimum DFS code-based graph sequence representation. Minimum DFS code is the canonical label of a graph capturing structure and the node/edge labels. Canonical labels of two isomorphic graphs are the same. A DFS code sequence is of length where is number of edges. Each edge having node labels and edge label in the sequence is represented as where , is time of discovery of node during DFS traversal. Essentially, DFS codes can be lexicographically ordered. The smallest DFS code among possible DFS codes of a graph is called the minimum DFS Code. Minimum DFS code is an interesting concept, and for thorough details, we refer to GraphGen Goyal et al. (2020). This representation drastically reduces the possible permutations of each graph, which speeds up the training process. Using the maximum likelihood objective, an LSTM-based sequence generator is trained over the minimum DFS codes generated from input graphs.

3.3 Temporal Graph Representation Learning Methods

We now summarize the temporal graph representation learning based methods. Overall, we can classify these methods into two major categories, Snapshot/discrete graph based methods Sankar et al. (2020); Goyal et al. (2018); Zhou et al. (2018); Singer et al. (2019); Pareja et al. (2020) and Continuous time/event-stream based temporal graphs Lu et al. (2019); Liu et al. (2020); Zuo et al. (2018); Kumar et al. (2019); Zhang et al. (2020); Trivedi et al. (2019); Xu et al. (2021); Rossi et al. (2020); Wang et al. (2021); Zhang et al. (2017).We discuss both of these categories separately.

3.3.1 Snapshot/Discrete Graph based Methods

For the following discussion in this subsection, we assume the following notation-

A dynamic graph is represented as a collection of snapshots where each , is node set at time and similarly is edge set, adjacency matrix and node feature matrix at time . and are number of nodes and edges in graph .

Naïve method involves running a static graph embedding approach over each graph snapshot and aligning these embeddings across snapshots using certain heuristics Hamilton et al. (2016). This is a very expensive operation. DynGem Goyal et al. (2018) proposes an auto-encoder based approach which initializes weights and node embeddings of using . Mainly, an autoencoder network (MLP based) takes two nodes of an edge in represented by their neighbourhood vector and computes a dimensional vector representation . A decoder further takes as input and reconstructs the original as . Following loss is optimized at each timestamp to compute the parameters.

(16)

where are -norm and -norm computed over networks weights to reduce the over-fitting. is a autoencoder reconstruction loss at snapshot and is a first order proximity loss to preserve the local structure. We note that parameters of autoencoder for are initialized using parameters of autoencoder for inducing stability over graph/node embeddings over consecutive snapshots and reducing the training computation assuming graph structures doesn’t change drastically at consecutive snapshots. tNodeEmbed Singer et al. (2019) is a similar method but instead of using autoencoder they directly learn node embedding matrix for each snapshot by optimizing node classification task or edge reconstruction task. They also use the following loss to align the consecutive as follows:

(17)

where first term forces stable consecutive embeddings and second terms requires being a rotation matrix. Further, they employ recurrent neural network over node embeddings at each timestamp to connect time-dependent embeddings for each node and to learn its final representation. DynamicTriad Zhou et al. (2018) is a similar method based on regularizing embeddings of consecutive snapshots, but it models an additional phenomenon of triadic closure process. Assuming 3 nodes in a evolving social network where are connected but are not, i.e. is a common friend of and . Depending upon the social habits, it might introduce and or not. This signifies that the higher the number of common nodes between and , the higher the chances of them being connected at the next snapshot. DynamicTriad models this phenomena by defining a strength of with at snapshot as follows:

(18)

where and denotes the tie strength of with and respectively at time . Utilizing this, DynamicTriad defines a following probability of becoming a close triad at snapshot given they are open triad at snapshot given is the common neighbour.

(19)

Since and can have multiple common neighbours, any of the neighbour can connect, thus closing the triads with all the neighbours. But in real world, which neighbour(s) closed the triad is(are) unknown. To accommodate this, they introduce a vector of length . is number of common neighbours of, i.e. length of set . Formally, where denotes that will connect at time under the influence of . Finally, they introduce probabilities that will connect at, given that they are not connected and in open triad(s) at the time .

(20)

where denotes the connecting/not connecting at next snapshot . Summation over denotes the iterating over all possible configuration of common neighbour(s) causing the connection between . We note that at-least 1 entry in should non-zero to enable edge creation between at next step. Finally, their loss optimization function is:

(21)

where is set of edges which form at and is set of edges not existing at . is a ranking loss over edges which preserves the structural information of corresponding snapshot and is embedding stability constraint over embeddings of each node. These methods can’t directly utilize the node features and only focuses on the first order proximity of graph structures, ignoring the higher order structures. EvolveGCN Pareja et al. (2020) and DySAT Sankar et al. (2020) utilize the graph neural networks to calculate the node embeddings over each snapshot. Thus, these methods are additionally capable of using node features to model the higher order neighbourhood structure. We note that these features are also dynamic, i.e. can evolve over time.

EvolveGCN is a natural temporal extension of GCN. Equation 6 is rewritten as follows:

(22)

Where sub-script denotes the time of the corresponding snapshot, we note that in EvolveGCN formulation, are not learned using GNN but are the output of recurrent external network which incorporates the current as well the past snapshots’ information. The parameters of a GCN at each snapshot are controlled by a recurrent model, while node embeddings at corresponding snapshots are learned using these parameters. Specifically, EvolveGCN proposes two variants of calculation.

  1. This variant treats as hidden state of a RNN cell. Specifically,

    (23)

    where is the input to the RNN and is the hidden state at previous timestamp. This variant is useful if node features are strong and plays important role in end task.

  2. This variant treat as output of a RNN cell. Specifically,

    (24)

    where is a input which was the output of RNN cell at previous timestamp.

The parameters are trained end-to-end using any loss associated with the node classification/link classification/link prediction tasks.

Similar to EvolveGCN, DySAT Sankar et al. (2020) is a gnn based model, namely GAT. DySAT runs two attention blocks. First, it runs a GAT style GNN over each snapshot separately. This provides the static node embeddings at each snapshot, which captures the structural and attributes based information. Then, DySAT run a temporal based self-attention block i.e. for each node at time , this module takes it’s all previous embeddings and compute a new embedding using self-attention which also incorporate temporal modalities. Specifically,

  • Structural Attention : For each , in equation 8 is replaced as:

    (25)

    where is a graph input which denotes the weight on the edge . Please note that in GAT message is computed from self node, which is not the case here.

  • Temporal Self-attention: We assume the learned node representation using structural attention for each node as and . is input to the temporal self-attention block and is the output. Also, is the final output of node at snapshot which will be used for downstream tasks. Following similar design as Vaswani et al. (2017), is used as query, key and value, thus the name self-attention.

    Specifically, are matrices which are used to transform to the corresponding query, key and value space. Essentially, query and key are used to compute an attention value for each timestamp with previous timestamps include . Using these attention values, a final value is computed for the timestamp by aggregating using attention weights over previous timestamp values, including . Specifically,

    (26)

    where and .

We note that the entire architecture is trained end to end by running random walks over each snapshot and using similar unsupervised loss as in DeepWalk for nodes co-occurring in the walks.

3.3.2 Continuous Time Graph/Event-Stream Graph-based Methods

Now, we turn our focus to the temporal embedding method for continuous graphs where each edge between two nodes is an instantaneous event. We will first discuss methods that model evolving network topology Liu et al. (2020); Zuo et al. (2018); Lu et al. (2019). Then we discuss methods that additionally model graph attributes as well. These include bi-partite interactions only methods Zhang et al. (2020); Kumar et al. (2019) and general interaction networks Xu et al. (2021); Wang et al. (2021); Trivedi et al. (2019); Zhang et al. (2017); Rossi et al. (2020). HTNE Zuo et al. (2018) notes that snapshot-based methods model the temporal network at pre-defined windows, thus ignoring the network/neighborhood formation process. HTNE remarks that neighborhood formation for each node is a vital process that excites the neighborhood formation for other nodes. For example, in the case of a co-authorship network, co-authors of a Ph.D. student will be her advisor or colleagues in the same research lab. However, as that student becomes a professor in the future, her students might collaborate with students of her previous colleagues, thus exciting edges between nodes. Furthermore, a neighborhood sequence of co-authors for that Ph.D. student might indicate the evolving research interests which will influence the future co-authors.

For the below discussion, we denote a temporal graph as where is the collection of nodes in the network, and is a feature matrix for nodes where and is the feature dimension. HTNE visualizes the neighbourhood formation as a sequence of events, where each event is a edge formation between source and target node at time . A neighbourhood formation sequence of a node can be represented as where is number of interactions/edges of node . We observe that a node can interaction multiple times with the same node but at different timestamps. HTNE assumes a as a event sequence for each node and models these event sequences using marked temporal point processes(TPP) Rizoiu et al. (2017). TPP is a great tool to model the sequential event sequence, where past events can impact the next event in the sequence. TPPs are mostly characterized by the conditional intensity function . defines an event arrival rate at time given the past event history . number of expected events in a infinitesimal time interval .

(27)

Particularly, HTNE utilizes temporal hawkes point process to model the neighbourhood formation sequence for each node in the network. Below formulation define the hawkes conditional intensity for edge between node and node at time .

(28)

where is the base rate of edge formation event between node and . is the importance of node s historical neighbour in possible edge creation with node . is the time decay kernel to reduce the impact of old neighbours in current edge formation. Finally, following loss is optimized to train the node embeddings.

(29)

HTNE applies exponential over to enable positive values for conditional intensity, as negative event rates are not possible. Furthermore, enables softmax loss which can be optimized using negative sampling similar to previously seen approaches to enable faster training. MDNE Lu et al. (2019) is a similar approach which additionally models the growth of the network i.e. number of edges at each timestamp apart from edge formation process. Specifically, given number of nodes by time t, the following defines the additional new edges at time :

(30)

where are learnable parameters. encodes the linking rate of each node with every other node in the network. Additionally, they add neighborhood factors of node and while calculating the intensity function for edge formation between node and in eq. 28. Finally, they add a loss term in eq. 29 on mean square error of predicted and ground truth new edges at each time t. FiTNE Liu et al. (2020) is a random walk-based method that defines a k length temporal walk over temporal graph as and there is no constraint over times of straight edges. Essentially, FiTNE follows a similar approach as DeepWalk,Node2Vec. It collects a collection of random walks from temporal graph and learns node embeddings using a skip-gram-based learning framework by similar unsupervised loss. When running a random walk, a transition probability is defined over consecutive edges as follows:

(31)

where is the set of outgoing edges from . Finally, they have proposed unbiased/time difference-based formulations for . These methods do not utilize the graph attributes and generate static node embeddings. Specifically, learned node embeddings are not a function of time. Thus, we now summarize methods that utilize graph attributes and learn node embeddings as a function of time .

JODIE Kumar et al. (2019) and TigeCMN Zhang et al. (2020) are bipartite interactions network-based methods with specific focus on user-item interaction paradigm. Likewise, their architecture actively involves design choices for users and items. Consequently, these methods are not applicable in non-bipartite interaction networks. A major difference between bipartite and non-bipartite networks is the no interactions between the same type nodes(user-user, item-item). JODIE remarks that current temporal embedding methods produce a static embedding using the temporal network, as we saw in previous approaches. In recommendation applications, this will ensure similar recommendations even if the user revisits the site after one hr/1 day/1 month, and so on. This problem implies node embeddings should be a function of time, , i.e., node embedding of a user should model the changing intent over time. Additionally, JODIE models the stationary nature of the user’s intent. Since users interact with a million items every day, recommendation time should be sublinear in order of number of items. JODIE’s architecture models all of these aspects. We note that following notation for a bipartite graph where is collection of user nodes, is collection of item nodes and is collection of interactions where denotes the observed interaction and is represented as . Assuming as embedding of node at time , as embedding of node at time just before i.e. updated embedding of node after latest interaction at time , as static embedding of node , JODIE defines following formulation of user and item nodes’ embeddings after observing an interaction :

(32)

where are time difference of last interaction of with respectively. s are modelled using an item RNN and similarly s using a user RNN. We note that all user’s RNNs share the same parameters, and similarly all item’s RNNs share the same parameters. Finally, JODIE proposes the following projection operator for calculating the projected node embedding at where t is the last interaction time and is time spent after last interaction.

(33)

In order to train the network, JODIE utilizes the future interactions of user nodes i.e. given a last interaction time with item for user and its future interaction time with item j, can we predict the item embedding just before , which will same as actual item ’s embedding . Predicted item embedding at time is calculated given the last interaction of node at time with node .

(34)

The architecture is trained end-to-end using the mean square error between predicted item embedding and actual item embedding. Further, it adds a regularization term over the change in consecutive dynamic embeddings for users and items. JODIE notes that since the architecture directly predicts the future item embedding, using LSH techniques Tsai and Yang (2014), the nearest item can be searched in constant time. TigeCMN Zhang et al. (2020) utilizes memory networks to store each node’s past interactions instead of a single latent vector in order to improve the systems’ performance. Specifically, TigeCMN creates a value memory matrix for each user and a value memory matrix for each item . These memory matrices have K slots of dimension. Furthermore, there exists a key memory matrix for all users and another key memory matrix for all items. Assuming a interaction having feature vector between user and item at time , memory matrix corresponding to user is updated as follows:

(35)

where erases values from using interaction embedding . is a add vector which updates the . The same approach is followed to update the value memory matrix for item i. To compute the user ’s embedding at time , its value matrix is passed through self attention layer to provide context-aware embedding for each slot and finally mean-pooled to provide dynamic embedding which is concatenated with static node embedding (computed using transformation on one-hot vector representation) to provide . Finally, this network is trained end to end using cosine similarity between user and item of each interaction in conjunction with samples from negative interactions similar to GraphSage’s unsupervised loss. We note that node embeddings learnt in TigeCMN are not function of time, i.e. the embedding corresponding to each node will not change after its last interaction. Another method IGE Zhang et al. (2017) uses skip-gram Mikolov et al. (2013a) style loss for training embeddings. First, they introduce a induced list containing all interactions by each node . IGE defines context window of size for each node using its neighbours in the induced list similar to word2vec method. Finally,IGE uses similar loss as skip-gram using negative sampling with these induced lists. Essentially, IGE observes each induced list as sentence or sequence of words to train their embeddings.

We now summarize the temporal graph representation methods, which utilize both dynamic graph topology and associated attributes to learn the node representation of non-bipartite temporal graphs. Specifically, DyRep Trivedi et al. (2019) remarks that most temporal graphs exhibit two dynamic processes realizing at different(same) timescale, mainly association process and interaction process. Association process signifies the topological changes in the graph, and Communication process implies the interaction/information exchange between connected(maybe disconnected) nodes. DyRep observes that these two processes are interleaving, i.e., an association event will impact the future communications between nodes, and a communication event between nodes can excite the association event. DyRep notes that association events have more global impact since they change the topology. In contrast, communications events are local, although indirectly capable of global impact by exciting topological changes in the network. Assuming and k=0 implies association event and k= 1 implies communication event. We note that permanent topological changes are potentially using k=0. The following formulation shows the node embedding update corresponding to the event .

(36)

where is the last event time by node and is the aggregated embedding of node ’s neighbourhood just before time . We note that node embedding update formulation comprises three main principles.

  1. Self-Propagation: A node representation should evolve from its previous representation.

  2. Localized embedding propagation: A event(association/communication) between nodes must be the outcome of neighbourhood of the node through which the information propagated.

  3. Exogenous Drive: Finally, a global process can update the node representation between successive events involving that node.

is calculated using the neighbourhood of node as:

(37)

where is the latest node embedding of node before time . denotes the weight of each structural neighbour of node which reflect the tendency of node to communicate more with associated nodes, i.e. is higher for those neighbours to which node has communicated more frequently. Although, we note that a more weightage should have been provided for neighbours among frequent communication neighbours with the latest exchanges. Anyway, for detailed calculation of, we refer to Algorithm 1 of DyRep. Finally, DyRep utilizes temporal point processes to model an event . Formally,

(38)

where signifies the most recent updated node representation of node . is a soft-plus function parameterized for each event type . Finally, DyRep utilizes following loss for training models’ parameters.

(39)

We note that the second term in the loss signifies the survival probability for events that do not happen till time . This term is computed by sampling techniques. For more details on the sampling procedure, we refer to Algorithm 2 of DyRep. Further, this approach can handle unseen nodes, i.e., transductive and inductive settings. We note that DyRep’s methodology requires datasets that contain both evolution events and communication events. We see it as a major limitation since most of the available datasets do not have association events. We now focus on methods that require only interaction events and do not majorly differentiate between association and communication events. TGAT Xu et al. (2021) is a self-attention based method similar to GAT but uses novel functional time encoding technique to encode time in an embedding space. Although previous approaches have projected time into embedding space using separate weight matrices, TGAT utilizes Bochner’s theorem to propose the following time encoding:

(40)

where is a hyperparameter and are learnable parameters. For more details on the derivation of this encoding using Bochner’s theorem, we refer to Xu et al. (2019a). Finally, TGAT defines neighbourhood of node at time as . So, TGAT defines the following temporal GAT layer at time -

(41)

TGAT extends this formulation to multi(k)-head attention by using separate for each attention head as follows:

(42)

Finally, is the final node representation of node at time using layer TGAT. This model is trained, similar to GraphSage, using link prediction or node classification loss. TGAT’s temporal GNN layer is inductive, i.e., it can predict the embedding of unseen nodes since its parameters are not node dependent. Furthermore, it allows incorporating edge features. Moreover, it can also use evolving node attributes .

TGN Rossi et al. (2020) attempts to unify the ideas proposed in previous approaches and provides a general framework for representation learning in continuous temporal graphs. This framework is inductive and consists of independent exchangeable modules. For example, there is a module for embedding time to a vector. This module is independent of the rest and is replaceable with a domain understanding-based embedding function. They also define a new event type which is node addition/updation where a node with attributes is added/updated in the temporal graph at time . TGN defines the following modules:

  • Memory: A memory vector is kept for each node . This memory stores the compressed information about the node and is updated only during a event involving the node . For a new node, its memory is initialized to 0.

  • Message Function: Following messages are computed when a interaction occurs.

    (43)

    where is corresponding edge attributes and is the time different between and last interaction time of node . In case, the event is node addition/updation , the following message is computed.

  • Message Aggregation: TGN processes interactions in a batch instead of one by one to parallelize the process. Given messages for a node , its aggregated message is calculated at time , given that the batch contains messages from to :

    (44)
  • Memory Updater: After computing messages for each node involving events in the current batch, memory is updates using . MEM can be any neural based architecture. TGN notes that recurrent neural architecture is more suitable since current memory depends upon the previous memory.

  • Embedding: This is the last module whose objective is to compute the embedding of required nodes using their neighbourhood and memory states. Specifically, TGN proposes the following general formulation to compute the embedding of a node at time .

    (45)

    where denotes the latest attributes of node , denotes the features of the latest interaction between node and . TGN remarks that TGAT layer can be utilized here or any other simpler approximation as well basis the requirement. TGN also proposes following L layer GNN based formulation which is simple and fast.

    (46)

This architecture can be trained using link-prediction or node classification tasks. TGN observes that during training, there is an issue of target variable leak when predicting links in current batch, since memory of nodes will be updated using events of this batch. To encounter this, TGN creates a message storage which stores the events occurring in last batch corresponding to each node in the graph. Now given the current batch during training, they update the memory of nodes in this batch using the events from message storage and compute loss in this batch by using link prediction objective. This loss is used to update the model’s parameter. Now, message storage is updated for nodes occurring in the current batch by replacing their events with events of the current batch.

CAW Wang et al. (2021) critiques that currently proposed methods work well in inductive setting only when there are rich node attributes available. CAW remarks that any temporal representation learning method, even in the absence of node features and identity, when trained on a temporal graph governed by certain dynamic laws, should perform well on an unseen graph governed by the same dynamic laws. For example, if a triadic closure process and feed-forward loop process 222if node a excites node b and node c, then node b and c will also link in the future govern link formation in the training graph. Then, a trained model on such data should be able to produce a similar performance on an unseen graph governed by these two laws even if node attributes are not available. Current methods rely heavily on attributes, and thus, an inductive setting cannot model these processes in the absence of node identity/attributes. CAW proposes a method that anonymizes node identity by exploiting temporal random walks. Specifically, a random walk backtracks each consecutive step, i.e. each consecutive edge in the walk has a lower timestamp than the current edge. Further, is a collection of random walks of length starting from node at time . is a size vector where each index specifies the count of node in position across every . So, given a target link at time , denotes the relative identity of node wrt to and . We note that must occur at-least in any random walk . CAW notes such representation traces the evolution of motifs in random walks without explicitly using node identity and attributes. Finally, each random walk can be represented as:

(47)

The following formulation is used to encode the target link involving nodes at time .

(48)

These target link embeddings are usable in future link prediction tasks. Finally, CAW observes that the layer can additionally incorporate node/edge attributes. We note that the CAW model is not applicable in node classification.

3.4 Deep Generative Models for Temporal Graphs

The current research in deep generative models for temporal graphs is lacking and still in a nascent stage. We briefly overview the non-neural/neural methods below.

Erdős Rényi graph model and small-world model are de-facto models for static graphs. However, similar works are not available for temporal graphs. Few works have attempted to formulate the generation process of a network to understand the spreading of diseases via contact using temporal networks. Holme (2013) has proposed an approach where they start by generating a static network, and for each link in the network, they generate an active time span and start time. Further, they sample the sequence of contact times from the inter-event time distribution. Finally, they overlay this sequence with the active time span of each link to generate an interaction network. Perra et al. (2012) introduces activity-driven temporal network generative approach. At each timestamp, they start with N disconnected nodes. Each node b becomes active with probability , also defined as an active potential, and connects with randomly chosen m nodes. At the next timestamp, the same process is repeated. Vestergaard et al. (2014) proposes a memory-based method for both node and link activation. It stores a time delta since the last interaction for every node i and for every link between node i and j. Initially, in an N-size network, all nodes are inactive. A node can become active with probability and initiate a link with inactive node j with probability . A link can de-activate with probability . Both and are control parameters. and are memory kernels of a power-law distribution. More recently, DYMOND Zeno et al. (2021) presented a non-neural, 3-node motif-based approach for the same problem. They assume that each type of motif follows a time-independent exponentially distributed arrival rate and learn the parameters to fit the observed arrival rate. TagGen Zhou et al. (2020) models temporal graphs by converting them into equivalent static graphs by merging node-IDs with their timestamps interactions (temporal edge) and connecting only those nodes in the resulting static graph that satisfy a specified temporal neighborhood constraint. They performed random walks on this transformed graph and modified them using heuristics-based local operations to generate many synthetic random walks. They further learn a discriminator to differentiate between actual random walks and synthetic random walks. Finally, the synthetic random walks classified by a discriminator as real are collected and combined to construct a synthetic static graph. Finally, they convert the sampled static graph to a temporal interaction graph by detaching node identity from time. These approaches suffer from the following limitations:

Weak Temporal Modeling: DYMOND makes two key assumptions: first, the arrival rate of motifs is exponential; and second, the structural configuration of a motif remains the same throughout the observed time horizon. These assumptions do not hold in practice – motifs may evolve with time and could arrive at time-dependent rates. This type of modeling leads to poor fidelity of structural and temporal properties of the generated graph. TagGen, on the other hand, does not model the graph evolution rate explicitly. It assumes that the timestamps in the input graph are discrete random variables, prohibiting TagGen from generating new(unseen in the source graph) timestamps. More critically, the generated graph duplicates many edges from the source graph – our experiments found up to edge overlap between the generated and the source graph. While TagGen generates graphs that exhibit high fidelity of graph structural and temporal interaction properties, unfortunately, it achieves them by generating graphs indistinguishable from the source graph due to their poor modeling of interaction times.

Poor Scalability to Large Graphs: Both TagGen and DYMOND are limited to graphs where the number of nodes is less than 10000, and the number of unique timestamps is below 200. However, real graphs are not only of much larger size but also grow with significantly high interaction frequency Paranjape et al. (2017). In such scenarios, the critical design choice of TagGen to convert the temporal graph into a static graph fails to scale over long time horizons since the number of nodes in the resulting static graph multiplies linearly with the number of timestamps. Further, TagGen also requires the computation of the inverse of a matrix, where is the number of nodes in the equivalent static graph to impute node-node similarity. This complexity adds to the quadratic increase in memory consumption and even higher cost of matrix inversion, thus making TagGen unscalable. On the other hand, DYMOND has a complexity, where is the number of nodes and is the number of timestamps.

Lack of Inductive Modelling: Inductivity allows the transfer of knowledge to unseen graphs Hamilton et al. (2017). In the context of generative graph modeling, inductive modeling is required to (1) upscale or downscale the source graph to a generated graph of a different size, and (2) prevent leakage of node-identity from the source graph. Both TagGen and DYMOND rely on one-to-one mapping from source graph node IDs to the generated graph and hence are non-inductive.

4 Proposed Work: Scalable Generative Modeling for Temporal Interaction Graphs

We now explain the problem statement, for which TiggerGupta et al. (2022) has proposed a solution.

4.1 Problem formulation

Problem 1 (Temporal Interaction Graph Generator)
Definition 1 (Temporal Interaction Graph)

A temporal interaction graph is defined as where is a set of nodes and is a set of temporal edges . is the maximum time of interaction.

Input: A temporal interaction graph .
Output: Let there be a hidden joint distribution of structural and temporal properties, from which the given has been sampled. Our goal is to learn this hidden distribution. Towards that end, we want to learn a generative model that maximizes the likelihood of generating . This generative model, in turn, can be used to generate new graphs that come from the same distribution as , but not itself.

The above problem formulation is motivated by the one-shot generative modelling paradigm, i.e., it only requires one temporal graph to learn the hidden joint distribution of structural and temporal interaction graph properties. Defining the joint distribution of temporal and structural properties is hard. In general, these properties are characterized by inter-interaction time distribution and evolution of static graph properties like degree distribution, power law exponent, number of connected components, the largest connected component, distribution of pair wise shortest distances, closeness centrality etc. Typically, a generative model optimizes over one of these properties under the assumption that the remaining properties are correlated and hence would be implicitly modelled. For example, DYMOND uses small structural motifs and TagGen uses random walks over the transformed static graph.

5 Conclusion

This survey presents a unified view of graph representation learning, temporal graph representation learning, graph generative modeling, and temporal graph generative modeling. We first introduced a taxonomy of temporal graphs and their tasks. Further, we explained static graph embedding techniques and, building over them, explained the temporal graph embedding techniques. Finally, we surveyed graph generative methods and highlighted the lack of similar research in temporal graph generative models. Inspired by this, we introduced the problem statement of our recently published paper Tigger.

References

  • [1] C. C. Aggarwal and H. Wang (2012) Managing and mining graph data. Springer Publishing Company, Incorporated. External Links: ISBN 1461425603 Cited by: §1.
  • [2] C. C. Aggarwal (2011) Social network data analytics. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 1441984615 Cited by: §1.
  • [3] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola (2013) Distributed large-scale natural graph factorization. New York, NY, USA, pp. 37–48. External Links: ISBN 9781450320351, Link, Document Cited by: §3.1.
  • [4] R. Albert and A. Barabási (2002) Statistical mechanics of complex networks. Reviews of modern physics 74 (1), pp. 47. Cited by: §3.2.
  • [5] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. pp. 214–223. Cited by: §3.2.
  • [6] C. D. T. Barros, M. R. F. Mendonça, A. B. Vieira, and A. Ziviani (2021-11) A survey on embedding dynamic graphs. ACM Comput. Surv. 55 (1). External Links: ISSN 0360-0300, Link, Document Cited by: §1.
  • [7] M. Belkin and P. Niyogi (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. pp. . External Links: Link Cited by: §3.1.
  • [8] A. Bojchevski, O. Shchur, D. Zügner, and S. Günnemann (2018) NetGAN: generating graphs via random walks. pp. 609–618. Cited by: §3.2.
  • [9] F. Chen, Y. Wang, B. Wang, and C.-C. J. Kuo (2020) Graph representation learning: a survey. APSIPA Transactions on Signal and Information Processing 9 (1). External Links: Document, Link Cited by: §1.
  • [10] F. Chen, Y. Wang, B. Wang, and C.-C. J. Kuo (2020) Graph representation learning: a survey. APSIPA Transactions on Signal and Information Processing 9, pp. e15. External Links: Document Cited by: §1.
  • [11] W. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C. Hsieh (2019-07) Cluster-gcn. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. External Links: ISBN 9781450362016, Link, Document Cited by: §3.1.2.
  • [12] D. J. Cook and L. B. Holder (2006) Mining graph data. John Wiley &; Sons, Inc., Hoboken, NJ, USA. External Links: ISBN 0471731900 Cited by: §1.
  • [13] N. De Cao and T. Kipf (2018) MolGAN: An implicit generative model for small molecular graphs. ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models. Cited by: §3.2, §3.2.
  • [14] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: §1.
  • [15] F. Faez, Y. Ommi, M. S. Baghshah, and H. R. Rabiee (2020) Deep graph generators: a survey. arXiv. External Links: Document, Link Cited by: §1.
  • [16] M. Färber, F. Bartscherer, C. Menne, and A. Rettinger (2018) Linked data quality of DBpedia, Freebase, Opencyc, Wikidata, and YAGO. Semantic Web 9 (1), pp. 77–129. External Links: Link, Document Cited by: §1.
  • [17] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur (2017) Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §1.
  • [18] B. J. Frey and D. Dueck (2007) Clustering by passing messages between data points. Science 315 (5814), pp. 972–976. External Links: Document, ISSN 0036-8075, Link, https://science.sciencemag.org/content/315/5814/972.full.pdf Cited by: §2.2.
  • [19] N. Goyal, H. V. Jain, and S. Ranu (2020) GraphGen: a scalable approach to domain-agnostic labeled graph generation. New York, NY, USA, pp. 1253–1263. External Links: ISBN 9781450370233, Link, Document Cited by: §1, §3.2, §3.2.
  • [20] P. Goyal, N. Kamra, X. He, and Y. Liu (2018) DynGEM: deep embedding method for dynamic graphs. ArXiv abs/1805.11273. Cited by: §3.3.1, §3.3.
  • [21] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. Journal of Machine Learning Research 13 (25), pp. 723–773. External Links: Link Cited by: 3rd item.
  • [22] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. New York, NY, USA, pp. 855–864. External Links: ISBN 9781450342322, Link, Document Cited by: §3.1.1.
  • [23] X. Guo and L. Zhao (2020) A systematic survey on deep generative models for graph generation. arXiv. External Links: Document, Link Cited by: §1.
  • [24] S. Gupta, S. Manchanda, S. Bedathur, and S. Ranu (2022-Jun.) Tigger: scalable generative modelling for temporal interaction graphs. Proceedings of the AAAI Conference on Artificial Intelligence 36 (6), pp. 6819–6828. External Links: Link, Document Cited by: A Survey on Temporal Graph Representation Learning and Generative Modeling, 5th item, §4.
  • [25] W. L. Hamilton, J. Leskovec, and D. Jurafsky (2016-08) Diachronic word embeddings reveal statistical laws of semantic change. Berlin, Germany, pp. 1489–1501. External Links: Link, Document Cited by: §3.3.1.
  • [26] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, pp. 1025–1035. External Links: ISBN 9781510860964 Cited by: §1, §2.3, §3.1.2, §3.4.
  • [27] A. Himmel, H. Molter, R. Niedermeier, and M. Sorge (2016) Enumerating maximal cliques in temporal graphs. CoRR abs/1605.03871. External Links: Link, 1605.03871 Cited by: §1.
  • [28] P. Holme (2013-07) Epidemiologically optimal static networks from temporal network data. PLOS Computational Biology 9 (7), pp. 1–10. External Links: Document, Link Cited by: §3.4.
  • [29] P. Holme (2015-09) Modern temporal network theory: a colloquium. The European Physical Journal B 88 (9). External Links: ISSN 1434-6036, Link, Document Cited by: §1.
  • [30] S. Huang, A. W. Fu, and R. Liu (2015) Minimum spanning trees in temporal graphs. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, New York, NY, USA, pp. 419–430. External Links: ISBN 9781450327589, Link, Document Cited by: §1.
  • [31] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. External Links: Link Cited by: §3.2.
  • [32] M. Karoński and A. Ruciński (1997) The origins of the theory of random graphs. In The Mathematics of Paul Erdös I, pp. 311–336. External Links: ISBN 978-3-642-60408-9, Document, Link Cited by: §3.2.
  • [33] S. M. Kazemi, R. Goel, K. Jain, I. Kobyzev, A. Sethi, P. Forsyth, and P. Poupart (2020) Representation learning for dynamic graphs: a survey.. J. Mach. Learn. Res. 21 (70), pp. 1–73. Cited by: §1.
  • [34] S. Khoshraftar and A. An (2022) A survey on graph representation learning methods. arXiv. External Links: Document, Link Cited by: §1.
  • [35] S. Khoshraftar and A. An (2022) A survey on graph representation learning methods. arXiv. External Links: Document, Link Cited by: §1.
  • [36] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §3.1.2.
  • [37] I. Konstas, V. Stathopoulos, and J. M. Jose (2009) On social networks and collaborative recommendation. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09, New York, NY, USA, pp. 195–202. External Links: ISBN 9781605584836, Link, Document Cited by: §1.
  • [38] S. Kumar, X. Zhang, and J. Leskovec (2019) Predicting dynamic embedding trajectory in temporal interaction networks. pp. 1269–1278. Cited by: §1, §2.2, §2.3, §3.3.2, §3.3.2, §3.3.
  • [39] A. Leman and B. Weisfeiler (1968) A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsiya 2 (9), pp. 12–16. Cited by: §3.1.2.
  • [40] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. W. Battaglia (2018) Learning deep generative models of graphs. ArXiv abs/1803.03324. Cited by: §3.2, §3.2.
  • [41] R. Liao, Y. Li, Y. Song, S. Wang, W. L. Hamilton, D. Duvenaud, R. Urtasun, and R. S. Zemel (2019) Efficient graph generation with graph recurrent attention networks. pp. 4257–4267. External Links: Link Cited by: §1.
  • [42] R. Liao, Y. Li, Y. Song, S. Wang, C. Nash, W. L. Hamilton, D. Duvenaud, R. Urtasun, and R. Zemel (2019) Efficient graph generation with graph recurrent attention networks. Cited by: §3.2, §3.2.
  • [43] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. External Links: Link Cited by: §3.2.
  • [44] Z. Liu, D. Zhou, Y. Zhu, J. Gu, and J. He (2020) Towards fine-grained temporal network representation via time-reinforced random walk. Cited by: §3.3.2, §3.3.2, §3.3.
  • [45] Y. Lu, X. Wang, C. Shi, P. S. Yu, and Y. Ye (2019) Temporal network embedding with micro- and macro-dynamics. New York, NY, USA, pp. 469–478. External Links: ISBN 9781450369763, Link, Document Cited by: §3.3.2, §3.3.2, §3.3.
  • [46] O. Michail and P. G. Spirakis (2016) Traveling salesman problems in temporal graphs. Theoretical Computer Science 634, pp. 1–23. External Links: ISSN 0304-3975, Document, Link Cited by: §1.
  • [47] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. Cited by: §3.1.1, §3.3.2.
  • [48] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. pp. . External Links: Link Cited by: §3.1.1, §3.1.2.
  • [49] G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim (2018) Continuous-time dynamic network embeddings. In Companion Proceedings of the The Web Conference 2018Proceedings of the 13th International Conference on Web Search and Data MiningProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, CanadaProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery; Data MiningProceedings of the Web Conference 2021Proceedings of the 22nd International Conference on World Wide WebAdvances in Neural Information Processing SystemsProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningProceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningICLRProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningProceedings of the 24th International Conference on World Wide WebProceedings of the 30th International Conference on Machine LearningAdvances in Neural Information Processing SystemsProceedings of the 31st International Conference on Neural Information Processing Systems7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USAThe Semantic WebNatureProceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018NeurIPSICANNICMLProceedings of The Web Conference 20205th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track ProceedingsProceedings of the 34th International Conference on Machine Learning - Volume 704th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track ProceedingsAAAIAAAIProceedings of the Thirty-Fourth AAAI Conference on Artificial IntelligenceIJCAIProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery; Data MiningProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Advances in Neural Information Processing SystemsProceedings of the 28th ACM International Conference on Information and Knowledge ManagementProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019ICML 2020 Workshop on Graph Representation LearningProceedings of The Web Conference 20209th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021Proceedings of the 2017 ACM on Conference on Information and Knowledge Management2014 IEEE International Conference on Image Processing (ICIP)Advances in Neural Information Processing SystemsProceedings of the 14th ACM International Conference on Web Search and Data MiningProceedings of Neural Information Processing Systems, NeurIPSAdvances in Neural Information Processing SystemsProceedings of the 34th International Conference on Machine Learning, J. G. Dy, A. Krause, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett, T. Dietterich, S. Becker, Z. Ghahramani, S. Dasgupta, D. McAllester, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Q. Weinberger, K. Chaudhuri, R. Salakhutdinov, A. Gangemi, R. Navigli, M. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, M. Alam, Y. Bengio, Y. LeCun, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin, D. Precup, and Y. W. Teh (Eds.), WWW ’18WSDM ’20KDD ’16Proceedings of Machine Learning ResearchKDD ’20WWW ’21WWW ’13KDD ’16KDD ’14KDD ’16WWW ’15Proceedings of Machine Learning ResearchNIPS’17Proceedings of Machine Learning ResearchWWW ’20ICML’17KDD ’18CIKM ’19WWW ’20CIKM ’17WSDM ’21Proceedings of Machine Learning Research, Vol. 8014282697303370, Republic and Canton of Geneva, CHE. External Links: ISBN 9781450356404, Link, Document Cited by: §2.3.
  • [50] C. C. Noble and D. J. Cook (2003) Graph-based anomaly detection. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, New York, NY, USA, pp. 631–636. External Links: ISBN 1581137370, Link, Document Cited by: §1.
  • [51] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu (2016) Asymmetric transitivity preserving graph embedding. New York, NY, USA, pp. 1105–1114. External Links: ISBN 9781450342322, Link, Document Cited by: §3.1.
  • [52] A. Paranjape, A. R. Benson, and J. Leskovec (2017-02) Motifs in temporal networks. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. External Links: ISBN 9781450346757, Link, Document Cited by: §3.4.
  • [53] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, T. B. Schardl, and C. E. Leiserson (2020) EvolveGCN: evolving graph convolutional networks for dynamic graphs. Cited by: §3.3.1, §3.3.
  • [54] Y. Peng, B. Choi, and J. Xu (2021-06) Graph Learning for Combinatorial Optimization: A Survey of State-of-the-Art. Data Sci. Eng. 6 (2), pp. 119–141. External Links: ISSN 2364-1541, Document Cited by: §1.
  • [55] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) DeepWalk: online learning of social representations. New York, NY, USA, pp. 701–710. External Links: ISBN 978-1-4503-2956-9, Link, Document Cited by: §3.1.1.
  • [56] N. Perra, B. Gonçalves, and R. Pastor-Satorras (2012) Activity driven modeling of time varying networks. Sci Rep 2, pp. 469 (en). External Links: Link, Document Cited by: §3.4.
  • [57] J. G. Rasmussen (2018) Lecture notes: temporal point processes and the conditional intensity function. arXiv: Methodology. Cited by: §2.3.2.
  • [58] M. Rizoiu, Y. Lee, S. Mishra, and L. Xie (2017) A tutorial on hawkes processes for events in social media. CoRR abs/1708.06401. External Links: Link, 1708.06401 Cited by: §3.3.2.
  • [59] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, and M. Bronstein (2020) Temporal graph networks for deep learning on dynamic graphs. Cited by: §1, §2.3, §3.3.2, §3.3.2, §3.3.
  • [60] A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang (2020) DySAT: deep neural representation learning on dynamic graphs via self-attention networks. New York, NY, USA, pp. 519–527. External Links: ISBN 9781450368223, Link, Document Cited by: §3.3.1, §3.3.1, §3.3.
  • [61] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. Cham, pp. 593–607. External Links: ISBN 978-3-319-93417-4 Cited by: §3.1.2.
  • [62] S. Shehnepoor, R. Togneri, W. Liu, and M. Bennamoun (2022) Spatio-temporal graph representation learning for fraudster group detection. arXiv. External Links: Document, Link Cited by: §1.
  • [63] U. Singer, I. Guy, and K. Radinsky (2019) Node embedding over temporal graphs. Cited by: §3.3.1, §3.3.
  • [64] I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013-17–19 Jun) On the importance of initialization and momentum in deep learning. Atlanta, Georgia, USA, pp. 1139–1147. External Links: Link Cited by: §3.1.1.
  • [65] R. Trivedi, H. Dai, Y. Wang, and L. Song (2017) Know-evolve: deep temporal reasoning for dynamic knowledge graphs. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 3462–3471. Cited by: §1.
  • [66] R. Trivedi, M. Farajtabar, P. Biswal, and H. Zha (2019) DyRep: learning representations over dynamic graphs. External Links: Link Cited by: §1, §2.1.1, §2.3.2, §2.3.3, §2.3, §2.3, §3.3.2, §3.3.2, §3.3.
  • [67] Y. Tsai and M. Yang (2014-10) Locality preserving hashing. pp. 2988–2992. External Links: Document, ISSN 2381-8549 Cited by: §3.3.2.
  • [68] T. Tylenda, R. Angelova, and S. Bedathur (2009) Towards time-aware link prediction in evolving social networks. In Proceedings of the 3rd Workshop on Social Network Mining and Analysis, SNA-KDD ’09, New York, NY, USA. External Links: ISBN 9781605586762, Link, Document Cited by: §1.
  • [69] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. pp. . External Links: Link Cited by: 2nd item.
  • [70] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph Attention Networks. International Conference on Learning Representations. Note: accepted as poster External Links: Link Cited by: §1, §3.1.2.
  • [71] C. L. Vestergaard, M. Génois, and A. Barrat (2014-10) How memory generates heterogeneous dynamics in temporal networks. Physical Review E 90 (4). External Links: ISSN 1550-2376, Link, Document Cited by: §3.4.
  • [72] D. Wang, P. Cui, and W. Zhu (2016) Structural deep network embedding. New York, NY, USA, pp. 1225–1234. External Links: ISBN 9781450342322, Link, Document Cited by: §2.3.2.
  • [73] Y. Wang, Y. Chang, Y. Liu, J. Leskovec, and P. Li (2021) Inductive representation learning in temporal networks via causal anonymous walks. CoRR abs/2101.05974. External Links: Link, 2101.05974 Cited by: §3.3.2, §3.3.2, §3.3.
  • [74] S. S. Watts DJ (1998) Collective dynamics of ’small-world’ networks. Cited by: §3.2.
  • [75] H. Wu, J. Cheng, Y. Lu, Y. Ke, Y. Huang, D. Yan, and H. Wu (2015) Core decomposition in large temporal graphs. In 2015 IEEE International Conference on Big Data (Big Data), Vol. , pp. 649–658. External Links: Document Cited by: §1.
  • [76] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2021-01) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32 (1), pp. 4–24 (English). External Links: Document, ISSN 2162-237X Cited by: §1.
  • [77] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan (2019) Self-attention with functional time representation learning. pp. 15889–15899. Cited by: §3.3.2.
  • [78] D. Xu, C. Ruan, E. Körpeoglu, S. Kumar, and K. Achan (2021) A temporal kernel approach for deep learning with continuous-time information. External Links: Link Cited by: §1, §2.3, §3.3.2, §3.3.2, §3.3.
  • [79] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. External Links: Link Cited by: §3.1.2.
  • [80] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. External Links: Link Cited by: §1.
  • [81] J. You, R. Ying, and J. Leskovec (2019) Position-aware graph neural networks. pp. 7134–7143. External Links: Link Cited by: §1, §3.1.2.
  • [82] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec (2018) GraphRNN: generating realistic graphs with deep auto-regressive models. Cited by: §3.2, §3.2.
  • [83] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec (2018) GraphRNN: generating realistic graphs with deep auto-regressive models. pp. 5694–5703. External Links: Link Cited by: §1.
  • [84] G. Zeno, T. La Fond, and J. Neville (2021) DYMOND: dynamic motif-nodes network generative model. New York, NY, USA, pp. 718–729. External Links: ISBN 9781450383127, Link, Document Cited by: §1, §3.4.
  • [85] Y. Zhang, Y. Xiong, X. Kong, and Y. Zhu (2017) Learning node embeddings in interaction graphs. New York, NY, USA, pp. 397–406. External Links: ISBN 9781450349185, Link, Document Cited by: §2.3, §3.3.2, §3.3.2, §3.3.
  • [86] Z. Zhang, J. Bu, M. Ester, J. Zhang, C. Yao, Z. Li, and C. Wang (2020) Learning temporal interaction graph embedding via coupled memory networks. New York, NY, USA, pp. 3049–3055. External Links: ISBN 9781450370233, Link, Document Cited by: §1, §2.3, §3.3.2, §3.3.2, §3.3.
  • [87] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng, and H. Li (2020) T-gcn: a temporal graph convolutional network for traffic prediction. IEEE Transactions on Intelligent Transportation Systems 21 (9), pp. 3848–3858. External Links: Document Cited by: §1.
  • [88] D. Zhou, L. Zheng, J. Han, and J. He (2020) A data-driven graph generative model for temporal interaction networks. New York, NY, USA, pp. 401–411. External Links: ISBN 9781450379984, Link, Document Cited by: §1, §3.4.
  • [89] L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang (2018) Dynamic network embedding by modeling triadic closure process. Cited by: §3.3.1, §3.3.
  • [90] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra (2020) Beyond homophily in graph neural networks: current limitations and effective designs. arXiv: Learning. Cited by: §3.1.2.
  • [91] Y. Zuo, G. Liu, H. Lin, J. Guo, X. Hu, and J. Wu (2018) Embedding temporal network via neighborhood formation. New York, NY, USA, pp. 2857–2866. External Links: ISBN 9781450355520, Link, Document Cited by: §3.3.2, §3.3.