This paper studies the problem of embedding very large information networks into low-dimensional vector spaces, which is useful in many tasks such as visualization, node classification, and link prediction. Most existing graph embedding methods do not scale for real world information networks which usually contain millions of nodes. In this paper, we propose a novel network embedding method called the "LINE," which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures. An edge-sampling algorithm is proposed that addresses the limitation of the classical stochastic gradient descent and improves both the effectiveness and the efficiency of the inference. Empirical experiments prove the effectiveness of the LINE on a variety of real-world information networks, including language networks, social networks, and citation networks. The algorithm is very efficient, which is able to learn the embedding of a network with millions of vertices and billions of edges in a few hours on a typical single machine. The source code of the LINE is available online. 1
translated by 谷歌翻译
我们介绍了DeepWalk,这是一种用于学习网络中潜在表示的新方法。这些潜在的表示在连续向量空间中编码社会关系,这很容易被统计模型利用.DeepWalk概括了语言建模和无监督特征学习(或深度学习)从单词到图形序列的最新进展。 DeepWalkuses从截断的随机游走中获取的本地信息通过将步行视为句子的等同来学习潜在的表示。我们针对社交网络(如BlogCatalog,Flickr和YouTube)的多个多标签网络分类任务展示了DeepWalk的潜在表示。我们的研究结果表明,DeepWalk优于具有挑战性的基线,可以全面了解网络,特别是在存在缺失信息的情况下。当标记数据稀疏时,DeepWalk的表示可以提供比竞争方法高10%的$ F_1 $分数。在一些实验中,DeepWalk的表示能够胜过所有基线方法,同时使用的训练数据减少60%。 DeepWalk也是可扩展的。它是一种在线学习算法,可以构建有用的增量结果,并且可以简单地并行化。这些特性使其适用于广泛的真实世界应用,如网络分类和异常检测。
translated by 谷歌翻译
Many Network Representation Learning (NRL) methods have been proposed to learn vector representations for vertices in a network recently. In this paper, we summarize most existing NRL methods into a unified two-step framework, including proximity matrix construction and dimension reduction. We focus on the analysis of proximity matrix construction step and conclude that an NRL method can be improved by exploring higher order proximities when building the proximity matrix. We propose Network Embedding Update (NEU) algorithm which implicitly approximates higher order proximities with theoretical approximation bound and can be applied on any NRL methods to enhance their performances. We conduct experiments on multi-label classification and link prediction tasks. Experimental results show that NEU can make a consistent and significant improvement over a number of NRL methods with almost negligible running time on all three publicly available datasets. The source code of this paper can be obtained from https://github.com/thunlp/NEU.
translated by 谷歌翻译
Embedding a web-scale information network into a low-dimensional vector space facilitates tasks such as link prediction, classification, and visualization. Past research has addressed the problem of extracting such embeddings by adopting methods from words to graphs, without defining a clearly comprehensible graph-related objective. Yet, as we show, the objectives used in past works implicitly utilize similarity measures among graph nodes. In this paper, we carry the similarity orientation of previous works to its logical conclusion; we propose VERtex Similarity Em-beddings (VERSE), a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a selected vertex-to-vertex similarity measure. VERSE learns such embeddings by training a single-layer neural network. While its default, scalable version does so via sampling similarity information, we also develop a variant using the full information per vertex. Our experimental study on standard benchmarks and real-world datasets demonstrates that VERSE, instantiated with diverse similarity measures, outperforms state-of-the-art methods in terms of precision and recall in major data mining tasks and supersedes them in time and space efficiency, while the scalable sampling-based variant achieves equally good results as the non-scalable full variant.
translated by 谷歌翻译
网络嵌入旨在学习网络节点的潜在的低维向量表示,有效地支持各种网络分析任务。虽然现有的网络嵌入技术主要集中在保留网络拓扑结构以学习节点表示,但最近提出的属性网络嵌入算法试图将丰富的节点内容信息与网络拓扑结构相结合,以提高网络嵌入的质量。实际上,网络往往具有稀疏内容,不完整的节点属性,以及节点属性特征空间与网络结构空间之间的差异,严重恶化了现有方法的性能。在本文中,我们提出了一个统一的框架,用于归因于网络嵌入-attri2vec-通过在原始属性空间上执行的网络结构引导变换,通过发现潜在节点属性子空间来学习节点嵌入。得到的latentsubspace可以以更一致的方式尊重网络结构,以获得高质量的节点表示。我们制定了一个优化问题,该问题通过有效的随机梯度下降算法求解,其线性时间复杂度与节点数量有关。我们研究了对节点属性执行的一系列线性和非线性转换,并在各种类型的网络上验证了它们的有效性。 attri2vec的另一个优点是它能够解决样本外的问题,可以通过学习的映射函数从节点属性中推断出新的节点。对各种类型网络的实验证实,attri2vec优于用于节点分类,节点聚类以及样本外链路预测任务的最新基线。本文的源代码可从以下网址获得://github.com/daokunzhang/attri2vec。
translated by 谷歌翻译
网络表示学习(NRL)方法旨在通过保留给定网络的局部和全局结构将每个顶点映射到低维空间,并且近年来,他们已经受到了极大的关注,感谢他们在几个具有挑战性的问题中的成功。尽管已经提出了各种方法来计算节点嵌入,但是许多成功的方法受益于随机遍历以便将给定网络变换为节点序列的集合,然后它们通过预测序列中每个顶点的上下文来学习节点的表示。在本文中,我们引入了一个通用框架来增强通过基于随机游走的方法获得的节点的嵌入。类似于NLP中主题词嵌入的概念,所提出的方法将每个顶点分配给主题,有利于各种统计模型和社区检测方法,然后生成增强的社区表示。我们在两个下游任务中评估我们的方法:节点分类和链接预测。实验结果表明,顶点和顶部嵌入的结合优于广为人知的基线NRL方法。
translated by 谷歌翻译
Network embedding is an important method to learn low-dimensional representations of vertexes in networks, aiming to capture and preserve the network structure. Almost all the existing network embedding methods adopt shallow models. However, since the underlying network structure is complex, shallow models cannot capture the highly non-linear network structure, resulting in sub-optimal network representations. Therefore, how to find a method that is able to effectively capture the highly non-linear network structure and preserve the global and local structure is an open yet important problem. To solve this problem, in this paper we propose a Structural Deep Network Embedding method, namely SDNE. More specifically, we first propose a semi-supervised deep model, which has multiple layers of non-linear functions, thereby being able to capture the highly non-linear network structure. Then we propose to exploit the first-order and second-order proximity jointly to preserve the network structure. The second-order proximity is used by the unsupervised component to capture the global network structure. While the first-order proximity is used as the supervised information in the supervised component to preserve the local network structure. By jointly optimizing them in the semi-supervised deep model, our method can preserve both the local and global network structure and is robust to sparse networks. Empirically, we conduct the experiments on five real-world networks, including a language network, a citation network and three social networks. The results show that compared to the baselines, our method can reconstruct the original network significantly better and achieves substantial gains in three applications, i.e. multi-label classification, link prediction and visualization.
translated by 谷歌翻译
网络嵌入算法能够学习节点的潜在特征表示,将网络转换为低维向量表示。已经使用网络插入有效地解决的典型密钥应用包括链路预测,多标签分类和社区检测。在本文中,我们提出了BiasedWalk,一种可扩展的,无监督的特征学习算法,它基于偏向随机游走来获取有关网络中每个节点的上下文信息。我们的基于随机游走的采样可以表现为呼吸优先搜索(BFS)和深度优先搜索(DFS)采样,目标是捕获网络中的节点之间的同音和角色等效。我们已经针对几种数据集和学习任务,针对各种基线方法,比较了所提算法的性能,进行了详细的实验评估。实验结果表明,该方法在大多数任务和数据集中均优于基线方法。
translated by 谷歌翻译
随着信息技术的广泛使用,信息网络越来越受欢迎,以捕获跨越各种学科的复杂关系,例如社交网络,引用网络,电信网络和生物网络。分析这些网络揭示了社会生活的不同方面,如社会结构,信息传播和传播模式。然而,实际上,大规模的信息网络经常使网络分析任务在计算上昂贵或难以处理。最近,网络表示学习被提出作为一种新的学习范例,通过保留网络拓扑结构,顶点内容和其他辅助信息,将网络顶点嵌入到低维向量空间中。这有助于在新的矢量空间中轻松处理原始网络以进行进一步分析。在本次调查中,我们对数据挖掘和机器学习领域的网络表示学习的当前文献进行了全面的回顾。我们提出了新的分类法,根据潜在的学习机制对现有的网络表示学习技术进行分类和总结,旨在保存的网络信息,以及算法设计和方法。我们总结了用于验证网络表示学习的评估协议,包括已发布的基准数据集,评估方法和开源算法。我们还进行了经验性研究,以比较代表性算法在常见数据集上的性能,并分析其计算复杂性。最后,提出了有希望的研究方向,以促进未来的研究。
translated by 谷歌翻译
Network embedding assigns nodes in a network to low-dimensional representations and effectively preserves the network structure. Recently, a significant amount of progresses have been made toward this emerging network analysis paradigm. In this survey, we focus on categorizing and then reviewing the current development on network embedding methods, and point out its future research directions. We first summarize the motivation of network embedding. We discuss the classical graph embedding algorithms and their relationship with network embedding. Afterwards and primarily, we provide a comprehensive overview of a large number of network embedding methods in a systematic manner , covering the structure-and property-preserving network embedding methods, the network embedding methods with side information and the advanced information preserving network embedding methods. Moreover, several evaluation approaches for network embedding and some useful online resources, including the network data sets and softwares, are reviewed, too. Finally, we discuss the framework of exploiting these network embedding methods to build an effective system and point out some potential future directions.
translated by 谷歌翻译
图形是许多问题的自然抽象,其中节点代表性和边缘代表实体之间的关系。在过去十年中出现的一个重要的研究领域是使用图形作为非线性降维的车辆,其方式类似于基于流形学习的预防措施,用于下游数据库处理,机器学习和可视化。在这个系统而全面的实验调查中,我们对几个流行的网络代表学习方法进行了基准测试,这些方法在两个关键任务上运行:链接预测和节点分类。我们检查了12个无监督嵌入方法在15个数据集上的性能。据我们所知,我们的研究规模 - 无论是方法数量还是数据集数量 - 都是迄今为止最大的研究。我们的结果揭示了关于这个领域迄今为止工作的几个关键见解。首先,我们发现某些基线方法(特定于任务的启发式方法,以及经典的多种方法)经常被解雇或以前的努力未被考虑,可以参与竞争。某些类型的数据集,如果它们适当调整。其次,我们发现最近基于矩阵因子化的方法从定性的角度提供了一些小的但相对一致的优点,即替代方法(例如,基于随机游走的方法)。具体来说,我们发现MNMF是一种社区保留嵌入方法,是链路预测任务中最具竞争力的方法。虽然NetMF是节点分类最具竞争力的基线。第三,nosingle方法在节点分类和链接预测任务上完全优于其他嵌入方法。我们还提供了几个深入分析,揭示了某些算法表现良好的设置(例如,邻域上下文对性能的作用) - 指导了用户。
translated by 谷歌翻译
网络嵌入(或图形嵌入)已广泛用于许多实际应用中。然而,现有方法主要集中在具有单类节点/边缘的网络上,并且不能很好地扩展以处理大型网络。许多真实世界网络由数十亿个节点和多类型的边缘组成,并且每个节点与不同的属性相关联。在本文中,我们对归属Multiplex异构网络的嵌入学习问题进行了格式化,并提出了一个统一的框架来解决这个问题。该框架支持转换和归纳学习。我们还对所提出的框架进行了理论分析,展示了它与以往工作的联系,并证明了其更好的泛化能力。我们针对四种不同类型的具有挑战性的数据集(亚马逊,YouTube,Twitter和阿里巴巴数据集)对拟议框架进行系统评估。实验结果表明,通过所提出框架的学习嵌入,我们可以实现统计上显着的改进(例如,5.99-28.23%提升通过F1得分; p << 0.01,t-检验)超过先前的领域用于链接预测。该框架还成功应用于全球领先的电子商务公司阿里巴巴的推荐系统。产品推荐的离线A / B测试结果进一步证实了框架在实践中的有效性和效率。
translated by 谷歌翻译
Heterogenous information network embedding aims to embed het-erogenous information networks (HINs) into low dimensional spaces, in which each vertex is represented as a low-dimensional vector, and both global and local network structures in the original space are preserved. However, most of existing heterogenous information network embedding models adopt the dot product to measure the proximity in the low dimensional space, and thus they can only preserve the first-order proximity and are insufficient to capture the global structure. Compared with homogenous information networks , there are multiple types of links (i.e., multiple relations) in HINs, and the link distribution w.r.t relations is highly skewed. To address the above challenging issues, we propose a novel het-erogenous information network embedding model PME based on the metric learning to capture both first-order and second-order proximities in a unified way. To alleviate the potential geometrical inflexibility of existing metric learning approaches, we propose to build object and relation embeddings in separate object space and relation spaces rather than in a common space. Afterwards, we learn embeddings by firstly projecting vertices from object space to corresponding relation space and then calculate the proximity between projected vertices. To overcome the heavy skewness of the link distribution w.r.t relations and avoid "over-sampling" or "under-sampling" for each relation, we propose a novel loss-aware adaptive sampling approach for the model optimization. Extensive experiments have been conducted on a large-scale HIN dataset, and the experimental results show superiority of our proposed PME model in terms of prediction accuracy and scalability.
translated by 谷歌翻译
拓扑信息对于研究网络中节点之间的关系至关重要。最近,网络表示学习(NRL)将网络投射到低维向量空间,在分析大规模网络方面已经显示出优势。但是,大多数现有的NRL方法旨在保留网络的本地拓扑结构,但它们无法捕获全局拓扑结构。为了解决这个问题,我们提出了一种名为HSRL的新NRL框架,以帮助现有的NRL方法捕获网络的本地和全球拓扑信息。具体而言,HSRL使用社区感知压缩策略递归地将输入网络压缩成一系列较小的网络。然后,使用现有的NRL方法来学习每个压缩网络的节点嵌入。最后,通过连接来自所有压缩网络的节点嵌入来获得输入网络的节点嵌入。对五个真实世界数据集进行链接预测的实证研究证明了HSRL超现有技术方法的优势。
translated by 谷歌翻译
最近,通过灵活的随机走路方法,新的优化目标和深层架构,无监督网络表示学习(UNRL)方法在图表方面取得了可观的进展。然而,没有共同的基础来进行系统比较嵌入,以了解不同图形和任务的行为。在本文中,我们理论上在一个统一的框架下对不同的方法进行分组,并实证研究不同网络表示方法的有效性。特别值得一提的是,UNRL的大多数方法都明确地或隐含地模拟了一个节点的索引上下文信息。因此,我们提出了一个框架,将基于随机游走,矩阵分解和基于深度学习的各种方法投射到基于上下文的统一优化函数中。我们基于它们的相似性和差异系统地对方法进行分组。我们详细研究了这些方法之间的差异,这些方法用于解释它们的性能差异(下游任务)。我们进行了一项大规模的实证研究,考虑了9种流行的和最近的UNRL技术以及11种具有不同结构特性和两个常见任务的真实数据集 - 节点分类和链接预测。我们发现没有一种方法是明显的赢家,并且选择合适的方法取决于嵌入方法的某些属性,底层图的任务和结构属性。此外,我们还报告了评估UNRL方法的常见缺陷,并提出了实验设计和结果解释的建议。
translated by 谷歌翻译
Graphs, such as social networks, word co-occurrence networks, and communication networks, occur naturally in various real-world applications. Analyzing them yields insight into the structure of society, language, and different patterns of communication. Many approaches have been proposed to perform the analysis. Recently, methods which use the representation of graph nodes in vector space have gained traction from the research community. In this survey, we provide a comprehensive and structured analysis of various graph embedding techniques proposed in the literature. We first introduce the embedding task and its challenges such as scalability, choice of dimensionality, and features to be preserved, and their possible solutions. We then present three categories of approaches based on factorization methods, random walks, and deep learning, with examples of representative algorithms in each category and analysis of their performance on various tasks. We evaluate these state-of-the-art methods on a few common datasets and compare their performance against one another. Our analysis concludes by suggesting some potential applications and future directions. We finally present the open-source Python library we developed, named GEM (Graph Embedding Methods, available at https://github.com/palash1992/GEM), which provides all presented algorithms within a unified interface to foster and facilitate research on the topic.
translated by 谷歌翻译
最近对图嵌入方法的兴趣集中于学习图中每个节点的单个表示。但节点真的可以通过单个矢量表示最佳地描述吗?在这项工作中,我们提出了一种用于学习图中节点的多个表示的方法(例如,社交网络的用户)。基于自我网络的原则分解,每个表示将节点的角色编码在节点参与的不同本地社区中。这些表示允许改进图形中出现的细微差别关系的重构 - 我们通过各种图表上的链接预测任务的最新结果来说明这种现象,将错误减少高达$ 90 \%$。此外,我们表明这些嵌入允许对学习的社区结构进行有效的视觉分析。
translated by 谷歌翻译
Network embedding, aiming to embed a network into a low dimensional vector space while preserving the inherent structural properties of the network, has attracted considerable attentions recently. Most of the existing embedding methods embed nodes as point vectors in a low-dimensional continuous space. In this way, the formation of the edge is deterministic and only determined by the positions of the nodes. However, the formation and evolution of real-world networks are full of uncertainties, which makes these methods not optimal. To address the problem, we propose a novel Deep Variational Network Embedding in Wasser-stein Space (DVNE) in this paper. The proposed method learns a Gaussian distribution in the Wasserstein space as the latent representation of each node, which can simultaneously preserve the network structure and model the uncertainty of nodes. Specifically, we use 2-Wasserstein distance as the similarity measure between the distributions, which can well preserve the transitivity in the network with a linear computational cost. Moreover, our method implies the mathematical relevance of mean and variance by the deep variational model, which can well capture the position of the node by the mean vectors and the uncertainties of nodes by the variance. Additionally, our method captures both the local and global network structure by preserving the first-order and second-order proximity in the network. Our experimental results demonstrate that our method can effectively model the uncertainty of nodes in networks, and show a substantial gain on real-world applications such as link prediction and multi-label classification compared with the state-of-the-art methods.
translated by 谷歌翻译
在网络中的节点和边缘上的预测任务需要学习算法所使用的仔细努力的工程特征。最近在代表学习领域的研究通过学习特征本身在自动化预测方面取得了重大进展。然而,现有的特征学习方法不足以捕捉网络中观察到的连通性模式的多样性。这里我们提出node2vec,用于学习网络中节点的连续特征表示的解析框架。在node2vec中,我们学习了节点到特征的低维空间的映射,这最大化了保留节点的网络邻域的可能性。我们定义了节点网络邻域的灵活概念,并设计了一个有偏差的随机游走过程,有效地探索了不同的邻域。我们的算法推广了基于网络邻域的严格概念的先前工作,并且我们认为探索邻域的附加灵活性是学习更富有代表性的关键。我们展示了node2vec在现有的多标签分类和链接预测技术方面的功效,这些技术来自不同领域的几个真实网络。总之,我们的工作代表了一种有效学习复杂网络中最先进的任务无关代表性的新方法。
translated by 谷歌翻译
Embedding network data into a low-dimensional vector space has shown promising performance for many real-world applications, such as node classification and entity retrieval. However, most existing methods focused only on leveraging network structure. For social networks, besides the network structure, there also exists rich information about social actors, such as user profiles of friendship networks and textual content of citation networks. These rich attribute information of social actors reveal the homophily effect, exerting huge impacts on the formation of social networks. In this paper, we explore the rich evidence source of attributes in social networks to improve network embedding. We propose a generic Social Network Embedding framework (SNE), which learns representations for social actors (i.e., nodes) by preserving both the structural proximity and attribute proximity. While the structural proximity captures the global network structure, the attribute proximity accounts for the homophily effect. To justify our proposal, we conduct extensive experiments on four real-world social networks. Compared to the state-of-the-art network embedding approaches, SNE can learn more informative representations, achieving substantial gains on the tasks of link prediction and node classification. Specifically, SNE significantly outperforms node2vec with an 8.2% relative improvement on the link prediction task, and a 12.7% gain on the node classification task.
translated by 谷歌翻译