Unlike tabular data, features in network data are interconnected within a domain-specific graph. Examples of this setting include gene expression overlaid on a protein interaction network (PPI) and user opinions in a social network. Network data is typically high-dimensional (large number of nodes) and often contains outlier snapshot instances and noise. In addition, it is often non-trivial and time-consuming to annotate instances with global labels (e.g., disease or normal). How can we jointly select discriminative subnetworks and representative instances for network data without supervision? We address these challenges within an unsupervised framework for joint subnetwork and instance selection in network data, called UISS, via a convex self-representation objective. Given an unlabeled network dataset, UISS identifies representative instances while ignoring outliers. It outperforms state-of-the-art baselines on both discriminative subnetwork selection and representative instance selection, achieving up to 10% accuracy improvement on all real-world data sets we use for evaluation. When employed for exploratory analysis in RNA-seq network samples from multiple studies it produces interpretable and informative summaries.
translated by 谷歌翻译
多视图无监督的特征选择(MUF)已被证明是一种有效的技术,可降低多视图未标记数据的维度。现有方法假定所有视图都已完成。但是,多视图数据通常不完整,即,某些视图中显示了一部分实例,但并非所有视图。此外,学习完整的相似性图,作为现有MUFS方法中重要的有前途的技术,由于缺少的观点而无法实现。在本文中,我们提出了一个基于互补的和共识学习的不完整的多视图无监督的特征选择方法(C $^{2} $ IMUFS),以解决上述问题。具体而言,c $^{2} $ imufs将功能选择集成到扩展的加权非负矩阵分解模型中,配备了自适应学习视图和稀疏的$ \ ell_ {2,p} $ - norm-norm,它可以提供更好的提供适应性和灵活性。通过从不同视图得出的多个相似性矩阵的稀疏线性组合,介绍了互补学习引导的相似性矩阵重建模型,以在每个视图中获得完整的相似性图。此外,c $^{2} $ imufs学习了跨不同视图的共识聚类指示器矩阵,并将其嵌入光谱图术语中以保留本地几何结构。现实世界数据集的全面实验结果证明了与最新方法相比,C $^{2} $ IMUF的有效性。
translated by 谷歌翻译
由于巨大的未标记数据的出现,现在已经增加了更加关注无监督的功能选择。需要考虑使用更有效的顺序使用样品训练学习方法的样本和潜在效果的分布,以提高该方法的鲁棒性。自定步学习是考虑样本培训顺序的有效方法。在本研究中,通过整合自花枢学习和子空间学习框架来提出无监督的特征选择。此外,保留了局部歧管结构,并且特征的冗余受到两个正则化术语的约束。 $ l_ {2,1 / 2} $ - norm应用于投影矩阵,旨在保留歧视特征,并进一步缓解数据中噪声的影响。然后,提出了一种迭代方法来解决优化问题。理论上和实验证明了该方法的收敛性。将所提出的方法与九个现实世界数据集上的其他技术的算法进行比较。实验结果表明,该方法可以提高聚类方法的性能,优于其他比较算法。
translated by 谷歌翻译
在过去十年中,图形内核引起了很多关注,并在结构化数据上发展成为一种快速发展的学习分支。在过去的20年中,该领域发生的相当大的研究活动导致开发数十个图形内核,每个图形内核都对焦于图形的特定结构性质。图形内核已成功地成功地在广泛的域中,从社交网络到生物信息学。本调查的目标是提供图形内核的文献的统一视图。特别是,我们概述了各种图形内核。此外,我们对公共数据集的几个内核进行了实验评估,并提供了比较研究。最后,我们讨论图形内核的关键应用,并概述了一些仍有待解决的挑战。
translated by 谷歌翻译
Selecting subsets of features that differentiate between two conditions is a key task in a broad range of scientific domains. In many applications, the features of interest form clusters with similar effects on the data at hand. To recover such clusters we develop DiSC, a data-driven approach for detecting groups of features that differentiate between conditions. For each condition, we construct a graph whose nodes correspond to the features and whose weights are functions of the similarity between them for that condition. We then apply a spectral approach to compute subsets of nodes whose connectivity differs significantly between the condition-specific feature graphs. On the theoretical front, we analyze our approach with a toy example based on the stochastic block model. We evaluate DiSC on a variety of datasets, including MNIST, hyperspectral imaging, simulated scRNA-seq and task fMRI, and demonstrate that DiSC uncovers features that better differentiate between conditions compared to competing methods.
translated by 谷歌翻译
特征选择通过识别最具信息性功能的子集来减少数据的维度。在本文中,我们为无监督的特征选择提出了一种创新的框架,称为分形Automencoders(FAE)。它列举了一个神经网络,以确定全球探索能力和局部挖掘的多样性的信息。架构上,FAE通过添加一对一的评分层和小子神经网络来扩展AutoEncoders,以便以无监督的方式选择特征选择。通过这种简洁的建筑,Fae实现了最先进的表演;在十四个数据集中的广泛实验结果,包括非常高维数据,已经证明了FAE对未经监督特征选择的现有现代方法的优越性。特别是,FAE对基因表达数据探索具有实质性优势,通过广泛使用的L1000地标基因将测量成本降低约15美元。此外,我们表明FAE框架与应用程序很容易扩展。
translated by 谷歌翻译
我们考虑了从节点观测值估算多个网络拓扑的问题,其中假定这些网络是从相同(未知)随机图模型中绘制的。我们采用图形作为我们的随机图模型,这是一个非参数模型,可以从中绘制出潜在不同大小的图形。图形子的多功能性使我们能够解决关节推理问题,即使对于要恢复的图形包含不同数量的节点并且缺乏整个图形的精确比对的情况。我们的解决方案是基于将最大似然惩罚与Graphon估计方案结合在一起,可用于增强现有网络推理方法。通过引入嘈杂图抽样信息的强大方法,进一步增强了所提出的联合网络和图形估计。我们通过将其性能与合成和实际数据集中的竞争方法进行比较来验证我们提出的方法。
translated by 谷歌翻译
图形离群值检测是一项具有许多应用程序的新兴但至关重要的机器学习任务。尽管近年来算法扩散,但缺乏标准和统一的绩效评估设置限制了它们在现实世界应用中的进步和使用。为了利用差距,我们(据我们所知)(据我们所知)第一个全面的无监督节点离群值检测基准为unod,并带有以下亮点:(1)评估骨架从经典矩阵分解到最新图形神经的骨架的14个方法网络; (2)在现实世界数据集上使用不同类型的注射异常值和自然异常值对方法性能进行基准测试; (3)通过在不同尺度的合成图上使用运行时和GPU存储器使用算法的效率和可扩展性。基于广泛的实验结果的分析,我们讨论了当前渠道方法的利弊,并指出了多个关键和有希望的未来研究方向。
translated by 谷歌翻译
子空间聚类是将大约位于几个低维子空间的数据样本集合集合的经典问题。此问题的当前最新方法基于自我表达模型,该模型表示样品是其他样品的线性组合。但是,这些方法需要足够广泛的样品才能准确表示,这在许多应用中可能不一定是可以访问的。在本文中,我们阐明了这个常见的问题,并认为每个子空间中的数据分布在自我表达模型的成功中起着至关重要的作用。我们提出的解决此问题的解决方案是由数据扩展在深神经网络的概括力中的核心作用引起的。我们为无监督和半监督的设置提出了两个子空间聚类框架,这些框架使用增强样品作为扩大词典来提高自我表达表示的质量。我们提出了一种使用一些标记的样品进行半监督问题的自动增强策略,该问题取决于数据样本位于多个线性子空间的联合以下事实。实验结果证实了数据增强的有效性,因为它显着提高了一般自我表达模型的性能。
translated by 谷歌翻译
图表可以模拟实体之间的复杂交互,它在许多重要的应用程序中自然出现。这些应用程序通常可以投入到标准图形学习任务中,其中关键步骤是学习低维图表示。图形神经网络(GNN)目前是嵌入方法中最受欢迎的模型。然而,邻域聚合范例中的标准GNN患有区分\ EMPH {高阶}图形结构的有限辨别力,而不是\ EMPH {低位}结构。为了捕获高阶结构,研究人员求助于主题和开发的基于主题的GNN。然而,现有的基于主基的GNN仍然仍然遭受较少的辨别力的高阶结构。为了克服上述局限性,我们提出了一个新颖的框架,以更好地捕获高阶结构的新框架,铰接于我们所提出的主题冗余最小化操作员和注射主题组合的新颖框架。首先,MGNN生成一组节点表示W.R.T.每个主题。下一阶段是我们在图案中提出的冗余最小化,该主题在彼此相互比较并蒸馏出每个主题的特征。最后,MGNN通过组合来自不同图案的多个表示来执行节点表示的更新。特别地,为了增强鉴别的功率,MGNN利用重新注射功能来组合表示的函数w.r.t.不同的主题。我们进一步表明,我们的拟议体系结构增加了GNN的表现力,具有理论分析。我们展示了MGNN在节点分类和图形分类任务上的七个公共基准上表现出最先进的方法。
translated by 谷歌翻译
We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA) or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on underdetermined problems ($p \gg N$ with tens of observations) and scales to large datasets (e.g., $N=100,000$; $p=1,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.
translated by 谷歌翻译
当节点具有人口统计属性时,概率图形模型中社区结构的推理可能不会与公平约束一致。某些人口统计学可能在某些检测到的社区中过度代表,在其他人中欠代表。本文定义了一个新的$ \ ell_1 $ -regulared伪似然方法,用于公平图形模型选择。特别是,我们假设真正的基础图表​​中存在一些社区或聚类结构,我们寻求从数据中学习稀疏的无向图形及其社区,使得人口统计团体在社区内相当代表。我们的优化方法使用公平的人口统计奇偶校验定义,但框架很容易扩展到其他公平的定义。我们建立了分别,连续和二进制数据的高斯图形模型和Ising模型的提出方法的统计一致性,证明了我们的方法可以以高概率恢复图形及其公平社区。
translated by 谷歌翻译
由于需要经济的储存和二元法规的效率,因此无监督的哈希对二元表示学习引起了很多关注。它旨在编码锤子空间中的高维特征,并在实例之间保持相似性。但是,大多数现有方法在基于多种的方法中学习哈希功能。这些方法捕获了数据的局部几何结构(即成对关系),并且在处理具有不同语义信息的实际特征(例如颜色和形状)的真实情况时缺乏令人满意的性能。为了应对这一挑战,在这项工作中,我们提出了一种有效的无监督方法,即共同个性化的稀疏哈希(JPSH),以进行二进制表示学习。具体来说,首先,我们提出了一个新颖的个性化哈希模块,即个性化的稀疏哈希(PSH)。构建了不同的个性化子空间,以反映不同群集的特定类别属性,同一群集中的自适应映射实例与同一锤子空间。此外,我们为不同的个性化子空间部署稀疏约束来选择重要功能。我们还收集了其他群集的优势,以避免过度拟合,以构建PSH模块。然后,为了在JPSH中同时保留语义和成对的相似性,我们将基于PSH和歧管的哈希学习纳入无缝配方中。因此,JPSH不仅将这些实例与不同的集群区分开,而且还保留了集群中的本地邻里结构。最后,采用了交替优化算法,用于迭代捕获JPSH模型的分析解决方案。在四个基准数据集上进行的大量实验验证了JPSH是否在相似性搜索任务上优于几个哈希算法。
translated by 谷歌翻译
Transfer learning aims at improving the performance of target learners on target domains by transferring the knowledge contained in different but related source domains. In this way, the dependence on a large number of target domain data can be reduced for constructing target learners. Due to the wide application prospects, transfer learning has become a popular and promising area in machine learning. Although there are already some valuable and impressive surveys on transfer learning, these surveys introduce approaches in a relatively isolated way and lack the recent advances in transfer learning. Due to the rapid expansion of the transfer learning area, it is both necessary and challenging to comprehensively review the relevant studies. This survey attempts to connect and systematize the existing transfer learning researches, as well as to summarize and interpret the mechanisms and the strategies of transfer learning in a comprehensive way, which may help readers have a better understanding of the current research status and ideas. Unlike previous surveys, this survey paper reviews more than forty representative transfer learning approaches, especially homogeneous transfer learning approaches, from the perspectives of data and model. The applications of transfer learning are also briefly introduced. In order to show the performance of different transfer learning models, over twenty representative transfer learning models are used for experiments. The models are performed on three different datasets, i.e., Amazon Reviews, Reuters-21578, and Office-31. And the experimental results demonstrate the importance of selecting appropriate transfer learning models for different applications in practice.
translated by 谷歌翻译
通过图形结构表示数据标识在多个数据分析应用中提取信息的最有效方法之一。当调查多模式数据集时,这尤其如此,因为通过各种传感策略收集的记录被考虑并探索。然而,经典曲线图信号处理基于根据热扩散机构配置的信息传播的模型。该系统提供了对多模式数据分析不适用于多模式数据分析的数据属性的若干约束和假设,特别是当考虑从异构源收集的大规模数据集,因此结果的准确性和稳健性可能会受到严重危害。在本文中,我们介绍了一种基于流体扩散的图表定义模型。该方法提高了基于图形的数据分析的能力,以考虑运行方案中现代数据分析的几个问题,从而为对考试记录的记录底层的现象提供了一种精确,多才多艺的,有效地理解平台,以及完全利用记录的多样性提供的潜力,以获得数据的彻底表征及其意义。在这项工作中,我们专注于使用这种流体扩散模型来驱动社区检测方案,即根据节点中的节点中的相似性将多模式数据集分为多个组中。在不同应用场景中测试真正的多模式数据集实现的实验结果表明,我们的方法能够强烈优先于多媒体数据分析中的社区检测的最先进方案。
translated by 谷歌翻译
Multi-task learning (MTL) aims to improve the performance of multiple related prediction tasks by leveraging useful information from them. Due to their flexibility and ability to reduce unknown coefficients substantially, the task-clustering-based MTL approaches have attracted considerable attention. Motivated by the idea of semisoft clustering of data, we propose a semisoft task clustering approach, which can simultaneously reveal the task cluster structure for both pure and mixed tasks as well as select the relevant features. The main assumption behind our approach is that each cluster has some pure tasks, and each mixed task can be represented by a linear combination of pure tasks in different clusters. To solve the resulting non-convex constrained optimization problem, we design an efficient three-step algorithm. The experimental results based on synthetic and real-world datasets validate the effectiveness and efficiency of the proposed approach. Finally, we extend the proposed approach to a robust task clustering problem.
translated by 谷歌翻译
使用机器学习算法从未标记的文本中提取知识可能很复杂。文档分类和信息检索是两个应用程序,可以从无监督的学习(例如文本聚类和主题建模)中受益,包括探索性数据分析。但是,无监督的学习范式提出了可重复性问题。初始化可能会导致可变性,具体取决于机器学习算法。此外,关于群集几何形状,扭曲可能会产生误导。在原因中,异常值和异常的存在可能是决定因素。尽管初始化和异常问题与文本群集和主题建模相关,但作者并未找到对它们的深入分析。这项调查提供了这些亚地区的系统文献综述(2011-2022),并提出了共同的术语,因为类似的程序具有不同的术语。作者描述了研究机会,趋势和开放问题。附录总结了与审查的作品直接或间接相关的文本矢量化,分解和聚类算法的理论背景。
translated by 谷歌翻译
Multi-view unsupervised feature selection has been proven to be efficient in reducing the dimensionality of multi-view unlabeled data with high dimensions. The previous methods assume all of the views are complete. However, in real applications, the multi-view data are often incomplete, i.e., some views of instances are missing, which will result in the failure of these methods. Besides, while the data arrive in form of streams, these existing methods will suffer the issues of high storage cost and expensive computation time. To address these issues, we propose an Incremental Incomplete Multi-view Unsupervised Feature Selection method (I$^2$MUFS) on incomplete multi-view streaming data. By jointly considering the consistent and complementary information across different views, I$^2$MUFS embeds the unsupervised feature selection into an extended weighted non-negative matrix factorization model, which can learn a consensus clustering indicator matrix and fuse different latent feature matrices with adaptive view weights. Furthermore, we introduce the incremental leaning mechanisms to develop an alternative iterative algorithm, where the feature selection matrix is incrementally updated, rather than recomputing on the entire updated data from scratch. A series of experiments are conducted to verify the effectiveness of the proposed method by comparing with several state-of-the-art methods. The experimental results demonstrate the effectiveness and efficiency of the proposed method in terms of the clustering metrics and the computational cost.
translated by 谷歌翻译
Research in Graph Signal Processing (GSP) aims to develop tools for processing data defined on irregular graph domains. In this paper we first provide an overview of core ideas in GSP and their connection to conventional digital signal processing, along with a brief historical perspective to highlight how concepts recently developed in GSP build on top of prior research in other areas. We then summarize recent advances in developing basic GSP tools, including methods for sampling, filtering or graph learning. Next, we review progress in several application areas using GSP, including processing and analysis of sensor network data, biological data, and applications to image processing and machine learning.
translated by 谷歌翻译
网络分析一直是揭示大量对象之间关系和交互的强大工具。然而,它在准确识别重要节点节点相互作用的有效性受到快速增长的网络规模的挑战,数据以空前的粒度和规模收集。克服这种高维度的共同智慧是将节点崩溃成较小的群体,并在小组级别进行连通性分析。将努力分为两个阶段不可避免地打开了一致性的差距,并降低了效率。共识学习是通用知识发现的新常态,并具有多个可用的数据源。为此,本文以组合多个数据源来开发同时分组和连接分析的统一框架。该算法还保证了统计上最佳的估计器。
translated by 谷歌翻译