我们提出了一种称为集成扩散的方法,用于组合多模式数据集,或通过同一系统上的几个不同测量收集的数据,以创建联合数据扩散操作员。随着现实世界的数据遭受本地和全局噪声,我们引入了最佳地计算了反映了两种方式的扩散操作者的机制。我们在数据去噪,可视化和聚类中显示了该联合操作员的实用程序,比其他方法更好地集成和分析多模式数据。我们将方法应用于从血细胞产生的多个OMIC数据,测量基因表达和染色质可接近性。我们的方法更好地可视化了联合数据的几何形状,捕获已知的跨模块关联,并识别已知的蜂窝群体。更一般地,集成扩散广泛适用于许多医学和生物系统中产生的多模式数据集。
translated by 谷歌翻译
In applications such as social, energy, transportation, sensor, and neuronal networks, high-dimensional data naturally reside on the vertices of weighted graphs. The emerging field of signal processing on graphs merges algebraic and spectral graph theoretic concepts with computational harmonic analysis to process such signals on graphs. In this tutorial overview, we outline the main challenges of the area, discuss different ways to define graph spectral domains, which are the analogues to the classical frequency domain, and highlight the importance of incorporating the irregular structures of graph data domains when processing signals on graphs. We then review methods to generalize fundamental operations such as filtering, translation, modulation, dilation, and downsampling to the graph setting, and survey the localized, multiscale transforms that have been proposed to efficiently extract information from high-dimensional data on graphs. We conclude with a brief discussion of open issues and possible extensions.
translated by 谷歌翻译
在不同工具或条件对给定现象的研究产生不同但相关的领域的情况下,多模式数据的整合提出了挑战。许多现有的数据集成方法假设整个数据集的域之间的一对一对应关系可能是不现实的。此外,现有的流形比对方法不适合数据包含特定区域区域的情况,即,对于其他域中的某个数据,没有一个对应物。我们提出了扩散传输对准(DTA),这是一种半监督的歧管比对方法,该方法利用仅几个点之间的先前对应知识来对齐域。通过构建扩散过程,DTA找到了从具有不同特征空间的两个异质域测量的数据之间的运输计划,通过假设,它们共享来自相同基础数据生成过程的相似几何结构。 DTA还可以以数据驱动的方式计算部分对齐,从而在仅在一个域中测量某些数据时会准确对齐。我们从经验上证明,DTA在该半监视设置中对齐多模式数据中的其他方法优于其他方法。我们还从经验上表明,DTA获得的对齐方式可以改善机器学习任务的性能,例如域适应性,域间特征映射和探索性数据分析,同时表现优于竞争方法。
translated by 谷歌翻译
Selecting subsets of features that differentiate between two conditions is a key task in a broad range of scientific domains. In many applications, the features of interest form clusters with similar effects on the data at hand. To recover such clusters we develop DiSC, a data-driven approach for detecting groups of features that differentiate between conditions. For each condition, we construct a graph whose nodes correspond to the features and whose weights are functions of the similarity between them for that condition. We then apply a spectral approach to compute subsets of nodes whose connectivity differs significantly between the condition-specific feature graphs. On the theoretical front, we analyze our approach with a toy example based on the stochastic block model. We evaluate DiSC on a variety of datasets, including MNIST, hyperspectral imaging, simulated scRNA-seq and task fMRI, and demonstrate that DiSC uncovers features that better differentiate between conditions compared to competing methods.
translated by 谷歌翻译
We present a new technique called "t-SNE" that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets.
translated by 谷歌翻译
嵌入或可视化临床患者数据的主要挑战是可变类型的异质性,包括连续实验室值,分类诊断代码以及缺失或不完整的数据。特别地,在EHR数据中,一些变量是{\ EM缺失而不是随机(MNAR)}但故意没有收集,因此是信息来源。例如,在疑似诊断的基础上,某些患者可能认为实验室测试是必要的,但不适用于其他患者。在这里,我们呈现壁画林 - 一个无监督的随机森林,用于代表具有不同变量类型的数据(例如,分类,连续,mnar)。壁画森林由一组决策树组成,其中随机选择节点分裂变量,使得所有其他变量的边缘熵由分裂最小化。这允许我们在与连续变量一致的方式中也拆分在Mnar变量和离散变量上。最终目标是学习使用这些患者之间的平均树距离的患者的壁画嵌入。这些距离可以馈送到非线性维度减少方法,如phate,以获得可视化的嵌入。虽然这种方法在连续值的数据集中普遍存在(如单细胞RNA测序)中,但它们尚未在混合可变数据中广泛使用。我们展示在一个人工和两个临床数据集上使用我们的方法。我们表明,使用我们的方法,我们可以比竞争方法更准确地对数据进行可视化和分类数据。最后,我们表明壁画也可用于通过最近提出的树木切片的Wassersein距离比较患者的群组。
translated by 谷歌翻译
通过图形结构表示数据标识在多个数据分析应用中提取信息的最有效方法之一。当调查多模式数据集时,这尤其如此,因为通过各种传感策略收集的记录被考虑并探索。然而,经典曲线图信号处理基于根据热扩散机构配置的信息传播的模型。该系统提供了对多模式数据分析不适用于多模式数据分析的数据属性的若干约束和假设,特别是当考虑从异构源收集的大规模数据集,因此结果的准确性和稳健性可能会受到严重危害。在本文中,我们介绍了一种基于流体扩散的图表定义模型。该方法提高了基于图形的数据分析的能力,以考虑运行方案中现代数据分析的几个问题,从而为对考试记录的记录底层的现象提供了一种精确,多才多艺的,有效地理解平台,以及完全利用记录的多样性提供的潜力,以获得数据的彻底表征及其意义。在这项工作中,我们专注于使用这种流体扩散模型来驱动社区检测方案,即根据节点中的节点中的相似性将多模式数据集分为多个组中。在不同应用场景中测试真正的多模式数据集实现的实验结果表明,我们的方法能够强烈优先于多媒体数据分析中的社区检测的最先进方案。
translated by 谷歌翻译
我们在点云数据上引入了一种新的局部曲率量度,称为扩散曲率。我们的措施使用扩散图的框架,包括数据扩散操作员,结构点云数据,并根据从数据的点或区域开始的随机步行的懒惰定义局部曲率。我们表明,这种懒惰直接与Riemannian几何形状的体积比较结果有关。然后,我们使用基于点云数据扩散图的神经网络估计将此标量曲率概念扩展到整个二次形式。我们展示了关于玩具数据,单细胞数据以及估计神经网络损失景观本地Hessian矩阵的应用。
translated by 谷歌翻译
歧管散射变换是用于在Riemannian歧管上定义的数据的深度提取器。它是将类似卷积神经网络的操作员扩展到一般流形的第一个例子之一。该模型的初始工作主要集中在其理论稳定性和不变性属性上,但没有为其数值实现提供方法,除非具有预定义的网格的二维表面。在这项工作中,我们根据扩散图的理论提出实用方案,以实现在自然主义系统(例如单细胞遗传学)中产生的流形散射转换,其中数据是一个高度点云,该云是模仿躺在上面的高维点云。低维歧管。我们证明我们的方法对于信号分类和多种分类任务有效。
translated by 谷歌翻译
Many scientific fields study data with an underlying structure that is a non-Euclidean space. Some examples include social networks in computational social sciences, sensor networks in communications, functional networks in brain imaging, regulatory networks in genetics, and meshed surfaces in computer graphics. In many applications, such geometric data are large and complex (in the case of social networks, on the scale of billions), and are natural targets for machine learning techniques. In particular, we would like to use deep neural networks, which have recently proven to be powerful tools for a broad range of problems from computer vision, natural language processing, and audio analysis. However, these tools have been most successful on data with an underlying Euclidean or grid-like structure, and in cases where the invariances of these structures are built into networks used to model them.Geometric deep learning is an umbrella term for emerging techniques attempting to generalize (structured) deep neural models to non-Euclidean domains such as graphs and manifolds. The purpose of this paper is to overview different examples of geometric deep learning problems and present available solutions, key difficulties, applications, and future research directions in this nascent field.
translated by 谷歌翻译
单细胞转录组学的分析通常依赖于聚类细胞,然后进行差异基因表达(DGE)来识别这些簇之间变化的基因。这些离散分析成功地确定了细胞类型和标记。但是,可能无法检测到细胞类型内部和之间的连续变化。我们提出了三种拓扑动机的数学方法,用于无监督的特征选择,这些方法可以同时在多个尺度上同时考虑离散和连续的转录模式。 eigenscores($ \ mathrm {eig} _i $)基于其与图形laplacian的频谱分解在数据中与低频内在图案的对应相对的对应。多尺度拉普拉斯评分(MLS)是一种无监督的方法,用于在数据中定位相关量表并选择在这些相应量表上相干表达的基因。持续的瑞利商(PRQ)采用了配备过滤的数据,允许在分叉过程中具有不同作用的基因(例如伪时间)。我们通过将它们应用于已发布的单细胞转录组数据集来证明这些技术的实用性。该方法验证了先前鉴定的基因并检测具有相干表达模式的其他基因。通过研究基因信号与基础空间的几何形状之间的相互作用,这三种方法给出了基因的多维排名和它们之间关系的可视化。
translated by 谷歌翻译
Graph convolution is the core of most Graph Neural Networks (GNNs) and usually approximated by message passing between direct (one-hop) neighbors. In this work, we remove the restriction of using only the direct neighbors by introducing a powerful, yet spatially localized graph convolution: Graph diffusion convolution (GDC). GDC leverages generalized graph diffusion, examples of which are the heat kernel and personalized PageRank. It alleviates the problem of noisy and often arbitrarily defined edges in real graphs. We show that GDC is closely related to spectral-based models and thus combines the strengths of both spatial (message passing) and spectral methods. We demonstrate that replacing message passing with graph diffusion convolution consistently leads to significant performance improvements across a wide range of models on both supervised and unsupervised tasks and a variety of datasets. Furthermore, GDC is not limited to GNNs but can trivially be combined with any graph-based model or algorithm (e.g. spectral clustering) without requiring any changes to the latter or affecting its computational complexity. Our implementation is available online. 1
translated by 谷歌翻译
Research in Graph Signal Processing (GSP) aims to develop tools for processing data defined on irregular graph domains. In this paper we first provide an overview of core ideas in GSP and their connection to conventional digital signal processing, along with a brief historical perspective to highlight how concepts recently developed in GSP build on top of prior research in other areas. We then summarize recent advances in developing basic GSP tools, including methods for sampling, filtering or graph learning. Next, we review progress in several application areas using GSP, including processing and analysis of sensor network data, biological data, and applications to image processing and machine learning.
translated by 谷歌翻译
Deep learning has achieved a remarkable performance breakthrough in several fields, most notably in speech recognition, natural language processing, and computer vision. In particular, convolutional neural network (CNN) architectures currently produce state-of-the-art performance on a variety of image analysis tasks such as object detection and recognition. Most of deep learning research has so far focused on dealing with 1D, 2D, or 3D Euclideanstructured data such as acoustic signals, images, or videos. Recently, there has been an increasing interest in geometric deep learning, attempting to generalize deep learning methods to non-Euclidean structured data such as graphs and manifolds, with a variety of applications from the domains of network analysis, computational social science, or computer graphics. In this paper, we propose a unified framework allowing to generalize CNN architectures to non-Euclidean domains (graphs and manifolds) and learn local, stationary, and compositional task-specific features. We show that various non-Euclidean CNN methods previously proposed in the literature can be considered as particular instances of our framework. We test the proposed method on standard tasks from the realms of image-, graphand 3D shape analysis and show that it consistently outperforms previous approaches.
translated by 谷歌翻译
Graph is a highly generic and diverse representation, suitable for almost any data processing problem. Spectral graph theory has been shown to provide powerful algorithms, backed by solid linear algebra theory. It thus can be extremely instrumental to design deep network building blocks with spectral graph characteristics. For instance, such a network allows the design of optimal graphs for certain tasks or obtaining a canonical orthogonal low-dimensional embedding of the data. Recent attempts to solve this problem were based on minimizing Rayleigh-quotient type losses. We propose a different approach of directly learning the eigensapce. A severe problem of the direct approach, applied in batch-learning, is the inconsistent mapping of features to eigenspace coordinates in different batches. We analyze the degrees of freedom of learning this task using batches and propose a stable alignment mechanism that can work both with batch changes and with graph-metric changes. We show that our learnt spectral embedding is better in terms of NMI, ACC, Grassman distance, orthogonality and classification accuracy, compared to SOTA. In addition, the learning is more stable.
translated by 谷歌翻译
最近有一项激烈的活动在嵌入非常高维和非线性数据结构的嵌入中,其中大部分在数据科学和机器学习文献中。我们分四部分调查这项活动。在第一部分中,我们涵盖了非线性方法,例如主曲线,多维缩放,局部线性方法,ISOMAP,基于图形的方法和扩散映射,基于内核的方法和随机投影。第二部分与拓扑嵌入方法有关,特别是将拓扑特性映射到持久图和映射器算法中。具有巨大增长的另一种类型的数据集是非常高维网络数据。第三部分中考虑的任务是如何将此类数据嵌入中等维度的向量空间中,以使数据适合传统技术,例如群集和分类技术。可以说,这是算法机器学习方法与统计建模(所谓的随机块建模)之间的对比度。在论文中,我们讨论了两种方法的利弊。调查的最后一部分涉及嵌入$ \ mathbb {r}^ 2 $,即可视化中。提出了三种方法:基于第一部分,第二和第三部分中的方法,$ t $ -sne,UMAP和大节。在两个模拟数据集上进行了说明和比较。一个由嘈杂的ranunculoid曲线组成的三胞胎,另一个由随机块模型和两种类型的节点产生的复杂性的网络组成。
translated by 谷歌翻译
最近的邻居图被广泛用于捕获数据集的几何形状或拓扑。构建此类图的最常见策略之一是基于为每个点选择固定数字K(KNN)。但是,当抽样密度或噪声水平在数据集各不相同时,KNN启发式可能会变得不合适。试图解决此问题的策略通常会引入需要调整的其他参数。我们提出了一种简单的方法,以基于四次正规化的最佳传输,从单个参数构建自适应邻域图。我们的数值实验表明,以这种方式构建的图在无监督和半监督的学习应用中表现出色。
translated by 谷歌翻译
半监督学习得到了研究人员的关注,因为它允许其中利用未标记数据的结构来实现比监督方法更少的标签来实现竞争分类结果。本地和全局一致性(LGC)算法是最着名的基于图形的半监督(GSSL)分类器之一。值得注意的是,其解决方案可以写成已知标签的线性组合。这种线性组合的系数取决于参数$ \ alpha $,在随机步行中达到标记的顶点时,确定随时间的衰减。在这项工作中,我们讨论如何删除标记实例的自我影响可能是有益的,以及它如何与休留次误差。此外,我们建议尽量减少自动分化的休假。在此框架内,我们提出了估计标签可靠性和扩散速率的方法。优化扩散速率以频谱表示更有效地完成。结果表明,标签可靠性方法与强大的L1-NORM方法竞争,删除对角线条目会降低过度的风险,并导致参数选择的合适标准。
translated by 谷歌翻译
从模型分析和机器学习中的比较到医疗数据集集合中的趋势发现,需要有效地比较和表示具有未知字段的数据集跨越各个字段。我们使用歧管学习来比较不同数据集的固有几何结构,通过比较其扩散操作员,对称阳性定义(SPD)矩阵,这些矩阵与连续的拉普拉斯 - 贝特拉米操作员与离散样品的近似相关。现有方法通常假设已知的数据对齐,并以点数的方式比较此类运算符。取而代之的是,我们利用SPD矩阵的Riemannian几何形状比较了这些操作员并根据log-euclidean Metric的下限定义了新的理论动机距离。我们的框架有助于比较具有不同大小,功能数量和测量方式的数据集中表达的数据歧管的比较。我们的日志 - 欧几里德签名(LES)距离恢复了有意义的结构差异,在各种应用领域的表现都优于竞争方法。
translated by 谷歌翻译
维数减少(DR)技术有助于分析师理解高维空间的模式。这些技术通常由散点图表示,在不同的科学域中使用,并促进集群和数据样本之间的相似性分析。对于包含许多粒度的数据集或者当分析遵循信息可视化Mantra时,分层DR技术是最合适的方法,因为它们预先呈现了主要结构和需求的详细信息。然而,当前的分层DR技术并不完全能够解决文献问题,因为它们不保留跨分层级别的投影心理映射,或者不适合大多数数据类型。这项工作提出了Humap,一种新颖的等级维度减少技术,旨在灵活地保护本地和全球结构,并在整个分层勘探中保留心理贴图。我们提供了与现有的等级方法相比我们技术优势的经验证据,并显示了两种案例研究以证明其优势。
translated by 谷歌翻译