非负矩阵分解(NMF)已成为信号和数据分析的主力,由其模型简约性和可解释性引发。也许有点令人惊讶的是,对模型可识别性的理解 - 主题挖掘和高光谱成像等许多应用中可解释性的主要原因 - 直到近几年才相当有限。从2010年开始,NMF的可识别性研究取得了相当大的进展。 :信号处理(SP)和机器学习(ML)社区已经发现了许多有趣且重要的结果。 NMF可识别性在实践中的许多方面都有很大的影响,例如避免配方避免和性能保证算法设计。另一方面,没有教学论文从可识别性的角度介绍NMF。在本文中,我们旨在通过提供有关NMF模型可识别性的全面而深入的教程以及与算法和应用的连接来填补这一空白。本教程将帮助研究人员和研究生掌握NMF的本质和见解,从而避免典型的“陷阱”,这些常常是由于无法识别的NMF配方造成的。本文还将帮助从业者为自己的问题挑选/设计合适的因子化工具。
translated by 谷歌翻译
Tensors or {\em multi-way arrays} are functions of three or more indices$(i,j,k,\cdots)$ -- similar to matrices (two-way arrays), which are functionsof two indices $(r,c)$ for (row,column). Tensors have a rich history,stretching over almost a century, and touching upon numerous disciplines; butthey have only recently become ubiquitous in signal and data analytics at theconfluence of signal processing, statistics, data mining and machine learning.This overview article aims to provide a good starting point for researchers andpractitioners interested in learning about and working with tensors. As such,it focuses on fundamentals and motivation (using various application examples),aiming to strike an appropriate balance of breadth {\em and depth} that willenable someone having taken first graduate courses in matrix algebra andprobability to get started doing research and/or developing tensor algorithmsand software. Some background in applied optimization is useful but notstrictly required. The material covered includes tensor rank and rankdecomposition; basic tensor factorization models and their relationships andproperties (including fairly good coverage of identifiability); broad coverageof algorithms ranging from alternating optimization to stochastic gradient;statistical performance analysis; and applications ranging from sourceseparation to collaborative filtering, mixture and topic modeling,classification, and multilinear subspace learning.
translated by 谷歌翻译
这项工作考虑了一种计算和统计上有效的参数估计方法,适用于广泛的潜变量模型 - 包括高斯混合模型,隐马尔可夫模型和潜在的Dirichlet分配---在低或可逆的时刻利用一定的张量结构(通常,二阶和三阶)。具体地,参数估计被简化为提取从矩导出的对称张量的某个(正交)分解的问题;这种分解可以看作是矩阵奇异值分解的自然推广。虽然张量分解通常难以计算,但是这些特殊结构的张量的分解可以通过各种方法有效地获得,包括功耗和最大化方法(类似于矩阵的情况)。提供了一种鲁棒的张量幂方法的分析,建立了矩阵奇异向量的Wedin扰动定理的一个类别。这意味着一种强大且计算易处理的估计方法可以用于几种流行的潜变量模型。
translated by 谷歌翻译
Mixture models are a fundamental tool in applied statistics and machinelearning for treating data taken from multiple subpopulations. The currentpractice for estimating the parameters of such models relies on local searchheuristics (e.g., the EM algorithm) which are prone to failure, and existingconsistent methods are unfavorable due to their high computational and samplecomplexity which typically scale exponentially with the number of mixturecomponents. This work develops an efficient method of moments approach toparameter estimation for a broad class of high-dimensional mixture models withmany components, including multi-view mixtures of Gaussians (such as mixturesof axis-aligned Gaussians) and hidden Markov models. The new method leads torigorous unsupervised learning results for mixture models that were notachieved by previous works; and, because of its simplicity, it offers a viablealternative to EM for practical deployment.
translated by 谷歌翻译
估计一组随机变量的联合概率质量函数(PMF)是统计学习和信号处理的核心。没有结构假设,例如将变量建模为马尔可夫链,树或其他图形模型,联合PMF估计通常是认为不可能 - 未知数随着变量数呈指数增长。但谁给了我们结构模型?是否有一种通用的“非参数”方法来控制联合PMF的复杂性,而不依赖于关于潜在概率模型的先验结构假设?是否可以在不预先偏析分析的情况下发现操作结构?如果我们只观察变量的随机子集,我们还可以估算出所有变量的联合PMF吗?本文或许令人惊讶地表明,如果可以估计任何三个变量的联合PMF,则可以在相对温和的条件下可证实地恢复所有变量的联合PMF。结果让人联想到Kolmogorov的扩展定理 - 低维分布的一致性规范诱导了整个过程的唯一概率测量。不同之处在于,对于有限复杂度(高维PMF的等级)的过程,可以仅从三维分布获得完整的表征。事实上,不是所有的三维PMF都是必需的;而且在更严格的条件下甚至是二维的。利用多线性代数,本文证明了这种高维PMF完成可以得到保证 - 推导出几个相关的可识别性结果。它还提供了一种实用而有效的算法来执行恢复任务。提出了关于电影推荐和数据分类的精心设计的模拟和实际数据实验,以展示该方法的有效性。
translated by 谷歌翻译
Topic models provide a useful method for dimensionality reduction andexploratory data analysis in large text corpora. Most approaches to topic modelinference have been based on a maximum likelihood objective. Efficientalgorithms exist that approximate this objective, but they have no provableguarantees. Recently, algorithms have been introduced that provide provablebounds, but these algorithms are not practical because they are inefficient andnot robust to violations of model assumptions. In this paper we present analgorithm for topic model inference that is both provable and practical. Thealgorithm produces results comparable to the best MCMC implementations whilerunning orders of magnitude faster.
translated by 谷歌翻译
状态聚合是一种植根于控制理论和实施学习的模型简化方法。它通过将系统状态映射到少量元状态来降低工程系统的复杂性。在本文中,我们研究了基于马尔可夫轨迹的未知状态聚合结构的无监督估计。我们将马尔可夫过程的状态聚合表示为非负分解模型,其中左和右因子矩阵分别对应于聚合和分解分布。通过利用在主题建模的上下文中开发的技术,我们提出了一种用于计算估计状态聚合模型的有效多项式时间算法。在一些“锚状态”假设下,可以确定人们可以从概率很高的样本转换中可靠地恢复状态聚合结构。针对估计的聚合和分解分布证明了尖锐的发散误差界限,并提供了曼哈顿交通数据的实验。
translated by 谷歌翻译
Hidden Markov Models (HMMs) are one of the most fundamental and widely usedstatistical tools for modeling discrete time series. In general, learning HMMsfrom data is computationally hard (under cryptographic assumptions), andpractitioners typically resort to search heuristics which suffer from the usuallocal optima issues. We prove that under a natural separation condition (boundson the smallest singular value of the HMM parameters), there is an efficientand provably correct algorithm for learning HMMs. The sample complexity of thealgorithm does not explicitly depend on the number of distinct (discrete)observations---it implicitly depends on this quantity through spectralproperties of the underlying HMM. This makes the algorithm particularlyapplicable to settings with a large number of observations, such as those innatural language processing where the space of observation is sometimes thewords in a language. The algorithm is also simple, employing only a singularvalue decomposition and matrix multiplications.
translated by 谷歌翻译
我们引入了动态生成模型,贝叶斯分配模型(BAM),它建立了非负张量因子分解(NTF),离散概率分布的图形模型和它们的贝叶斯扩展之间的显式联系,以及潜在的Dirichlet分配等主题模型。 BAM基于泊松过程,其事件通过使用贝叶斯网络来标记,其中该网络的条件概率表然后被分析地集成。我们证明了最终的边际过程结果是一个Polya urn,一个整数值的自增强过程。这个urn进程,我们命名为Polya-Bayes进程,遵循某些条件独立属性,提供有关NTF特性的进一步见解。这些见解还让我们开发出符合数据潜在稀疏性的节省空间的仿真算法:我们提出了一类用于计算NTF和近似其边际似然的重要抽样算法,这对于模型选择非常有用。结果方法也可以看作主题模型的模型评分方法和隐藏变量的离散贝叶斯网络。与变分算法相比,新算法在稀疏数据体系中具有有利的属性,当观察到的张量的元素的总和变为无穷大时,这些算法变得更加准确。我们在几个例子中说明了性能,并在数值上研究了各种数据体系的算法行为。
translated by 谷歌翻译
In the Nonnegative Matrix Factorization (NMF) problem we are given an n × m nonnegative matrix M and an integer r > 0. Our goal is to express M as AW where A and W are nonnegative matrices of size n × r and r × m respectively. In some applications, it makes sense to ask instead for the product AW to approximate M-i.e. (approximately) minimize M − AW F where F denotes the Frobenius norm; we refer to this as Approximate NMF. This problem has a rich history spanning quantum mechanics, probability theory, data analysis , polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormously popular in machine learning, where A and W are computed using a variety of local search heuristics. Vavasis recently proved that this problem is NP-complete. (Without the restriction that A and W be nonnegative, both the exact and approximate problems can be solved optimally via the singular value decomposition.) We initiate a study of when this problem is solvable in polynomial time. Our results are the following: 1. We give a polynomial-time algorithm for exact and approximate NMF for every constant r. Indeed NMF is most interesting in applications precisely when r is small. 2. We complement this with a hardness result, that if exact N M F can be solved in time (nm) o(r) , 3-SAT has a sub-exponential time algorithm. This rules out substantial improvements to the above algorithm. 3. We give an algorithm that runs in time polynomial in n, m and r under the separablity condition identified by Donoho and Stodden in 2003. The algorithm may be practical since it is simple and noise tolerant (under benign assumptions). Separability is believed to hold in many practical settings. To the best of our knowledge, this last result is the first example of a polynomial-time algorithm that provably works under a non-trivial condition on the input and we believe that this will be an interesting and important direction for future work.
translated by 谷歌翻译
Algorithms such as Latent Dirichlet Allocation (LDA) have achieved significant progress in modeling word document relationships. These algorithms assume each word in the document was generated by a hidden topic and explicitly model the word distribution of each topic as well as the prior distribution over topics in the document. Given these parameters , the topics of all words in the same document are assumed to be independent. In this paper, we propose modeling the topics of words in the document as a Markov chain. Specifically, we assume that all words in the same sentence have the same topic, and successive sentences are more likely to have the same topics. Since the topics are hidden , this leads to using the well-known tools of Hidden Markov Models for learning and inference. We show that incorporating this dependency allows us to learn better topics and to disambiguate words that can belong to different topics. Quantitatively, we show that we obtain better perplexity in model-ing documents with only a modest increase in learning and inference complexity.
translated by 谷歌翻译
我们在主题模型中提出了一种新的估计方法,这种方法不是对现有的单纯形发现算法进行估计,而是从观测数据中估计出主题K的数量。我们推导出新的有限样本minimaxlower边界用于估计A,以及我们提出的估计量的新上界。我们描述了估算器最小适应性的场景。我们的有限样本分析对任意数量的文档(n),单个文档长度(N_i),字典大小(p)和主题数量(K)都有效,并且p和K都允许随n增加,不是这样的情况由以前的分析处理得很好。我们通过详细的模拟研究补充了我们的理论结果。我们说明新算法比现有算法更快,更准确,尽管我们开始时计算和理论上的缺点是不知道主题K的正确数量,而我们在模拟中为竞争方法提供了正确的值。
translated by 谷歌翻译
We tackle unsupervised part-of-speech (POS) tagging by learning hidden Markov models (HMMs) that are particularly well-suited for the problem. These HMMs, which we call anchor HMMs, assume that each tag is associated with at least one word that can have no other tag, which is a relatively benign condition for POS tagging (e.g., "the" is a word that appears only under the determiner tag). We exploit this assumption and extend the non-negative matrix factorization framework of Arora et al. (2013) to design a consistent estimator for anchor HMMs. In experiments, our algorithm is competitive with strong base-lines such as the clustering method of Brown et al. (1992) and the log-linear model of Berg-Kirkpatrick et al. (2010). Furthermore, it produces an interpretable model in which hidden states are automatically lexicalized by words.
translated by 谷歌翻译
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning [15] and in theory [27] have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. A compelling feature of our algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD-just as NMF has come to replace SVD in many applications.
translated by 谷歌翻译
We propose a general algorithmic framework for constrained matrix and tensor factorization, which is widely used in signal processing and machine learning. The new framework is a hybrid between alternating optimization (AO) and the alternating direction method of multipliers (ADMM): each matrix factor is updated in turn, using ADMM, hence the name AO-ADMM. This combination can naturally accommodate a great variety of constraints on the factor matrices, and almost all possible loss measures for the fitting. Computation caching and warm start strategies are used to ensure that each update is evaluated efficiently, while the outer AO framework exploits recent developments in block coordinate descent (BCD)-type methods which help ensure that every limit point is a stationary point, as well as faster and more robust convergence in practice. Three special cases are studied in detail: non-negative matrix/tensor factorization, constrained matrix/tensor completion, and dictionary learning. Extensive simulations and experiments with real data are used to showcase the effectiveness and broad applicability of the proposed framework.
translated by 谷歌翻译
Community detection is the task of detecting hidden communities from observed interactions. Guaranteed community detection has so far been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remove this restriction, and provide guaranteed community detection for a family of probabilistic network models with overlapping communities, termed as the mixed membership Dirichlet model, first introduced by Airoldi et al. (2008). This model allows for nodes to have fractional memberships in multiple communities and assumes that the community memberships are drawn from a Dirichlet distribution. Moreover, it contains the stochastic block model as a special case. We propose a unified approach to learning these models via a tensor spectral decomposition method. Our estimator is based on low-order moment tensor of the observed network, consisting of 3-star counts. Our learning method is fast and is based on simple linear algebraic operations, e.g., singular value decomposition and tensor power iterations. We provide guaranteed recovery of community memberships and model parameters and present a careful finite sample analysis of our learning method. As an important special case, our results match the best known scaling requirements for the (homogeneous) stochastic block model.
translated by 谷歌翻译
This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.
translated by 谷歌翻译
We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. We first develop a novel "stratified" SGD variant (SSGD) that applies to general loss-minimization problems in which the loss function can be expressed as a weighted sum of "stratum losses." We establish sufficient conditions for convergence of SSGD using results from stochastic approximation theory and regenerative process theory. We then specialize SSGD to obtain a new matrix-factorization algorithm, called DSGD, that can be fully distributed and run on web-scale datasets using MapReduce. DSGD has good speed-up behavior and handles a wide variety of matrix factorizations. We describe the practical techniques used to optimize performance in our DSGD implementation. Experiments suggest that DSGD converges significantly faster and has better scalability properties than alternative algorithms. 1
translated by 谷歌翻译
Sparsity is an important modeling tool that expands the applicability of convex formulations for data analysis, however it also creates significant challenges for efficient algorithm design. In this paper we investigate the generalized conditional gradient (GCG) algorithm for solving sparse optimization problems-demonstrating that, with some enhancements, it can provide a more efficient alternative to current state of the art approaches. After studying the convergence properties of GCG for general convex composite problems, we develop efficient methods for evaluating polar operators, a subroutine that is required in each GCG iteration. In particular, we show how the polar operator can be efficiently evaluated in learning low-rank matrices, instantiated with detailed examples on matrix completion and dictionary learning. A further improvement is achieved by interleaving GCG with fixed-rank local subspace optimization. A series of experiments on matrix completion, multi-class classification, and multi-view dictionary learning shows that the proposed method can significantly reduce the training cost of current alternatives.
translated by 谷歌翻译
Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation-every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a vari-c 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE, DHILLON, GHOSH, MERUGU AND MODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and compression of categorical data matrices.
translated by 谷歌翻译