2018-03-03 translated by 谷歌翻译

2016-07-06 Tensors or {\em multi-way arrays} are functions of three or more indices$(i,j,k,\cdots)$ -- similar to matrices (two-way arrays), which are functionsof two indices $(r,c)$ for (row,column). Tensors have a rich history,stretching over almost a century, and touching upon numerous disciplines; butthey have only recently become ubiquitous in signal and data analytics at theconfluence of signal processing, statistics, data mining and machine learning.This overview article aims to provide a good starting point for researchers andpractitioners interested in learning about and working with tensors. As such,it focuses on fundamentals and motivation (using various application examples),aiming to strike an appropriate balance of breadth {\em and depth} that willenable someone having taken first graduate courses in matrix algebra andprobability to get started doing research and/or developing tensor algorithmsand software. Some background in applied optimization is useful but notstrictly required. The material covered includes tensor rank and rankdecomposition; basic tensor factorization models and their relationships andproperties (including fairly good coverage of identifiability); broad coverageof algorithms ranging from alternating optimization to stochastic gradient;statistical performance analysis; and applications ranging from sourceseparation to collaborative filtering, mixture and topic modeling,classification, and multilinear subspace learning.
translated by 谷歌翻译

2012-10-29 translated by 谷歌翻译

2012-03-03 Mixture models are a fundamental tool in applied statistics and machinelearning for treating data taken from multiple subpopulations. The currentpractice for estimating the parameters of such models relies on local searchheuristics (e.g., the EM algorithm) which are prone to failure, and existingconsistent methods are unfavorable due to their high computational and samplecomplexity which typically scale exponentially with the number of mixturecomponents. This work develops an efficient method of moments approach toparameter estimation for a broad class of high-dimensional mixture models withmany components, including multi-view mixtures of Gaussians (such as mixturesof axis-aligned Gaussians) and hidden Markov models. The new method leads torigorous unsupervised learning results for mixture models that were notachieved by previous works; and, because of its simplicity, it offers a viablealternative to EM for practical deployment.
translated by 谷歌翻译

2017-12-01 translated by 谷歌翻译

2012-12-19 Topic models provide a useful method for dimensionality reduction andexploratory data analysis in large text corpora. Most approaches to topic modelinference have been based on a maximum likelihood objective. Efficientalgorithms exist that approximate this objective, but they have no provableguarantees. Recently, algorithms have been introduced that provide provablebounds, but these algorithms are not practical because they are inefficient andnot robust to violations of model assumptions. In this paper we present analgorithm for topic model inference that is both provable and practical. Thealgorithm produces results comparable to the best MCMC implementations whilerunning orders of magnitude faster.
translated by 谷歌翻译

2018-11-06 translated by 谷歌翻译

2008-11-26 Hidden Markov Models (HMMs) are one of the most fundamental and widely usedstatistical tools for modeling discrete time series. In general, learning HMMsfrom data is computationally hard (under cryptographic assumptions), andpractitioners typically resort to search heuristics which suffer from the usuallocal optima issues. We prove that under a natural separation condition (boundson the smallest singular value of the HMM parameters), there is an efficientand provably correct algorithm for learning HMMs. The sample complexity of thealgorithm does not explicitly depend on the number of distinct (discrete)observations---it implicitly depends on this quantity through spectralproperties of the underlying HMM. This makes the algorithm particularlyapplicable to settings with a large number of observations, such as those innatural language processing where the space of observation is sometimes thewords in a language. The algorithm is also simple, employing only a singularvalue decomposition and matrix multiplications.
translated by 谷歌翻译

2019-03-11 translated by 谷歌翻译 In the Nonnegative Matrix Factorization (NMF) problem we are given an n × m nonnegative matrix M and an integer r > 0. Our goal is to express M as AW where A and W are nonnegative matrices of size n × r and r × m respectively. In some applications, it makes sense to ask instead for the product AW to approximate M-i.e. (approximately) minimize M − AW F where F denotes the Frobenius norm; we refer to this as Approximate NMF. This problem has a rich history spanning quantum mechanics, probability theory, data analysis , polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormously popular in machine learning, where A and W are computed using a variety of local search heuristics. Vavasis recently proved that this problem is NP-complete. (Without the restriction that A and W be nonnegative, both the exact and approximate problems can be solved optimally via the singular value decomposition.) We initiate a study of when this problem is solvable in polynomial time. Our results are the following: 1. We give a polynomial-time algorithm for exact and approximate NMF for every constant r. Indeed NMF is most interesting in applications precisely when r is small. 2. We complement this with a hardness result, that if exact N M F can be solved in time (nm) o(r) , 3-SAT has a sub-exponential time algorithm. This rules out substantial improvements to the above algorithm. 3. We give an algorithm that runs in time polynomial in n, m and r under the separablity condition identified by Donoho and Stodden in 2003. The algorithm may be practical since it is simple and noise tolerant (under benign assumptions). Separability is believed to hold in many practical settings. To the best of our knowledge, this last result is the first example of a polynomial-time algorithm that provably works under a non-trivial condition on the input and we believe that this will be an interesting and important direction for future work.
translated by 谷歌翻译 Algorithms such as Latent Dirichlet Allocation (LDA) have achieved significant progress in modeling word document relationships. These algorithms assume each word in the document was generated by a hidden topic and explicitly model the word distribution of each topic as well as the prior distribution over topics in the document. Given these parameters , the topics of all words in the same document are assumed to be independent. In this paper, we propose modeling the topics of words in the document as a Markov chain. Specifically, we assume that all words in the same sentence have the same topic, and successive sentences are more likely to have the same topics. Since the topics are hidden , this leads to using the well-known tools of Hidden Markov Models for learning and inference. We show that incorporating this dependency allows us to learn better topics and to disambiguate words that can belong to different topics. Quantitatively, we show that we obtain better perplexity in model-ing documents with only a modest increase in learning and inference complexity.
translated by 谷歌翻译

2018-05-17 translated by 谷歌翻译 We tackle unsupervised part-of-speech (POS) tagging by learning hidden Markov models (HMMs) that are particularly well-suited for the problem. These HMMs, which we call anchor HMMs, assume that each tag is associated with at least one word that can have no other tag, which is a relatively benign condition for POS tagging (e.g., "the" is a word that appears only under the determiner tag). We exploit this assumption and extend the non-negative matrix factorization framework of Arora et al. (2013) to design a consistent estimator for anchor HMMs. In experiments, our algorithm is competitive with strong base-lines such as the clustering method of Brown et al. (1992) and the log-linear model of Berg-Kirkpatrick et al. (2010). Furthermore, it produces an interpretable model in which hidden states are automatically lexicalized by words.
translated by 谷歌翻译 Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning  and in theory  have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. A compelling feature of our algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD-just as NMF has come to replace SVD in many applications.
translated by 谷歌翻译 We propose a general algorithmic framework for constrained matrix and tensor factorization, which is widely used in signal processing and machine learning. The new framework is a hybrid between alternating optimization (AO) and the alternating direction method of multipliers (ADMM): each matrix factor is updated in turn, using ADMM, hence the name AO-ADMM. This combination can naturally accommodate a great variety of constraints on the factor matrices, and almost all possible loss measures for the fitting. Computation caching and warm start strategies are used to ensure that each update is evaluated efficiently, while the outer AO framework exploits recent developments in block coordinate descent (BCD)-type methods which help ensure that every limit point is a stationary point, as well as faster and more robust convergence in practice. Three special cases are studied in detail: non-negative matrix/tensor factorization, constrained matrix/tensor completion, and dictionary learning. Extensive simulations and experiments with real data are used to showcase the effectiveness and broad applicability of the proposed framework.
translated by 谷歌翻译 Community detection is the task of detecting hidden communities from observed interactions. Guaranteed community detection has so far been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remove this restriction, and provide guaranteed community detection for a family of probabilistic network models with overlapping communities, termed as the mixed membership Dirichlet model, first introduced by Airoldi et al. (2008). This model allows for nodes to have fractional memberships in multiple communities and assumes that the community memberships are drawn from a Dirichlet distribution. Moreover, it contains the stochastic block model as a special case. We propose a unified approach to learning these models via a tensor spectral decomposition method. Our estimator is based on low-order moment tensor of the observed network, consisting of 3-star counts. Our learning method is fast and is based on simple linear algebraic operations, e.g., singular value decomposition and tensor power iterations. We provide guaranteed recovery of community memberships and model parameters and present a careful finite sample analysis of our learning method. As an important special case, our results match the best known scaling requirements for the (homogeneous) stochastic block model.
translated by 谷歌翻译 This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.
translated by 谷歌翻译 We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. We first develop a novel "stratified" SGD variant (SSGD) that applies to general loss-minimization problems in which the loss function can be expressed as a weighted sum of "stratum losses." We establish sufficient conditions for convergence of SSGD using results from stochastic approximation theory and regenerative process theory. We then specialize SSGD to obtain a new matrix-factorization algorithm, called DSGD, that can be fully distributed and run on web-scale datasets using MapReduce. DSGD has good speed-up behavior and handles a wide variety of matrix factorizations. We describe the practical techniques used to optimize performance in our DSGD implementation. Experiments suggest that DSGD converges significantly faster and has better scalability properties than alternative algorithms. 1
translated by 谷歌翻译 Sparsity is an important modeling tool that expands the applicability of convex formulations for data analysis, however it also creates significant challenges for efficient algorithm design. In this paper we investigate the generalized conditional gradient (GCG) algorithm for solving sparse optimization problems-demonstrating that, with some enhancements, it can provide a more efficient alternative to current state of the art approaches. After studying the convergence properties of GCG for general convex composite problems, we develop efficient methods for evaluating polar operators, a subroutine that is required in each GCG iteration. In particular, we show how the polar operator can be efficiently evaluated in learning low-rank matrices, instantiated with detailed examples on matrix completion and dictionary learning. A further improvement is achieved by interleaving GCG with fixed-rank local subspace optimization. A series of experiments on matrix completion, multi-class classification, and multi-view dictionary learning shows that the proposed method can significantly reduce the training cost of current alternatives.
translated by 谷歌翻译 Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation-every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a vari-c 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE, DHILLON, GHOSH, MERUGU AND MODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and compression of categorical data matrices.
translated by 谷歌翻译
${authors} 分类：${tags}
${pubdate} \${abstract_cn}
translated by 谷歌翻译