Independent component analysis (ICA) is a statistical method for transforming an observable multidimensional random vector into components that are as statistically independent as possible from each other. Usually the ICA framework assumes a model according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is a special case of ICA in which both the observations and the independent components are over a finite alphabet. In this work we consider a generalization of this framework in which an observation vector is decomposed to its independent components (as much as possible) with no prior assumption on the way it was generated. This generalization is also known as Barlow's minimal redundancy representation problem and is considered an open problem. We propose several theorems and show that this NP hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Our contribution provides the first efficient and constructive set of solutions to Barlow's problem. The minimal redundancy representation (also known as factorial code) has many applications, mainly in the fields of Neural Networks and Deep Learning. The Binary ICA (BICA) is also shown to have applications in several domains including medical diagnosis, multi-cluster assignment, network tomography and internet resource management. In this work we show this formulation further applies to multiple disciplines in source coding such as predictive coding, distributed source coding and coding of large alphabet sources. Index Terms
translated by 谷歌翻译
独立成分分析(ICA)是一种统计工具,可将观察到的随机向量分解为尽可能统计独立的成分。有限域上的ICA是ICA的一个特例,其中观察和分解的分量都取有限字母表中的值。这个问题也称为最小冗余表示或因子编码。在这项工作中,我们专注于有限域上ICA的线性方法。我们引入一个基本的下界,它为任何线性解决方案解决这个问题的能力提供了基本的限制。基于此界限,我们提出了优于所有当前已知方法的协议算法。重要的是,随着问题规模的扩大,我们建议的算法(与下限相比)的开销通常会降低。此外,我们提供了我们建议的方法的次优变体,以相对较小的性能成本显着降低了计算复杂度。最后,我们讨论了与现有非线性解相比,随机向量的线性变换的通用能力。
translated by 谷歌翻译
The Information Bottleneck (IB) is a conceptual method for extracting the most compact, yet informative, representation of a set of variables, with respect to the target. It generalizes the notion of minimal sufficient statistics from classical parametric statistics to a broader information-theoretic sense. The IB curve defines the optimal trade-off between representation complexity and its predictive power. Specifically, it is achieved by minimizing the level of mutual information (MI) between the representation and the original variables, subject to a minimal level of MI between the representation and the target. This problem is shown to be in general NP hard. One important exception is the multivariate Gaussian case, for which the Gaussian IB (GIB) is known to obtain an analytical closed form solution, similar to Canonical Correlation Analysis (CCA). In this work we introduce a Gaussian lower bound to the IB curve; we find an embedding of the data which maximizes its "Gaussian part", on which we apply the GIB. This embedding provides an efficient (and practical) representation of any arbitrary data-set (in the IB sense), which in addition holds the favorable properties of a Gaussian distribution. Importantly, we show that the optimal Gaussian embedding is bounded from above by non-linear CCA. This allows a fundamental limit for our ability to Gaussianize arbitrary data-sets and solve complex problems by linear methods.
translated by 谷歌翻译
典型相关分析(CCA)是一种线性表示学习方法,用于在多视图数据中寻找最大相关变量。非线性CCA将这一概念扩展到更广泛的转换族,这对于许多实际应用来说更加强大。给定联合概率,交替条件期望(ACE)为非线性CCA问题提供了最优解。然而,当仅有有限数量的观测可用时,它遭受有限的性能和增加的计算负担。在这项工作中,我们引入了一个信息理论框架,用于线性CCA问题(ITCCA),它扩展了传统的ACE方法。 Oursuggested框架寻求压缩的数据表示,允许最大程度的相关性。这样我们就可以控制表现的灵活性和复杂性之间的权衡。与有限样本体系中的非线性替代方案相比,我们的方法在减少的计算负担方面表现出有利的性能。此外,ITCCA提供理论界限和最优性条件,因为我们建立了与速率失真理论,信息瓶颈和远程信息编码的基本联系。此外,它意味着“软”维数降低,因为压缩水平是由原始噪声数据和我们提取的信号之间的相互信息测量(和控制)的。
translated by 谷歌翻译
In this paper we review some recent interactions between harmonic analysis and data compression. The story goes back of course to Shannon's R(D) theory in the case of Gaussian stationary processes, which says that transforming into a Fourier basis followed by block coding gives an optimal lossy compression technique; practical developments like transform-based image compression have been inspired by this result. In this paper we also discuss connections perhaps less familiar to the Information Theory community, growing out of the field of harmonic analysis. Recent harmonic analysis constructions, such as wavelet transforms and Gabor transforms, are essentially optimal transforms for transform coding in certain settings. Some of these transforms are under consideration for future compression standards. We discuss some of the lessons of harmonic analysis in this century. Typically, the problems and achievements of this field have involved goals that were not obviously related to practical data compression, and have used a language not immediately accessible to outsiders. Nevertheless, through an extensive generalization of what Shannon called the "sampling theorem," harmonic analysis has succeeded in developing new forms of functional representation which turn out to have significant data compression interpretations. We explain why harmonic analysis has interacted with data compression, and we describe some interesting recent ideas in the field that may affect data compression in the future.
translated by 谷歌翻译
A unified view of the area of sparse signal processing is presented in tutorial form by bringing together various fields in which the property of sparsity has been successfully exploited. For each of these fields, various algorithms and techniques, which have been developed to leverage sparsity, are described succinctly. The common potential benefits of significant reduction in sampling rate and processing manipulations through sparse signal processing are revealed. The key application domains of sparse signal processing are sampling, coding, spectral estimation, array processing, component analysis, and multipath channel estimation. In terms of the sampling process and reconstruction algorithms, linkages are made with random sampling, compressed sensing, and rate of innovation. The redundancy introduced by channel coding in finite and real Galois fields is then related to over-sampling with similar reconstruction algorithms. The error locator polynomial (ELP) and iterative methods are shown to work quite effectively for both sampling and coding applications. The methods of Prony, Pisarenko, and MUltiple SIgnal Classification (MUSIC) are next shown to be targeted at analyzing signals with sparse frequency domain representations. Specifically, the relations of the approach of Prony to an annihilating filter in rate of innovation and ELP in coding are emphasized; the Pisarenko and MUSIC methods are further improvements of the Prony method under noisy environments. The iterative methods developed for sampling and coding applications are shown to be powerful tools in spectral estimation. Such narrowband spectral estimation is then related to multi-source location and direction of arrival estimation in array processing. Sparsity in unobservable source signals is also shown to facilitate source separation in sparse component analysis; the algorithms developed in this area such as linear programming and matching pursuit are also widely used in compressed sensing. Finally, the multipath channel estimation problem is shown to have a sparse formulation; algorithms similar to sampling and coding are used to estimate typical multicarrier communication channels.
translated by 谷歌翻译
Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation-every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a vari-c 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE, DHILLON, GHOSH, MERUGU AND MODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and compression of categorical data matrices.
translated by 谷歌翻译
A discrete denoising algorithm estimates the input sequence to a discrete memoryless channel (DMC) based on the observation of the entire output sequence. For the case in which the DMC is known and the quality of the reconstruction is evaluated with a given single-letter fidelity criterion, we propose a discrete denoising algorithm that does not assume knowledge of statistical properties of the input sequence. Yet, the algorithm is universal in the sense of asymptotically performing as well as the optimum denoiser that knows the input sequence distribution, which is only assumed to be stationary and ergodic. Moreover, the algorithm is universal also in a semi-stochastic setting, in which the input is an individual sequence, and the randomness is due solely to the channel noise. The proposed denoising algorithm is practical, requiring a linear number of register-level operations and sub-linear working storage size relative to the input data length.
translated by 谷歌翻译
This article reviews the principle of minimum description length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed attention within the statistics community. Here we review both the practical and the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines we nd many interesting interpretations of popular frequentist and Bayesian procedures. As we show, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can coexist and be compared. We illustrate the MDL principle by considering problems in regression, nonparametric curve estimation, cluster analysis, and time series analysis. Because model selection in linear regression is an extremely common problem that arises in many applications, we present detailed derivations of several MDL criteria in this context and discuss their properties through a number of examples. Our emphasis is on the practical application of MDL, and hence we make extensive use of real datasets. In writing this review, we tried to make the descriptive philosophy of MDL natural to a statistics audience by examining classical problems in model selection. In the engineering literature, however, MDL is being applied to ever more exotic modeling situations. As a principle for statistical modeling in general, one strength of MDL is that it can be intuitively extended to provide useful tools for new problems. The principle of parsimony, or Occam's razor, implicitly motivates the process of data analysis and statistical model-ing and is the soul of model selection. Formally, the need for model selection arises when investigators must decide among model classes based on data. These classes might be indistinguishable from the standpoint of existing subject knowledge or scienti c theory, and the selection of a particular model class implies the con rmation or revision of a given theory. To implement the parsimony principle, one must quantify "parsi-mony" of a model relative to the available data. Applying this measure to a number of candidates, we search for a concise model that provides a good t to the data. Rissanen (1978) distilled such thinking in his principle of minimum description length (MDL): Choose the model that gives the shortest description of data. In this framework a concise model is one that is easy to describe, whereas a good t implies that the model captures or describes the important features evident in the data. MDL has its intellectual roots in the algorithmic or descriptive complexity theory of Kolmogorov, Chaitin, and Sol
translated by 谷歌翻译
We review the principles of Minimum Description Length and Stochastic Complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon's basic source coding theorem. The normalized maximized likelihood, mixture, and predictive codings are each shown to achieve the stochastic complexity to within asymptotically vanishing terms. We assess the performance of the minimum description length criterion both from the vantage point of quality of data compression and accuracy of statistical inference. Context tree modeling, density estimation, and model selection in Gaussian linear regression serve as examples.
translated by 谷歌翻译
This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.
translated by 谷歌翻译
The deterministic annealing approach to clustering and its extensions has demonstrated substantial performance improvement over standard supervised and unsupervised learning methods in a variety of important applications including compression, estimation, pattern recognition and classification, and statistical regression. The method offers three important features: 1) the ability to avoid many poor local optima; 2) applicability to many different structures/architectures; and 3) the ability to minimize the right cost function even when its gradients vanish almost everywhere, as in the case of the empirical classification error. It is derived within a probabilistic framework from basic information theoretic principles (e.g., maximum entropy and random coding). The application-specific cost is minimized subject to a constraint on the randomness (Shannon entropy) of the solution, which is gradually lowered. We emphasize intuition gained from analogy to statistical physics, where this is an annealing process that avoids many shallow local minima of the specified cost and, at the limit of zero "temperature," produces a nonrandom (hard) solution. Alternatively, the method is derived within rate-distortion theory, where the annealing process is equivalent to computation of Shannon's rate-distortion function, and the annealing temperature is inversely proportional to the slope of the curve. This provides new insights into the method and its performance, as well as new insights into rate-distortion theory itself. The basic algorithm is extended by incorporating structural constraints to allow optimization of numerous popular structures including vector quantizers, decision trees, multilayer perceptrons, radial basis functions, and mixtures of experts. Experimental results show considerable performance gains over standard structure-specific and application-specific training methods. The paper concludes with a brief discussion of extensions of the method that are currently under investigation.
translated by 谷歌翻译
A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based paramet-ric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Breg-man clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information.
translated by 谷歌翻译
translated by 谷歌翻译
We study two families of error-correcting codes defined in terms of very sparse matrices. "MN" (MacKay-Neal) codes are recently invented, and "Gallager codes" were first investigated in 1962, but appear to have been largely forgotten, in spite of their excellent properties. The decoding of both codes can be tackled with a practical sum-product algorithm. We prove that these codes are "very good," in that sequences of codes exist which, when optimally decoded, achieve information rates up to the Shannon limit. This result holds not only for the binary-symmetric channel but also for any channel with symmetric stationary ergodic noise. We give experimental results for binary-symmetric channels and Gaussian channels demonstrating that practical performance substantially better than that of standard convolutional and concatenated codes can be achieved; indeed, the performance of Gallager codes is almost as close to the Shannon limit as that of turbo codes.
translated by 谷歌翻译
进化算法理论中的一个主要话题,更一般地说,随机黑盒优化技术的理论是运行时分析。运行时分析旨在通过限制启发式识别所需质量的解决方案所需的功能评估的数量来理解agiven启发式在给定问题上的性能。与通用算法理论一样,当运行时间视角得到有意义的复杂性理论的补充时,这种理论可以解决算法解决方案的局限性。在离散黑盒优化的背景下,已经开发了几种黑盒复杂度模型来分析黑盒优化算法在给定问题上可以实现的最佳性能。这些模型在这些下限应用的算法类别上有所不同。这样,黑盒复杂性有助于更好地理解算法选择的确切方式(例如启发式使用的内存量,选择压力或属性)。它用来创建新候选解决方案的策略会影响性能。在本章中,我们回顾了文献中提出的不同黑盒复杂性模型,调查了这些模型所获得的界限,并讨论了运行时分析和黑盒复杂性之间的相互作用如何能够激发新的算法解决方案。研究进化计算中的问题。我们还在本章中讨论了几个有趣的未来工作问题。
translated by 谷歌翻译
The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective, this article provides a formal and experimental analysis of different approaches for maximizing the F-measure. We start with a Bayes-risk analysis of related loss functions, such as Hamming loss and subset zero-one loss, showing that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Subsequently, we perform a similar type of analysis for F-measure maximizing algorithms , showing that such algorithms are approximate, while relying on additional assumptions regarding the statistical distribution of the binary response variables. Furthermore, we present a new algorithm which is not only computationally efficient but also Bayes-optimal, regardless of the underlying distribution. To this end, the algorithm requires only a quadratic (with respect to the number of binary responses) number of parameters of the joint distribution. We illustrate the practical performance of all analyzed methods by means of experiments with multi-label classification problems.
translated by 谷歌翻译
Tensors or {\em multi-way arrays} are functions of three or more indices$(i,j,k,\cdots)$ -- similar to matrices (two-way arrays), which are functionsof two indices $(r,c)$ for (row,column). Tensors have a rich history,stretching over almost a century, and touching upon numerous disciplines; butthey have only recently become ubiquitous in signal and data analytics at theconfluence of signal processing, statistics, data mining and machine learning.This overview article aims to provide a good starting point for researchers andpractitioners interested in learning about and working with tensors. As such,it focuses on fundamentals and motivation (using various application examples),aiming to strike an appropriate balance of breadth {\em and depth} that willenable someone having taken first graduate courses in matrix algebra andprobability to get started doing research and/or developing tensor algorithmsand software. Some background in applied optimization is useful but notstrictly required. The material covered includes tensor rank and rankdecomposition; basic tensor factorization models and their relationships andproperties (including fairly good coverage of identifiability); broad coverageof algorithms ranging from alternating optimization to stochastic gradient;statistical performance analysis; and applications ranging from sourceseparation to collaborative filtering, mixture and topic modeling,classification, and multilinear subspace learning.
translated by 谷歌翻译
集合方法属于最先进的预测建模方法。应用于现代大数据,这些方法通常需要大量的子学习者,其中每个学习者的复杂性通常随着数据集的大小而增长。这种现象导致对存储空间的需求增加,这可能是非常昂贵的。这个问题主要体现在基于asubscriber的环境中,其中用户特定的集合需要存储在具有严格存储限制的个人设备(例如蜂窝设备)中。在这项工作中,我们介绍了一种基于树的集合方法的无损压缩的新方法,重点是随机森林。我们建议的方法是基于整体树的概率建模,然后是通过Bregman散度的模型聚类。这使我们能够找到一组最小的模型,这些模型提供了树的准确描述,同时又足够小以便存储和维护。我们的压缩方案展示了各种现代数据集的高压缩率。重要的是,我们的方案能够从压缩格式预测并完美重建原始集合。此外,我们引入了一种理论上合理的有损压缩方案,它允许我们控制在失真和编码率之间的权衡。
translated by 谷歌翻译