A unified view of the area of sparse signal processing is presented in tutorial form by bringing together various fields in which the property of sparsity has been successfully exploited. For each of these fields, various algorithms and techniques, which have been developed to leverage sparsity, are described succinctly. The common potential benefits of significant reduction in sampling rate and processing manipulations through sparse signal processing are revealed. The key application domains of sparse signal processing are sampling, coding, spectral estimation, array processing, component analysis, and multipath channel estimation. In terms of the sampling process and reconstruction algorithms, linkages are made with random sampling, compressed sensing, and rate of innovation. The redundancy introduced by channel coding in finite and real Galois fields is then related to over-sampling with similar reconstruction algorithms. The error locator polynomial (ELP) and iterative methods are shown to work quite effectively for both sampling and coding applications. The methods of Prony, Pisarenko, and MUltiple SIgnal Classification (MUSIC) are next shown to be targeted at analyzing signals with sparse frequency domain representations. Specifically, the relations of the approach of Prony to an annihilating filter in rate of innovation and ELP in coding are emphasized; the Pisarenko and MUSIC methods are further improvements of the Prony method under noisy environments. The iterative methods developed for sampling and coding applications are shown to be powerful tools in spectral estimation. Such narrowband spectral estimation is then related to multi-source location and direction of arrival estimation in array processing. Sparsity in unobservable source signals is also shown to facilitate source separation in sparse component analysis; the algorithms developed in this area such as linear programming and matching pursuit are also widely used in compressed sensing. Finally, the multipath channel estimation problem is shown to have a sparse formulation; algorithms similar to sampling and coding are used to estimate typical multicarrier communication channels.
translated by 谷歌翻译
A fundamental problem in neural network research, as well as in many other disciplines, is finding a suitable representation of multivariate data, i.e. random vectors. For reasons of computational and conceptual simplicity, the representation is often sought as a linear transformation of the original data. In other words, each component of the representation is a linear combination of the original variables. Well-known linear transformation methods include principal component analysis, factor analysis, and projection pursuit. Independent component analysis (ICA) is a recently developed method in which the goal is to find a linear representation of nongaussian data so that the components are statistically independent, or as independent as possible. Such a representation seems to capture the essential structure of the data in many applications, including feature extraction and signal separation. In this paper, we present the basic theory and applications of ICA, and our recent work on the subject. 1 Motivation Imagine that you are in a room where two people are speaking simultaneously. You have two microphones, which you hold in different locations. The microphones give you two recorded time signals, which we could denote by x 1 (t) and x 2 (t), with x 1 and x 2 the amplitudes, and t the time index. Each of these recorded signals is a weighted sum of the speech signals emitted by the two speakers, which we denote by s 1 (t) and s 2 (t). We could express this as a linear equation: x 1 (t) = a 11 s 1 + a 12 s 2 (1) x 2 (t) = a 21 s 1 + a 22 s 2 (2) where a 11 , a 12 , a 21 , and a 22 are some parameters that depend on the distances of the microphones from the speakers. It would be very useful if you could now estimate the two original speech signals s 1 (t) and s 2 (t), using only the recorded signals x 1 (t) and x 2 (t). This is called the cocktail-party problem. For the time being, we omit any time delays or other extra factors from our simplified mixing model. As an illustration, consider the waveforms in Fig. 1 and Fig. 2. These are, of course, not realistic speech signals, but suffice for this illustration. The original speech signals could look something like those in Fig. 1 and the mixed signals could look like those in Fig. 2. The problem is to recover the data in Fig. 1 using only the data in Fig. 2. Actually, if we knew the parameters a i j , we could solve the linear equation in (1) by classical methods. The point is, however, that if you don't know the a i j , the problem is considerably more difficult. One approach to solving this problem would be to use some information on the statistical properties of the signals s i (t) to estimate the a ii. Actually, and perhaps surprisingly, it turns out that it is enough to assume that s 1 (t) and s 2 (t), at each time instant t, are statistically independent. This is not an unrealistic assumption in many cases, 1
translated by 谷歌翻译
Blind source separation (BSS) is a challenging problem in real-world environments where sources are time delayed and convolved. The problem becomes more difficult in very reverberant conditions, with an increasing number of sources, and geometric configurations of the sources such that finding directionality is not sufficient for source separation. In this paper, we propose a new algorithm that exploits higher order frequency dependencies of source signals in order to separate them when they are mixed. In the frequency domain, this formulation assumes that dependencies exist between frequency bins instead of defining independence for each frequency bin. In this manner, we can avoid the well-known frequency permutation problem. To derive the learning algorithm, we define a cost function, which is an extension of mutual information between multivariate random variables. By introducing a source prior that models the inherent frequency dependencies, we obtain a simple form of a multivariate score function. In experiments, we generate simulated data with various kinds of sources in various environments. We evaluate the performances and compare it with other well-known algorithms. The results show the proposed algorithm outperforms the others in most cases. The algorithm is also able to accurately recover six sources with six microphones. In this case, we can obtain about 16-dB signal-to-interference ratio (SIR) improvement. Similar performance is observed in real conference room recordings with three human speakers reading sentences and one loudspeaker playing music. Index Terms-Blind source separation (BSS), cocktail party problem, convolutive mixture, frequency domain, higher order dependency, independent component analysis, permutation problem.
translated by 谷歌翻译
We introduce the independent factor analysis (IFA) method for recovering independent hidden sources from their observed mixtures. IFA generalizes and uniies ordinary factor analysis (FA), principal component analysis (PCA), and independent component analysis (ICA), and can handle not only square noiseless mixing, but also the general case where the number of mixtures diiers from the number of sources and the data are noisy. IFA is a two-step procedure. In the rst step, the source densities, mixing matrix and noise covariance are estimated from the observed data by maximum likelihood. For this purpose we present an expectation-maximization (EM) algorithm, which performs unsupervised learning of an associated probabilistic model of the mixing situation. Each source in our model is described by a mixture of Gaussians, thus all the probabilistic calculations can be performed analytically. In the second step, the sources are reconstructed from the observed data by an optimal non-linear estimator. A variational approximation of this algorithm is derived for cases with a large number of sources, where the exact algorithm becomes intractable. Our IFA algorithm reduces to the one for ordinary FA when the sources become Gaussian, and to an EM algorithm for PCA in the zero-noise limit. We derive an additional EM algorithm speciically for noiseless IFA. This algorithm is shown to be superior to ICA since it can learn arbitrary source densities from the data. Beyond blind separation, IFA can be used for modeling multi-dimensional data by a highly constrained mixture of Gaussians, and as a tool for non-linear signal encoding.
translated by 谷歌翻译
Independent component analysis (ICA) is a recently developed, useful extension of standard principal component analysis (PCA). The ICA model is utilized mainly in blind separation of unknown source signals from their linear mixtures. In this application only the source signals which correspond to the coefficients of the ICA expansion are of interest. In this paper, we propose neural structures related to multilayer feedforward networks for performing complete ICA. The basic ICA network consists of whitening, separation, and basis vector estimation layers. It can be used for both blind source separation and estimation of the basis vectors of ICA. We consider learning algorithms for each layer, and modify our previous nonlinear PCA type algorithms so that their separation capabilities are greatly improved. The proposed class of networks yields good results in test examples with both artificial and real-world data. Index Terms-Blind source separation, independent component analysis, neural networks, principal component analysis, signal processing, unsupervised learning.
translated by 谷歌翻译
Capturing statistical regularities in complex, high-dimensional data is an important problem in machine learning and signal processing. Models such as PCA and ICA make few assumptions about the structure in the data, have good scaling properties, but are limited to representing linear statistical regularities and assume that the distribution of the data is stationary. For many natural, complex signals, the latent variables often exhibit residual dependencies as well as non-stationary statistics. Here we present a hierarchical Bayesian model that is able to capture higher-order non-linear structure and represent non-stationary data distributions. The model is a generalization of ICA in which the basis function coefficients are no longer assumed to be independent; instead, the dependencies in their magnitudes are captured by a set of density components. Each density component describes a common pattern of deviation from the marginal density of the pattern ensemble; in different combinations, they can describe non-stationary distributions. Adapting the model to image or audio data yields a non-linear, distributed code for higher-order statistical regularities that reflect more abstract, invariant properties of the signal. 3 Learning Density Components
translated by 谷歌翻译
Blind signal separation (BSS) and independent component analysis (ICA) are emerging techniques of array processing and data analysis that aim to recover unobserved signals or "sources" from observed mixtures (typically, the output of an array of sensors), exploiting only the assumption of mutual independence between the signals. The weakness of the assumptions makes it a powerful approach, but it requires us to venture beyond familiar second-order statistics. The objectives of this paper are to review some of the approaches that have been recently developed to address this exciting problem, to illustrate how they stem from basic principles, and to show how they relate to each other.
translated by 谷歌翻译
An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented. The algorithm is based on factorizing the magnitude spectrogram of an input signal into a sum of components, each of which has a fixed magnitude spectrum and a time-varying gain. Each sound source, in turn, is modeled as a sum of one or more components. The parameters of the components are estimated by minimizing the reconstruction error between the input spectrogram and the model, while restricting the component spectrograms to be nonneg-ative and favoring components whose gains are slowly varying and sparse. Temporal continuity is favored by using a cost term which is the sum of squared differences between the gains in adjacent frames, and sparseness is favored by penalizing nonzero gains. The proposed iterative estimation algorithm is initialized with random values, and the gains and the spectra are then alternatively updated using multiplicative update rules until the values converge. Simulation experiments were carried out using generated mixtures of pitched musical instrument samples and drum sounds. The performance of the proposed method was compared with independent subspace analysis and basic nonnegative matrix factorization, which are based on the same linear model. According to these simulations , the proposed method enables a better separation quality than the previous algorithms. Especially, the temporal continuity criterion improved the detection of pitched musical sounds. The sparseness criterion did not produce significant improvements. Index Terms-Acoustic signal analysis, audio source separation, blind source separation, music, nonnegative matrix factorization, sparse coding, unsupervised learning.
translated by 谷歌翻译
In this paper we present a convolutive basis decomposition method and its application on simultaneous speakers separation from monophonic recordings. The model we propose is a convolutive version of the non-negative matrix factorization algorithm. Due to the non-negativity constraint this type of coding is very well suited for intuitively and efficiently representing magnitude spectra. We present results that reveal the nature of these basis functions and we introduce their utility in separating monophonic mixtures of known speakers. IEEE Transactions on Audio, Speech and Language Processing This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Abstract-In this paper we present a convolutive basis decomposition method and its application on simultaneous speakers separation from monophonic recordings. The model we propose is a convolutive version of the non-negative matrix factorization algorithm. Due to the non-negativity constraint this type of coding is very well suited for intuitively and efficiently representing magnitude spectra. We present results that reveal the nature of these basis functions and we introduce their utility in separating monophonic mixtures of known speakers.
translated by 谷歌翻译
Adaptation of Bayesian models for single channel source separation and its application to voice / music separation in popular songs. Abstract-Probabilistic approaches can offer satisfactory solutions to source separation with a single channel, provided that the models of the sources match accurately the statistical properties of the mixed signals. However, it is not always possible to train such models. To overcome this problem, we propose to resort to an adaptation scheme for adjusting the source models with respect to the actual properties of the signals observed in the mix. In this paper, we introduce a general formalism for source model adaptation which is expressed in the framework of Bayesian models. Particular cases of the proposed approach are then investigated experimentally on the problem of separating voice from music in popular songs. The obtained results show that an adaptation scheme can improve consistently and significantly the separation performance in comparison with nonadapted models. Index Terms-Adaptive Wiener filtering, Bayesian model, expectation maximization (EM), Gaussian mixture model (GMM), maximum a posteriori (MAP), model adaptation, single-channel source separation, time-frequency masking.
translated by 谷歌翻译
声学数据提供从生物学和通信到海洋和地球科学等领域的科学和工程见解。我们调查了机器学习(ML)的进步和变革潜力,包括声学领域的深度学习。 ML是用于自动检测和利用模式印度的广泛的统计技术家族。相对于传统的声学和信号处理,ML是数据驱动的。给定足够的训练数据,ML可以发现特征之间的复杂关系。通过大量的训练数据,ML candiscover模型描述复杂的声学现象,如人类语音和混响。声学中的ML正在迅速发展,具有令人瞩目的成果和未来的重大前景。我们首先介绍ML,然后在五个声学研究领域强调MLdevelopments:语音处理中的源定位,海洋声学中的源定位,生物声学,地震探测和日常场景中的环境声音。
translated by 谷歌翻译
This paper gives an overview of automatic speak er recognition technology, with an emphasis on text-independent recognition. Speak er recognition has been studied actively for several decades. W e give an overview of both the classical and the state-of-the-art methods. W e start with the fundamentals of automatic speak er recognition, concerning feature extraction and speak er modeling. W e elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. W e also provide an overview of this recent development and discuss the evaluation methodology of speak er recognition systems. W e conclude the paper with discussion on future directions.
translated by 谷歌翻译
In ordinary independent component analysis, the components are assumed to be completely independent, and they do not necessarily have any meaningful order relationships. In practice, however, the estimated "independent" components are often not at all independent. We propose that this residual dependence structure could be used to define a topographic order for the components. In particular, a distance between two components could be defined using their higher-order correlations, and this distance could be used to create a topographic representation. Thus we obtain a linear decomposition into approximately independent components, where the dependence of two components is approximated by the proximity of the components in the topographic representation.
translated by 谷歌翻译
New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed.
translated by 谷歌翻译
Algorithms for data-driven learning of domain-specific overcomplete dictionaries are developed to obtain maximum likelihood and maximum a posteriori dictionary estimates based on the use of Bayesian models with concave/Schur-concave (CSC) negative log priors. Such priors are appropriate for obtaining sparse representations of environmental signals within an appropriately chosen (environmentally matched) dictionary. The elements of the dictionary can be interpreted as concepts, features, or words capable of succinct expression of events encountered in the environment (the source of the measured signals). This is a generalization of vector quantization in that one is interested in a description involving a few dictionary entries (the proverbial "25 words or less"), but not necessarily as succinct as one entry. To learn an environmentally adapted dictionary capable of concise expression of signals generated by the environment , we develop algorithms that iterate between a representative set of sparse representations found by variants of FOCUSS and an update of the dictionary using these sparse representations. Experiments were performed using synthetic data and natural images. For complete dictionaries, we demonstrate that our algorithms have im-Neural Computation 15, 349-396 (2003) proved performance over other independent component analysis (ICA) methods, measured in terms of signal-to-noise ratios of separated sources. In the overcomplete case, we show that the true underlying dictionary and sparse sources can be accurately recovered. In tests with natural images, learned overcomplete dictionaries are shown to have higher coding efficiency than complete dictionaries; that is, images encoded with an over-complete dictionary have both higher compression (fewer bits per pixel) and higher accuracy (lower mean square error).
translated by 谷歌翻译
Most of audio source separation methods are developed for a particular scenario characterized by the number of sources and channels and the characteristics of the sources and the mixing process. In this paper we introduce a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints. While this framework generalizes several existing audio source separation methods, it also allows to imagine and implement new efficient methods that were not yet reported in the literature. We first introduce the framework by describing the model structure and constraints, explaining its generality, and summarizing its algorithmic implementation using a generalized expectation-maximization algorithm. Finally, we illustrate the above-mentioned capabilities of the framework by applying it in several new and existing configurations to different source separation problems. We have released a software tool named Flexible Audio Source Separation Toolbox (FASST) implementing a baseline version of the framework in Matlab.
translated by 谷歌翻译
A number of current face recognition algorithms use face representations found by unsupervised statistical methods. Typically these methods find a set of basis images and represent faces as a linear combination of those images. Principal component analysis (PCA) is a popular example of such methods. The basis images found by PCA depend only on pairwise relationships between pixels in the image database. In a task such as face recognition, in which important information may be contained in the high-order relationships among pixels, it seems reasonable to expect that better basis images may be found by methods sensitive to these high-order statistics. Independent component analysis (ICA), a generalization of PCA, is one such method. We used a version of ICA derived from the principle of optimal information transfer through sigmoidal neurons. ICA was performed on face images in the FERET database under two different architectures, one which treated the images as random variables and the pixels as outcomes, and a second which treated the pixels as random variables and the images as outcomes. The first architecture found spatially local basis images for the faces. The second architecture produced a factorial face code. Both ICA representations were superior to representations based on PCA for recognizing faces across days and changes in expression. A classifier that combined the two ICA representations gave the best performance. Index Terms-Eigenfaces, face recognition, independent component analysis (ICA), principal component analysis (PCA), unsupervised learning.
translated by 谷歌翻译
This letter presents theoretical, algorithmic, and experimental results about nonnegative matrix factorization (NMF) with the Itakura-Saito (IS) divergence. We describe how IS-NMF is underlaid by a well-defined statistical model of superimposed gaussian components and is equivalent to maximum likelihood estimation of variance parameters. This setting can accommodate regularization constraints on the factors through Bayesian priors. In particular, inverse-gamma and gamma Markov chain priors are considered in this work. Estimation can be carried out using a space-alternating generalized expectation-maximization (SAGE) algorithm ; this leads to a novel type of NMF algorithm, whose convergence to a stationary point of the IS cost function is guaranteed. We also discuss the links between the IS divergence and other cost functions used in NMF, in particular, the Euclidean distance and the generalized Kullback-Leibler (KL) divergence. As such, we describe how IS-NMF can also be performed using a gradient multiplicative algorithm (a standard algorithm structure in NMF) whose convergence is observed in practice, though not proven. Finally, we report a furnished experimental comparative study of Euclidean-NMF, KL-NMF, and IS-NMF algorithms applied to the power spectrogram of a short piano sequence recorded in real conditions, with various initializations and model orders. Then we show how IS-NMF can successfully be employed for denoising and upmix (mono to stereo conversion) of an original piece of early jazz music. These experiments indicate that IS-NMF correctly captures the semantics of audio and is better suited to the representation of music signals than NMF with the usual Euclidean and KL costs.
translated by 谷歌翻译
We present an energy-based model that uses a product of generalised Student-t distributions to capture the statistical structure in datasets. This model is inspired by and particularly applicable to "natural" datasets such as images. We begin by providing the mathematical framework, where we discuss complete and overcomplete models, and provide algorithms for training these models from data. Using patches of natural scenes we demonstrate that our approach represents a viable alternative to "indepen-dent components analysis" as an interpretive model of biological visual systems. Although the two approaches are similar in flavor there are also important differences, particularly when the representations are overcom-plete. By constraining the interactions within our model we are also able to study the topographic organization of Gabor-like receptive fields that are learned by our model. Finally, we discuss the relation of our new approach to previous work-in particular Gaussian Scale Mixture models, and variants of independent components analysis.
translated by 谷歌翻译