Two features distinguish the Bayesian approach to learning models from data. First, beliefs derived from background knowledge are used to select a prior probability distribution for the model parameters. Second, predictions of future observations are made by integrating the model's predictions with respect to the posterior parameter distribution obtained by updating this prior to take account of the data. For neural network models, both these aspects present diiculties | the prior over network parameters has no obvious relation to our prior knowledge, and integration over the posterior is computationally very demanding. I address the rst problem by deening classes of prior distributions for network parameters that reach sensible limits as the size of the network goes to innnity. In this limit, the properties of these priors can be elucidated. Some priors converge to Gaussian processes, in which functions computed by the network may be smooth, Brownian, or fractionally Brownian. Other priors converge to non-Gaussian stable processes. Interesting eeects are obtained by combining priors of both sorts in networks with more than one hidden layer. The problem of integrating over the posterior can be solved using Markov chain Monte Carlo methods. I demonstrate that the hybrid Monte Carlo algorithm, which is based on dynamical simulation, is superior to methods based on simple random walks. I use a hybrid Monte Carlo implementation to test the performance of Bayesian neural network models on several synthetic and real data sets. Good results are obtained on small data sets when large networks are used in conjunction with priors designed to reach limits as network size increases, connrming that with Bayesian learning one need not restrict the complexity of the network based on the size of the data set. A Bayesian approach is also found to be eeective in automatically determining the relevance of inputs. ii
translated by 谷歌翻译
Structured additive regression models are perhaps the most commonly used class of models in statistical applications. It includes, among others, (generalized) linear models, (gener-alized) additive models, smoothing spline models, state space models, semiparametric regression , spatial and spatiotemporal models, log-Gaussian Cox processes and geostatistical and geoadditive models. We consider approximate Bayesian inference in a popular subset of struc-tured additive regression models, latent Gaussian models, where the latent field is Gaussian, controlled by a few hyperparameters and with non-Gaussian response variables. The posterior marginals are not available in closed form owing to the non-Gaussian response variables. For such models, Markov chain Monte Carlo methods can be implemented, but they are not without problems, in terms of both convergence and computational time. In some practical applications, the extent of these problems is such that Markov chain Monte Carlo sampling is simply not an appropriate tool for routine analysis. We show that, by using an integrated nested Laplace approximation and its simplified version, we can directly compute very accurate approximations to the posterior marginals. The main benefit of these approximations is computational: where Markov chain Monte Carlo algorithms need hours or days to run, our approximations provide more precise estimates in seconds or minutes. Another advantage with our approach is its generality , which makes it possible to perform Bayesian analysis in an automatic, streamlined way, and to compute model comparison criteria and various predictive measures so that models can be compared and the model under study can be challenged.
translated by 谷歌翻译
Gaussian Process Dynamical Models for Human Motion This thesis introduces Gaussian process dynamical models (GPDMs) for nonlinear time series analysis. A GPDM comprises a low-dimensional latent space with associated dynamics , and a map from the latent space to an observation space. We marginalize out the model parameters in closed-form, which leads to modeling both dynamics and observation mappings as Gaussian processes. This results in a nonparametric model for dynamical systems that accounts for uncertainty in the model. We train the model on human motion capture data in which each pose is 62-dimensional, and synthesize new motions by sampling from the posterior distribution. A comparison of forecasting results between different covariance functions and sampling methods is provided, and we demonstrate a simple application of GPDM on filling in missing data. Finally, to account for latent space uncertainty, we explore different priors settings on hyperparameters and show some preliminary GPDM learning results using a Monte Carlo expectation-maximization algorithm. ii Acknowledgements First and foremost, this research would never have happened without the technical contributions from my co-supervisors Aaron Hertzmann and David Fleet. Aaron's work on style-based inverse kinematics piqued my interest in machine learning techniques, and David's course on visual motion analysis is where this work first started. They were responsible for motivating this project, and spent countless hours with me to refine the research into a publishable form. I have learned a great deal from working with both of them. This thesis is influenced by insightful comments from Allan Jepson, brief discussions with Radford Neal, and talks given by Geoff Hinton. In addition, thanks to Raquel Urtasun for explaining to me the issues in applying this research to human tracking, which became one of the motivations for Chapter 5. Many thanks to my fellow students at the DGP: Abhishek, Alex, and Eron for co-founding DAG. Anand and Patricio for providing non-stop comic relief. Gonzalo, Jacky, Mike M., Mike P., Nigel, Noah, and Patrick for being great targets to distract without having to stand up. Faisal, Joe, and Mike N. for many actual research conversations. Azeem, Bowen, Irene, Kevin, Mike W., Shahzad, and Winnie for many conversations on subjects ranging from adopting puppies to world domination. Naiqi and Qinxin for many conversations no one else in the lab understands. Dan V., Sam, Tovi, and Tristan for regular food and drink excursions. Anastasia and Marge for keeping the lab sane. Thanks to the DGP research staff, especially John Hancock for year-round technical support. Last but not least, thanks to all the professors at the DGP, especially Ravin, Aaron, and Karan for having good taste in food. iii
translated by 谷歌翻译
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although domain knowledge can be used to help design representations, learning can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, manifold learning, and deep learning. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
translated by 谷歌翻译
机器学习算法的成功通常取决于数据表示,我们假设这是因为不同的表示可以或多或少地隐藏数据背后变异的不同解释因素。虽然可以使用特定领域知识来帮助设计表示,但也可以使用通用先验学习,并且对AI的追求正在激励设计实现这些先验的更强大的表示 - 学习算法。本文回顾了无监督特征学习和深度学习领域的最新研究成果,涵盖了概率模型,自动编码器,流形学习和深度网络的进步。这激发了关于学习良好表征,计算表示(即推理)以及表示学习,密度估计和流形学习之间的几何联系的适当目标的长期未回答的问题。
translated by 谷歌翻译
The Gaussian process latent variable model (GP-LVM) provides a flexible approach for non-linear dimensionality reduction that has been widely applied. However, the current approach for training GP-LVMs is based on maximum likelihood, where the latent projection variables are maximised over rather than integrated out. In this paper we present a Bayesian method for training GP-LVMs by introducing a non-standard variational inference framework that allows to approximately integrate out the latent variables and subsequently train a GP-LVM by maximising an analytic lower bound on the exact marginal likelihood. We apply this method for learning a GP-LVM from i.i.d. observations and for learning non-linear dynamical systems where the observations are temporally correlated. We show that a benefit of the variational Bayesian procedure is its robustness to overfitting and its ability to automatically select the dimensionality of the non-linear latent space. The resulting framework is generic, flexible and easy to extend for other purposes, such as Gaussian process regression with uncertain or partially missing inputs. We demonstrate our method on synthetic data and standard machine learning benchmarks, as well as challenging real world datasets, including high resolution video data.
translated by 谷歌翻译
声学数据提供从生物学和通信到海洋和地球科学等领域的科学和工程见解。我们调查了机器学习(ML)的进步和变革潜力,包括声学领域的深度学习。 ML是用于自动检测和利用模式印度的广泛的统计技术家族。相对于传统的声学和信号处理,ML是数据驱动的。给定足够的训练数据,ML可以发现特征之间的复杂关系。通过大量的训练数据,ML candiscover模型描述复杂的声学现象,如人类语音和混响。声学中的ML正在迅速发展,具有令人瞩目的成果和未来的重大前景。我们首先介绍ML,然后在五个声学研究领域强调MLdevelopments:语音处理中的源定位,海洋声学中的源定位,生物声学,地震探测和日常场景中的环境声音。
translated by 谷歌翻译
Network models are widely used to represent relations between interacting units or actors. Network data often exhibit transitivity, meaning that two actors that have ties to a third actor are more likely to be tied than actors that do not, homophily by attributes of the actors or dyads, and clustering. Interest often focuses on finding clusters of actors or ties, and the number of groups in the data is typically unknown. We propose a new model, the latent position cluster model , under which the probability of a tie between two actors depends on the distance between them in an unobserved Euclidean 'social space', and the actors' locations in the latent social space arise from a mixture of distributions, each corresponding to a cluster. We propose two estimation methods: a two-stage maximum likelihood method and a fully Bayesian method that uses Markov chain Monte Carlo sampling. The former is quicker and simpler, but the latter performs better. We also propose a Bayesian way of determining the number of clusters that are present by using approximate conditional Bayes factors. Our model represents transitivity, homophily by attributes and clustering simultaneously and does not require the number of clusters to be known. The model makes it easy to simulate realistic networks with clustering, which are potentially useful as inputs to models of more complex systems of which the network is part, such as epidemic models of infectious disease. We apply the model to two networks of social relations. A free software package in the R statistical language, latentnet, is available to analyse data by using the model.
translated by 谷歌翻译
We consider prediction and uncertainty analysis for systems which are approximated using complex mathematical models. Such models, implemented as computer codes, are often generic in the sense that by a suitable choice of some of the model's input parameters the code can be used to predict the behaviour of the system in a variety of speci®c applications. However, in any speci®c application the values of necessary parameters may be unknown. In this case, physical observations of the system in the speci®c context are used to learn about the unknown parameters. The process of ®tting the model to the observed data by adjusting the parameters is known as calibration. Calibration is typically effected by ad hoc ®tting, and after calibration the model is used, with the ®tted input values, to predict the future behaviour of the system. We present a Bayesian calibration technique which improves on this traditional approach in two respects. First, the predictions allow for all sources of uncertainty, including the remaining uncertainty over the ®tted parameters. Second, they attempt to correct for any inadequacy of the model which is revealed by a discrepancy between the observed data and the model predictions from even the best-®tting parameter values. The method is illustrated by using data from a nuclear radiation release at Tomsk, and from a more complex simulated nuclear accident exercise. 1. Overview 1.1. Computer models and calibration Various sciences use mathematical models to describe processes that would otherwise be very dif®cult to analyse, and these models are typically implemented in computer codes. Often, the mathematical model is highly complex, and the resulting computer code is large and may be expensive in terms of the computer time required for a single run. Nevertheless, running the computer model will be much cheaper than making direct observations of the process. Sacks, Welch, Mitchell and Wynn (1989) have given several examples. The codes that we consider are deterministic, i.e. running the code with the same inputs always produces the same output. Computer models are generally designed to be applicable to a wide range of particular contexts. However, to use a model to make predictions in a speci®c context it may be necessary ®rst to calibrate the model by using some observed data. To illustrate this process we introduce a simple example. Two more examples are described in detail in Section 2.2. To decide on a dose regime (e.g. size, frequency and release rates of tablets) for a new drug, a pharmacokinetic model is used. This models the movement of the drug through various`compartments various`compartments' of the patient's body and its eventual elimination (e.g. by chemical reactions
translated by 谷歌翻译
Gaussian processes (GPs) are natural generalisations of multivariate Gaussian random variables to infinite (countably or continuous) index sets. GPs have been applied in a large number of fields to a diverse range of ends, and very many deep theoretical analyses of various properties are available. This paper gives an introduction to Gaussian processes on a fairly elementary level with special emphasis on characteristics relevant in machine learning. It draws explicit connections to branches such as spline smoothing models and support vector machines in which similar ideas have been investigated. Gaussian process models are routinely used to solve hard machine learning problems. They are attractive because of their flexible non-parametric nature and computational simplicity. Treated within a Bayesian framework, very powerful statistical methods can be implemented which offer valid estimates of uncertainties in our predictions and generic model selection procedures cast as nonlinear optimization problems. Their main drawback of heavy computational scaling has recently been alleviated by the introduction of generic sparse approximations [13, 78, 31]. The mathematical literature on GPs is large and often uses deep concepts which are not required to fully understand most machine learning applications. In this tutorial paper, we aim to present characteristics of GPs relevant to machine learning and to show up precise connections to other "kernel machines" popular in the community. Our focus is on a simple presentation, but references to more detailed sources are provided.
translated by 谷歌翻译
Factor analysis, principal component analysis, mixtures of gaussian clusters , vector quantization, Kalman filter models, and hidden Markov models can all be unified as variations of unsupervised learning under a single basic generative model. This is achieved by collecting together disparate observations and derivations made by many previous authors and introducing a new way of linking discrete and continuous state models using a simple nonlinearity. Through the use of other nonlinearities, we show how independent component analysis is also a variation of the same basic generative model. We show that factor analysis and mixtures of gaussians can be implemented in autoencoder neural networks and learned using squared error plus the same regularization term. We introduce a new model for static data, known as sensible principal component analysis, as well as a novel concept of spatially adaptive observation noise. We also review some of the literature involving global and local mixtures of the basic models and provide pseudocode for inference and learning for all the basic models. 1 A Unifying Review Many common statistical techniques for modeling multidimensional static data sets and multidimensional time series can be seen as variants of one underlying model. As we will show, these include factor analysis, principal component analysis (PCA), mixtures of gaussian clusters, vector quantiza-tion, independent component analysis models (ICA), Kalman filter models (also known as linear dynamical systems), and hidden Markov models (HMMs). The relationships between some of these models has been noted in passing in the recent literature. For example, Hinton, Revow, and Dayan (1995) note that FA and PCA are closely related, and Digalakis, Rohlicek, and Ostendorf (1993) relate the forward-backward algorithm for HMMs to * Present address: {roweis, zoubin}@gatsby.ucl.ac.uk.
translated by 谷歌翻译
Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be non-linearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets.
translated by 谷歌翻译
Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment.
translated by 谷歌翻译
A n ew information-theoretic approach is presented for nding t he p o s e o f a n o b j e c t i n a n i m age. The t echnique does not require information about t he surface properties of the object, besides its shape, and is robust with respect to v ariations of illumination. In our derivation, few assumptions are made a bout t he n atu r e o f t he i m aging process. As a result t he algorithms are quite general and can foreseeably be used in a wide v ariety o f i m aging s i t uations. Experiments are presented that d emonstrate t he a p proach r e g i s t ering m agnetic resonance (MR) images with c o m p u ted tomography (C T) i m ages, aligning a complex 3D object model to real scenes including c l u tter and occlusion, tracking a h uman he a d i n a v i d eo sequence and aligning a view-based 2D object model to r e a l i m ages. The m ethod is based on a formulation of the m utual information between the m o d el and t he image called EMMA. As applied here the t echnique is intensity-based, rather than feature-based. It works well in domains where edge or gradient-magnitude b a s e d m ethods have diiculty, y et it is more robust than traditional correlation. Additionally, i t h as an eecient implementation that i s based on stochastic approximation. Finally, w e w i l l d escribe a number of additional real-world applications that can be solved ef-ciently and reliably using EMMA. EMMA can be used in machine learning t o o n d m aximally informative projections of high-dimensional data. EMMA can also be used to d etect and correct corruption in magnetic resonance images (MRI). Copyright c Massachusetts I n s t itute o f T echnology, 1995 This report describes research done a t in partial fullllment o f t he requirements for the d egree of Doctor of Philosophy. Abstract Over the last 30 years the problems of image registration and recognition have p r o ven more diicult t han even the most pessimistic might h ave predicted. Progress has been hampered by t he sheer complexity o f t he relationship between an object and i t s i m age, which i n volves the object's shape, surface properties, position, and illumination. Changes in illumination can radically alter the i n tensity a n d s h ading o f a n i m age. Nevertheless, the h uman visual system can use shading b o t h for recognition and i m age interpretation. We will present a m easure for comparing objects a n d i m ages that u s e s s h ading information, yet is explicitly insensitive t o c hanges in illumination. This measure is unique in that it compares 3D object models directly to raw images. No pre-processing o r e d g e d etection is required. We will show t hat w h en the mutual information between model and i m age is large they are likely to b e a l i g n ed. Toward making this technique a reality w e h ave d eened a concrete a n d ecient t echnique for evaluating e n tropy called EMMA. In our derivation of mutual information based alignment few assumptions are made a bout t h
translated by 谷歌翻译
The problem of dimensionality reduction arises in many fields of information processing, including machine learning, data compression, scientific visualization, pattern recognition, and neural computation. Here we describe locally linear embedding (LLE), an unsupervised learning algorithm that computes low dimensional, neighborhood preserving embeddings of high dimensional data. The data, assumed to be sampled from an underlying manifold, are mapped into a single global coordinate system of lower dimensionality. The mapping is derived from the symmetries of locally linear reconstructions, and the actual computation of the embedding reduces to a sparse eigen-value problem. Notably, the optimizations in LLE-though capable of generating highly nonlinear embeddings-are simple to implement, and they do not involve local minima. In this paper, we describe the implementation of the algorithm in detail and discuss several extensions that enhance its performance. We present results of the algorithm applied to data sampled from known manifolds, as well as to collections of images of faces, lips, and handwritten digits. These examples are used to provide extensive illustrations of the algorithm's performance-both successes and failures-and to relate the algorithm to previous and ongoing work in nonlinear dimensionality reduction.
translated by 谷歌翻译
We introduce a very general method for high dimensional classification, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower dimensional space. In one special case that we study in detail, the random projections are divided into disjoint groups, and within each group we select the projection yielding the smallest estimate of the test error. Our random-projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Our theoretical results elucidate the effect on performance of increasing the number of projections. Moreover, under a boundary condition that is implied by the sufficient dimension reduction assumption, we show that the test excess risk of the random-projection ensemble classifier can be controlled by terms that do not depend on the original data dimension and a term that becomes negligible as the number of projections increases. The classifier is also compared empirically with several other popular high dimensional classifiers via an extensive simulation study, which reveals its excellent finite sample performance.
translated by 谷歌翻译
Data with mixed-type (metric-ordinal-nominal) variables are typical for social strat-ification, i.e. partitioning a population into social classes. Approaches to cluster such data are compared, namely a latent class mixture model assuming local independence and dissimilar-ity-based methods such as k-medoids. The design of an appropriate dissimilarity measure and the estimation of the number of clusters are discussed as well, comparing the Bayesian information criterion with dissimilarity-based criteria. The comparison is based on a philosophy of cluster analysis that connects the problem of a choice of a suitable clustering method closely to the application by considering direct interpretations of the implications of the methodology. The application of this philosophy to economic data from the 2007 US Survey of Consumer Finances demonstrates techniques and decisions required to obtain an interpretable clustering. The clustering is shown to be significantly more structured than a suitable null model. One result is that the data-based strata are not as strongly connected to occupation categories as is often assumed in the literature.
translated by 谷歌翻译
This paper is a survey of the theory and methods of photogrammetric bundle adjustment, aimed at potential implementors in the computer vision community. Bundle adjustment is the problem of refining a visual reconstruction to produce jointly optimal structure and viewing parameter estimates. Topics covered include: the choice of cost function and robustness; numerical optimization including sparse Newton methods, linearly convergent approximations, updating and recursive methods; gauge (datum) invariance; and quality control. The theory is developed for general robust cost functions rather than restricting attention to traditional nonlinear least squares.
translated by 谷歌翻译
机器学习中最基本的问题之一是比较例子:给定一对对象,我们想要返回一个表示(dis)相似度的值。相似性通常是特定于任务的,并且预定义的距离可能表现不佳,从而导致在度量学习中工作。然而,能够学习相似性敏感距离函数也预先假定对于手头的对象的丰富的,有辨别力的表示。在本论文中,我们提出了两端的贡献。在论文的第一部分中,假设数据具有良好的表示,我们提出了一种用于度量学习的公式,与先前的工作相比,它更直接地尝试优化k-NN精度。我们还提出了这个公式的扩展,用于kNN回归的度量学习,不对称相似学习和汉明距离的判别学习。在第二部分中,我们考虑我们处于有限计算预算的情况,即在可能度量的空间上进行优化是不可行的,但是仍然需要访问标签感知距离度量。我们提出了一种简单,计算成本低廉的方法,用于估计仅依靠梯度估计,讨论理论和实验结果的良好动机。在最后一部分,我们讨论代表性问题,考虑组等变卷积神经网络(GCNN)。等效tosymmetry转换在GCNNs中明确编码;经典的CNN是最简单的例子。特别地,我们提出了一种用于球形数据的SO(3) - 等变神经网络架构,它完全在傅立叶空间中运行,同时也为完全傅立叶神经网络的设计提供了形式,这与任何连续紧凑组的动作是等效的。
translated by 谷歌翻译