This web site aims to provide an overview of resources concerned with probabilistic modeling, inference and learning based on Gaussian processes. Although Gaussian processes have a long history in the field of statistics, they seem to have been employed extensively only in niche areas. With the advent of kernel machines in the machine learning community, models based on Gaussian processes have become commonplace for problems of regression (kriging) and classification as well as a host of more specialized applications.
| [ Books | Events | Other Web Sites | Software | Research Papers ] |
Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Chris Williams, the MIT Press, 2006, online version.
Statistical Interpolation of Spatial Data: Some Theory for Kriging, Michael L. Stein, Springer, 1999.
Statistics for Spatial Data (revised edition), Noel A. C. Cressie, Wiley, 1993
Spline Models for Observational Data, Grace Wahba, SIAM, 1990
The Gaussian Processes in Practice workshop at Bletchley Park, U.K., June 12-13 2006.
The Open Problems in Gaussian Processes for Machine Learning workshop at nips*05 in Whistler, December 10th, 2005.
The Gaussian Process Round Table meeting in Sheffield, June 9-10, 2005.
The kernel-machines web site.
Wikipedia entry on Gaussian processes.
The ai-geostats web site for spatial statistics and geostatistics.
Andreas Geiger has written a simple Gaussian process regression Java applet, illustrating the behaviour of covariance functions and hyperparameters.
|
Other software that way be useful for implementing Gaussian process models:
Below is a collection of papers relevant to learning in Gaussian process models. The papers are ordered according to topic, with occational papers occuring under multiple headings.
| [ Tutorials | Regression | Classification | Covariance Functions | Model Selection | Approximations | Stats | Learning Curves | RKHS | Reinforcement Learning | Other Topics | Applications ] |
D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge, UK, 2003. chapter 45.
Comment: A short introduction to GPs, emphasizing the relationships to paramteric models (RBF networks, neural networks, splines).
D. J. C. MacKay. Gaussian processes - a replacement for supervised neural networks?. Tutorial lecture notes for NIPS 1997, 1997.
D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, volume 168 of NATO ASI Series, pages 133-165. Springer, Berlin, 1998.
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006.
Comment: The initial chapters contain significant amounts of tutorial material. The whole book, including all chapters are freely available online.
M. Seeger. Gaussian processes for machine learning. International Journal of Neural Systems, 14(2):69-106, 2004.
Abstract: Gaussian processes (GPs) are natural generalisations of multivariate Gaussian random variables to infinite (countably or continuous) index sets. GPs have been applied in a large number of fields to a diverse range of ends, and very many deep theoretical analyses of various properties are available. This paper gives an introduction to Gaussian processes on a fairly elementary level with special emphasis on characteristics relevant in machine learning. It draws explicit connections to branches such as spline smoothing models and support vector machines in which similar ideas have been investigated.
C. K. I. Williams. Gaussian processes. In M. A. Arbib, editor, Handbook of Brain Theory and Neural Networks, pages 466-470. The MIT Press, second edition, 2002.
C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan, editor, Learning in Graphical Models, pages 599-621. The MIT Press, Cambridge, MA, 1999. Previously (1998) published by Kluwer Academic Press.
Abstract: The main aim of this paper is to provide a tutorial on regression with Gaussian processes. We start from Bayesian linear regression, and show how by a change of viewpoint one can se this method as a Gaussian process predictor based on priors over functions, rather than on priors over parameters. This leads to a more general discussion of Gaussian processes in section 4. Section 5 deals with further issues, including hierarchical modelling and the setting of the parameters that control the Gaussian process, the covariance functions for neural network models and the use of Gaussian processes in classification problems.
P. Boyle and M. Frean. Dependent Gaussian processes. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 217-224. The MIT Press, 2005.
Abstract: Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs.
P. W. Goldberg, C. K. I. Williams, and C. M. Bishop. Regression with input-dependent noise: A Gaussian process treatment. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10. The MIT Press, Cambridge, MA, 1998.
Abstract: Gaussian processes provide natural non-parametric prior distributions over regression functions. In this paper we consider regression problems where there is noise on the output, and the variance of the noise depends on the inputs. If we assume that the noise is a smooth functon of the inputs, then it is natural to model the noise variance using a second Gaussian process, in addition to the Gaussian process governing the noise-free output value. We show that prior uncertainty about the parameters controlling both processes can be handled and that the posterior distribution of the noise rate can be sampled from using Markov chain Monte Carlo methods. Our results on a sythetic data set give a posterior noise variance that well-approximates the true variance.
R. B. Gramacy. tgp: An R package for Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian process models. Journal of Statistical Software, 19, 2007.
Abstract: The tgp package for R is a tool for fully Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian processes with jumps to the limiting linear model. Special cases also implemented include Bayesian linear models, linear CART, stationary separable and isotropic Gaussian processes. In addition to inference and posterior prediction, the package supports the (sequential) design of experiments under these models paired with several objective criteria. 1-d and 2-d plotting, with higher dimension projection and slice capabilities, and tree drawing functions (requiring maptree and combinat packages), are also provided for visualization of tgp objects.
R. B. Gramacy, Herbert K. H. Lee, and William MacReady. Parameter space exploration with Gaussian process trees. In 21st International Conference on Machine Learning, pages 353-360. Omnipress and ACM Digital Library, 2004.
Abstract: Computer experiments often require dense sweeps over input parameters to obtain a qualitative understanding of their response. Such sweeps can be prohibitively expensive, and are unnecessary in regions where the response is easily predicted; well-chosen designs could allow a mapping of the response with far fewer simulation runs. Thus, there is a need for computationally inexpensive surrogate models and an accompanying method for selecting small designs. We explore a general methodology for addressing this need that uses non-stationary Gaussian processes. Binary trees partition the input space to facilitate non-stationarity and a Bayesian interpretation provides an explicit measure of predictive uncertainty that can be used to guide sampling. Our methods are illustrated on several examples, including a motivating example involving computational fluid dynamics simulation of a NASA reentry vehicle.
E. Meeds and S. Osindero. An alternative infinite mixture of Gaussian process experts. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 883-890. The MIT Press, Cambridge, MA, 2006.
Abstract: We present an infinite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian specification of the effective 'gating network' for the different experts.
A. O'Hagan. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society, Series B, 40(1):1-42, 1978.
A. O'Hagan. Bayesian Inference, volume 2B of Kendall's Advanced Theory of Statistics. Arnold, London, 1994. chapter 10.48.
T. Plate. Accuracy versus interpretability in flexible modeling: implementing a tradeoff using Gaussian process models. Behaviourmetrika, 26(1):29-50, 1999.
Abstract: One of the widely acknowledged drawbacks of flexible statistical models is that the fitted models are often extremely difficult to interpret. However, if flexible models are constrained to be additive the fitted models are much easier to interpret, as each input can be considered independently. The problem with additive models is that they cannot provide an accurate model if the phenomenon being modeled is not additive. This paper shows that a tradeoff between accuracy and additivity can be implemented easily in Gaussian process models, which are a type of flexible model closely related to feedforward neural networks. One can fit a series of Gaussian process models that begins with the completely flexible and are constrained to be progressively more additive, and thus progressively more interpretable. Observations of how the degree of non-additivity and the test error change as the models become more additive give insight into the importance of interactions in a particular model. Fitted models in the series can also be interpreted graphically with a technique for visualizing the effects of inputs in non-additive models that was adapted from plots for generalized additive models. This visualization technique shows the overall effects of different inputs and also shows which inputs are involved in interactions and how strong those interactions are.
C. E. Rasmussen. Evaluation of Gaussian Processes and other Methods for Non-linear Regression. PhD thesis, Department of Computer Science, University of Toronto, 1996.
Abstract: This thesis develops two Bayesian learning methods relying on Gaussian processes and a rigorous statistical approach for evaluating such methods. In these experimental designs the sources of uncertainty in the estimated generalisation performances due to both variation in training and test sets are accounted for. The framework allows for estimation of generalisation performance as well as statistical tests of significance for pairwise comparisons. Two experimental designs are recommended and supported by the DELVE software environment. Two new non-parametric Bayesian learning methods relying on Gaussian process priors over functions are developed. These priors are controlled by hyperparameters which set the characteristic length scale for each input dimension. In the simplest method, these parameters are fit from the data using optimization. In the second, fully Bayesian method, a Markov chain Monte Carlo technique is used to integrate over the hyperparameters. One advantage of these Gaussian process methods is that the priors and hyperparameters of the trained models are easy to interpret. The Gaussian process methods are benchmarked against several other methods, on regression tasks using both real data and data generated from realistic simulations. The experiments show that small datasets are unsuitable for benchmarking purposes because the uncertainties in performance measurements are large. A second set of experiments provide strong evidence that the bagging procedure is advantageous for the Multivariate Adaptive Regression Splines (MARS) method. The simulated datasets have controlled characteristics which make them useful for understanding the relationship between properties of the dataset and the performance of different methods. The dependency of the performance on available computation time is also investigated. It is shown that a Bayesian approach to learning in multi-layer perceptron neural networks achieves better performance than the commonly used early stopping procedure, even for reasonably short amounts of computation time. The Gaussian process methods are shown to consistently outperform the more conventional methods.
C. E. Rasmussen and Z. Ghahramani. Infinite mixtures of Gaussian process experts. In T. G. Diettrich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14. The MIT Press, 2002.
Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using a input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets - thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.
E. Snelson, C. E. Rasmussen, and Z. Ghahramani. Warped Gaussian processes. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, Cambridge, MA, 2004. The MIT Press.
Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation.
P. Sollich and C. K. I. Williams. Using the equivalent kernel to understand Gaussian process regression. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17. The MIT Press, 2005.
Abstract: The equivalent kernel [Silverman, 1984] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes.
F. Vivarelli and C. K. I Williams. Discovering hidden features with Gaussian processes regression. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11. The MIT Press, 1999.
Abstract: In Gaussian process regression the covariance between the outputs at input locations x and x' is usually assumed to depend on the distance (x-x')W(x-x'), where W is a positive definite matrix. W is often taken to be diagonal, but if we allow W to be a general positive definite matrix which can be tuned on the basis of training data, then an eigen-analysis of W shows that we are effectively creating hidden features, where the dimensionality of the hidden-feature space is determined by the data. We demonstrate the superiority of predictions using the general matrix over those based on a diagonal matrix on two test problems.
C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514-520. The MIT Press, Cambridge, MA, 1996.
Abstract: The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior over functions. We investigate the use of a Gaussian process prior over functions, which permits the predictive Bayesian analysis for fixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results.
H. Zhu, C. K. I. Williams, R. J. Rohwer, and M. Morciniec. Gaussian regression and optimal finite dimensional linear models. In C. M. Bishop, editor, Neural Networks and Machine Learning. Springer-Verlag, Berlin, 1998. See also technical report NCRG/97/011, Aston University.
Abstract: The problem of regression under Gaussian assumptions is treated generally. The relationship between Bayesian prediction, regularization and smoothing is elucidated. The ideal regression is the posterior mean and its computation scales as O(n3), where n is the same size. We show that the optimal m-dimensional linear model under a given prior is spanned by the first m eigenfunctions of a covaraince operator, which is a trace-class operator. This is an infinite dimensional analogue of principal component analysis. The importance of Hilbert space methods to practical statistics is also discussed.
Y. Altun, T. Hofmann, and A. J. Smola. Gaussian process classification for segmenting and annotating sequences.. In C. E. Brodley, editor, Proceedings of the Twenty-first International Conference on Machine Learning (ICML 2004). ACM, 2004.
D. Barber and C. K. I. Williams. Gaussian processes for Bayesian classification via hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. The MIT Press.
Abstract: The full Bayesian method for applying neural networks to a prediction problem is to set up the prior/hyperprior structure for the net and then perform the necessary integrals. However, these integrals are not tractable analytically, and Markov Chain Monte Carlo (MCMC) methods are slow, especially if the parameter space is high-dimensional. Using Gaussian processes we can approximate the weight space integral analytically, so that only a small number of hyperparameters need be integrated over by MCMC methods. We have applied this idea to classification problems, obtaining excellent results on the real-world problems investigated so far.
N. Choudhuri, S. Ghosal, and A. Roy. Nonparametric binary regression using a Gaussian process prior. (unpublished), 2005.
L. Csató, E. Fokoué, M. Opper, B. Schottky, and O. Winther. Efficient approaches to Gaussian process classification. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems 12, Cambridge, MA, 2000. The MIT Press.
Abstract: We present three simple approximations for the calculation of the posterior mean in Gaussian Process classification. The first two methods are related to mean field ideas known in Statistical Physics. The third approach is based on Bayesan online approach which was motivated by recent results in the Statistical Mechanics of Neural Networks. We present simulation results showing: 1. that the mean field Bayesian evidence may be used for hyperparameter tuning and 2. that the online approach may achieve a low training error fast.
L. Csató and M. Opper. Sparse representation for Gaussian process models. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, Cambridge, MA, 2001. The MIT Press.
Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world datasets indicate that efficiency of the approach.
L. Csató and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(2):641-669, 2002.
Abstract: We develop an approach for sparse representations of gaussian process (GP) models (which are Bayesian types of kernel machines) in order to overcome their limitations for large data sets. The method is based on a combination of a Bayesian on-line algorithm, together with a sequential construction of a relevant subsample of the data that fully specifies the prediction of the GP model. By using an appealing parameterization and projection techniques in a reproducing kernel Hilbert space, recursions for the effective parameters and a sparse gaussian approximation of the posterior process are obtained. This allows for both a propagation of predictions and Bayesian error measures. The significance and robustness of our approach are demonstrated on a variety of experiments.
M. N. Gibbs and D. J. C. MacKay. Variational Gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6):1458-1464, 2000.
Abstract: Gaussian processes are a promising non-linear interpolation tool [williams-95,Williams and Rasmussen 1996], but it is not straightforward to solve classification problems with them. In this paper the variational methods of [jaakkola-jordan-96] are applied to Gaussian processes to produce an efficient Bayesian binary classifier.
M. Girolami and S. Rogers. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18(8):1790-1817, 2006.
Abstract: It is well known in the statistics literature that augmenting binary and polychotomous response models with Gaussian latent variables enables exact Bayesian analysis via Gibbs sampling from the parameter posterior. By adopting such a data augmentation strategy, dispensing with priors over regression coefficients in favour of Gaussian Process (GP) priors over functions, and employing variational approximations to the full posterior we obtain efficient computational methods for Gaussian Process classification in the multi-class setting. The model augmentation with additional latent variables ensures full a posteriori class coupling whilst retaining the simple a priori independent GP covariance structure from which sparse approximations, such as multi-class Informative Vector Machines (IVM), emerge in a very natural and straightforward manner. This is the first time that a fully Variational Bayesian treatment for multi-class GP classification has been developed without having to resort to additional explicit approximations to the non-Gaussian likelihood term. Empirical comparisons with exact analysis via MCMC and Laplace approximations illustrate the utility of the variational approximation as a computationally economic alternative to full MCMC and it is shown to be more accurate than the Laplace approximation.
A. Kapoor, K. Grauman, R. Urtasun, and T. Darell. Active learning with Gaussian processes for object categorization. In Proceedings of the International Conference in Cmputer Vision, 2007.
Abstract: Discriminative methods for visual object category recognition are typically non-probabilistic, predicting class labels but not directly providing an estimate of uncertainty. Gaussian Processes (GPs) are powerful regression techniques with explicit uncertainty models; we show here how Gaussian Processes with covariance functions defined based on a Pyramid Match Kernel (PMK) can be used for probabilistic object category recognition. The uncertainty model provided by GPs offers confidence estimates at test points, and naturally allows for an active learning paradigm in which points are optimally selected for interactive labeling. We derive a novel active category learning method based on our probabilistic regression model, and show that a significant boost in classification performance is possible, especially when the amount of training data for a category is ultimately very small.
H.-C. Kim and Z. Ghahramani. The EM-EP algorithm for Gaussian process classification. In Proceedings of the Workshop on Probabilistic Graphical Models for Classification (at ECML), 2003.
Abstract: Gaussian process classifiers (GPCs) are fully statistical kernel classification models derived from Gaussian processes for regression. In GPCs, the probability of belonging to a certain class at an input location is monotonically related to the value of some latent function at that location. Starting from a prior over this latent function, the data are used to infer both the posterior over the latent function and the values of hyperparameters determining various aspects of the function. GPCs can also be viewed as graphical models with latent variables. Based on the work of [Minka 2001, Opper and Winther 2000], we present an approximate EM algorithm, the EM-EP algorithm for learning both the latent function and the hyperparameters of a GPC. The algorithm alternates the following steps until convergence. In the E-step, given the hyperparameters, a density for the latent variables defining the latent function is computed via the Expectation-Propagation (EP) algorithm [Minka 2001, Opper and Winther 2000]. In the M-step, given the density for the latent values, the hyperparameters are selected to maximize a variational lower bound on the marginal likelihood (i.e. the model evidence). This algorithm is found to converge in practice and provides an efficient Bayesian framework for learning hyperparameters of the kernel. We examine the role of various different hyperparameters which model labeling errors, the lengthscales (i.e. relevances) of different features, and sharpness of the decision boundary. The added flexibility these provide results in signicantly improved performance. Experimental results on synthetic and real data sets show that the EM-EP algorithm works well, with GPCs giving equal or better performance than support vector machines (SVMs) on all data sets tested.
M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679-1704, 2005.
Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace's method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace's method.
M. Kuss and C. E. Rasmussen. Assessing approximations for Gaussian process classification. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 699-706, Cambridge, MA, 2006. The MIT Press.
Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace's method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate.
N. D. Lawrence and M. I. Jordan. Semi-supervised learning via Gaussian processes. In L. K. Saul, Y. Weiss, and Bottou L, editors, Advances in Neural Information Processing Systems 17, pages 753-760, Cambridge, MA, 2005. The MIT Press.
Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a "null category noise model" (NCNM) inspired by ordered cate- gorical noise models. The noise model re ects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits.
R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475-501. Oxford University Press, 1998.
Abstract: Gaussian processes are a natural way of specifying prior distributions over functions of one or more input variables. When such a function defines the mean response in a regression model with Gaussian errors, inference can be done using matrix computations, which are feasible for datasets of up to about a thousand cases. The covariance function of the Gaussian process can be given a hierarchical prior, which allows the model to discover high-level properties of the data, such as which inputs are relevant to predicting the response. Inference for these covariance hyperparameters can be done using Markov chain sampling. Classification models can be defined using Gaussian processes for underlying latent values, which can also be sampled within the Markov chain. Gaussian processes are in my view the simplest and most obvious way of defining flexible Bayesian regression and classification models, but despite some past usage, they appear to have been rather neglected as a general-purpose technique. This may be partly due to a confusion between the properties of the function being modeled and the properties of the best predictor for this unknown function.
M. Opper and O. Winther. Gaussian processes for classification: Mean-field algorithms. Neural Computation, 12(11):2655-2684, 2000.
Abstract: We derive a mean-field algorithm for binary classification with gaussian processes that is based on the TAP approach originally proposed in statistical physics of disordered systems. The theory also yields an approximate leave-one-out estimator for the generalization error, which is computed with no extra computational cost. We show that from the TAP approach, it is possible to derive both a simpler "naive" mean-field theory and support vector machines (SVMs) as limiting cases. For both mean-field algorithms and support vector machines, simulation results for three small benchmark data sets are presented. They show that one may get state-of-the-art performance by using the leave-one-out estimator for model selection and the built-in leave-one-out estimators are extremely precise when compared to the exact leave-one-out estimate. The second result is taken as strong support for the internal consistency of the mean-field approach.
M. Opper and O. Winther. Mean field methods for classification with Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 309-315, Cambridge, MA, 1999. The MIT Press.
Abstract: We discuss the application of TAP mean field methods known from the Statistical Mechanics of diordered systems to Bayesian classification models with Gaussian processes. In contrast to previous approaches, no knowledge about the distribution of inputs is needed. Simulation results for the Sonar data set are given.
T. Peña Centeno and N. D. Lawrence. Optimising kernel parameters and regularisation coefficients for non-linear discriminant analysis. Journal of Machine Learning Research, 7:455-491, 2006.
Abstract: In this paper we consider a novel Bayesian interpretation of Fisher's discriminant analysis. We relate Rayleigh's coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the 'regression to the labels' assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher's discriminant, and with the incorporation of a prior we can apply Bayes' rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher's discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.
M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classification. Journal of Machine Learning Research, 3:233-269, 2002.
Abstract: Approximate Bayesian Gaussian process (GP) classification techniques are powerful non-parametric learning methods, similar in appearance and performance to support vector machines. Based on simple probabilistic models, they render interpretable results and can be embedded in Bayesian frameworks for model selection, feature selection, etc. In this paper, by applying the PAC-Bayesian theorem of McAllester (1999a), we prove distribution-free generalisation error bounds for a wide range of approximate Bayesian GP classification techniques. We also provide a new and much simplified proof for this powerful theorem, making use of the concept of convex duality which is a backbone of many machine learning techniques. We instantiate and test our bounds for two particular GPC techniques, including a recent sparse method which circumvents the unfavourable scaling of standard GP algorithms. As is shown in experiments on a real-world task, the bounds can be very tight for moderate training sample sizes. To the best of our knowledge, these results provide the tightest known distribution-free error bounds for approximate Bayesian GPC methods, giving a strong learning-theoretical justification for the use of these techniques.
M. Seeger and M. I. Jordan. Sparse Gaussian process classification with multiple classes. Technical Report TR 661, Department of Statistics, University of California at Berkeley, 2004.
Abstract: Sparse approximations to Bayesian inference for nonparametric Gaussian Process models scale linearly in the number of training points, allowing for the application of these powerful kernel-based models to large datasets. We show how to generalize the binary classification Informative Vector Machine (IVM) to multiple classes. In contrast to earlier efficient approaches to kernel-based non-binary classification, our method is a principled approximation to Bayesian inference which yields valid uncertainty estimates and allows for hyperparameter adaption via marginal likelihood maximization. While most earlier proposals suggest fitting independent binary discriminants to heuristically chosen partitions of the data and combining these in a heuristic manner, our method operates jointly on the data for all classes. Crucially, we still achieve a linear scaling in both the number of classes and the number of training points.
R. Urtasun and T. Darrell. Discriminative Gaussian process latent variable model for classification. In 24th International Conference on Machine Learning, 2007.
Abstract: Supervised learning is dicult with high dimensional input spaces and very small training sets, but accurate classcation may be possible if the data lie on a low-dimensional manifold. Gaussian Process Latent Variable Models can discover low dimensional manifolds given only a small number of examples, but learn a latent space without regard for class labels. Existing methods for discriminative manifold learning (e.g., LDA, GDA) do constrain the class distribution in the latent space, but are generally deterministic and may not generalize well with limited training data. We introduce a method for Gaussian Process Classcation using latent variable models trained with discriminative priors over the latent space, which can learn a discriminative latent space from a small training set.
C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342-1351, 1998. Accompanying code.
Abstract: We consider the problem of assigning an input vector x to one of m classes by predicting P(c|x) for c=1,...,m. For a two-class problem, the probability of class 1 given x is esimated by σ(y(x)), where σ(y)=1/(1+e-y). A Gaussian process prior is placed on y(x), and is combined with the training data to obtain predictions for the x points. We provide a Bayesian treatment, integrating over uncertainty in y and in the parameters that control the Gaussian process prior; the necessary integration over y is carried out using Laplace's approximation. The method is generalized to multi-class problems (m>2) using the softmax function. We demonstrate the effectiveness of the method on a number of datasets.
P. Abrahamsen. A review of Gaussian random fields and correlation functions. Technical Report 917, Norwegian Computing Center, Oslo, 1997.
R. J. Adler. The Geometry of Random Fields. John Wiley & Sons, Chichester, 1981.
J. L. Doob. The elementary Gaussian processes. Annals of Mathematical Statistics, 15(3):229-282, 1944.
M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD thesis, Department of Physics, University of Cambridge, 1997.
Abstract: Bayesian inference offers us a powerful tool with which to tackle the problem of data modelling. However the performance of Bayesian methods is crucially dependent on being able to find good models for our data. The principal focus of this thesis is the development of models based on Gaussian process priors. Such models, which can be thought of as the infinite extension of several existing finite models have the flexibility to model complex phenomena while being mathematically simple. In thesis, I present a review of the theory of Gaussian processes and their covariance functions and demonstrate how they fit into the Bayesian framework. The efficient implementation of a Gaussian process is discussed with particular reference to approximate methods for matrix inversion based on the work of Skilling (1993). Several regression problems are examined. Non-stationary covariance functions are developed for the regression of neuron spike data and the use of Gaussian processes to model the potential energy surfaces of weakly bound molecules is discussed. Classification methods based on Gaussian processes are implemented using variational methods. Existing bounds (Jaakkola and Jordan 1996) for the sigmoid function are used to tackle binary problems and multi-dimensional bounds on the softmax function are presented for the multiple class case. The performance of the variational classifier is compared with that of other methods using the CRABS and PIMA datasets (Ripley 1996) and the problem of predicting the cracking of welds based on their chemical composition is also investigated. The theoretical calculation of the density of states of crystal structures is discussed in detail. Three possible approaches to the problem are described based on free energy minimization, Gaussian processes and the theory of random matrices. Results from these approaches are compared with the state-of-the-art techniques (Pickard 1997).
G. Matheron. The intrinsic random functions and their applications. Advances in Applied Probability, 5:439-468, 1973.
W. Meiring, P. Monestiez, P. D. Sampson, and Guttorp. P. Developments in the modelling of nonstationary spatial covariance structure for space-time monitoring data. In E. Y. Baafi and N. Schofield, editors, Geostatistics Wollongong '96, volume 1, pages 162-173. Kluwer, 1997.
R. M. Neal. Bayesian Learning for Neural Networks. Springer, New York, 1996. chapter 2.
Abstract: Artificial ``neural networks'' are now widely used as flexible models for regression and classification applications, but questions remain regarding what these models mean, and how they can safely be used when training data is limited. Bayesian Learning for Neural Networks shows that Bayesian methods allow complex neural network models to be used without fear of the ``overfitting'' that can occur with traditional neural network learning methods. Insight into the nature of these complex Bayesian models is provided by a theoretical investigation of the priors over functions that underlie them. Use of these models in practice is made possible using Markov chain Monte Carlo techniques. Both the theoretical and computational aspects of this work are of wider statistical interest, as they contribute to a better understanding of how Bayesian methods can be applied to complex problems. Presupposing only basic knowledge of probability and statistics, this book should be of interest to many researchers in Statistics, Engineering, and Artificial Intelligence. Software for Unix systems that implements the methods described is freely available over the Internet.
Comment: Gaussian processes are not the main topic of this book but, chapter 2 entitled "Priors for Infinite Networks" contains a characterization of Gaussian and non-Gaussian limits of priors over functions generated from neural networks. See also Williams 1998.
C. J. Paciorek and M. J. Schervish. Nonstationary covariance functions for Gaussian process regression. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. The MIT Press, 2004.
P. D. Sampson and P. Guttorp. Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87(417):108-119, 1992.
A. M. Schmidt and A. O'Hagan. Bayesian inference for non-stationary spatial covariance structure via spatial deformations. Journal of the Royal Statistical Society B, 65(3):745-758, 2003.
I. J. Schoenberg. Metric spaces and positive definite functions. Transactions of the American Mathematical Society, 44(3):522-536, 1938.
C. K. I. Williams. Computation with infinite neural networks. Neural Computation, 10:1203-1216, 1998.
Abstract: For neural networks with a wide class of weight priors, it can be shown that in the limit of an infinite number of hidden units, the prior over functions tends to a Gaussian process. In this article, analytic forms are derived for the covariance function of the Gaussian processes corresponding to networks with sigmoidal and gaussian hidden units. This allows predictions to be made efficiently using networks with an infinite number of hidden units and shows, somewhat paradoxically, that it may be easier to carry out Bayesian prediction with infinite networks rather than finite ones.
A. M. Yaglom. Stationary Random Functions. Prentice-Hall, Englewood Cliffs, NJ, 1962.
D. J. C. MacKay. Comparison of approximate methods for handling hyperparameters. Neural Compuration, 11(5):1035-1068, 1999.
K. V. Mardia and R. J. Marshall. Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika, 71(1):135-146, 1984.
Y. Qi, T. P. Minka, R. W. Picard, and Z. Ghahramani. Predictive automatic relevance determination by expectation propagation. In C. E. Brodley, editor, Proceedings of Twenty-first International Conference on Machine Learning, 2004.
A. Schwaighofer, V. Tresp, and K. Yu. Learning Gaussian process kernels via hierarchical bayes. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, Cambridge, MA, 2005. The MIT Press.
M. Seeger. Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems 12, pages 603-609, Cambridge, MA, 2000. The MIT Press.
M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classification. Journal of Machine Learning Research, 3:233-269, 2002.
Abstract: Approximate Bayesian Gaussian process (GP) classification techniques are powerful non-parametric learning methods, similar in appearance and performance to support vector machines. Based on simple probabilistic models, they render interpretable results and can be embedded in Bayesian frameworks for model selection, feature selection, etc. In this paper, by applying the PAC-Bayesian theorem of McAllester (1999a), we prove distribution-free generalisation error bounds for a wide range of approximate Bayesian GP classification techniques. We also provide a new and much simplified proof for this powerful theorem, making use of the concept of convex duality which is a backbone of many machine learning techniques. We instantiate and test our bounds for two particular GPC techniques, including a recent sparse method which circumvents the unfavourable scaling of standard GP algorithms. As is shown in experiments on a real-world task, the bounds can be very tight for moderate training sample sizes. To the best of our knowledge, these results provide the tightest known distribution-free error bounds for approximate Bayesian GPC methods, giving a strong learning-theoretical justification for the use of these techniques.
P. Sollich. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine Learning, 46(1-3):21-52, 2002.
Abstract: I describe a framework for interpreting Support Vector Machines (SVMs) as maximum a posteriori (MAP) solutions to inference problems with Gaussian Process priors. This probabilistic interpretation can provide intuitive guidelines for choosing a lsquogoodrsquo SVM kernel. Beyond this, it allows Bayesian methods to be used for tackling two of the outstanding challenges in SVM classification: how to tune hyperparameters\u2014the misclassification penalty C, and any parameters specifying the ernel\u2014and how to obtain predictive class probabilities rather than the conventional deterministic class label predictions. Hyperparameters can be set by maximizing the evidence; I explain how the latter can be defined and properly normalized. Both analytical approximations and numerical methods (Monte Carlo chaining) for estimating the evidence are discussed. I also compare different methods of estimating class probabilities, ranging from simple evaluation at the MAP or at the posterior average to full averaging over the posterior. A simple toy application illustrates the various concepts and techniques.
S. Sundararajan and S. S. Keerthi. Predictive approaches for choosing hyperparameters in Gaussian processes,. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems 12, Cambridge, MA, 2000. The MIT Press.
S. Sundararajan and S. Sathiya Keerthi. Predictive approaches for choosing hyperparameters in Gaussian processes. Neural Computation, 13:1103-1118, 2001.
F. Vivarelli and C. K. I Williams. Discovering hidden features with Gaussian processes regression. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11. The MIT Press, 1999.
Abstract: In Gaussian process regression the covariance between the outputs at input locations x and x' is usually assumed to depend on the distance (x-x')W(x-x'), where W is a positive definite matrix. W is often taken to be diagonal, but if we allow W to be a general positive definite matrix which can be tuned on the basis of training data, then an eigen-analysis of W shows that we are effectively creating hidden features, where the dimensionality of the hidden-feature space is determined by the data. We demonstrate the superiority of predictions using the general matrix over those based on a diagonal matrix on two test problems.
G. Wahba. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical Society, Series B, 40:364-372, 1978.
L. Csató. Gaussian Processes - Iterative Sparse Approximations. PhD thesis, Neural Computing Research Group, Aston University, 2002.
L. Csató, E. Fokoué, M. Opper, B. Schottky, and O. Winther. Efficient approaches to Gaussian process classification. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems 12, Cambridge, MA, 2000. The MIT Press.
Abstract: We present three simple approximations for the calculation of the posterior mean in Gaussian Process classification. The first two methods are related to mean field ideas known in Statistical Physics. The third approach is based on Bayesan online approach which was motivated by recent results in the Statistical Mechanics of Neural Networks. We present simulation results showing: 1. that the mean field Bayesian evidence may be used for hyperparameter tuning and 2. that the online approach may achieve a low training error fast.
L. Csató and M. Opper. Sparse representation for Gaussian process models. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, Cambridge, MA, 2001. The MIT Press.
Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world datasets indicate that efficiency of the approach.
L. Csató and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(2):641-669, 2002.
Abstract: We develop an approach for sparse representations of gaussian process (GP) models (which are Bayesian types of kernel machines) in order to overcome their limitations for large data sets. The method is based on a combination of a Bayesian on-line algorithm, together with a sequential construction of a relevant subsample of the data that fully specifies the prediction of the GP model. By using an appealing parameterization and projection techniques in a reproducing kernel Hilbert space, recursions for the effective parameters and a sparse gaussian approximation of the posterior process are obtained. This allows for both a propagation of predictions and Bayesian error measures. The significance and robustness of our approach are demonstrated on a variety of experiments.
G. Ferrari Trecate, C. K. I. Williams, and M. Opper. Finite-dimensional approximation of Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 218-224. The MIT Press, 1999.
M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD thesis, Department of Physics, University of Cambridge, 1997.
Abstract: Bayesian inference offers us a powerful tool with which to tackle the problem of data modelling. However the performance of Bayesian methods is crucially dependent on being able to find good models for our data. The principal focus of this thesis is the development of models based on Gaussian process priors. Such models, which can be thought of as the infinite extension of several existing finite models have the flexibility to model complex phenomena while being mathematically simple. In thesis, I present a review of the theory of Gaussian processes and their covariance functions and demonstrate how they fit into the Bayesian framework. The efficient implementation of a Gaussian process is discussed with particular reference to approximate methods for matrix inversion based on the work of Skilling (1993). Several regression problems are examined. Non-stationary covariance functions are developed for the regression of neuron spike data and the use of Gaussian processes to model the potential energy surfaces of weakly bound molecules is discussed. Classification methods based on Gaussian processes are implemented using variational methods. Existing bounds (Jaakkola and Jordan 1996) for the sigmoid function are used to tackle binary problems and multi-dimensional bounds on the softmax function are presented for the multiple class case. The performance of the variational classifier is compared with that of other methods using the CRABS and PIMA datasets (Ripley 1996) and the problem of predicting the cracking of welds based on their chemical composition is also investigated. The theoretical calculation of the density of states of crystal structures is discussed in detail. Three possible approaches to the problem are described based on free energy minimization, Gaussian processes and the theory of random matrices. Results from these approaches are compared with the state-of-the-art techniques (Pickard 1997).
M. N. Gibbs and D. J. C. MacKay. Efficient implementation of Gaussian processes. Technical report, Department of Physics, Cavendish Laboratory, Cambridge University, 1997.
Abstract: Neural networks and Bayesian inference provide a useful framework within which to solve regression problems. However their parameterisation means that the Bayesian analysis of neural networks can be difficult. In this paper, we investigate a method for regression using Gaussian Process Priors which allows exact Bayesian analysis using matrix manipulation for fixed values of hyperparameters. We discuss in detail the workings of the method and we detail a range of mathematical and numerical techniques that are useful in applying Gaussian Processes to general problems including efficient approximate matrix inversion methods developed by Skilling.
S. S. Keerthi and W. Chu. A matching pursuit approach to sparse Gaussian process regression. In Advances in Neural Information Processing Systems 18, 2006.
Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions.
N. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: The informative vector machine. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 609-616, Cambridge, MA, 2003. The MIT Press.
T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001.
T. P. Minka. Expectation propagation for approximate Bayesian inference. In J. S. Breese and D. Koller, editors, Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, pages 362-369. Morgan Kaufmann, 2001.
R. M. Neal. Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical Report 9702, Department of Statistics, University of Toronto, 1997.
Abstract: Gaussian processes are a natural way of defining prior distributions over functions of one or more input variables. In a simple nonparametric regression problem, where such a function gives the mean of a Gaussian distribution for an observed response, a Gaussian process model can easily be implemented using matrix computations that are feasible for datasets of up to about a thousand cases. Hyperparameters that define the covariance function of the Gaussian process can be sampled using Markov chain methods. Regression models where the noise has a t distribution and logistic or probit models for classification applications can be implemented by sampling as well for latent values underlying the observations. Software is now available that implements these methods using covariance functions with hierarchical parameterizations. Models defined in this way can discover high-level properties of the data, such as which inputs are relevant to predicting the response.
R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475-501. Oxford University Press, 1998.
Abstract: Gaussian processes are a natural way of specifying prior distributions over functions of one or more input variables. When such a function defines the mean response in a regression model with Gaussian errors, inference can be done using matrix computations, which are feasible for datasets of up to about a thousand cases. The covariance function of the Gaussian process can be given a hierarchical prior, which allows the model to discover high-level properties of the data, such as which inputs are relevant to predicting the response. Inference for these covariance hyperparameters can be done using Markov chain sampling. Classification models can be defined using Gaussian processes for underlying latent values, which can also be sampled within the Markov chain. Gaussian processes are in my view the simplest and most obvious way of defining flexible Bayesian regression and classification models, but despite some past usage, they appear to have been rather neglected as a general-purpose technique. This may be partly due to a confusion between the properties of the function being modeled and the properties of the best predictor for this unknown function.
M. Opper and O. Winther. Gaussian processes for classification: Mean-field algorithms. Neural Computation, 12(11):2655-2684, 2000.
Abstract: We derive a mean-field algorithm for binary classification with gaussian processes that is based on the TAP approach originally proposed in statistical physics of disordered systems. The theory also yields an approximate leave-one-out estimator for the generalization error, which is computed with no extra computational cost. We show that from the TAP approach, it is possible to derive both a simpler "naive" mean-field theory and support vector machines (SVMs) as limiting cases. For both mean-field algorithms and support vector machines, simulation results for three small benchmark data sets are presented. They show that one may get state-of-the-art performance by using the leave-one-out estimator for model selection and the built-in leave-one-out estimators are extremely precise when compared to the exact leave-one-out estimate. The second result is taken as strong support for the internal consistency of the mean-field approach.
M. Opper and O. Winther. Mean field methods for classification with Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 309-315, Cambridge, MA, 1999. The MIT Press.
Abstract: We discuss the application of TAP mean field methods known from the Statistical Mechanics of diordered systems to Bayesian classification models with Gaussian processes. In contrast to previous approaches, no knowledge about the distribution of inputs is needed. Simulation results for the Sonar data set are given.
J. Quiñonero-Candela. Learning with Uncertainty-Gaussian Processes and Relevance Vector Machines. PhD thesis, Informatics and Mathematical Modelling, Technical Univeristy of Denmark, 2004.
J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1935-1959, 12 2005.
Abstract: We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints.
Comment: Errata: there is a mistake in computational complexity for taking the derivative of the log marginal likelihood wrt all elements of Xu, stated as O(dnm2), in line -12 on page 1953. A more careful derivation reduces this to O(dnm+nm2).
Joaquin Quiñonero-Candela and Ole Winther. Incremental Gaussian processes. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, Cambridge, MA, 2003. The MIT Press.
A. Schwaighofer and V. Tresp. Transductive and inductive methods for approximate Gaussian process regression. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15. The MIT Press, 2003.
Abstract: Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. In this paper we compare several approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, online Gaussian processes, and the Bayesian committee machine. Furthermore we provide theoretical insight into some of our experimental results. We found that subset of representers methods can give good and particularly fast predictions for data sets with high and medium noise levels. On complex low noise data sets, the Bayesian committee machine achieves significantly better accuracy, yet at a higher computational cost.
M. Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations. PhD thesis, Institute of Adaptive and Neural Computation, University of Edinburgh, 2003.
M. Seeger, C. K. I. Williams, and N. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In C.M. Bishop and B. J. Frey, editors, Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2003.
Y. Shen, A. Ng, and M. Seeger. Fast Gaussian process regression using KD-trees. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1227-1234. The MIT Press, Cambridge, MA, 2006.
A. J. Smola and P. L. Bartlett. Sparse greedy Gaussian process regression. In T. K. Leen, T. G. Diettrich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 619-625. The MIT Press, 2001.
E. Snelson. Flexible and efficient Gaussian process models for machine learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2007.
Abstract: Gaussian process (GP) models are widely
used to perform Bayesian nonlinear regression and classification—tasks that
are central to many machine learning problems. A GP is nonparametric, meaning
that the complexity of the model grows as more data points are received.
Another attractive feature is the behaviour of the error bars. They naturally
grow in regions away from training data where we have high uncertainty about
the interpolating function.
In their standard form GPs have several
limitations, which can be divided into two broad categories: computational
difficulties for large data sets, and restrictive modelling assumptions for
complex data sets. This thesis addresses various aspects of both of these
problems.
The training cost for a GP has O(N3) complexity,
where N is the number of training data points. This is due to an inversion of
the N × N covariance matrix. In this thesis we develop several new
techniques to reduce this complexity to O(NM2), where M is a user
chosen number much smaller than N. The sparse approximation we use is based
on a set of M ‘pseudo-inputs’ which are optimised together with
hyperparameters at training time. We develop a further approximation based on
clustering inputs that can be seen as a mixture of local and global
approximations.
Standard GPs assume a uniform noise variance. We use our
sparse approximation described above as a way of relaxing this assumption. By
making a modification of the sparse covariance function, we can model input
dependent noise. To handle high dimensional data sets we use supervised
linear dimensionality reduction. As another extension of the standard GP, we
relax the Gaussianity assumption of the process by learning a nonlinear
transformation of the output space. All these techniques further increase the
applicability of GPs to real complex data sets.
We present empirical
comparisons of our algorithms with various competing techniques, and suggest
problem dependent strategies to follow in practice.
Comment: See also the matlab implementation spgp.
E. Snelson and Z Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1259-1266. The MIT Press, Cambridge, MA, 2006.
Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M<<N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M2N) training cost and O(M2) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M, i.e. very sparse solutions, and it significantly outperforms other approaches in this regime.
V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719-2741, 2000.
Abstract: The Bayesian committee machine (BCM) is a novel approach to combining estimators that were trained on different data sets. Although the BCM can be applied to the combination of any kind of estimators, the main foci are gaussian process regression and related systems such as regularization networks and smoothing splines for which the degrees of freedom increase with the number of training data. Somewhat surprisingly, we find that the performance of the BCM improves if several test points are queried at the same time and is optimal if the number of test points is at least as large as the degrees of freedom of the estimator. The BCM also provides a new solution for on-line learning with potential applications to data mining. We apply the BCM to systems with fixed basis functions and discuss its relationship to gaussian process regression. Finally, we show how the ideas behind the BCM can be applied in a non-Bayesian setting to extend the input-dependent combination of estimators.
M. J. Wainwright, E. B. Sudderth, and A. S. Willsky. Tree-based modeling and estimation of Gaussian processes on graphs with cycles. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, Cambridge, MA, 2001. The MIT Press.
C. K. I. Williams, C. E. Rasmussen, A. Schwaighofer, and V. Tresp. Observations on the Nyström method for Gaussian process prediction. Technical report, University of Edinburgh, 2002.
C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682-688. The MIT Press, 2001.
O. Winther. Bayesian Mean Field Algorithms for Neural Networks and Gaussian Processes. PhD thesis, University of Copenhagen, 1988.
S. A. Wood, W. Jiang, and M. Tanner. Bayesian mixture of splines for spatially adaptive nonparametric regression. Biometrika, 89(3):513-528, 2002.
D. Barry. Nonparametric Bayesian regression. The Annals of Statistics, 14(3):934-953, 1986.
B. J. N. Blight and L. Ott. A Bayesian approach to model inadequacy for polynomial regression. Biometrika, 62(1):79-88, 1975.
N. A. C. Cressie. Statistics for Spatial Data. John Wiley & Sons, New York, 1993.
P. J. Diggle, J. A. Tawn, and R. A. Moyeed. Model-based geostatistics (with discussion). Applied Statistics, 47:299-350, 1998.
M. S. Handcock and M. L. Stein. A Bayesian analysis of kriging. Technometrics, 35(4):403-410, 1993.
J. T. Kent and K. V. Mardia. The link between Kriging and thin-plate splines. In F. P. Kelly, editor, Probability, Statsitics and Optimization, pages 325-339. Wiley, 1994.
D. G. Krige. A statistical approach to some basic mine valuation problems on the Witwatersrand. Journal of the Chemical, Metallurgical and Mining Society of South Africa, 52(6):119-139, 1951.
G. M. Laslett. Kriging and splines: An empirical comparison of their predictive performance in some applications. Journal of the American Statistical Association, 89(426):391-409, 1994.
A. O'Hagan. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society, Series B, 40(1):1-42, 1978.
B. W. Silverman. Spline smoothing: The equivalent variable kernel method. Annals of Statistics, 12(3):898-916, 1984.
B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting (with discussion). Journal of the Royal Statistical Society, Series B, 47(1):1-52, 1985.
M. L. Stein. A kernel approximation to the kriging predictor of a spatial process. Ann. Inst. Statist. Math, 43(1):61-75, 1991.
M. L. Stein. Interpolation of Spatial Data. Springer, New York, 1999.
G. Wahba. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical Society, Series B, 40:364-372, 1978.
S. J. Yakowitz and F. Szidarovszky. A comparison of kriging with nonparametric regression methods. Journal of Multivariate Analysis, 16:21-53, 1985.
A. S. Young. A Bayesian approach to prediction using polynomials. Biometrika, 64(2):309-317, 1977.
T. Choi and M. J. Schervish. Posterior consistency in nonparametric regression problems under Gaussian process priors. Technical Report 809, Department of Statistics, CMU, 2004.
S. Kakade, M. Seeger, and P. Foster. Worst-case bounds for Gaussian process models. In Advances in Neural Information Processing Systems 18. The MIT Press, 2006.
Abstract: We present a competitive analysis of some non-parametric Bayesian algorithms in a worst-case online learning setting, where no probabilistic assumptions about the generation of the data are made. We consider models which use a Gaussian process prior (over the space of all functions) and provide bounds on the regret (under the log loss) for commonly used non-parametric Bayesian algorithms-including Gaussian regression and logistic regression-which show how these algorithms can perform favorably under rather general conditions. These bounds explicitly handle the infinite dimensionality of these non-parametric classes in a natural way. We also make formal connections to the minimax and minimum description length (MDL) framework. Here, we show precisely how Bayesian Gaussian regression is a minimax strategy.
D. Malzahn and M. Opper. A variational approach to learning curves. In T. G. Diettrich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14. The MIT Press, 2002.
M. Opper and F. Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302-308. The MIT Press, 1999.
P. Sollich. Approximate learning curves for Gaussian processes. In ICANN99 - Ninth International Conference on Artificial Neural Networks, pages 437-442, London, 1999. IEE.
P. Sollich. Learning curves for Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Neural Information Processing Systems, Vol. 11. The MIT Press, 1999.
P. Sollich and A. Halees. Learning curves for Gaussian process regression: Approximations and bounds. Neural Computation, 14:1393-1428, 2002.
Abstract: We consider the problem of calculating learning curves (i.e., average generalization performance) of gaussian processes used for regression. On the basis of a simple expression for the generalization error, in terms of the eigenvalue decomposition of the covariance function, we derive a number of approximation schemes. We identify where these become exact and compare with existing bounds on learning curves; the new approximations, which can be used for any input space dimension, generally get substantially closer to the truth. We also study possible improvements to our approximations. Finally, we use a simple exactly solvable learning scenario to show that there are limits of principle on the quality of approximations and bounds expressible solely in terms of the eigenvalue spectrum of the covariance function.
I Steinwart. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans. on Information Theory, 51(1):128-142, 2005.
C. K. I. Williams and F. Vivarelli. Upper and lower bounds on the learning curve for Gaussian proccesses. Machine Learning, 40:77-102, 2000.
Abstract: In this paper we introduce and illustrate non-trivial upper and lower bounds on the learning curves for one-dimensional Guassian Processes. The analysis is carried out emphasising the effects induced on the bounds by the smoothness of the random process described by the Modified Bessel and the Squared Exponential covariance functions. We present an explanation of the early, linearly-decreasing behavior of the learning curves and the bounds as well as a study of the asymptotic behavior of the curves. The effects of the noise level and the lengthscale on the tightness of the bounds are also discussed.
M. G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of Machine Learning Research, 2:299-312, 2001.
Abstract: In this paper, we present classes of kernels for machine learning from a statistics perspective. Indeed, kernels are positive definite functions and thus also covariances. After discussing key properties of kernels, as well as a new formula to construct kernels, we present several important classes of kernels: anisotropic stationary kernels, isotropic stationary kernels, compactly supported kernels, locally stationary kernels, nonstationary kernels, and separable nonstationary kernels. Compactly supported kernels and separable nonstationary kernels are of prime interest because they provide a computational reduction for kernel-based methods. We describe the spectral representation of the various classes of kernels and conclude with a discussion on the characterization of nonlinear maps that reduce nonstationary kernels to either stationarity or local stationarity.
G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics, 41(2):495-502, 1970.
G. Wahba. Spline Models for Observational Data, volume 59. Society for Industrial and Applied Mathematics, Philadelphia, 1990.
Y. Engel. Algorithms and Representations for Reinforcement Learning. PhD thesis, Hebrew University, Jerusalem, Israel, April 2005.
Abstract: Machine Learning is a field of research aimed at constructing intelligent machines that gain and improve their skills by learning and adaptation. As such, Machine Learning research addresses several classes of learning problems, including for instance, supervised and unsupervised learning. Arguably, the most ubiquitous and realistic class of learning problems, faced by both living creatures and artificial agents, is known as Reinforcement Learning. Reinforcement Learning problems are characterized by a long-term interaction between the learning agent and a dynamic, unfamiliar, uncertain, possibly even hostile environment. Mathematically, this interaction is modeled as a Markov Decision Process (MDP). Probably the most significant contribution of this thesis is in the introduction of a new class of Reinforcement Learning algorithms, which leverage the power of a statistical set of tools known as Gaussian Processes. This new approach to Reinforcement Learning offers viable solutions to some of the major limitations of current Reinforcement Learning methods, such as the lack of confidence intervals for performance predictions, and the difficulty of appropriately reconciling exploration with exploitation. Analysis of these algorithms and their relationship with existing methods also provides us with new insights into the assumptions underlying some of the most popular Reinforcement Learning algorithms to date.
Y. Engel, S. Mannor, and R. Meir. Bayes meets Bellman: The Gaussian process approach to temporal difference learning. In T. Fawcett and N. Mishra, editors, 20th International Conference on Machine Learning. AAAI Press, 2003.
Abstract: We present a novel Bayesian approach to the problem of value function estimation in con- tinuous state spaces. We derne a probabilistic generative model for the value function by imposing a Gaussian prior over value functions and assuming a Gaussian noise model. Due to the Gaussian nature of the random processes involved, the posterior distribution of the value function is also Gaussian and is therefore described entirely by its mean and covariance. We derive exact expressions for the posterior process moments, and utilizing an ecient sequential sparsfication method, we describe an on-line algorithm for learning them. We demonstrate the operation of the algorithm on a 2-dimensional continuous spatial navigation domain.
Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with Gaussian processes. In 22nd International Conference on Machine Learning, pages 201-208, Bonn, Germany, August 2005.
Abstract: Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning. In this paper we extend the GPTD framework by addressing two pressing issues, which were not adequately treated in the original GPTD paper (Engel et al., 2003). The first is the issue of stochasticity in the state transitions, and the second is concerned with action selection and policy improvement. We present a new generative model for the value function, deduced from its relation with the discounted return. We derive a corresponding on-line algorithm for learning the posterior moments of the value Gaussian process. We also present a SARSA based extension of GPTD, termed GPSARSA, that allows the selection of actions and the gradual improvement of policies without requiring a world-model.
Y. Engel, P. Szabo, and D. Volkinshtein. Learning to control an octopus arm with Gaussian process temporal difference methods. In Yair Weiss, Bernhard Schölkopf, and John C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 347-354, Cambridge, MA, U.S.A., 2006. The MIT Press.
Abstract: The Octopus arm is a highly versatile and complex limb. How the Octopus controls such a hyper-redundant arm (not to mention eight of them!) is as yet unknown. Robotic arms based on the same mechanical principles may render present day robotic arms obsolete. In this paper, we tackle this control problem using an online reinforcement learning algorithm, based on a Bayesian approach to policy evaluation known as Gaussian process temporal difference (GPTD) learning. Our substitute for the real arm is a computer simulation of a 2-dimensional model of an Octopus arm. Even with the simplifications inherent to this model, the state space we face is a high-dimensional one. We apply a GPTDbased algorithm to this domain, and demonstrate its operation on several learning tasks of varying degrees of difficulty.
M. Ghavamzadeh and Y. Engel. Bayesian policy gradient algorithms. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 457-464, Cambridge, MA, U.S.A., 2007. The MIT Press.
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost.
M. Ghavamzadeh and Y. Engel. Bayesian actor-critic algorithms. In 24th International Conference on Machine Learning, pages 297-304, Corvallis, OR, U.S.A., June 2007.
Abstract: We present a new actor-critic learning model in which a Bayesian class of non-parametric critics, using Gaussian process temporal difference learning is used. Such critics model the state-action value function as a Gaussian process, allowing Bayes' rule to be used in computing the posterior distribution over state-action value functions, conditioned on the observed data. Appropriate choices of the prior covariance (kernel) between state-action values and of the parametrization of the policy allow us to obtain closed-form expressions for the posterior distribution of the gradient of the average discounted return with respect to the policy parameters. The posterior mean, which serves as our estimate of the policy gradient, is used to update the policy, while the posterior covariance allows us to gauge the reliability of the update.
J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and B. Likar. Predictive control with Gaussian process models. In B. Zajc and M. Tkal, editors, Proceedings of IEEE Region 8 Eurocon 2003: Computer as a Tool, pages 352-356, Piscataway, 2003. IEEE.
Abstract: This paper describes model-based predictive control based on Gaussian processes.Gaussian process models provide a probabilistic non-parametric modelling approach for black-box identification of non-linear dynamic systems. It offers more insight in variance of obtained model response, as well as fewer parameters to determine than other models. The Gaussian processes can highlight areas of the input space where prediction quality is poor, due to the lack of data or its complexity, by indicating the higher variance around the predicted mean. This property is used in predictive control, where optimisation of control signal takes the variance information into account. The predictive control principle is demonstrated on a simulated example of nonlinear system.
C. E. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. The MIT Press, Cambridge, MA, 2004.
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. Journal of Machine Learning Research, 6:1019-1041, 2005.
Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach.
W. Chu and Z Ghahramani. Preference learning with Gaussian processe. In 22nd International Conference on Machine Learning, 2005.
Abstract: In this paper, we propose a probabilistic kernel approach to preference learning based on Gaussian processes. A new likelihood function is proposed to capture the preference relations in the Bayesian framework. The generalized formulation is also applicable to tackle many multiclass problems. The overall approach has the advantages of Bayesian methods for model selection and probabilistic prediction. Experimental results compared against the constraint classification approach on several benchmark datasets verify the usefulness of this algorithm.
R. Der and D. Lee. Beyond Gaussian processes: On the distributions of infinite networks. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 275-282. The MIT Press, Cambridge, MA, 2006.
Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.
A. Girard, C. Rasmussen, J.Quiñonero-Candela, and R. Murray-Smith. Multiple-step ahead prediction for non linear dynamic systems - a Gaussian process treatment wih propagation of the uncertainty. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, Cambridge, MA, 2003. The MIT Press.
Abstract: We consider the problem of multi-step ahead prediction in time series analysis using the non-parametric Gaussian process model. k-step ahead forecasting of a discrete-time non-linear dynamic system can be performed by doing repeated one-step ahead predictions. For a state-space model of the form y_t = f(y_t-1,...,y_t-L), the prediction of y at time t + k is based on the point estimates of the previous outputs. In this paper, we show how, using an analytical Gaussian approximation, we can formally incorporate the uncertainty about intermediate regressor values, thus updating the uncertainty on the current prediction
T. Graepel. Solving noisy linear operator equations by Gaussian processes: Application to ordinary and partial differential equations. In 20th International Conference on Machine Learning, 2003.
Abstract: We formulate the problem of solving stochastic linear operator equations in a Bayesian Gaussian process (GP) framework. The solution is obtained in the spirit of a collocation method based on noisy evaluations of the target function at randomly drawn or deliberately chosen points. Prior knowledge about the solution is encoded in terms of the covariance kernel of the GP. As in GP regression, analytical expressions for the mean and variance of the estimated target function are obtained from which the solution to the operator equation follows by a manipulation of the kernel. Linear initial and boundary value constraints can be enforced by embedding the non-parametric model in a form that automatically satisfies the boundary conditions. The method is illustrated on a noisy linear first-order ordinary differential equation with initial condition and on a noisy second-order partial differential equation with Dirichlet boundary conditions.
N. Lawrence. Gaussian process latent variable models for visualization of high dimensional data. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 329-336. The MIT Press, 2004.
Abstract: In this paper we introduce a new underlying probabilistic model for principal component analysis (PCA). Our formulation interprets PCA as a particular Gaussian process prior on a mapping from a latent space to the observed data-space. We show that if the prior’s covariance function constrains the mappings to be linear the model is equivalent to PCA, we then extend the model by considering less restrictive covariance functions which allow non-linear mappings. This more general Gaussian process latent variable model (GPLVM) is then evaluated as an approach to the visualisation of high dimensional data for three different data-sets. Additionally our non-linear algorithm can be further kernelised leading to ‘twin kernel PCA’ in which a mapping between feature spaces occurs.
N. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783-1816, 2005.
Abstract: Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be non-linearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets.
R. Murray-Smith and A. Girard. Gaussian process priors with ARMA noise models. In Irish Signals and Systems Conference, pages 147-153, Maynooth, 2001.
A. O'Hagan. Bayes-Hermite quadrature. Journal of Statistical Planning and Inference, 29:245-260, 1991.
C. E. Rasmussen. Gaussian processes to speed up hybrid Monte Carlo for expensive Bayesian integrals. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, and M. West, editors, Bayesian Statistics 7, pages 651-659. Oxford University Press, 2003.
Abstract: Hybrid Monte Carlo (HMC) is often the method of choice for computing Bayesian integrals that are not analytically tractable. However the success of this method may require a very large number of evaluations of the (un-normalized) posterior and its partial derivatives. In situations wh