Microsoft Research, Cambridge, U.K.

.................................................................... Published as: “Bayesian inference: An introduction to Principles and practice in machine learning.” In O. Bousquet, U. von Luxburg, and G. R¨tsch (Eds.), Advanced Lectures on a Machine Learning, pp. 41–62. Springer. 2004 June 26, 2006 http://www.miketipping.com/papers.htm mail@miketipping.com

Year of publication: This version typeset: Available from: Correspondence:

Abstract

This article gives a basic introduction to the principles of Bayesian inference in a machine learning context, with an emphasis on the importance of marginalisation for dealing with uncertainty. We begin by illustrating concepts via a simple regression task before relating ideas to practical, contemporary, techniques with a description of ‘sparse Bayesian’ models and the ‘relevance vector machine’.

1

Introduction

What is meant by “Bayesian inference” in the context of machine learning? To assist in answering that question, let’s start by proposing a conceptual task: we wish to learn, from some given number of example instances of them, a model of the relationship between pairs of variables A and B. Indeed, many machine learning problems are of the type “given A, what is B?”.1 Verbalising what we typically treat as a mathematical task raises an interesting question in itself. How do we answer “what is B?”? Within the appealingly well-deﬁned and axiomatic framework of propositional logic, we ‘answer’ the question with complete certainty, but this logic is clearly too rigid to cope with the realities of real-world modelling, where uncertaintly over ‘truth’ is ubiquitous. Our measurements of both the dependent (B) and independent (A) variables are inherently noisy and inexact, and the relationships between the two are invariably non-deterministic. This is where probability theory comes to our aid, as it furnishes us with a principled and consistent framework for meaningful reasoning in the presence of uncertainty. We might think of probability theory, and in particular Bayes’ rule, as providing us with a “logic of uncertainty” [1]. In our example, given A we would ‘reason’ about the likelihood of the truth of B (let’s say B is binary for example) via its conditional probability P (B|A): that is, “what is the probability of B given that A takes a particular value?”. An appropriate answer might be “B is true with probability 0.6”. One of the primary tasks of ‘machine learning’ is then to approximate P (B|A) with some appropriately speciﬁed model based on a given set of corresponding examples of A and B.2 1 In this article we will focus exclusively on such ‘supervised learning’ tasks, although of course there are other modelling applications which are equally amenable to Bayesian inferential techniques. 2 In many learning methods, this conditional probability approximation is not made explicit, though such an interpretation may exist. However, one might consider it a signiﬁcant limitation if a particular machine learning procedure cannot be expressed coherently within a probabilistic framework.

Bayesian Inference: Principles and Practice in Machine Learning

2

It is in the modelling procedure where Bayesian inference comes to the fore. We typically (though not exclusively) deploy some form of parameterised model for our conditional probability: P (B|A) = f (A; w), (1)

where w denotes a vector of all the ‘adjustable’ parameters in the model. Then, given a set D of N examples of our variables, D = {An , Bn }N , a conventional approach would involve the n=1 maximisation of some measure of ‘accuracy’ (or minimisation of some measure of ‘loss’) of our model for D with respect to the adjustable parameters. We then can make predictions, given A, for unknown B by evaluating f (A; w) with parameters w set to their optimal values. Of...