Understanding Bayesian Optimization

The Prior and its Possibilities

I discussed Bayesian optimization, which is a way of optimizing a function that does not have a formula but can be evaluated.  Bayesian optimization use Bayesian inference and thus have prior, likelihood, and posterior distributions.  Bayesian optimization can be used to optimize hyperparameters in machine learning.  Given a data set for learning on, the hyperparameters are the input to a function.  The output to the function is some assessment of the machine learning model’s performance on the data such as F1-score, precision and recall, accuracy, and so on.

The following links were used in my understanding of Bayesian optimization:

PRACTICAL BAYESIAN OPTIMIZATION OF MACHINE LEARNING ALGORITHMS
Gaussian Processes for Dummies (blog post)
A Tutorial on Bayesian Optimization for Machine Learning (slides)
Introduction to Bayesian Optimization (slides)
Tutorial: Gaussian process models for machine learning (slides)

The prior distribution in Bayesian optimization is called a Gaussian process on the prior.  This terminology was confusing to me at first since I thought that Bayesian optimization was basically synonymous with Gaussian processes, but I think the prior distribution is called a Gaussian process.  Gaussian processes are used as a machine learning method.

What exactly the posterior and the prior are doing was confusing to me as well as some people in the class.  At first I thought that the prior was a distribution on functions.  For example, if you had a polynomial function with parameterized coefficients, theta, then I thought that the prior and the posterior were a distribution on these coefficients theta, but this is not correct.  Thus, initially, I thought that the prior must specify a family of functions.

The prior distribution and the posterior distribution are distributions on function values without regard to the type or family of function.  Hence, each possible function value for a given input has a probability assigned by the distribution.  A common choice for a prior distribution on the function values is a normal distribution centered at 0 with some variance.  This means that the function that maximizes the probability of this distribution is y = 0.  The posterior distribution gives a probability distribution on possible function values after function evaluations are performed and incorporated with the likelikhood function, which is jointly normal on the observed function evaluations.

There is an animated GIF in some slides on page 28 that I found that gives a nice visualization of this phenomena: Introduction to Bayesian Optimization (slides).

Since the normal prior centered at 0 is usually chosen, I wonder how much the choice of prior changes things.  The prior in a machine learning context could be the average results that given hyperparameters produce across many machine learning projects.  There are statistics to calculate the relative effects of the prior and the likelihood on the posterior distribution.  It is often the case that the prior has little effect while the likelihood has a greater effect especially with more data; thus, optimizing the prior distribution could be a lot of effort for little gain.

 

The Meaning and Selection of the Kernel

In Bayesian optimization, there is a function to determine the covariance between any two function evaluation points.  This function determines the covariance matrix used in the multivariate Gaussian matrix for the likelihood function.  The covariance function is called a kernel, and there are many kernels used in Gaussian processes.

The paper that we discussed in class, Practical Bayesian Optimization of Machine Learning Algorithms, discussed two kernels, the squared exponential kernel and the Matérn 5/2 kernel.  The authors argue that the squared exponential kernel is too smooth for hyperparameter tuning in machine learning while the Matérn kernel is more appropriate.  The papers show the Matérn kernel performing better in some empirical experiments.

How was this kernel really chosen?  Did the authors simply try several kernels in their experiments and found one that performed the best?  This doesn’t really give a provable way to choose a good kernel.  We wondered whether there was some mathematically formal way to prove that some kernel was better than another, but we thought that this was probably not the case.  Maybe it could be proven given certain conditions such as convexity or smoothness assumptions.  Intuitively, similar choices of hyperparameters will probably produce similar results, so kernels that assume a somewhat smooth function are probably appropriate.

One of the drawbacks of Bayesian optimization in machine learning hyperparameter tuning is that Bayesian optimization has its own hyperparameters that could be tuned, so this just pushes back the problem of hyperparameter tuning.  How are the hyperparameters of Bayesian optimization meant to be tuned?  The choice of kernel, acquisition function, prior distribution, and stopping criteria are examples of hyperparameters for Bayesian optimization.

The Meaning of the Acquisition Function

In Bayesian optimization, an acquisition function is used to choose the next point for function evaluation.  The paper that we discussed in class mentions three acquisition functions: probability of improvement, expected improvement, and upper confidence bound.

The definition of these functions is very abstract and mathematical, and I had some difficulty interpreting what the functions did.  This makes reasoning about them intuitively difficult.  The paper focuses on the expected improvement function, and I found an alternative formula for it on slide 35 of the presentation Introduction to Bayesian Optimization.  This formula is the following.

∫ max(0, y_{best} – y) p(y | x; \theta, D) dy

From this formula, it is clearer that the expected improvement formula finds a point that maximizes the probability that the function evaluation at that point will be below the best y value found so far.  Perhaps the other formulas can be written in a similar manner, giving a better understanding of what they do.

The paper did not present a mathematically formal way of choosing the best acquisition function.  Maybe under certain conditions, a given acquisition function can be shown to be optimal.