{"id":182,"date":"2018-03-27T00:37:53","date_gmt":"2018-03-27T00:37:53","guid":{"rendered":"http:\/\/wordpress.cs.vt.edu\/optml\/?p=182"},"modified":"2018-03-27T00:39:21","modified_gmt":"2018-03-27T00:39:21","slug":"understanding-bayesian-optimization","status":"publish","type":"post","link":"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/03\/27\/understanding-bayesian-optimization\/","title":{"rendered":"Understanding Bayesian Optimization"},"content":{"rendered":"<div class=\"entry-content\">\n<h3 class=\"entry-title\">The Prior and its Possibilities<\/h3>\n<p>I discussed Bayesian optimization, which is a way of optimizing a function that does not have a formula but can be evaluated.\u00a0 Bayesian optimization use Bayesian inference and thus have prior, likelihood, and posterior distributions.\u00a0 Bayesian optimization can be used to optimize hyperparameters in machine learning.\u00a0 Given a data set for learning on, the hyperparameters are the input to a function.\u00a0 The output to the function is some assessment of the machine learning model\u2019s performance on the data such as F1-score, precision and recall, accuracy, and so on.<\/p>\n<p>The following links were used in my understanding of Bayesian optimization:<\/p>\n<p><a href=\"https:\/\/papers.nips.cc\/paper\/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf\">PRACTICAL BAYESIAN OPTIMIZATION OF MACHINE LEARNING ALGORITHMS<\/a><br \/>\n<a href=\"http:\/\/katbailey.github.io\/post\/gaussian-processes-for-dummies\/\" target=\"_blank\" rel=\"noopener\">Gaussian Processes for Dummies (blog post)<\/a><br \/>\n<a href=\"https:\/\/www.iro.umontreal.ca\/~bengioy\/cifar\/NCAP2014-summerschool\/slides\/Ryan_adams_140814_bayesopt_ncap.pdf\" target=\"_blank\" rel=\"noopener\">A Tutorial on Bayesian Optimization for Machine Learning (slides)<br \/>\n<\/a><a href=\"http:\/\/gpss.cc\/gpmc17\/slides\/LancasterMasterclass_1.pdf\">Introduction to Bayesian Optimization (slides)<\/a><a href=\"http:\/\/mlg.eng.cam.ac.uk\/tutorials\/06\/es.pdf\" target=\"_blank\" rel=\"noopener\"><br \/>\nTutorial: Gaussian process models for machine learning (slides)<\/a><\/p>\n<p>The prior distribution in Bayesian optimization is called a Gaussian process on the prior.\u00a0 This terminology was confusing to me at first since I thought that Bayesian optimization was basically synonymous with Gaussian processes, but I think the prior distribution is called a Gaussian process.\u00a0 <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/gaussian_process.html\" target=\"_blank\" rel=\"noopener\">Gaussian processes are used as a machine learning method.<\/a><\/p>\n<p>What exactly the posterior and the prior are doing was confusing to me as well as some people in the class.\u00a0 At first I thought that the prior was a distribution on functions.\u00a0 For example, if you had a polynomial function with parameterized coefficients, theta, then I thought that the prior and the posterior were a distribution on these coefficients theta, but this is not correct.\u00a0 Thus, initially, I thought that the prior must specify a family of functions.<\/p>\n<p>The prior distribution and the posterior distribution are distributions on <em>function values<\/em> without regard to the type or family of function.\u00a0 Hence, each possible function value for a given input has a probability assigned by the distribution.\u00a0 A common choice for a prior distribution on the function values is a normal distribution centered at 0 with some variance.\u00a0 This means that the function that maximizes the probability of this distribution is y = 0.\u00a0 The posterior distribution gives a probability distribution on possible function values after function evaluations are performed and incorporated with the likelikhood function, which is jointly normal on the observed function evaluations.<\/p>\n<p>There is an animated GIF in some slides on page 28 that I found that gives a nice visualization of this phenomena: <a href=\"http:\/\/gpss.cc\/gpmc17\/slides\/LancasterMasterclass_1.pdf\">Introduction to Bayesian Optimization (slides).<\/a><\/p>\n<p>Since the normal prior centered at 0 is usually chosen, I wonder how much the choice of prior changes things.\u00a0 The prior in a machine learning context could be the average results that given hyperparameters produce across many machine learning projects.\u00a0 There are statistics to calculate the relative effects of the prior and the likelihood on the posterior distribution.\u00a0 It is often the case that the prior has little effect while the likelihood has a greater effect especially with more data; thus, optimizing the prior distribution could be a lot of effort for little gain.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"entry-content\">\n<h3 class=\"entry-title\">The Meaning and Selection of the Kernel<\/h3>\n<p>In Bayesian optimization, there is a function to determine the covariance between any two function evaluation points.\u00a0 This function determines the covariance matrix used in the multivariate Gaussian matrix for the likelihood function.\u00a0 The covariance function is called a kernel, and <a href=\"http:\/\/www.gaussianprocess.org\/gpml\/chapters\/RW.pdf#section.4.2\">there are many kernels used in Gaussian processes<\/a>.<\/p>\n<p>The paper that we discussed in class, <a href=\"https:\/\/papers.nips.cc\/paper\/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf\">Practical Bayesian Optimization of Machine <\/a><a href=\"https:\/\/papers.nips.cc\/paper\/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf\">Learning Algorithms,<\/a> discussed two kernels, the squared exponential kernel and the\u00a0Mat\u00e9rn 5\/2 kernel.\u00a0 The authors argue that the squared exponential kernel is too smooth for hyperparameter tuning in machine learning while the\u00a0Mat\u00e9rn kernel is more appropriate.\u00a0 The papers show the Mat\u00e9rn kernel performing better in some empirical experiments.<\/p>\n<p>How was this kernel really chosen?\u00a0 Did the authors simply try several kernels in their experiments and found one that performed the best?\u00a0 This doesn\u2019t really give a provable way to choose a good kernel.\u00a0 We wondered whether there was some mathematically formal way to prove that some kernel was better than another, but we thought that this was probably not the case.\u00a0 Maybe it could be proven given certain conditions such as convexity or smoothness assumptions.\u00a0 Intuitively, similar choices of hyperparameters will probably produce similar results, so kernels that assume a somewhat smooth function are probably appropriate.<\/p>\n<p>One of the drawbacks of Bayesian optimization in machine learning hyperparameter tuning is that Bayesian optimization has its own hyperparameters that could be tuned, so this just pushes back the problem of hyperparameter tuning.\u00a0 How are the hyperparameters of Bayesian optimization meant to be tuned?\u00a0 The choice of kernel, acquisition function, prior distribution, and stopping criteria are examples of hyperparameters for Bayesian optimization.<\/p>\n<\/div>\n<h3 class=\"entry-title\">The Meaning of the Acquisition Function<\/h3>\n<p>In Bayesian optimization, an acquisition function is used to choose the next point for function evaluation.\u00a0 The paper that we discussed in class mentions three acquisition functions: probability of improvement, expected improvement, and upper confidence bound.<\/p>\n<p>The definition of these functions is very abstract and mathematical, and I had some difficulty interpreting what the functions did.\u00a0 This makes reasoning about them intuitively difficult.\u00a0 The paper focuses on the expected improvement function, and I found an alternative formula for it on slide 35 of the presentation <a href=\"http:\/\/gpss.cc\/gpmc17\/slides\/LancasterMasterclass_1.pdf\">Introduction to Bayesian Optimization<\/a>.\u00a0 This formula is the following.<\/p>\n<p>\u222b max(0, y_{best} \u2013 y) p(y | x; \\theta, D) dy<\/p>\n<p>From this formula, it is clearer that the expected improvement formula finds a point that maximizes the probability that the function evaluation at that point will be below the best y value found so far.\u00a0 Perhaps the other formulas can be written in a similar manner, giving a better understanding of what they do.<\/p>\n<p>The paper did not present a mathematically formal way of choosing the best acquisition function.\u00a0 Maybe under certain conditions, a given acquisition function can be shown to be optimal.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>The Prior and its Possibilities I discussed Bayesian optimization, which is a way of optimizing a function that does not have a formula but can be evaluated.\u00a0 Bayesian optimization use Bayesian inference and thus have prior, likelihood, and posterior distributions.\u00a0 Bayesian optimization can be used to optimize hyperparameters in machine learning.\u00a0 Given a data set &hellip; <a href=\"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/03\/27\/understanding-bayesian-optimization\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Understanding Bayesian Optimization<\/span><\/a><\/p>\n","protected":false},"author":148,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-182","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9CQAE-2W","_links":{"self":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/182","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/users\/148"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/comments?post=182"}],"version-history":[{"count":2,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/182\/revisions"}],"predecessor-version":[{"id":184,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/182\/revisions\/184"}],"wp:attachment":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/media?parent=182"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/categories?post=182"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/tags?post=182"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}