{"id":424,"date":"2018-04-26T07:11:10","date_gmt":"2018-04-26T07:11:10","guid":{"rendered":"http:\/\/wordpress.cs.vt.edu\/optml\/?p=424"},"modified":"2018-04-26T07:11:10","modified_gmt":"2018-04-26T07:11:10","slug":"federated-optimization-distributed-machine-learning-for-on-device-intelligence","status":"publish","type":"post","link":"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/04\/26\/federated-optimization-distributed-machine-learning-for-on-device-intelligence\/","title":{"rendered":"Federated Optimization: Distributed Machine Learning for On-Device Intelligence"},"content":{"rendered":"<p align=\"center\"><strong>Introduction<\/strong><\/p>\n<p>Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. This work\u00a0introduces\u00a0an additional approach: Federated Learning, which\u00a0leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally computed updates via\u00a0a central coordinating server.<\/p>\n<p>State-of-the-art optimization algorithms are typically inherently sequential. Moreover, they usually rely on performing a large number of very fast iterations. However, the round of communication is much more time-consuming than a single iteration of the algorithm, which lead to the development of novel algorithms specialized for distributed optimization.<\/p>\n<p>Problem Formulation<\/p>\n<p>This model proposed in this paper are considering the problem of minimizing an objection function that has the form of a sum, which is like this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-487 aligncenter\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.23.56-AM.png\" alt=\"\" width=\"397\" height=\"64\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.23.56-AM.png 472w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.23.56-AM-300x48.png 300w\" sizes=\"auto, (max-width: 397px) 100vw, 397px\" \/><\/p>\n<p>The examples for this kind of problem structure covers:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-488 aligncenter\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.24.33-AM.png\" alt=\"\" width=\"579\" height=\"107\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.24.33-AM.png 725w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.24.33-AM-300x55.png 300w\" sizes=\"auto, (max-width: 579px) 100vw, 579px\" \/><\/p>\n<p>The Setting of Federated Optimization<\/p>\n<p>Since communication efficiency is of utmost importance, algorithm for federated optimization follow the following characteristics.<\/p>\n<p>Massively Distributed:<\/p>\n<p>Data points are stored across a large number of nodes K. In particular, the number of nodes can be much bigger than the average number of training examples stored on a given node (n\/K).<\/p>\n<p>Non-IID:<\/p>\n<p>Data on each node may be drawn from a different distribution; that is, the data points available locally are far from being a representative sample of the overall distribution.<\/p>\n<p>Unbalanced:<\/p>\n<p>Different nodes may vary by orders of magnitude in the number of training examples they hold.<\/p>\n<p align=\"center\"><strong>Federated Learning<\/strong><\/p>\n<p>Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices by bringing model training to the device as well.<\/p>\n<p>It works like this: your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.<\/p>\n<p style=\"text-align: center\"><strong>Distributed settings and desirable\u00a0algorithmic properties<\/strong><\/p>\n<p>We consider two distributed settings in this work. On a single machine, we compute the execution time as<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-460 alignnone aligncenter\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-25-at-11.31.16-PM.png\" alt=\"\" width=\"168\" height=\"27\" \/><\/p>\n<p>where\u00a0<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-461\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-25-at-11.34.53-PM.png\" alt=\"\" width=\"51\" height=\"24\" \/>\u00a0is\u00a0the number of iterations algorithm A needs to converge to some fixed \u03b5 accuracy.<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-462\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-25-at-11.36.04-PM.png\" alt=\"\" width=\"27\" height=\"25\" \/>is\u00a0the time needed for a single iteration.<\/p>\n<p>On natural distributed machines, the execution time includes the communication overhead\u00a0\u00a0in a single iteration; c &gt;&gt; <img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-462\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-25-at-11.36.04-PM.png\" alt=\"\" width=\"27\" height=\"25\" \/>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-463 aligncenter\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-25-at-11.36.45-PM.png\" alt=\"\" width=\"222\" height=\"30\" \/><\/p>\n<p>Desirable algorithmic properties for the non-IID, unbalanced, and massively-distributed setting are:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-464\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-2.png\" alt=\"\" width=\"1887\" height=\"577\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-2.png 1887w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-2-300x92.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-2-768x235.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-2-1024x313.png 1024w\" sizes=\"auto, (max-width: 1887px) 100vw, 1887px\" \/><\/p>\n<div class=\"page\" title=\"Page 15\">\n<div class=\"layoutArea\">\n<div class=\"column\">\n<p>Property (A) is valuable in any optimization setting. Properties (B) and (C) are extreme cases of the federated optimization setting (non-IID, unbalanced, and sparse), whereas (D) is an extreme case of the classic distributed optimization setting (large amounts of IID data per machine). Thus, (D) is the least important property for algorithms in the federated optimization setting.<\/p>\n<p style=\"text-align: center\"><strong>SVRG algorithm on a single node<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-466\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/svrg.png\" alt=\"\" width=\"1812\" height=\"610\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/svrg.png 1812w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/svrg-300x101.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/svrg-768x259.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/svrg-1024x345.png 1024w\" sizes=\"auto, (max-width: 1812px) 100vw, 1812px\" \/><\/p>\n<p>The <u><b>Central idea<\/b><\/u> is<b>\u00a0<\/b>the algorithm evaluates two stochastic gradients,\u2207fi(w) and \u2207fi(wt) to estimate the change of the gradient of the entire function between points wt and w, namely \u2207f(w)\u2212 \u2207f(wt).<\/p>\n<p>Under a distributed setting, we modify the objective as below:<\/p>\n<p>The original problem formulation is\u00a0<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-470\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-3.png\" alt=\"\" width=\"117\" height=\"32\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-3.png 310w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-3-300x82.png 300w\" sizes=\"auto, (max-width: 117px) 100vw, 117px\" \/>\u00a0If we define\u00a0<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-469\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-25-at-11.45.53-PM.png\" alt=\"\" width=\"190\" height=\"57\" \/>\u00a0then\u00a0<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-468\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-25-at-11.45.58-PM.png\" alt=\"\" width=\"353\" height=\"61\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-25-at-11.45.58-PM.png 492w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-25-at-11.45.58-PM-300x52.png 300w\" sizes=\"auto, (max-width: 353px) 100vw, 353px\" \/><\/p>\n<p>Therefore, the objective is changed to<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-467\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-25-at-11.46.05-PM.png\" alt=\"\" width=\"195\" height=\"59\" \/><\/p>\n<p style=\"text-align: center\"><strong>Distributed Approximate Newton algorithm (DANE)<\/strong><\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-472 aligncenter\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-4.png\" alt=\"\" width=\"1999\" height=\"652\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-4.png 1999w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-4-300x98.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-4-768x250.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-4-1024x334.png 1024w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/p>\n<p>The main idea of DANE is to form a local subproblem, dependent only on local data, and gradient of the entire function \u2014 which can be computed in a single round of communication.<\/p>\n<p style=\"text-align: center\"><strong>Naive Federated SVRG<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-473\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-5.png\" alt=\"\" width=\"1999\" height=\"783\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-5.png 1999w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-5-300x118.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-5-768x301.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-5-1024x401.png 1024w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-475\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-6.png\" alt=\"\" width=\"1999\" height=\"437\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-6.png 1999w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-6-300x66.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-6-768x168.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-6-1024x224.png 1024w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/p>\n<p>This proposition proves the effectiveness of the naive federated SVRG algorithm. However, this algorithm is inherently stochastic, and is\u00a0valid under identical sequence of samples. This work further improves this algorithm to get the federated SVRG.<\/p>\n<p style=\"text-align: center\"><strong>The Federated SVRG<\/strong><\/p>\n<p>The notations used in this work are summarized below:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-476\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture2-2.png\" alt=\"\" width=\"1999\" height=\"818\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture2-2.png 1999w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture2-2-300x123.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture2-2-768x314.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture2-2-1024x419.png 1024w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/p>\n<p>Federated SVRG improves from naive federated SVRG from four parameters:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-477\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-7.png\" alt=\"\" width=\"1999\" height=\"849\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-7.png 1999w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-7-300x127.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-7-768x326.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-7-1024x435.png 1024w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-482\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.32-AM.png\" alt=\"\" width=\"332\" height=\"36\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.32-AM.png 332w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.32-AM-300x33.png 300w\" sizes=\"auto, (max-width: 332px) 100vw, 332px\" \/><\/p>\n<p>Since the local data sizes, nk, are not identical, so setting the stepsize hk inversely proportional to nk, making sure each node makes progress of roughly the same magnitude overall.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-481\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.40-AM.png\" alt=\"\" width=\"660\" height=\"34\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.40-AM.png 764w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.40-AM-300x15.png 300w\" sizes=\"auto, (max-width: 660px) 100vw, 660px\" \/><\/p>\n<div class=\"page\" title=\"Page 21\">\n<div class=\"layoutArea\">\n<div class=\"column\">\n<p>If K is\u00a0more than 1, the values of Gk are in general biased estimates of \u2207f(wt).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-483\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.07.11-AM.png\" alt=\"\" width=\"864\" height=\"246\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.07.11-AM.png 864w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.07.11-AM-300x85.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.07.11-AM-768x219.png 768w\" sizes=\"auto, (max-width: 864px) 100vw, 864px\" \/><\/p>\n<div class=\"page\" title=\"Page 21\">\n<div class=\"layoutArea\">\n<div class=\"column\">\n<p>This motivates the aggregation of updates from nodes proportional to nk, the number of data points available locally.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-480\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.45-AM.png\" alt=\"\" width=\"590\" height=\"33\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.45-AM.png 590w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.45-AM-300x17.png 300w\" sizes=\"auto, (max-width: 590px) 100vw, 590px\" \/><\/p>\n<div class=\"page\" title=\"Page 22\">\n<div class=\"layoutArea\">\n<div class=\"column\">\n<p>Because of the un-even distribution of nonzeros for a particular feature,<\/p>\n<div class=\"page\" title=\"Page 22\">\n<div class=\"layoutArea\">\n<div class=\"column\">\n<p>repeatedly sampling and overshooting the magnitude of the gradient will likely cause the iterative process to diverge quickly. So we scale stochastic gradients by diagonal matrix Sk.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-479\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.50-AM.png\" alt=\"\" width=\"655\" height=\"34\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.50-AM.png 655w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Screen-Shot-2018-04-26-at-12.04.50-AM-300x16.png 300w\" sizes=\"auto, (max-width: 655px) 100vw, 655px\" \/><\/p>\n<div class=\"page\" title=\"Page 22\">\n<div class=\"layoutArea\">\n<div class=\"column\">\n<p>If a variable appears in data on each node, we are going to take average. However, the less nodes a particular variable appear on, the more we want to trust those few nodes in informing us about the meaningful update to this variable \u2014 or alternatively, take a longer step. Hence the per-variable scaling of aggregated updates.<\/p>\n<p style=\"text-align: center\"><strong>Experiments<\/strong><\/p>\n<p>They provide results on a dataset based on public Google+ posts, clustered by user \u2014 simulating each user as a independent node.<\/p>\n<p>The following figure shows the frequency of\u00a0 different features across nodes. In particular, over 88%<br \/>\nof features are present on fewer than 1000 nodes, which simulate the structure of distribution of the sparsity pattern.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-484 aligncenter\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-9.png\" alt=\"\" width=\"408\" height=\"314\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-9.png 1060w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-9-300x231.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-9-768x591.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-9-1024x788.png 1024w\" sizes=\"auto, (max-width: 408px) 100vw, 408px\" \/><\/p>\n<p>Figure 1: Features vs. appearance on nodes. The x-axis is a feature index, and they-axis represents the number of nodes where a given feature is present.<\/p>\n<\/div>\n<\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-485 aligncenter\" src=\"http:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-10.png\" alt=\"\" width=\"1999\" height=\"750\" srcset=\"https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-10.png 1999w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-10-300x113.png 300w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-10-768x288.png 768w, https:\/\/wordpress.cs.vt.edu\/optml\/wp-content\/uploads\/sites\/69\/2018\/04\/Picture1-10-1024x384.png 1024w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/p>\n<p>Figure 2: Rounds of communication vs. objective function (left) and test prediction error (right).<\/p>\n<\/div>\n<p>Figure2 show that FSVRG, converges to optimal test classification accuracy in just 30 iterations.\u00a0Since the core reason other methods fail to converge is the non-IID data distribution, the experiment test the FSVRGR algorithm with data randomly reshuffled among the same number of nodes, which illustrate the algorithm is robust\u00a0to challenges present in federated optimization.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. This work\u00a0introduces\u00a0an additional approach: Federated Learning, which\u00a0leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally computed updates via\u00a0a central coordinating server. State-of-the-art optimization algorithms are typically inherently sequential. Moreover, they &hellip; <a href=\"https:\/\/wordpress.cs.vt.edu\/optml\/2018\/04\/26\/federated-optimization-distributed-machine-learning-for-on-device-intelligence\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Federated Optimization: Distributed Machine Learning for On-Device Intelligence<\/span><\/a><\/p>\n","protected":false},"author":169,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1],"tags":[],"class_list":["post-424","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9CQAE-6Q","_links":{"self":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/424","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/users\/169"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/comments?post=424"}],"version-history":[{"count":12,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/424\/revisions"}],"predecessor-version":[{"id":491,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/posts\/424\/revisions\/491"}],"wp:attachment":[{"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/media?parent=424"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/categories?post=424"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cs.vt.edu\/optml\/wp-json\/wp\/v2\/tags?post=424"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}