Title | ROG: A High Performance and Robust Distributed Training System for Robotic IoT |
Author | |
Corresponding Author | Zhao,Shixiong |
DOI | |
Publication Years | 2022
|
Conference Name | 55th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
|
ISSN | 1072-4451
|
ISBN | 978-1-6654-7428-3
|
Source Title | |
Volume | 2022-October
|
Pages | 336-353
|
Conference Date | 1-5 Oct. 2022
|
Conference Place | Chicago, IL, USA
|
Publication Place | 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
|
Publisher | |
Abstract | Critical robotic tasks such as rescue and disaster response are more prevalently leveraging ML (Machine Learning) models deployed on a team of wireless robots, on which data parallel (DP) training over Internet of Things of these robots (robotic IoT) can harness the distributed hardware resources to adapt their models to changing environments as soon as possible. Unfortunately, due to the need for DP synchronization across all robots, the instability in wireless networks (i.e., fluctuating bandwidth due to occlusion and varying communication distance) often leads to severe stall of robots, which affects the training accuracy within a tight time budget and wastes energy stalling. Existing methods to cope with the instability of datacenter networks are incapable of handling such straggler effect. That is because they are conducting model-granulated transmission scheduling, which is much more coarse-grained than the granularity of transient network instability in real-world robotic IoT networks, making a previously reached schedule mismatch with the varying bandwidth during transmission. We present ROG, the first ROw-Granulated distributed training system optimized for ML training over unstable wireless networks. ROG confines the granularity of transmission and synchronization to each row of a layer's parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth. In this way the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence. The evaluation shows that, given the same training time, ROG achieved about 4.9%6.5% training accuracy gain compared with the baselines and saved 20.4%50.7% of the energy to achieve the same training accuracy. |
Keywords | |
SUSTech Authorship | Others
|
Language | English
|
URL | [Source Record] |
Indexed By | |
Funding Project | HK RGC GRF["17202318","17207117"]
; HK ITF[GHP/169/20SZ]
; NSFC[62132009]
|
WOS Research Area | Computer Science
; Engineering
|
WOS Subject | Computer Science, Hardware & Architecture
; Engineering, Electrical & Electronic
|
WOS Accession No | WOS:000886530600020
|
EI Accession Number | 20224613109619
|
EI Keywords | Bandwidth
; Budget control
; Energy efficiency
; Internet of things
; Robots
; Wave transmission
|
ESI Classification Code | Energy Conservation:525.2
; Information Theory and Signal Processing:716.1
; Radio Systems and Equipment:716.3
; Data Communication, Equipment and Techniques:722.3
; Computer Software, Data Handling and Applications:723
; Robotics:731.5
|
Scopus EID | 2-s2.0-85141723019
|
Data Source | Scopus
|
PDF url | https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9923782 |
Citation statistics |
Cited Times [WOS]:0
|
Document Type | Conference paper |
Identifier | http://kc.sustech.edu.cn/handle/2SGJ60CL/411872 |
Department | Southern University of Science and Technology |
Affiliation | 1.The University of Hong Kong,Department of Computer Science,Hong Kong,Hong Kong 2.Tsinghua University,Beijing,China 3.Institute of Software,Chinese Academy of Sciences,Beijing,China 4.Eee,Southern University of Science and Technology,China 5.Pujiang Lab,Shanghai,China |
Recommended Citation GB/T 7714 |
Guan,Xiuxian,Sun,Zekai,Deng,Shengliang,et al. ROG: A High Performance and Robust Distributed Training System for Robotic IoT[C]. 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA:IEEE COMPUTER SOC,2022:336-353.
|
Files in This Item: | There are no files associated with this item. |
|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment