中文版 | English
Title

ROG: A High Performance and Robust Distributed Training System for Robotic IoT

Author
Corresponding AuthorZhao,Shixiong
DOI
Publication Years
2022
Conference Name
55th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
ISSN
1072-4451
ISBN
978-1-6654-7428-3
Source Title
Volume
2022-October
Pages
336-353
Conference Date
1-5 Oct. 2022
Conference Place
Chicago, IL, USA
Publication Place
10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
Publisher
Abstract
Critical robotic tasks such as rescue and disaster response are more prevalently leveraging ML (Machine Learning) models deployed on a team of wireless robots, on which data parallel (DP) training over Internet of Things of these robots (robotic IoT) can harness the distributed hardware resources to adapt their models to changing environments as soon as possible. Unfortunately, due to the need for DP synchronization across all robots, the instability in wireless networks (i.e., fluctuating bandwidth due to occlusion and varying communication distance) often leads to severe stall of robots, which affects the training accuracy within a tight time budget and wastes energy stalling. Existing methods to cope with the instability of datacenter networks are incapable of handling such straggler effect. That is because they are conducting model-granulated transmission scheduling, which is much more coarse-grained than the granularity of transient network instability in real-world robotic IoT networks, making a previously reached schedule mismatch with the varying bandwidth during transmission. We present ROG, the first ROw-Granulated distributed training system optimized for ML training over unstable wireless networks. ROG confines the granularity of transmission and synchronization to each row of a layer's parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth. In this way the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence. The evaluation shows that, given the same training time, ROG achieved about 4.9%6.5% training accuracy gain compared with the baselines and saved 20.4%50.7% of the energy to achieve the same training accuracy.
Keywords
SUSTech Authorship
Others
Language
English
URL[Source Record]
Indexed By
Funding Project
HK RGC GRF["17202318","17207117"] ; HK ITF[GHP/169/20SZ] ; NSFC[62132009]
WOS Research Area
Computer Science ; Engineering
WOS Subject
Computer Science, Hardware & Architecture ; Engineering, Electrical & Electronic
WOS Accession No
WOS:000886530600020
EI Accession Number
20224613109619
EI Keywords
Bandwidth ; Budget control ; Energy efficiency ; Internet of things ; Robots ; Wave transmission
ESI Classification Code
Energy Conservation:525.2 ; Information Theory and Signal Processing:716.1 ; Radio Systems and Equipment:716.3 ; Data Communication, Equipment and Techniques:722.3 ; Computer Software, Data Handling and Applications:723 ; Robotics:731.5
Scopus EID
2-s2.0-85141723019
Data Source
Scopus
PDF urlhttps://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9923782
Citation statistics
Cited Times [WOS]:0
Document TypeConference paper
Identifierhttp://kc.sustech.edu.cn/handle/2SGJ60CL/411872
DepartmentSouthern University of Science and Technology
Affiliation
1.The University of Hong Kong,Department of Computer Science,Hong Kong,Hong Kong
2.Tsinghua University,Beijing,China
3.Institute of Software,Chinese Academy of Sciences,Beijing,China
4.Eee,Southern University of Science and Technology,China
5.Pujiang Lab,Shanghai,China
Recommended Citation
GB/T 7714
Guan,Xiuxian,Sun,Zekai,Deng,Shengliang,et al. ROG: A High Performance and Robust Distributed Training System for Robotic IoT[C]. 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA:IEEE COMPUTER SOC,2022:336-353.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Export to Excel
Export to Csv
Altmetrics Score
Google Scholar
Similar articles in Google Scholar
[Guan,Xiuxian]'s Articles
[Sun,Zekai]'s Articles
[Deng,Shengliang]'s Articles
Baidu Scholar
Similar articles in Baidu Scholar
[Guan,Xiuxian]'s Articles
[Sun,Zekai]'s Articles
[Deng,Shengliang]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Guan,Xiuxian]'s Articles
[Sun,Zekai]'s Articles
[Deng,Shengliang]'s Articles
Terms of Use
No data!
Social Bookmark/Share
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.