Thumbnail
Access Restriction
Open

Author Petrović, Darko ♦ Shahmirzadi, Omid ♦ Ropars, Thomas ♦ Schiper, André
Source CiteSeerX
Content type Text
File Format PDF
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Direct Advantage ♦ Remote Memory Access ♦ Higher-level Send Receive Interface ♦ On-chip Rma ♦ Message-passing Application ♦ Pipelined K-ary Tree Algorithm ♦ State-of-the-art Broadcast Algorithm ♦ On-chip Message Passing Buffer ♦ Collective Operation ♦ Rma-based Collective Operation ♦ Main Research Direction ♦ Message-passing Many-core Chip ♦ Efficient Broadcast Algorithm ♦ Analytical Evaluation Highlight ♦ Analytical Evaluation ♦ Scalability Issue ♦ On-chip Broadcast ♦ High-performance Rma-based Broadcast ♦ Logp-based Model ♦ Many-core Chip ♦ Hardware Feature ♦ Intel Single-chip Cloud Computer ♦ Message-passing Programming Model ♦ Experimental Result ♦ Intel Scc ♦ Full Advantage ♦ Improves Latency ♦ Future Message-passing Many-core Architecture ♦ Broadcast Algorithm
Description Many-core chips with more than 1000 cores are expected by the end of the decade. To overcome scalability issues related to cache coherence at such a scale, one of the main research directions is to leverage the message-passing programming model. The Intel Single-Chip Cloud Computer (SCC) is a prototype of a message-passing many-core chip. It offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). Performance of message-passing applications is directly affected by efficiency of collective operations, such as broadcast. In this paper, we study how to make use of the MPBs to implement an efficient broadcast algorithm for the SCC. We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Using a LogP-based model, we present an analytical evaluation that compares our algorithm with the state-of-the-art broadcast algorithms implemented for the SCC. As predicted by the model, experimental results show that OC-Bcast attains almost three times better throughput, and improves latency by at least 27%. Furthermore, the analytical evaluation highlights the benefits of our approach: OC-Bcast takes direct advantage of RMA, unlike the other considered broadcast algorithms, which are based on a higher-level send/receive interface. This leads us to the conclusion that RMA-based collective operations are needed to take full advantage of hardware features of future message-passing many-core architectures.
Educational Role Student ♦ Teacher
Age Range above 22 year
Educational Use Research
Education Level UG and PG ♦ Career/Technical Study
Learning Resource Type Article
Publisher Date 2012-01-01
Publisher Institution In Proc. 24th ACM Symp. on Parallelism in Alg. and Arch. (SPAA’12