Scalable Multicore Systems:
Interprocessor Communication and Memory Architecture

Manolis G.H. Katevenis, Spyros Lyberis, Vassilis Papaefstathiou, Stamatis Kavadias, Dimitrios S. Nikolopoulos, Dionisios Pnevmatikatos, Manolis Marazakis, George Kalokairinos, George Nikiforos, Christoforos Kachris, Xiaojun Yang, Dimitris Tsaliagkos, Pranav Tendulkar and Michael Zampetakis

© copyright 2006-2013 by FORTH, IEEE, ACM, and Springer

64 Formic boards in a 4x4x4 (3D) mesh interconnection, connected to 2 ARM systems and 1 XUP board

The "Formic" 512-core emulator and its "Myrmics" Runtime System:

Our Formic system, shown in the photograph on the right, consists of 64 FPGA-based Formic boards, interconnected through a 4x4x4 three-dimensional cube network, and emulates a 512-core system. Formic is described in the publication referenced below.
For detailed information about the Formic board and system, and for downloadable design files, please visit:
Myrmics is a parallel, task-based Runtime System for Formic. For a description, related publications, and the Myrmics downloadable code, please visit:

  • S. Lyberis, G. Kalokerinos, M. Lygerakis, V. Papaefstathiou, D. Tsaliagkos, M. Katevenis, D. Pnevmatikatos and D. Nikolopoulos: "Formic: Cost-efficient and Scalable Prototyping of Manycore Architectures", Proc. of the IEEE 20th Int. Symposium on Field-Programmable Custom Computing Machines (FCCM'12), Toronto Canada, May 2012, pp. 61-64; DOI: 10.1109/FCCM.2012.20
    - Preprint in PDF (2.1 MBytes); © Copyright 2012 by IEEE.
  • ABSTRACT: Modeling emerging multicore architectures is challenging and imposes a tradeoff between simulation speed and accuracy. An effective practice that balances both targets well is to map the target architecture on FPGA platforms. We find that accurate prototyping of hundreds of cores on existing FPGA boards faces at least one of the following problems: (i) limited fast memory resources (SRAM) to model caches, (ii) insufficient inter-board connectivity for scaling the design or (iii) the board is too expensive. We address these shortcomings by designing a new FPGA board for multicore architecture prototyping, which explicitly targets scalability and cost-efficiency. Formic has a 35% bigger FPGA, three times more SRAM, four times more links and costs at most half as much when compared to the popular Xilinx XUPV5 prototyping platform. We build and test a 64-board system by developing a 512-core, MicroBlaze-based, non-coherent hardware prototype with DMA capabilities, with full networkon- chip in a 3D-mesh topology. We believe that Formic offers significant advantages over existing academic and commercial platforms that can facilitate hardware prototyping for future manycore architectures.

    Recent Work:

  • V. Papaefstathiou, M. Katevenis, D. S. Nikolopoulos, D. Pnevmatikatos: "Prefetching and Cache Management using Task Lifetimes", Proceedings of the 27th ACM International Conference on Supercomputing (ICS'13), Eugene, Oregon, USA, 10-14 June 2013, pp. 325-334; ISBN: 978-1-4503-2130-3; DOI: 10.1145/2464996.2465443.
    - Preprint in PDF (460 KBytes); © Copyright 2013 by ACM.
  • ABSTRACT: Task-based dataflow programming models and runtimes emerge as promising candidates for programming multicore and manycore architectures. These programming models analyze dynamically task dependencies at runtime and schedule independent tasks concurrently to the processing elements. In such models, cache locality, which is critical for performance, becomes more challenging in the presence of fine-grain tasks, and in architectures with many simple cores.
    This paper presents a combined hardware-software approach to improve cache locality and offer better performance is terms of execution time and energy in the memory system. We propose the explicit bulk prefetcher (EBP) and epoch-based cache management (ECM) to help runtimes prefetch task data and guide the replacement decisions in caches. The runtime software can use this hardware support to expose its internal knowledge about the tasks to the architecture and achieve more efficient task-based execution. Our combined scheme outperforms HW-only prefetchers and state-of-the-art replacement policies, improves performance by an average of 17%, generates on average 26% fewer L2 misses, and consumes on average 28% less energy in the components of the memory system.

  • M. Katevenis, V. Papaefstathiou, S. Kavadias, D. Pnevmatikatos, F. Silla, and D. S. Nikolopoulos: "Explicit Communication and Synchronization in SARC", To appear in IEEE Micro Magazine (IEEE Micro), Special Issue: European Multicore Processing Projects, September/October 2010.
    - Preprint in PDF (640 KBytes); © Copyright 2010 by IEEE.

    ABSTRACT: SARC merges cache controller and network interface functions by relying on a single hardware primitive: each access checks the tag and the state of the addressed line for possible occurrence of events that may trigger responses like coherence actions, RDMA, synchronization, or configurable event notifications. The fully virtualized and protected user-level API is based on specially marked lines in the scratchpad space that respond as command buffers, counters, or queues. The runtime system maps communication abstractions of the programming model to data transfers among local memories using remote write or read DMA and into task synchronization and scheduling using notifications, counters, and queues. The on-chip network provides efficient communication among these configurable memories, using advanced topologies and routing algorithms, and providing for process variability in NoC links. We simulate benchmark kernels on a full-system simulator to compare speedup and network traffic against cache-only systems with directory-based coherence and prefetchers. Explicit communication provides 10 to 40% higher speedup on 64 cores, and reduces network traffic by factors of 2 to 4, thus economizing on energy and power; lock and barrier latency is reduced by factors of 3 to 5.

  • G. Kalokairinos, V. Papaefstathiou, G. Nikiforos, S. Kavadias, M. Katevenis, D. Pnevmatikatos, and X. Yang: "Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability", To appear in Transactions on High-Performance Embedded Architectures and Compilers (Transactions on HiPEAC), Special Issue: SAMOS2009 Best Papers, Springer Verlag LNCS 2010.
    - Preprint in PDF (370 KBytes); © Copyright 2010 by Springer.

    Xilinx XUPV5 FPGA-based Prototype ABSTRACT: We present the hardware design and implementation of a local memory system for individual processors inside future chip multiprocessors (CMP). Our memory system supports both implicit communication via caches, and explicit communication via directly accessible local ("scratchpad") memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks that lie near each processor, so that portions of them operate as 2nd level (local) cache, while the rest operate as scratchpad. We also strive to merge the communication subsystems required by the cache and scratchpad into one integrated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor interacts with the NI at user-level through virtualized command areas in scratchpad; the NI uses a similar access mechanism to provide efficient support for two hardware synchronization primitives: counters, and queues. We describe the NI design, the hardware cost, and the latencies of our FPGA-based prototype implementation that integrates four MicroBlaze processors, each with 64 KBytes of local SRAM, a crossbar NoC, and a DRAM controller. One-way, end-to-end, user-level communication completes within about 20 clock cycles for short transfer sizes.

    The prototype includes multiple Xilinx XUPV5 processor boards, containing 4 MicroBlaze cores per board, interconnected via a Xilinx ML325 switch board that contains 3 parallel crossbars, using 3 RocketIO (2.5 Gbps) links per board.

  • Support for Explicit Communication and Synchronization:

  • S. Kavadias, M. Katevenis, M. Zampetakis, and D. S. Nikolopoulos: "On-chip Communication and Synchronization Mechanisms with Cache-Integrated Network Interfaces", Proc. 7th ACM International Conference on Computing Frontiers (CF-2010), Bertinoro, Italy, 17-19 May 2010, pp. 217-226, ISBN: 978-1-4503-0044-5 (ranked, by the PC Co-Chairs, as one of the top three papers of the Conference)
    - Preprint in PDF (390 KBytes); © Copyright 2010 by ACM.

    ABSTRACT: Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds --the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.

  • M. Katevenis: "Replicate and Migrate Objects in the Runtime, not Cache Lines or Pages in Hardware", Invited Talk at the Barcelona Multicore Workshop 2010 (BMW 2010), Barcelona, Spain, 21-22 Oct. 2010.
    - Slides available in PDF (1.5 MBytes); © Copyright 2010 by FORTH.

    ABSTRACT: Tasks of parallel or parallelized programs cooperate with each other by exchanging data, which get transferred from one local or cache memory to another. If we let hardware prefetchers and cache coherence perform these transfers, significant network bandwidth (and energy) are consumed, especially under directory-based coherence, and extra latencies occur when prefetchers fail to correctly predict software behavior. Recent advances in programming models and runtime systems allow runtime libraries to know when specific software objects should be transferred, from where to where, during task scheduling and execution, thus explicitely managing locality and economizing on network packets and energy.
    We argue that the runtime tables that contain such knowledge for explicit communication fulfill goals analogous to coherence directories, and can thus obviate hardware coherence. Furthermore, these runtime tables also serve functions analogous to page tables, and thus traditional virtual memory could perhaps be replaced by a simpler scheme, used for protection purposes only. In such new systems, the runtime instructs the hardware to replicate or migrate entire (variable-size) "objects", rather than individual cache lines or pages one at a time. When a large data structure spans several such objects, inter-object pointers are a problem. We argue for a new breed of parallel data structures and algorithms that operate in units of objects that are larger than the traditional small data structure nodes, in a way analogous to what the data base community has done long time ago for disk-resident data.

  • M. Katevenis: "Towards Unified Mechanisms for Inter-Processor Communication", Keynote Presentation at the IEEE Int. Conf. on Embedded Computer Systems: Architectures, Modeling and Simulation (IC-SAMOS2008), Samos, Greece, 21-24 July 2008.
    - Slides available in PDF (130 KBytes); © Copyright 2008 by FORTH.
  • M. Katevenis: "Interprocessor Communication seen as Load-Store Instruction Generalization", in The Future of Computing, essays in memory of Stamatis Vassiliadis, K. Bertels e.a. Editors, Delft, The Netherlands, 28 Sep. 2007, pp. 55-68.
    - Available in PDF (3.7 MBytes) - Slides in PDF (40 KBytes); © Copyright 2007 by FORTH.
  • C. Villavieja, M. Katevenis, N. Navarro, D. Pnevmatikatos, A. Ramirez, S. Kavadias, V. Papaefstathiou, and D. S. Nikolopoulos: "Hardware Support for Explicit Communication in Scalable CMP's", Technical Report UPC-DAC-RR-CAP-2009-1, UPC, BSC, and FORTH-ICS, January 2009.
    - Available in PDF (420 KBytes); © Copyright 2009 by UPC, BSC, and FORTH.
  • Hardware Prototypes for Interprocessor Communication Mechanisms:

    - Tightly-coupled Network Interfaces (2008-2010)

  • G. Kalokairinos, V. Papaefstathiou, G. Nikiforos, S. Kavadias, M. Katevenis, D. Pnevmatikatos, and X. Yang: "FPGA Implementation of a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability", Proc. IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (IC-SAMOS2009), Samos, Greece, 20-23 July 2009, ISBN 978-1-4244-4501-1, pp. 149-156.
    - Preprint in PDF (440 KBytes) © Copyright 2009 by IEEE; Slides in PDF (550 KBytes) © Copyright 2009 by FORTH.

    This conference paper is extended and superseeded by the Transactions of HiPEAC journal paper.

  • G. Nikiforos, G. Kalokairinos, V. Papaefstathiou, S. Kavadias, D. Pnevmatikatos, and M. Katevenis, "A run-time Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability", In the 6th HiPEAC Industrial Workshop on Embedded Computing, THALES Research and Development - Palaiseau, Paris, France, 26 November 2008.
    - Available in PDF (270 KBytes) - Slides in PDF (210 KBytes) © Copyright 2008 by FORTH.

    This paper is superseeded by the Transactions of HiPEAC journal paper.

  • - Loosely-coupled Network Interfaces (2006-2007)

  • V. Papaefstathiou, D. Pnevmatikatos, M. Marazakis, G. Kalokairinos, A. Ioannou, M. Papamichael, S. Kavadias, G. Mihelogiannakis, and M. Katevenis: "Prototyping Efficient Interprocessor Communication Mechanisms", Proc. IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (IC-SAMOS2007), Samos, Greece, 16-19 July 2007.
    - Preprint in PDF (130 KBytes); © Copyright 2007 by IEEE.

    DiniGroup Xilinx VII-Pro FPGA-based Prototype ABSTRACT: Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid system prototyping becomes important in designing and evaluating their architecture. We present an efficient FPGA-based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as Remote DMA, Remote Queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.

    The prototype includes eight x86 nodes, each with a 10Gbps PCI-X RDMA-capable NIC (DiniGroup Virtex-II Pro boards), interconnected via four Xilinx ML325 switch boards (variable-size buffered crossbars), using four RocketIO (2.5 Gbps) links per node.

  • V. Papaefstathiou, G. Kalokairinos, A. Ioannou, M. Papamichael, G. Mihelogiannakis, S. Kavadias, E. Vlahos, D. Pnevmatikatos, and M. Katevenis: "An FPGA-based Prototyping Platform for Research in High-Speed Interprocessor Communication", In the 2nd HiPEAC Industrial Workshop on Embedded Computing, Philips (NXP), Eindhoven, Netherlands, 17 October 2006.
    - Available in PDF (200 KBytes) - Slides in PDF (1 MByte) © Copyright 2006 by FORTH.

    This paper is superseeded by the IEEE IC-SAMOS 2007 conference paper.

  • Other Papers, Posters, and Related Work:

  • C. Kachris, G. Nikiforos, V. Papaefstathiou, S. Kavadias, and M. Katevenis: "Low-latency Explicit Communication and Synchronization in Scalable Multi-core Clusters", Short paper and poster presented at the IEEE International Conference on Cluster Computing (CLUSTER2010), Hersonissos, Crete, Greece, 20-24 September 2010.
    - Preprint in PDF (430 KBytes) © Copyright 2010 by IEEE.
    - Poster in PDF (660 KBytes) © Copyright 2010 by FORTH.
  • M. Katevenis, V. Papaefstathiou, S. Kavadias, G. Nikiforos, D. Pnevmatikatos, D. Nikolopoulos, and C. Kachris: "Explicit Communication and Synchronization in SARC", Poster presented at the HiPEAC Innovation Event, Edinburgh, UK, 3-5 May 2010 (ranked 3rd out of 19 in the poster competition).
    - Available in PDF (290 KBytes); © Copyright 2010 by FORTH.
  • M. Marazakis, V. Papaefstathiou, and A. Bilas: "Optimization and Bottleneck Analysis of Network Block I/O in Commodity Storage Systems", In Proc. 21th ACM International Conference on Supercomputing (ICS2007), Seattle, Washington, USA, 16-20 June 2007.
    - Preprint in PDF (210 KBytes); © Copyright 2007 by ACM.
  • M. Marazakis, V. Papaefstathiou, G. Kalokairinos, and A. Bilas: "Experiences from Debugging a PCI-X-based RDMA-capable NIC", In the 3rd Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies (RAIT2006) - In conjunction with IEEE International Conference on Cluster Computing (CLUSTER2006), Barcelona, Spain, 25-28 September 2006.
    - Preprint in PDF (120 KBytes); © Copyright 2006 by IEEE.
  • M. Marazakis, K. Xinidis, V. Papaefstathiou, and A. Bilas, "Efficient Remote Block-level I/O over an RDMA-capable NIC", In Proc. 20th ACM International Conference on Supercomputing (ICS2006), Queensland, Australia, 28 June - 1 July 2006.
    - Preprint in PDF (110 KBytes); © Copyright 2006 by ACM.
  • Past Work on IPC: The Telegraphos Project (1993-97)

    Telegraphos -- from the Greek words ``tele'' (remote) and ``grapho'' (write) -- was a project on low-latency, high-throughput interprocessor communication. During that project, in 1993-1997, at FORTH-ICS CARV Laboratory, workstation clustering prototypes were designed and built, including processor-network interfaces for remote-write based, protected, user-level communication.

    ENCORE project logo

    Projects - Funding - Acknowledgements

    This work is currently (2010-2012) being conducted mostly within the ENCORE (#248647) project on "ENabling technologies for a programmable many-CORE", and in cooperation with the TEXT (#261580) project, both funded by the European Union FP7 Programme. In the period 2006-2009, this work was conducted mostly within the SARC European integrated project on "Scalable computer ARChitecture", funded by the European Union FP6 Programme (#027648). Financial support, especially for hardware prototyping, was also provided by the FP6 Marie-Curie project UNiSIX (MC #509595). Our work in general, and the ENCORE and SARC projects in particular, are within the framework of the HiPEAC Network of Excellence. SARC project logo

    Angelos Bilas, Alex Ramirez, and Georgi Gaydadjiev helped us shape our ideas; we deeply thank them. We also thank, for their participation and assistance: M. Ligerakis, M. Marazakis, M. Papamichael, E. Vlahos, G. Mihelogiannakis, and A. Ioannou.

    We also deeply thank the Xilinx University Program for donating to us a number of FPGA chips, boards, and licences for the Xilinx EDA tools.

    © Copyright 2006-2013 by IEEE or ACM or Springer or FORTH:
    These papers are protected by copyright. Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are made for personal use, they are not made or distributed for profit or commercial advantage, the IEEE or ACM or Springer or FORTH copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of the IEEE or of the ACM or of the Springer or of the Foundation for Research & Technology - Hellas (FORTH), as appropriate. To copy otherwise, in whole or in part, to republish, to post on servers, or to redistribute to lists, requires prior specific written permission and/or a fee.

    Up to CARV-ICS-FORTH Last updated: June 2013, by M. Katevenis.