© copyright 2006-2013 by FORTH, IEEE, ACM, and Springer
Our Formic system, shown in the photograph on the right,
consists of 64 FPGA-based Formic boards,
interconnected through a 4x4x4 three-dimensional cube network,
and emulates a 512-core system.
Formic is described in the publication referenced below.
For detailed information about the Formic board and system, and for downloadable design files, please visit: http://formic-board.com
Myrmics is a parallel, task-based Runtime System for Formic. For a description, related publications, and the Myrmics downloadable code, please visit: http://myrmics.com
ABSTRACT: Modeling emerging multicore architectures is challenging and imposes a tradeoff between simulation speed and accuracy. An effective practice that balances both targets well is to map the target architecture on FPGA platforms. We find that accurate prototyping of hundreds of cores on existing FPGA boards faces at least one of the following problems: (i) limited fast memory resources (SRAM) to model caches, (ii) insufficient inter-board connectivity for scaling the design or (iii) the board is too expensive. We address these shortcomings by designing a new FPGA board for multicore architecture prototyping, which explicitly targets scalability and cost-efficiency. Formic has a 35% bigger FPGA, three times more SRAM, four times more links and costs at most half as much when compared to the popular Xilinx XUPV5 prototyping platform. We build and test a 64-board system by developing a 512-core, MicroBlaze-based, non-coherent hardware prototype with DMA capabilities, with full networkon- chip in a 3D-mesh topology. We believe that Formic offers significant advantages over existing academic and commercial platforms that can facilitate hardware prototyping for future manycore architectures.
ABSTRACT: Task-based dataflow programming models and runtimes emerge as promising candidates for programming multicore and manycore architectures. These programming models analyze dynamically task dependencies at runtime and schedule independent tasks concurrently to the processing elements. In such models, cache locality, which is critical for performance, becomes more challenging in the presence of fine-grain tasks, and in architectures with many simple cores.
This paper presents a combined hardware-software approach to improve cache locality and offer better performance is terms of execution time and energy in the memory system. We propose the explicit bulk prefetcher (EBP) and epoch-based cache management (ECM) to help runtimes prefetch task data and guide the replacement decisions in caches. The runtime software can use this hardware support to expose its internal knowledge about the tasks to the architecture and achieve more efficient task-based execution. Our combined scheme outperforms HW-only prefetchers and state-of-the-art replacement policies, improves performance by an average of 17%, generates on average 26% fewer L2 misses, and consumes on average 28% less energy in the components of the memory system.
ABSTRACT: SARC merges cache controller and network interface functions by relying on a single hardware primitive: each access checks the tag and the state of the addressed line for possible occurrence of events that may trigger responses like coherence actions, RDMA, synchronization, or configurable event notifications. The fully virtualized and protected user-level API is based on specially marked lines in the scratchpad space that respond as command buffers, counters, or queues. The runtime system maps communication abstractions of the programming model to data transfers among local memories using remote write or read DMA and into task synchronization and scheduling using notifications, counters, and queues. The on-chip network provides efficient communication among these configurable memories, using advanced topologies and routing algorithms, and providing for process variability in NoC links. We simulate benchmark kernels on a full-system simulator to compare speedup and network traffic against cache-only systems with directory-based coherence and prefetchers. Explicit communication provides 10 to 40% higher speedup on 64 cores, and reduces network traffic by factors of 2 to 4, thus economizing on energy and power; lock and barrier latency is reduced by factors of 3 to 5.
ABSTRACT: We present the hardware design and implementation of a local memory system for individual processors inside future chip multiprocessors (CMP). Our memory system supports both implicit communication via caches, and explicit communication via directly accessible local ("scratchpad") memories and remote DMA (RDMA). We provide run-time configurability of the SRAM blocks that lie near each processor, so that portions of them operate as 2nd level (local) cache, while the rest operate as scratchpad. We also strive to merge the communication subsystems required by the cache and scratchpad into one integrated Network Interface (NI) and Cache Controller (CC), in order to economize on circuits. The processor interacts with the NI at user-level through virtualized command areas in scratchpad; the NI uses a similar access mechanism to provide efficient support for two hardware synchronization primitives: counters, and queues. We describe the NI design, the hardware cost, and the latencies of our FPGA-based prototype implementation that integrates four MicroBlaze processors, each with 64 KBytes of local SRAM, a crossbar NoC, and a DRAM controller. One-way, end-to-end, user-level communication completes within about 20 clock cycles for short transfer sizes.
The prototype includes multiple Xilinx XUPV5 processor boards, containing 4 MicroBlaze cores per board, interconnected via a Xilinx ML325 switch board that contains 3 parallel crossbars, using 3 RocketIO (2.5 Gbps) links per board.
ABSTRACT: Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds --the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.
ABSTRACT: Tasks of parallel or parallelized programs cooperate with each other by exchanging data, which get transferred from one local or cache memory to another. If we let hardware prefetchers and cache coherence perform these transfers, significant network bandwidth (and energy) are consumed, especially under directory-based coherence, and extra latencies occur when prefetchers fail to correctly predict software behavior. Recent advances in programming models and runtime systems allow runtime libraries to know when specific software objects should be transferred, from where to where, during task scheduling and execution, thus explicitely managing locality and economizing on network packets and energy.
We argue that the runtime tables that contain such knowledge for explicit communication fulfill goals analogous to coherence directories, and can thus obviate hardware coherence. Furthermore, these runtime tables also serve functions analogous to page tables, and thus traditional virtual memory could perhaps be replaced by a simpler scheme, used for protection purposes only. In such new systems, the runtime instructs the hardware to replicate or migrate entire (variable-size) "objects", rather than individual cache lines or pages one at a time. When a large data structure spans several such objects, inter-object pointers are a problem. We argue for a new breed of parallel data structures and algorithms that operate in units of objects that are larger than the traditional small data structure nodes, in a way analogous to what the data base community has done long time ago for disk-resident data.
This conference paper is extended and superseeded by the Transactions of HiPEAC journal paper.
This paper is superseeded by the Transactions of HiPEAC journal paper.
ABSTRACT: Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid system prototyping becomes important in designing and evaluating their architecture. We present an efficient FPGA-based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as Remote DMA, Remote Queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.
The prototype includes eight x86 nodes, each with a 10Gbps PCI-X RDMA-capable NIC (DiniGroup Virtex-II Pro boards), interconnected via four Xilinx ML325 switch boards (variable-size buffered crossbars), using four RocketIO (2.5 Gbps) links per node.
This paper is superseeded by the IEEE IC-SAMOS 2007 conference paper.
Angelos Bilas, Alex Ramirez, and Georgi Gaydadjiev helped us shape our ideas; we deeply thank them. We also thank, for their participation and assistance: M. Ligerakis, M. Marazakis, M. Papamichael, E. Vlahos, G. Mihelogiannakis, and A. Ioannou.
We also deeply thank the Xilinx University Program for donating to us a number of FPGA chips, boards, and licences for the Xilinx EDA tools.
|Up to CARV-ICS-FORTH||Last updated: June 2013, by M. Katevenis.|