Distributed Vector Architecture: Beyond a Single Vector-IRAM
Stefanos Kaxiras, Rabin Sugumar, Jim Schwarzmeier
Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, Denver, Colorado June 1, 1997
Distributed Vector Architecture: Fine Grain Parallelism with Efficient Communication
Stefanos Kaxiras and Rabin Sugumar
University of Wisconsin-Madison Computer Sciences Dept. Tech. Report 1339, February, 1997.
Also available from CRAY Research.
As processing power continues to increase, while memory access latency and bandwidth become serious bottlenecks, processors and DRAM memory will be packaged increasingly tighter together, possibly on a single chip. This integration would introduce orders of magnitude superior bandwidth/latency to local memory than to remote memory. In this situation, an on-chip vector unit is advantageous since it can make efficient use of such high internal bandwidth. However, real-life vector applications, which have enormous memory requirements, would not fit in the non-expandable memory of a single integrated device and their performance would be primarily determined by the amount of remote traffic they require. We propose a solution for running large vector applications on multiple, vector-capable, tightly integrated processor-memory nodes. Vector processors of individual nodes cooperate together to work as a single larger vector processor, while the vector application occupies the memory of all nodes. The physical vector registers of the nodes combine together to form larger architectural registers. Vector operations on the architectural registers are distributed among the nodes, each of which operates on its assigned elements. One of the novel contributions of our work is a variable, program defined, mapping of elements of the architectural vector registers to elements of the physical vector registers. This capability considerably reduces remote traffic for loading and storing physical vector registers to and from the distributed memory of the system. We introduce the notion of mapping vectors to specify this variable mapping. We present heuristics for selecting traffic efficient mapping vectors and selecting appropriate memory interleavings. Simulation results show that the DIstributed Vector Architecture (DIVA) we propose has the potential to result in lower remote traffic than other approaches.
Back to Stefanos Kaxiras' Homepage