This paper argues the effectiveness of Vector architecture vis-a-vis Superscalar
and VLIW architecture for multimedia applications in embedded devices. The key
factors in this domain are low power consumption, small code size and simple design
and implementation. The paper evaluates 3 components(ISA, vectorizing
compiler and processor micro-architecture) of the design space in some details
using EEMBC benchmarks.

ISA in VIRAM is a simple co-processor extension of MIPS load-store
vector instruction set. However, to effectively utilize the hardware
resources they provide support for narrow lane vectors and datapath.
Special instructions were added for permuting elements in a vector as 
well as for faster context switches. Their evaluation shows more or less
even usage for many of the instructions.

VIRAM compiler was built from those used in Cray Supercomputer. A two
pass scheme was designed to facilitate narrow vector code. The compiler
was able to vectorize nearly all operations in the benchmarks considered
with the exception of Cjpeg and Djpeg. Average vector length achieved in 
these workloads was also satisfactory. Lack of loop unrolling and software
pipelining compared to conventional compilers, enabled VIRAM compiler to
achieve significant code size compaction. The compiler however, does not
do basic block scheduling and thus they manually produce optimized code
for performance evaluation.


VIRAM was built with a 13M on-chip DRAM memory cells. To hide the latency,
the memory system was deeply(15-stage) pipelined. Unlike scalar operations,
vector operations can easily tolerate high latency and utilize increased bandwidth. 
Thus their trading off latency against throughput worked well in this case. 
On-chip memory was backed by off-chip and the data transfer was under software control. 
4 parallel lanes with a maximum of 64 bit datapath was connected to each functional unit. 

VIRAM performed better than most other systems under consideration for these
workloads. In fact, the optimized (after basic block scheduling) VIRAM outperforms
the rest by a distance. When the clock rate of all other systems was scaled 
down to VIRAM clock rate, the performance gain seemed even larger for VIRAM.
VIRAM design also exhibit good scalability in terms length of the vector lanes.
Extensive use of static circuits and low clock rates makes the VIRAM highly
power efficient as well.

The authors deviate from recommended EEMBC benchmark performance guidelines.
EEMBC benchmarks recommends running the benchmarks many times over which
clearly becomes advantageous for cache based system. The authors argue that
the target machines for these applications are real time embedded devices. It is
however, unclear as to why one would like to use general purpose Superscalar
or full blown VLIW architecture in these devices. 

We discussed the possiblity of building CMPs with non-identical cores in this
context. However, it appeared that such a design will be highly complex
and hard to test and verify.


Historical Note: IRAM project at Berkeley started out of a graduate class.
Basic objective was to take a fresh look at memory and processor decoupling. 
Designing on-chip DRAM memory was a step towards bringing them closer and thus
providing large improvement in memory bandwidth.