This paper argues the effectiveness of Vector architecture vis-a-vis Superscalar and VLIW architecture for multimedia applications in embedded devices. The key factors in this domain are low power consumption, small code size and simple design and implementation. The paper evaluates 3 components(ISA, vectorizing compiler and processor micro-architecture) of the design space in some details using EEMBC benchmarks. ISA in VIRAM is a simple co-processor extension of MIPS load-store vector instruction set. However, to effectively utilize the hardware resources they provide support for narrow lane vectors and datapath. Special instructions were added for permuting elements in a vector as well as for faster context switches. Their evaluation shows more or less even usage for many of the instructions. VIRAM compiler was built from those used in Cray Supercomputer. A two pass scheme was designed to facilitate narrow vector code. The compiler was able to vectorize nearly all operations in the benchmarks considered with the exception of Cjpeg and Djpeg. Average vector length achieved in these workloads was also satisfactory. Lack of loop unrolling and software pipelining compared to conventional compilers, enabled VIRAM compiler to achieve significant code size compaction. The compiler however, does not do basic block scheduling and thus they manually produce optimized code for performance evaluation. VIRAM was built with a 13M on-chip DRAM memory cells. To hide the latency, the memory system was deeply(15-stage) pipelined. Unlike scalar operations, vector operations can easily tolerate high latency and utilize increased bandwidth. Thus their trading off latency against throughput worked well in this case. On-chip memory was backed by off-chip and the data transfer was under software control. 4 parallel lanes with a maximum of 64 bit datapath was connected to each functional unit. VIRAM performed better than most other systems under consideration for these workloads. In fact, the optimized (after basic block scheduling) VIRAM outperforms the rest by a distance. When the clock rate of all other systems was scaled down to VIRAM clock rate, the performance gain seemed even larger for VIRAM. VIRAM design also exhibit good scalability in terms length of the vector lanes. Extensive use of static circuits and low clock rates makes the VIRAM highly power efficient as well. The authors deviate from recommended EEMBC benchmark performance guidelines. EEMBC benchmarks recommends running the benchmarks many times over which clearly becomes advantageous for cache based system. The authors argue that the target machines for these applications are real time embedded devices. It is however, unclear as to why one would like to use general purpose Superscalar or full blown VLIW architecture in these devices. We discussed the possiblity of building CMPs with non-identical cores in this context. However, it appeared that such a design will be highly complex and hard to test and verify. Historical Note: IRAM project at Berkeley started out of a graduate class. Basic objective was to take a fresh look at memory and processor decoupling. Designing on-chip DRAM memory was a step towards bringing them closer and thus providing large improvement in memory bandwidth.