The 45th ACM Technical Symposium on Computer Science Education (SIGCSE), 2014

Raghuraman Balasubramanian, Zachary York, Matthew Dorran, Aritra Biswas, Timur Girgin, Karthikeyan Sankaralingam

Paper

This paper details the creation of a hands-on introduction course that reflects the dramatic growth and diversity in computer science. Our aim was to enable students to get an end-to-end perspective on computer system design by building one. We report on a two-year exercise in using the Arduino platform to build a series of hands-on projects. We have used these projects in two course instances, and have obtained detailed student feedback, which we analyze and present in this paper. The instructions, code and videos developed are available open-source here.

The 20th International Symposium on High-Performance Computer Architecture (HPCA), 2014

Raghuraman Balasubramanian, Karthikeyan Sankaralingam

Paper

This paper introduces a novel end-to-end platform called PERSim that allows FPGA accelerated full-system simulation of complete programs on prototype hardware with detailed fault injection that can capture gate delays and digital logic behavior of arbitrary circuits and provides full coverage. We use PERSim and report on five case studies spanning a diverse spectrum of reliability techniques including wearout prediction/detection (FIRST, Wearmon, TRIX), transient faults, and permanent faults (Sampling- DMR). PERSim provides unprecedented capability to study these techniques quantitatively when applied to a full processor and when running complete programs. These case studies demonstrate PERSim’s robustness and flexibility — such a diverse set of techniques can be studied uniformly with common metrics like area overhead, power overhead, and detection latency. PERSim provides many new insights, of which two important ones are: i) We discover an important modeling “hole” — when considering the true logic delay behavior, non-critical paths can directly transition into logic faults, rendering insufficient delay-based detection/prediction mechanisms targeted at critical paths alone. ii) When Sampling-DMR was evaluated in a real system running full applications, detection latency is orders of magnitude lower than previously reported modelbased worst-case latency - 107 seconds vs. 0.84 seconds, thus dramatically strengthening Sampling-DMR’s effectiveness. The framework is released open source (coming soon) and runs on the Zync platform.

The 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013

Raghuraman Balasubramanian, Karthikeyan Sankaralingam

Paper | Talk PPT | Lighting Talk | Poster

Hardware failure due to wearout is a growing concern. Circuit failure prediction is an approach that is effective if it meets the following requirements: low design complexity, low overheads, generality (supporting various types of wearout including soft and hard breakdown) and high accuracy. State-of-the-art techniques, which typically detect and measure low level circuit properties like gate delay cannot deliver on all four requirements. Moving away from the paradigm of measuring circuit delays is key to satisfying the four design requirements. Our insight is to virtually age the processor and thus manifest a wearout fault early { we convert the delay degradation into a logic fault; expose the fault and then detect the fault. To virtually age the processor, reducing supply voltage effectively mirrors wearout. For fault exposure, we observe that faults in critical paths are naturally exposed and we develop a technique to expose faults along the non-critical paths using clock phase shifting logic. Our system, Aged-SDMR, combines these two mechanisms to expose wearout faults early and detects them using Sampling DMR. We also develop principles to combine these two mechanisms with any detection technique. We implement a prototype system based on the OpenRISC processor on a Xilinx Zync FPGA. We demonstrate that Aged-SDMR is practical and delivers on all four requirements, has area and energy overheads of 9% and 0.7% respectively, takes at most 0.4 days to detect failure after onset and its early warning window is configurable. More generally, Aged-SDMR provides the capability for low-overhead DMR execution without any missed errors and 100% coverage. It is likely to nd broad uses within reliability and elsewhere.