SOSP
09http://www.sigops.org/sosp/sosp09/papers/kadav-sosp09.pdfhttp://www.sigops.org/sosp/sosp09/papers/kadav-sosp09.pdfshapeimage_2_link_0shapeimage_2_link_1
 



hardware unreliability is a significant and real problem. random crashes due to device misbehave is a common place. USERS/SYSADMINS managing large grids and data centers vouch for abrupt, random crashes of servers.



QUOTING An industry source- “We have encountered hangs, interrupt storms in USB stacks due to some chipsets not behaving according to the USB OHCI specification. The same goes for graphics chipsets.”  THis happens because devices fail to OBEY DEVICE SPECIFICATIONS and DEVICE DRIVERS make assumptions about device values causing system panic and/or corruption.


Studies of Windows servers at Microsoft demonstrate the scope of the problem. In one study of Windows servers, eight percent of systems sUFFered from a storage or network adapter failure. Many of these failures are transient: hardware vendors repeatedly report that the majority of returned devices operate correctly and retrying an operation often succeeds. In total, 9% of all unplanned reboots of servers at Microsoft during a separate study were caused by adapter or hardware failures.

The story so far...

  CMOS PROBLEMS

ELECTROMAGNETIC interference, radiations, WEAR out, insufficient BURN-ins

FIRMWARE CODE BUGS

all about

SOURCES of HARDWARE

BUGS

READ   PAPER HEREhttp://pages.cs.wisc.edu/~kadav/papers/carb-sosp09.pdfhttp://pages.cs.wisc.edu/~kadav/papers/carb-sosp09.pdfshapeimage_9_link_0shapeimage_9_link_1