Issue Programmer Fix (as mentioned)
Our Possible Fix
Handling Error Conditions (Most systems give blue screen)
handle error condition, perform online recovery
Shadow Drivers
Halt System (call panic(), BUG_ON, ASSERT)
Don't halt, recover and then err out
Shadow Drivers
Handle all possible hardware fault codes
Handle hardware code for thread to move forward gracefully as error occurs, common error routines to handle hardware errors.
Check return values of hardware functions and recover as required.
Proactively check for hardware failure often
1.Sanity check hardware responses
-- Check for (-1 return value => 111111) and halt driver progress immediately.
-- status against valid possibilirties (driver specific)
--data patterns in DMA areas to check DMA has occured(?)
2.Time transactions (I/O, state changes etc)
3.Check for liveness of hardware (using some no-op)

-Check return values of hardware functions and recover as required.
Simple Error Recovery
Different simple and fast error recovery for different situations

Report all errors
Report all errors for complete diagnosis.
Always log error messages to event log.

Hide recoverable faults from user applications
Handle recovery and errors transparently as much as possible.
Driver should attempt to reset the device and re-issue the request.
Out of memory conditions and resource depletions should be handled gracefully.
Should fail in such a way that higher level system software can recovery

Ensure asserts are tested in free build. Asserts code only put in debug build. Put asserts in regular code

Excessive spin on hardware
dont spin on hardware unless criitical.
never block or hold interrupts for long time.
run operations in parallel
make driver resource hierarchy simple


Drivers should be designed for non-modal operation. (Not expect things to happen in certain order)
Driver logic should be event driven.
No assumptions made on ordering of events

Minimize device usage - Number of devices is large. (redundant hardware)
- Hot add/remove(cannot guarantee presence of device always)
-Minimize:
-- Size of PCI BARS (ioremap)
-- Common communication buffers
- Driver may not get unloaded when hardware is removed.

Input Verification and Exiting - Minimal exit paths to centralize error handling and recovery. (Use structured exception handling in windows)
- All input to driver routines must be verified
- Minimize complex nesting (to avoid subtle bugs).
 

Serviceability Able to diagnose issue without impacting availablity of the fault tolerant system.
- Ability to enable diagnosis on a running system.
- Minimal impact on normal performance (when diagnosis not required)
- Diagnosis tools must produce clear results and be also used during development stage.
- Diagnosis should be capable of being performed remotely.
- Trace produced should be cateogarized and be of variable verbosity.

Testing (Comprehensive Testing Required) Perform:
Load and Stress Testing
Failure testing
Power failure testing
Plug and Play testing
Duration Testing



Sun Table

Uses the term - "Hardenining" : :- integrate fault management capabilities into I/O device drivers




Issue
Programmer Fix (as mentioned) Our Possible Fix

Any data read from driver is corrupt (PIO or DMA)
Esp with pointers, memory offsets and array indexes.
Also check packet lengths, status words, channel IDs


Avoid releasing bad data to the system. Range check packed lengths from size of buffer, check status bits for impossible values. Check channel IDs for improbable values.


Data from device is only available once


No re-reading from device. Data should be stored in device's state. Re-reading can trigger undesirable events.
Waiting in infinite loop. Should not drain system resources if device is busy. - All loops are bounded.
- Use callbacks instead of waiting on resources.

Streams can be dismanteled Check existance of streams before each use.
Corruption of device data Perform integrity checks using checksum and CRC to check received data. Tests are usually device specific. Some tests already in place - e.g. 
DMA isolation Should not perform incorrect DMA and corrupt memory (device/system).
Handling stuck interrupts Avoid looping on interrupt status bits
Driver should return interrupt unclaimed if interrupt is not legitimate

Device Failure Driver must free up resources if there is a device failure. Also, close all minor devices and detach driver instances.