Issue | Programmer
Fix (as mentioned) |
Our Possible
Fix |
Handling Error
Conditions (Most systems give blue screen) |
handle error
condition, perform online recovery |
Shadow Drivers |
Halt System
(call panic(), BUG_ON, ASSERT) |
Don't halt,
recover and then err out |
Shadow Drivers |
Handle all
possible hardware fault codes |
Handle hardware
code for thread to move forward gracefully as error occurs, common error routines to handle hardware errors. |
Check return
values of hardware functions and recover as required. |
Proactively
check for hardware failure often |
1.Sanity check
hardware responses -- Check for (-1 return value => 111111) and halt driver progress immediately. -- status against valid possibilirties (driver specific) --data patterns in DMA areas to check DMA has occured(?) 2.Time transactions (I/O, state changes etc) 3.Check for liveness of hardware (using some no-op) |
-Check return values of hardware functions and recover as required. |
Simple Error
Recovery |
Different simple
and fast error recovery for different situations |
|
Report all errors |
Report all
errors for complete diagnosis. Always log error messages to event log. |
|
Hide recoverable
faults from user applications |
Handle recovery
and errors transparently as much as possible. Driver should attempt to reset the device and re-issue the request. Out of memory conditions and resource depletions should be handled gracefully. Should fail in such a way that higher level system software can recovery |
|
Ensure asserts are tested in free build. Asserts code only put in debug build. | Put asserts in
regular code |
|
Excessive spin
on hardware |
dont spin on
hardware unless criitical. never block or hold interrupts for long time. run operations in parallel make driver resource hierarchy simple |
|
Drivers should
be designed for non-modal operation. (Not expect things to happen in
certain order) |
Driver logic
should be event driven. No assumptions made on ordering of events |
|
Minimize device usage | - Number of
devices is large. (redundant hardware) - Hot add/remove(cannot guarantee presence of device always) -Minimize: -- Size of PCI BARS (ioremap) -- Common communication buffers - Driver may not get unloaded when hardware is removed. |
|
Input Verification and Exiting | - Minimal exit
paths to centralize error handling and recovery. (Use structured
exception handling in windows) - All input to driver routines must be verified - Minimize complex nesting (to avoid subtle bugs). |
|
Serviceability | Able to diagnose
issue without impacting availablity of the fault tolerant system. - Ability to enable diagnosis on a running system. - Minimal impact on normal performance (when diagnosis not required) - Diagnosis tools must produce clear results and be also used during development stage. - Diagnosis should be capable of being performed remotely. - Trace produced should be cateogarized and be of variable verbosity. |
|
Testing (Comprehensive Testing Required) | Perform: Load and Stress Testing Failure testing Power failure testing Plug and Play testing Duration Testing |
Issue |
Programmer Fix (as mentioned) | Our Possible Fix |
Any data read from driver is corrupt (PIO or DMA) Esp with pointers, memory offsets and array indexes. Also check packet lengths, status words, channel IDs |
Avoid releasing bad data to the system. Range check packed lengths from size of buffer, check status bits for impossible values. Check channel IDs for improbable values. |
|
Data from device
is only available once |
No re-reading from device. Data should be stored in device's state. Re-reading can trigger undesirable events. | |
Waiting in infinite loop. Should not drain system resources if device is busy. | - All loops are
bounded. - Use callbacks instead of waiting on resources. |
|
Streams can be dismanteled | Check existance of streams before each use. | |
Corruption of device data | Perform integrity checks using checksum and CRC to check received data. Tests are usually device specific. Some tests already in place - e.g. | |
DMA isolation | Should not perform incorrect DMA and corrupt memory (device/system). | |
Handling stuck interrupts | Avoid looping on
interrupt status bits Driver should return interrupt unclaimed if interrupt is not legitimate |
|
Device Failure | Driver must free up resources if there is a device failure. Also, close all minor devices and detach driver instances. |