Augmenting Frame-BasedVision With Temporal ContextMatthew Dutsoncs.wisc.edu/~dutson/defense.sozi.htmlQuestions?Thank you!Outline1. Frame-Based Vision2. Bandwidth3. Compute4. Stability and RobustnessOutline1. Frame-Based Vision2. Bandwidth3. Compute4. Stability and RobustnessOutline1. Frame-Based Vision2. Bandwidth3. Compute4. Stability and RobustnessOutline1. Frame-Based Vision2. Bandwidth3. Compute4. Stability and RobustnessWhy frames?• Dataset variety and scale• Training resource requirements• Flexibility (image or video)Augmenting Frame-Based VisionCan we use temporal context to improvevision systems, while retaining theadvantages of frame-based processing?Analog ImagingLimited optionsfor storage andprocessingDigital ImagingGreatly increased flexibilityConventional Video CodecsSingle-Photon SensorsSingle-Photon Sensors+ No read noise+ High speed+ Good low-light performance+ High dynamic range− High noise in each frame (breaks codecs)− Torrent of dataConventional Event CamerasThreshold-based detectorAnalog comparatorBrightness+ Low bandwidth+ Low latency+ High temporal resolution− No intensity information− Limited compatibility with downstream tasksConventional Event CamerasGeneralized Event CamerasSpatiotemporal supportNoise-aware thresholdingIntegratorImplicationsCorner detectionPose estimationObject detectionSegmentationAttributionsFrom Sensing to ProcessingFirst frameSecond frameFrom Sensing to ProcessingFirst frameSecond frameConvolutional NetworksLinearActivationEvent neuronOutput valueInput vectorDifferenceAccumulatorBest estimatePolicyDeltaValueVision TransformersQuery-keyproductLayer normLayer normLinearLinearAddAddMLPTransposeSoftmaxAttention-valueproductExisting Transformer operationBuffer moduleGate moduleDelta gate moduleRow-sparse (gated tokens)Column-sparseMatrix multiplication orderResultsConventional35.8 Gops/frameEvent2.06 Gops/frameVersatilityObject detectionEnhancementOptical flowPose estimationFrame-Wise PredictionInstabilityNon-RobustnessSegmentation (DeepLabv3+) under impulse noiseGoalAdapt frame-based networks forstability and robustnessOracle Bound Within the oracle bound, zero error is the globalloss minimizerCollapse BoundWithin the collapse bound, a static prediction isthe global loss minimizer ...LegendInterpolationStabilizationExisting layerController layer ...Resize ...Fuse...Stabilizer 1Adding a ControllerControllerUse context to dynamically adjust the decay Spatial Fusion...SoftmaxNeighborhoodControllerStabilizedCurrent.DenoisingBefore/afterInputDenoisingInputBefore/afterImage EnhancementInputBefore/afterOversmoothingDepth EstimationInputBefore/afterSegmentationInputBefore/afterSee final slide for image and video attributionsThreadsBandwidthComputeStability & RobustnessGeneralized Event CamerasCVPR 2024Varun Sundar*, Matthew Dutson*,Andrei Ardelean, Claudio Bruschini,Edoardo Charbon, and Mohit GuptaInstant Video Models: UniversalAdapters for Stabilizing Image-Based NetworksNeurIPS 2025Matthew Dutson, Nathan Labiosa,Yin Li, and Mohit GuptaEvent Neural NetworksECCV 2022Matthew Dutson, Yin Li, andMohit GuptaEventful Transformers: LeveragingTemporal Redundancy in VisionTransformersICCV 2023Matthew Dutson, Yin Li, andMohit GuptaThemes• Event-based processing• Exploiting temporal redundancy• Augmenting existing networksFrame-Based Readout1 MP SPAD100 kHz100 GbpsUSB 3.110 Gbps1 TB SSD~40 Gbps80 s8 TB HDD~1 Gbps640 s32 GB RAM~200 Gbps2.56 sLow-Speed Capture30 FPS240 Mbps for 1 MPConventional Events290 Mbps for 1 MPGeneralized Events Reconstruction3000 FPS495 Mbps for 1 MPLow Light3000 FPS315 Mbps for 1 MPOn-Chip ImplementationDenoising (NAFNet)No Stability PenaltyOracle StateCollapse StateProposed Loss A Simple Starting Point Design GoalsDon't modify the original architectureInject stability adaptersUse an online (streaming) approachhttps://commons.wikimedia.org/wiki/File:Movie_film_35_soundtrack.jpghttps://commons.wikimedia.org/wiki/File:Matrixw.jpghttps://openaccess.thecvf.com/content_ICCV_2017/papers/Galoogahi_Need_for_Speed_ICCV_2017_paper.pdfhttps://commons.wikimedia.org/wiki/File:Bilateral_Filter.jpgDefining Stability Defining Robustness
1
  1. Title
  2. Outline
  3. Analog Imaging
  4. Digital Imaging
  5. Why frames?
  6. Augmenting Frame-Based Vision
  7. Threads
  8. Themes
  9. Outline
  10. Conventional Video Codecs
  11. Single-Photon Sensors
  12. Single-Photon Sensors
  13. Frame-Based Readout
  14. Conventional Event Cameras
  15. Low-Speed Capture
  16. Conventional Events
  17. Conventional Event Cameras
  18. Generalized Event Cameras
  19. High-Speed Reconstruction
  20. Low-Light Reconstruction
  21. On-Chip Implementation
  22. Implications
  23. Outline
  24. From Sensing to Processing
  25. From Sensing to Processing
  26. Convolutional Networks
  27. Vision Transformers
  28. CNN Results
  29. Versatility
  30. Outline
  31. Frame-Wise Prediction
  32. Instability
  33. Non-Robustness
  34. Goal
  35. Defining Stability
  36. Defining Robustness
  37. Proposed Loss
  38. Oversmoothing
  39. Oracle Bound
  40. Collapse Bound
  41. No Stability Penalty
  42. Oracle State
  43. Collapse State
  44. Design Goals
  45. A Simple Starting Point
  46. Adding a Controller
  47. Controller
  48. Spatial Fusion
  49. Denoising
  50. Denoising
  51. Image Enhancement
  52. Depth Estimation
  53. Segmentation
  54. Questions
  55. Attributions