Augmenting Frame-Based
Vision With Temporal Context
Matthew Dutson
cs.wisc.edu/~dutson/defense.sozi.html
Questions?
Thank you!
Outline
1. Frame-Based Vision
2. Bandwidth
3. Compute
4. Stability and Robustness
Outline
1. Frame-Based Vision
2. Bandwidth
3. Compute
4. Stability and Robustness
Outline
1. Frame-Based Vision
2. Bandwidth
3. Compute
4. Stability and Robustness
Outline
1. Frame-Based Vision
2. Bandwidth
3. Compute
4. Stability and Robustness
Why frames?
• Dataset variety and scale
• Training resource requirements
• Flexibility (image or video)
Augmenting Frame-Based Vision
Can we use
temporal context
to improve
vision systems, while retaining the
advantages of frame-based processing?
Analog Imaging
Limited options
for storage and
processing
Digital Imaging
Greatly increased
flexibility
Conventional Video Codecs
Single-Photon Sensors
Single-Photon Sensors
+ No read noise
+ High speed
+ Good low-light performance
+ High dynamic range
− High noise in each frame (breaks codecs)
− Torrent of data
Conventional Event Cameras
Threshold-based detector
Analog comparator
Brightness
+ Low bandwidth
+ Low latency
+ High temporal resolution
− No intensity information
− Limited compatibility with downstream tasks
Conventional Event Cameras
Generalized Event Cameras
Spatiotemporal support
Noise-aware thresholding
Integrator
Implications
Corner detection
Pose estimation
Object detection
Segmentation
Attributions
From Sensing to Processing
First frame
Second frame
From Sensing to Processing
First frame
Second frame
Convolutional Networks
Linear
Activation
Event neuron
Output value
Input vector
Difference
Accumulator
Best estimate
Policy
Delta
Value
Vision Transformers
Query-key
product
Layer norm
Layer norm
Linear
Linear
Add
Add
MLP
Transpose
Softmax
Attention-
value
product
Existing Transformer operation
Buffer module
Gate module
Delta gate module
Row-sparse (gated tokens)
Column-sparse
Matrix multiplication order
Results
Conventional
35.8 Gops/frame
Event
2.06 Gops/frame
Versatility
Object detection
Enhancement
Optical flow
Pose estimation
Frame-Wise Prediction
Instability
Non-Robustness
Segmentation (DeepLabv3+) under impulse noise
Goal
Adapt frame-based networks for
stability
and
robustness
Oracle Bound
Within the oracle bound, zero error is the global
loss minimizer
Collapse Bound
Within the collapse bound, a static prediction is
the global loss minimizer
...
Legend
Interpolation
Stabilization
Existing layer
Controller layer
...
Resize
...
Fuse
...
Stabilizer 1
Adding a Controller
Controller
Use context to dynamically adjust the decay
Spatial Fusion
...
Softmax
Neighborhood
Controller
Stabilized
Current
.
Denoising
Before/after
Input
Denoising
Input
Before/after
Image Enhancement
Input
Before/after
Oversmoothing
Depth Estimation
Input
Before/after
Segmentation
Input
Before/after
See final slide for image and video attributions
Threads
Bandwidth
Compute
Stability & Robustness
Generalized Event Cameras
CVPR 2024
Varun Sundar*, Matthew Dutson*,
Andrei Ardelean, Claudio Bruschini,
Edoardo Charbon, and Mohit Gupta
Instant Video Models: Universal
Adapters for Stabilizing Image-
Based Networks
NeurIPS 2025
Matthew Dutson, Nathan Labiosa,
Yin Li, and Mohit Gupta
Event Neural Networks
ECCV 2022
Matthew Dutson, Yin Li, and
Mohit Gupta
Eventful Transformers: Leveraging
Temporal Redundancy in Vision
Transformers
ICCV 2023
Matthew Dutson, Yin Li, and
Mohit Gupta
Themes
• Event-based processing
• Exploiting temporal redundancy
• Augmenting existing networks
Frame-Based Readout
1 MP SPAD
100 kHz
100 Gbps
USB 3.1
10 Gbps
1 TB SSD
~40 Gbps
80 s
8 TB HDD
~1 Gbps
640 s
32 GB RAM
~200 Gbps
2.56 s
Low-Speed Capture
30 FPS
240 Mbps for 1 MP
Conventional Events
290 Mbps for 1 MP
Generalized Events Reconstruction
3000 FPS
495 Mbps for 1 MP
Low Light
3000 FPS
315 Mbps for 1 MP
On-Chip Implementation
Denoising (NAFNet)
No Stability Penalty
Oracle State
Collapse State
Proposed Loss
A Simple Starting Point
Design Goals
Don't modify the original architecture
Inject stability adapters
Use an online (streaming) approach
https://commons.wikimedia.org/wiki/File:Movie_film_35_soundtrack.jpg
https://commons.wikimedia.org/wiki/File:Matrixw.jpg
https://openaccess.thecvf.com/content_ICCV_2017/papers/Galoogahi_Need_for_Speed_ICCV_2017_paper.pdf
https://commons.wikimedia.org/wiki/File:Bilateral_Filter.jpg
Defining Stability
Defining Robustness
1
Title
Outline
Analog Imaging
Digital Imaging
Why frames?
Augmenting Frame-Based Vision
Threads
Themes
Outline
Conventional Video Codecs
Single-Photon Sensors
Single-Photon Sensors
Frame-Based Readout
Conventional Event Cameras
Low-Speed Capture
Conventional Events
Conventional Event Cameras
Generalized Event Cameras
High-Speed Reconstruction
Low-Light Reconstruction
On-Chip Implementation
Implications
Outline
From Sensing to Processing
From Sensing to Processing
Convolutional Networks
Vision Transformers
CNN Results
Versatility
Outline
Frame-Wise Prediction
Instability
Non-Robustness
Goal
Defining Stability
Defining Robustness
Proposed Loss
Oversmoothing
Oracle Bound
Collapse Bound
No Stability Penalty
Oracle State
Collapse State
Design Goals
A Simple Starting Point
Adding a Controller
Controller
Spatial Fusion
Denoising
Denoising
Image Enhancement
Depth Estimation
Segmentation
Questions
Attributions