From Label Maps to Label Strokes: |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Image Parsing from Sparsely Labeled Training Data |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Shengqi Zhu Yiqing Yang Li Zhang | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
University of Wisconsin – Madison |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Abstract This paper proposes a novel image parsing framework to solve the semantic pixel labeling problem from only label strokes. The input to our framework is a set of training images, each of which has zero, one, or multiple semantic label strokes; the output of our framework is a trained image parser, which can be applied to new testing images without any user interaction. Our framework is based on a network of voters, each of which aggregates both a self voting vector and a neighborhood context. The voters are parameterized using sparse convex coding. We learn the parameters by minimizing a regularized energy function that propagates label information in the (labeled and unlabeled) training data while taking into account of context interaction. A backward composition algorithm is proposed for efficient gradient computation. Our framework is capable of handling label strokes and is scalable to a code book of millions of bases. Our experiment results show the effectiveness of our framework on both synthetic examples and real world datasets.
Publication Shengqi Zhu, Yiqing Yang, Li Zhang. From Label Maps to Label Strokes: Semantic Segmentation for Street Scenes from Incomplete Training Data, IEEE Workshop on Computer Vision for Converging Perspectives, December, 2013. [PDF 1.2 MB][Slides 27MB]
Acknowledgements This work is supported in part by NSF EFRI-BSBA-0937847, NSF IIS-0845916, IIS-0916441, a Sloan Research Fellowship, a Packard Fellowship for Science and Engineering, and a gift donation from Adobe Systems Inc.
Supplementary Video
Both training and testing treat video frames as independent images; temporal inter-frame motion is not used in the current work so that our method can be applied to single images.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Label stroke missing rate k. (Not all regions need to have a stroke) | ||||||||
k=0% | k=20% | k=40% | k=60% | |||||
Training | Testing | Training | Testing | Training | Testing | Training | Testing | |
Building | 88% | 87% | 88% | 87% | 87% | 86% | 87% | 87% |
Car | 53% | 28% | 51% | 28% | 57% | 28% | 52% | 27% |
Road | 95% | 95% | 95% | 95% | 95% | 96% | 94% | 95% |
Sidewalk | 55% | 35% | 53% | 33% | 60% | 34% | 57% | 36% |
Sky | 94% | 88% | 94% | 88% | 95% | 89% | 95% | 90% |
Tree | 68% | 56% | 66% | 55% | 67% | 56% | 62% | 52% |
Overall | 84% | 75% | 83% | 75% | 84% | 75% | 83% | 75% |
Future Work
Small objects do not work well yet, and need more research effort. We hope that this framework would make it easier to collect training data to process ``big'' visual data.