Tell Me What You See and I will Show You Where It Is

Jia Xu1      Alexander G. Schwing2      Raquel Urtasun2,3

1University of Wisconsin-Madison         2University of Toronto           3TTI Chicago



We tackle the problem of weakly labeled semantic segmentation, where the only source of annotation are image tags encoding which classes are present in the scene. This is an extremely difficult problem as no pixel-wise labelings are available, not even at training time. In this paper, we show that this problem can be formalized as an instance of learning in a latent structured prediction framework, where the graphical model encodes the presence and absence of a class as well as the assignments of semantic labels to superpixels. As a consequence, we are able to leverage standard algorithms with good theoretical properties. We demonstrate the effectiveness of our approach using the challenging SIFT-flow dataset and show average per-class accuracy improvements of 7% over the state-of-the-art.


  • Jia Xu, Alexander G. Schwing, Raquel Urtasun. Tell Me What You See and I will Show You Where It Is. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. PDF, Bibtex.

  • Source code

    Email Jia for the link.


    We thank Sanja Fidler and Vikas Singh for helpful discussions. This work was partially funded by NSF RI 1116584 and ONR-N00014-13-1-0721.