Text-to-Picture Synthesis

Text-to-Picture Synthesis Project

One challenge in artificial intelligence is to enable natural interactions between people and computers via multiple modalities. It is often desirable to convert information between modalities. One example is the conversion between text and speech using speech synthesis and speech recognition. However, such conversion is rare between other modalities. In particular, relatively little research has considered the transformation from general text to pictorial representations.

This project will develop general-purpose Text-to-Picture synthesis algorithms that automatically generate pictures from natural language sentences so that the picture conveys the main meaning of the text. Unlike prior systems that require hand-crafted narrative descriptions of a scene, our algorithms will generate static or animated pictures that represent important objects, spatial relations, and actions for general text. Key components include extracting important information from text, generating corresponding images for each piece of information, composing the images into a coherent picture, and evaluation. Our approach uses statistical machine learning and draws ideas from automatic machine translation, text summarization, text-to-speech synthesis, computer vision, and graphics.

Text-to-picture synthesis is likely to have a number of important impacts. First, it has the potential for improving literacy across a range of groups including children who need additional support in learning to read, and adults who are learning a second language. Second, it may be used as an assistive communication tool for people with disabilities like dyslexia or brain damage, and as a universal language when communication is needed simultaneously to many people who speak different languages. Third, it can be a summarization tool for rapidly browsing long text documents. Our research will foster collaboration between researchers in computer science and other disciplines, including psychology and education.

Findings

It is possible to use a "cut-and-paste" approach to produce understandable pictures for general English, although the level of detail is limited [AAAI 2007].

A more structured "ABC" layout can better convey the subject, action, and object in simple sentences. We formalized layout optimization as a sequence labeling problem, employing conditional random fields as our machine learning method [CoNLL 2008].

Verbs are difficult to visualize. One possibility is to use hand-drawn action icons, and linking the verb to its constituents as predicted by semanic role labeling [NIPS 2009 Symposim].
"The girl called the king a frog."

How do people interpret a picture? We conducted a large scale image disambiguation user study. Image ambiguity is a major challenge for pictorial systems. For example, a picture of a red pepper most likely will induce the concept 'pepper' instead of the desired 'red pepper' in human viewers. However, three different pictures of red pepper will robustly induce the concept 'red pepper'. This can be modeled by Bayesian inference over a concept hierarchy [UW CS TR 2011].
Now you see cheese.
Now you see Swiss cheese.

Text-to-picture can be used to improve reading comprehension in young children. Emerging readers must learn to map words and phrases onto their remembered experiences. Having children manipulate toys to correspond to what they are reading overcomes this problem, but introduces its own problem for the classroom, namely having to provide a classroom full of children with manipulative. In collaboration with Art Glenberg (Arizona State University), we demonstrate that having first and second-grade children manipulate images of toys on a computer screen benefits their comprehension as much as physical manipulation of the toys [Instructional Science 2009].
Children manipulate the icons in the picture while reading the story.

Text-to-picture can improve early math learning too [Scientific Studies in Reading, 2011].

Our iPad text-to-picture app has been used in therapy for people with communicative disorders.
A University of Wisconsin Speech and Hearing Clinic client (right) using a text-to-picture iPad app. Photo courtesy of Jamie Murray-Branch. 2011.

Publications

Michael Maynord, Jitrapon Tiachunpun, Xiaojin Zhu, Charles R. Dyer, Kwang-Sung Jun, and Jake Rosin. An Image-To-Speech iPad App. Department of Computer Sciences Technical Report TR1774, University of Wisconsin-Madison. 2012. [link]

Arthur Glenberg, Jonathan Willford, Bryan Gibson, Andrew Goldberg, and Xiaojin Zhu. Improving reading to improve math. Scientific Studies in Reading, 2011.

Jake Rosin, Andrew Goldberg, Xiaojin Zhu, and Charles Dyer. A Bayesian model for image sense ambiguity in pictorial communication systems. Department of Computer Sciences Technical Report, University of Wisconsin-Madison. 2011.

Raman Arora, Charles R. Dyer, Yu Hen Hu, and Nigel Boston. Distributed curve matching in camera networks using projective joint invariant signatures. Proc. 4th ACM/IEEE International Conference on Distributed Smart Cameras. Atlanta, GA, 2010.

Andrew B. Goldberg, Jake Rosin, Xiaojin Zhu, and Charles R. Dyer. Toward Text-to-Picture Synthesis. In NIPS 2009 Symposium on Assistive Machine Learning for People with Disabilities, 2009.
[pdf]

Arthur Glenberg, Andrew B. Goldberg, and Xiaojin Zhu. Improving early reading comprehension using embodied CAI. Instructional Science, 2009.
[link]

Andrew B. Goldberg, Xiaojin Zhu, Charles R. Dyer, Mohamed Eldawy, and Lijie Heng. Easy as ABC? Facilitating pictorial communication via semantically enhanced layout. In Twelfth Conference on Computational Natural Language Learning (CoNLL), 2008.
If you have pictures for individual words in a sentence, how do you compose them to best convey the meaning of the sentence? We learn an "ABC" layout using semantic role labeling and conditional random fields, and conduct a user study. [pdf]

Xiaojin Zhu, Andrew Goldberg, Mohamed Eldawy, Charles Dyer, and Bradley Strock. A text-to-picture synthesis system for augmenting communication. In The Integrated Intelligence Track of the Twenty-Second AAAI Conference on Artificial Intelligence (AAAI-07), 2007.
Synthesizing a picture from general, unrestricted natural language text, to convey the gist of the text. This is an overview, as well as proof of concept, paper. [pdf]

Faculty

Graduate Students

Undergraduate Students

Benjamin Burchfield
Valerie Lo
Molly Maloney
Michael Maynord
Mia Mueller
Peter Ney
Bradley Strock
Jitrapon Tiachunpun
Nicholas Wharton
Steve Yazicioglu

Collaborators

Arthur Glenberg, Department of Psychology, Arizona State University.
Jamie Murray-Branch, Department of Communicative Disorders, University of Wisconsin-Madison.
Julie Gamradt, Department of Communicative Disorders, University of Wisconsin-Madison.
Katie Hustad, Department of Communicative Disorders, University of Wisconsin-Madison.
Josh Tenenbaum, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology.
Suman Banerjee, Department of Computer Sciences, University of Wisconsin-Madison.

This project is based upon work supported by the National Science Foundation under Grant No. IIS-0711887, and by the Wisconsin Alumni Research Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Computer Sciences | UW Home