ICML 2007 Tutorial,
Corvallis, OR, USA
Xiaojin Zhu, University
of Wisconsin, Madison
Why can we learn from unlabeled data for supervised learning tasks? Do
unlabeled data always help? What are the popular semi-supervised learning
methods, and how do they work? How do they relate to each other? What are
the research trends? In this tutorial we address these questions. We will
examine state-of-the-art methods, including generative models, multiview
learning (e.g., co-training), graph-based learning (e.g., manifold regularization),
transductive SVMs and so on. We also offer some advice for practitioners.
Finally we discuss the connection between semi-supervised machine learning
and natural learning. The emphasis of the tutorial is on the intuition
behind each method, and the assumptions they need.
(preliminary, subject to change)
Introduction to semi-supervised learning (15min)
What is semi-supervised learning and transductive learning? Why can
we ever learn a classifier from unlabeled data? Does unlabeled data
always help? Which semi-supervised learning methods are out there?
Which one should I use? Answers to these questions set the stage
for a detailed look at individual algorithms.
Semi-supervised learning algorithms
In fact we will focus on classification algorithms that uses both
labeled and unlabeled data. Several families of algorithms will be
discussed, which uses different model assumptions:
Self-training (10 min)
Probably the earliest semi-supervised learning method. Still extensively
used in the natural language processing community.
Generative models (30 min)
Mixture of Gaussian or multinomial distributions, Hidden Markov Models,
and pretty much any generative model can do semi-supervised learning.
We will also look into the EM algorithm, which is often used for training
generative models when there is unlabeled data.
S3VMs (40 min)
Originally called Transductive SVMs, they are now called Semi-Supervised
SVMs to emphasize the fact that they are capable of induction too, not
just transduction. The idea is simple and elegant, to find a decision
boundary in 'low density' regions. However, the optimization problem
behind it is difficult, and so we will discuss the various optimization
techniques for S3VM, including the one used in SVM-light, Convex-Concave
Procedure (CCCP), Branch-and-Bound, continuation method, etc.
Graph-based methods (30 min)
Here one constructs a graph over the labeled and unlabeled examples, and
assumes that two strongly-connected examples tend to have the same label.
The graph Laplacian matrix is a central quantity. We will discuss
representative algorithms, including manifold regularization.
Multiview learning (15 min)
Exemplified by the Co-Training algorithm, these methods employ multiple
'views' of the same problem, and require that different views produce similar
Other approaches (10 min)
Metric based model selection, tree-based learning, information-based method,
Related problems (10 min)
Regression with unlabeled data, clustering with side information, classification
with positive and unlabeled data; dimensionality reduction with side information,
inferring label missing mechanism, etc.
Semi-supervised learning in nature (60 min)
Long before computers come around and machine learning becomes a discipline,
learning has occurred in nature. Is semi-supervised learning part
of it? The research in this area has just begun. We will look
at a few case studies, ranging from infant word learning, human visual
system, and human categorization behavior.
Challenges for the future (20 min)
There are many open questions. What new algorithms / assumptions
can we make? How to efficiently perform semi-supervised learning
for very large problems? What special methods are needed for structured
output domains? Can we find a way to guarantee that unlabeled data
would not decrease performance? What can we borrow from natural learning?
We suggest these as a few potential research directions.
WHO SHOULD ATTEND
Researchers who want an intuitive overview of the field and get up to speed
with the latest research directions, and practitioners who wish to take
advantage of unlabeled data in addition to labeled data to build better
machine learning systems.
ABOUT THE INSTRUCTOR
Xiaojin Zhu is an Assistant Professor in Computer Sciences at University
of Wisconsin, Madison. His research interests are statistical machine
learning (in particular semi-supervised learning), and its applications
to natural language analysis. He received a Ph.D. in Language
Technologies from CMU in 2005, with thesis research on graph-based semi-supervised
learning. His current research projects aim at bridging the different approaches
in semi-supervised learning, and making them more effective for practitioners.
He has taught several graduate and undergraduate courses in AI, machine
learning and NLP at the University of Wisconsin, Madison.