Latent Dirichlet Allocation with Topic-in-Set Knowledge

Soft set z-label

Overview

This software implements an extension of LDA [2] which allows the use of "topic-in-set knowledge", or z-labels [1]. This allows the user to supply (possibly noisy) labels or set-labels for specific latent topic z-values. The inference method is Collapsed Gibbs sampling [3]. This code can also be used to do "standard" LDA, similar to [3] (see example code).

The model is implemented as a Python C extension module.

Code

Requirements

To build and install the module, you will need:

Python
NumPy
C compiler (e.g.,GCC)

See README.txt for further details.

Example usage

from numpy import *
from zlabelLDA import zlabelLDA

# model parameters
(T,W) = (4,6)
alpha = .1 * ones((1,T))
beta = .1 * ones((T,W))

# corpus of documents
docs = [[1,1,2],
        [1,1,1,1,2],
        [3,3,3,3,5,5,5],
        [3,3,3,3,4,4,4],
        [0,0,0,0,0],
        [0,0,0,0]]

# z-label strength
eta = .95 # confidence in the our labels

# z-labels (will be ignored unless it is a list)
zs = [[0,0,0],
      [0,0,0,0,0],
      [[0],[0],0,0,0,0,0],
      [[1],[1],0,0,0,0,0],
      [0,0,0,0,0],
      [0,0,0,0]]

# set number of samples, random number generator seed
(numsamp,randseed) = (100,194582)

# Do inference to estimate topics
(phi,theta,sample) = zlabelLDA(docs,zs,eta,alpha,beta,numsamp,randseed)

Questions/Comments/Bugs

Open up your Python interpreter and e-mail me at:

'@'.join(['andrzeje','.'.join(['cs','wisc','edu'])])

Reference

[1] Latent Dirichlet Allocation with Topic-in-Set Knowledge
Andrzejewski, D. and Zhu, X.
NAACL 2009 Workshop on Semi-supervised Learning for NLP (NAACL-SSLNLP 2009)
(pdf)

[2] Latent Dirichlet Allocation
Blei, D. M., Ng, A. Y., and Jordan, M. I.
Journal of Machine Learning Research (JMLR), 3, Mar. 2003, 993-1022.

[3] Finding Scientific Topics
Griffiths, T., and Steyvers, M.
Proceedings of the National Academy of Sciences (PNAS), 101, 5228-5235.