Topic Modeling with the Dirichlet Forest Prior

Must-Link Example

Overview

This software implements the Dirichlet Forest (DF) Prior [1] within the LDA model for discrete count data. When combined with LDA [2], the Dirichlet Forest Prior allows the user to encode domain knowledge (must-links and cannot-links between words) into the prior on topic-word multinomials. The inference method is Collapsed Gibbs sampling [3]. This code can also be used to do "standard" LDA by applying no domain knowledge, or setting the "strength" parameter eta to 1, equivalent to [3].

The code implements DF-LDA as a Python C++ extension module.

Code

DF-LDA.tgz

Requirements

To build and install the module, you will need: See README.txt for further details.

Example usage

import DirichletForest as DF
from numpy import *

# Model parameters (see paper for meanings)
(alpha,beta,eta) = (1, .01, 100)

# Number of topics, size of vocab
(T,W) = (2,3)

# Vocabulary
vocab = ['apple','banana','motorcycle']

# Read docs
docs = DF.readDocs('example.docs')

# Build DF, apply constraints 
df = DF.DirichletForest(alpha,beta,eta,T,W,vocab)

# Must-Link between apple and banana
df.merge('apple','banana')

# Cannot-Link between apple and motorcycle
df.split('apple','motorcycle')

# Do inference on docs
(numsamp, randseed) = (50, 821945)
df.inference(docs,numsamp,randseed)

#Output results
print 'Top 3 words from learned topics'
df.printTopics(N=3)

Questions/Comments/Bugs

Open up your Python interpreter and e-mail me at:
'@'.join(['andrzeje','.'.join(['cs','wisc','edu'])])

Reference

[1] Incorporating Domain Knowedge into Topic Modeling via Dirichlet Forest Priors
Andrzejewski, D., Zhu, X., and Craven, M.
Proceedings of the 26th International Conference on Machine Learning (ICML 2009)
(pdf,slides)

[2] Latent Dirichlet Allocation
Blei, D. M., Ng, A. Y., and Jordan, M. I.
Journal of Machine Learning Research (JMLR), 3, Mar. 2003, 993-1022.

[3] Finding Scientific Topics
Griffiths, T., and Steyvers, M.
Proceedings of the National Academy of Sciences (PNAS), 101, 5228-5235.