Topic Modeling with the Dirichlet Forest Prior

Must-Link Example


This software implements the Dirichlet Forest (DF) Prior [1] within the LDA model for discrete count data. When combined with LDA [2], the Dirichlet Forest Prior allows the user to encode domain knowledge (must-links and cannot-links between words) into the prior on topic-word multinomials. The inference method is Collapsed Gibbs sampling [3]. This code can also be used to do "standard" LDA by applying no domain knowledge, or setting the "strength" parameter eta to 1, equivalent to [3].

The code implements DF-LDA as a Python C++ extension module.




To build and install the module, you will need: See README.txt for further details.

Example usage

import DirichletForest as DF
from numpy import *

# Model parameters (see paper for meanings)
(alpha,beta,eta) = (1, .01, 100)

# Number of topics, size of vocab
(T,W) = (2,3)

# Vocabulary
vocab = ['apple','banana','motorcycle']

# Read docs
docs = DF.readDocs('')

# Build DF, apply constraints 
df = DF.DirichletForest(alpha,beta,eta,T,W,vocab)

# Must-Link between apple and banana

# Cannot-Link between apple and motorcycle

# Do inference on docs
(numsamp, randseed) = (50, 821945)

#Output results
print 'Top 3 words from learned topics'


Open up your Python interpreter and e-mail me at:


[1] Incorporating Domain Knowedge into Topic Modeling via Dirichlet Forest Priors
Andrzejewski, D., Zhu, X., and Craven, M.
Proceedings of the 26th International Conference on Machine Learning (ICML 2009)

[2] Latent Dirichlet Allocation
Blei, D. M., Ng, A. Y., and Jordan, M. I.
Journal of Machine Learning Research (JMLR), 3, Mar. 2003, 993-1022.

[3] Finding Scientific Topics
Griffiths, T., and Steyvers, M.
Proceedings of the National Academy of Sciences (PNAS), 101, 5228-5235.