This code supports models used in 

K. Noto and M. Craven, Learning Regulatory Network Models that
Represent Regulator States and Roles. To appear in Lecture Notes in
Bioinformatics (see file noto.lnbi2003.pdf).

please refer to this paper.


**************************************************************
HOW TO RUN THE grn EXECUTABLE:

grn -r ROLES-FILE -c CONDITIONS-FILE -x ASSAY-FILE [-t
TESTSET-ASSAY-FILE] [-d DOT-OUTPUT-FILE] (use ``grn --help'' to see a
list of options/argument formats)

(See FILE-FORMATS for a description of each file's format)

ROLES-FILE: contains information about which regulators/regulatees
should be included in the network, plus how each regulator is
PREDICTED (or known) to act on each regulatee.  These roles
(e.g. Activator or Repressor) will affect how the CPDs for regulatee
nodes are initialized.

CONDITIONS-FILE: list of conditions to use in the network (and all
their discrete possible values)

ASSAY-FILE: matrix of array-cross-{condition,gene} indicating the
order of experimental conditions, genes, arrays, and the value for
each condition and expression for each gene in each array.

DOT-OUTPUT-FILE: After training, if a filename is provided, grn will
print "dot" markup to the given file which will indicate the network
structure and CPDs for each hidden and regulatee node.


**************************************************************
OUTPUT OF THE PROGRAM:

grn will echo input files and options to standard ERROR, as well as
progress messages (if the verbose option is set to true), and will
write only TEST SET RESULTS to standard OUTPUT (and only if a test set
assay file is provided, of course).

I'll explain the evaulation of the network: For each array in the TEST
SET, the trained network is given the REGULATOR expression and the
experimental conditions' values.  The network is then queried for each
regulatee node, which comes up with a probability distribution over
the expression states of the regulatee (note that regulatees with only
one state are removed from the network).  The following measurements
are taken:

1)  Given the actual testset expression, we calculate the 
probability distribution over the expression states
for
the regulatee, and we measure the accuracy of our 
prediction (the probability distribtion query result)
as the dot product of these distributions.  The ERROR
is measured as one minus this dot product.

2)	Given our predicted probability distribution over states,
which are represented as Gaussians in a mixture, we can
predict the ACTUAL EXPRESSION level as the mean of the
most likely Gaussian(s).  We measure the difference between
each Gaussian's mean (weighted by it's likelihood in our
prediction) and square the result.  This is the SQUARED ERROR.

3)	We calculate the probability of each test set regulatee 
expression level, given the network and the test set data
for the experimental conditions and the regulator expression.
We keep the total product of all these probabilities in
the (natural) log scale.

The output of the grn executable, then is four numbers:

1)	The number of examples (test set arrays X regulatees)
2)	The total ERROR (incorrect predictions)
3)	The SUM SQUARED ERROR
4)	The log probability of the test regulatee expression, 
given the trained network, test regulator expression,
and test experimental conditions' values.
		


**************************************************************
CREATING ASSAY FILES:

the program caf will combine an experimental-condtions data file and a
gene-expression data file (I use a transposed version of the output
from Irizarry's RMA algorithm from bioconductor.org).  It will also
limit the file's size by including only genes mentioned in a ``roles''
file as either a regulator or regulatee.  It is important to note that
the names of the arrays must be identical in the conditions data file
and the expression data file.


**************************************************************
CREATING STATES FILES:

the program genstates will read expression data and generate a
Gaussian mixture model for each gene from these expression values.
The algorithm is summarized:

------------------------------------------------------------------
for cross-fold-validation folds 1 to 5:
	
	hold aside new TESTING expression values 
		(rest are TRAINING expression values)
	calculate the mean/variance of the TRAINING data, 
		these are the parameters of a 1-Gaussian mixture.
	use an E-M algorithm with a few random restarts to 
		estimate the parameters for a 2-Gaussian and a
		3-Gaussian mixture

	evaluate the probability of the TESTING data on 
		each of the Gaussian mixture models

(Now we have some probabilities for the held-aside data,
	given the number of Gaussians in a Gaussian mixture model)

tentatively select the 1-Gaussian mixture.

IF the probability of a 2-Gaussian mixture is higher and a 
	t-test indicates the difference is statistically 
	significant, select the 2-Gaussian mixture

IF the probability of a 3-Gaussian mixture is higher and a
	t-test indicates the difference is statistically 
	significant, select the 3-Gaussian mixture
------------------------------------------------------------------

The genstates program uses our Gaussian mixture constraints
(specifically, a minimum weight and standard deviation of each
Gaussian in the mixture, also each Gaussian must have the highest
density within some number of standard deviations of its mean.  These
are inductive biases used to eliminate certain possible mixtures that
don't make sense under our interpretation of Gaussians as gene
expression states).  See the ``mixture_constraints'' files.

try genstates -? for help.
	
		
