Procedure to follow to calculate the MDL criterion for most or all partitions of the data: 2nd step. The first step was to calculate the parsimony score of each group and all (or most) partitions using the perl script. Now a file is available that contains a list of partitions with their parsimony scores and their number of groups. I will assume this file has name "partition_scores". In R, source the file "MDL.r" with the command > source("MDL.r") Then, use the command > find.MDL("partition_scores") which will create a new file called "MDL.out" containing the list of partitions and their MDL scores, sorted by best MDL scores. These are the basic scores, based on the separate description of each tree. If you want to describe the trees (other than the first one) by their NNI differences with the first tree, then you can do the following. Procedure to follow for the MDL approach with NNI description of trees: Input: "alltrees" file with the list of one MP tree for each group. 1. Choose the best N partitions from the basic MDL analysis. The NNI-based MDL score will be calculated for these N partitions, not for the others (calculating the NNI distance between 2 trees is actually NP-hard, and a heuristic exists but is still time-consuming). 2. Make a list of all groups appearing in the N best partitions. Example: partition "1211" is the partition that combined genes 1,3 and 4 together and puts gene 2 by itself in another group. This partition has two groups" "1011" and "0100". Another example" partition "1232" has 3 groups: "1000" with the first gene only, group "0101" with genes number 2 and 4, and group "0010" with gene 3 only. 3. Get the number of the groups appearing in the selected partitions. Groups are ordered in the following way: 1 10000000 2 01000000 3 11000000 ... 01111111 255 11111111 (This example has 8 genes, and so 2^8-1 = 255 nonempty groups of genes) The correspondance group index - group description is obtained with: > group2index("11101101") 183 > group2index("00010010") 72 > group2index("11101100") 55 > group2index("00010001") 136 > group2index("00000010") 64 In this example, I had 3 partitions: one with groups 183 and 72 and one with groups 55 and 200. (note that 183+72 = 55+136+64 = 255 = 2^8-1 is not by chance) 4. Create a new tree file by selecting the trees in "alltrees" with indices found in 3. Call this new file "sometrees" for instance. In the previous example, the new file would have only 5 trees, numbers 55, 64, 72, 136 and 183. 5. Now use COMPONENT to get the distance between all pairs of trees in the file "sometrees" (created in 4). Use the heuristic d_us. Example: The example above might lead to tree 2| 3 (now tree 1 --> 55 3| 3 3 tree 2 --> 64 etc.) 4| 6 6 5 5| 7 6 6 5 ___________ 1 2 3 4 6. create a list of vectors, one vector for each partition. For a partition with k groups, the vector will contains tke k-1 distances between one specific tree and all the others. Example: for partition 183/72, distance between trees 3 and 5 is 6. For partition 55/136/64, distances between tree 1 and trees 2 and 4 are 3 and 6. The list is : list(c(6),c(3,6)). In R, we can save this list with > nni.distance = list(c(6),c(3,6)) Note: for the partition with a single group combining all the genes, the vector of NNI distances is the empty vector c(). If unavailable, used the vector c(NA). 7. Calculate the NNI-based MDL score of the selected partitions: > find.MDL.nni("MDL.out", nni.dist=nni.distance) where MDL.out was created with the basic MDL criterion. The seleted partitions will be the first N partitions appearing in this file, where N is the length of the list "nni.distance". The output will be in the file "MDLnni.out".