affils1.clusters.clusters : 1696 affils2.clusters.clusters: 1897 affils2.clusters.singletons: 129 affils3.clusters.clusters: 1678 clusters (with 2283 strings) affils3.clusters.singletons: 243 strings Total: 12723 (includes 6935 outliers) 1st pass outliers (affils.tmp.outliers): This has further been reduced to 1662 clusters with 2238 strings. The remaining are singleton strings which I did not write it out to a file. I forgot to make a small change in the code. Now: The total number of clusters is 12085 clusters; Time taken: 10 hrs. (the first pass is 4/5 hours. later on, the time taken reduces drastically to much less than an hour for clustering each set.) Approach: The first pass is more of a data cleaning operation than clustering. It generates affils.tmp.clusters and affils.tmp.outliers. Then affils.tmp.clusters is split into three (arbitrarily chosen for convenience) parts and each part is clustered separately. The three parts are affils1.clusters, affils2.clusters, and affils3.clusters Clustering each part generates two files: affils?.clusters.clusters and affils?.clusters.singletons In addition, we also clustered affils.tmp.outliers to generate two more files, affils.outliers.clusters and affils.outliers.outliers (this file does not exist in this run. Please calculate this number to be the remaining number of strings, which were not accounted anywhere else)