nClusters and maxnClusters Version 1.1 ========================== Program maxnclusters first calls "nclusters" to mine nClusters, then removes non-maximal nClusters in a post-processing step. Usage: maxnclusters nclusters data_filename min_row_size min_clmn_size delta 1 max_overlap output_filename Parameters: 1. data_filename: If the specified name is xxx, then two files should exist: "xxx.data" and "xxx.names". Data format: (similar to the data format used by UCI machine learning repository.) xxx.data: Each line represents an object with a set of attributes sperated by comma. xxx.names: The first line contains the number of objects (genes). The second line contains the number of attributes (tissues). The remaining lines contain information on whether an attribute is continuous or nominal. Each line is of the form: attrbute-name: continuous/nominal/... 2. min_row_size: minimum number of objects in a cluster 3. min_clmn_size: minimum number of attributes in a cluster 4. delta: the distance threshold. The objects in a cluster are at most delta times attribute-range apart from each other on every attribute in this cluster. 5. max_overlap: the maximum overlap allowed between adjacent bins of an attribute. Suggested value: 0.9. If the program takes too long, then try to lower this value. 6. output_filename: contains the subspace clusters generated. Each cluster takes two lines. The first line contains the set of attributes in the cluster, and the second line contains the object ids in the bicluster. The format is as follows: #attributes attr1 attr2... attrk #objects obj1 obj2 ... objm where #attribute is the number of attributes in the cluster, followed by the set of attributes in the cluster. #objects is the number of objects in the cluster, followed by the ids of the objects. The id of an objects is its line no. in "xxx.data" minus 1. Credits: These programs were written by LIU Guimei. During the project period, she was partially supported by FRC grant "R-252-040-238-101 & R-252-060-238-133: Pattern Spaces: Theory, Algorithms, and Applications", MOE T1 grant "R-252-000-274-112: Graph-Based Protein Function Prediction", and SERC PSF grant "SERC 072 101 0016 : Pattern Spaces: Theory, Techniques and Applications". If you use these programs, please cite: Guimei Liu, Jinyan Li, Kelvin Sin, Limsoon Wong. Distance-Based Subspace Clustering with Flexible Dimension Partitioning. Proceedings of 23rd IEEE International Conference on Data Engineering, pages 1250--1254, Istanbul, Turkey, April 2007. Guimei Liu, Kelvin Sim, Jinyan Li, Limsoon Wong. Efficient Mining of Distance-Based Subspace Clusters. Statistical Analysis and Data Mining, submitted.