第一期:数据挖掘
 

Data Mining: Foundation, Techniques and Applications
(Organized by Key Lab of Data and Knowledge Engineering in the Renmin University of China)

Lecturers:

Anthony K. H. Tung (鄧锦浩),
Assistant Professor,
School of Computing,
National University of Singapore,

Singapore

Cuiping Li (李翠平),
Associate Professor,
School of Information,
Renmin University of China,
China

 

Date

Session

Topics

4th Dec 2007

Morning

Afternoon

Night

5th Dec 2007

Morning

Afternoon

6th Dec 2007

Morning

Association Rule, Frequent Pattern (II)Classification & Regression (I)

Afternoon

Classification & Regression (II), Clustering (I)

Night

Clustering (II)

7th Dec 2007

Morning

Skyline/Dominance Relationship Analysis

Afternoon

Searching and Mining High Dimensional Data

Night

Searching and Mining Sequences

8th Dec 2007

Morning

Searching and Mining Trees

Afternoon

Searching and Mining Graphs

Timing:

Morning: 9am-12pm
Afternoon: 2pm-5pm
Night: 7pm-10pm

Texts:   

[HMS01]:   Principles of Data Mining, David Hand, Heikki Mannila and Padhraic Smyth, The MIT Press, 2001
[HK01]: Data Mining: Concepts and Techniques,  Second Edition, J. Han and M. Kamber,  Morgan Kaufmann, 2006
[Mit97]: Machine Learning, Tom M. Mitchell, McGrawHill, 1997.

Readings: (Tentative)


Introduction/A Quick Overview  


Machine Learning and Statistic


Database Techniques I (Indexing)

·         A. Guttman, "R-Trees: A Dynamic Index Structure for Spatial Searching." SIGMOD Conference 1984: 47-57

·         Nick Roussopoulos, Steve Kelley, and F. Vincent. “Nearest Neighbor Queries”. Proc. of ACM-SIGMOD, pages 71--79, May 1995.

·         R. Baeza-Yates, et al, Modern Information Retrieval (Acm Press Series), 1999, Chapter 8.1-8.3


Database Techniques II (Pre-Computation)

·         J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichaxt, and M. Venkatrao. “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”. In DMKD, pages 29-53. Kluwer Academic Publishers, 1997

·         Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman: Implementing Data Cubes Efficiently. SIGMOD Conference 1996: 205-216

·         H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. “Optimal Histograms with Quality Guarantees”. VLDB 1998

·         HK06: Chapter 3-4


Association Rule, Frequent Pattern (I)

·         http://www.cs.helsinki.fi/u/salmenki/dmcourse/chap12.pdf  Chapter 2

·         Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li, "New Algorithms for Fast Discovery of Association Rules", 3rd International Conference on Knowledge Discovery and Data Mining (KDD), pp 283-286, Newport, California, August, 1997. (Journal Version in 2000)
Note: The first paper the introduce depth first search for frequent pattern mining.

·         P.Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, D. “Turbo-Charging Vertical Mining of Large Database.” Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00), Dallas, TX, May 2000.
Note: Tested on a dataset of 1.2GB, one of the largest that I have seen  in data mining literatures. (Code Available Here)

·         Gao Cong, Beng Chin Ooi, Kian-Lee Tan, Anthony K. H. Tung, "Go Green: Recycle and Reuse Frequent Patterns". International Conference on Data Engineering (ICDE'2004), Boston, 2004

Optional References:

·         Zijian Zheng, Ron Kohavi, Llew Mason. Real World Performance of Association Rule Algorithms. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 

·         H. V. Jagadish, Raymond T. Ng, Beng Chin Ooi, Anthony K. H. Tung, "ItCompress: An Iterative Semantic Compression Algorithm".International Conference on Data Engineering (ICDE'2004), Boston, 2004.


Association Rule, Frequent Pattern (II)

·         Mohammed J. Zaki, “Efficient Enumeration of Frequent Sequences”, 7th International Conference on Information and Knowledge Management, pp 68-75, Washington DC, November 1998.
Note: The first paper that introduce depth first search for sequential pattern mining.

·         Kevin S. Beyer, Raghu Ramakrishnan: Bottom-Up Computation of Sparse and Iceberg CUBEs. SIGMOD Conference 1999: 359-370.
Note: The first paper that use pointer-based algorithm for frequent pattern mining.

·         Mohammed J. Zaki, Ching-Jui Hsiao, CHARM: An Efficient Algorithm for Closed Itemset Mining, 2nd SIAM International Conference on Data Mining, Arlington, April 2002.

·         Cuiping Li, Gao Cong, Anthony K.H. Tung, Shan Wang. “Incremental Maintenance of Quotient Cube for Median”. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04) 2004

·         HK06: Chapter 5.5


Classification & Regression (I)


Classification & Regression (II)


Clustering (I)

  • HK06: Chapter 7.1- 7.7

Optional References:


Clustering (II)

·         HK06:  Chapter 7.9- 7.11

·                     Anthony K. H. Tung, Jean Hou, Jiawei Han,  “Clustering in the Presence of Obstacles”. In Proc. of 17th International Conference on Data Engineering (ICDE’01, Heidelberg, Germany  p359-367.

·                     Anthony K. H. Tung,  Raymond T. Ng, Laks V. S. Lakshmanan,, Jiawei Han,  “Constrained Clustering on Large Database” , Proc. 8th Intl. Conf. on Database Theory (ICDT’01), London, UK, Jan. 2001, p405-419.  

·         Wen Jin,  Anthony K. H. Tung , Jiawei. Han, “Finding Top-n Local Outliers in Large Database”, in 7th  ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (SIGKDD’01)


Ranking: Skyline/Dominance Relationship Analysis

·         Stephan Börzsönyi, Donald Kossmann, Konrad Stocker: "The Skyline Operator". ICDE 2001: 421-430

·         Papadias, D., Tao, Y., Fu, G., Seeger, B. "An Optimal and Progressive Algorithm for Skyline Queries". Proceedings of the ACM Conference on the Management of Data (SIGMOD), pp. 443-454, San Diego, CA, June 9-12, 2003.

·         Cuiping Li, Beng Chin Ooi, Anthony K. H. Tung, Shan Wang. "DADA: A Data Cube for Dominant Relationship Analysis", In Proceedings, ACM SIGMOD Int'l. Conference on Management of Data , Chicago, 2006 (SIGMOD'06)

·         Chee-Yong Chan, H. V. Jagadish, Kian-Lee Tan, Anthony K. H. Tung, Zhenjie Zhang. "Finding k-Dominant Skylines in High Dimensional Space", In Proceedings, ACM SIGMOD Int'l. Conference on Management of Data , Chicago, 2006 (SIGMOD'06)

·         Cuiping Li, Anthony K. H. Tung, Wen Jin, Martin Ester. "On Dominating Your Neighborhood Profitably". accepted and to appear in the 33rd Very Large Data Bases conference (VLDB'07), Vienna, 2007

Optional References:

·         Chee-Yong Chan, H. V. Jagadish, Anthony K.H. Tung, Kian-Lee Tan, Zhenjie Zhang. "On High Dimensional Skylines ". In EDBT 2006

·         Wen Jin, Anthony K. H. Tung, Martin Ester and Jiawei Han,''On Efficient Processing of Subspace Skyline Queries on High Dimensional Data'', in  Proc. 2007 Int'l.Conf.on Scientific and Statistical Database Management (SSDBM'07), Canada, July. 2007.


Searching and Mining High Dimensional Data

·         Anthony K. H. Tung, Rui Zhang, Nick Koudas, Beng Chin Ooi. "Similarity Search: A Matching Based Approach",  VLDB'06.

·         C. Xia, H. Lu B. C. Ooi, and  J. Hu  "GORDER: An Efficient Method for KNN Join Processing" VLDB 2004

·         Anthony K. H. Tung, Xin Xu, Beng Chin Ooi. "CURLER: Finding and Visualizing Nonlinear Correlated Clusters" In Proceedings, SIGMOD'05, Baltimore, Maryland 2005.

·         Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". In Proceedings SIGMOD'05,Baltimore, Maryland 2005

     Optional References:

·         Feng Pan, Gao Cong, Anthony K. H. Tung, Jiong Yang, Mohammed Zaki, "CARPENTER: Finding Closed Patterns in Long Biological Datasets", In Proceedings KDD'03, Washington, DC, USA, August 24-27, 2003.

·         Gao Cong, Anthony K. H. Tung, Xin Xu, Feng Pan, Jiong Yang. "FARMER: Finding Interesting Rule Groups in Microarray Datasets". Iin SIGMOD'04, June 13-18, 2004, Maison de la Chimie, Paris, France.

·         Feng Pang, Anthony K. H. Tung, Gao Cong, Xin Xu. "COBBLER: Combining Column and Row Enumeration for Closed Pattern Discovery".  SSDBM 2004 Santorini Island Greece.

·         Gao Cong, Kian-Lee Tan, Anthony K.H. Tung, Feng Pan. “Mining Frequent Closed Patterns in Microarray Data”. In IEEE International Conference on Data Mining, (ICDM). 2004


Searching and Mining Sequences

·         Benjarath Phoophakdee, Mohammed J. Zaki: "Genome-scale disk-based suffix tree indexing". SIGMOD Conference 2007: 833-844

·         Chen Li, Bin Wang, and Xiaochun Yang . "VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams". Accept and to Appear in VLDB 2007.

·         nT.L. Bailey and Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proc. Int. Conf. Intell. Syst. Mol. Biol., volume 2, pages 28--36. 1994

·         Heikki Mannila, Christopher Meek: Global partial orders from sequential data. KDD 2000: 161-168

    Optional References:

·         U. Keich and P. Pevzner. Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics, in press, 2002.

·         B Ma, J Tromp, M Li. PatternHunter: faster and more sensitive homology search    - Bioinformatics, 2002 - Oxford Univ Press

·         Xin Xu, Ying Lu, Anthony K.H. Tung, Wei Wang. "Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles". Accepted and to appear in ICDE 2006.


Searching and Mining Trees

·         Philip Bille . A survey on tree edit distance and related problems. Theoretical Computer Science. Volume 337 ,  Issue 1-3  (June 2005)

·         Rui Yang, Panos Kalnis, Anthony K. H. Tung: Similarity Evaluation on Tree-structured Data. SIGMOD 2005.

·         Hisashi Kashima Teruo Koyanagi. Kernels for Semi-Structured Data. Proceedings of the Nineteenth International Conference on Machine Learning. 291-298. 2002.

·         Mohammed J. Zaki. "Efficiently mining frequent trees in a forest". Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. Pg. 71-80. 2002

    Optional References:

·         JP Vert. "A tree kernel to analyze phylogenetic profiles"  - Bioinformatics, 2002 - Oxford Univ Press


Searching and Mining Graphs

·         Hisashi Kashima, Koji Tsuda, Akihiro Inokuchi: Marginalized Kernels Between Labeled Graphs. ICML 2003: 321-328

·         Raymond J. W. Willett P. "Maximum common subgraph isomorphism algorithms for the matching of chemical structures".Journal of Computer-Aided Molecular Design, Volume 16, Number 7, July 2002 , pp. 521-533(13)

·         Bei Yu, Guoliang Li, Karen Sollins, Anthony K. H. Tung. Effective Keyword-based Selection of Relational Databases. ACM SIGMOD 2007.

·         X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03), Washington, D.C., Aug. 2003..

·         X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, in Proc. 2005 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'05), Baltimore, Maryland, June 2005

·         HK06: Chapter 9.1

Optional References:

·         H Bunke. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 1997

·         H Bunke, K Shearer. A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 1998 - ece.concordia.ca