B. Research

B.1 Research Programme

My long-term research goals are to impact the way scientific scholars produce and consume academic research. This research comes under the guise of Digital Libraries (DL), which is an interdisciplinary field that combines many areas of computer science: natural language processing (NLP), (web) information retrieval (IR), multimedia (MM), human computer interaction (HCI), machine learning (ML) and databases (DB). Over my 5 ½ years at NUS, I have solidified my group’s track record as a world-class group that publishes in top conferences and journals in the areas of NLP, IR and MM. Where we do not have the proper expertise, I have collaborated with other internationally renowned teams that do (A/P Hari Sundaram of Arizona State Univ., MM and HCI expert; A/P Dongwon Lee). I have sought diversity in collaboration as well as funding sources to ensure that my group thrives from the interdisciplinary culture that is Digital Libraries and can survive the fluxes in funding changes and initiatives. In other collaborations, my group has done the bulk of the work in order to strategically network with other, more established research groups (see “Statement on Co-Authorship”, C.2.d).

Being an interdisciplinary researcher is difficult. Often times, fields that you contribute in do not see the value in your research, as it is on the fringe. I’m happy to report that this is not true in my case; rather I feel that have I have succeeded quite well thus far. My overall citations are very healthy, surpassing many of my fellow tenure applicants in my department (by Google Scholar; see C.3.a), many whom are not interdisciplinary in focus.

I’m also satisfied that the visibility of my group has increased, recognized in the form of successful alumni (Google, I²R, Univ. of Maryland), well-placed internships for students (Columbia Univ., Google), and editorial and professional board positions for myself (Information Retrieval, Association for Computational Linguistics).

To further encourage growth and the impact of my research, it is clear that publications are not enough. This is obvious to me as a scholar that studies the dissemination of scholarly works. The successful research team is active on all fronts: publication, tools, datasets, advocacy, consultancy and active dissemination. My group activity mirrors this: we contribute to the research community through a number of means aside from publications (see “Non-publication Research Highlights”, see C.2.c). The next phase of my group’s development will be to build, develop and maintain high impact software systems that will catalyze long-term changes in how scholars do research. This effort has started in the past year, resulting in the release of useful and utilized open source software (ParsCit; publication #28). Building these systems will both provide impact to my research group and the host institution, but also create a platform to find and solve research problems that we do not yet know exist. Research experience with web-based systems has shown this to be the case: YouTube, Facebook, Flick, Del.icio.us all have access to enormous datasets that have real-world impact yet abound in interesting research problems.

In the short term, my group is acquiring experience for enterprise level coding as well as maintaining our publication profile in doing interesting cross-disciplinary research: in particular we are exploring slide analysis and fine-grained multimedia alignment, both problems that are not yet significantly explored by research groups, yet crucial for enabling future systems. A case in point is our recent work on slide-to-paper alignment (SlideSeer; publication #18) that has created a small stir in the NLP and DL communities. In the longer term, we will be building the infrastructure to bring these research prototypes to the next generation digital library and (perhaps more importantly) seeding the user pool for these DL projects.

System building is inherently dangerous activity for academics to undertake: academic colleagues may not view such activities as impactful; and often they are not, as few research systems have demonstrably made a real impact. Why is my bid different? I feel that the reason behind many such failures is due to a lack of understanding of the end user. As a DL expert, HCI issues of user analysis and multiple rounds of testing are at the forefront of my concerns in developing a system. I am actively seeking the support of users and past DL system developers in creating useable DL systems that will place users in the front seat. I have maintained my group’s publishing record during this time of development but feel that the system development is of higher priority because it stands to create a much larger impact in the long run.

I plan on doing this with new monies that my group has secured as collaborators with the large MDA grants from Prof. Chua Tat-Seng. I’m actively involved in other nascent collaborations to secure the long-term funding and manpower necessary to build a larger, world-class group (see “Future Plans”, C.4). Currently, I am concurrently pursuing seven different grants as a Co-PI or collaborator (including ones with the CSIDM (NUS-Chinese Academy of Sciences), NUS-CALIT2; each worth approximately 10M SGD in total).

B.2 Research Contributions

In addition to the published or accepted publications, I am preparing two premium journal publications for submission and have one other Rank-1 conference publication under review.

Please note that my primary target area is in Digital Libraries, and the top conference for that field is considered a rank 2 conference by SoC and NUS: JCDL. While I agree with this ranking, it is the correct venue for much of the research that my group has been doing. I have published regularly in this venue and have been a long-standing program committee member for this conference for the past several years (since 2005).

(a) Publications (70 total; IDs are numbered sequentially for ease of reference):

· Books Edited: 1

1. Hwee Tou Ng, Mun-Kew Leong, Min-Yen Kan and Donghong Ji (Eds.) (2006) Information Retrieval Technology. Lecture Notes in Computer Science, Volume 4182.

· Book Chapters: 1

2. Min-Yen Kan (2005) Using multi-document summarisation to assist in semi-structured literature retrieval: A case study in consumer healthcare In Theng, Yin Leng and Foo, Schubert (Eds.) "Design and Usability of Digital Libraries : Case Studies in the Asia Pacific", Idea Group Publishing.

· Journal publications (premium): 2

3. Min-Yen Kan, Ye Wang, Denny Iskandar, Tin Lay Nwe and Arun Shenoy (2008) LyricAlly: Automatic Synchronization of Textual Lyrics to Acoustic Music Signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(2), February. pp. 338-349.

4. Hang Cui, Min-Yen Kan and Tat-Seng Chua (2007) Soft Pattern Matching Models for Definitional Question Answering, ACM Transactions on Information Systems (TOIS), 25(2). April.

· Journal publications (leading): 4

5. Yee Fan Tan and Min-Yen Kan (2008) Record Linkage in Digital Library Metadata. In Communications of the ACM (CACM), Technical Opinion Column, 51(2), pp 91-94, February.

6. Shiren Ye, Tat-Seng Chua, Min-Yen Kan and Long Qiu (2007) Document concept lattice for text understanding and summarization, Information Processing and Management, 43(6), pp. 1643-1662.

7. Wei Lu and Min-Yen Kan (2007) Supervised Categorization of Javascript using Program Analysis Features, Information Processing and Management, 43(2).

8. Noemie Elhadad, Min-Yen Kan, Judith Klavans, and Kathleen McKeown (2005) Customization in a Unified Framework for Summarizing Medical Literature, Journal of Artificial Intelligence in Medicine, 33 (2), pp. 179-198.

· Conference Papers (1st tier): 8

9. Hendra Setiawan, Min-Yen Kan and Haizhou Li (2007), Ordering Phrases with Function Words, In Proceedings of the Association of Computational Linguistics, (ACL 07). Prague, Czech Republic, June.

10. Long Qiu, Min-Yen Kan and Tat-Seng Chua (2006) Paraphrase Recognition via Dissimilarity Significance Classification, In Proceedings of the Empirical Methods for Natural Language Processing (EMNLP '06), Syndey, Australia, July 2006.

11. Cui Hang, Min-Yen Kan and Tat-Seng Chua (2005) Generic Soft Pattern Models for Definitional Question Answering. Proc. of ACM SIG on Information Retrieval (SIGIR 05). Brazil, August 2005.

12. Cui Hang, Renxu Sun, Keya Li, Min-Yen Kan and Tat-Seng Chua (2005) Question Answering Passage Retrieval Using Depedency Relations. Proc. of ACM SIG on Information Retrieval (SIGIR 05). Brazil, August 2005.

13. Wang Ye, Min-Yen Kan, Tin Lay Nwe, Arun Shenoy and Jun Yin (2004) LyricAlly: Automatic Synchronization of Acoustic Musical Signals and Textual Lyrics. In Proceedings of ACM Multimedia 2004 (MM '04), New York, USA, 10-16 October.

14. Hang Cui, Min-Yen Kan and Tat-Seng Chua (2004) Unsupervised Learning of Soft Patterns for Generating Definitions from Online News. In Proceedings of the 13th International World Wide Web Conference (WWW2004), May 2004. New York, New York, USA.

15. Judith L. Klavans and Min-Yen Kan (1998) Role of Verbs in Document Analysis. In Proceedings of COLING/ACL 98, Montréal, Québec, Canada: Aug. 1998. pp. 680-686. (Posted to cmp-lg archives)

16. Min-Yen Kan, Judith L. Klavans and Kathleen R. McKeown (1998) Linear Segmentation and Segment Relevence. Proceedings of 6th International Workshop of Very Large Corpora (WVLC-6), Montréal, Québec, Canada: August 1998. pp. 197-205.

· Conference Papers (2nd tier): 7

17. Jin Zhao, Min-Yen Kan and Yin Leng Theng (2008) Math Information Retrieval: User Requirements and Prototype Implementation. In Proceedings of the Joint Conference on Digital Libraries (JCDL '08). Pittsburgh, Pennsylvania, June, pages 187-196.

18. Min-Yen Kan (2007) SlideSeer: A Digital Library of Aligned Document and Presentation Pairs, In Proceedings of the Joint Conference on Digital Libraries (JCDL '07). Vancouver, Canada, June.

19. Su Yan, Dongwon Lee, Min-Yen Kan and C. Lee Giles (2007) Adaptive Sorted Neighborhood Methods for Efficient Record Linkage, In Proceedings of the Joint Conference on Digital Libraries (JCDL '07). Vancouver, Canada, June.

20. Min-Yen Kan and Danny C. C. Poo Detecting and supporting known item queries in online public access catalogs. Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries (JCDL 05). Denver, 7-11 June 2005. pp. 91-99.

21. Bageshree Shevade, Hari Sundaram and Min Yen-Kan (2005) A Collaborative Annotation Framework. In Proceedings of the International Conference on Multimedia and Expo (ICME '05), Amsterdam, Netherlands, July 2005.

22. Jeffry Komarjaya, Danny C.C. Poo and Min-Yen Kan (2004) Corpus-Based Query Expansion in Online Public Access Catalogs. In Proceedings of the European Conference on Digital Libraries (ECDL '04), Bath, United Kingdom, 12-17 September.

23. Andre W. Kushniruk, Min-Yen Kan, Kathleen R. McKeown, Judith L. Klavans, Desmond Jordan, Mark LaFlamme and Vimla L. Patel (2002) Usability Evaluation of an Experimental Text Summarization System and Three Search Engines: Implications for the Reengineering of Health Care Interfaces. In Proceedings of the American Medical Informatics Association Annual Symposium (AMIA 2002), San Antonio, Texas, USA: November 2002.

24. Min-Yen Kan and Judith L. Klavans (2002) Using Librarian Techniques in Automatic Text Summarization for Information Retrieval. Proceedings of the Joint Conference on Digital Libraries (JCDL 2002), Portland, Oregon, USA: July 2002. pp. 36-45.

· Conference Papers (Others): 29

25. Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan and Dongwon Lee (2008). Efficient Web-Based Linkage of Short to Long Forms. International Workshop on the Web and Databases (WebDB), Vancouver, Canada, June 2008.

26. Guo Min Liew and Min-Yen Kan (2008) Slide Image Retrieval: A Preliminary Study. In Proceedings of the Joint Conference on Digital Libraries (JCDL '08). Pittsburgh, Pennsylvania, June, pages 359-362. Short paper.

27. Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev and Yee Fan Tan (2008) The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Language Resources and Evaluation Conference (LREC 08). Marrakesh, Morocco, May.

28. Isaac G. Councill, C. Lee Giles and Min-Yen Kan (2008) ParsCit: An open-source CRF reference string parsing package. In Language Resources and Evaluation Conference (LREC 08). Marrakesh, Morocco, May

29. Long Qiu, Min-Yen Kan and Tat-Seng Chua (2008) Modeling Context in Scenario Template Creation, In Proceedings of the Third International Joint Conference on Natural Language Processing (IJCNLP '08), Hyderabad, India.

30. Thuy Dung Nguyen and Min-Yen Kan (2007). Keyphrase Extraction in Scientific Publications. In Proc. of International Conference on Asian Digital Libraries (ICADL '07). Hanoi, Vietnam, December.

31. Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee and Yi Zhang (2007). Web Based Linkage. In Proc. of Workshop on Web Information and Data Management (WIDM '07). Lisboa, Portugal, September.

32. Ergin Elmacioglu, Yee Fan Tan, Su Yan, Min-Yen Kan and Dongwon Lee (2007). PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features. In Proceedings of SemEval 2007 Workshop, Association of Computational Linguistics (ACL), Prague, Czech Republic, June.

33. Jesse Prabawa Gozali and Min-Yen Kan (2007) A Rich OPAC User Interface with AJAX, In Proceedings of the Joint Conference on Digital Libraries (JCDL '07). Vancouver, Canada, June. Short paper.

34. Bang Viet Nguyen and Min-Yen Kan (2007) Functional Faceted Web Query Classification. In Proc. of Query Log Analysis: Social and Technological Challenges, Banff, Canada, May.

35. Ziheng Lin and Min-Yen Kan (2007) Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization, In Proceedings of Textgraphs-2: Workshop on Graph-based Methods for Natural Language Processing, Rochester, NY, USA, April.

36. Denny Iskandar, Ye Wang, Min-Yen Kan and Haizhou Li (2006) Syllabic Level Automatic Synchronization of Music Signals and Text Lyrics, In Proceedings of ACM Multimedia (MM '06), Santa Barbara, CA, USA, October 2006.

37. Shi-yong Neo, Jin Zhao, Min-Yen Kan and Tat-Seng Chua (2006) Video Retrieval using High Level Features: Exploiting Query Matching and Confidence-based Weighting, In Proceedings of the Conference on Image and Video Retrieval (CIVR), Tempe, Arizona, USA, July 2006.

38. Fei Wang and Min-Yen Kan (2006) NPIC: Hierarchical synthetic image classification using image search and generic features, In Proceedings of the Conference on Image and Video Retrieval (CIVR), Tempe, Arizona, USA, July 2006.

39. Yee Fan Tan, Min-Yen Kan and Dongwon Lee (2006). Search Engine Driven Author Disambiguation. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), Chapel Hill, North Carolina, USA, June 2006. (Short Paper).

40. Yee Fan Tan, Min-Yen Kan and Hang Cui (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework. In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context (MWEmc), Trento, Italy, April 2006, pages 47-54.

41. Renxu Sun, Jing Jiang, Yee Fan Tan, Hang Cui, Tat-Seng Chua and Min-Yen Kan (2005) Using Syntactic and Semantic Relation Analysis in Question Answering, In Proceedings of the 14th Text Retrieval Conference (TREC), Gaithersburg, Maryland, USA, November 2005.

42. Min-Yen Kan and Hoang Oanh Nguyen Thi (2005) Fast webpage classification using URL features. In Proc. of Conf. on Info and Knowledge Management (CIKM '05). Bremen, Germany, November 2005. Poster Paper.

43. Wei Lu and Min-Yen Kan (2005) Supervised Categorization of Javascript using Program Analysis Features. In Proc. of Asia Information Retrieval Symposium (AIRS 05). Jeju Island, Korea, October 2005.

44. Yijue How and Min-Yen Kan (2005) Optimizing predictive text entry for short message service on mobile phones. In Proc. of Human Computer Interfaces International (HCII 05). Las Vegas, July 2005.

45. Hang Cui, Min-Yen Kan, Tat-Seng Chua and Jing Xiao (2004) A Comparative Study on Sentence Retrieval for Definitional Question Answering. In Proceedings of the Workshop on Information Retrieval for Question Answering (IR4QA), SIGIR '04. Sheffield, United Kingdom.

46. Chee How Lee, Min-Yen Kan and Sandra Lai (2004) Stylistic and Lexical Co-training for Web Block Classification. In Proceedings of Workshop on Web Information and Data Management (WIDM '04), Washington, D.C., USA, 12-13 November.

47. Min-Yen Kan (2004) Web Page Classification Without the Web Page. In Proceedings of the 13th International World Wide Web Conference (WWW2004), May 2004. New York, New York, USA. Poster Paper.

48. Long Qiu, Min-Yen Kan and Tat-Seng Chua (2004) A Public Reference Implementation of the RAP Anaphora Resolution Algorithm. In Proceedings of the Language Resources and Evaluation Conference 2004 (LREC 04), Lisbon, Portugal.

49. Simon Lok and Min-Yen Kan (2003) Employing Natural Language Summarization and Automated Layout for Effective Presentation and Navigation of Information Retrieval Results. Proceedings of the 12th International World Wide Web Conference (WWW2003), May 2003. Poster paper.

50. Andre W. Kushniruk, Min-Yen Kan, Kathleen R. McKeown, Judith L. Klavans and Vimla L. Patel (2002) Evaluating the Content and Usability of an Experimental Text Summarization System and Three Web-Based Search Engines. In Proceedings of the Human Factors and Ergonomics 46th Annual Meeting (HFES 2002), Baltimore, Maryland, USA: September 2002.

51. Min-Yen Kan and Kathleen R. McKeown (2002) Corpus-trained text generation for summarization. Proceedings of the Second International Natural Language Generation Conference (INLG 2002), Harriman, New York, USA: July 2002. pp. 1-8.

52. Min-Yen Kan, Judith L. Klavans and Kathleen R. McKeown (2002) Using the Annotated Bibliography as a Resource for Indicative Summarization. In Proceedings of the Language Resources and Evaluation Conference (LREC 2002), Las Palmas, Spain: May 2002. pp. 1746-1752.

53. Min-Yen Kan, Kathleen R. McKeown and Judith L. Klavans (2001) Domain-specific informative and indicative summarization for information retrieval. In Proceedings of the Document Understanding Workshop (DUC 2001), New Orleans, USA: September 2001.

54. Kathleen R. McKeown, Regina Barzilay, David Evans, Vasileios Hatzivassiloglou, Min-Yen Kan, Barry Schiffman and Simone Teufel (2001) Columbia Multi-Document Summarization: Approach and Evaluation. In Proceedings of the Document Understanding Workshop (DUC 2001), New Orleans, USA: September 2001.

55. Min-Yen Kan, Kathleen R. McKeown and Judith L. Klavans (2001) Applying Natural Language Generation to Indicative Summarization. In Proceedings of 8th European Workshop on Natural Language Generation, Toulouse, France: July 2001. pp. 92-100.

56. Vasileios Hatzivassiloglou, Judith L. Klavans, Melissa L. Holcombe, Regina Barzilay, Min-Yen Kan, and Kathleen R. McKeown (2001) Simfinder: A Flexible Clustering Tool for Summarization. In Proceedings of the Workshop on Summarization in NAACL `01, Pittsburg, Pennsylvania, USA: June 2001.

57. Judith L. Klavans, Kathleen R. McKeown, Min-Yen Kan, and Susan Lee (1998) Resources for the Evaluation of Summarization Techniques. Proceedings of the 1st International Conference on Language Resources and Evaluation, Grenada, Spain: May 1998.

58. Pascale Fung, Min-Yen Kan and Yurie Horita (1996) Extracting Japanese Domain and Technical Terms is Relatively Easy. Second International Conference in New Methods for Language Processing, (NEMLP) Bilkent, Turkey: September 1996. pp. 148-159.

· Technical Reports: 11

59. Jesse Prabawa Gozali and Min-Yen Kan (2007). Rich and Dynamic Library Catalogs: A Case Study of Online Search Interfaces. National University of Singapore Department of Computer Science Technical Report, TRA 8/07.

60. Ziheng Lin, Tat-Seng Chua, Min-Yen Kan, Wee Sun Lee, Long Qiu and Shiren Ye (2007). NUS at DUC 2007: Using Evolutionary Models of Text. In Proceedings of the Document Understanding Conference (DUC '07), Rochester, NY, USA.

61. Min-Yen Kan and Hoang Oanh Nguyen Thi (2005) Fast webpage classification using URL features. National University of Singapore Department of Computer Science Technical Report, TRC 8/05.

62. Yee Fan Tan, Min-Yen Kan and Hang Cui (2005) Extending corpus-based identification of light verb constructions using a supervised learning framework. National University of Singapore Department of Computer Science Technical Report, TRB 8/05.

63. Hang Cui, Keya Li, Renxu Sun, Tat-Seng Chua and Min-Yen Kan (2004) National University of Singapore at the TREC-13 Question Answering Main Task. In Proceedings of TREC 13.

64. Hui Yang, Hang Cui, Mstislav Maslennikov, Long Qiu, Min-Yen Kan, Tat-Seng Chua (2003) QUALIFIER In TREC-12 QA Main Task. In Proceedings of TREC 12, pages 480-488.

65. Min-Yen Kan (2003) Metadata extraction and text categorization using Universal Resource Locator expansions. National University of Singapore Department of Computer Science Technical Report, TR 10/03.

66. Min-Yen Kan, Judith L. Klavans, Kathleen R. McKeown (2001) Synthesizing composite topic structure trees for multiple domain specific documents. Columbia University Computer Science Technical Report, CUCS-003-01.

67. Min-Yen Kan (2001) Combining visual layout and lexical cohesion features for text segmentation. Columbia University Computer Science Technical Report, CUCS-002-01.

68. Min-Yen Kan and Kathleen R. McKeown (1999) Information Extraction and Summarization: Domain Independence through Focus Types. Columbia University Computer Science Technical Report, CUCS-030-99.

69. Martin Braschler, Min-Yen Kan, Peter Schäuble and Judith L. Klavans The Eurospider Retreival System and the TREC-8 Cross-Language Task. Proceedings of TREC-8, Gaithersburg, Maryland, USA: Nov. 1999.

· Ph.D. Thesis: 1

70. Min-Yen Kan, Automatic text summarization as applied to information retrieval: Using indicative and informative summaries, New York, New York, USA: Feb 2003.

· Patents – International: 2

71. "Method for partitioning natural language texts into topical, multi-paragraph segments" M. Kan, J. Klavans and K. McKeown, U.S. Patent 6,473,730, October 2002.

72. “An Automatic System for Forced Alignment between Polyphonic Song and Textual Lyrics”, Y. Wang, M. Kan, T. Nwe, A. Shenoy, J. Yin, US Provisional Application No: 60/582,736, September 2004.

· Patents – National: 1

73. “System for synthetic image classification”, M. Kan, W. Fei, Invention Disclosure, August 2006

(b) Statement on Significance of Publications

Below I highlight five publications that I have authored that I feel highlight the significance of my research as an interdisciplinary researcher. I bridge and bring together experts in other areas of computer science. I repeat the bibliographic details of each publication for ease of reference, and give my assessment of their significance below each record. Sample versions of these publications are included in my supplementary dossier, under the research portfolio section.

1. (publication #18) Min-Yen Kan (2007) SlideSeer: A Digital Library of Aligned Document and Presentation Pairs, In Proceedings of the Joint Conference on Digital Libraries (JCDL '07). Vancouver, Canada, June.

This work introduced a new modality of digital scholarly documents in the form of aligned presentation and document pairs. I formalized the problem as aligning two streams of text units and showed how an automatic system can be built from component documents and presentations found on the web. This paper featured all aspects of building the implemented system, from backend processing, core alignment to user interface design. It has already been introduced as requisite reading in a few digital library programs (notably, Rick Furuta’s course at Texas A&M).

2. (publication #4) Hang Cui, Min-Yen Kan and Tat-Seng Chua (2007) Soft Pattern Matching Models for Definitional Question Answering, ACM Transactions on Information Systems (TOIS), 25(2). April.

This was a culmination of my co-supervision of my Ph.D. student Hang Cui. Here we formalize soft pattern models, a stochastic model for improving approximate text matching, of use to many applications where regular expressions on language tokens are the norm. Using soft pattern models of different forms, applications can improve performance greatly on recall. By using this technology, the use of expensive manual can be greatly reduced. We give two formal models of such technology in this work, and show a case study in definitional question answering, where the significant improvements are realized.

3. (publication #47) Min-Yen Kan (2004) Web Page Classification Without the Web Page. In Proceedings of the 13th International World Wide Web Conference (WWW2004), May 2004. New York, New York, USA. Poster Paper.

Although only a poster paper, this work received a lot of attention from industry, as it introduced a lightweight method to classify millions of webpages in minutes, by just analyzing their URLs. Previous approaches all downloaded the webpages or analyzed their linking structure to determine a classification, taking hours for thousands of webpages. This work introduced and formalized another form of analysis (URL analysis) that has since been used in numerous of web classification studies.

4. (publication #13) Wang Ye, Min-Yen Kan, Tin Lay Nwe, Arun Shenoy and Jun Yin (2004) LyricAlly: Automatic Synchronization of Acoustic Musical Signals and Textual Lyrics. In Proceedings of ACM Multimedia 2004 (MM '04), New York, USA, 10-16 October.

This was the first of three publications where I worked jointly with Dr Wang Ye and his student team on lyric and music alignment (all the other authors were students from his team). I was responsible for the text processing core and porting the NLP technology to achieve the project goals, and coordinating the editing and structuring of the paper. The paper won the best student paper award, and has since culminated in a top tiered journal paper.

5. (publication #15) Judith L. Klavans and Min-Yen Kan (1998) Role of Verbs in Document Analysis. In Proceedings of COLING/ACL 98, Montréal, Québec, Canada: Aug. 1998. pp. 680-686.

This was the pioneering work on verb analysis that I did with one of my co-supervisors as a PhD student. It represents one of the earliest works showing how verbs (as a part of speech) can make a valuable contribution to downstream processing. Earlier works focused mostly on noun and noun modifier analysis, due to the richer ontological resources and their relatively low number of different word senses. We showed that verb analysis could be helpful in classifying documents to genres and event types, paving the way for later work in verb-centric analyses of sentences, such as semantic role labeling.

(c) Non-publication Research Highlights

I would also like to highlight six contributions that demonstrate my research contribution and initiative, and differentiates my work from those of other colleagues.

1. Digital Anthologies Special Interest Group (dAnth SIG). External to SoC, I organize and lead an informal consortium of several universities (including Cambridge, Univ. of Michigan, Hiroshima Univ., Macquarie Univ. and Penn State Univ.) in dealing with issues which concern digital anthologies, an important sub-branch of digital libraries. This collaboration has grown to a birds-of-a-feather meeting at the ACL (rank 1) NLP conference and lead to joint publications describing canonical datasets and tools for digital libraries, where I am the lead author.

2. CHIME text processing seminar. I organize the CHIME lab’s text processing seminar, started in Sem I, 2004. You are welcome to visit the seminar home page at http://wing.comp.nus.edu.sg/chime/textSeminar.html, which contains the list of talks, abstracts and their slides. To date we have had over 70 meetings. Our speakers draw upon staff in text processing from SoC, I2R, DSO, CLIPS-IMAG and NTU. The mass mailing about the seminar reaches over 100 participants across various Singaporean institutions. We have hosted a number of overseas visitors as part of this seminar series, including researchers from industry (Microsoft Research, Google, Baidu, Yahoo!, USC/ISI, IBM) and prominent universities in Multimedia, NLP and machine learning (Univ. of Toronto, Arizona. State Univ., MIT, CMU).

3. Research Infrastructure. I have organized a department-internal framework for doing natural language and web-related research. This framework consists of tools, datasets and papers all organized under one access point. Students and staff doing NLP research have a complete listing of tools at one access point. The catalog of the research tools and corpora are searchable by regular web search engines. This allows new researchers (HYP, UROPs and graduate students) to get a head start into their research areas. Profs. Chua Tat-Seng, Lee Wee Sun, Tan Chew Lim and Ng Hwee Tou have all benefited from this infrastructure work. I invite you to poll them on how much impact this repository has had on their research. To date over, 178 separate tools and datasets have been installed. This resource is often searched by external researchers and students and highlights SoC’s interests in this research area.

4. Promotion and deployment of publicly available outcomes. A successful research group also encourages other research groups to use the output of the research not only in theory but also in practice. I have encouraged my group of students to publicize their research implementations and to make them available for public use.

Corpora: These are my research group’s contribution back to the community to help standardize evaluation measures as well as garner more impact and citations to my research group. NUS SMS corpus: we compiled and released a corpus of over 10,000 Short Message Service (SMS) messages. This has been downloaded and used by over 11 parties internationally and accessed by researchers around the world, (including Brown University, University of Lausanne) for their research in language change and student projects. Locally we have used this corpus as part of a HYP research project looking into effective SMS writing. Other corpora released in past years include corpora for 1) Javascript annotation, 2) light verb annotation, 3) synthetic image search, 4) automatic keyphrasing, and 5) scholarly documents. The keyphrase corpus is being used at the Univ. of Waikato (a leader in this research area) and the synthetic image corpus and tool has attracted industry attention in Europe (stalled in negotiations with ILO). In this upcoming year, we are planning to make additional corpora available on 1) scenario template generation and 2) a re-release of the scholarly document collection, due to substantial material additions.
Tools: I encourage my students to make their tools into web services and downloadable tools. Defsearch, a research output of a PhD alumnus, under my co-supervision, attracted interest from Google. PARCELS, a web page block classifier, was developed as an open-source Sourceforge project, to encourage the advanced user community to use it. JavaRAP, another tool, has been packaged as an installable package for Linux, to make it easy to use and install for other natural language processing groups. MeURLin, a webpage classifier, has attracted the interest of Prof. Steve Reiss of Brown University. All of these tools have web pages associated with them to encourage the research community to learn about them and to use them. Other tools that have been released by our group include tools for anaphora resolution, sentence similarity and light verb detection have also been made recently available to the public. In the past year, we have increased this tool support by making open source tools for citation parsing, in a collaborative effort with the CiteSeer team at Penn State Univ., launched in May this year. In addition, an HYP student’s project implementation of a new user interface for library catalogs was also was fully implemented during in 2008. This has led to us fielding it in the NUS libraries and at the Colorado State University, again increasing my group’s impact and showing that our capability of real-world implementation as well as research. In this upcoming year, we plan to make additional tools available. These tools are central to open-source digital library initiatives and natural language processing research. These include software packages for 1) record linkage, 2) title page metadata parsing, 3) sentence similarity, 4) paraphrase recognition and 5) keyphrase extraction.

· Federated Evaluation: Evaluation of research is vital and a bottleneck especially for natural language processing research. I have been collaborating with my former advisor at Columbia University to form an evaluation federation, in which students and staff from participating Universities can assist each other in their evaluations. So far, we have assisted CU with two evaluations and have had their help with two as well.

5. Graphical Models Reading Group. Along with A/P Lee Wee Sun, I had organized a reading group for graduate students interested in graphical models for learning, from 2005-2007. We met biweekly to push students towards a better understanding of what techniques they are using. Due to this reading group, some students in my group and in Prof. Ng Hwee Tou’s group have formed their own Natural Language Processing Reading Group on their own initiative. The reading group’s leadership has since passed on to Chieu Hai Leong, a postdoc under Lee Wee Sun, to manage.

6. Digital Library Collection Building. With the cooperation of Workshop and Central Facilities and the collaborators world-wide, we have begun collecting a sizeable collection of scholarly academic research papers in computer science. We are now mirroring the CiteSeerX database of papers, the ACL Anthology and hold the metadata for DBLP. This will bring the total of scientific papers that my group has access to over 2M documents, a very large collection by international standards. This collection serves not only our research agenda but also public good for all computer science researchers.

(d) Statement on Co-Authorship:

While most of my publications are joint with students (as is typical for my discipline), I also publish, research and program independently of my students’ research topics (not all of my publications are joint with students). I have also had the time to do research entirely on my own as a single author (SlideSeer, publication #18; MeURLin, publication #47), as evidenced above in my research significance statement. In the past year, I have taken the lead role in two large group, collaborative papers across multiple universities in terms of writing and coordination – (ACL Anthology; publication #27) and (ParsCit; publication #28) – although we have agreed to list authors in alphabetical order.

B.3 Research Performance Indicators

(a) Citations (current as of Aug 2008)

I offer a comparison between my own citation counts and those from our department’s historical statistics. I conclude that I have been relatively successful compared against my peers overall. I believe this is due to my standing as an interdisciplinary researcher.

Please note that as a researcher in computer sciences, journal publications are not as important as such publications often take 1-2 years from submission to publication. Given how fast my discipline changes, conference publications are the preferred vehicle for scholarly dissemination and citations to these sources should be considered most telling. As such, I offer citation analysis only for sources that include conference information, including CiteSeerX and Google Scholar. Also, fortunately, my name is relatively rare (I know of no other person in computer science that has my name) so the records from these search engines are clean.

Our department also keeps statistics for past tenure cases, so I’ve include these statistics (max/average/min) in the second line for your comparison.

CiteSeerX:

Me: 153 citations (excludes self cites) for 41 documents indexed, 32 cites for top paper

Computer Science Dept. (Max/Avg/Min): 242/129/41 for all papers, 198/56/8 for top paper

Self-interpretation of CiteSeerX results: I come in a bit above the average for total paper citations, compared to other peers that have applied for tenure. While my top paper is not cited as much as others, I interpret it as a good sign, showing that my citations are not due to any single paper, but rather to continuous contribution to the community.

Google Scholar:

Me: 789 citations for all papers (~40 documents indexed), 74 cites for top paper

Computer Science Dept. (Max/Avg/Min): 753/341/108 for all papers, 533/120/29 for top paper

Self-interpretation of Google Scholar results: Here, the results mirror those of CiteSeerX, except that I’ve surpassed the highest number of citations in Scholar in comparison to all others in my department that have applied for tenure. While I do not feel that such a single indicator should be taken very seriously, I feel it does lend creditability to my case as an excellent and visible (impactful) researcher. A more detailed analysis of the results show a long-tail effect: many of my more recent work still is gathering citations. No single paper dominates my citation counts. Overall, over ½ of my publications have been cited in Scholar; again reinforcing my belief that my contributions to research excellence are continuous and long-term.

(b) Research grants

I have been successful in securing monies internally within the University in past years, and have large (100K+) long-term projects in progress.

Currently, I am in the midst of securing collaborations with different partners to ensure the long-term security of my research unit. These grants are typically large (1-10M SGD), involving multiple PIs. Please see the “Grants in Preparation” section on the next page for details.

· Grants Approved

External Grants (Governmental)

1. Interactive Media Search (IDM R&D programme)

Collaborator (Lead PI: Prof. Chua Tat-Seng), 1.9M SGD, Ongoing, started on 1 Nov 2007

My group will be working on hierarchical indexing of semi-structured data (e.g., patent and legal documents). My group is allocated about 425K of the whole budget.

External Grants (Industry)

1. Document Information Mining for Digital Libraries (HP)

Co-PI (Lead PI: Prof Tan Chew Lim), 23,500 SGD, Completed, started on 1 Oct 2006.

SERC Human Factors Engineering (HFE) Pilot program

1. Empirical Usability Studies with E-Learning Systems: Towards Executable Cognitive User Models as Design and Usability Evaluation Aids

Co-PI (Lead PI: Prof. Theng Yin Leng, NTU), 24,650 SGD, Completed, started on 26 Jan 2007.

OAP (Inbound) Fellow

I applied to the University’s OAP research funds over the last two terms and have gotten funding for SoC to bring young and promising researchers to SoC for both networking and collaboration

1. Visit of A/P Dongwon Lee (Dec 2006) – exceeded proposed goals. Prof. Lee met with a number of faculty members and gave 2 public seminars at NUS and one at NTU.

2. Visit of A/P Hari Sundaram (Aug 2007) – exceeded proposed goals. Prof. Sundaram partially funded his own visit and gave 2 public seminars (one at I2R and one at NUS).

Faculty Research Grants – previously termed Academic Research Fund

1. Mathematical Equation Indexing, Search and Retrieval

PI, 39,500 SGD, Ongoing, started 1 Feb 2007

2. Natural Language Query Analysis for Web Queries

PI , 41,955 SGD, Completed, started 10 Feb 2006.

3. Corpus-Based Query Expansion in Online Public Access Catalogs

(joint with Danny Poo of IS)

PI, 31,000 SGD, Completed, started 1 Jul 2003.

4. Towards multi document indicative summarization via automated metadata extraction

PI, 23,250 SGD, Completed, started 14 Jan 2003.

ICITI Interfaculty research equipment grant

1. Joint NUS Libraries / School of Computing Research Servers

Co-PI (joint with Danny Poo of IS) 1 May 2004 60,000 SGD Completed

· Grants in preparation

Large International Projects

1. NUS-CALIT2 (Co-PI, total project ~10M SGD)

2. Chinese Singapore Interactive Digital Media (CSIDM) (Co-PI, total project ~10M SGD)

Smaller International Projects

3. University of Maryland Text Analysis for IT Trends

4. Information Retrieval Facility for Scholarly Document Spidering

5. NUS-Duke Graduate Medical School Full Text Analysis

Local Industry Projects

6. Sentiment Analysis for Unnamed Source (withheld by request)

Local University Projects

7. Mobile Multimedia URC -- PI: Dr Wang Ye (Collaborator)

· Pro Bono Research Collaborations (Unfunded)

1. Internet Archive Chinese Interface Translation Project - International

2. NUH Evidence Based Nursing Project - Local

3. World Scientific Data Cleaning Project - Local

(c) Research Awards and Prizes

Our participation in the international Document Understanding Conference (DUC), a yearly competition for automatic summarization, was awarded 3rd place in the update summary task in 2007 among 24 participants.

I was nominated by my department to submit an application for the NUS Young Investigator Award in 2007/2008.

My first Ph.D. graduate, Cui Hang, who is now working at Yahoo! Research was previously at Google China and India, won the SoC Best Thesis Award in 2007.

Our participation in the international Text Retrieval Conference (TREC) was awarded 1st prize in the definitional question answering task among 28 teams in 2004, and came in 2nd among 33 teams in 2003. This was joint work with my student, Cui Hang, and Prof. Chua Tat-Seng.

Our team’s effort (working together with Dr. Wang Ye) won the best student paper award in ACM Multimedia 2004 (rank 1 conference).

Our entry to the Web Persons Search (WEPS) task came in 3rd place among 17 participants. This was a competition to automatically find and differentiate web pages about people with the same name (e.g., Michael Jordan). Our work was in collaboration with A/P Dongwon Lee and his student at Pennsylvania State Univ.

(d) Membership of editorial boards; conference committees

Note that many conferences often term their reviewers as synonymous with programme committee members. I have purposely relegated these memberships to the “Service as a reviewer” item below; memberships I list here deal with the organization of the conference in terms of organizing the scientific committee.

· International Journal

• Information Retrieval, editorial board member (since 2007)

• International Journal on Digital Libraries, special issue organizer for “Very Large Digital Libraries”, joint with Ee Peng Lim (Nanyang Technological Univ.) and Dongwon Lee (Pennsylvania State Univ.)

• International Journal of Computational Linguistics and Chinese Language Processing, special issue organizer for “Cross-Lingual Information Retrieval and Question Answering”, joint with Wen-Hsiang Lu (Nat’l Cheng-Kung Univ., Taiwan) and Jianfeng Gao (Microsoft Research Asia)

· Conferences – International

(please see “Service to International Academic Community” D.2.a for details)

• Association for Computational Linguistics (ACL) 2008 (rank 1), area chair for Information Extraction.

• Association for Computational Linguistics (ACL) 2009 (rank 1), local committee

• ACM Special Interest Group on Information Retrieval (SIGIR) 2008 (rank 1), registration chair

• Asia Information Retrieval Symposium (AIRS) 2006, publication chair

(e) Service as a reviewer

(See note in (d) above)

· International journal

• Journal of Artificial Intelligence Research (JAIR), 2007, reviewer

• Data and Knowledge Engineering Special Issue, reviewer

• Information Processing and Management, reviewer

• International Journal of Information Technology, 2005, 2007, reviewer

· Conferences – International

• Association for Computational Linguistics (ACL; rank 1) 2003, 2005 - reviewer

• Empirical Methods on Natural Language Processing (EMNLP; rank 1) 2005, 2006, 2007, 2008 - reviewer

• International Joint Conference on Artificial Intelligence (IJCAI; rank 1) 2005, 2007 - reviewer

• Joint Conference on Digital Libraries (JCDL; rank 2) 2005, 2006, 2007, 2008 - program committee

• EACL (rank 2) 2008 - reviewer

• COLING (rank 2) 2004 - reviewer

• Human Language Technologies / North American Association of Computational Linguistics (HLT-NAACL, rank 2) 2006, 2007 – reviewer

• Conference on Image and Video Retrieval (CIVR) 2006 - reviewer

• Workshop on Multi-source, Multilingual Information Extraction and Summarization (MMIES) 2007, 2008 – program committee,

• AAAI Workshop on Event Extraction and Synthesis, 2006 – program committee

• SIG on Information Retrieval (SIGIR) 2006 – poster reviewer

• Intelligent User Interfaces (IUI) 2006, 2007 – program committee

• Recent Advances in Natural Language Processing (RANLP) 2005 - reviewer

• International Joint Conference on Natural Language Processing (IJCNLP) 2004 - program committee

• International Natural Language Generation Conference (INLG) 2004 - reviewer

• Workshop on Information and Knowledge Management (WIDM) 2003, 2004, 2005, 2008 - reviewer

· Conferences – Regional

• Australasian Language Technology Workshop (ALTW) 2006 - reviewer

• Knowledge Discovery and Language Learning (KDLL) 2006 - program committee

• Asian Information Retrieval Symposium (AIRS) 2005 - reviewer

• International Conference on Asian Digital Libraries (ICADL) - 2003, 2007, program committee

(e) Consultancy

· Evaluated a project proposal for the Hong Kong Research Grants Council, Mar 2008

• Evaluated a project proposal for the Malaysian Multimedia Super Corridor (MSC), Jul 2004

(f) Invited presentations at scholarly meetings/workshops

· International:

1. NLP Lecture

Host: Dr Dina Demner, MD, Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, (Bethesda, Maryland, USA, 7 Apr 2008)

2. Web IR/NLP Group (WING) @ NUS

Host: Chin-Yew Lin and Ming Zhou, Microsoft Research Asia, Workshop on Web-Scale Natural Language Processing (Daejong, Korea, 21-22 Feb 2008)

3. Elements of Enterprise NLP

Host: C Anantaram, Tata Consultancy Services, Workshop on Natural Language Applications in Enterprise Class Systems (New Delhi, India, 4-5 Oct 2007)

4. Linked Anthology Proposal (joint with Brett Powley, Macquarie Univ.)

Host: ACL Executive Board, at ACL Conference (Prague, Czech Republic; 24 June 2007)

5. Recent Directions in List and Definitional Question Answering

Host: Dongwon Lee and Prasenjit Mitra, at Pennsylvania State Univ (20 Feb 2006)

6. Text processing for Web page classification and SMS optimization

Host: Hari Sundaram, at Arizona State University (24 Feb 2005)

7. Question Answering Research at NUS

Host: Kathleen R. McKeown, at Columbia University (17 May 2004)

· Regional

1. Latest Trends in Web Research and Web Data Mining

Host: Chiew Ying Oi, World Scientific Publishers (Singapore, 23 September 2008)

2. Automatic Text Summarization in Information Retrieval

Host: Dekai Wu, at Hong Kong University of Science and Technology (21 March 2003).

3. Automatic Text Summarization in Information Retrieval

Hosts: Wai Lam and Ee Peng Lim, at Chinese University of Hong Kong, SEEM (19 March 2003).

· National

1. NLP in Information Retrieval and Machine Translation

Host: Stephane Bressan, at NUS for Malay Indonesian NLP workshop (13 Nov 2007)

2. Next generation Digital Libraries

Host: at SERC Human Factors Engineering Workshop (24 Aug 2006)

3. Trends in automatic text summarization

Host: Chieu Hai Leong, at DSO National Labs (14 Oct 2003)

B.4 Future Plans

With tenure, I hope to be able to concentrate more on putting research into practice. This is my long-term goal to move away from paper count, and to emphasize impact. There are many interdisciplinary real-world problems that sorely need the rigor and method from traditional theoretical fields such as machine learning and natural language processing. I plan to channel more energy into Scholarly Digital Libraries, my personal research focus, and finish deploying my long-term bid to create better full-text digital library system. This system, ForeCite, will allow my group a cutting-edge advantage in exploring not yet discovered problems on academic information seeking, publishing and collaborative work, which can influence not only the whole of computer science, but other disciplines as well.

Putting research into practice also allows my group the means of producing good ties with industrial partners, which helps show the relevancy of our research, ensure practical impact and provide a destination for student graduates. This is already happening with the work that my group has done pro bono (see “Pro Bono Research Collaborations”, B.3.b). As opportunity knocks, I also need to have the right talent within my group to meet these targets. Ensuring that my advisement spans the whole spectrum (see “Student Supervision” A.2.b) helps to make ends meet.

If my tenure is approved, I plan to take sabbatical right away. The current plan is to take it at UC Irvine, as it is a key partner in our NUS-CALIT2 initiative (see earlier discussion in “Grants in Preparation” B.3.b). I have liased with both PIs internally as well as those at UC Irvine and have secured their blessings for this move. I think it is a particularly good idea as it advances my research agenda and creates better ties between UC Irvine and NUS for the collaborative proposal.

NUS-CALIT2 needs to have visiting faculty going in both directions. Currently, we only host faculty from UC Irvine, but not vice versa. We would need to participate in the interchange if we want our voices to be heard in the US arena. Thus we need to have an ambassador (or more if possible) to facilitate this interchange. Many of our current faculty have permanent residence in Singapore with families that are unable to relocate. Having me at a CALIT2 position for a whole year may make it easier for NUS visitors to be accommodated during the time period while I am stationed there.

UC Irvine is home to Donald Bren School, a faculty that mirrors an information school, blending together Informatics, Computer Science and Statistics. This blend of disciplines is an ideal environment for my work, which blends together practical aspects (Human Computer Interaction, Digital Libraries, system building) and theoretical aspects (Natural Language Processing, Machine Learning, Information Retrieval).

It is also ideally situated in the CA area, relatively close to many local industries as well as Silicon Valley. This makes my group’s potential real-world system building easily transmutable to impact real-world products and devices. “The hallmark of our research is an empirical grounding in real-world practice” is a motto from the Bren School, which I think also serves to characterize the type of research I practice.