Menu

[ IVLE ]

[ Overview ]
[ >Syllabus ]
[ Grading ]
[ Homework ]
[ Misc. ]

Sat Feb 10 15:01:20 GMT-8 2007 : Added SALSA paper from last week's lecture and added some short descriptions of certain papers.

Note that the syllabus and readings are still in flux for the time being. Readings marked with a "*" will be present in the course pack. Readings in small print are primary materials (i.e., original conference and journal papers); read these after the secondary materials (i.e., textbook chapters) if possible.

Slides linked from this page to the textbook slides are for reference only. I do not vouch for the correctness or the material presented in any of the linked slide sets. I will likely be using a composite of slides compiled from these and various other sources, but these may not be made available on the internet due to time preparation constraints.

The hyperlinks here all work as of Tue Dec 19 23:16:21 GMT-8 2006, when I updated this page. Use a search engine with the appropriate text if the links below stop working.

Date Description Deadlines
Week 0:
Prerequisites

(Please read before coming to class and be familiar with the material)

Readings:

  • *P. Baldi, P. Frasconi and P. Smyth (2003) Chapter 1 "Mathematical Foundations" of Modeling the Internet and the Web. Wiley. (Covers basic math foundation needed for the course. The topics introduced here are basically a nutshell of most of the material we will cover in more depth in class. Warning! this is a dense chapter, expect to have to read it a couple times. Contents: Probability from a Bayesian Perspective, Parameter Estimation from Data, Mixture models and the Expectation Maximization Algorithm, Graphical Models, Classification, Clustering, Power Laws)
Week 1:
(8 Jan)
Introduction to Web-Based Searches

Readings:

  • P. Baldi, P. Frasconi and P. Smyth (2003) Chapters 2 and 3 "Basic WWW Technologies" and "Web Graphs" of Modeling the Internet and the Web. Wiley. (You should already be familiar with Chapter 2's material from the Hypermedia or equivalent pre-requisite, so you should spend more time reading Chapter 3's material).
    [ Chapter 2 slides (.ppt) ] [ Chapter 3 slides (.ppt) ]
  • C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapter 19 "Web Search Basics" of Introduction to Information Retrieval. Cambridge UP. (Caution: may be too advanced for this stage in our course. Skim and re-read closer to the end of the course.)
    [ .pdf of Chapter 19 ]
    [ Chapter 19 slides Part 1 (.pdf) ] [ Chapter 19 slides Part 2 (.pdf) ]
  • *S. Lawrence and C.L. Giles (1999). Accessibility of information on the web. Nature, Vol. 400(8), pp. 107-109. (Short note describing how articles easily available on the internet (self-archived) create larger impact)
    [ Link ]
  • *A-L. Babarasi and R. Albert (1999). Emergence of scaling in random networks. Science, Volume 286. Pre-print
    [ ArXiV link ]
Week 2:
(15 Jan)
Intro to IR and Vector-Space Model

Readings:

  • P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.3 "Content-Based Ranking" of Modeling the Internet and the Web. Wiley. (There is a link to the .pdf for the whole of Chapter 4 provided by the authors as their sample chapter. We will be using this chapter as the basic overview for the next couple of weeks.)
    [ .pdf of Chapter 4 from UC Irvine ]
    [ Chapter 4 slides (.ppt) ]
  • C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapter 7 "Vector Space Retrieval" of Introduction to Information Retrieval. Cambridge UP. (Covers the same as the Baldi book but in more depth.)
    [ .pdf of Chapter 6 ] [ .pdf of Chapter 7 ]
    [ Chapter 6 slides (.pdf) ] [ Chapter 7 slides (.pdf) ]
  • *G. Salton (1972). Dynamic document processing. Communications of the ACM, Vol. 15(7), pp. 658-668.
    [ ACM Portal Link ]
Week 3:
(22 Jan)
Probabilistic IR Model and Language Modeling and Tutorial 0 - Math Foundations

Tutorial 0 will be offered both before (5:00-6:00pm) and after (8:30-9:30pm) class. It will cover Baldi et al., Sections 1.1-1.3. The other sections will be taught later or covered in other SoC modules.

Readings:

  • C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapters 11-12 "Probabilistic Information Retrieval" and "Language Models for Information Retrieval" of Introduction to Information Retrieval. Cambridge UP. (These two chapters should be considered the primary source for this week; skip 11.3.4, 11.4.2-11.5, 12.4)
    [ .pdf of Chapter 11 ] [ Chapter 11 slides (.ppt) ]
    [ .pdf of Chapter 12 ] [ Chapter 12 slides (.ppt) ]
  • P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.4 "Probablistic Retrieval" of Modeling the Internet and the Web. Wiley.
  • *K. Sparck Jones, S. Walker and S.E. Robertson (1998). A probabilistic model of information retrieval: development and status. Technical Report 446, Cambridge University Computer Laboratory. (This is a very complete description of probabilistic IR from the people who pioneered it; you can just read Sections 2 & 4; if you want to know more about relevance feedback, read Sections 5 and 6)
    [ CiteSeer@NUS Link ]
  • J.M. Ponte and W.B. Croft (1998) A language modeling approach to information retrieval. ACM SIGIR 1998, pp 275-281. (Discusses the language modeling approach to IR -- still much more to be done here with increasingly large data sets)
    [ CiteSeer@NUS Link ]
Assignment #1 out
Week 4:
(29 Jan)
Improving Search I - LSA and Adaptive Search and Tutorial - Retrieval 1

Readings:

  • P. Baldi, P. Frasconi and P. Smyth (2003) Section 4.5 "Latent Semantic Analysis" Modeling the Internet and the Web. Wiley.
  • C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapter 18 "Dimensionality Reduction and Latent Semantic Indexing" of Introduction to Information Retrieval. Cambridge UP. (Covers the same material as the Baldi et al. book, but in more depth)
    [ .pdf of Chapter 18 ]
    [ Chapter 18 slides (.pdf) ]
  • *S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Harshman (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, Vol. 41(6), pp. 391-407. (An expanded version of the original paper that pioneered dimensionality reduction)
    [ CiteSeer@NUS Link ]
  • T. Hofmann (1999) Probabilistic latent semantic indexing. ACM SIGIR 99. (The breakthrough paper that is the basis for newer Bayesian analysis to dimensionality reduction)
    [ ACM Portal Link ]
Week 5:
(5 Feb)
Improving Search II - Use of Links and Structures

  • P. Baldi, P. Frasconi and P. Smyth (2003) Chapter 5 "Link Analysis" in Modeling the Internet and the Web. Wiley.
    [ Chapter 5 slides (.ppt) ]
  • C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapter 21 "Link Analysis" of Introduction to Information Retrieval. Cambridge UP.
    [ .pdf of Chapter 21 ]
    [ Chapter 21 slides (.pdf) ]
  • *S. Brin and L. Page (1998). The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th International World Wide Web Conference (WWW7), Brisbane, Australia, pp. 107-117. (This is the original paper on the PageRank algorithm)
    [ CiteSeer@NUS link ]
  • *T.H. Haveliwala (2002). Topic-Sensitive PageRank. Proceedings of the 11th International World Wide Web Conference (WWW2002), Honolulu, Hawaii, USA. (Making PageRank biased to some "basis" topics by playing with the teleportation factor)
    [ CiteSeer@NUS Link ]
  • J. Kleinberg (1998). Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms. (Describes HITS; when we decouple both ends of the directed edge in calculating prestige)
    [ CiteSeer@NUS Link ]
  • R. Lempel and S. Moran (2000). The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Proceedings of WWW 9 (1999). (Bringing Kleinberg's HITS to a bipartite framework; and explaining its benefit to Tightly Knit Communities)
    [ CiteSeer@NUS Link ]
Week 6:
(12 Feb)
Improving Search III - Relations and Passage Retrieval and Tutorial - Retrieval 2

Readings:

  • *R.M. Tong, L.A. Appelbaum, V.N. Askman and J.F. Cunningham (1987). Conceptual information retrieval using RUBRIC. Proceedings of the 10th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'87), New Orleans, Louisiana, USA, pp. 247-253. (An early paper that discusses how thesaural knowledge can be integrated to IR; from the pre-WordNet era)
    [ ACM Portal Link ]
  • *H. Yang, T.S. Chua, S. Wang and C.K. Koh. (2003) Structured use of external knowledge for event-based open-domain question-answering. 26th Int'l ACM SIGIR Conference. (Putting together resources in a unified manner)
    [ Link to CMU's copy ]
  • *H. Cui, J.R. Wen, J.Y. Nie and W.Y. Ma (2002). Probabilistic query expansion using query logs. Proceedings of the 11th International World Wide Web Conference (WWW2002), Honolulu, Hawaii, USA. (query expansion from another external resource: query logs)
    [ CiteSeer@NUS Link ]
  • *H. Yang and T.S. Chua (2003). QUALIFIER: question answering by lexical fabric and external resources. Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 03) (Density based methods for passage retrieval leading up to questions answering)
    [ CiteSeer@NUS Link ]
  • *H. Cui, R. Sun, K. Li, M.Y. Kan, T.S. Chua (2005). Question Answering Passage Retrieval Using Dependency Relations. ACM SIGIR, 400-407. (better ranking based on grammatical dependencies between words in a passage)
    [ Link to SoC's copy ]
Mid-semester Break (Mon 19 Feb - Fri 23 Feb 2007)
Week 7:
(26 Feb)
Question Answering

Readings:

  • *L. Hirschman, M. Light, E. Breck and J.D. Burger (1999). Deep read: a reading comprehension system. Proceedings of the 37th Meeting of the Association for Computational Linguistics (ACL'99), College Park, Maryland, USA, pp. 325-332.
    [ CiteSeer@NUS Link ]
  • *D. Moldovan and A. Novischi (2002). Lexical chains for question answering. Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan.
    [CiteSeer@NUS Link ]
  • *E. Voorhees (2002). Overview of the TREC 2002 Question Answering Track, In notebook of the Eleventh Text REtrieval Conference (TREC 2002), 115-123.
    [ CiteSeer@NUS Link ]
  • Hang Cui, Min-Yen Kan and Tat-Seng Chua (2004) Unsupervised Learning of Soft Patterns for Generating Definitions from Online News. In Proceedings of the 13th International World Wide Web Conference (WWW2004), May 2004. New York, New York, USA.
    [ From Min's Home Page ]
Assignment #1 due
Week 8:
(5 Mar)
Summarization I

Readings:

  • *J. Kupiec, J. Pedersen and F. Chen (1995). A trainable document summarizer. Proceedings of the 18th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), Seattle, Washington, USA, pp. 68-73. (The work that took out the heuristic approaches to summarization and made it a learning problem)
    [ CiteSeer@NUS Link ]
  • *T. Nomoto and Y. Matsumoto (2001). A new approach to unsupervised text summarization. Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), New Orleans, Louisiana, USA, pp. 26-34. (Great paper showing a use of X-means clustering for summarization)
    [ ACM Portal Link ]
Assignment #2 out
Week 9:
(12 Mar)
Summarization II and Tutorial - Summarization

Readings:

  • *G. Erkan and D. Radev (2004) LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. of AI Research. Vol. 22. (Viewing documents as a graph and using PageRank to compute n-best sentences)
    [ Link to UMich copy ]
  • H. Jing and K. McKeown (2004) The decomposition of human-written summary sentences. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 129-136. (Describes how an HMM can be used to align abstracts to source articles)
    [ CiteSeer@NUS Link ]
  • K. Knight and D. Marcu (2000) Statistics-Based Summarization Step One: Sentence Compression. Proceedings of the 17th National Conference on Artificial Intelligence (AAAI), pages 703-710. (Combines NLP and the noisy channel model to create a sentence compression scheme)
    [ CiteSeer@NUS Link ]
Week 10:
(19 Mar)
Text Categorization

Readings:

  • C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapters 13-14 "Text classification and Naïve Bayes" and "Vector space classification" of Introduction to Information Retrieval. Cambridge UP.
    [ .pdf of Chapter 13 ] [ .pdf of Chapter 14 ]
    [ Chapter 13 slides (.pdf) ] [ Chapter 14 slides (.pdf) ]
  • *Y. Yang and J.O. Pedersen (1997). A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning (ICML'97), Nashville, Tennessee, USA, pp. 412-420.
    [ CiteSeer@NUS Link ]
  • *Y. Yang and X. Liu (1999). A re-examination of text categorization methods. Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), Berkeley, California, USA, pp. 42-49.
    [ CiteSeer@NUS Link ]
Assignment #1 returned
Week 11:
(26 Mar)
Text Clustering and Tutorial - Categorization

Readings:

  • C. Manning, P. Raghavan and H. Schütze (c. 2007) Chapters 16-17 "Partitional Clustering" and "Hierarchical Clustering" of Introduction to Information Retrieval. Cambridge UP.
    [ .pdf of Chapter 16 ] [ .pdf of Chapter 17 ]
    [ Chapter 16 slides (.pdf) ] [ Chapter 17 slides (.ppt) ]
  • *J.M. Liu and T.S. Chua (2001). Building semantic perceptron net for topic spotting. Proceedings of the 39th Meeting of Association of Computational Linguistics (ACL'01), Toulouse, France, pp. 370-377.
    [ CiteSeer@NUS Link ]
Week 12:
(2 Apr)
Named Entity Recognition

Readings:

  • *G. Zhou, J. Su (2002). Named Entity Recognition using an HMM-based Chunk Tagger. Proc. of 40th ACL (ACL '02). pp. 473-480.
    [ CiteSeer@NUS Link ]
  • *S. Baluja, V. Mittal and R. Sukthankar (1999). Applying machine learning for high performance named-entity extraction. Pacific Association for Computational Linguistics (PACLING'99), Waterloo, Canada.
    [ CiteSeer@NUS Link ]
Week 13:
(9 Apr)
Information Extraction

Readings:

  • *C. Cardie (1997). Empirical methods in information extraction. AI Magazine, 18(4): 65-79. Special Issue on Natural Language Processing.
    [ CiteSeer@NUS Link ]
  • *S. Soderland (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, Vol. 34(1-3), pp. 233-272.
    [ CiteSeer@NUS Link ]
  • *J. Xiao, T. S. Chua and J. M. Liu, A Global Rule Induction Approach to Information Extraction, ICTAI2003.
    [ IEEE Xplore Link ]
Assignment #2 due
Reading Week (Sat 14 Apr - 20 Apr 2007)
Final Exam (Mon 30 Apr, evening)

Min-Yen Kan <kanmy@comp.nus.edu.sg> Created on: Mon Dec 1 19:36:22 2003 | RCS: $Id: syllabus.html,v 1.2 2004/08/11 06:00:38 kanmy Exp kanmy $ | Version: 1.0 | Last modified: Tue Apr 3 10:29:01 2007