COMPGI15 - Information Retrieval & Data Mining

Note: Whilst every effort is made to keep the syllabus and assessment records correct, the precise details must be checked with the lecturer(s).

Code
COMPGI15 (Also taught as: COMPM052)
Year
4
Prerequisites

Term
2
Taught By
Jun Wang (100%)
Aims
The course is aimed at an entry level study of information retrieval and data mining techniques. It is about how to find relevant information and subsequently extract meaningful patterns out of it. While the basic theories and mathematical models of information retrieval and data mining are covered, the course is primarily focused on practical algorithms of textual document indexing, relevance ranking, web usage mining, text analytics, as well as their performance evaluations. Practical retrieval and data mining applications such as web search engines, personalisation and recommender systems, business intelligence, and fraud detection will also be covered.
Learning Outcomes
Students are expected to master both the theoretical and practical aspects of information retrieval and data mining. More specifically, the student will understand: 1. the basic concenps and processes of information retrieval systems and data mining techniques. 2. The common algorithms and techniques for information retrieval (document indexing and retrieval, query processing, etc). 3. The quantitative evaluation methids for the IR systems and data mining techniques. 4. The popular probabilistic retrieval methods and ranking principle. 5. The techniques and algorithms existing in practical retrieval and data mining systems such as those in web search engines and the Amazon book/ Last.FM recommender systems. 6. The challenges and existing techniques for the emerging topics of MapReduce, portfolio retrieval and online advertising.

Content:

Overview of the fields

Study some basic concepts of information retrieval and data mining, such as the concept of relevance, association rules, and knowledge discovery. Understand the conceptual models of an information retrieval and knowledge discovery system.

Indexing

Introduce various indexing techniques for textual information items, such as inverted indices, tokenization, stemming and stop words.

Retrieval Methods

Study popular retrieval models: 1 Boolean, 2. Vector space, 3 Binary independence, 4 Language modelling. Probability ranking principle. Other commonly-used techniques include relevance feedback, pseudo relevance feedback, and query expansion.

Evaluation of Retrieval Performance

Measurements: Average precision, NDCG, etc. "Cranfield paradigm" and TREC conferences.

Personalisation and usage Ming

Study basic techniques for collaborative filtering and recommender systems, such as the memory-based approaches, probabilistic latent semantic analysis (PLSA), personalized web searche through click-through data.

Data Mining

Study basic techniques, algorithms, and systems of data mining and analytics, including frequent pattern and correlation and association analysis, anomoly detection, and click-through modelling.

Emerging Areas

Peer-to-peer information retrieval and MapReduce; Machine translation; Online (web) Advertising; Learning to Rank; Portfolio retrieval and Risk Management.

Method of Instruction:

Lecture presentations, Practical exercises

Assessment:

The course has the following assessment components:

  • Written Examination (2.5 hours, 60%)
  • Coursework Section (1 piece, 40%)

To pass this course, students must:

  • Obtain an overall pass mark of 50% for all sections combined

The examination rubric is:
Answer any THREE of four or five questions

Resources:

Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University

Press. 2008.

Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison-Wesley, 2006

Gigabytes (2nd Ed.) Ian H. Witten, Alistair Moffat and Timothy C. Bell. (1999), Morgan Kaufmann, San Francisco,

California.

Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer (2006).

course website