School of Computing
Department of Computer Science
CS5344: Big Data Analytics
Objectives] [Lecture Schedule]
and Materials] [Assignment
(Hadoop Labs & Gradiance] [Project] [Assessment]
For the readings, we will
denote the book Mining of Massive Datasets as MMD, and the data mining
reference text as DM. For topics covered in MMD, we have used the slides
provided by the authors (with minor modifications).
Week 1 – 15 Jan 2016
& Overview [slides]
· This lecture will give an introduction to Big Data
Analytics Technology, followed by an overview of the module.
Week 2 – 22 Jan 2016
· Basics of Hadoop and MapReduce [slides]
· This lecture will give a quick overview of
· Reading: MMD Chapter 2: Large-Scale File Systems
Week 3 – 29 Jan 2016
· Don’t they look the same? [slides]
· One of the common tasks in analytics is to
find objects that look alike. In forensic text analysis, e.g., plagiarism
detection, two documents that share significant overlapping textual content
implies one is likely to have copied or derived from the other. In online
retailing, a user is always bombarded with recommended items, e.g., if you are
browsing for a book on Big Data Analytics, then you are likely to find books
with titles containing keywords “Big Data”, “Analytics” being recommended. This lecture will look at some methods for
similar item detection.
· Reading: MMD Chapter 3: Finding Similar Items.
Week 4 – 5 Feb 2016
The human in the Big Data Loop [slides]
Lecture by Professor H.V. Jagadish, University of
we desire to have systems that are fully automatic. However, in dealing with
big data analytic, full automation is not an option. This lecture will explain the importance of
keeping human in the big data loop.
Week 5 – 12 Feb 2016
· What did most people buy, and why? [slides]
· In many applications, it is important to find
objects/events that co-occur, or how occurrences of some objects/events may
influence the occurrences of other objects/events. For example, a supermarket
may be interested to find out items that are frequently purchased in a single
transaction, say diapers and beer. Knowing this, it can deploy clever marketing
strategy to increase its profit margin, say give a discount on diapers, but
raises the price of beer. It is also interesting to figure out why people buy
these items together – if they buy diapers, it means they have young children,
which means they are likely to be “grounded” at home to look after them, and as
such they may be bored and ended up drinking to pass their time (and watching
TV) … This lecture will look at methods to figure out such “frequent itemsets” and the corresponding “association rules”. While
we will not talk much about “why”, it is important to understand the reasons
for such co-occurrences for the answers to be meaningful.
· Reading: MMD Chapter 6: Frequent Itemsets.
Week 6 – 19 Feb 2016
this week, there will be a 45 min to 1 hour test, followed by an hour of
· Is BIG necessarily better? [slides]
· This lecture will illustrate with real examples
that having more (big) data does not necessarily lead to better analytics
(discovery). In fact, it is important to be able to know what you are looking
for, to ask the right questions, and to use the right tools.
A short article
on the subject. See another article on Google Flu
Week 7 – 26 Feb 2016
Week. No lecture.
Week 8 – 4 Mar 2016
· What should we recommend? [slides]
· Recommendation systems are common today. This
lecture details some of the methods used in recommendation systems.
· Reading: MMD Chapter 9: Recommendation
on The Long Tail
9 – 11 Mar 2016
MapReduce all that we have? [slides]
· This lecture will look at what is beyond
MapReduce. In particular, we will examine other alternative BIG data
technologies, e.g., NoSQL, Hive, PIG, etc. In addition, we will also look at limitations
Week 10 – 18 Mar 2016
· Who is
the authority in the Internet? [slides]
is a widely used representation for social networks, internet, web pages, and
so on. It is thus important to be able to analyze graphs efficiently. For
example, in social networks, it may be useful to know the hub nodes (e.g.,
person who has authority or influence over others). This lecture will study
algorithms for such analysis.
MMD Chapter 5: Link Analysis
Week 11 – 25 Mar 2016
Friday. No lecture.
12 – 1 Apr 2016
many groups do we have? Which group do you belong to? [slides]
· Very often, we need to segment the data into
groups. For example, in business, it makes sense to organize your customers
into groups, perhaps with different needs. In this way, each individual can be
targeted more effectively. For social networks, it may also be useful to figure
out the community within the network. We will study cluster analysis for both
structured and graph data.
· Reading: MMD Chapter 7: Clustering; Chapter
10: Analysis of Social Networks; The DBSCAN algorithm can be found downloaded
- this paper received the most impactful paper award in KDD'2014.
Week 13 – 8 Apr 2016
· A picture is worth a thousand
· Guest lecture by Dr Zhao Shengdong, Department of Computer Science, School of
· An important aspect of big data is the presentation
of the answers. This lecture will look review information visualization as a
tool that can facilitate presentation that is more appealing to end-users.
Week 14 – 15 Apr 2016
this week, there will be a 45-min to 1-hour test, followed by a 1-hour lecture.
· How am I going to browse through all these? [slides]
· Very often, users may not know what they want.
They may simply try “looking” at the data. However, such “exploration” tasks
have to be done in real-time as the users are not going to sit around for an
hour to see some preliminary results.
This lecture will investigate methods for “real-time data exploration”.