School of Computing

Department of Computer Science

CS5344:   Big Data Analytics Technology  


[Announcements]  [Instructor] [Course Objectives] [Lecture Schedule] [Reference Texts and Materials] [Assignment (Hadoop Labs & Gradiance] [Project] [Assessment]


For the readings, we will denote the book Mining of Massive Datasets as MMD, and the data mining reference text as DM. For topics covered in MMD, we have used the slides provided by the authors (with minor modifications).

Week 1 – 15 Jan 2016

·       Introduction & Overview [slides]

·       This lecture will give an introduction to Big Data Analytics Technology, followed by an overview of the module.

·       Reading: MMD Chapter 1: Data Mining and DM Chapter 1: What is it all about?; Article on Big Data and Its Technical Challenges

Week 2 – 22 Jan 2016

·       Basics of Hadoop and MapReduce [slides]

·       This lecture will give a quick overview of MapReduce/Hadoop.

·       Reading: MMD Chapter 2: Large-Scale File Systems and MapReduce

Week 3 – 29 Jan 2016

·       Don’t they look the same? [slides]

·       One of the common tasks in analytics is to find objects that look alike. In forensic text analysis, e.g., plagiarism detection, two documents that share significant overlapping textual content implies one is likely to have copied or derived from the other. In online retailing, a user is always bombarded with recommended items, e.g., if you are browsing for a book on Big Data Analytics, then you are likely to find books with titles containing keywords “Big Data”, “Analytics” being recommended.  This lecture will look at some methods for similar item detection.

·       Reading: MMD Chapter 3: Finding Similar Items.

Week 4 – 5 Feb 2016

·       The human in the Big Data Loop [slides]

·       Guest Lecture by Professor H.V. Jagadish, University of Michigan

·       Traditionally, we desire to have systems that are fully automatic. However, in dealing with big data analytic, full automation is not an option.  This lecture will explain the importance of keeping human in the big data loop.

Week 5 – 12 Feb 2016

·       What did most people buy, and why? [slides]

·       In many applications, it is important to find objects/events that co-occur, or how occurrences of some objects/events may influence the occurrences of other objects/events. For example, a supermarket may be interested to find out items that are frequently purchased in a single transaction, say diapers and beer. Knowing this, it can deploy clever marketing strategy to increase its profit margin, say give a discount on diapers, but raises the price of beer. It is also interesting to figure out why people buy these items together – if they buy diapers, it means they have young children, which means they are likely to be “grounded” at home to look after them, and as such they may be bored and ended up drinking to pass their time (and watching TV) … This lecture will look at methods to figure out such “frequent itemsets” and the corresponding “association rules”. While we will not talk much about “why”, it is important to understand the reasons for such co-occurrences for the answers to be meaningful.

·       Reading: MMD Chapter 6: Frequent Itemsets.

Week 6 – 19 Feb 2016

·       In this week, there will be a 45 min to 1 hour test, followed by an hour of lecture.

·       Is BIG necessarily better? [slides]

·      This lecture will illustrate with real examples that having more (big) data does not necessarily lead to better analytics (discovery). In fact, it is important to be able to know what you are looking for, to ask the right questions, and to use the right tools.

·       Reading: A short article on the subject. See another article on Google Flu Trends.

Week 7 – 26 Feb 2016

·       Recess Week. No lecture.

Week 8 – 4 Mar 2016

·       What should we recommend? [slides]

·       Recommendation systems are common today. This lecture details some of the methods used in recommendation systems.

·       Reading: MMD Chapter 9: Recommendation Systems; Article on The Long Tail

Week 9 – 11 Mar 2016

·       Is MapReduce all that we have? [slides]

·       This lecture will look at what is beyond MapReduce. In particular, we will examine other alternative BIG data technologies, e.g., NoSQL, Hive, PIG, etc. In addition, we will also look at limitations of MapReduce.

Week 10 – 18 Mar 2016

·       Who is the authority in the Internet? [slides]

·       Graph is a widely used representation for social networks, internet, web pages, and so on. It is thus important to be able to analyze graphs efficiently. For example, in social networks, it may be useful to know the hub nodes (e.g., person who has authority or influence over others). This lecture will study algorithms for such analysis.

·       Reading: MMD Chapter 5: Link Analysis

Week 11 – 25 Mar 2016

·       Good Friday. No lecture.

Week 12 – 1 Apr 2016

·      How many groups do we have? Which group do you belong to? [slides]

·       Very often, we need to segment the data into groups. For example, in business, it makes sense to organize your customers into groups, perhaps with different needs. In this way, each individual can be targeted more effectively. For social networks, it may also be useful to figure out the community within the network. We will study cluster analysis for both structured and graph data.

·       Reading: MMD Chapter 7: Clustering; Chapter 10: Analysis of Social Networks; The DBSCAN algorithm can be found downloaded from here - this paper received the most impactful paper award in KDD'2014.

Week 13 – 8 Apr 2016

·       A picture is worth a thousand words? [slides]

·       Guest lecture by Dr Zhao Shengdong, Department of Computer Science, School of Computing

·       An important aspect of big data is the presentation of the answers. This lecture will look review information visualization as a tool that can facilitate presentation that is more appealing to end-users.

Week 14 – 15 Apr 2016

·       For this week, there will be a 45-min to 1-hour test, followed by a 1-hour lecture.

·       How am I going to browse through all these? [slides]

·       Very often, users may not know what they want. They may simply try “looking” at the data. However, such “exploration” tasks have to be done in real-time as the users are not going to sit around for an hour to see some preliminary results.  This lecture will investigate methods for “real-time data exploration”.

·       Readings:

o   Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster

o   Online Aggregation

o   MapReduce Online