Community Detection Project (Advanced) Summer 2016

RI3009B: CT and Big-Data Community Discovery (Advanced), (CD2)
School of Computing, National University of Singapore
(RI3009B, Summer 2016)
Prof. Leong Hon Wai

Community Detection Project (Group Project)

Project Overview

In this project, each project group is required to

find suitable big-data (from any domain that they are interested in),
formulate how they will define and build an interconnection graph,
choose a community detection algorithm(s),
compute communities or clusters, using the community detection algorithm,
visualize the clusters (using any appropriate software tool)
do some analysis (manual or computer-assisted) of the communities.
Perform further data analysis to better understand the interesting communities.
Write a project report, and
Give a short presentation of your work and the results you obtain.

Also see these slides CD2-Project-Intro.pdf on the project overview. To get you started, Prof Kal Ng has created for all of you, a self-starter guide called Cluster-Guide-v2.pdf Use this guide as a starting point for your project.

You will need to explore MORE than this, but this gets you started on the way.

Project Groups

You work in project groups, 3 students per group. The grouping into 15 groups has already been done (via random assignment, taking into account some constraints as explained in the lectures). You can find the groupings on IVLE.

As explained in the lectures, we encourage you to start with a planning meeting, identify strengths and weaknesses, identify project risks (and how to remedy them); break up tasks and assign project roles, establish communication channels and so on.

Project Data:

You are encouraged to find suitable big-data (from any domain that they are interested in). And from the chosen data, you should formulate how you will define and build an interconnection graph.

As a starting point, we give you the following sample input graphs:

edges.csv small test graph, (6 nodes and 7 edges),
G1.txt network of FB account K, (?? nodes and 178 edges),
G2.txt network of FB account U, (?? nodes and 2390 edges),

You will need to find and use your own big-data sets for your project.

There are many websites that distribute large collections of data, and there are many portals which list these websites. As a start, these answers in Quora give a good starting point to access these websites:

https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public

Here are some additional data sources that were not mentioned in the answers:

Also, there is a dataset on human mobility pattern that Prof Leong and his colleagues are very interested to try out. Those interested in possible future research on community detection should try this dataset.

They are described in this paper:
"Uncovering Patterns of Inter-Urban Trip and Spatial Interaction from Social Media Check-In Data"
(http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0086026)

IMPORTANT NOTE ON DATASET SIZE:

Please note that your project time is SHORT (only a few days). So, you should NOT use very BIG datasets. Instead try to look for some of the smaller ones, or first select smaller subsets (if you cannot find smaller ones).

But not too small though. How big is the limit? It depends on the algorithm you use and its complexity. You should use the lessons learnt from Lecture 1 on scalability to find a suitable dataset size that is feasible for your algorithm.

Community Detection Algorithms

In this course, we covered 4 community detection (CD) algorithms, namely,

SLC (Single-Linkage Clustering),
GN (Girvan-Newman) algorithm,
MCL (Markov CLustering) algorithm,
CPM (Clique Percolation Method),

As it turns out, all 4 CD algorithms (and much more) have been implemented in R in different packages. (I thank Prof. Kal Ng for help with finding these info for me.) However, it was found that CPM algorithm cannot run on the slightly bigger datasets, and so we have excluded CPM algorithm running on R.

I encourage you to use alternative (non-R) implementations of CPM or other algorithms -- since they are likely to be more efficient and can run on bigger graphs. You will need to alter your workflow a little bit in that case. *It is a good thing to learn anyway.*

Your Assigned CD Algorithms and Your Experimentations

Each group will select their own CD algorithm(s) to use for analysis. You are NOT limited to only the ones that are taught in this course. Of course, it is a good idea to first check if there are already existing good implementations of the algorithm (in R and in other languages/platform).

Do this step AS SOON AS POSSIBLE.

You will need to spend some time and experiments to gain a deep understanding of the behavior of your CD algorithm on the different datasets. Use the datasets to help you with this. We also encourage you to create your own small/medium-size datasets to deepen your understanding.

For SLC, you will need to learn how to find out how/where to "cut" the dendrogram to get "good" clusters. You may also try out other similar linkage-type clustering, such as "complete", "average" or "ward.D". (See more details in http://www.inside-r.org/r-doc/stats/hclust.)
For GN algorithm, you can also try out different values of the parameters to gain a deeper understanding of the algorithm. Learn more about "modularity" and if that can be used in any way to improve the results of the clustering. (See http://www.inside-r.org/packages/cran/igraph/docs/edge.betweenness.community)
For MCL, you will need to learn what are good values of the parameters e for expansion and r for inflation. (The default values are e=2, and r=2.) But these may not work well for all cases. (See details in https://cran.r-project.org/web/packages/MCL/MCL.pdf)

Visualizing the Communities Detected

After the communities are computed, it is useful to visualize them in a suitable way. Prof. Kal have covered several algorithms for doing this. You can use some of them for this. Also, you can use pals system (that you have learnt in Lab-Assignment 1); it was developed by my former student, Yao Yujian (and modified by Prof. Kal Ng and his student, Lim Yun Kai).

It turns out that R also has graph layout/visualization APIs (under the general note "plot()"). Note that "plot(C)" will plot C according to the type of object C; namely,

if C is of type graph, then it just output the graph,
if C is of type merge, then plot prints out a dendrogram,
if C is of type cluster, then plot prints the communities,
and so on.

The default may not be very good -- so you should try out different layout algorithm options in the plot command. Some of them includes random, circle, etc which are not good for our purpose. Again, you may experiment with some of them to see which one is suitable. (See details in http://www.inside-r.org/packages/cran/igraph/docs/layout.)

Analysis of the Communities Detected, Gaining New Insights, if any.

The most interesting part of the project is to find and analyze the interesting communities among all the communities you have found. (Use any method you think is appropriate.) You may also need to extract appropriate data so that you can visualize/analyze just the individual community you wish to study. And use these information to confirming old knowledge or gaining new insights.

This step requires some creative thinking from your group member -- to ask the right questions, and to use the right data, and may include additional data both within and outside the community you have detected.

We hope you will do some interesting analysis and have interesting insights to report.

Your Project Deliverables and Presentation...

Project Deliverables: This section describe your project deliverables (including project report and project software) and how to submit them.

The deadline for submission of this is 11:55pm, Friday, 05-Aug-2016.

Project Presentation: Final project presentation is on Saturday, 06-Aug-2016, 10am--1:00pm.
Each project group will be given 15 minutes to present their work + 3 minutes for Q&A (Question and Answers). The venue and detailed schedule for the presentation will be announced soon.

There will be two parallel sessions -- Groups 1-7 in SESSION A, and Groups 8-14 in SESSION B. All students in Groups 1-7 must attend ALL of SESSION A and join in the evaluation of Groups 1-7 (except for their own group). The same for students in Group 8-14. So, each project is evaluated by two teaching staff and at least 18 other students.

The presentation schedule (and grading form) can be found below:

Session A: Groups 1-7 -- here
Session B: Groups 8-14 -- here

(Read CD2-Project-M3-M4.pdf for some presentation tips.)

References: (Community Detection, Graph Layout, etc)

Community Detection References:

Cluster-Guide-v2.pdf, by Prof Kal Ng. (Updated version.)
http://www.sixhat.net/finding-communities-in-networks-with-r-and-igraph.html
http://www.r-bloggers.com/summary-of-community-detection-algorithms-in-igraph-0-6/
Graph Visualization:
"CS3230R Report; The Pals System," by Yao Yujian -- Yao-YuJian-report.pdf
Graph Layout in R -- http://www.inside-r.org/packages/cran/igraph/docs/layout
Graph Layout in R -- http://igraph.org/r/doc/plot.common.html

Project, (RI3009B (Summer 2016), SoC, NUS)

RI3009B: CT and Big-Data Community Discovery (Advanced), (CD2) School of Computing, National University of Singapore (RI3009B, Summer 2016) Prof. Leong Hon Wai Community Detection Project (Group Project)