RI2001A: CT and Big-Data Community Discovery (Basic), (CD1)
School of Computing, National University of Singapore
(RI2001A, Summer 2016)
Prof. Leong Hon Wai
Community Detection Project (Group Project)
See Project Presentation details below (under Presentation).
Project Overview
In this project, each group is given some input (dataset or graphs).
For each of these input, they will
- first form a graph (if needed),
- compute communities or clusters, using a community detection algorithm,
- visualize the clusters and do some analysis (manual or computer-assisted).
Each group will write a report on their work and also give a short presentation of
their work and the results they obtained.
Also see these slides CD1-Project.pdf on the project overview.
To get you started, Prof Kal Ng has created for all of you, a self-starter guide called
Cluster-Guide-v2.pdf
[Note: The updated version includes how to convert the output
to make it suitable for use with the software pals. So download the newer version
of Cluster-Guide-v2.pdf]
Use this guide as a starting point for your project.
You will need to explore MORE than this, but this gets you started on the way.
Project Groups
You work in project groups, 3 students per group.
The grouping into 15 groups has already been done (via random
assignment, taking into account some constraints as explained
in the lectures). You can find the groupings on IVLE.
As explained in the lectures, we encourage you to start with a
planning meeting, identify strengths and weaknesses, identify
project risks (and how to remedy them); break up tasks and
assign project roles, establish communication channels and so on.
Project Data
We will use the following input datasets for this project:
- edges.csv small test graph, (6 nodes and 7 edges),
- G1.txt network of FB account K, (?? nodes and 178 edges),
- G2.txt network of FB account U, (?? nodes and 2390 edges),
- G3.txt tweet graph from Assignment, (?? nodes and 260 edges),
You can download all these datasets from CD1projdata.
Community Detection Algorithms
In this course, we covered 4 community detection (CD) algorithms, namely,
- SLC (Single-Linkage Clustering),
- GN (Girvan-Newman) algorithm,
- MCL (Markov CLustering) algorithm,
- CPM (Clique Percolation Method),
As it turns out, all 4 CD algorithms (and much more) have been implemented in R in different packages. (I thank Prof. Kal Ng for help with finding these info for me.) However, it was found that CPM algorithm cannot run on the slightly bigger datasets, and so we have excluded CPM algorithm running on R.
Your Assigned CD Algorithms and Your Experimentations
Each group has been assigned a CD algorithm from {SLC, GN, MCL}.
See it at
CD-Alg-to-Group.pdf.
If you seriously object to this assignment, please talk to Prof Leong or send him an email with appropriate email-heading and with a good reason for wanting to chat your assigned CD algorithm. (And those who are *very* keen on using CPM will have to download and install CPM on their computer and run it separately. But, get Prof. Leong's permission first, before proceeding with this option.)
You will need to spend some time and experiments to gain a deep understanding of the behavior of your CD algorithm on the different datasets. Use the datasets to help you with this. We also encourage you to create your own small/medium-size datasets to deepen your understanding.
- For SLC, you will need to learn how to find out how/where to "cut" the dendrogram to get "good" clusters. You may also try out other similar linkage-type clustering, such as "complete", "average" or "ward.D".
(See more details in
http://www.inside-r.org/r-doc/stats/hclust.)
- For GN algorithm, you can also try out different values of the parameters to gain a deeper understanding of the algorithm. Learn more about "modularity" and if that can be used in any way to improve the results of the clustering.
(See
http://www.inside-r.org/packages/cran/igraph/docs/edge.betweenness.community)
- For MCL, you will need to learn what are good values of the parameters
e for expansion and r for inflation.
(The default values are e=2, and r=2.) But these may not work well for all cases.
(See details in
https://cran.r-project.org/web/packages/MCL/MCL.pdf)
Visualizing the Communities Detected
After the communities are computed, it is useful to visualize them in a suitable way.
One way you have learnt in Lab-Assignment 1 is to the the "pals" system developed by Yao Yujian
and later modified by Prof. Kal Ng and his student, Lim Yun Kai. You can do that, of course.
It turns out that R also has graph layout/visualization APIs (under the general note "plot()").
Note that "plot(C)" will plot C according to the type of object C; namely,
- if C is of type graph, then it just output the graph,
- if C is of type merge, then plot prints out a dendrogram,
- if C is of type cluster, then plot prints the communities,
- and so on.
Also, you can choose to one of several layout algorithms to position the nodes.
Some of them includes random, circle, etc which are not good for our purpose.
Again, you may experiment with some of them to see which one is suitable.
(See details in
http://www.inside-r.org/packages/cran/igraph/docs/layout.)
Analysis of the Communities Detected, Gaining New Insights, if any.
The last (and in my opinion, most interesting) part of the project will be to manually
analyze the communities detected in the hope of confirming old knowledge or gaining new insights.
This step requires some creative thinking from your group member -- to ask the right questions, and to use the right data, and may include additional data both within and outside the community you have detected.
We hope you will do some interesting analysis and have interesting insights to report.
Your Project Deliverables and Presentation...
Project Deliverable: Your project deliverables (including project report and project software) and how to submit them.
The deadline for submission of this is 11:00am, Friday, 29-July-2016.
Project Presentation: Final project presentation is on Friday, 29-July-2016, 2pm--5:30pm.
Each project group will be given 15 minutes to present their work + 3 minutes for Q&A (Question and Answers). The venue and detailed schedule for the presentation will be announced soon.
There will be two parallel sessions -- Groups 1-7 in SESSION A, and Groups 8-15 in SESSION B.
All students in Groups 1-7 must attend ALL of SESSION A and join in the evaluation of Groups 1-7
(except for their own group). The same for students in Group 8-15.
So, each project is evaluated by two teaching staff and at least 18 other students.
The presentation schedule (and grading form) can be found below:
- Session A: Groups 1-7 -- here
- Session B: Groups 8-15 -- here
References: (Community Detection, Graph Layout, etc)
Community Detection References:
- Cluster-Guide-v2.pdf, by Prof Kal Ng. (Updated version.)
- http://www.sixhat.net/finding-communities-in-networks-with-r-and-igraph.html
- http://www.r-bloggers.com/summary-of-community-detection-algorithms-in-igraph-0-6/
Graph Visualization:
- "CS3230R Report; The Pals System," by Yao Yujian --
Yao-YuJian-report.pdf
- Graph Layout in R --
http://www.inside-r.org/packages/cran/igraph/docs/layout
- Graph Layout in R --
http://igraph.org/r/doc/plot.common.html
Project, (RI2001A (Summer 2016), SoC, NUS)