RI3009B: CT and Big-Data Community Discovery (Advanced), (CD2)
School of Computing, National University of Singapore
(RI3009B, Summer 2016)
Prof. Leong Hon Wai
Community Detection Project (Group Project)
Project Overview
In this project, each project group is required to
- find suitable big-data (from any domain that they are interested in),
- formulate how they will define and build an interconnection graph,
- choose a community detection algorithm(s),
- compute communities or clusters, using the community detection algorithm,
- visualize the clusters (using any appropriate software tool)
- do some analysis (manual or computer-assisted) of the communities.
- Perform further data analysis to better understand the interesting communities.
- Write a project report, and
- Give a short presentation of your work and the results you obtain.
Also see these slides CD2-Project-Intro.pdf on the project overview.
To get you started, Prof Kal Ng has created for all of you, a self-starter guide called
Cluster-Guide-v2.pdf
Use this guide as a starting point for your project.
You will need to explore MORE than this, but this gets you started on the way.
Project Groups
You work in project groups, 3 students per group.
The grouping into 15 groups has already been done (via random
assignment, taking into account some constraints as explained
in the lectures). You can find the groupings on IVLE.
As explained in the lectures, we encourage you to start with a
planning meeting, identify strengths and weaknesses, identify
project risks (and how to remedy them); break up tasks and
assign project roles, establish communication channels and so on.
Project Data:
You are encouraged to find suitable big-data (from any domain that they are interested in).
And from the chosen data, you should formulate how you will
define and build an interconnection graph.
As a starting point, we give you the following sample input graphs:
- edges.csv small test graph, (6 nodes and 7 edges),
- G1.txt network of FB account K, (?? nodes and 178 edges),
- G2.txt network of FB account U, (?? nodes and 2390 edges),
You will need to find and use your own big-data sets for your project.
There are many websites that distribute large collections of data, and
there are many portals which list these websites.
As a start, these answers in Quora give a good starting point to access
these websites:
Here are some additional data sources that were not mentioned in the answers:
Also, there is a dataset on human mobility pattern that Prof Leong and his colleagues are
very interested to try out. Those interested in possible
future research on community detection should try this dataset.
They are described in this paper:
"Uncovering Patterns of Inter-Urban Trip and Spatial Interaction from Social Media Check-In Data"
(http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0086026)
IMPORTANT NOTE ON DATASET SIZE:
Please note that your project time is SHORT (only a few days).
So, you should NOT use very BIG datasets.
Instead try to look for some of the smaller ones, or
first select smaller subsets (if you cannot find smaller ones).
But not too small though.
How big is the limit? It depends on the algorithm you use and its complexity.
You should use the lessons learnt from Lecture 1 on scalability to find a
suitable dataset size that is feasible for your algorithm.
Community Detection Algorithms
In this course, we covered 4 community detection (CD) algorithms, namely,
- SLC (Single-Linkage Clustering),
- GN (Girvan-Newman) algorithm,
- MCL (Markov CLustering) algorithm,
- CPM (Clique Percolation Method),
As it turns out, all 4 CD algorithms (and much more) have been
implemented in R in different packages. (I thank Prof. Kal Ng for help
with finding these info for me.) However, it was found that CPM
algorithm cannot run on the slightly bigger datasets, and so we have
excluded CPM algorithm running on R.
I encourage you to use alternative (non-R) implementations of CPM or
other algorithms -- since
they are likely to be more efficient and can run on bigger graphs. You
will need to alter your workflow a little bit in that case. *It is a good thing to learn anyway.*
Your Assigned CD Algorithms and Your Experimentations
Each group will select their own CD algorithm(s) to use for analysis.
You are NOT limited to only the ones that are taught in this course.
Of course, it is a good idea to first check if there are already existing good
implementations of the algorithm (in R and in other languages/platform).
Do this step AS SOON AS POSSIBLE.
You will need to spend some time and experiments to gain a deep
understanding of the behavior of your CD algorithm on the different
datasets. Use the datasets to help you with this. We also encourage you
to create your own small/medium-size datasets to deepen your
understanding.
- For SLC, you will need to learn how to find out how/where to "cut"
the dendrogram to get "good" clusters. You may also try out other
similar linkage-type clustering, such as "complete", "average" or
"ward.D".
(See more details in
http://www.inside-r.org/r-doc/stats/hclust.)
- For GN algorithm, you can also try out different values of the
parameters to gain a deeper understanding of the algorithm. Learn more
about "modularity" and if that can be used in any way to improve the
results of the clustering.
(See
http://www.inside-r.org/packages/cran/igraph/docs/edge.betweenness.community)
- For MCL, you will need to learn what are good values of the parameters
e for expansion and r for inflation.
(The default values are e=2, and r=2.) But these may not work well for all cases.
(See details in
https://cran.r-project.org/web/packages/MCL/MCL.pdf)
Visualizing the Communities Detected
After the communities are computed, it is useful to visualize them in a suitable way.
Prof. Kal have covered several algorithms for doing this. You can use some of them for this.
Also, you can use pals system (that you have learnt in Lab-Assignment 1); it was developed
by my former student, Yao Yujian (and modified by Prof. Kal Ng and his student, Lim Yun Kai).
It turns out that R also has graph layout/visualization APIs (under the general note "plot()").
Note that "plot(C)" will plot C according to the type of object C; namely,
- if C is of type graph, then it just output the graph,
- if C is of type merge, then plot prints out a dendrogram,
- if C is of type cluster, then plot prints the communities,
- and so on.
The default may not be very good -- so you should try out different
layout algorithm options in the plot command. Some of them includes
random, circle, etc which are not good for our purpose.
Again, you may experiment with some of them to see which one is
suitable.
(See details in
http://www.inside-r.org/packages/cran/igraph/docs/layout.)
Analysis of the Communities Detected, Gaining New Insights, if any.
The most interesting part of the project is to find and analyze
the interesting communities among all the communities you have found.
(Use any method you think is appropriate.)
You may also need to extract appropriate data so that you can visualize/analyze
just the individual community you wish to study.
And use these information to confirming old knowledge or gaining new insights.
This step requires some creative thinking from your group member -- to ask the right questions, and to use the right data, and may include additional data both within and outside the community you have detected.
We hope you will do some interesting analysis and have interesting insights to report.
Your Project Deliverables and Presentation...
Project Deliverables: This section describe
your project deliverables (including project report and project software) and how to submit them.
The deadline for submission of this is 11:55pm, Friday, 05-Aug-2016.
Project Presentation: Final project presentation is on Saturday, 06-Aug-2016, 10am--1:00pm.
Each project group will be given 15 minutes to present their work + 3
minutes for Q&A (Question and Answers). The venue and detailed
schedule for the presentation will be announced soon.
There will be two parallel sessions -- Groups 1-7 in SESSION A, and Groups 8-14 in SESSION B.
All students in Groups 1-7 must attend ALL of SESSION A and join in the evaluation of Groups 1-7
(except for their own group). The same for students in Group 8-14.
So, each project is evaluated by two teaching staff and at least 18 other students.
The presentation schedule (and grading form) can be found below:
- Session A: Groups 1-7 -- here
- Session B: Groups 8-14 -- here
(Read CD2-Project-M3-M4.pdf for some presentation tips.)
References: (Community Detection, Graph Layout, etc)
Community Detection References:
- Cluster-Guide-v2.pdf, by Prof Kal Ng. (Updated version.)
-
http://www.sixhat.net/finding-communities-in-networks-with-r-and-igraph.html
-
http://www.r-bloggers.com/summary-of-community-detection-algorithms-in-igraph-0-6/
Graph Visualization:
- "CS3230R Report; The Pals System," by Yao Yujian --
Yao-YuJian-report.pdf
- Graph Layout in R --
http://www.inside-r.org/packages/cran/igraph/docs/layout
- Graph Layout in R --
http://igraph.org/r/doc/plot.common.html
Project, (RI3009B (Summer 2016), SoC, NUS)