CMC
Version 2.0
=============

This package contains the following programs:

1/ cd_ppi:  A method to assign reliability to edges in a PPI network.
         The reliability is assigned based on the concept of
         interated CD distance. See [2].

2/ filterNadd_ppi: A method for cleansing a PPI network. It uses
         cd_ppi to assign reliability to edges in a PPI network.
         Those below a specified threshold are removed as false positives.
         It also evaluates protein pairs not in the PPI network and
         decides if they are possible false negatives. Those pairs
         scoring above a specified threshold are added into the network.
         See [1,2]
         
3/ CMC:  A method to identify protein complexes from a PPI network.
         The method is based on merging of maximal clique. See [1].


A detailed set of instructions for these programs can be found after
the references below.


Credits:

This package was implemented by LIU Guimei and is supported in part
by URC grant "R-252-000-274-112: Graph-Based Protein Function Prediction"
and by NRF grant "R252-000-218-281: Informatics and Search Algorithms
for Lipidomics - Novel Tools and Applications".

If you use this program, please cite the following references:

[1] Guimei Liu, Limsoon Wong, Hon Nian Chua. 
    Complex Discovery from Weighted PPI Networks.
    Bioinformatics, 2009 (to appear).

[2] Guimei Liu, Jinyan Li, Limsoon Wong. 
    Assessing and Predicting Protein Interactions Using Both Local and
    Global Network Topological Metrics. 
    Proceedings of 19th International Conference on Genome Informatics (GIW),
    pages 138--149, Gold Coast, Australia, 3 December 2008.

[3] Hon Nian Chua, Limsoon Wong. 
    Increasing the Reliability of Protein Interactomes. 
    Drug Discovery Today, 13(15/16):652--658, August 2008


===cd_ppi===

cd_ppi:  A method to assign reliability to edges in a PPI network.
         The reliability is assigned based on the concept of
         interated CD distance. See [2].

Usage of cd_ppi:

  cd_ppi  train_ppi_file  test_ppi_file  method  #iterations  output_filename


Parameters: 

1. train_ppi_file: contains protein protein interactions. 
	Each line represents an interaction, and contains a pair of proteins. 
	The program uses this file to calculate the score of every protein pair. 
2. test_ppi_file: contains the set of interactions to be assessed.
	Its format is the same as that of "train_ppi_file". The program will 
	calculate the score of the proteins pairs in this file, and output 
	those protein pairs with non-zero score. If "train_ppi_file" and 
	"test_ppi_file" are the same file, then the program is assessing 
	the reliability of the interactions in "train_ppi_file". If the 
	value of "test_ppi_file" is "NULL", then the program will predict 
	new protein interactions that are not in "train_ppi_file", and
        output those proteins pairs that are not in "train_ppi_file" and 
	their interacting scores. 

3. method: takes the following values: 
                 CD: CD-distance            (See [2,3])
            AdjstCd: Adjusted CD-distance   (See [2])
                 FS: FS-weight              (See [3])
    
4. nmax_iterations: is the number of iterations. 

5. output_filename: Each line contains a pair of proteins and their
	 interacting score. 


====filterNadd_ppi=====

filterNadd_ppi: A method for cleansing a PPI network. It uses
         cd_ppi to assign reliability to edges in a PPI network.
         Those below a specified threshold are removed as false positives.
         It also evaluates protein pairs not in the PPI network and
         decides if they are possible false negatives. Those pairs
         scoring above a specified threshold are added into the network.
         See [1,2]
         
Usage of filterNadd_ppi:

filterNadd_ppi ppi_filename \
	#iterations  filter_method  filter_min_score \
	add_method  add_min_score \
	output_file

This program calls the "cd_ppi" program to calculate PPI scores. 

Parameters:

1. ppi_filename: contains the set of interactions. 
	Each line represents an interaction, and contains a pair of proteins. 
	The program uses this file to calculate the score of 
	every protein pair. 

2. #iterations: number of iterations for the chosen scoring methods. 
	Suggested value: 2. 

3. filter_method: the scoring method used to remove non-reliable interactions. 
	It takes the following values: 
                 CD: CD-distance             (See [3])
            AdjstCd: Adjusted CD-distance    (See [1,2])
                 FS: FS-weight               (SEe [3])

4. filter_min_score: the threshold used to filter interactions with low 
	score from "ppi_filename".  If its value is between 0 and 1, 
	then all the interactions with score lower than its value are removed. 
	If its value is an integer k that is larger than 1, then only the 
	top-k interactions are retained. 

5. add_method: the scoring method used to add new interactions. 
	It takes the following values: 
                 CD: CD-distance
            AdjstCd: Adjusted CD-distance 
                 FS: FS-weight              

6. add_min_score: the threshold used to add new predicted interactions. 
	If its value falls in (0, 1], then those proteins pairs that are 
	not in "ppi_filename" but have a score no less than the given 
	value are added to the final output file. If its value is
        an integer k that is larger than 1, then only the top-k new 
	proteins pairs will be added. If do not want to add new interactions,
	set its value to 0. 

7. output_file: contains the final proteins pairs. Each line contains 
	a pair of proteins and their score. 
                  

Example: 

1. The following command uses AdjstCD to calculate score, and removes 
interactions with 0 score, and add the top-1000 new interactions to the
PPI network.  

	filterNadd_ppi  dip.ppi.txt 1 AdjstCD 0  AdjstCD 1000  ppi.score.txt  

2. Same as the above command except that it uses iterated AdjstCD to
calculate score. The number of iterations is 2. 

	filterNadd_ppi  dip.ppi.txt 2 AdjstCD 0 AdjstCD  1000  ppi.score.txt


====CMC=====

CMC:  A method to identify protein complexes from a PPI network.
         The method is based on merging of maximal clique. See [1].

Usage of CMC:

CMC ppi_score_filename \
    min_deg_ratio \
    min_size \
    overlap_thres \
    merge_thres \
    output_file

This program calls the "quasiCliques" program to find maximal cliques 

Parameters:

1. ppi_score_filename: contains the set of interactions and their scores.
	 Each line represents an interaction, 
         and contains a pair of proteins and their score.

2. min_deg_ratio: set it to 1

3. min_size: the minimum size of the clusters generated

4. overlap_thres: the threshold used to remove or merge highly
	 overlapped clusters. Given two clusters C1 and C2, if
         the overlap between C1 and C2 is no less than filter_score*|C2|,
	 then C2 will either be removed or merged.

5. merge_thres: the threshold used to remove or merge highly 
	 overlapped clusters. Given two clusters C1 and C2, if
         the overlap between C1 and C2 is no less than filter_score*|C2|,
         and the inter-connectivity between C1 and C2 is no less than 
         merge_thres, then C2 is merged with C1, otherwise C2 is removed.

6. output_file: contains the list of clusters generated. 
	Each line represents a cluster. The string 
        before ':' is the identifier of the cluster, followed by the
	set of proteins in this cluster. 


Example:

   CMC  ppi.score.txt  1 4 0.5 0.25  clusters.txt