A Correlated Motif Approach for Finding Short Linear Motifs from Protein-Protein Interaction Data
Soon Heng, Tan Hugo Willy Wing-Kin, Sung See-Kiong, Ng
An important class of interaction switches for biological circuits and disease pathways are short binding motifs. However, the biological experiments to find these binding motifs are often laborious and expensive. With the availability of protein interaction data, novel binding motifs can be discovered computationally by applying standard motif extracting algorithms on a protein sequence set interacting with a common protein or a protein group with similar properties. The underlying assumption is that proteins with common interacting partners will share some common binding motifs. Although novel binding motifs have been discovered with such approach, it is not applicable if either a protein interacts with very few other proteins or when prior knowledge of such protein grouping is not available or erroneous. Experimental noise in input interaction data can further deteriorate the dismal performance of such approaches. We propose a novel approach of finding correlated short sequence motifs from protein-protein interaction data to effectively circumvent the above-mentioned limitations. Correlated motifs are those motifs that consistently co-occur only in pairs of interacting protein sequences which could possibly interact with each other directly or indirectly to mediate interactions. We adopted the (l,d)-motif model and formulate finding the correlated motifs as an (l,d)-motif pair finding problem. We present both an exact algorithm, D-MOTIF, as well as its approximation algorithm, D-STAR to solve this problem. Evaluation on extensive simulated data showed that our approach not only eliminated the need for any prior protein grouping, but is also more robust in extracting motifs from noisy interaction data. On real biological datasets, we are able to extract correlated motifs that correspond to the actual binding interfaces of two different classes of proteins. The correlated motif approach outlined in this paper is able to find correlated linear motifs from sparse and noisy interaction data. This, in turn, will help further elucidate the various biological pathways mediated by linear binding motifs.
Program availability : DSTAR.zip.
The ZIP file contains the following files:
Binaries : DSTAR.Window.exe , DSTAR.LinuxRedHat, DSTAR.Solaris
Sample sequence input file (in FASTA format) : C2-BIND_Seq.fasta
Sample input interaction data : C2-BIND.int
Sample output file : C2-BIND.out
Perl interactive script to run the program : D-STAR_Interactive.pl
A README file.
To run the program from the command line, type
<DSTAR-binary> <fasta-file> <interaction-file> [<option> <option-value>]
where <option> can be
-l <PATTERN_WIDTH> The length of the motif searched
-d <DIST_THRES> The maximum number of mismatch allowed from the centroid of the star (equal to 2d in (l,d) definition)
-L <MIN_SIZE_L> Minimum size of left star
-R <MIN_SIZE_R> Minimum size of right star
-I <MIN_NUM_OF_INTER> Minimum number of interaction between the proteins in left and right protein
-N <NUM_OF_RESULT> Set the maximum number of result to compute
-o <OUTPUT_FILE_NAME> Print the output into file, otherwise the output is sent to stdout
-s <LEFT_PROT_NAME> <RIGHT_PROT_NAME> Run DSTAR only on the interaction <left_protein_name,right_protein_name>
DSTAR.Window.exe C2-BIND_Seq.fasta C2-BIND.int -l 8 -d 2 -L 6 -R 6 -I 8 -N 100 -o try.out
or for single interaction (only for interaction <0,2> )
DSTAR.Window.exe C2-BIND_Seq.fasta C2-BIND.int -l 8 -d 2 -L 6 -R 6 -I 8 -N 100 -o try.out -s 0 2
Note that the interaction file is in the form of
<protein_ID>\tab<protein_ID> where protein_ID is the string following '>' in the FASTA file.