FastTagger
Version 1.0
28 June 2009
============


Usage:

   FastTagger data_file min_maf window_len min_r2 max_len \
	 merge_window_len max_covered_times mem_size max_tagSNP_num \
	 output_file 

parameters:

1. data_file: if its value is xxx, 
              then two files should exist: xxx.maf and xxx.matrix
              (they are generated by the ConvertData program). 

              a) File ".maf": Each line contains information of a single SNP,
                 and its format is as follows:
     
                 id	position	allele1	allele2	MAF
     
                where allele1 and allele2 are the two alleles of the SNP,
                allele1 corresponds to 0 and allele2 corresponds to 1 
                in file ".matrix", MAF is the minor allele frequency
                of the SNP. 
                
             b) file ".matrix": Each line contains the alleles of a SNP
                over all samples. Each allele is either 0 or 1, and the
                corresponding actual allele can be find in file ".maf". The 
                total number of 0s and 1s in each line is equal to the
                number of samples. There is a one-to-one correspondance
                between the SNPs in file ".maf" and ".matrix". SNPs at the
                same line in the two files refer to the same SNP. 


2. min_maf: the minimum frequency of the minor allele. Suggested value: 0.05

3. window_len: the maximum distance between correlated SNPs (in 1000 bases).
            If set it to 100, then the maximum distance is 100,000. 
            Suggested value: 100

4. min_r2: the minimum correlation threshold. Suggested value: 0.8-1

5. max_len: the maximum length of the SNP set on the left hand side of 
           the rules. Suggested value: <=3

6. merge_window_len: this threshold is used to merge equivalent SNPs. 
           Only equivalent SNPs within the specified distance are merged
           together. Suggested value: set it to the same value as
           the window_len parameter. 

7. max_covered_times: the maximum times of a SNP being tagged by other SNPs.
           The purpose of this parameter is to speed up the running time
           in the cost of missing some rules. The basic idea is that 
           if a SNP has been covered by enough number of times by other SNPs,
           then this SNP will not be considered as right hand side in the
           future. Hopefully this will not increase the number
           of tag SNPs selected considerably.  If the value of this parameter
           is 0, then no restriction on the number of times that a SNP
           is being covered, i.e. all the rules are generated.
           Suggested value: 0 

8. mem_size: the maximum size of the memory can be used by the program.
           If its value is 0, then no restriction on memory size.
           Suggested value: 0

9. max_tagSNP_num: the desired number of tag SNPs. If it is set to 0,
           then the tag SNP selection algorithm stops when all the SNPs
           are covered, otherwise, the algorithm stops when the desired 
           number of tag SNPs is reached. 

10. output_file: the name of the output files. Six files are generated:

                a) ".SNPlineno.txt": contains the line no of the SNPs
                   in file ".maf" and ".matrix" that satisfies the min_maf
                   threshold.

                b) ".SNPid.txt": contains the information of the SNPs
                   that satisfies the min_maf threshold. Its format is as
                   follows:

                                   id	position	allele1	allele2

                   where allele1 and allele2 are the two alleles of the SNP, 
                   allele1 corresponds to 0 and allele2 corresponds to 1

                c) ".merged-ids.txt": Each line contains a set of SNPs 
                   being merged together. The first number is the number
                   of SNPs being merged together, followed by the order of 
                   the SNPs being merged together. Here, the order of 
                   a SNP is its line number in file ".SNPid.txt" minus 1. 

                d) ".rule.bin": is a binary file, contains the rules generated.
                   Every SNP is represented by its line no minus 1 in file 
                   ".SNPid.txt". 

                               length-1 rule format: 
                               
                               1 SNP1 SNP2 r2 flag

                   This actually represent two rules: SNP1=>SNP2 and SNP2=>SNP1.
                   If flag=1, then 1<=>1, 0<=>0; 
                   if flag=0, then 1<=>0, 0<=>1. 
                   r2 is a real number between 0 and 1, and it is the
                   correlation value between SNP1 and SNP2. 

                               If LHS contains more than one SNP, 
                               the format is as follows:
            
                               k   SNP_i1 ... SNP_ik 
                               t   SNP_j1   R2_j1    flag_j1
                                   SNP_j2   R2_j2    flag_j2
                                    ... 
                                   SNP_jt   R2_jt    flat_jt

                   Where k is the length of left hand side, t is the number
                   of SNPs that can be inferred from left hand side. 
                   The values of flag_j1, ... flag_jt are set as follows.
                   There are 2^k possible combinations of alleles of 
                   SNP_i1, .... SNP_ik on the left hand side, and each
                   combination can be mapped to an integer. When we 
                   calculate r^2, we map each combintaion to the two alleles
                   of SNP_j1 (0 or 1). Let n=a1a2..ak, where a1, a2, ... a_k
                   are alleles of SNP_i1, SNP_i2, ..., SNP_ik. If n is 
                   mapped to 1 of SNP_j1, then the n-th bit of flag_j1 is 
                   set to 1.  For example, suppose there are 3 SNPs on the
                   left hand side, so ther are 8 combinations. The mappings
                   between these 8 combinations and the two alleles of 
                   SNP_j1 are as follows:
                    
                                         000 -> 0
                                         001 -> 0
                                         010 -> 1
                                         011 -> 0
                                         100 -> 0
                                         101 -> 1
                                         110 -> 0
                                         111 -> 1
                              
                   Then the value of falg_j1 is 10100100 = 164
                   Note that in this file, each SNP may actually represent
                   multiple SNPs that are merged with it. As a result, 
                   each rule may represent multiple rules, 
                   but not every rule is valid. You need to use the distance
                   constraint to filter invalid rules. 
                                 

                e) ".tagSNP.txt": Each line contains a tag SNP. Here every 
                   tag SNP is represented by its its line no in file 
                   ".SNPid.txt" minus 1. 

                f) ".tagrule.txt": contains the tagging rules. Every SNP on
                   the left hand side of a tagging rule must appear in 
                   file ".tagSNP.txt". The format of each rule is as
                                    follows:

                                      k   SNP_i1 ... SNP_ik 
                                      t   SNP_j1   R2_j1    flag_j1	...	SNP_jt   R2_jt    flat_jt

                   Note that a SNP on the left hand side of the rule
                   represents only itself, while a SNP on the right hand
                   side may represent a set of SNPs that are merged with it. 


Example: FastTagger ENm010 0.05  100 0.95  3  100  0  0  0  temp
       

Credits:

This package was implemented by LIU Guimei and is supported in
part by SERC PSF grant 072-101-0016. 

If you use this program, please cite the following reference:

[1] Guimei Liu, Yue Wang, Limsoon Wong.
FastTagger: An Efficient Algorithm for Genome-Wide Tag SNP Selection.
Manuscript, February 2009.