FastTagger Version 1.0 28 June 2009 ============ Usage: FastTagger data_file min_maf window_len min_r2 max_len \ merge_window_len max_covered_times mem_size max_tagSNP_num \ output_file parameters: 1. data_file: if its value is xxx, then two files should exist: xxx.maf and xxx.matrix (they are generated by the ConvertData program). a) File ".maf": Each line contains information of a single SNP, and its format is as follows: id position allele1 allele2 MAF where allele1 and allele2 are the two alleles of the SNP, allele1 corresponds to 0 and allele2 corresponds to 1 in file ".matrix", MAF is the minor allele frequency of the SNP. b) file ".matrix": Each line contains the alleles of a SNP over all samples. Each allele is either 0 or 1, and the corresponding actual allele can be find in file ".maf". The total number of 0s and 1s in each line is equal to the number of samples. There is a one-to-one correspondance between the SNPs in file ".maf" and ".matrix". SNPs at the same line in the two files refer to the same SNP. 2. min_maf: the minimum frequency of the minor allele. Suggested value: 0.05 3. window_len: the maximum distance between correlated SNPs (in 1000 bases). If set it to 100, then the maximum distance is 100,000. Suggested value: 100 4. min_r2: the minimum correlation threshold. Suggested value: 0.8-1 5. max_len: the maximum length of the SNP set on the left hand side of the rules. Suggested value: <=3 6. merge_window_len: this threshold is used to merge equivalent SNPs. Only equivalent SNPs within the specified distance are merged together. Suggested value: set it to the same value as the window_len parameter. 7. max_covered_times: the maximum times of a SNP being tagged by other SNPs. The purpose of this parameter is to speed up the running time in the cost of missing some rules. The basic idea is that if a SNP has been covered by enough number of times by other SNPs, then this SNP will not be considered as right hand side in the future. Hopefully this will not increase the number of tag SNPs selected considerably. If the value of this parameter is 0, then no restriction on the number of times that a SNP is being covered, i.e. all the rules are generated. Suggested value: 0 8. mem_size: the maximum size of the memory can be used by the program. If its value is 0, then no restriction on memory size. Suggested value: 0 9. max_tagSNP_num: the desired number of tag SNPs. If it is set to 0, then the tag SNP selection algorithm stops when all the SNPs are covered, otherwise, the algorithm stops when the desired number of tag SNPs is reached. 10. output_file: the name of the output files. Six files are generated: a) ".SNPlineno.txt": contains the line no of the SNPs in file ".maf" and ".matrix" that satisfies the min_maf threshold. b) ".SNPid.txt": contains the information of the SNPs that satisfies the min_maf threshold. Its format is as follows: id position allele1 allele2 where allele1 and allele2 are the two alleles of the SNP, allele1 corresponds to 0 and allele2 corresponds to 1 c) ".merged-ids.txt": Each line contains a set of SNPs being merged together. The first number is the number of SNPs being merged together, followed by the order of the SNPs being merged together. Here, the order of a SNP is its line number in file ".SNPid.txt" minus 1. d) ".rule.bin": is a binary file, contains the rules generated. Every SNP is represented by its line no minus 1 in file ".SNPid.txt". length-1 rule format: 1 SNP1 SNP2 r2 flag This actually represent two rules: SNP1=>SNP2 and SNP2=>SNP1. If flag=1, then 1<=>1, 0<=>0; if flag=0, then 1<=>0, 0<=>1. r2 is a real number between 0 and 1, and it is the correlation value between SNP1 and SNP2. If LHS contains more than one SNP, the format is as follows: k SNP_i1 ... SNP_ik t SNP_j1 R2_j1 flag_j1 SNP_j2 R2_j2 flag_j2 ... SNP_jt R2_jt flat_jt Where k is the length of left hand side, t is the number of SNPs that can be inferred from left hand side. The values of flag_j1, ... flag_jt are set as follows. There are 2^k possible combinations of alleles of SNP_i1, .... SNP_ik on the left hand side, and each combination can be mapped to an integer. When we calculate r^2, we map each combintaion to the two alleles of SNP_j1 (0 or 1). Let n=a1a2..ak, where a1, a2, ... a_k are alleles of SNP_i1, SNP_i2, ..., SNP_ik. If n is mapped to 1 of SNP_j1, then the n-th bit of flag_j1 is set to 1. For example, suppose there are 3 SNPs on the left hand side, so ther are 8 combinations. The mappings between these 8 combinations and the two alleles of SNP_j1 are as follows: 000 -> 0 001 -> 0 010 -> 1 011 -> 0 100 -> 0 101 -> 1 110 -> 0 111 -> 1 Then the value of falg_j1 is 10100100 = 164 Note that in this file, each SNP may actually represent multiple SNPs that are merged with it. As a result, each rule may represent multiple rules, but not every rule is valid. You need to use the distance constraint to filter invalid rules. e) ".tagSNP.txt": Each line contains a tag SNP. Here every tag SNP is represented by its its line no in file ".SNPid.txt" minus 1. f) ".tagrule.txt": contains the tagging rules. Every SNP on the left hand side of a tagging rule must appear in file ".tagSNP.txt". The format of each rule is as follows: k SNP_i1 ... SNP_ik t SNP_j1 R2_j1 flag_j1 ... SNP_jt R2_jt flat_jt Note that a SNP on the left hand side of the rule represents only itself, while a SNP on the right hand side may represent a set of SNPs that are merged with it. Example: FastTagger ENm010 0.05 100 0.95 3 100 0 0 0 temp Credits: This package was implemented by LIU Guimei and is supported in part by SERC PSF grant 072-101-0016. If you use this program, please cite the following reference: [1] Guimei Liu, Yue Wang, Limsoon Wong. FastTagger: An Efficient Algorithm for Genome-Wide Tag SNP Selection. Manuscript, February 2009.