/************************************************************************** *eCEO Source code* ***************************************************************************/ The software in this package is provided "as is" without further support. However, we are still interested in hearing what and how you use them. We welcome comments, suggestions and collaboration, please contact: Zhengkui Wang (wangzhengkui@nus.edu.sg) The README file covers the following topics 1. Prerequisites 2. Input Format 3. How to compile the source code 4. How to run the software in hadoop cluster 5. Examples 6. How to run on Cloud =============== 1. Prerequisites =============== Hadoop version: 0.20.2 =============== 2. Input Format =============== The user should create a space-delimited file which contains the case-control genotype data as the input for the program. The first column is the sample ID marked as integer with each line corresponding to one individual. The last column should contain the disease status of each individual coded by 0 and 1. From the second column to the last two column, it should be the genotype data which should be coded by 0, 1 and 2. The following is a sample data file for 5 individauls (3 cases and 2 controls) each genotyped for 8 SNPs. 1 2 1 0 2 1 0 1 0 1 2 0 1 2 0 2 1 2 2 1 3 1 2 0 1 2 0 1 1 1 4 0 2 1 2 1 2 0 0 0 5 1 0 1 1 2 1 2 1 0 ***Users can modify the source code for different kinds of data. We will release the software which can support for different kinds of data formats soon. *** ================================= 3. How to compile the source code ================================= Import the source code to your Java programming tools, like eclipse. Modify the source code according to your request. When it is ready, in your project, right click choose: /export->java->jar file->sect the export destination->Finish The compiled Jar file will be generated. ================================================ 4. How to run the jar file in your Hadoop cluster ================================================ 1). Create the input folder in the HDFS filesystem like: >>hadoop fs -mkdir inputfolder/ 2). Put the data file into the input folder in the file system. Using the following command: like >>Hadoop fs -put localdatafile inputfolder/ 3). Run the program using the following command: like: >> hadoop jar jar_path/***.jar sg.edu.nus.GeneProcessor inputfolder/ preprocess_output_folder/ two-locus_analysis_output_folder/ three-Locus_analysis_output_folder TopK_Retrieval_From_two-locus_output_folder There are total four args in the execution command. 1st arg is the orignal data path. 2end arg is the result output path after data preprocessing. 3rd arg is the result output path after the two-locus analysis 4th arg is the result output pather for retrievaling the top k result from two-locus analysis result data. ==================== 5. Example ==================== We have provided users an example to try our software. In this folder we provide a Genesnp_greedy.jar and Genesnp_squarechopping files which are compiled jar file for 100 SNPs in greedy and Sqaure-chopping distribution models. The data100.txt is the original data for 100 SNPs data with 2000 samples. The compiled Jar program will do the data preprocessing, two-locus analysis, three-locus analysis and retrival top 10 value from the result of two-locus analysis. For a simple testing, users can use the following commands: Users can change the folder path if you want: $HADOOP_HOME/bin/hadoop fs -mkdir genesnp/input100 $HADOOP_HOME/bin/hadoop fs -put data100.txt genesnp/input100/ RUN the greedy model: $HADOOP_HOME/bin/hadoop jar GeneSnp_greedy.jar sg.edu.nus.GeneProcessor genesnp/input100 genesnp/output1 genesnp/output2 genesnp/output3 genesnp/output4 RUN the square-chopping model: $HADOOP_HOME/bin/hadoop jar GeneSnp_squarechopping.jar sg.edu.nus.GeneProcessor genesnp/input100 genesnp/output1 genesnp/output2 genesnp/output3 genesnp/output4 The result data after preprocessing will be stored under the folder of genesnp/output1/ in HDFS. The result data after two-locus analysis will be stored under the folder of genesnp/output2/ in HDFS. The result data after three-locus analysis will be stored under the folder of genesnp/output3/ in HDFS. The result data after top k retrival will be stored under the folder of genesnp/output4/ in HDFS. ================================= 6. How to run on Cloud ================================= Besides of running the program on your own hadoop cluster, users can run it on cloud also. Users can use all kinds of hadoop cluster provided by any cloud providers, like Amazon Elastic Compute Cloud (EC2), Amazon Elastic MapReduce, etc. 1) get hadoop cluster running on cloud a) For how to use Amazon Elastic Compute Cloud, users can download the documents from http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/. We suggest user use the tool under the hadoop package to launch the cluster with hadoop setup. The tool is under the hadooppackage/src/contrib/ec2/bin. The instructions can be found there also. Or users can find the instructions from http://wiki.apache.org/hadoop/AmazonEC2#AutomatedScripts. b) For using Amazon Elastic MapReduce, users can download the documents from http://aws.amazon.com/elasticmapreduce/ 2) How to run our program a) Upload your jar file and your data to the cloud. b) Then run the jar file exactly the same as running on your own cluster. Instructions can be found above. More information and examples can be found on http://www.comp.nus.edu.sg/~wangzk/eCEO.html We welcome all the collaborations!!! Credits: This package was implemented by Wang Zhenkui. If you use this program, please cite the following reference: Zhengkui Wang, Yue Wang, Kian-Lee Tan, Limsoon Wong, Divyakant Agrawal. eCEO: An efficient Cloud Epistasis cOmputing model in genome-wide association study. Bioinformatics, 27(8):1045--1051, April 2011.