/**************************************************************************
			*eCEO Source code*
***************************************************************************/
The software in this package is provided "as is" without further support. 
However, we are still interested in hearing what and how you use them. 
We welcome comments, suggestions and collaboration, please contact:
Zhengkui Wang (wangzhengkui@nus.edu.sg)

The README file covers the following topics 
1. Prerequisites
2. Input Format
3. How to compile the source code
4. How to run the software in hadoop cluster 
5. Examples
6. How to run on Cloud

===============
1. Prerequisites
===============

Hadoop version: 0.20.2

===============
2. Input Format
===============
The user should create a space-delimited file which contains the 
case-control genotype data as the input for the program.  The first 
column is the sample ID marked as integer with each line corresponding
to one individual.  The last column should contain the disease status 
of each individual coded by 0 and 1.  From the second column to the 
last two column, it should be the genotype data which should be coded 
by 0, 1 and 2. 

The following is a sample data file for 5 individauls (3 cases and 2 
controls) each genotyped for 8 SNPs. 

1 2 1 0 2 1 0 1 0 1 
2 0 1 2 0 2 1 2 2 1
3 1 2 0 1 2 0 1 1 1
4 0 2 1 2 1 2 0 0 0
5 1 0 1 1 2 1 2 1 0

***Users can modify the source code for different kinds of data. 
We will release the software which can support for different kinds 
of data formats soon. ***

=================================
3. How to compile the source code
=================================

Import the source code to your Java programming tools, like eclipse.
Modify the source code according to your request. 
When it is ready, in your project, right click choose: /export->java->jar file->sect the export destination->Finish
The compiled Jar file will be generated. 

================================================
4. How to run the jar file in your Hadoop cluster
================================================  

1). Create the input folder in the HDFS filesystem
like: >>hadoop fs -mkdir inputfolder/ 

2). Put the data file into the input folder in the file system. 
Using the following command: 
like >>Hadoop fs -put localdatafile inputfolder/

3). Run the program using the following command: 
like: >> hadoop jar jar_path/***.jar sg.edu.nus.GeneProcessor inputfolder/ preprocess_output_folder/ two-locus_analysis_output_folder/ three-Locus_analysis_output_folder TopK_Retrieval_From_two-locus_output_folder 

There are total four args in the execution command. 
1st arg is the orignal data path. 
2end arg is the result output path after data preprocessing. 
3rd arg is the result output path after the two-locus analysis 
4th arg is the result output pather for retrievaling the top k result from two-locus analysis result data. 

====================
5. Example
====================
We have provided users an example to try our software. 

In this folder we provide a Genesnp_greedy.jar and Genesnp_squarechopping 
files which are compiled jar file for 100 SNPs in greedy and 
Sqaure-chopping distribution models. The data100.txt is the original data 
for 100 SNPs data with 2000 samples. The compiled Jar program will do the 
data preprocessing, two-locus analysis, three-locus analysis and retrival 
top 10 value from the result of two-locus analysis. For a simple testing, 
users can use the following commands:

Users can change the folder path if you want:

$HADOOP_HOME/bin/hadoop fs -mkdir genesnp/input100
$HADOOP_HOME/bin/hadoop fs -put data100.txt genesnp/input100/

RUN the greedy model:
$HADOOP_HOME/bin/hadoop jar GeneSnp_greedy.jar sg.edu.nus.GeneProcessor genesnp/input100 genesnp/output1 genesnp/output2 genesnp/output3 genesnp/output4
RUN the square-chopping model:
$HADOOP_HOME/bin/hadoop jar GeneSnp_squarechopping.jar sg.edu.nus.GeneProcessor genesnp/input100 genesnp/output1 genesnp/output2 genesnp/output3 genesnp/output4

The result data after preprocessing will be stored under the folder 
of genesnp/output1/ in HDFS.  The result data after two-locus analysis 
will be stored under the folder of genesnp/output2/ in HDFS.  The result 
data after three-locus analysis will be stored under the folder of 
genesnp/output3/ in HDFS. The result data after top k retrival will be 
stored under the folder of genesnp/output4/ in HDFS.

=================================
6. How to run on Cloud
=================================

Besides of running the program on your own hadoop cluster, users can 
run it on cloud also. Users can use all kinds of hadoop cluster provided 
by any cloud providers, like Amazon Elastic Compute Cloud (EC2), 
Amazon Elastic MapReduce, etc. 

1) get hadoop cluster running on cloud

a) For how to use Amazon Elastic Compute Cloud, users can download 
the documents from http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/. 
We suggest user use the tool under the hadoop package to launch the cluster 
with hadoop setup. The tool is under the hadooppackage/src/contrib/ec2/bin. 
The instructions can be found there also. Or users can find the instructions
from http://wiki.apache.org/hadoop/AmazonEC2#AutomatedScripts. 

b) For using Amazon Elastic MapReduce, users can download the documents from http://aws.amazon.com/elasticmapreduce/

2) How to run our program

a) Upload your jar file and your data to the cloud. 
b) Then run the jar file exactly the same as running on your own cluster.
Instructions can be found above. 

More information and examples can be found on 
http://www.comp.nus.edu.sg/~wangzk/eCEO.html

We welcome all the collaborations!!!


Credits:

This package was implemented by Wang Zhenkui.

If you use this program, please cite the following reference:

Zhengkui Wang, Yue Wang, Kian-Lee Tan, Limsoon Wong, Divyakant Agrawal. 
eCEO: An efficient Cloud Epistasis cOmputing model in genome-wide 
association study. 
Bioinformatics, 27(8):1045--1051, April 2011.