PE-Assembler Read Me

 

0.     Downloading PA

 

1.     Installing PA

 

2.     Input files

a)     lib.info file

b)    sequence data

 

3.     Running PA

a)     hash.exe

b)    err.exe (1)

c)     err.exe (2)

d)    build.exe

e)     scaf.exe

f)      fill.exe

g)     Asm

 

4.     Output files

a)     Initial contigs file

b)    Graph files

c)     Scaffolds

d)    Final contigs file

 

 

 

0.     Download PA

 

PE-Assembler can be downloaded here. The package includes source code, manual (this document in MS Word format) and a sample data set. 

 


1.     Installing PA

 

PE-Assembler is written in C++ and therefore will require a machine with GNU C++ pre-installed.

 

Installing PE-Assembler is fairly straight-forward. Create a main directory (eg: /work/PA/) with two sub-directories ‘src’ (eg: /work/PA/src) and ‘bin’. (eg: /work/PA/bin)  Copy all source files to ‘src’ directory and run ‘compile.sh’. The necessary executable files will be built in ‘bin’ directory.

 

Unzipping the downloaded .tgz file should automatically create the above directory structure.

 

The program is tested in linux environment but should be compatible in any other platforms with GNU C++.

 

 

2.     Input files

 

Each assembly is contained in a single directory. (Data directory) This directory should contain all necessary input files as well as required configuration files. All output files created during the assembly is placed in the same directory.

 

a)      lib.info file

 

lib.info file contains all details about the input sequence data. The first line of file should be the number of libraries/files. Following lines contain information regarding each library. The format is as follows. All entries are <tab> separated.

 

<Number of libraries>

<filename1> <minspan1> <maxspan1> <avespan1> <stddev1> <minoverlap1> <olmm1> <mapmm1> 

<filename2> <minspan2> <maxspan2> <avespan2> <stddev2> <minoverlap2> <olmm2> <mapmm2> 

.

.

 

Important: Always list the library with least insert size first. This is assumed to be the ‘base’ library and expected to have the highest sequencing coverage. Remaining libraries are mainly used to resolve ambiguities due to longer repeats.

 

Filename should not consist of any directory information. All files should reside in the same directory as lib.info file.

 

minspan and maxspan are the minimum and maximum span of the reads. Note that span is measured from beginning of the 5’ read to the beginning of 3’ read. Essentially in this case ,

 

 

span = insert_size – 1 x read_length

 

 

 

 

 

 

 

 


Initially you may need to roughly estimate the span size. But once a single run of the assembly is complete you will be able to accurately gauge the minimum and maximum spans.

 

Average span and standard deviation need not be entered by the user. Leave these values as -1. The scaffolding step will measure these values and update the lib.info file automatically.

 

<minoverlap> is the minimum partial overlap considered for 3’ extension. Similar to k value in de Bruijn graph based methods, this value needs to be set carefully. Setting a high value will cause many overlaps to go undetected and hence produce a highly disconnected assembly, where as setting a low value may cause excessive branch and slow down the program. You may use 2/3 x read length as a default value. Note that this value needs to be strictly smaller than the read length.

 

<olmm> indicates how many mismatches should be allowed for detecting partial overlaps. <mapmm> indicates how many mismatches are allowed for mapping the entire read within the assembled contigs. The current implementation of the hash table fully supports only up to a single mismatch. Therefore some overlaps/maps with 2 or more mismatches may go undetected.

 

b)      Sequence data

 

Sequence data should be in fasta format. Each even line should start with ‘>’ to mark the fasta header. This should be immediately followed in the line below by 5’ read sequence and 3’ read sequence, separated by a <tab>. Example format is given below.

 

>0

tatgttgggttcaaggttaaattgag      ctggttaaaaccatgcgtgcttctat

>1

gaggtgcttagctcatctggcgagaa      ccgtcgcttacacgataccgaccgtt

>2

acatttgtgcagcgtagtgcagtttt      ggccaaaatcagcctggcttaaaggc

>3

atgcgttcgccaggttgaatggtgaa      tgagggctggattatcattcatttag

>4

ctgcgtcggagtcggtcgggttaatc      gcccccttatgaagaaaggaggcgtc

 

 

The content of the fasta header is irrelevant. Both reads are assumed to have come from the same strand and 5’ read should be listed in the first column followed by the 3’ read.

 

To further clarify the orientation: assume both reads are mapped back to the source genome. Both reads should map to the same strand. If it maps to ‘+’ strand, the coordinate of the first read should be less than that of the second read. In case it maps to ‘-’ strand, the coordinate of first read should be higher than that of the second read.

 

All reads in the same file (library) should be of the exact length.

 


3.     Running PA

 

PE-Assembler is executed as a series of separate steps. These are outline below in order of execution.

 

Important: Most programs below outputs significant amount of data to standard output. You are advised to direct standard output to a file using ‘>’.

 

 

a)      hash.exe

 

Usage: ./hash.exe <data dir>

 

hash.exe creates two hash tables per each input sequence file. These hash tables are used for finding overlaps and mapping reads back to the assembled contigs. .hashext file is a alternate version of the .hash file used for finding less stringent overlaps for gap filling stage.

 

 

b)      err.exe (1)

 

Usage: ./err.exe <data dir>

 

err.exe is the program responsible for calculating k-mer frequency and detecting solid/repeat reads. The program is run twice using different parameters.

 

When run with parameters above, err.exe produces output file ‘kmer.freq.dist’. User is required to determine ‘solid kmer cutoff’ and ‘repeat kmer cutoff’ by observing this plot.

 

Figure 2

 

Ideally the plot should have a well defined main peak as in Figure 2. Pick the x-values at the two troughs on either side as the ‘solid cutoff’ (~35 in this case) and ‘repeat cutoff’ (~190). However in some cases a peak may not be clearly visible. In such a case you may use ‘3’ as the solid cutoff and use the point at which curve approaches 0 as the repeat cutoff. You may also choose to ignore the read screening by passing nocorrect option to build.exe.

 


c)      err.exe (2)

 

Usage: ./err.exe <data dir> <solid cutoff> <repeat cutoff>

 

Running err.exe again with these two addition parameters will perform read screening based on the kmer.freq file. It will output a list of solid/non-repeat reads in to the file <filename1>.correct where <filename> is the first sequence data file in lib.info

 

 

d)      build.exe

 

Usage: ./build.exe <data dir> <no. of threads> [-nocorrect]

 

build.exe assembles the raw paired-end reads into an initial set of contigs. This step can be run in parallel by setting no. of threads to value higher than 1.

It is advised that you have at least the same number of idle cores in your machine. Spawning more threads than the number of cores may slow down the program. Running multiple threads does not consume significantly more memory.

 

By default program fill start each contig with a ‘solid read’ as identified by previous step. You may disable this requirement by passing the option -nocorrect  to the program.

 

 

e)      scaf.exe

 

Usage: ./scaf.exe <data dir> <no. of threads>

 

scaf.exe carries out series of steps of which the final result in a set(s) of contig orderings. Initially scaf.exe maps all tags back to the assemble genome to estimate the sequencing coverage and demarcate repeat regions. At the next stage it identifies all chimeric mappings and builds a set of edges which link contigs. The step uses these edges to order the set of contigs in to (multiple) scaffold(s).

 

First two steps can be run in parallel across multiple cores. <no. of threads> parameter determines how many threads should be launched.

 

The scaf.exe produces a subdirectory ‘./contigs’ within the <data dir>. This data is used by the next step and therefore should remain as it is.

 

 

f)        fill.exe

 

Usage: ./fill.exe <data dir> <no. of threads>

 

fill.exe takes scaffold info and attempts to fill the gaps in between each contigs. The final result is the file ‘fill.list’ which lists all gaps in the scaffold its status. (filled/unfilled, overlap. Etc)

 

fill.exe is able to run in parallel, filling multiple gaps at once.

 

 

g)      Asm

 

Usage: java –cp <main dir> pa.Asm <data dir>

 

Asm is a java program which combines the initial contigs and gap filling information to obtain the final set of contigs.

 

The final set of contigs will be written to ‘contigs.fasta’ in <data dir>

 

This is the end of the assembly process.

 


5.           Output files

 

a)      Initial contigs file

 

This fasta file contains immediate set of contigs resulting from build.exe

 

b)      Graph files

 

<Explanation will be added later>

 

c)      Scaffolds

 

<Explanation will be added later>

 

d)      Final contigs file

 

This is final set of contigs, in fasta format.