ABMapper

ABMapper: A suffix-array based spliced alignment tool

ABMapper is a portable, easy-to-use package for spliced alignment, junction site detection, and reads mapping. The core module was written in C++ and wrapped in PERL scripts.


Publication

Publication

Documentation

Examples

Discussion


Background

The next-generation RNA-sequencing(RNA-seq) platform can generate far more data per experiment than traditional sequencing technology. Millions of short, sometimes paired-end reads have brought about new challenges: to map them back onto the genome for determining gene expression levels and for detecting splicing events.

The current popular tools, such as BWA, Bowtie etc.,are fast and with low memory requirement, However, one drawback of these fast mappers is that they cannot handle reads that span across a splice junction site, which is not contiguous along a transcribed region within the genome.

Although SplitSeek,TopHat and SpliceMap can do spliced alignment, they have to rely on output from existing mapper, and do not support original splice junction mapping

ABMapper was developed as an in-house standalone tool specifically for mapping sequencing reads that span across splicing junctions, or reads that have multiple putative locations within the genome. It adopts a fast suffix-array algorithm and a dual-seed strategy to find all putative locations of reads.

ABMapper could be used not only for splicing junction detection but for exonic alignment.


Version History

2010.11.10 version 2.0.3
Bugs fixed:
1. Incorrectly generate seed occurrences in the reverse complementary (RC) reads, resulting in hit reads having duplicated fragments in the output file (*_tbl.txt).
2. Error to obtain the number of total_count, which is the total hit number of all searched reads.
3. Error when parsing the current working directory in the subroutine, which could lead to a failure in calling the “ABMapper” program.
Enhancement:
1. Added an error-checking process when loading program from the wrapper, ‘runABMapper.pl’ to ensure all the executable files are in the same working directory and have the correct access privilege.

2010.09.10 version 2.0.2
Fixed a bug in output unhit reads

2010.09.07 version 2.0.1

Fixed some bugs
Released windows 32bit and 64bit binary with source codes

2010.09.02:version 2.0

Changes:
1. valid input files
When user specifies an invalid input files in filelist.txt, the program continues to run ignoring that file. Meanwhile, warning message will be given to instruct user to exit for the first, input a valid file, and re-run the program.
2. Hard-disk based suffix array
Given reference sequences, we build suffix arrays on hard disk, generate index files. After that, the index files are loaded into main memory. For example, in the case of 24 human genome sequences, it takes about 20 minutes to build the index files for all the sequences, and only takes about 1 minute to load the index files into main memory.
It should be noted that, once the index files have been generated in the first running of ABMapper, they do not need to be re-built in the subsequent runnings. Moreover, the size of the index files generated is about four times of the sequences. For example, the size of all the 24 human genome sequences is about 3GB, and the hard disk space required by the index files is about 12GB.
3. Seeding strategy
Given a read P, we place two seeds of length k in it. For example, the first seed at the beginning is P[1..k]. If the seed has too many occurrences, it is very likely repeats. Then we increase the seed length by adding several more nucleotides, i.e. the seed becoming P[1..k+2], resulting in reducing the number of occurrences is cut down.
4. Intermediate structure of fragment nodes and pair info nodes
After one genome sequence is processed, the intermediate structures are output to files, and the space in memory is released.
5. Output
There are three output types regarding file outputs:
Output = 0, no file output, and only screen outputs;
Output = 1, file outputs for all the genome sequences <= user specified maximum output;
Output = 2, file outputs for each genome sequence <= user specified maximum output.

6.Fixed BED output bugs

2010.08.02 Fixed SAM output bug

2010.07.18 version 1.01