DeLoxer: Trim LoxP sequence from mate-pair libraries

This script is designed to trim a set of reads such that they do not contain a specified sequence, much like a typical adapter trimming program. However, it also classifies paired-end reads as either mate-pair or non-mate-pair based on whether or not the specified sequence occurred between the two reads. This allows one to select the subset of read pairs most likely to be true mate pairs.

The DeLoxer script was written in support of a new mate-pair sequencing protocol using circularization by Cre-Lox recombination and Illumina paired-end sequenceing. This protocol results in mate-pairs separated by the LoxP circularization adapter. DeLoxer uses the presence of this adapter sequence to classify the reads and then trims off the separator since it is not of genomic origin. For more information, see the publication here: http://dx.doi.org/10.1093/nar/gkr1000

The name is a portmanteau of "Cre-Lox" and "detox".

Download

You can download the latest released version of DeLoxer at this URL: http://genomes.sdsc.edu/downloads/deloxer/delox.tar.gz

If you want to modify the code, you can clone the git repository: git clone http://genomes.sdsc.edu/downloads/deloxer/.git.

Usage

The standard usage of DeLoxer is this:

$ Rscript delox.R adapter.fasta read1.fastq read2.fastq output_basename

adapter.fasta should be a FASTA sequence file containing exactly one sequence (any additional sequences after the first are ignored). This is the adapter sequence. The LoxP sequence used in the paper is included with DeLoxer in a file called loxp-adapter.fasta. The next two options are FASTQ files, containing the first and second reads of a paired-end sequencing run. (There are also single-end and interleaved paired-end modes, which I do not cover here.) The final argument is the "base name" of the output files. All of DeLoxer's output files will begin with this base name, and they will have various suffixes such as "read1_mate.fastq".

DeLoxer will align each read against this adapter sequence and its reverse complement using the overlap or "ends-free" alignment strategy, which is identical to a global alignment except that gaps at sequence ends are not penalized. Note that this is a true alignment, allowing for gaps and mismatches, so sequencing errors are tolerated. For each read, the aligned region will be trimmed off, and based on where the aligned region occurred, each read pair is classified as either mate-pair, non-mate-pair, or adapter-negative. If a read is too short after trimming, it is discarded entirely and its mate is saved as an unpaired read. Each of these categories is saved in separate FASTQ output files for use in downstream analysis.

DeLoxer has a number of options. They include alignment parameters, read orientation, number of parallel processes, and classification scoring thresholds. The defaults for these options are shown on the help screen (run delox --help).

Current status

Completed features

Single-read trimmer
Paired-end read trimmer
Mate-pair classifier
Parallel operation

Empirical Performance

DeLoxer can trim the LoxP sequence from and classify approximately 80,000,000 pairs of 100-bp reads (1 Illumina HiSeq lane) in 13 hours on a 48-core Opteron server.

Prerequisites

DeLoxer is written in the R programming language, and uses a number of R packages, both from CRAN and BioConductor:

An simple R script to install all these packages is included with the distribution. It is called install-required-packages.R and should either be run with Rscript or sourced from within an interactive R session.

DeLoxer was tested and run on R version 2.12.1 (2010-12-16), though it will probably run on any recent version of R. It has only been tested on Linux.

Test suite

The DeLoxer distribution comes with a small test suite that makes sure DeLoxer is classifying read pairs correctly. The test script test/run_tests.sh also provides an example invocation of DeLoxer for those looking to use it for their own purposes. I recommend running the test script and examining the output. This will give you a better idea of how DeLoxer works.

Simulating an equivalent blunt-end mate-pair library with "BluntLoxer"

The DeLoxer distribution includes an additional script "bluntlox.R". This script performs the same alignments as DeLoxer, but instead of trimming and classifying, it simply excises the matching region of the read and concatenates the parts before and after the match. The result is a new file or pair of files containing all the same reads, with the adapter sequence excised. This can be used to simulate an experiment that uses blunt-end ligation for circularization but has otherwise identical performance to a recombination-based circularization method. Usage is almost identical to the delox.R script, but some options are removed because they are no longer relevant.

Future improvements

The following features might be implemented if someone ever asks for them.

Short fragments

DeLoxer currently assumes that the two reads in a pair do not overlap (or overlap only slightly). For short fragments, where the read tails are reverse-complementary to each other, DeLoxer may incorrectly classify these in the adapter-negative category. Specifically, if the adapter aligns the to head of one read and the tail of the other, the pair will be marked as adapter-negative, since it is ambiguous. However, the exact circumstances under which such misclassification could occur are such that the read pair would likely not be informative as a mate-pair anyway. A future update to DeLoxer may create a new category for these "ambiguous" reads, simply to avoid putting them in the same category as adapter-negative reads.

To circumvent this limitation, one should use a program such as SeqPrep to merge reads with overlapping tails, and then only run DeLoxer on the remaining pairs.

Performance

R does not have any features for reading or writing sequence data in a streaming fashion, so DeLoxer reads all the sequence data at once into memory before beginning any processing. Obviously this is sub-optimal, and it would be more efficient to read, process, and write the reads (or read pairs) one by one. This could be implemented in Python, but the BioPython pairwise alignment module is more difficult to use for "trimming" applications. Possibly, a combination of Python and R could be used, using the rpy2 python module to pass data to and from R. In any case, the current version is fast enough for our purposes.

Saving more sequence data

When the LoxP sequence occurs in the middle of a read, then either the prefix or suffix is discarded, despite the fact that it contains real genomic sequence. Currently this is not a major problem because with 100 bp reads, only one of them can be long enough. But with longer reads, it may become useful to keep both prefix and suffix.

Multiple adapters

Right now, DeLoxer assumes only a single adapter sequence. If your prep results in two or more possible adapter sequences, DeLoxer will probably not be suitable.

Also, Deloxer assumes that the adapter sequence will never occur multiple times in a single read, and is not guaranteed to produce correct results for reads that have multiple instances of the adapter sequence in them.

Other actions

It would be nice to be able to do something other than trim the adapter. For example, any of the following might all be useful in some cases:

Replace with lowercase letters
Replace with all N
Set quality to zero
Save alignment information to a separate file
Do nothing at all (classify reads without modification).

I currently have no pressing need for any of the above, but I may be willing to implement them by request if someone else needs them.

Author

Ryan C. Thompson rct@thompsonclan.org