<h1>DeLoxer: Trim LoxP sequence from mate-pair libraries</h1>

<p>This script is designed to trim a set of reads such that they do not
contain a specified sequence, much like a typical adapter trimming
program. However, it also classifies paired-end reads as either
mate-pair or non-mate-pair based on whether or not the specified
sequence occurred <em>between</em> the two reads. This allows one to select
the subset of read pairs most likely to be true mate pairs.</p>

<p>The DeLoxer script was written in support of a new mate-pair
sequencing protocol using circularization by Cre-Lox recombination and
Illumina paired-end sequenceing. This protocol results in mate-pairs
separated by the LoxP circularization adapter. DeLoxer uses the
presence of this adapter sequence to classify the reads and then trims
off the separator since it is not of genomic origin. For more
information, see the publication here:
<a href="http://dx.doi.org/10.1093/nar/gkr1000">http://dx.doi.org/10.1093/nar/gkr1000</a></p>

<p>The name is a portmanteau of "Cre-Lox" and "detox".</p>

<h2>Download</h2>

<p>You can download the latest released version of DeLoxer at this URL:
<a href="http://genomes.sdsc.edu/downloads/deloxer/delox.tar.gz">http://genomes.sdsc.edu/downloads/deloxer/delox.tar.gz</a></p>

<p>If you want to modify the code, you can clone the git repository: 
<code>git clone http://genomes.sdsc.edu/downloads/deloxer/.git</code>.</p>

<h2>Usage</h2>

<p>The standard usage of DeLoxer is this:</p>

<pre><code>$ Rscript delox.R adapter.fasta read1.fastq read2.fastq output_basename
</code></pre>

<p><code>adapter.fasta</code> should be a FASTA sequence file containing exactly
<em>one</em> sequence (any additional sequences after the first are
<em>ignored</em>). This is the adapter sequence. The LoxP sequence used in
the paper is included with DeLoxer in a file called
<code>loxp-adapter.fasta</code>. The next two options are FASTQ files, containing
the first and second reads of a paired-end sequencing run. (There are
also single-end and interleaved paired-end modes, which I do not cover
here.) The final argument is the "base name" of the output files. All
of DeLoxer's output files will begin with this base name, and they
will have various suffixes such as "read1_mate.fastq".</p>

<p>DeLoxer will align each read against this adapter sequence <em>and</em> its
reverse complement using the overlap or "ends-free" alignment
strategy, which is identical to a global alignment except that gaps at
sequence ends are not penalized. Note that this is a true <em>alignment</em>,
allowing for gaps and mismatches, so sequencing errors are tolerated.
For each read, the aligned region will be trimmed off, and based on
where the aligned region occurred, each read pair is classified as
either mate-pair, non-mate-pair, or adapter-negative. If a read is too
short after trimming, it is discarded entirely and its mate is saved
as an unpaired read. Each of these categories is saved in separate
FASTQ output files for use in downstream analysis.</p>

<p>DeLoxer has a number of options. They include alignment parameters, read
orientation, number of parallel processes, and classification scoring
thresholds. The defaults for these options are shown on the help
screen (run <code>delox --help</code>).</p>

<h2>Current status</h2>

<h3>Completed features</h3>

<ul>
<li>Single-read trimmer</li>
<li>Paired-end read trimmer</li>
<li>Mate-pair classifier</li>
<li>Parallel operation</li>
</ul>

<h3>Empirical Performance</h3>

<p>DeLoxer can trim the LoxP sequence from and classify approximately
80,000,000 pairs of 100-bp reads (1 Illumina HiSeq lane) in 13 hours
on a 48-core Opteron server.</p>

<h2>Prerequisites</h2>

<p>DeLoxer is written in the
<a href="http://www.r-project.org/">R programming language</a>, and uses a number
of R packages, both from <a href="http://cran.r-project.org/">CRAN</a> and
<a href="http://www.bioconductor.org/">BioConductor</a>:</p>

<ul>
<li><a href="http://cran.r-project.org/web/packages/optparse/index.html">optparse</a></li>
<li><a href="http://cran.r-project.org/web/packages/multicore/index.html">multicore</a></li>
<li><a href="http://cran.r-project.org/package=foreach">foreach</a></li>
<li><a href="http://cran.r-project.org/web/packages/iterators/index.html">iterators</a></li>
<li><a href="http://cran.r-project.org/web/packages/doMC/index.html">doMC</a></li>
<li><a href="http://www.bioconductor.org/packages/release/bioc/html/ShortRead.html">ShortRead</a></li>
</ul>

<p>An simple R script to install all these packages is included with the
distribution. It is called <code>install-required-packages.R</code> and should
either be run with <code>Rscript</code> or sourced from within an interactive R
session.</p>

<p>DeLoxer was tested and run on R version 2.12.1 (2010-12-16), though it
will probably run on any recent version of R. It has only been tested
on Linux.</p>

<h2>Test suite</h2>

<p>The DeLoxer distribution comes with a small test suite that makes sure
DeLoxer is classifying read pairs correctly. The test script
<code>test/run_tests.sh</code> also provides an example invocation of DeLoxer for
those looking to use it for their own purposes. I recommend running
the test script and examining the output. This will give you a better
idea of how DeLoxer works.</p>

<h2>Simulating an equivalent blunt-end mate-pair library with "BluntLoxer"</h2>

<p>The DeLoxer distribution includes an additional script "bluntlox.R".
This script performs the same alignments as DeLoxer, but instead of
trimming and classifying, it simply excises the matching region of the
read and concatenates the parts before and after the match. The result
is a new file or pair of files containing all the same reads, with the
adapter sequence excised. This can be used to simulate an experiment
that uses blunt-end ligation for circularization but has otherwise
identical performance to a recombination-based circularization method.
Usage is almost identical to the delox.R script, but some options are
removed because they are no longer relevant.</p>

<h2>Future improvements</h2>

<p>The following features might be implemented if someone ever asks for
them.</p>

<h3>Short fragments</h3>

<p>DeLoxer currently assumes that the two reads in a pair do not overlap
(or overlap only slightly). For short fragments, where the read tails
are reverse-complementary to each other, DeLoxer may incorrectly
classify these in the adapter-negative category. Specifically, if the
adapter aligns the to head of one read and the tail of the other, the
pair will be marked as adapter-negative, since it is ambiguous.
However, the exact circumstances under which such misclassification
could occur are such that the read pair would likely not be
informative as a mate-pair anyway. A future update to DeLoxer may
create a new category for these "ambiguous" reads, simply to avoid
putting them in the same category as adapter-negative reads.</p>

<p>To circumvent this limitation, one should use a program such as
<a href="https://github.com/jstjohn/SeqPrep">SeqPrep</a> to merge reads with
overlapping tails, and then only run DeLoxer on the remaining pairs.</p>

<h3>Performance</h3>

<p>R does not have any features for reading or writing sequence data in a
streaming fashion, so DeLoxer reads all the sequence data at once into
memory before beginning any processing. Obviously this is sub-optimal,
and it would be more efficient to read, process, and write the reads
(or read pairs) one by one. This could be implemented in Python, but
the BioPython pairwise alignment module is more difficult to use for
"trimming" applications. Possibly, a combination of Python and R could
be used, using the rpy2 python module to pass data to and from R. In
any case, the current version is fast enough for our purposes.</p>

<h3>Saving more sequence data</h3>

<p>When the LoxP sequence occurs in the middle of a read, then either the
prefix or suffix is discarded, despite the fact that it contains real
genomic sequence. Currently this is not a major problem because with
100 bp reads, only one of them can be long enough. But with longer
reads, it may become useful to keep both prefix and suffix.</p>

<h3>Multiple adapters</h3>

<p>Right now, DeLoxer assumes only a single adapter sequence. If your prep
results in two or more possible adapter sequences, DeLoxer will
probably not be suitable.</p>

<p>Also, Deloxer assumes that the adapter sequence will never occur
multiple times in a single read, and is not guaranteed to produce
correct results for reads that have multiple instances of the adapter
sequence in them.</p>

<h3>Other actions</h3>

<p>It would be nice to be able to do something other than trim the
adapter. For example, any of the following might all be useful in some
cases:</p>

<ul>
<li>Replace with lowercase letters</li>
<li>Replace with all N</li>
<li>Set quality to zero</li>
<li>Save alignment information to a separate file</li>
<li>Do nothing at all (classify reads without modification).</li>
</ul>

<p>I currently have no pressing need for any of the above, but I may be
willing to implement them by request if someone else needs them.</p>

<h2>Author</h2>

<p>Ryan C. Thompson <a href="&#x6D;&#x61;&#105;&#x6C;&#116;&#111;:&#x72;&#99;&#116;&#64;&#x74;&#x68;&#111;&#109;&#x70;&#x73;&#111;&#110;&#x63;&#x6C;&#x61;&#x6E;&#x2E;&#111;&#x72;&#x67;">&#x72;&#99;&#116;&#64;&#x74;&#x68;&#111;&#109;&#x70;&#x73;&#111;&#110;&#x63;&#x6C;&#x61;&#x6E;&#x2E;&#111;&#x72;&#x67;</a></p>




