Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.

Porechop also supports demultiplexing of Nanopore reads that were barcoded with the Native Barcoding Kit, PCR Barcoding Kit or Rapid Barcoding Kit.

Oct 2018 update: Porechop is officially unsupported

While I'm happy Porechop has so many users, it has always been a bit klugey and a pain to maintain. I don't have the time to give it the attention it deserves, so I'm going to now officially declare Porechop as abandonware (though the unanswered issues and pull requests reveal that it already has been for some time). I've added a known issues section to the README to outline what I think is wrong with Porechop and how a reimplementation should look. I may someday (no promises though :stuck_out_tongue:) try to rewrite it from a blank canvas to address its faults.

Requirements
Installation
- Install from source
- Build and run without installation
Quick usage examples
How it works
Known adapters
Full usage
Known issues
Acknowledgements
License

Requirements

Linux or macOS
Python 3.4 or later
C++ compiler
- If you're using GCC, version 4.9.1 or later is required (check with g++ --version).
- Recent versions of Clang and ICC should also work (C++14 support is required).

I haven't tried to make Porechop run on Windows, but it should be possible. If you have any success on this front, let me know and I'll add instructions to this README!

Installation

Install from source

Running the setup.py script will compile the C++ components of Porechop and install a porechop executable:

git clone https://github.com/rrwick/Porechop.git
cd Porechop
python3 setup.py install
porechop -h

Notes:

If the last command complains about permissions, you may need to run it with sudo.
Install just for your user: python3 setup.py install --user
- If you get a strange "can't combine user with prefix" error, read this.
Install to a specific location: python3 setup.py install --prefix=$HOME/.local
Install with pip (local copy): pip3 install path/to/Porechop
Install with pip (from GitHub): pip3 install git+https://github.com/rrwick/Porechop.git
If you'd like to specify which compiler to use, set the CXX variable: export CXX=g++-6; python3 setup.py install
Porechop includes ez_setup.py for users who don't have setuptools installed, though that script is deprecated. So if you run into any installation problems, make sure setuptools is installed on your computer: pip3 install setuptools

Build and run without installation

By simply running make in Porechop's directory, you can compile the C++ components but not install an executable. The program can then be executed by directly calling the porechop-runner.py script.

git clone https://github.com/rrwick/Porechop.git
cd Porechop
make
./porechop-runner.py -h

Quick usage examples

Basic adapter trimming: porechop -i input_reads.fastq.gz -o output_reads.fastq.gz

Trimmed reads to stdout, if you prefer: porechop -i input_reads.fastq.gz > output_reads.fastq

Demultiplex barcoded reads: porechop -i input_reads.fastq.gz -b output_dir

Demultiplex barcoded reads, straight from Albacore output directory: porechop -i albacore_dir -b output_dir

Also works with FASTA: porechop -i input_reads.fasta -o output_reads.fasta

More verbose output: porechop -i input_reads.fastq.gz -o output_reads.fastq.gz --verbosity 2

Got a big server? porechop -i input_reads.fastq.gz -o output_reads.fastq.gz --threads 40

How it works

Find matching adapter sets

Porechop first aligns a subset of reads (default 10000 reads, change with --check_reads) to all known adapter sets. Adapter sets with at least one high identity match (default 90%, change with --adapter_threshold) are deemed present in the sample.

Identity in this step is measured over the full length of the adapter. E.g. in order to qualify for a 90% match, an adapter could be present at 90% identity over its full length, or it could be present at 100% identity over 90% of its length, but a 90% identity match over 90% of the adapter length would not be sufficient.

The alignment scoring scheme used in this and subsequent alignments can be modified using the --scoring_scheme option (default: match = 3, mismatch = -6, gap open = -5, gap extend = -2).

Trim adapters from read ends

The first and last bases in each read (default 150 bases, change with --end_size) are aligned to each present adapter set. When a long enough (default 4, change with --min_trim_size) and strong enough (default 75%, change with --end_threshold) match is found, the read is trimmed. A few extra bases (default 2, change with --extra_end_trim) past the adapter match are removed as well to ensure it's all removed.

Identity in this step is measured over the aligned part of the adapter, not its full length. E.g. if the last 5 bases of an adapter exactly match the first 5 bases of a read, that counts as a 100% identity match and those bases will be trimmed off. This allows Porechop to effectively trim partially present barcodes.

The default --end_threshold is low (75%) because false positives (trimming off some sequence that wasn't really an adapter) shouldn't be too much of a problem with long reads, as only a tiny fraction of the read is lost.

Split reads with internal adapters

The entirety of each read is aligned to the present adapter sets to spot cases where an adapter is in the middle of the read, indicating a chimera. When a strong enough match is found (default 85%, change with --middle_threshold), the read is split. If the resulting parts are too short (default less than 1000 bp, change with --min_split_read_size), they are discarded.

The default --middle_threshold (85%) is higher than the default --end_threshold (75%) because false positives in this step (splitting a read that is not chimeric) could be more problematic than false positives in the end trimming step. If false negatives (failing to split a chimera) are worse for you than false positives (splitting a non-chimera), you should reduce this threshold (e.g. --middle_threshold 75).

Extra bases are also removed next to the hit, and how many depends on the side of the adapter. If we find an adapter that's expected at the start of a read, it's likely that what follows is good sequence but what precedes it may not be. Therefore, a few bases are trimmed after the adapter (default 10, change with --extra_middle_trim_good_side) and more bases are trimmed before the adapter (default 100, change with --extra_middle_trim_bad_side). If the found adapter is one we'd expect at the end of the read, then the "good side" is before the adapter and the "bad side" is after the adapter.

Here is a real example of the "good" and "bad" sides of an adapter. The adapter is in the middle of this snippet (SQK-NSK007_Y_Top at about 90% identity). The bases to the left are the "bad" side and their repetitive nature is clear. The bases to the right are the "good" side and represent real biological sequence.

TGTTGTTGTTGTTATTGTTGTTATTGTTGTTGTATTGTTGTTATTGTTGTTGTTGTACATTGTTATTGTTGTATTGTTGTTATTGTTGTTGTATTATCGGTGTACTTCGTTCAGTTACGTATTACTATCGCTATTGTTTGCAGTGAGAGGTGGCGGTGAGCGTTTTCAAATGGCCCTGTACAATCATGGGATAACAACATAAGGAACGGACCATGAAGTCACTTCT

Discard reads with internal adapters

If you run Porechop with --discard_middle, the reads with internal adapters will be thrown out instead of split.

If you plan on using your reads with Nanopolish, then the --discard_middle option is required. This is because Nanopolish first runs nanopolish index to find a one-to-one association between FASTQ reads and fast5 files. If you ran Porechop without --discard_middle, then you could end up with multiple separate FASTQ reads which are from a single fast5, and this is incompatible with Nanopolish.

This option is also recommended if you are trimming reads from a demultiplexed barcoded sequencing

Porechop

Install / Use

README