As an use case, we will use 30X of one of the Escherichia coli sequencing data (Biosample: SAMN10819847) that is available from a recent study that compared the use of different long read technologies in hybrid assembly of 137 bacterial genomes [4].

Get the data

We have made this subsampled dataset available in Figshare.

# Download data from figshare
wget -O reads.zip https://ndownloader.figshare.com/articles/14036585/versions/4

# Unzip
unzip reads.zip

Now we have the necessary data to perform the quickstart.


The pipeline will always use the fastq file name as prefix for sub-folders and output files. For instance, if users use a fastq file named SRR7128258.fastq the output files and directories will have the string “SRR7128258” in it.


Remember, the pipeline can always be executed with a config file. In fact, the best way to execute these pipelines is by using a configuration file. With a proper configuration, users can easily run the pipeline.

Preprocessing the data

Outputs will be at preprocessed_reads.


Whenever using REGEX for a pattern match, for example “illumina/SRR9847694_{1,2}.fastq.gz” or “illumina/SRR*.fastq.gz”, it MUST ALWAYS be inside double quotes.

Remember: the pipeline does not concatenate the reads. Whenever you use a pattern such as * with unpaired reads the pipeline will process each read separately.

# Running for both Illumina and nanopore data
nextflow run fmalmeida/ngs-preprocess \
  -profile docker \
  --output preprocessed_reads \
  --max_cpus 4 \
  --shortreads "SRR8482585_30X_{1,2}.fastq.gz" \
  --shortreads_type "paired" \
  --fastp_correct_pairs \
  --fastp_merge_pairs \
  --nanopore_fastq "SRX5299443_30X.fastq.gz" \
  --lreads_min_length 1000 \
  --lreads_min_quality 10


These parameters can be used via configuration file. See Configuration File.

Using test profile

As for version v2.5, users can also used a pre-configured test profile which will automatically load a list of SRA run ids for download.

# Running for both short and long reads data
nextflow run fmalmeida/ngs-preprocess -profile docker,test


Now you can used these datasets to, for example, assemble and annotate a genome. For this, check out the MpGAP and Bacannot pipelines that we’ve developed for such tasks.