Quickstart

As an use case, we will use 30X of one of the Escherichia coli sequencing data (Biosample: SAMN10819847) that is available from a recent study that compared the use of different long read technologies in hybrid assembly of 137 bacterial genomes [1].

Get the data

We have made this subsampled dataset available in Figshare.

# Download data from figshare
wget -O reads.zip https://ndownloader.figshare.com/articles/14036585/versions/4

# Unzip
unzip reads.zip

Now we have the necessary data to perform the quickstart.

Where my outputs go?

The pipeline will always use the fastq file name as prefix for sub-folders and output files. For instance, if users use a fastq file named SRR7128258.fastq the output files and directories will have the string "SRR7128258" in it.

Preprocessing the data

Outputs will be at preprocessed_reads.

Watch your REGEX

Whenever using REGEX for a pattern match, for example "illumina/SRR9847694_{1,2}.fastq.gz" or "illumina/SRR*.fastq.gz", it MUST ALWAYS be inside double quotes.

Remember: the pipeline does not concatenate the reads. Whenever you use a pattern such as * with unpaired reads the pipeline will process each read separately.

# Running for both illumina and nanopore data
nextflow run fmalmeida/ngs-preprocess \
    -profile docker \
    --output preprocessed_reads \
    --max_cpus 4 \
    --shortreads "SRR8482585_30X_{1,2}.fastq.gz" \
    --shortreads_type "paired" \
    --fastp_correct_pairs \
    --fastp_merge_pairs \
    --nanopore_fastq "SRX5299443_30X.fastq.gz" \
    --lreads_min_length 1000 \
    --lreads_min_quality 10

Using test profile

As for version v2.5, users can also used a pre-configured test profile which will automatically load a list of SRA run ids for download.

# Running for both short and long reads data
nextflow run fmalmeida/ngs-preprocess -profile docker,test

Afterwards

Now you can used these datasets to, for example, assemble and annotate a genome. For this, check out the MpGAP and Bacannot pipelines that we've developed for such tasks.