Quickstart¶
Overview¶
As an use case, we will use 30X of one of the Escherichia coli sequencing data (Biosample: SAMN10819847) that is available from a recent study that compared the use of different long read technologies in hybrid assembly of 137 bacterial genomes [4].
Get the data¶
We have made this subsampled dataset available in Figshare.
# Download data from figshare
wget -O reads.zip https://ndownloader.figshare.com/articles/14036585/versions/4
# Unzip
unzip reads.zip
Now we have the necessary data to perform the quickstart.
Note
The pipeline will always use the fastq file name as prefix for sub-folders and output files. For instance, if users use a fastq file named SRR7128258.fastq the output files and directories will have the string “SRR7128258” in it.
Tip
Remember, the pipeline can always be executed with a config file. In fact, the best way to execute these pipelines is by using a configuration file. With a proper configuration, users can easily run the pipeline.
Preprocessing the data¶
Outputs will be at preprocessed_reads
.
Warning
Whenever using REGEX for a pattern match, for example “illumina/SRR9847694_{1,2}.fastq.gz” or “illumina/SRR*.fastq.gz”, it MUST ALWAYS be inside double quotes.
Remember: the pipeline does not concatenate the reads. Whenever you use a pattern such as * with unpaired reads the pipeline will process each read separately.
# Running for both Illumina and nanopore data
nextflow run fmalmeida/ngs-preprocess \
-profile docker \
--output preprocessed_reads \
--max_cpus 4 \
--shortreads "SRR8482585_30X_{1,2}.fastq.gz" \
--shortreads_type "paired" \
--fastp_correct_pairs \
--fastp_merge_pairs \
--nanopore_fastq "SRX5299443_30X.fastq.gz" \
--lreads_min_length 1000 \
--lreads_min_quality 10
Note
These parameters can be used via configuration file. See Configuration File.
Using test profile¶
As for version v2.5, users can also used a pre-configured test profile which will automatically load a list of SRA run ids for download.
# Running for both short and long reads data
nextflow run fmalmeida/ngs-preprocess -profile docker,test
Afterwards¶
Now you can used these datasets to, for example, assemble and annotate a genome. For this, check out the MpGAP and Bacannot pipelines that we’ve developed for such tasks.