- In this worksheet you will learn how to use Flye to assemble long reads into a genome. Long reads are those that come from 3rd generation sequencers such as PacBio SMRT or Nanopore ONT.
- Flye can be used to assemble metagenomes sequenced using long read technologies (metaFlye). This is not covered here but is outlined in the Flye github and associated paper
- This worksheet also covers basic filtering out of bad reads using filtlong
Suggested prerequisites
- It is recommended that you have followed the Concepts in Computer Programming and UNIX tutorial (basics) tutorials before starting.
- A knowledge of Flye is useful. You can read the paper here and the github page with the manual and other documents can be found here
- Installing Flye through conda is easiest so its suggested you have followed the Setting up and using conda tutorial.
Dataset
- This demonstration uses one of the samples from Hikichi et al which is the fastq file from this zenodo record
Steps
- Create a folder to store the raw data and the genome assembly
mkdir Flye_demo cd Flye_demo
- Download the fastq raw data into this folder
wget https://zenodo.org/record/4534098/files/DRR187567.fastq.bz2
- Note: this is a large file and will take a few minutes, depending on your internet speed
- Note: if you dont have wget you can install it via conda (just search “wget conda” on google to find the instructions)
- Filtlong and Flye require gzipped files (files that end in .gz) but our downloaded file is a bzip2 file (ends in .bz2). We need to convert these types before we proceed
bzcat DRR187567.fastq.bz2 | gzip -c > DRR187567.fastq.gz
- Note: Most sequencers produce .gz files so this isnt necessary if your file is already in that format.
- Note: This can take a while as the file is quite large
- Install Flye using conda
- It is recommended to always install packages in their own environments so here will we create an enironment and install Flye in one step.
mamba create -n Flye -c bioconda flye -y mamba activate Flye
- It is recommended to always install packages in their own environments so here will we create an enironment and install Flye in one step.
- Install filtlong into the Flye environment
- Since we always use these tools together and the packages dont interefere with each other it is safe to have both in the same environment
mamba install -c bioconda filtlong -y
- Since we always use these tools together and the packages dont interefere with each other it is safe to have both in the same environment
- Filtlong can do a lot of filtering (see overview here) but we will do two primary filtering steps:
- –min_length 1000 to remove any reads less than 1000bp
- –keep_percent 90 to keep only the 90% best reads based on read base quality
filtlong --min_length 1000 --keep_percent 90 DRR187567.fastq.gz | gzip > DRR187567_filtered.fastq.gz
- Note the use of the
| gzip > DRR187567_filtered.fastq.gz
to pass the output of filtlong to the gzip program (which compresses it for use in Flye) and then the output file is named DRR187567_filtered.fastq.gz
- Assemble to genome with Flye. The full list of options you can pass to this assembler are here but we will do a basic assembly.
- We know this is an MRSA sample and thus the genome should be around 2.8Mbp, which we can pass to Flye to help in the assembly
- This isnt a requirement but can improve accuracy
- We use the
--nano-corr
as these are nanopore reads which we filtered (corrected) - Replace with –nano-raw if you do not filter
-t
indicates the number of threads to use; we have set this to 4 here but should be set to whatever your machine can handle
- We know this is an MRSA sample and thus the genome should be around 2.8Mbp, which we can pass to Flye to help in the assembly
flye --genome-size 2.8m --out-dir DRR187567_flye -t 4 --nano-corr DRR187567_filtered.fastq.gz
- Assembly takes a long time (average 1-2 hours on a standard laptop).
- Look at the basic statistics of the final assembly
cat DRR187567_flye/assembly_info.txt
- The genome looks good inisitally as it has assembled into 2 contigs (chromosome and plasmid) with high coverage (both over 100x) and both closed (circular column is Y).
- Deactivate your mamba environment when finished
mamba deactivate
## Post assembly steps Once assembly is finished you can do a more in depth check of the quality and completeness using the BUSCO and Bandage worksheet