Assembling a genome from long reads (e.g. ONT) using Flye

  • In this worksheet you will learn how to use Flye to assemble long reads into a genome. Long reads are those that come from 3rd generation sequencers such as PacBio SMRT or Nanopore ONT.
    • Flye can be used to assemble metagenomes sequenced using long read technologies (metaFlye). This is not covered here but is outlined in the Flye github and associated paper
  • This worksheet also covers basic filtering out of bad reads using filtlong

Suggested prerequisites

Dataset

Steps

  1. Create a folder to store the raw data and the genome assembly
    mkdir Flye_demo
    cd Flye_demo
    
  2. Download the fastq raw data into this folder
    wget https://zenodo.org/record/4534098/files/DRR187567.fastq.bz2
    
    • Note: this is a large file and will take a few minutes, depending on your internet speed
    • Note: if you dont have wget you can install it via conda (just search “wget conda” on google to find the instructions)
  3. Filtlong and Flye require gzipped files (files that end in .gz) but our downloaded file is a bzip2 file (ends in .bz2). We need to convert these types before we proceed
    bzcat DRR187567.fastq.bz2 | gzip -c > DRR187567.fastq.gz
    
    • Note: Most sequencers produce .gz files so this isnt necessary if your file is already in that format.
    • Note: This can take a while as the file is quite large
  4. Install Flye using conda
    • It is recommended to always install packages in their own environments so here will we create an enironment and install Flye in one step.
      mamba create -n Flye -c bioconda flye -y
      mamba activate Flye 
      
  5. Install filtlong into the Flye environment
    • Since we always use these tools together and the packages dont interefere with each other it is safe to have both in the same environment
       mamba install -c bioconda filtlong -y
      
  6. Filtlong can do a lot of filtering (see overview here) but we will do two primary filtering steps:
    • –min_length 1000 to remove any reads less than 1000bp
    • –keep_percent 90 to keep only the 90% best reads based on read base quality
      filtlong --min_length 1000 --keep_percent 90 DRR187567.fastq.gz | gzip > DRR187567_filtered.fastq.gz
      
    • Note the use of the | gzip > DRR187567_filtered.fastq.gz to pass the output of filtlong to the gzip program (which compresses it for use in Flye) and then the output file is named DRR187567_filtered.fastq.gz
  7. Assemble to genome with Flye. The full list of options you can pass to this assembler are here but we will do a basic assembly.
    • We know this is an MRSA sample and thus the genome should be around 2.8Mbp, which we can pass to Flye to help in the assembly
      • This isnt a requirement but can improve accuracy
    • We use the --nano-corr as these are nanopore reads which we filtered (corrected)
    • Replace with –nano-raw if you do not filter
    • -t indicates the number of threads to use; we have set this to 4 here but should be set to whatever your machine can handle
flye --genome-size 2.8m --out-dir DRR187567_flye -t 4 --nano-corr DRR187567_filtered.fastq.gz
  • Assembly takes a long time (average 1-2 hours on a standard laptop).
  1. Look at the basic statistics of the final assembly
    cat DRR187567_flye/assembly_info.txt
    
    • The genome looks good inisitally as it has assembled into 2 contigs (chromosome and plasmid) with high coverage (both over 100x) and both closed (circular column is Y).
  2. Deactivate your mamba environment when finished
    mamba deactivate
    

## Post assembly steps Once assembly is finished you can do a more in depth check of the quality and completeness using the BUSCO and Bandage worksheet