Assembling a genome from long reads (e.g. ONT) using Flye

In this worksheet you will learn how to use Flye to assemble long reads into a genome. Long reads are those that come from 3rd generation sequencers such as PacBio SMRT or Nanopore ONT.
- Flye can be used to assemble metagenomes sequenced using long read technologies (metaFlye). This is not covered here but is outlined in the Flye github and associated paper
This worksheet also covers basic filtering out of bad reads using filtlong

Suggested prerequisites

It is recommended that you have followed the Concepts in Computer Programming and UNIX tutorial (basics) tutorials before starting.
A knowledge of Flye is useful. You can read the paper here and the github page with the manual and other documents can be found here
Installing Flye through conda is easiest so its suggested you have followed the Setting up and using conda tutorial.

Dataset

This demonstration uses one of the samples from Hikichi et al which is the fastq file from this zenodo record

Steps

Create a folder to store the raw data and the genome assembly
```
mkdir Flye_demo
cd Flye_demo
```
Download the fastq raw data into this folder
```
wget https://zenodo.org/record/4534098/files/DRR187567.fastq.bz2
```
- Note: this is a large file and will take a few minutes, depending on your internet speed
- Note: if you dont have wget you can install it via conda (just search “wget conda” on google to find the instructions)
Filtlong and Flye require gzipped files (files that end in .gz) but our downloaded file is a bzip2 file (ends in .bz2). We need to convert these types before we proceed
```
bzcat DRR187567.fastq.bz2 | gzip -c > DRR187567.fastq.gz
```
- Note: Most sequencers produce .gz files so this isnt necessary if your file is already in that format.
- Note: This can take a while as the file is quite large
Install Flye using conda
- It is recommended to always install packages in their own environments so here will we create an enironment and install Flye in one step.
```
mamba create -n Flye -c bioconda flye -y
mamba activate Flye 
```
Install filtlong into the Flye environment
- Since we always use these tools together and the packages dont interefere with each other it is safe to have both in the same environment
```
 mamba install -c bioconda filtlong -y
```
Filtlong can do a lot of filtering (see overview here) but we will do two primary filtering steps:
- –min_length 1000 to remove any reads less than 1000bp
- –keep_percent 90 to keep only the 90% best reads based on read base quality
```
filtlong --min_length 1000 --keep_percent 90 DRR187567.fastq.gz | gzip > DRR187567_filtered.fastq.gz
```
- Note the use of the | gzip > DRR187567_filtered.fastq.gz to pass the output of filtlong to the gzip program (which compresses it for use in Flye) and then the output file is named DRR187567_filtered.fastq.gz
Assemble to genome with Flye. The full list of options you can pass to this assembler are here but we will do a basic assembly.
- We know this is an MRSA sample and thus the genome should be around 2.8Mbp, which we can pass to Flye to help in the assembly
  - This isnt a requirement but can improve accuracy
- We use the --nano-corr as these are nanopore reads which we filtered (corrected)
- Replace with –nano-raw if you do not filter
- -t indicates the number of threads to use; we have set this to 4 here but should be set to whatever your machine can handle

flye --genome-size 2.8m --out-dir DRR187567_flye -t 4 --nano-corr DRR187567_filtered.fastq.gz

Assembly takes a long time (average 1-2 hours on a standard laptop).

Look at the basic statistics of the final assembly
```
cat DRR187567_flye/assembly_info.txt
```
- The genome looks good inisitally as it has assembled into 2 contigs (chromosome and plasmid) with high coverage (both over 100x) and both closed (circular column is Y).
Deactivate your mamba environment when finished
```
mamba deactivate
```

## Post assembly steps Once assembly is finished you can do a more in depth check of the quality and completeness using the BUSCO and Bandage worksheet