- In this worksheet you will learn how to use PopPUNK to assign samples to existing epidemiological clusters, and detect new clusters
Suggested prerequisites
- It is recommended that you have followed the Concepts in Computer Programming and UNIX tutorial (basics) tutorials before starting.
- A knowledge of the PopPUNK tool is useful. You can read the paper here and access the manuals here
- The associated databases can be found here
- Installing PopPUNK through conda is easiest so its suggested you have followed the Setting up and using conda tutorial.
Dataset
- This demonstration uses a genome assembly and a unassembled read set of Haemophilus influenzae, both downloaded from the ENA
Steps
- Create a directory for your analyses and step into it
mkdir poppunk_demo
cd poppunk_demo
- Download the sample assembly and read file
- You can save this directly to your terminal current working directory by using the wget command (wget can be installed via conda).
wget https://www.ebi.ac.uk/ena/browser/api/fasta/AP022846.1 -O AP022846.1.fa
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/032/SRR11108932/SRR11108932_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR111/032/SRR11108932/SRR11108932_2.fastq.gz
- We need to create a sample map file which is the in the form
. Data location can be a single file (e,g, fasta) or two files (e.g. read1 and read 2, separated by a tab). - These commands will automatically create the data file for our two samples (file will be called
input.txt
)
- These commands will automatically create the data file for our two samples (file will be called
printf "AP022846\tAP022846.1.fa\n" >input.txt
printf "SRR11108932\tSRR11108932_1.fastq.gz\tSRR11108932_1.fastq.gz\n" >>input.txt
- PopPUNK allows for clustering of query samples against known clusters of the given species. We will use the Haemophilus influenzae dataset which is stored in the PopPUNK database
- You can save this directly to your terminal current working directory by using the wget command (wget can be installed via conda).
- If doing analysis with PopPUNK, always download the reference sequences file for your species, if there is one
wget https://ftp.ebi.ac.uk/pub/databases/pp_dbs/Haemophilus_influenzae_v1_refs.tar.bz2
- Extract the database from the compressed file
- We use the
tar
command to extract from tar files -xjf
tells the program to do a full extract and that it is a bz2 compressed file on top of the tar compression
- We use the
tar -xjf Haemophilus_influenzae_v1_refs.tar.bz2
- Install PopPUNK using conda
- It is recommended to always install packages in their own environments so here will we create an enironment and install PopPunk in one step.
mamba create -n poppunk -c bioconda poppunk -y
mamba activate poppunk
- Assign the query samples to existing clusters, or make new clusters if they are not related to existing ones
--db
is our dataset file as downloaded from the PopPUNK database--query
is the file outlining our samples and their locations on our computer--output
is the folder to store the output files in--threads
is the number of threads to dedicate to the process. My computer has 8 threads so I am dedicating 7
poppunk_assign --db Haemophilus_influenzae_v1_refs --query input.txt --output poppunk_clusters --threads 7
- Once finished we can look at the primary output file in poppunk_clusters
poppunk_clusters_clusters.csv
lists the cluster numbers for each of our inputs
- You can look at the bottom of the
Haemophilus_influenzae_v1_refs_clusters
file in theHaemophilus_influenzae_v1_refs
folder. The final line will list the last cluster number already assigned. If any of your samples are assigned a cluster number higher than this, then it sits in a new cluster not previously found.
- You can visualise the clusters in different programs such as microreact or cytoscape. The full list is outlined on the PopPUNK visualisation page
- We will create a visulation for use with microreact
--ref-db
is our dataset file as downloaded from the PopPUNK database--query-db
is the file output from step 7--output
is the name of the fodler to store the output files for visualisation--microreact
ensures the program creates files compatible with microreact
poppunk_visualise --ref-db Haemophilus_influenzae_v1_refs --query-db poppunk_clusters --output example_viz --microreact
- Open the microreact page and upload the
example_viz.microreact
from theexample_viz
folder.- You can now explore the data and look for the cluster relationships in the networks.
- Note, for our data this is quite messy as the samples were small and few. It is suggested to do this analysis with a lot of samples which have a good sampling depth
- Deactivate your mamba environment when finished
mamba deactivate