- In this worksheet you will learn how to build a pangenome as well as a core genome alignment using Roary in Galaxy
Required prerequisite(s)
- You must create an account on https://usegalaxy.eu/ and log in to that account
Suggested prerequisite(s)
- An understanding of how to use Galaxy. Some good guidance: https://www.youtube.com/watch?v=uVNdyrVDYYU
- An understanding of how Roary works: https://academic.oup.com/bioinformatics/article/31/22/3691/240757
Dataset
- This demonstration uses the GFF3 files of a set of closely related bacterial whole genome sequences
- Download this dataset and unzip it to a folder on your computer.
- GFF3 files are annotation files, laying out where each gene starts and stops and its predicted function. It is created by most annotation software such as Bakta.
- You can follow the tutorial on genome assembly and annotation (viral and bacterial) from my Pathogenic Genomics Course if you wish to learn more about this.
Steps
- In your web browser, navigate to https://usegalaxy.eu/
- Log in to your account using the ‘Login or Register’ button in the top navigation bar
- We want to upload all the genome GFF3 files together into a single collection so we can process them together.
- Click ‘Upload Data’ on the left of the screen and then select ‘Collection’ at the top.
- Drag and drop all 6 GFF3 files from the dataset into the box and when finished press ‘Start’
- Once upload is finished, click ‘Build’
- Enter a name for your collection in the ‘Name:’ box and then click ‘Create collection’
- It can take a few minutes for your collection to be built but you should see it appear in green in the right hand history section.
- In the lefthand side menu, in the search box under ‘Tools’ type roary
- Click on ‘Roary the pangenome pipeline - Quickly generate a core gene alignment from gff3 files’
- The Roary tool will now appear in the centre of the screen. This tool will compare multiple gene sets from sequenced genomes and construct their shared pangenome.
- Under ‘Individual gff files or a dataset collection’ click the dropdown and select ‘Collection’
- Under ‘Dataset collection to submit to Roary’ ensure your collection is in the dropdown menu
- We are assuming that these genomes are all from the same species so we will leave the default percentage cut-offs as they are.
- If these were from different species we may want to decrease these cut-offs to be more permissive in finding homologs and not over splitting our data.
- Of note though, pangenomes are meant to be for species level analyses only and Roary is not built for genus level comparisons. If you wish to have a core genome alignment from genus level data, I would recommend Proteinortho or SCARAP
- You can request additional outputs using the checkboxes but for now we will leave these blank and get the default outputs
- Click ‘Run Tool’ to start the analysis.
- This can take a while so be patient
- When done, Roary will produce three output files (Click the eye symbol on each to see their contents):
- Summary statistics: How many genes are in the core, shell and cloud of the pangenome. We would usually expect a large core genome (>1000 genes). Do you have that here? If not, why do you think not?
- Core Gene Alignment: The alignment of all core genes across the genomes. This can be input to phylogenetic programs as outlined in other worksheets.
- Gene Presence Absence: A tab delimited file of each gene group in the pangenome, its predicted function and its distribution across the genomes.