Prokka is a tool that facilitates the fast annotation of prokaryotic genomes.
The goals of this tutorial are to:
Download and extract the latest version of prokka:
cd ~/
git clone https://github.com/tseemann/prokka.git
We also will need some dependencies such as bioperl:
sudo apt-get -y install bioperl libdatetime-perl libxml-simple-perl libdigest-md5-perl
This may take a little while.
and we need an XML package from perl
sudo bash
export PERL_MM_USE_DEFAULT=1
export PERL_EXTUTILS_AUTOINSTALL="--defaultdeps"
perl -MCPAN -e 'install "XML::Simple"'
exit
Now, you should be able to add Prokka to your $PATH
and set up the index for the sequence database:
export PATH=$PATH:$HOME/prokka/bin
prokka --setupdb
To make sure the database loaded directly:
prokka --listdb
You should see something like:
tx160085@js-157-212:~$ prokka --listdb
[17:04:15] Looking for databases in: /home/tx160085/prokka/bin/../db
[17:04:15] * Kingdoms: Archaea Bacteria Mitochondria Viruses
[17:04:15] * Genera: Enterococcus Escherichia Staphylococcus
[17:04:15] * HMMs: HAMAP
[17:04:15] * CMs: Bacteria Viruses
Prokka uses a core set of the Uniprot-DB Kingdom sets against which it blasts your samples. It is possible to search in a more specific dataset, e.g. the genus Enterococcus, by adding a few flags to the command.
–usegenus –genus Enterococcus
Question: What do you think you would do for adding to the default databases?
Prokka should be good to go now– you can check to make sure that all is well by typing prokka
. This should print the help screen with all available options. You can find out more about Prokka databases here.
Make a new directory for the annotation:
cd ~/
mkdir annotation
cd annotation
Link the metagenome assembly file into this directory:
ln -fs ~/mapping/subset_assembly.fa .
Now it is time to run Prokka! There are tons of different ways to specialize the running of Prokka. We are going to keep it simple for now, though. It will take a little bit to run.
prokka subset_assembly.fa --outdir prokka_annotation --prefix metagG --metagenome --kingdom Bacteria
Question: Look at the results of the prokka analysis as it prepares your output file. What types of categories are you seeing flash by on the screen?
Don’t worry, the program tends to pause here:
Running: cat prokka_annotation\/sprot\.faa | parallel --gnu --plain -j 6 --block 242000
--recstart '>' --pipe blastp -query --db /home/tx160085/prokka/bin/../db/kingdom/Bacteria/sprot
-evalue 1e-06 -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no > prokka_annotation\/sprot\.blast 2> /dev/null
This will generate a new folder called prokka_annotation
in which will be a series of files, which are detailed here.
In particular, we will be using the *.ffn
file to assess the relative read coverage within our metagenomes across the predicted genomic regions.
Question: Take a moment and look inside the output files.:
cd ~/annotation/prokka_annotation
less -S *.fsa
less reminders:
*Press space_bar to page down *Press q to exit the less commands
Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm. See Kraken Home Page for more information.
Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. See the Prodigal home page for more info. Citation
Prodigal is already installed inside the prokka wrapper, but sometimes it is handy to generate a standalone .gff file for annotation.
cd ~
git clone https://github.com/DerrickWood/kraken.git
cd ~/kraken
mkdir ~/kraken/bin
./install_kraken.sh ~/kraken/bin
export PATH=$PATH:$HOME/kraken/bin
mkdir ~/KRAKEN
cd ~/KRAKEN
wget http://ccb.jhu.edu/software/kraken/dl/minikraken.tgz
tar -xvf minikraken.tgz
cd ~/annotation
mkdir kraken_annotation
kraken --db ~/KRAKEN/minikraken_20141208/ --threads 2 --fasta-input subset_assembly.fa --output kraken_annotation/subset_assembly.kraken
kraken-translate --db ~/KRAKEN/minikraken_20141208/ kraken_annotation/subset_assembly.kraken > kraken_annotation/subset_assembly.kraken.labels
Kraken has now provided a taxonomic assignment to all of the clusters.
To generate a summary table:
cd ~/annotation
kraken-report --db ~/KRAKEN/minikraken_20141208 kraken_annotation/subset_assembly.kraken > kraken_annotation/subset_assembly.kraken.report
The top of the file lists all the unclassified sequences, to look at the file and skip over these, do the following:
grep -v ^U ~/annotation/kraken_annotation/subset_assembly.kraken.report | head -n20
The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:
- Percentage of reads covered by the clade rooted at this taxon
- Number of reads covered by the clade rooted at this taxon
- Number of reads assigned directly to this taxon
- A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-‘.
- NCBI taxonomy ID
- indented scientific name
Example output:
tx160085@js-157-212:~/annotation/kraken_annotation$ grep -v ^U subset_assembly.kraken.report | head -n20
89.60 8311 8311 U 0 unclassified
10.40 965 0 - 1 root
10.40 965 3 - 131567 cellular organisms
10.37 962 43 D 2 Bacteria
6.51 604 0 P 200918 Thermotogae
6.51 604 0 C 188708 Thermotogae
6.51 604 0 O 2419 Thermotogales
6.51 604 8 F 188709 Thermotogaceae
5.11 474 0 G 28236 Petrotoga
5.11 474 0 S 69499 Petrotoga mobilis
5.11 474 474 - 403833 Petrotoga mobilis SJ95
1.22 113 0 G 1184396 Mesotoga
1.22 113 0 S 1184387 Mesotoga prima
1.22 113 113 - 660470 Mesotoga prima MesG1.Ag.4.2
0.04 4 0 G 651456 Kosmotoga
0.04 4 0 S 651457 Kosmotoga olearia
0.04 4 4 - 521045 Kosmotoga olearia TBF 19.5.1
0.02 2 1 G 2335 Thermotoga
0.01 1 0 S 177758 Thermotoga lettingae
0.01 1 1 - 416591 Thermotoga lettingae TMO
Why use Kraken?
For a simulated metagenome of 100 bp reads in its fastest mode of operation, , Kraken processed over 4 million reads per minute on a single core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken’s accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision. Citation
However, kraken is only as sensitive as the provided database, so for unusual samples, a custom database needs to be constructed . The accuracy is very sensitive to the quantity of samples in the database.
cd ~
wget https://github.com/hyattpd/Prodigal/releases/download/v2.6.3/prodigal.linux
tar -xvf v2.6.3.tar.gz
chmod 775 ~/prodigal.linux
Using prodigal with the same set of data, we can get a list of predicted genes.
cd ~/annotation
mkdir prodigal_annotation
~/prodigal.linux -p meta -a prodigal_annotation/subset_assembly.faa -d prodigal_annotation/subset_assembly.fna -f gff -o prodigal_annotation/subset_assembly.gff -i subset_assembly.fa