Annotation with Prokka¶

Prokka is a tool that facilitates the fast annotation of prokaryotic genomes.

The goals of this tutorial are to:

Install Prokka
Use Prokka to annotate our genomes

Installing Prokka¶

Download and extract the latest version of prokka:

cd ~/
git clone https://github.com/tseemann/prokka.git

We also will need some dependencies such as bioperl:

sudo apt-get -y install bioperl libdatetime-perl libxml-simple-perl libdigest-md5-perl

This may take a little while.

and we need an XML package from perl

sudo bash
export PERL_MM_USE_DEFAULT=1
export PERL_EXTUTILS_AUTOINSTALL="--defaultdeps"
perl -MCPAN -e 'install "XML::Simple"'
exit

Now, you should be able to add Prokka to your $PATH and set up the index for the sequence database:

export PATH=$PATH:$HOME/prokka/bin
prokka --setupdb

To make sure the database loaded directly:

prokka --listdb

You should see something like:

tx160085@js-157-212:~$ prokka --listdb
[17:04:15] Looking for databases in: /home/tx160085/prokka/bin/../db
[17:04:15] * Kingdoms: Archaea Bacteria Mitochondria Viruses
[17:04:15] * Genera: Enterococcus Escherichia Staphylococcus
[17:04:15] * HMMs: HAMAP
[17:04:15] * CMs: Bacteria Viruses

Prokka uses a core set of the Uniprot-DB Kingdom sets against which it blasts your samples. It is possible to search in a more specific dataset, e.g. the genus Enterococcus, by adding a few flags to the command.

–usegenus –genus Enterococcus

Question: What do you think you would do for adding to the default databases?

Prokka should be good to go now– you can check to make sure that all is well by typing prokka. This should print the help screen with all available options. You can find out more about Prokka databases here.

Running Prokka¶

Make a new directory for the annotation:

cd ~/
mkdir annotation
cd annotation

Link the metagenome assembly file into this directory:

ln -fs ~/mapping/subset_assembly.fa .

Now it is time to run Prokka! There are tons of different ways to specialize the running of Prokka. We are going to keep it simple for now, though. It will take a little bit to run.

prokka subset_assembly.fa --outdir prokka_annotation --prefix metagG --metagenome --kingdom Bacteria

Question: Look at the results of the prokka analysis as it prepares your output file. What types of categories are you seeing flash by on the screen?

Don’t worry, the program tends to pause here:

Running: cat prokka_annotation\/sprot\.faa | parallel --gnu --plain -j 6 --block 242000
--recstart '>' --pipe blastp -query --db /home/tx160085/prokka/bin/../db/kingdom/Bacteria/sprot
-evalue 1e-06 -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no > prokka_annotation\/sprot\.blast 2> /dev/null

This will generate a new folder called prokka_annotation in which will be a series of files, which are detailed here.

In particular, we will be using the *.ffn file to assess the relative read coverage within our metagenomes across the predicted genomic regions.

Question: Take a moment and look inside the output files.:

cd ~/annotation/prokka_annotation
less -S *.fsa

less reminders:

*Press space_bar to page down *Press q to exit the less commands

Questions?¶

What can I annotate with prokka?
Alternatives?
How do I submit my annotated files to Genbank? EBI??
Why is it called Prokka?

Alternative Annotation Tools (if Time Allows or for Homework)¶

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm. See Kraken Home Page for more information.

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. See the Prodigal home page for more info. Citation

Prodigal is already installed inside the prokka wrapper, but sometimes it is handy to generate a standalone .gff file for annotation.

Install Kraken¶

cd ~
git clone https://github.com/DerrickWood/kraken.git
cd ~/kraken
mkdir ~/kraken/bin
./install_kraken.sh ~/kraken/bin
export PATH=$PATH:$HOME/kraken/bin

Install Kraken Mini DB¶

mkdir ~/KRAKEN
cd ~/KRAKEN
wget http://ccb.jhu.edu/software/kraken/dl/minikraken.tgz
tar -xvf minikraken.tgz

Running Kraken¶

cd ~/annotation
mkdir kraken_annotation

kraken --db ~/KRAKEN/minikraken_20141208/ --threads 2 --fasta-input subset_assembly.fa --output kraken_annotation/subset_assembly.kraken

kraken-translate --db ~/KRAKEN/minikraken_20141208/  kraken_annotation/subset_assembly.kraken > kraken_annotation/subset_assembly.kraken.labels

Kraken has now provided a taxonomic assignment to all of the clusters.

To generate a summary table:

cd ~/annotation
kraken-report --db ~/KRAKEN/minikraken_20141208 kraken_annotation/subset_assembly.kraken > kraken_annotation/subset_assembly.kraken.report

The top of the file lists all the unclassified sequences, to look at the file and skip over these, do the following:

grep -v ^U ~/annotation/kraken_annotation/subset_assembly.kraken.report | head -n20

The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:

Percentage of reads covered by the clade rooted at this taxon

Number of reads covered by the clade rooted at this taxon

Number of reads assigned directly to this taxon

A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-‘.

NCBI taxonomy ID

indented scientific name

Example output:

tx160085@js-157-212:~/annotation/kraken_annotation$ grep -v ^U subset_assembly.kraken.report | head -n20
60  8311    8311    U       0       unclassified
40  965     0       -       1       root
40  965     3       -       131567    cellular organisms
37  962     43      D       2           Bacteria
51  604     0       P       200918        Thermotogae
51  604     0       C       188708          Thermotogae
51  604     0       O       2419              Thermotogales
51  604     8       F       188709              Thermotogaceae
11  474     0       G       28236                 Petrotoga
11  474     0       S       69499                   Petrotoga mobilis
11  474     474     -       403833                    Petrotoga mobilis SJ95
22  113     0       G       1184396               Mesotoga
22  113     0       S       1184387                 Mesotoga prima
22  113     113     -       660470                    Mesotoga prima MesG1.Ag.4.2
04  4       0       G       651456                Kosmotoga
04  4       0       S       651457                  Kosmotoga olearia
04  4       4       -       521045                    Kosmotoga olearia TBF 19.5.1
02  2       1       G       2335                  Thermotoga
01  1       0       S       177758                  Thermotoga lettingae
01  1       1       -       416591                    Thermotoga lettingae TMO

Why use Kraken?

For a simulated metagenome of 100 bp reads in its fastest mode of operation, , Kraken processed over 4 million reads per minute on a single core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken’s accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision. Citation

However, kraken is only as sensitive as the provided database, so for unusual samples, a custom database needs to be constructed . The accuracy is very sensitive to the quantity of samples in the database.

Install Prodigal¶

cd ~
wget https://github.com/hyattpd/Prodigal/releases/download/v2.6.3/prodigal.linux
tar -xvf v2.6.3.tar.gz
chmod 775 ~/prodigal.linux

Running Prodigal¶

Using prodigal with the same set of data, we can get a list of predicted genes.

cd ~/annotation
mkdir prodigal_annotation
~/prodigal.linux -p meta -a prodigal_annotation/subset_assembly.faa -d prodigal_annotation/subset_assembly.fna -f gff -o prodigal_annotation/subset_assembly.gff -i subset_assembly.fa