Annotation with Prokka

Prokka is a tool that facilitates the fast annotation of prokaryotic genomes.

The goals of this tutorial are to:

  • Install Prokka
  • Use Prokka to annotate our genomes

Installing Prokka

Download and extract the latest version of prokka:

cd ~/
git clone https://github.com/tseemann/prokka.git

We also will need some dependencies such as bioperl:

sudo apt-get -y install bioperl libdatetime-perl libxml-simple-perl libdigest-md5-perl

This may take a little while.

and we need an XML package from perl

sudo bash
export PERL_MM_USE_DEFAULT=1
export PERL_EXTUTILS_AUTOINSTALL="--defaultdeps"
perl -MCPAN -e 'install "XML::Simple"'
exit

Now, you should be able to add Prokka to your $PATH and set up the index for the sequence database:

export PATH=$PATH:$HOME/prokka/bin
prokka --setupdb

To make sure the database loaded directly:

prokka --listdb

You should see something like:

tx160085@js-157-212:~$ prokka --listdb
[17:04:15] Looking for databases in: /home/tx160085/prokka/bin/../db
[17:04:15] * Kingdoms: Archaea Bacteria Mitochondria Viruses
[17:04:15] * Genera: Enterococcus Escherichia Staphylococcus
[17:04:15] * HMMs: HAMAP
[17:04:15] * CMs: Bacteria Viruses

Prokka uses a core set of the Uniprot-DB Kingdom sets against which it blasts your samples. It is possible to search in a more specific dataset, e.g. the genus Enterococcus, by adding a few flags to the command.

–usegenus –genus Enterococcus

Question: What do you think you would do for adding to the default databases?

Prokka should be good to go now– you can check to make sure that all is well by typing prokka. This should print the help screen with all available options. You can find out more about Prokka databases here.

Running Prokka

Make a new directory for the annotation:

cd ~/
mkdir annotation
cd annotation

Link the metagenome assembly file into this directory:

ln -fs ~/mapping/subset_assembly.fa .

Now it is time to run Prokka! There are tons of different ways to specialize the running of Prokka. We are going to keep it simple for now, though. It will take a little bit to run.

prokka subset_assembly.fa --outdir prokka_annotation --prefix metagG --metagenome --kingdom Bacteria

Question: Look at the results of the prokka analysis as it prepares your output file. What types of categories are you seeing flash by on the screen?

Don’t worry, the program tends to pause here:

Running: cat prokka_annotation\/sprot\.faa | parallel --gnu --plain -j 6 --block 242000
--recstart '>' --pipe blastp -query --db /home/tx160085/prokka/bin/../db/kingdom/Bacteria/sprot
-evalue 1e-06 -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no > prokka_annotation\/sprot\.blast 2> /dev/null

This will generate a new folder called prokka_annotation in which will be a series of files, which are detailed here.

In particular, we will be using the *.ffn file to assess the relative read coverage within our metagenomes across the predicted genomic regions.

Question: Take a moment and look inside the output files.:

cd ~/annotation/prokka_annotation
less -S *.fsa

less reminders:

*Press space_bar to page down *Press q to exit the less commands

Questions?

  • What can I annotate with prokka?
  • Alternatives?
  • How do I submit my annotated files to Genbank? EBI??
  • Why is it called Prokka?

Alternative Annotation Tools (if Time Allows or for Homework)

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm. See Kraken Home Page for more information.

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. See the Prodigal home page for more info. Citation

Prodigal is already installed inside the prokka wrapper, but sometimes it is handy to generate a standalone .gff file for annotation.

Install Kraken

cd ~
git clone https://github.com/DerrickWood/kraken.git
cd ~/kraken
mkdir ~/kraken/bin
./install_kraken.sh ~/kraken/bin
export PATH=$PATH:$HOME/kraken/bin

Install Kraken Mini DB

mkdir ~/KRAKEN
cd ~/KRAKEN
wget http://ccb.jhu.edu/software/kraken/dl/minikraken.tgz
tar -xvf minikraken.tgz

Running Kraken

cd ~/annotation
mkdir kraken_annotation

kraken --db ~/KRAKEN/minikraken_20141208/ --threads 2 --fasta-input subset_assembly.fa --output kraken_annotation/subset_assembly.kraken


kraken-translate --db ~/KRAKEN/minikraken_20141208/  kraken_annotation/subset_assembly.kraken > kraken_annotation/subset_assembly.kraken.labels

Kraken has now provided a taxonomic assignment to all of the clusters.

To generate a summary table:

cd ~/annotation
kraken-report --db ~/KRAKEN/minikraken_20141208 kraken_annotation/subset_assembly.kraken > kraken_annotation/subset_assembly.kraken.report

The top of the file lists all the unclassified sequences, to look at the file and skip over these, do the following:

grep -v ^U ~/annotation/kraken_annotation/subset_assembly.kraken.report | head -n20

The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:

  1. Percentage of reads covered by the clade rooted at this taxon
  2. Number of reads covered by the clade rooted at this taxon
  3. Number of reads assigned directly to this taxon
  4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-‘.
  5. NCBI taxonomy ID
  6. indented scientific name

Example output:

tx160085@js-157-212:~/annotation/kraken_annotation$ grep -v ^U subset_assembly.kraken.report | head -n20
 89.60  8311    8311    U       0       unclassified
 10.40  965     0       -       1       root
 10.40  965     3       -       131567    cellular organisms
 10.37  962     43      D       2           Bacteria
  6.51  604     0       P       200918        Thermotogae
  6.51  604     0       C       188708          Thermotogae
  6.51  604     0       O       2419              Thermotogales
  6.51  604     8       F       188709              Thermotogaceae
  5.11  474     0       G       28236                 Petrotoga
  5.11  474     0       S       69499                   Petrotoga mobilis
  5.11  474     474     -       403833                    Petrotoga mobilis SJ95
  1.22  113     0       G       1184396               Mesotoga
  1.22  113     0       S       1184387                 Mesotoga prima
  1.22  113     113     -       660470                    Mesotoga prima MesG1.Ag.4.2
  0.04  4       0       G       651456                Kosmotoga
  0.04  4       0       S       651457                  Kosmotoga olearia
  0.04  4       4       -       521045                    Kosmotoga olearia TBF 19.5.1
  0.02  2       1       G       2335                  Thermotoga
  0.01  1       0       S       177758                  Thermotoga lettingae
  0.01  1       1       -       416591                    Thermotoga lettingae TMO

Why use Kraken?

For a simulated metagenome of 100 bp reads in its fastest mode of operation, , Kraken processed over 4 million reads per minute on a single core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken’s accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision. Citation

However, kraken is only as sensitive as the provided database, so for unusual samples, a custom database needs to be constructed . The accuracy is very sensitive to the quantity of samples in the database.

Install Prodigal

cd ~
wget https://github.com/hyattpd/Prodigal/releases/download/v2.6.3/prodigal.linux
tar -xvf v2.6.3.tar.gz
chmod 775 ~/prodigal.linux

Running Prodigal

Using prodigal with the same set of data, we can get a list of predicted genes.

cd ~/annotation
mkdir prodigal_annotation
~/prodigal.linux -p meta -a prodigal_annotation/subset_assembly.faa -d prodigal_annotation/subset_assembly.fna -f gff -o prodigal_annotation/subset_assembly.gff -i subset_assembly.fa

LICENSE: This documentation and all textual/graphic site content is released under Creative Commons - 0 (CC0) -- fork @ github.