Tutorials:
Lead Instructors: Harriet Alexander and C. Titus Brown
Co-Instructors: Jessica Blanton, Adelaide Rhodes, Shawn Higdon, Jessica Mizzi, Phillip Brooks, Veronika Kivenson
These are the online materials for the environmental metagenomics workshop run as part of the Data Intensive Biology Summer Institute at UC Davis (DIBSI).
We will be using HackMD to take collective notes throughout the course: HackMD.
11am: Meet in Valley Hall (room TBA), setup, and introductions.
Noon - 1:15: Lunch
1:15 - 2pm: Group meeting with all of DIBSI
2pm - 4pm:
Homework: Skim the Critical Assessment of Metagenome Interpretation (CAMI) Paper
9am - Noon:
1:15 - 4pm:
9am - Noon:
Noon - 1:15: Lunch
1:15 - 4pm:
Resources:
The github repository for this workshop is public at https://github.com/ngs-docs/2017-ucsc-metagenomics
For you:
This is intended to be a safe and friendly place for learning!
Please see the Software Carpentry workshop Code of Conduct: http://software-carpentry.org/conduct.html
In particular, please ask questions, because I guarantee you that your question will help others!
Harriet Alexander - postdoc at UC Davis.
Phil Brooks - postdoc at UC Davis.
Titus Brown - prof at UC Davis in the School of Vet Med.
Shannon Joslin - grad student at UC Davis.
Taylor Reiter - grad student at UC Davis.
Basic rules:
Place the sticky notes where we can see them from the back of the room – e.g. on the back of your laptop.
What we’re going to do here is walk through starting up an running computer (an “instance”) on the Jetstream service.
Below, we’ve provided screenshots of the whole process. You can click on them to zoom in a bit. The important areas to fill in are circled in red.
Some of the details may vary – for example, if you have your own XSEDE account, you may want to log in with that – and the name of the operating system or “image” may also vary from “Ubuntu 16.04” depending on the workshop.
First, go to the Jetstream application at https://use.jetstream-cloud.org/application.
Now:
Choose “XSEDE” as your account provider (it should be the default) and click on “Continue”.
Fill in the username, which is ‘tx160085’ for the ANGUS workshop, and then enter the password (which we will tell you in class).
Now, this is something you only need to once if you have your own account - but if you’re using a shared account like tx160085, you will need a way to keep your computers separate from everyone else’s.
We’ll do this with Projects, which give you a bit of a workspace in which to keep things that belong to “you”.
Click on “Projects” up along the top.
Enter your name into the Project Name, and something simple like “ANGUS” into the description. Then click ‘create’.
Now, select ‘New’ and then “Instance” from the dropdown menu to start up a new machine.
Enter “Ubuntu 16.04” into the search bar - make sure it’s from June 21st, 2017.
Change the name after what we’re doing - “workshop tutorial”, for example, but it doesn’t matter – and select ‘m1.medium’.
It will now be booting up! This will take 2-10 minutes, depending. Just wait! Don’t reload or anything.
Now, you can either click “Open Web Shell”, or, if you know how to use secure shell (ssh), you can ssh in as user ‘tx160085’ on the IP address of the machine - see circled information below. Click Here to access the tutorial for setting up a ssh connection from your local computer to your Jetstream instance.
There’s a possibility that you’ll be confronted with this when you log in to jetstream:
A refresh of the page should get you past it. Please try not to actually move any instances to a new project; it’s probably someone else’s and it could confuse them :)
You can save your workspace so you can return to your instance at a later time without losing any of your files or information stored in memory, similiar to putting your physical computer to sleep. At the Instance Details screen, select the “Suspend” button.
This will open up a dialogue window. Select the “Yes, suspend this instance” button.
It may take Jetstream a few minutes to process, so wait until the progress bar says “Suspended.”
To wake-up your instance, select the “Resume” button.
This will open up a dialogue window. Select the “Yes, resume this instance” button.
It may take Jetstream a few minutes to process, so wait until the progress bar says “Active.”
You can shut down your workspace so you can return to your instance another day without losing any of your files, similiar to shutting down your physical computer. You will retain your files, but you will lose any information stored in memory, such as your history on the command line. At the Instance Details screen, select the “Stop” button.
This will open up a dialogue window. Select the “Yes, stop this instance” button.
It may take Jetstream a few minutes to process, so wait until the progress bar says “Shutoff.”
To start your instance again, select the “Start” button.
This will open up a dialogue window. Select the “Yes, start this instance” button.
It may take Jetstream a few minutes to process, so wait until the progress bar says “Active.”
To completely remove your instance, you can select the “delete” buttom from the instance details page.
This will open up a dialogue window. Select the “Yes, delete this instance” button.
It may take Jetstream a few minutes to process your request. The instance should disappear from the project when it has been successfully deleted.
The shell is a program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard combination.
There are many reasons to learn about the shell.
Unix is user-friendly. It’s just very selective about who its friends are.
The shell is already available on Mac and Linux. For Windows, you’ll have to download a separate program.
On Mac the shell is available through TerminalApplications -> Utilities -> TerminalGo ahead and drag the Terminal application to your Dock for easy access.
For Windows, we’re going to be using gitbash.Download and install gitbash on your computer. Open up the program.
We will spend most of our time learning about the basics of the shell by manipulating some experimental data.
Now we’re going to download the data for the tutorial. For this you’ll need internet access, because you’re going to get it off the web.
Open the shell.
Enter the command:
git clone https://github.com/edamame-course/edamame-data.git
This command will grab all of the data needed for this workshop from the internet. (We’re not going to talk about git right now, but it’s a tool for doing version control.)
Now let’s go in to that directory
cd
cd edamame-data
The command cd
stands for ‘change directory’
In this directory, there should be some things we just downloaded. Let’s check. Type:
ls
ls stands for ‘list’ and it lists the contents of a directory.
There’s a few directories there, but not too many. Let’s go look in the data directory.
cd shell
ls
In there, all mixed up together are files and directories/folders. If we want to know which is which, we can type:
ls -F
Anything with a "/"
after it is a directory. Things with a ""
after them are programs. If there’s nothing there it’s a file.
You can also use the command ls -l
to see whether items in a directory are files or directories. ls -l
gives a lot more information too, such as the size of the file
So, we can see that we have several files, directories and a program. Great!
Most programs take additional arguments that control their exact behavior. For example, -F
and -l
are arguments to ls
. The ls
program, like many programs, take a lot of arguments. But how do we know what the options are to particular commands?
Most commonly used shell programs have a manual. You can access the manual using the man
program. Try entering:
man ls
This will open the manual page for ls
. Use the space key to go forward and b to go backwards. When you are done reading, just hit q
to quit.
Programs that are run from the shell can get extremely complicated. To see an example, open up the manual page for the find
program. No one can possibly learn all of these arguments, of course. So you will probably find yourself referring back to the manual page frequently.
As you’ve already just seen, you can move around in different directories or folders at the command line. Why would you want to do this, rather than just navigating around the normal way.
When you’re working with bioinformatics programs, you’re working with your data and it’s key to be able to have that data in the right place and make sure the program has access to the data. Many of the problems people run in to with command line bioinformatics programs is not having the data in the place the program expects it to be.
Let’s practice moving around a bit.
We’re going to work in that shell
directory we just downloaded.
First let’s navigate there using the regular way by clicking on the different folders.
First we did something like go to the folder of our username. Then we opened
'edamame-data'
then 'shell'
Let’s draw out how that went.
Harriet draw image
Now let’s draw some of the other files and folders we could have clicked on.
This is called a hierarchical file system structure, like an upside down tree with root (/) at the base that looks like this.
Exercise
Now we’re going to try a hunt. Move around in the ‘hidden’ directory and try to find the file 'youfoundit.txt'
By default, the ls
commands lists the contents of the working directory (i.e. the directory you are in). You can always find the directory you are in using the pwd
command. However, you can also give ls
the names of other directories to view. Navigate to the home directory if you are not already there.
Type:
cd
Then enter the command:
ls edamame-data
This will list the contents of the edamame-data
directory without you having to navigate there.
The cd
command works in a similar way. Try entering:
cd
cd edamame-data/shell/hidden
and you will jump directly to hidden
without having to go through the intermediate directory.
Exercise
Try finding the 'anotherfile.txt'
file without changing directories.
Navigate to the home directory. Typing out directory names can waste a lot of time. When you start typing out the name of a directory, then hit the tab key, the shell will try to fill in the rest of the directory name. For example, enter:
cd e<tab>
The shell will fill in the rest of the directory name for edamame-data
. Now go to edamame-data/shell/MiSeq
ls C01<tab><tab>
When you hit the first tab, nothing happens. The reason is that there
are multiple directories in the home directory which start with
C01
. Thus, the shell does not know which one to fill in. When you hit
tab again, the shell will list the possible choices.
Tab completion can also fill in the names of programs. For example,
enter e<tab><tab>
. You will see the name of every program that
starts with an e
. One of those is echo
. If you enter ec<tab>
you
will see that tab completion works.
There are some shortcuts which you should know about. Dealing with the
home directory is very common. So, in the shell the tilde character,
"~"
, is a shortcut for your home directory. Navigate to the edamame
directory:
cd
cd edamame-data
cd shell
Then enter the command:
ls ~
This prints the contents of your home directory, without you having to
type the full path. The shortcut ..
always refers to the directory
above your current directory. Thus:
ls ..
prints the contents of the /home/username/edamame-data
. You can chain
these together, so:
ls ../../
prints the contents of /home/username
which is your home
directory. Finally, the special directory .
always refers to your
current directory. So, ls
, ls .
, and ls ././././.
all do the
same thing, they print the contents of the current directory. This may
seem like a useless shortcut right now, but we’ll see when it is
needed in a little while.
To summarize, while you are in the shell
directory, the commands
ls ~
, ls ~/.
, ls ../../
, and ls /home/username
all do exactly the
same thing. These shortcuts are not necessary, they are provided for
your convenience.
Navigate to the MiSeq directory using”
~/edamame-data/shell/MiSeq
This directory contains our FASTQ files and some other ones we’ll need for analyses. If we type ls
, we will see that there are a bunch of files with long file names. Some of the end with .fastq
.
The *
character is a shortcut for “everything”. Thus, if you enter ls *
, you will see all of the contents of a given directory. Now try this command:
ls *fastq
This lists every file that ends with a fastq
. This command:
ls /usr/bin/*.sh
Lists every file in /usr/bin
that ends in the characters .sh
.
We have paired end sequencing, so for every sample we have two files. If we want to just see the list of the files for the forward direction sequencing we can use:
ls *R*fastq
lists every file in the current directory whose name contains the letter R
, and ends with fastq
. There are twenty such files which
we would expect because we have 20 samples.
So how does this actually work? Well...when the shell (bash) sees a word that contains the *
character, it automatically looks for filenames
that match the given pattern. In this case, it identified four such files. Then, it replaced the *R*fastq
with the list of files, separated by spaces.
What happens if you do R*fastq
?
Short Exercise
Do each of the following using a single ls
command without
navigating to a different directory.
/bin
that start with the letter ‘c/bin
that contain the letter ‘a’/bin
that end with the letter ‘o’BONUS: List all of the files in ‘/bin’ that contain the letter ‘a’ or ‘c’
You can easily access previous commands. Hit the up arrow. Hit it again. You can step backwards through your command history. The down arrow takes your forwards in the command history.
^-C
will cancel the command you are writing, and give you a fresh prompt.
^-R
will do a reverse-search through your command history. This is very useful.
You can also review your recent commands with the history
command. Just enter:
history
to see a numbered list of recent commands, including this just issues
history
command. You can reuse one of these commands directly by
referring to the number of that command.
If your history looked like this:
259 ls *
260 ls /usr/bin/*.sh
261 ls *R*fastq
then you could repeat command #260 by simply entering:
!260
(that’s an exclamation mark).
Short Exercise
We now know how to switch directories, run programs, and look at the contents of directories, but how do we look at the contents of files?
The easiest way to examine a file is to just print out all of the
contents using the program cat
. Enter the following command:
cat C01D01F_sub.fastq
This prints out the contents of the C01D01F_sub.fastq
file.
Short Exercises
~/edamame-data/shell/MiSeq/Centralia_mapping_files/Collapsed_Centralia_full_map.txt
file. What does this file contain?edamame-data
),
use one short command to print the contents of all of the files in
the /home/username/edamame-data/shell/MiSeq
directory.Make sure we’re in the right place for the next set of the lessons. We
want to be in the shell
directory. Check if you’re there with pwd
and if not navigate there. One way to do that would be
cd ~/edamame-data/shell/MiSeq
cat
is a terrific program, but when the file is really big, it can
be annoying to use. The program, less
, is useful for this
case. Enter the following command:
less C01D01F_sub.fastq
less
opens the file, and lets you navigate through it. The commands
are identical to the man
program.
Some commands in less
| key | action | | ——- | ———- | | “space” | to go forward | | “b” | to go backwarsd | | “g” | to go to the beginning | | “G” | to go to the end | | “q” | to quit |
less
also gives you a way of searching through files. Just hit the
“/” key to begin a search. Enter the name of the word you would like
to search for and hit enter. It will jump to the next location where
that word is found. Try searching the dictionary.txt
file for the
word “cat”. If you hit “/” then “enter”, less
will just repeat
the previous search. less
searches from the current location and
works its way forward. If you are at the end of the file and search
for the word “cat”, less
will not find it. You need to go to the
beginning of the file and search.
For instance, let’s search for the sequence 1101:14341
in our file.
You can see that we go right to that sequence and can see
what it looks like.
Remember, the man
program actually uses less
internally and
therefore uses the same commands, so you can search documentation
using “/” as well!
There’s another way that we can look at files, and in this case, just look at part of them. This can be particularly useful if we just want to see the beginning or end of the file, or see how it’s formatted.
The commands are head
and tail
and they just let you look at
the beginning and end of a file respectively.
head C01D01F_sub.fastq
tail C01D01F_sub.fastq
The -n
option to either of these commands can be used to print the
first or last n
lines of a file. To print the first/last line of the
file use:
head -n 1 C01D01F_sub.fastq
tail -n 1 C01D01F_sub.fastq
We showed a little how to search within a file using less
. We can also
search within files without even opening them, using grep
. Grep is a command-line
utility for searching plain-text data sets for lines matching a string or regular expression.
Let’s give it a try!
Let’s search for that sequence 22029:7208
in the C01D01F_sub.fastq
file.
grep 22029:7208 C01D01F_sub.fastq
We get back the whole line that had '22029:7208'
in it. What if we wanted all
four lines, the whole part of that FASTQ sequence, back instead.
grep -A 3 22029:7208 C01D01F_sub.fastq
The -A
flag stands for “after match” so it’s returning the line that
matches plus the three after it. The -B
flag returns that number of lines
before the match.
Exercise
'TTATCCGGATTTATTGGGTTTAAAGGGT'
in the
C01D01F_sub.fastq
file and in the output have the sequence name and the sequence. e.g.@M00967:43:000000000-A3JHG:1:2114:11799:28499 1:N:0:188 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCAG
We’re excited we have all these sequences that we care about that we just got from the FASTQ files. That is a really important motif that is going to help us answer our important question. But all those sequences just went whizzing by with grep. How can we capture them?
We can do that with something called “redirection”. The idea is that we’re redirecting the output to the terminal (all the stuff that went whizzing by) to something else. In this case, we want to print it to a file, so that we can look at it later.
The redirection command for putting something in a file is >
Let’s try it out and put all the sequences that contain 'TTATCCGGATTTATTGGGTTTAAAGGGT'
from all the files in to another file called 'good-data2.txt'
grep -B 2 TTATCCGGATTTATTGGGTTTAAAGGGT *.fastq > good-data2.txt
The prompt should sit there a little bit, and then it should look like nothing happened. But type ls
. You should have a new file called good-data2.txt
. Take a look at it and see if it has what you think it should.
There’s one more useful redirection command that we’re going to show, and that’s called the pipe command, and it is |
. It’s probably not a key on your keyboard you use very much. What |
does is take the output that scrolling by on the terminal and then can run it through another command. When it was all whizzing by before, we wished we could just slow it down and look at it, like we can with less
. Well it turns out that we can! We pipe the grep
command through less
grep TTATCCGGATTTATTGGGTTTAAAGGGT *.fastq | less
Now we can use the arrows to scroll up and down and use q
to get out.
We can also do something tricky and use the command wc
. wc
stands for
word count
. It counts the number of lines or characters. So, we can use
it to count the number of lines we’re getting back from our grep
command.
And that will magically tell us how many sequences we’re finding. We’re
grep TTATCCGGATTTATTGGGTTTAAAGGGT *.fastq | wc
That tells us the number of lines, words and characters in the file. If we just want the number of lines, we can use the -l
flag for lines
.
grep TTATCCGGATTTATTGGGTTTAAAGGGT *.fastq | wc -l
Redirecting is not super intuitive, but it’s really powerful for stringing together these different commands, so you can do whatever you need to do.
The philosophy behind these command line programs is that none of them really do anything all that impressive. BUT when you start chaining
them together, you can do some really powerful things really efficiently. If you want to be proficient at using the shell, you must learn to become proficient with the pipe and redirection operators:
|
, >
, >>
.
Now we can move around in the file structure, look at files, search files, redirect. But what if we want to do normal things like copy files or move them around or get rid of them. Sure we could do most of these things without the command line, but what fun would that be?! Besides it’s often faster to do it at the command line, or you’ll be on a remote server like Amazon where you won’t have another option.
The stability.files file is one that tells us what sample name goes with what sequences. This is a really important file, so we want to make a copy so we don’t lose it.
Lets copy the file using the cp
command. The cp
command backs up the file. Navigate to the data
directory and enter:
cp good-data2.txt good-data2.backup.txt
Now good-data2.backup.txt
has been created as a copy of good-data2.txt
.
Let’s make a backup
directory where we can put this file.
The mkdir
command is used to make a directory. Just enter mkdir
followed by a space, then the directory name.
mkdir backup
We can now move our backed up file in to this directory. We can
move files around using the command mv
. Enter this command:
mv good-data2.backup.txt backup/
This moves good-data2.backup.txt
into the directory backup/
or
the full path would be ~/edamame-data/shell/MiSeq/backup
The mv
command is also how you rename files. Since this file is so
important, let’s rename it:
mv backup/good-data2.backup.txt backup/good-data2.backup_IMPORTANT
Now the file name has been changed to good-data2.backup_IMPORTANT. Let’s delete the backup file now:
rm backup/good-data2.backup_IMPORTANT
The rm
file removes the file. Be careful with this command. It doesn’t
just nicely put the files in the Trash. They’re really gone.
Short Exercise
Do the following:
good-data2.backup_IMPORTANT
file to good-data2.backup.txt
.MiSeq
directory called new
good-data2.backup.txt
file into new
By default, rm
, will NOT delete directories. You can tell rm
to
delete a directory using the -r
option. Let’s delete that new
directory
we just made. Enter the following command:
rm -r new
Commands like ls
, rm
, echo
, and cd
are just ordinary programs on the computer. A program is just a file that you can execute. The program which
tells you the location of a particular program. For example:
which ls
Will return “/bin/ls”. Thus, we can see that ls
is a program that sits inside of the /bin
directory. Now enter:
which find
You will see that find
is a program that sits inside of the /usr/bin
directory.
So ... when we enter a program name, like ls
, and hit enter, how does the shell know where to look for that program? How does it know to run /bin/ls
when we enter ls
. The answer is that when we enter a program name and hit enter, there are a few standard places that the shell automatically looks. If it can’t find the program in any of those places, it will print an error saying “command not found”. Enter the command:
echo $PATH
This will print out the value of the PATH
environment variable. More on environment variables later. Notice that a list of directories, separated by colon characters, is listed. These are the places the shell looks for programs to run. If your program is not in this list, then an error is printed. The shell ONLY checks in the places listed in the PATH
environment variable.
Navigate to the shell
directory and list the contents. You will notice that there is a program (executable file) called hello.sh
in this directory. Now, try to run the program by entering:
hello.sh
You should get an error saying that hello.sh cannot be found. That is because the directory /home/username/edamame-data/shell
is not in the
PATH
. You can run the hello.sh
program by entering:
./hello.sh
Remember that .
is a shortcut for the current working directory. This tells the shell to run the hello.sh
program which is located right here. So, you can run any program by entering the path to that program. You can run hello.sh
equally well by specifying:
/home/username/edamame-data/shell/hello.sh
Or by entering:
~/edamame-data/shell/hello.sh
When there are no /
characters, the shell assumes you want to look in one of the default places for the program.
The find
program can be used to find files based on arbitrary criteria. Navigate to the data
directory and enter the following command:
find . -print
This prints the name of every file or directory, recursively, starting from the current directory. Let’s exclude all of the directories:
find . -type f -print
This tells find
to locate only files. Now try these commands:
find . -type f -name "*1*"
find . -type f -name "*1*" -or -name "*2*" -print
find . -type f -name "*1*" -and -name "*2*" -print
The find
command can acquire a list of files and perform some
operation on each file. Try this command out:
find . -type f -exec grep Volume {} \;
This command finds every file starting from .
. Then it searches each file for a line which contains the word “Volume”. The {}
refers to
the name of each file. The trailing \;
is used to terminate the command. This command is slow, because it is calling a new instance of grep
for each item the find
returns.
A faster way to do this is to use the xargs
command:
find . -type f -print | xargs grep Volume
find
generates a list of all the files we are interested in, then we pipe them to xargs
. xargs
takes the items given to it and passes them as arguments to grep
. xargs
generally only creates a single instance of grep
(or whatever program it is running).
backtick, xargs: Example find all files with certain text
alias -> rm -i
variables -> use a path example
.bashrc
du
ln
ssh and scp
Regular Expressions
Permissions
Chaining commands together
The goal of this tutorial is to run you through a demonstration of the command line, which you may not have seen or used much before.
Start up an m1.medium instance running Ubuntu 16.04 on Jetstream.
All of the commands below can copy/pasted.
Copy and paste the following commands:
sudo apt-get update && sudo apt-get -y install python ncbi-blast+
(make sure to hit enter after the paste – sometimes the last line doesn’t paste completely.)
This updates the software list and installs the Python programming language and NCBI BLAST+.
First! We need some data. Let’s grab the mouse and zebrafish RefSeq
protein data sets from NCBI, and put them in our home directory. If you’ve just logged
in, you should be there already, but if you’re unsure, just run cd
and hit enter. Now,
we’ll use curl
to download the files:
curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.1.protein.faa.gz
curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.2.protein.faa.gz
curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.3.protein.faa.gz
curl -O ftp://ftp.ncbi.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.1.protein.faa.gz
If you look at the files in the current directory, you should see four files, along with a directory called lost+found which is for system information:
ls -l
should show you:
total 29056
drwxr-xr-x 2 titus titus 4096 May 5 08:26 Desktop
drwxr-xr-x 2 titus titus 4096 Jun 14 18:03 Documents
drwxr-xr-x 2 titus titus 4096 Jun 14 18:03 Downloads
-rw-rw-r-- 1 titus titus 3610407 Jun 14 18:11 mouse.1.protein.faa.gz
-rw-rw-r-- 1 titus titus 6080985 Jun 14 18:11 mouse.2.protein.faa.gz
-rw-rw-r-- 1 titus titus 7520591 Jun 14 18:11 mouse.3.protein.faa.gz
drwxr-xr-x 2 titus titus 4096 Jun 14 18:03 Music
drwxr-xr-x 2 titus titus 4096 Jun 14 18:03 Pictures
drwxr-xr-x 2 titus titus 4096 Jun 14 18:03 Public
drwxr-xr-x 2 titus titus 4096 Jun 14 18:03 Templates
drwxr-xr-x 2 titus titus 4096 Jun 14 18:03 Videos
-rw-rw-r-- 1 titus titus 12500932 Jun 14 18:11 zebrafish.1.protein.faa.gz
All four of the files are FASTA protein files (that’s what the .faa
suggests) that are compressed with gzip
(that’s what the .gz means).
Uncompress them:
gunzip *.faa.gz
and let’s look at the first few sequences in the file:
head mouse.1.protein.faa
These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s a text file, containing records; each record starts with a line beginning with a ‘>’, and then contains one or more lines of sequence text.
Let’s take those first two sequences and save them to a file. We’ll do this using output redirection with ‘>’, which says “take all the output and put it into this file here.”
head -11 mouse.1.protein.faa > mm-first.fa
So now, for example, you can do cat mm-first.fa
to see the contents of
that file (or less mm-first.fa
).
Now let’s BLAST these two sequences against the entire zebrafish protein data set. First, we need to tell BLAST that the zebrafish sequences are (a) a database, and (b) a protein database. That’s done by calling ‘makeblastdb’:
makeblastdb -in zebrafish.1.protein.faa -dbtype prot
Next, we call BLAST to do the search:
blastp -query mm-first.fa -db zebrafish.1.protein.faa
This should run pretty quickly, but you’re going to get a lot of output!!
To save it to a file instead of watching it go past on the screen,
ask BLAST to save the output to a file that we’ll name mm-first.x.zebrafish.txt
:
blastp -query mm-first.fa -db zebrafish.1.protein.faa -out mm-first.x.zebrafish.txt
and then you can ‘page’ through this file at your leisure by typing:
less mm-first.x.zebrafish.txt
(Type spacebar to move down, and ‘q’ to get out of paging mode.)
Let’s do some more sequences (this one will take a little longer to run):
head -500 mouse.1.protein.faa > mm-second.fa
blastp -query mm-second.fa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.txt
will compare the first 83 sequences. You can look at the output file with:
less mm-second.x.zebrafish.txt
(and again, type ‘q’ to get out of paging mode.)
To get an output format that reads well into downstream applications, it is helpful to add the flag -outfmt 6
blastp -query mm-second.fa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.tbl.txt -outfmt 6
To see the results:
head mm-second.x.zebrafish.tbl.txt | less -S
less -S means it is scrollable in screen from left to right. type ‘q’ to exit less
To find out how to customize blast outputs, it is helpful to look at the BLAST® Command Line Applications User Manual.
Notes:
you can execute multiple commands at a time;
You might see a warning -
Selenocysteine (U) at position 310 replaced by X
what does this mean?
why did it take longer to BLAST mm-second.fa
than mm-first.fa
?
Things to mention and discuss:
blastp
options and -help.Reminder: shut down your instance!
Other topics to discuss:
(Jetstream startup instructions HERE.)
—
You should now be logged into your Jetstream computer! You should see something like this:
ubuntu@ip-172-30-1-252:~$
this is the command prompt.
First, let’s install software for short read quality assessment, trimming and python virtual environments:
sudo apt-get -y update && \
sudo apt-get -y install trimmomatic python-pip \
samtools zlib1g-dev ncurses-dev python-dev
sudo apt-get install -y python3.5-dev python3.5-venv make \
libc6-dev g++ zlib1g-dev
wget -c http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip fastqc_v0.11.5.zip
cd FastQC
chmod +x fastqc
cd
Now, create a python 3.5 virtual environment and install software within:
python3.5 -m venv ~/py3
. ~/py3/bin/activate
pip install -U pip
pip install -U Cython
pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn khmer
pip install -U https://github.com/dib-lab/sourmash/archive/master.zip
Let’s also run a Jupyter Notebook. First, configure it a teensy bit more securely, and also have it run in the background:
jupyter notebook --generate-config
cat >>~/.jupyter/jupyter_notebook_config.py <<EOF
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:5d813e5d59a7:b4e430cf6dbd1aad04838c6e9cf684f4d76e245c'
c.NotebookApp.port = 8000
EOF
Now, run!
jupyter notebook &
On the Jetstream webshell, you can get the Web page address for the jupyter notebook you just launched. Press enter, then execute
echo http://$(hostname):8000/
Copy the output from echo and paste it into the url bar of a new tab in your web browser.
Note, the password is ‘davis’.
Note
If your network blocks port 8000 (e.g. cruznet at UCSC), you can run:
ssh -N -f -L localhost:8000:localhost:8000 username@remotehost
to tunnel the remote Jupyter notebook server over SSH.
We’re going to be using a subset of data from Hu et al., 2016. This paper from the Banfield lab samples some relatively low diversity environments and finds a bunch of nearly complete genomes.
(See DATA.html for a list of the data sets we’re using in this tutorial.)
We’ve loaded subsets of the data onto an Amazon location for you, to
make everything faster for today’s work. We’re going to put the
files on your computer locally under the directory ~/data
:
mkdir ~/data
Next, let’s grab part of the data set:
cd ~/data
curl -O -L https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948_1.fastq.gz
curl -O -L https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948_2.fastq.gz
Let’s make sure we downloaded all of our data using md5sum.:
md5sum SRR1976948_1.fastq.gz SRR1976948_2.fastq.gz
You should see this:
37bc70919a21fccb134ff2fefcda03ce SRR1976948_1.fastq.gz
29919864e4650e633cc409688b9748e2 SRR1976948_2.fastq.gz
Now if you type:
ls -l
you should see something like:
total 346936
-rw-rw-r-- 1 ubuntu ubuntu 169620631 Oct 11 23:37 SRR1976948_1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 185636992 Oct 11 23:38 SRR1976948_2.fastq.gz
These are 1m read subsets of the original data, taken from the beginning of the file.
One problem with these files is that they are writeable - by default, UNIX makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. Let’s fix that before we go on any further:
chmod u-w *
We’ll talk about what these files are below.
First, make a working directory; this will be a place where you can futz around with a copy of the data without messing up your primary data:
mkdir ~/work
cd ~/work
Now, make a “virtual copy” of the data in your working directory by linking it in –
ln -fs ~/data/* .
These are FASTQ files – let’s take a look at them:
less SRR1976948_1.fastq.gz
(use the spacebar to scroll down, and type ‘q’ to exit ‘less’)
Question:
Links:
We’re going to use FastQC to summarize the data. We already installed ‘fastqc’ on our computer for you.
Now, run FastQC on two files:
fastqc SRR1976948_1.fastq.gz
fastqc SRR1976948_2.fastq.gz
Now type ‘ls’:
ls -d *fastqc.zip*
to list the files, and you should see:
SRR1976948_1_fastqc.zip
SRR1976948_2_fastqc.zip
Inside each of the fatqc directories you will find reports from the fastqc. You can download these files using your Jupyter Notebook console, if you like; or you can look at these copies of them:
Questions:
Links:
There are several caveats about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed for the first
Now we’re going to do some trimming! We’ll be using Trimmomatic, which (as with fastqc) we’ve already installed via apt-get.
The first thing we’ll need are the adapters to trim off:
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/TruSeq2-PE.fa
Now, to run Trimmomatic:
TrimmomaticPE SRR1976948_1.fastq.gz \
SRR1976948_2.fastq.gz \
SRR1976948_1.qc.fq.gz s1_se \
SRR1976948_2.qc.fq.gz s2_se \
ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
You should see output that looks like this:
...
Input Read Pairs: 1000000 Both Surviving: 885734 (88.57%) Forward Only Surviving: 114262 (11.43%) Reverse Only Surviving: 4 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
Questions:
For a discussion of optimal trimming strategies, see MacManes, 2014 – it’s about RNAseq but similar arguments should apply to metagenome assembly.
Links:
Run FastQC again on the trimmed files:
fastqc SRR1976948_1.qc.fq.gz
fastqc SRR1976948_2.qc.fq.gz
And now view my copies of these files:
Let’s take a look at the output files:
less SRR1976948_1.qc.fq.gz
(again, use spacebar to scroll, ‘q’ to exit less).
MultiQC aggregates results across many samples into a single report for easy comparison.
Install MultiQC within the py3 environment:
pip install multiqc
Now, run Mulitqc on both the untrimmed and trimmed files within the work directory:
multiqc .
And now you should see output that looks like this:
[INFO ] multiqc : This is MultiQC v1.0
[INFO ] multiqc : Template : default
[INFO ] multiqc : Searching '.'
Searching 15 files.. [####################################] 100%
[INFO ] fastqc : Found 4 reports
[INFO ] multiqc : Compressing plot data
[INFO ] multiqc : Report : multiqc_report.html
[INFO ] multiqc : Data : multiqc_data
[INFO ] multiqc : MultiQC complete
Now we can view the output file using Jupyter Notebook.
Questions:
Optional: K-mer Spectral Error Trimming
MEGAHIT is a very fast, quite good assembler designed for metagenomes.
First, install it:
cd ~/
git clone https://github.com/voutcn/megahit.git
cd megahit
make
Now, download some data:
mkdir ~/data
cd ~/data
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948.abundtrim.subset.pe.fq.gz
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1977249.abundtrim.subset.pe.fq.gz
These are data that have been run through k-mer abundance trimming (see k-mer and subsampled so that we can run an assembly in a fairly short time period).
Now, finally, run the assembler!
mkdir ~/assembly
cd ~/assembly
ln -fs ../data/*.subset.pe.fq.gz .
~/megahit/megahit --12 SRR1976948.abundtrim.subset.pe.fq.gz,SRR1977249.abundtrim.subset.pe.fq.gz \
-o combined
This will take about 15 minutes; at the end you should see output like this:
... 7713 contigs, total 13168567 bp, min 200 bp, max 54372 bp, avg 1707 bp, N50 4305 bp
... ALL DONE. Time elapsed: 899.612093 seconds
The output assembly will be in combined/final.contigs.fa
.
Some figures for your consideration: The first two come from work by Dr. Sherine Awad on analyzing the data from Shakya et al (2014). The third comes from an analysis of read search vs contig search of a protein database.
Let’s first take a look at the assembly:
less combined/final.contigs.fa
Now we can run a few stats on our assembly. To do this we will use QUAST:
cd ~/
git clone https://github.com/ablab/quast.git -b release_4.5
export PYTHONPATH=$(pwd)/quast/libs/
Now, run QUAST on the assembly:
cd ~/assembly
~/quast/quast.py combined/final.contigs.fa -o combined-report
cat combined-report/report.txt
What does this say about our assembly? What do the stats not tell us? For thought and discussion: a few nice slides from Jessica Blanton on assessing assembly quality.
At this point we can do a bunch of things:
Prokka is a tool that facilitates the fast annotation of prokaryotic genomes.
The goals of this tutorial are to:
Download and extract the latest version of prokka:
cd ~/
git clone https://github.com/tseemann/prokka.git
We also will need some dependencies such as bioperl:
sudo apt-get -y install bioperl libdatetime-perl libxml-simple-perl libdigest-md5-perl
This may take a little while.
and we need an XML package from perl
sudo bash
export PERL_MM_USE_DEFAULT=1
export PERL_EXTUTILS_AUTOINSTALL="--defaultdeps"
perl -MCPAN -e 'install "XML::Simple"'
exit
Now, you should be able to add Prokka to your $PATH
and set up the index for the sequence database:
export PATH=$PATH:$HOME/prokka/bin
prokka --setupdb
To make sure the database loaded directly:
prokka --listdb
You should see something like:
tx160085@js-157-212:~$ prokka --listdb
[17:04:15] Looking for databases in: /home/tx160085/prokka/bin/../db
[17:04:15] * Kingdoms: Archaea Bacteria Mitochondria Viruses
[17:04:15] * Genera: Enterococcus Escherichia Staphylococcus
[17:04:15] * HMMs: HAMAP
[17:04:15] * CMs: Bacteria Viruses
Prokka uses a core set of the Uniprot-DB Kingdom sets against which it blasts your samples. It is possible to search in a more specific dataset, e.g. the genus Enterococcus, by adding a few flags to the command.
–usegenus –genus Enterococcus
Question: What do you think you would do for adding to the default databases?
Prokka should be good to go now– you can check to make sure that all is well by typing prokka
. This should print the help screen with all available options. You can find out more about Prokka databases here.
Make a new directory for the annotation:
cd ~/
mkdir annotation
cd annotation
Link the metagenome assembly file into this directory:
ln -fs ~/mapping/subset_assembly.fa .
Now it is time to run Prokka! There are tons of different ways to specialize the running of Prokka. We are going to keep it simple for now, though. It will take a little bit to run.
prokka subset_assembly.fa --outdir prokka_annotation --prefix metagG --metagenome --kingdom Bacteria
Question: Look at the results of the prokka analysis as it prepares your output file. What types of categories are you seeing flash by on the screen?
Don’t worry, the program tends to pause here:
Running: cat prokka_annotation\/sprot\.faa | parallel --gnu --plain -j 6 --block 242000
--recstart '>' --pipe blastp -query --db /home/tx160085/prokka/bin/../db/kingdom/Bacteria/sprot
-evalue 1e-06 -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no > prokka_annotation\/sprot\.blast 2> /dev/null
This will generate a new folder called prokka_annotation
in which will be a series of files, which are detailed here.
In particular, we will be using the *.ffn
file to assess the relative read coverage within our metagenomes across the predicted genomic regions.
Question: Take a moment and look inside the output files.:
cd ~/annotation/prokka_annotation
less -S *.fsa
less reminders:
*Press space_bar to page down *Press q to exit the less commands
Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm. See Kraken Home Page for more information.
Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. See the Prodigal home page for more info. Citation
Prodigal is already installed inside the prokka wrapper, but sometimes it is handy to generate a standalone .gff file for annotation.
cd ~
git clone https://github.com/DerrickWood/kraken.git
cd ~/kraken
mkdir ~/kraken/bin
./install_kraken.sh ~/kraken/bin
export PATH=$PATH:$HOME/kraken/bin
mkdir ~/KRAKEN
cd ~/KRAKEN
wget http://ccb.jhu.edu/software/kraken/dl/minikraken.tgz
tar -xvf minikraken.tgz
cd ~/annotation
mkdir kraken_annotation
kraken --db ~/KRAKEN/minikraken_20141208/ --threads 2 --fasta-input subset_assembly.fa --output kraken_annotation/subset_assembly.kraken
kraken-translate --db ~/KRAKEN/minikraken_20141208/ kraken_annotation/subset_assembly.kraken > kraken_annotation/subset_assembly.kraken.labels
Kraken has now provided a taxonomic assignment to all of the clusters.
To generate a summary table:
cd ~/annotation
kraken-report --db ~/KRAKEN/minikraken_20141208 kraken_annotation/subset_assembly.kraken > kraken_annotation/subset_assembly.kraken.report
The top of the file lists all the unclassified sequences, to look at the file and skip over these, do the following:
grep -v ^U ~/annotation/kraken_annotation/subset_assembly.kraken.report | head -n20
The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:
- Percentage of reads covered by the clade rooted at this taxon
- Number of reads covered by the clade rooted at this taxon
- Number of reads assigned directly to this taxon
- A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-‘.
- NCBI taxonomy ID
- indented scientific name
Example output:
tx160085@js-157-212:~/annotation/kraken_annotation$ grep -v ^U subset_assembly.kraken.report | head -n20
89.60 8311 8311 U 0 unclassified
10.40 965 0 - 1 root
10.40 965 3 - 131567 cellular organisms
10.37 962 43 D 2 Bacteria
6.51 604 0 P 200918 Thermotogae
6.51 604 0 C 188708 Thermotogae
6.51 604 0 O 2419 Thermotogales
6.51 604 8 F 188709 Thermotogaceae
5.11 474 0 G 28236 Petrotoga
5.11 474 0 S 69499 Petrotoga mobilis
5.11 474 474 - 403833 Petrotoga mobilis SJ95
1.22 113 0 G 1184396 Mesotoga
1.22 113 0 S 1184387 Mesotoga prima
1.22 113 113 - 660470 Mesotoga prima MesG1.Ag.4.2
0.04 4 0 G 651456 Kosmotoga
0.04 4 0 S 651457 Kosmotoga olearia
0.04 4 4 - 521045 Kosmotoga olearia TBF 19.5.1
0.02 2 1 G 2335 Thermotoga
0.01 1 0 S 177758 Thermotoga lettingae
0.01 1 1 - 416591 Thermotoga lettingae TMO
Why use Kraken?
For a simulated metagenome of 100 bp reads in its fastest mode of operation, , Kraken processed over 4 million reads per minute on a single core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken’s accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision. Citation
However, kraken is only as sensitive as the provided database, so for unusual samples, a custom database needs to be constructed . The accuracy is very sensitive to the quantity of samples in the database.
cd ~
wget https://github.com/hyattpd/Prodigal/releases/download/v2.6.3/prodigal.linux
tar -xvf v2.6.3.tar.gz
chmod 775 ~/prodigal.linux
Using prodigal with the same set of data, we can get a list of predicted genes.
cd ~/annotation
mkdir prodigal_annotation
~/prodigal.linux -p meta -a prodigal_annotation/subset_assembly.faa -d prodigal_annotation/subset_assembly.fna -f gff -o prodigal_annotation/subset_assembly.gff -i subset_assembly.fa
sourmash is our lab’s implementation of an ultra-fast lightweight approach to nucleotide-level search and comparison, called MinHash.
You can read some background about MinHash sketches in this paper: Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.
Create / log into an m1.medium Jetstream instance, and run these two commands:
cd
curl -O https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz
tar xzf microbe-genbank-sbt-k31-2017.05.09.tar.gz
– they take a long time :).
K-mers are a fairly simple concept that turn out to be tremendously powerful.
A “k-mer” is a word of DNA that is k long:
ATTG - a 4-mer
ATGGAC - a 6-mer
Typically we extract k-mers from genomic assemblies or read data sets by running a k-length window across all of the reads and sequences – e.g. given a sequence of length 16, you could extract 11 k-mers of length six from it like so:
AGGATGAGACAGATAG
becomes the following set of 6-mers:
AGGATG
GGATGA
GATGAG
ATGAGA
TGAGAC
GAGACA
AGACAG
GACAGA
ACAGAT
CAGATA
AGATAG
k-mers are most useful when they’re long, because then they’re specific. That is, if you have a 31-mer taken from a human genome, it’s pretty unlikely that another genome has that exact 31-mer in it. (You can calculate the probability if you assume genomes are random: there are 431 possible 31-mers, and 431 = 4,611,686,018,427,387,904. So, you know, a lot.)
The important concept here is that long k-mers are species specific. We’ll go into a bit more detail later.
We’ve already run into k-mers before, as it turns out - when we were doing genome assembly. One of the three major ways that genome assembly works is by taking reads, breaking them into k-mers, and then “walking” from one k-mer to the next to bridge between reads. To see how this works, let’s take the 16-base sequence above, and add another overlapping sequence:
AGGATGAGACAGATAG
TGAGACAGATAGGATTGC
One way to assemble these together is to break them down into k-mers –
becomes the following set of 6-mers:
AGGATG
GGATGA
GATGAG
ATGAGA
TGAGAC
GAGACA
AGACAG
GACAGA
ACAGAT
CAGATA
AGATAG -> off the end of the first sequence
GATAGG <- beginning of the second sequence
ATAGGA
TAGGAT
AGGATT
GGATTG
GATTGC
and if you walk from one 6-mer to the next based on 5-mer overlap, you get the assembled sequence:
AGGATGAGACAGATAGGATTGC
Graphs of many k-mers together are called De Bruijn graphs, and assemblers like MEGAHIT and SOAPdenovo are De Bruijn graph assemblers - they use k-mers underneath.
Computers love k-mers because there’s no ambiguity in matching them. You either have an exact match, or you don’t. And computers love that sort of thing!
Basically, it’s really easy for a computer to tell if two reads share a k-mer, and it’s pretty easy for a computer to store all the k-mers that it sees in a pile of reads or in a genome.
So, we’ve said long k-mers (say, k=31 or longer) are pretty species specific. Is that really true?
Yes! Check out this figure from the MetaPalette paper:
here, the Koslicki and Falush show that k-mer similarity works to group microbes by genus, at k=40. If you go longer (say k=50) then you get only very little similarity between different species.
So, one thing you can do is use k-mers to compare genomes to genomes, or read data sets to read data sets: data sets that have a lot of similarity probably are similar or even the same genome.
One metric you can use for this comparisons is the Jaccard distance, which is calculated by asking how many k-mers are shared between two samples vs how many k-mers in total are in the combined samples.
only k-mers in both samples
----------------------------
all k-mers in either or both samples
A Jaccard distance of 1 means the samples are identical; a Jaccard distance of 0 means the samples are completely different.
This is a great measure and it can be used to search databases and cluster unknown genomes and all sorts of other things! The only real problem with it is that there are a lot of k-mers in a genome – a 5 Mbp genome (like E. coli) has 5 m k-mers!
About a year ago, Ondov et al. (2016) showed that MinHash approaches could be used to estimate Jaccard distance using only a small fraction (1 in 10,000 or so) of all the k-mers.
The basic idea behind MinHash is that you pick a small subset of k-mers to look at, and you use those as a proxy for all the k-mers. The trick is that you pick the k-mers randomly but consistently: so if a chosen k-mer is present in two data sets of interest, it will be picked in both. This is done using a clever trick that we can try to explain to you in class - but either way, trust us, it works!
We have implemented a MinHash approach in our sourmash software, which can do some nice things with samples. We’ll show you some of these things next!
To install sourmash, run:
sudo apt-get -y update && \
sudo apt-get install -y python3.5-dev python3.5-venv make \
libc6-dev g++ zlib1g-dev
this installs Python 3.5.
Now, create a local software install and populate it with Jupyter and other dependencies:
python3.5 -m venv ~/py3
. ~/py3/bin/activate
pip install -U pip
pip install -U Cython
pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn khmer
pip install -U https://github.com/dib-lab/sourmash/archive/master.zip
Download some reads and a reference genome:
mkdir ~/data
cd ~/data
wget http://public.ged.msu.edu.s3.amazonaws.com/ecoli_ref-5m-trim.pe.fq.gz
wget https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz
Compute a scaled MinHash signature from our reads:
mkdir ~/sourmash
cd ~/sourmash
sourmash compute --scaled 10000 ~/data/ecoli_ref*pe*.fq.gz -o ecoli-reads.sig -k 31
Use case: how much of the read content is contained in the reference genome?
Build a signature for an E. coli genome:
sourmash compute --scaled 10000 -k 31 ~/data/ecoliMG1655.fa.gz -o ecoli-genome.sig
and now evaluate containment, that is, what fraction of the read content is contained in the genome:
sourmash search -k 31 ecoli-reads.sig ecoli-genome.sig --containment
and you should see:
# running sourmash subcommand: search
loaded query: /home/ubuntu/data/ecoli_ref-5m... (k=31, DNA)
loaded 1 signatures from ecoli-genome.sig
1 matches:
similarity match
---------- -----
46.6% /home/ubuntu/data/ecoliMG1655.fa.gz
Why are only 50% or so of our k-mers from the reads in the genome!? Any ideas?
Try the reverse - why is it bigger?
sourmash search -k 31 ecoli-genome.sig ecoli-reads.sig --containment
(...but 99% of our k-mers from the genome are in the reads!?)
This is basically because of sequencing error! Illumina data contains a lot of errors, and the assembler doesn’t include them in the assembly!
Suppose that we have a collection of signatures (made with sourmash compute
as above) and we want to search it with our newly assembled
genome (or the reads, even!). How would we do that?
Let’s grab a sample collection of 50 E. coli genomes and unpack it –
mkdir ecoli_many_sigs
cd ecoli_many_sigs
curl -O -L https://github.com/dib-lab/sourmash/raw/master/data/eschericia-sigs.tar.gz
tar xzf eschericia-sigs.tar.gz
rm eschericia-sigs.tar.gz
cd ../
This will produce 50 files named ecoli-N.sig
in the ecoli_many_sigs
–
ls ecoli_many_sigs
Let’s turn this into an easily-searchable database with sourmash index
–
sourmash index -k 31 ecolidb ecoli_many_sigs/*.sig
What does the database look like and how does the search work?
sourmash search (and gather) compare the query and the database (sequence bloom tree) and report matches based containment.
One point to make with this is that the search can quickly narrow down which signatures match your query, without losing any matches. It’s a clever example of how computer scientists can actually make life better :).
And now we can search!
sourmash search ecoli-genome.sig ecolidb.sbt.json -n 20
You should see output like this:
# running sourmash subcommand: search
select query k=31 automatically.
loaded query: /home/tx160085/data/ecoliMG165... (k=31, DNA)
loaded SBT ecolidb.sbt.json
Searching SBT ecolidb.sbt.json
49 matches; showing first 20:
similarity match
---------- -----
75.4% NZ_JMGW01000001.1 Escherichia coli 1-176-05_S4_C2 e117605...
72.2% NZ_GG774190.1 Escherichia coli MS 196-1 Scfld2538, whole ...
71.4% NZ_JMGU01000001.1 Escherichia coli 2-011-08_S3_C2 e201108...
70.1% NZ_JHRU01000001.1 Escherichia coli strain 100854 100854_1...
69.0% NZ_JH659569.1 Escherichia coli M919 supercont2.1, whole g...
64.9% NZ_JNLZ01000001.1 Escherichia coli 3-105-05_S1_C1 e310505...
63.0% NZ_MOJK01000001.1 Escherichia coli strain 469 Cleandata-B...
62.9% NZ_MOGK01000001.1 Escherichia coli strain 676 BN4_676_1_(...
62.0% NZ_JHDG01000001.1 Escherichia coli 1-176-05_S3_C1 e117605...
59.9% NZ_MIWF01000001.1 Escherichia coli strain AF7759-1 contig...
52.7% NZ_KE700241.1 Escherichia coli HVH 147 (4-5893887) acYxy-...
51.7% NZ_APWY01000001.1 Escherichia coli 178200 gec178200.conti...
49.3% NZ_LVOV01000001.1 Escherichia coli strain swine72 swine72...
49.3% NZ_MIWP01000001.1 Escherichia coli strain K6412 contig_00...
49.0% NZ_LQWB01000001.1 Escherichia coli strain GN03624 GCID_EC...
48.9% NZ_JHGJ01000001.1 Escherichia coli O45:H2 str. 2009C-4780...
48.1% NZ_CP011331.1 Escherichia coli O104:H4 str. C227-11, comp...
47.7% NZ_JHNB01000001.1 Escherichia coli O103:H25 str. 2010C-45...
47.7% NZ_JHRE01000001.1 Escherichia coli strain 302014 302014_1...
47.6% NZ_JHHE01000001.1 Escherichia coli O103:H2 str. 2009C-327...
identifying what genome is in the signature.
Adjust plotting (this is a bug in sourmash :) –
echo 'backend : Agg' > matplotlibrc
Compare all the signatures:
sourmash compare ecoli_many_sigs/* -o ecoli_cmp
and then plot:
sourmash plot --labels ecoli_cmp
which will produce a file ecoli_cmp.matrix.png
and ecoli_cmp.dendro.png
which you can then download via FileZilla and view on your local computer.
Now open jupyter notebook and visualize:
from IPython.display import Image
Image("ecoli_cmp.matrix.png")
Here’s a PNG version:
At the beginning, we downloaded and unpacked a GenBank index of all the microbial genomes – you can see a basic description here, CTB’s blog post – this one contains sketches of all 100k Genbank microbes. (See available sourmash databases for more information.)
After this database is unpacked, it produces a file
genbank-k31.sbt.json
and a whole bunch of hidden files in the
directory .sbt.genbank-k31
.
Next, run the ‘gather’ command to see what’s in your ecoli genome –
at the command line run:
sourmash gather -k 31 ecoli-reads.sig ../genbank-k31.sbt.json -o ecoli-reads-gather-reults.csv
and you should get:
# running sourmash subcommand: sbt_gather
loaded query: /home/tx160085/data/ecoli_ref-... (k=31, DNA)
loaded SBT ../genbank-k31.sbt.json
overlap p_query p_match
--------- ------- --------
4.9 Mbp 46.6% 100.0% APIN01000001.1 Escherichia coli str. ...
3.4 Mbp 32.4% 1.5% AQAX01000001.1 Escherichia coli P0304...
2.5 Mbp 23.6% 0.7% JHDK01000001.1 Escherichia coli 1-110...
160.0 kbp 1.5% 0.6% AMPE01000001.1 Citrobacter sp. L17 co...
0.5 Mbp 4.8% 0.4% BBVS01000001.1 Escherichia albertii D...
3.3 Mbp 31.2% 0.4% APYP01000001.1 Escherichia coli P0299...
3.6 Mbp 34.5% 0.4% AQCO01000001.1 Escherichia coli MP021...
4.3 Mbp 40.9% 0.4% JONB01000001.1 Escherichia coli 3-020...
2.8 Mbp 26.0% 0.5% AZPL01000001.1 Shigella flexneri 2001...
2.8 Mbp 26.0% 0.2% MIIY01000001.1 Shigella sp. FC1737 NO...
2.2 Mbp 20.7% 0.2% AFET01000004.1 Escherichia coli AA86 ...
found less than 10.0 kbp in common. => exiting
found 11 matches total;
the recovered matches hit 49.2% of the query
You can use this on metagenomes (assembled and unassembled) as well; you’ve just got to make the signature files.
To see this in action, here is gather running on a signature generated from some sequences that assemble (but don’t align to known genomes) from the Shakya et al. 2013 mock metagenome paper.
wget https://github.com/dib-lab/sourmash/raw/master/doc/_static/shakya-unaligned-contigs.sig
sourmash gather -k 31 shakya-unaligned-contigs.sig ../genbank-k31.sbt.json
This should yield:
# running sourmash subcommand: gather
loaded query: mqc500.QC.AMBIGUOUS.99.unalign... (k=31, DNA)
loaded SBT genbank-k31.sbt.json
overlap p_query p_match
--------- ------- --------
1.4 Mbp 11.0% 58.0% JANA01000001.1 Fusobacterium sp. OBRC1 c
1.0 Mbp 7.7% 25.9% CP001957.1 Haloferax volcanii DS2 plasmi
0.9 Mbp 7.5% 11.8% BA000019.2 Nostoc sp. PCC 7120 DNA, comp
0.7 Mbp 5.9% 23.0% FOVK01000036.1 Proteiniclasticum ruminis
0.7 Mbp 5.3% 17.6% AE017285.1 Desulfovibrio vulgaris subsp.
0.6 Mbp 4.9% 11.1% CP001252.1 Shewanella baltica OS223, com
0.6 Mbp 4.8% 27.3% AP008226.1 Thermus thermophilus HB8 geno
0.6 Mbp 4.4% 11.2% CP000031.2 Ruegeria pomeroyi DSS-3, comp
480.0 kbp 3.8% 7.6% CP000875.1 Herpetosiphon aurantiacus DSM
410.0 kbp 3.3% 10.5% CH959317.1 Sulfitobacter sp. NAS-14.1 sc
1.4 Mbp 10.9% 11.8% LN831027.1 Fusobacterium nucleatum subsp
0.5 Mbp 4.1% 5.3% CP000753.1 Shewanella baltica OS185, com
420.0 kbp 3.3% 7.7% FNDZ01000023.1 Proteiniclasticum ruminis
150.0 kbp 1.2% 4.5% CP015081.1 Deinococcus radiodurans R1 ch
150.0 kbp 1.2% 8.2% CP000969.1 Thermotoga sp. RQ2, complete
290.0 kbp 2.3% 4.1% CH959311.1 Sulfitobacter sp. EE-36 scf_1
1.2 Mbp 9.4% 5.0% CP013328.1 Fusobacterium nucleatum subsp
110.0 kbp 0.9% 3.5% FREL01000833.1 Enterococcus faecalis iso
0.6 Mbp 5.0% 2.8% CP000527.1 Desulfovibrio vulgaris DP4, c
340.0 kbp 2.7% 3.3% KQ235732.1 Fusobacterium nucleatum subsp
70.0 kbp 0.6% 1.2% CP000850.1 Salinispora arenicola CNS-205
60.0 kbp 0.5% 0.7% CP000270.1 Burkholderia xenovorans LB400
50.0 kbp 0.4% 2.6% CP001080.1 Sulfurihydrogenibium sp. YO3A
50.0 kbp 0.4% 3.2% L77117.1 Methanocaldococcus jannaschii D
found less than 40.0 kbp in common. => exiting
found 24 matches total;
the recovered matches hit 73.4% of the query
In our recent preprint using this, we showed that
It is straightforward to build your own databases for use with
search
and gather
; this is of interest if you have dozens or
hundreds of sequencing data sets in your group. Ping us if you want us
to write that up.
There are many tools like Kraken and Kaiju that can do taxonomic classification of individual reads from metagenomes; these seem to perform well (albeit with high false positive rates) in situations where you don’t necessarily have the genome sequences that are in the metagenome. Sourmash, by contrast, can estimate which known genomes are actually present, so that you can extract them and map/align to them. It seems to have a very low false positive rate and is quite sensitive to strains.
Above, we’ve shown you a few things that you can use sourmash for. Here is a (non-exclusive) list of other uses that we’ve been thinking about –
Chat with Luiz, Phil, or Titus if you are interested in these use cases!
A common approach following metagenome assembly is binning, a process by which assembled contigs are collected into groups or ‘bins’ that might then be assigned some taxonomic affiliation. There are many different tools that can be used for binning (see CAMI review for more details). Here, we will be using MaxBin and MetaBAT, which are both user friendly and highly cited. To use these binners, we will first need to map our data against the assembled metagenome using bwa and then estimate relative abundances by contig. We will then inspect the bins generated by MaxBin and MetaBAT using VizBin.
MaxBin
cd
curl https://downloads.jbei.org/data/microbial_communities/MaxBin/getfile.php?MaxBin-2.2.2.tar.gz > MaxBin-2.2.2.tar.gz
tar xzvf MaxBin-2.2.2.tar.gz
cd MaxBin-2.2.2
cd src
make
cd ..
./autobuild_auxiliary
export PATH=$PATH:~/MaxBin-2.2.2
MetaBAT
cd
curl -L https://bitbucket.org/berkeleylab/metabat/downloads/metabat-static-binary-linux-x64_v0.32.4.tar.gz > metabatv0.32.4.tar.gz
tar xvf metabatv0.32.4.tar.gz
Time to finally run the Binners! Note: MaxBin can take a lot of time to run and bin your metagenome. As this is a workshop, we are doing three things that sacrifice quality for speed.
Q. Why is the read counts per contigs not an accurate number?
cd ~/mapping
for i in *sorted.bam
do
samtools idxstats $i > $i.idxstats
cut -f1,3 $i.idxstats > $i.counts
done
–
Maxbin uses read coverage & tetranucleotide frequencies for each contig, and marker gene counts for each bin
First, we will get a list of the count files that we have to pass to MaxBin
mkdir ~/mapping/binning
cd ~/mapping/binning
ls ../*counts > abundance.list
Now, on to the actual binning
run_MaxBin.pl -contig ../subset_assembly.fa -abund_list abundance.list -max_iteration 5 -out mbin
This will generate a series of files. Take a look at the files generated. In particular you should see a series of *.fasta files preceeded by numbers. These are the different genome bins predicted by MaxBin.
Take a look at the mbin.summary file. What is shown?
Now, we are going to generate a concatenated file that contains all of our genome bins put together. We will change the fasta header name to include the bin number so that we can tell them apart later.
for file in mbin.*.fasta
do
num=${file//[!0-9]/}
sed -e "/^>/ s/$/ ${num}/" mbin.$num.fasta >> maxbin_binned.concat.fasta
done
And finally make an annotation file for visualization
echo label > maxbin_annotation.list
grep ">" maxbin_binned.concat.fasta |cut -f2 -d ' '>> maxbin_annotation.list
–
MetaBAT uses read coverage, coverage variance, & tetranucleotide frequencies for each contig. This is done with a custom script
cd ~/mapping
~/metabat/jgi_summarize_bam_contig_depths --outputDepth depth_var.txt *sorted.bam
Setup another “binning” directory for this tool’s results
mkdir ~/mapping/binning_metabat
cd ~/mapping/binning_metabat
Run MetaBAT script
Note that we are outputting info to a logfile
~/metabat/metabat -i ../subset_assembly.fa -a ../depth_var.txt --verysensitive -o metabat -v > log.txt
Make the .fasta file of all binned sequences
for file in metabat.*.fa
do
num=${file//[!0-9]/}
sed -e "/^>/ s/$/ ${num}/" metabat.$num.fa >> metabat_binned.concat.fasta
done
Make an annotation file of the bin numbers for annotation in VizBin
echo label > metabat_annotation.list
grep ">" metabat_binned.concat.fasta |cut -f2 -d ' '>> metabat_annotation.list
Now that we have our binned data from both MetaBAT and MaxBin there are several different things we can do. One thing we might want to do is check the quality of the binning– a useful tool for this is CheckM. Today, for the sake of time, we will visualize the bins that we just generated using VizBin.
First, install VizBin::
cd
sudo apt-get install libatlas3-base libopenblas-base default-jre
curl -L https://github.com/claczny/VizBin/blob/master/VizBin-dist.jar?raw=true > VizBin-dist.jar
VizBin can run in OSX or Linux but is very hard to install on Windows. To simplify things we are going to run VizBin in the desktop emulator through JetStream (which is ... a bit clunky). So, go back to the Jetstream and open up the web desktop simulator.
Open the terminal through the desktop simulator and open VizBin:
java -jar VizBin-dist.jar
This should prompt VizBin to open in another window. First we will look at the output of the MaxBin assembly. Click the choose button to open file browser to navigate to the binning folder (~/mapping/binning
). There you will find the concatenated binned fasta file (maxbin_binned.concat.fasta
). Upload this file and hit run.
What do you see? Read up a bit on VizBin to see how the visualization is generated.
Now, upload the maxbin_annotation.list file as an annotation file to VizBin. The annotation file contains the bin id for each of the contigs in the assembly that were binned.
Now, do the same for MetaBat!
Compare the results of the two binning methods-
Salmon is one of a breed of new, very fast RNAseq counting packages. Like Kallisto and Sailfish, Salmon counts fragments without doing up-front read mapping. Salmon can be used with edgeR and others to do differential expression analysis (if you are quantifying RNAseq data).
Today we will use it to get a handle on the relative distribution of genomic reads across the predicted protein regions.
The goals of this tutorial are to:
Extra resources:
Download and extract the latest version of Salmon and add it to your PATH:
cd
. ~/py3/bin/activate
pip install pandas
wget https://github.com/COMBINE-lab/salmon/releases/download/v0.7.2/Salmon-0.7.2_linux_x86_64.tar.gz
tar -xvzf Salmon-0.7.2_linux_x86_64.tar.gz
cd Salmon-0.7.2_linux_x86_64
export PATH=$PATH:$HOME/Salmon-0.7.2_linux_x86_64/bin
Make a new directory for the quantification of data with Salmon:
mkdir ~/quant
cd ~/quant
Grab the nucleotide (*ffn
) predicted protein regions from Prokka and link them here. Also grab the trimmed sequence data (*fq
)
ln -fs ~/annotation/prokka_annotation/metagG.ffn .
ln -fs ~/annotation/prokka_annotation/metagG.gff .
ln -fs ~/data/*.abundtrim.subset.pe.fq.gz .
Create the salmon index:
salmon index -t metagG.ffn -i transcript_index --type quasi -k 31
Salmon requires that paired reads be separated into two files. We can split the reads using the split-paired-reads.py
from the khmer package:
pip install khmer
for file in *.abundtrim.subset.pe.fq.gz
do
tail=.fq.gz
BASE=${file/$tail/}
split-paired-reads.py $BASE$tail -1 ${file/$tail/}.1.fq -2 ${file/$tail/}.2.fq
done
Now, we can quantify our reads against this reference:
for file in *.pe.1.fq
do
tail1=.abundtrim.subset.pe.1.fq
tail2=.abundtrim.subset.pe.2.fq
BASE=${file/$tail1/}
salmon quant -i transcript_index --libType IU \
-1 $BASE$tail1 -2 $BASE$tail2 -o $BASE.quant;
done
(Note that –libType must come before the read files!)
This will create a bunch of directories named after the fastq files that we just pushed through. Take a look at what files there are within one of these directories:
find SRR1976948.quant -type f
Now, the quant.sf
files actually contain the relevant information about expression – take a look:
head -10 SRR1976948.quant/quant.sf
The first column contains the transcript names, and the fourth column is what we will want down the road - the normalized counts (TPM). However, they’re not in a convenient location / format for use; let’s fix that.
Download the gather-counts.py script:
curl -L -O https://raw.githubusercontent.com/ngs-docs/2016-metagenomics-sio/master/gather-counts.py
and run it:
python2 ./gather-counts.py
This will give you a bunch of .counts files, which are processed from the quant.sf files and named for the directory from which they emanate.
Now, we can create one file:
for file in *counts
do
name=${file%%.*}
sed -e "s/count/$name/g" $file > tmp
mv tmp $file
done
paste *counts |cut -f 1,2,4 > Combined-counts.tab
In Jupyter Notebook, open a new Python3 notebook and name it. Then, into the first cell enter:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In another cell (to make sure we are in the correct directory) enter:
cd ~/quant
Now, we can read our data into a pandas dataframe. This is similar to dataframes in R or an Excel spreadsheet. Paste the following into a new cell.:
count_df=pd.read_table('Combined-counts.tab', index_col='transcript')
count_df
And, finally we can plot it!:
fig, ax = plt.subplots(1) #set up a figure and axis handle
count_df.plot(kind='scatter', x='SRR1976948', y='SRR1977249', ax=ax) #plot scatter plot
The wonderful thing about Jupyter notebooks is that you can go back and modify a plot really easily! Let’s try a few modifications with the above plot.
Download bwa:
cd
curl -L https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.15.tar.bz2/download > bwa-0.7.15.tar.bz2
Unpack and build it:
tar xjvf bwa-0.7.15.tar.bz2
cd bwa-0.7.15
make
Install it:
sudo cp bwa /usr/local/bin
Now, go to a new directory and grab the data:
mkdir ~/mapping
cd ~/mapping
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948.abundtrim.subset.pe.fq.gz
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1977249.abundtrim.subset.pe.fq.gz
And extract the files:
for file in *fq.gz
do
gunzip $file
done
We will also need the assembly; rather than rebuilding it, you can download a copy that we saved for you:
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/subset_assembly.fa.gz
gunzip subset_assembly.fa
First, we will need to to index the megahit assembly:
bwa index subset_assembly.fa
to The reads are in paired-end/interleaved format, so you’ll need to add the -p flag to indicate to bwa that these are paired end data:
Map the reads:
for i in *fq
do
bwa mem -p subset_assembly.fa $i > ${i}.aln.sam
done
First, index the assembly for samtools:
samtools faidx subset_assembly.fa
Then, convert both SAM files to BAM files:
for i in *.sam
do
samtools import subset_assembly.fa $i $i.bam
samtools sort $i.bam $i.bam.sorted
samtools index $i.bam.sorted.bam
done
Find a contig name to visualize:
grep -v ^@ SRR1976948.abundtrim.subset.pe.fq.aln.sam | \
cut -f 3 | sort | uniq -c | sort -n | tail
Pick one e.g. k99_13588.
Now execute:
samtools tview SRR1976948.abundtrim.subset.pe.fq.aln.sam.bam.sorted.bam subset_assembly.fa -p k99_13588:400
(use arrow keys to scroll, ‘q’ to quit; a key for what you are looking at: `pileup format<https://en.wikipedia.org/wiki/Pileup_format>`__.)
Look at it in both mappings:
samtools tview SRR1977249.abundtrim.subset.pe.fq.aln.sam.bam.sorted.bam subset_assembly.fa -p k99_13588:400
Why is the mapping so good??
Note: no strain variation :).
Grab some untrimmed data:
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948_1.fastq.gz
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948_2.fastq.gz
Now align this untrimmed data:
gunzip -c SRR1976948_1.fastq.gz | head -800000 > SRR1976948.1
gunzip -c SRR1976948_2.fastq.gz | head -800000 > SRR1976948.2
bwa aln subset_assembly.fa SRR1976948.1 > SRR1976948_1.untrimmed.sai
bwa aln subset_assembly.fa SRR1976948.2 > SRR1976948_2.untrimmed.sai
bwa sampe subset_assembly.fa SRR1976948_1.untrimmed.sai SRR1976948_2.untrimmed.sai SRR1976948.1 SRR1976948.2 > SRR1976948.untrimmed.sam
i=SRR1976948.untrimmed.sam
samtools import subset_assembly.fa $i $i.bam
samtools sort $i.bam $i.bam.sorted
samtools index $i.bam.sorted.bam
And now look:
samtools tview SRR1976948.untrimmed.sam.bam.sorted.bam subset_assembly.fa -p k99_13588:500
You can also use ‘Tablet’ to view the downloaded BAM file - see the Tablet paper.
(Note, this won’t work with amplified data.)
Extra resources:
—
At the command line, create a new directory and extract some data:
cd /mnt
mkdir slice
cd slice
We’re going to work with half the read data set for speed reasons –
gunzip -c ../mapping/SRR1976948.abundtrim.subset.pe.fq.gz | \
head -6000000 > SRR1976948.half.fq
In a Jupyter Notebook (go to ‘http://‘ + machine name + ‘:8000’), password ‘davis’, create new Python notebook “conda root”, run:
cd /mnt/slice
and then in another cell:
!load-into-counting.py -M 4e9 -k 31 SRR1976948.kh SRR1976948.half.fq
and in another cell:
!abundance-dist.py SRR1976948.kh SRR1976948.half.fq SRR1976948.dist
and in yet another cell:
%matplotlib inline
import numpy
from pylab import *
dist1 = numpy.loadtxt('SRR1976948.dist', skiprows=1, delimiter=',')
plot(dist1[:,0], dist1[:,1])
axis(ymax=10000, xmax=1000)
Then:
python2 ~/khmer/sandbox/calc-median-distribution.py SRR1976948.kh \
SRR1976948.half.fq SRR1976948.readdist
And:
python2 ~/khmer/sandbox/slice-reads-by-coverage.py SRR1976948.kh SRR1976948.half.fq slice.fq -m 0 -M 60
(Re)install megahit:
cd
git clone https://github.com/voutcn/megahit.git
cd megahit
make
Go back to the slice directory and extract paired ends:
cd /mnt/slice
extract-paired-ends.py slice.fq
Assemble!
~/megahit/megahit --12 slice.fq.pe -o slice
The contigs will be in slice/final.contigs.fa
.
Now we are going bring it all together and visuzlize our assembly with Anvi’o. Anvi’o is a powerful and extensible tool that might be easily applied to pan-genomic analysis as well as metagenomic analysis. The anvi’o group has a series of fantstic online tutorials, including one on metagenomic analysis. They also run workshops periodically (schedule here) that throughly cover the use of the software.
Today, we are adapting their tutorial on metagenomic analysis to work with the subset dataset that we have.
The goals of this tutorial are to:
The first thing we need to do is install anvi’o. To install anvi’o we will be be using Anaconda.
cd ~
wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh
bash Anaconda3-4.4.0-Linux-x86_64.sh
Now, follow the prompts for the Anaconda installation. To finish the install it will ask you if you would like to add anaconda to the $PATH
in you .bashrc
. You should say yes. Now, you just need to source your .bashrc
to make sure you can use conda.
source .bashrc
Anaconda should now be installed. We will now use anaconda (conda
) to install anvi’o (and all its dependencies) as well as source an environment in which to to run conda.
Now, install anvi’o using conda, create an environment in which to run it, and source the environment:
conda create -n anvio232 -c bioconda -c conda-forge gsl anvio
source activate anvio232
Anvi’o should now be installed. But, let’s double check that it worked. They have a nice little test case to check that everything is working well as follows:
anvi-self-test --suite mini
This prompt will start anvi’o processing and ultimately it will generate an interactive window with the anvi’o environment. This is accessible through port 8080 (typically, though it might create go to a different port that will be specified) at your IP address.
To find your IP address you can look back at the Jetstream instance page and copy the IP address.
Now, open a new tab in your browser* and paste in the following:
*Note: This only works in Google Chrome.
[Your IP Address]:8080
This should open up the anvi’o interface which is interactive and pretty good looking.
Now, we just need to install a few other programs, namely, samtools and Bowtie2, which we will use for mapping and looking at our mapped data.
wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.3.2/bowtie2-2.3.2-linux-x86_64.zip
unzip bowtie2-2.3.2-linux-x86_64.zip
echo 'export PATH=$PATH:~/bowtie2-2.3.2' >> ~/.bashrc
source ~/.bashrc
sudo apt-get -y install samtools
Alright, now onto a complete re-analysis of our data with the anvi’o pipeline.
Anvi’o takes in 1) an assembly and 2) the raw read data. We have both of those already created, so let’s go ahead and download those data (trimmed reads and asssemblies):
mkdir ~/anvio-work
cd ~/anvio-work
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948.abundtrim.subset.pe.fq.gz
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1977249.abundtrim.subset.pe.fq.gz
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/subset_assembly.fa.gz
And, gunzip those files:
for file in *gz
do
gunzip $file
done
We now need to get our assembly into the correct format so that anvi’o interpret it.
anvi-script-reformat-fasta subset_assembly.fa -o anvio-contigs.fa --min-len 2000 --simplify-names --report name_conversions.txt
Take a look at the output files. What has changed?
We need to map our reads to our anvi’o corrected assembly. This is going to take a little bit of time. First, build the an index for bowtie2:
bowtie2-build anvio-contigs.fa anvio-contigs
We can write a for loop to map our two datasets and produce .bam files for the files:
for file in *fq
do
bowtie2 --threads 8 -x anvio-contigs --interleaved $file -S ${file/.fq/}.sam
samtools view -U 4 -bS ${file/.fq/}.sam > ${file/.fq/}.bam
done
As above, we need to make these data readable for anvi’o:
for file in *.bam
do
anvi-init-bam ${file} -o ${file/.bam/}.anvio.bam
done
In this step we are asking anvi’o to create a database with information about your contigs. The contig database is fairly extensible and can contain lots of different information (taxonomic, functional, etc.). Here, we are primarily asking it to do three things:
So, run the following command to generate the database:
anvi-gen-contigs-database -f anvio-contigs.fa -o anvio-contigs.db
Then, run this command to perform the hmm search and identify single copy gene content:
anvi-run-hmms -c anvio-contigs.db --num-threads 28
Now, we can layer on the coverge information from our two samples:
for file in *.anvio.bam
do
anvi-profile -i $file -c anvio-contigs.db -T 28
done
And finally, we run the merge step. This will pull all the information together and create a merged anvi’o profile. This step will also run CONCOCT (another binning algorithm) that will identify bins in our data. Finally, this step calculates the hierarchical relationship betwewen our contigs based on a variety of parameters.
anvi-merge *ANVIO_PROFILE/PROFILE.db -o MERGED-SAMPLES -c anvio-contigs.db --enforce-hierarchical-clustering
Now we can visualize our data!
anvi-interactive -p MERGED-SAMPLES/PROFILE.db -c anvio-contigs.db
First, let’s summarize the bin information for our data. This will produce a series of text-based output files detailing some statistics on our genome bins:
anvi-summarize -p MERGED-SAMPLES/PROFILE.db -c anvio-contigs.db -o SAMPLES-SUMMARY -C CONCOCT
Take a look at the output in SAMPLES-SUMMARY
. What does it report?
Now you can visuzlize those data in the anvi’o style by simply adding the -C flag to the previous anvi-interactive command:
anvi-interactive -p MERGED-SAMPLES/PROFILE.db -c anvio-contigs.db -C CONCOCT
Now, we can actually refine the genome bins using anvi’o. This allows us to use human intutition and pattern recognition to better identify contigs that should co-occur.
anvi-refine -p MERGED-SAMPLES/PROFILE.db -c anvio-contigs.db -b Bin_4 -C CONCOCT
Finally, it is time to interact with the anvi’o interface. Here are some screenshots to help guide you in your quest.
And of course a big thank you to Meren for providing us with extra materials to help create this tutorial!
Circos is a powerful visualization tool that allows for the creation of circular graphics to display complex genomic data (e.g. genome comparisons). On top of the circular ideogram generated can be layered any number of graphical information (heatmaps, scatter plots, etc.).
The goals of this tutorial are to:
Note: Beyond this brief crash course , circos is very well-documented and has a great series of tutorials and course materials that are useful.
You’ll need to install one additional ubuntu package, libgd:
sudo apt-get -y install libgd-perl
Make a directory called circos and navigate into it. There, we will download and extract the latest version of circos:
cd
mkdir circos
cd circos
curl -O http://dib-training.ucdavis.edu.s3.amazonaws.com/metagenomics-scripps-2016-10-12/circos-0.69-3.tar.gz
tar -xvzf circos-0.69-3.tar.gz
Circos runs within Perl and as such does not need to be compiled to run. So, we can just add the location of circos to our path variable. (Alternatively, you can append this statement to the end of your .bashrc
file.)
export PATH=~/circos/circos-0.69-3/bin:$PATH
Circos does, however, require quite a few additional perl modules to operate correctly. To see what modules are missing and need to be downloaded type the following:
circos -modules > modules
Now, to download all of these we will be using CPAN, a package manager for perl. We are going to pick out all the missing modules and then loop over those modules and download them using cpan.
grep missing modules |cut -f13 -d " " > missing_modules
for mod in $(cat missing_modules);
do
sudo cpan install $mod;
done
This will take a while to run. When it is done check that you now have all modules downloaded by typing:
circos -modules
If you got all ‘ok’ then you are good to go!
And with that, circos should be up and ready to go. Run the example by navigating to the examples folder within the circos folder.
cd ~/circos/circos-0.69-3/example
bash run
This will take a little bit to run but should generate a file called circos.png
. Open it and you can get an idea of the huge variety of things that are possible with circos and a lot of patience. We will not be attempting anything that complex today, however.
First, let’s make a directory where we will be doing all of our work for plotting:
mkdir ~/circos/plotting
cd ~/circos/plotting
Now, link in the *gff
file output from prokka (which we will use to define the location of genes in each of our genomes), the genome assembly file final.contigs.fa
, and the SRR*counts
files that we generated with salmon:
ln -fs ~/data/prokka_annotation/*gff .
ln -fs ~/data/final.contigs.fa .
ln -fs ~/quant/*counts .
We also need to grab a set of useful scripts and config files for this plotting exercise:
curl -L -O https://github.com/ngs-docs/2016-metagenomics-sio/raw/master/circos-build.tar.gz
tar -xvzf circos-build.tar.gz
curl -L -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/subset_assembly.fa.gz
gunzip subset_assembly.fa.gz
mv subset_assembly.fa final.contigs.fa
We are going to limit the data we are trying to visualize and get longest contigs from our assembly. We can do this using a script from the khmer package:
extract-long-sequences.py final.contigs.fa -l 24000 -o final.contigs.long.fa
cp ~/data/quant/*counts .
Next, we will run a script that processes the data from the the files that we just moved to create circos-acceptable files. This is really the crux of using circos: figuring out how to get your data into the correct format.
python parse_data_for_circos.py
If you are interested– take a look at the script and the input files to see how these data were manipulated.
Circos operates off of three main types of files: 1) a config files that dictate the style and inputs to your circos plot, 2) a karyotype file that defines the size and layout of your “chromosomes”, and 3) any data files that you call in your config file that detail attributes you want to plot.
The above script generated our karyotype file and four different data files. What are they? How are they oriented?
Now, we all that is left is actually running circos. Navigate into the circos-build directory and type circos
:
cd circos-build
circos
This command should generate an circos.svg
and circos.png
. Check out the circos.png
!
Now, let’s take a look at the file that controls this crazy figure– circos.config
.
Try changing a few parameters– colors, radius, size, to see what you can do. Again, if you are into this type of visualization, do check out the extensive tutorial.
See CTB’s Mar 2016 workshop on “repeatability” - in sum,
Among other things, this provides a set of artifacts that can be sent to your advisor, provided to your collaborators and (ultimately) published with your publication. Plus you won’t have to remember what you did - you’ll have it written down!
SRR1976948_1.fastq.gz - First 1m reads of the SRA record
SRR1976948_2.fastq.gz - First 1m reads of the SRA record
SRR1977249_1.fastq.gz - First 1m reads of the SRA record
SRR1977249_2.fastq.gz - First 1m reads of the SRA record
SRR1977296_1.fastq.gz - First 1m reads of the SRA record
SRR1977296_2.fastq.gz - First 1m reads of the SRA record
SRR1976948.abundtrim.subset.pe.fq.gz - abundtrim/subset swept/PE
SRR1977249.abundtrim.subset.pe.fq.gz - abundtrim/subset swept/PE
all-genomes.fasta.gz - all of the genomes from the study
subset-genomes.fasta.gz - ~1/8th of the genomes from the study
subset_assembly.fa.gz - - assembly of the two abundtrim/subset data sets
Make a virtualenv:
python -m virtualenv ../build
. ../build/bin/activate
Install the requirements:
pip install -r requirements.txt
Run make
Look at the html/
directory.
This workshop was given on July 17th - 21st, 2017 at the the Unveristy of California, Davis. Lead Instructor: Harriet Alexander Instructors: C. Titus Brown, Phillip Brooks, Adelaide Rhodes, Jessica Blanton, Jessica Mitzsi, Shawn Higdon, and Veronika Kivenson
See https://2017-dibsi-metagenomics.readthedocs.io/en/latest/ for the rendered version of this site.
(Instructions mostly copied from Short read quality and trimming!)
Use image “Ubuntu 14.04.3”
Run:
sudo apt-get -y update && \
sudo apt-get -y install trimmomatic fastqc python-pip \
samtools zlib1g-dev ncurses-dev python-dev
Install anaconda:
curl -O https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
bash Anaconda3-4.2.0-Linux-x86_64.sh
Then update your environment and install khmer and sourmash:
source ~/.bashrc
conda install -n root pip -y
pip install https://github.com/dib-lab/khmer/archive/master.zip
pip install https://github.com/dib-lab/sourmash/archive/2017-ucsc-metagenome.zip
(See the sourmash docs for this workshop for some details on the sourmash install.)
Let’s also run a Jupyter Notebook in your home directory. Configure it a teensy bit more securely, and also have it run in the background.
Generate a config:
jupyter notebook --generate-config
Add a password, have it not run a browser, and put it on port 8000 by default:
cat >> ~/.jupyter/jupyter_notebook_config.py <<EOF
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:5d813e5d59a7:b4e430cf6dbd1aad04838c6e9cf684f4d76e245c'
c.NotebookApp.port = 8000
EOF
Now, run!
jupyter notebook &
This will output some stuff; to make the prompt appear again, hit ENTER a few times.
You should now be able to visit port 8000 on your computer and see the Jupyter console; to get the URL to Jupyter, run:
echo http://$(hostname):8000/
Note
If your network blocks port 8000 (e.g. cruznet at UCSC), you can run:
ssh -N -f -L localhost:8000:localhost:8000 username@remotehost
to tunnel the remote Jupyter notebook server over SSH.
We are now ready to map and bin reads .
(Optional)
khmer documentation: http://khmer.readthedocs.io/en/latest
If you plot a k-mer abundance histogram of the samples, you’ll notice something: there’s an awful lot of unique (abundance=1) k-mers. These are erroneous k-mers caused by sequencing errors.
In a new Python3 Jupyter Notebook, run:
cd ~/work
and then
!abundance-dist-single.py -M 1e9 -k 21 SRR1976948_1.fastq.gz SRR1976948_1.fastq.gz.dist
and in another cell:
%matplotlib inline
import numpy
from pylab import *
dist1 = numpy.loadtxt('SRR1976948_1.fastq.gz.dist', skiprows=1, delimiter=',')
plot(dist1[:,0], dist1[:,1])
axis(xmax=50)
Many of these errors remain even after you do the Trimmomatic run; you can see this with:
!abundance-dist-single.py -M 1e9 -k 21 SRR1976948_1.qc.fq.gz SRR1976948_1.qc.fq.gz.dist
and then plot:
dist2 = numpy.loadtxt('SRR1976948_1.qc.fq.gz.dist', skiprows=1, delimiter=',')
plot(dist1[:,0], dist1[:,1], label='untrimmed')
plot(dist2[:,0], dist2[:,1], label='trimmed')
legend(loc='upper right')
axis(xmax=50)
This is for two reasons:
First, Trimmomatic trims based solely on the quality score, which is a statistical statement about the correctness of a base - a Q score of 30 means that, of 1000 bases with that Q score, 1 of those bases will be wrong. So, a base can have a high Q score and still be wrong! (and many bases will have a low Q score and still be correct)
Second, we trimmed very lightly - only bases that had a very low quality were removed. This was intentional because with assembly, you want to retain as much coverage as possible, and the assembler will generally figure out what the “correct” base is from the coverage.
An alternative to trimming based on the quality scores is to trim based on k-mer abundance - this is known as k-mer spectral error trimming. K-mer spectral error trimming always beats quality score trimming in terms of eliminating errors; e.g. look at this table from Zhang et al., 2014:
The basic logic is this: if you see low abundance k-mers in a high coverage data set, those k-mers are almost certainly the result of errors. (Caveat: strain variation could also create them.)
In metagenomic data sets we do have the problem that we may have very low and very high coverage data. So we don’t necessarily want to get rid of all low-abundance k-mers, because they may represent truly low abundance (but useful) data.
As part of the khmer project in my lab, we have developed an approach that sorts reads into high abundance and low abundance reads, and only error trims the high abundance reads.
This does mean that many errors may get left in the data set, because we have no way of figuring out if they are errors or simply low coverage, but that’s OK (and you can always trim them off if you really care).
To run such error trimming, use the command trim-low-abund.py
(at the command line, or prefix with a ‘!’ in the notebook):
interleave-reads.py SRR1976948_1.qc.fq.gz SRR1976948_2.qc.fq.gz |
trim-low-abund.py -V -M 8e9 -C 3 -Z 10 - -o SRR1976948.trim.fq
If you can assemble your data set without k-mer trimming, there’s no reason to do it. The reason we’re error trimming here is to speed up the assembler (by removing data) and to decrease the memory requirements of the assembler (by removing a number of k-mers).
To see how many k-mers we removed, you can examine the distribution as above,
or use the unique-kmers.py
script:
unique-kmers.py SRR1976948_1.qc.fq.gz SRR1976948_2.qc.fq.gz
unique-kmers.py SRR1976948.trim.fq