(Jetstream startup instructions HERE.)
—
You should now be logged into your Jetstream computer! You should see something like this:
ubuntu@ip-172-30-1-252:~$
this is the command prompt.
First, let’s install software for short read quality assessment, trimming and python virtual environments:
sudo apt-get -y update && \
sudo apt-get -y install trimmomatic python-pip \
samtools zlib1g-dev ncurses-dev python-dev
sudo apt-get install -y python3.5-dev python3.5-venv make \
libc6-dev g++ zlib1g-dev
wget -c http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip fastqc_v0.11.5.zip
cd FastQC
chmod +x fastqc
cd
Now, create a python 3.5 virtual environment and install software within:
python3.5 -m venv ~/py3
. ~/py3/bin/activate
pip install -U pip
pip install -U Cython
pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn khmer
pip install -U https://github.com/dib-lab/sourmash/archive/master.zip
Let’s also run a Jupyter Notebook. First, configure it a teensy bit more securely, and also have it run in the background:
jupyter notebook --generate-config
cat >>~/.jupyter/jupyter_notebook_config.py <<EOF
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:5d813e5d59a7:b4e430cf6dbd1aad04838c6e9cf684f4d76e245c'
c.NotebookApp.port = 8000
EOF
Now, run!
jupyter notebook &
On the Jetstream webshell, you can get the Web page address for the jupyter notebook you just launched. Press enter, then execute
echo http://$(hostname):8000/
Copy the output from echo and paste it into the url bar of a new tab in your web browser.
Note, the password is ‘davis’.
Note
If your network blocks port 8000 (e.g. cruznet at UCSC), you can run:
ssh -N -f -L localhost:8000:localhost:8000 username@remotehost
to tunnel the remote Jupyter notebook server over SSH.
We’re going to be using a subset of data from Hu et al., 2016. This paper from the Banfield lab samples some relatively low diversity environments and finds a bunch of nearly complete genomes.
(See DATA.html for a list of the data sets we’re using in this tutorial.)
We’ve loaded subsets of the data onto an Amazon location for you, to
make everything faster for today’s work. We’re going to put the
files on your computer locally under the directory ~/data
:
mkdir ~/data
Next, let’s grab part of the data set:
cd ~/data
curl -O -L https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948_1.fastq.gz
curl -O -L https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948_2.fastq.gz
Let’s make sure we downloaded all of our data using md5sum.:
md5sum SRR1976948_1.fastq.gz SRR1976948_2.fastq.gz
You should see this:
37bc70919a21fccb134ff2fefcda03ce SRR1976948_1.fastq.gz
29919864e4650e633cc409688b9748e2 SRR1976948_2.fastq.gz
Now if you type:
ls -l
you should see something like:
total 346936
-rw-rw-r-- 1 ubuntu ubuntu 169620631 Oct 11 23:37 SRR1976948_1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 185636992 Oct 11 23:38 SRR1976948_2.fastq.gz
These are 1m read subsets of the original data, taken from the beginning of the file.
One problem with these files is that they are writeable - by default, UNIX makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. Let’s fix that before we go on any further:
chmod u-w *
We’ll talk about what these files are below.
First, make a working directory; this will be a place where you can futz around with a copy of the data without messing up your primary data:
mkdir ~/work
cd ~/work
Now, make a “virtual copy” of the data in your working directory by linking it in –
ln -fs ~/data/* .
These are FASTQ files – let’s take a look at them:
less SRR1976948_1.fastq.gz
(use the spacebar to scroll down, and type ‘q’ to exit ‘less’)
Question:
Links:
We’re going to use FastQC to summarize the data. We already installed ‘fastqc’ on our computer for you.
Now, run FastQC on two files:
fastqc SRR1976948_1.fastq.gz
fastqc SRR1976948_2.fastq.gz
Now type ‘ls’:
ls -d *fastqc.zip*
to list the files, and you should see:
SRR1976948_1_fastqc.zip
SRR1976948_2_fastqc.zip
Inside each of the fatqc directories you will find reports from the fastqc. You can download these files using your Jupyter Notebook console, if you like; or you can look at these copies of them:
Questions:
Links:
There are several caveats about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed for the first
Now we’re going to do some trimming! We’ll be using Trimmomatic, which (as with fastqc) we’ve already installed via apt-get.
The first thing we’ll need are the adapters to trim off:
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/TruSeq2-PE.fa
Now, to run Trimmomatic:
TrimmomaticPE SRR1976948_1.fastq.gz \
SRR1976948_2.fastq.gz \
SRR1976948_1.qc.fq.gz s1_se \
SRR1976948_2.qc.fq.gz s2_se \
ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
You should see output that looks like this:
...
Input Read Pairs: 1000000 Both Surviving: 885734 (88.57%) Forward Only Surviving: 114262 (11.43%) Reverse Only Surviving: 4 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
Questions:
For a discussion of optimal trimming strategies, see MacManes, 2014 – it’s about RNAseq but similar arguments should apply to metagenome assembly.
Links:
Run FastQC again on the trimmed files:
fastqc SRR1976948_1.qc.fq.gz
fastqc SRR1976948_2.qc.fq.gz
And now view my copies of these files:
Let’s take a look at the output files:
less SRR1976948_1.qc.fq.gz
(again, use spacebar to scroll, ‘q’ to exit less).
MultiQC aggregates results across many samples into a single report for easy comparison.
Install MultiQC within the py3 environment:
pip install multiqc
Now, run Mulitqc on both the untrimmed and trimmed files within the work directory:
multiqc .
And now you should see output that looks like this:
[INFO ] multiqc : This is MultiQC v1.0
[INFO ] multiqc : Template : default
[INFO ] multiqc : Searching '.'
Searching 15 files.. [####################################] 100%
[INFO ] fastqc : Found 4 reports
[INFO ] multiqc : Compressing plot data
[INFO ] multiqc : Report : multiqc_report.html
[INFO ] multiqc : Data : multiqc_data
[INFO ] multiqc : MultiQC complete
Now we can view the output file using Jupyter Notebook.
Questions:
Optional: K-mer Spectral Error Trimming