XenBioinfo fromIanQuigley

From TaejoonLab
Jump to: navigation, search

Now, back to the software that we used.

General: Pretty much all of these tools are command-line tools. Most of them use the basic Unix command line except for DESeq, which instead uses R. We installed all of this stuff for you on the server, but it turns out too that beginners often have a lot of trouble installing things. Of course, to do this, you'll have to be using a machine that's either linux (Macs count) or is using Putty (for PCs). I think also the newest PC operating system will have linux functionality. Another thing to consider is that linux itself is free, so if you've got a PC computer lying around you don't need for anything else, you can make it your very own linux server! I did this in order to force myself to learn how to use the commands. It worked out pretty well. To get started down the path of installing the linux OS, go to Ubuntu.com. And if you're feeling really hardcore, you can actually just make your server live in the cloud, like with Amazon Web Services. This is largely what I do these days, and I bet everyone else will too in a few years.

Here is the general progression of installing software with command line (it is sadly not typically a click-and-watch-it-install kind of thing). Download the package. Click "download" if there's a button, or if you're feeling hardcore, you can right-click the file and get the url:

$ wget http://hosting_website.edu/the-package.tgz

That ending bit - ".tgz" - means that the developer made a tarball. Tarballs are basically ways to compress a whole folder into a smaller file for easy moving around. Uncompressing a tarball is a clunky command, sufficiently clunky that there is even an XKCD comic about it. If it's a .tgz, the command is

$ tar -xvzf the-package.tgz

A bunch of stuff will come out of it, and by the end you'll have a directory that you can then go into and see what's up.

$ cd the-package/

There is almost always a README.txt in these things. Look at it to see if they've got any advice about installing, etc.

$ more README.txt

Sometimes there isn't a README to tell you what to do, or the one that is there isn't all that helpful. Fortunately, installing software in a linux environment is almost always the same three commands:

$ ./config $ make $ make install

So, try that first. Weird, yes. Here are some more details about it.

Sometimes, when you're installing something, the software you're trying to install needs another piece of software that isn't there yet. When one piece of software needs another to run properly, it's called a dependency, and you'll pretty much always get an error while trying to install if that dependency is missing. When you're setting up something for the first time, you may discover that some of your dependencies have dependencies of their own. There is a term for this: dependency hell. It may eat up a lot of your time. Usually what ends up happening is that you'll see an error, and then you should copy and paste that error into Google. Seriously. I guarantee you somebody else has had this problem before, and I guarantee that at least one of those people has asked the internet how to solve it. ALL COMPUTER PEOPLE PASTE ERRORS INTO GOOGLE TRYING TO FIX THINGS, AND IT GENERALLY WORKS. I cannot overemphasize this enough. A Googling strategy is your greatest ally. And, understand that the first time you're trying to set up your computer environment, installing all those packages could take a long time. Like a day or two, maybe longer. I'm not kidding. This is why Taejoon installed all the stuff on the servers for us before the course. So, when you do this for yourself, chin up! It may well suck.

Okay. So, with that said, here are the different pieces of software I had you guys use in the course.

ChIPseq: To map reads, we used bowtie2 (paper). Another fine mapper is bwa mem (paper). A third great mapper is RNA STAR (paper), which is super fast, but that one has radically clunky demands and really weird documentation (documentation = "instruction manual"), so using it will be frustrating the first time out. Recall that to search the genome, each one of these aligners is going to want to make its own index, and that these indexes won't be usable by other aligners (e.g., bwa mem can't use the bowtie2 index, because it searches the genome a different way).

We aligned ChiPseq reads to the genome. We then needed to manipulate the alignment files to make them digestible by the IGV browser. We did this with SAMtools, a piece of software that's useful for changing sequencing file formats (SAM to BAM, etc.).

We then visualized the ChiPseq data on the IGV browser. Note that one could also use gBrowse or jBrowse. The gBrowse link will take you to Xenbase, where you can actually just upload BAM files or whatever and see them there without having to install anything, so that's not a bad option, although I always did think that gBrowse was aesthetically ugly.

Eventually the UCSC browser will support X. laevis, so you'll probably want to use that whenever they get around to it.

We then called peaks on those ChIPseq reads with HOMER and also did some other stuff ("getDistalPeaks" is from HOMER, too). Other well-known peak callers include MACS, MACS2, and SPP. HOMER is by far the easiest to use of these, but is also the biggest pain to install.

Let me sing the praises of HOMER for a minute. In the interest of disclosure, I will preface this by saying that developer, Chris Benner at UCSD (watch his talk), is a good friend of mine. I will also reveal that Taejoon hates it, not because it doesn't do its job well but because one can get into real dependency hell trying to install it. HOMER will want functional SAMtools, bedTools, seqlogo, and a few other things, and when you're installing it, the process will crash without them, which then provides error messages, which you will then google, and install the dependency, and then try to install HOMER again, at which point it will need another dependency and crash again, etc., for hours. Sucks, I won't deny it. Getting HOMER right will probably produce more tears and heartache than any other thing you're trying to do when you're setting up your computer for the first time.

However, once it's running, HOMER does some awesome shit. It has a ton of little nifty tools to move peaks around, annotate them, call them, and intersect them. It can make BAM files into UCSC files or bigWigs or other formats that make browsing easier. It calls motifs, which is that it was originally designed to do. It can analyze HiC data. It can work with custom genomes: my first laevis attempts involved me downloading the old-school EST collection (!), aligning it to the genome myself, building the gtf file myself, and giving it to HOMER to work with. There are other tools to do these jobs, but they are like a million little programs you gotta go find and install, whereas HOMER can do all those jobs by itself. The talk I gave on my stuff the last day? Everything that wasn't RNAseq was done on HOMER (after alignment). All of it.

The main HOMER site also has tons of tutorials on how to do stuff. Here's are some (first one, second one) for people who have never done this sequencing before. Here's one on how to find motifs, and here's another one on how to make that ginormous heatmap I put in my talk. Click around, you'll be amazed - it's a lot more informative and readable than the documentation of just about every other software package I've put on here.

Okay, enough about how great HOMER is. One thing you'll probably remember is that we used a gtf file along with the genome, both for HOMER and the browser. We used mine - which is to say, mine is mostly Taejoon's Mayball models but with my naming convention. As a gtf (and two similar file formats, gff3 and psl) is basically an address for exons (e.g., "nucleotides 100-150 of chromosome 1 is where exon 1 of tubb2b lives"), it is genome-version specific. Let's say a new genome version adds 10 nucleotides to the front of chromosome 1 that the last version missed. Now the gtf is wrong! So you gotta make sure these things match up right. Xenbase has a couple genome assemblies but also several gtf/gff3s that reflect different iterations of gene models. Xenbase doesn't have mine, but you can get mine either at my site or the wiki. I expect all of this will settle down to a nice standardized one soon but you should know about these things to stay out of trouble.

The rest of the ChIPseq stuff we did was all HOMER programs - findPeaks, annotatePeaks, getDistalPeaks, findMotifsGenome.

RNAseq We aligned to the transcriptome with bowtie2. One could also align to the genome, but be aware that then you have to a) extract where the genes are in the genome with a gtf file or equivalent, and b) use a spliced aligner that can identify when reads cross intron-exon junctions and still align them correctly. To make it easier from several vantage points, I chose to align to a transcriptome instead, which looks sort of like a genome fasta. Only instead of looking like this:

>Chromosome_1 CATATTCACTGCCACTTCTCGCTGCTTCTCGAACTCCTTCCACTCTCTGTTAATGCAAAACAGGATCTGAGCA etc. till end of chromosome_1

It looks like this

>DDX28|ENSG00000182810|c.Park201106_X000306|JGIv7b.000000004_959733-962017+ ACAAAGCCCACGTTCAGCCGGAAAAGAGAGTAACCTTGCGTCTTCTCATTTGATATTCCAGCATCAGGAAATA etc. till end of gene DDX28

You'd have to make a bowtie2 index out of the transcriptome fasta, too, before aligning.

After aligning, a step I did but one that you guys did not do is count the reads at every gene with a counter called eXpress. Another good counter that Taejoon likes is RSEM (manual). A third one that behaves different but works much faster (on Macs or Linux only, which is why I didn't bother with it for you guys) is Kallisto (Kallisto also needs to use a different method for differential expression, like sleuth, since it estimates counts a different way).

Once you get the counts, you'll want to make a table that looks like so:

Gene name expt. 1 counts expt. 2 counts etc.

For using eXpress, you'll want "effective counts". For some reason the RSEM guy wants you to use a different differential expression tool, so maybe look at his manual a little more closely if you're going that way.

Now that you've got that table, you'll want to install R, and after that, DESeq ("vignette", which is the word R likes to use for "manual", is here). And then the differential expression pipeline is the one I gave you guys in that powerpoint on the last day (link).