ChIPseq Baby

From TaejoonLab
Jump to: navigation, search

Basics

Moving forward, when I want you to do commands, I will indent the command and precede it with “$”. When you cut and paste the command, don’t include that sign. It will look like this:

$ This_is_the_command

So you’d cut “This_is_the_command” and paste it into the terminal. If you see two lines in a row and only one of them has a $ in front of it that means the command is one long line, so paste the whole thing.

This is going to be a tutorial on how to get chipseq data in fastq format, map it to the genome, and load it into the browser for looking at. After that, we will call peaks. Finally, we will look at the sequence in the peaks and search for overrepresented motifs. Remember, because this is a google doc the single and double quotation marker (‘ “) may look funny/not work in your terminal and you may have to replace them. Sorry!

First, I have generated a second baby dataset for you. Log into the server from your terminal:

$ ssh yourname@your_server_IP_address

Make a directory called “chipseq” to keep things tidy and move into it:

$ mkdir chipseq
$ cd chipseq

Then, copy three files from our shared directory into your new chipseq directory. Look at them first:

$ ls /usr/local/share/xenopus/chipseq_material/

You should see

Baby_chipseq_genome.fa  Baby_H3K4me3_reads.fq  Baby_input_reads.fq

Now copy these files into your home directory. You can copy everything by using “*”:

$ cp /usr/local/share/xenopus/chipseq_material/* .

Make sure everything made it in there:

$ ls ~/chipseq/

Along with bowtie2 for the mapping, we'll be using HOMER for the peak-calling. There are several ChIP peak-callers, including HOMER, MACS, MACS2, SPP, and a few others. They all basically have the same job: at a particular point in the genome, is there a big pile-up of reads? Is that clump big enough for us to decide that it's a real peak? They all use slightly different strategies, but for the most part you'll get very similar results. Where the results are likely to be different are gonna be pissy little peaks that are probably more the result of 5 minutes of extra time in fixative rather than something biologically meaningful, so don't worry about it.

As we've discussed a little, when you type a tool into the command line the computer needs to know where that tool is. When you type in "ls", you can do it from every directory, because the computer already knows where to find that command to run it. Taejoon installed HOMER for us on the servers, but at moment the computer doesn't know where that is. Give it a try:

$ findPeaks

Nope, didn't work. That's okay: it's pretty easy to tell the computer where to find HOMER. You can put it in the place where all the locations of tools are, called the PATH. You can find a little discussion on this here, but for now, run the following command:

$ PATH=$PATH:/usr/local/src/homer/bin/

Test them to make sure that PATH trick worked:

$ findPeaks

If a bunch of instructions about findPeaks show up, you did it right. We can talk about making this a permanent fix later (it involves editing a file called ".bashrc"), but for now this should work just fine.

Mapping

Now it's time to build the index. “Baby_chipseq_genome.fa” is a fake genome made from two real X. laevis genomic scaffolds. Go ahead, have a look at it:

$ more Baby_chipseq_genome.fa

Great, it's a fasta. Let's build that index now:

$ bowtie2-build Baby_chipseq_genome.fa Baby_chipseq_index

A bunch of stuff should happen, it should take a few moments.

Now it’s time to align the reads.

$ bowtie2 -x Baby_chipseq_index -U Baby_H3K4me3_reads.fq -S Baby_H3K4me3_reads.sam
$ bowtie2 -x Baby_chipseq_index -U Baby_input_reads.fq -S Baby_input_reads.sam

Consider having a quick look at these guys to make sure there are alignments hiding in the file:

$ head *sam

Okay that is pretty hard to read because the lines are so long. Let’s look at shorter ones:

$ head -n 5 *sam

SAMtools

sam files are pretty big. They not only include the unique name/address of each read, its sequence, its quality scores, plus all the information about where it maps to, the quality of the mapping, etc. It’s smart to shrink them into bam files. Moreover, to put them on a browser to look at them, you have to sort them. This is to make the browser put them in their right place easier.

$ samtools view -b -h -S Baby_input_reads.sam -o Baby_input_reads.bam
$ samtools view -b -h -S Baby_H3K4me3_reads.sam -o Baby_H3K4me3_reads.bam

These files are in binary now, so if we use “head” to look at them it will be an uninterpretable mess.

Then we need to sort them:

$ samtools sort Baby_input_reads.bam > Baby_input_reads_sorted.bam
$ samtools sort Baby_H3K4me3_reads.bam > Baby_H3K4me3_reads_sorted.bam

And now we need to index them:

$ samtools index Baby_input_reads_sorted.bam
$ samtools index Baby_H3K4me3_reads_sorted.bam

This will produce two files named

Baby_input_reads_sorted.bam.bai
Baby_H3K4me3_reads_sorted.bam.bai

IGV

Now it’s time for a little fun. We are going to take the results from your alignments and view them on a genome browser on your own computer. The browser is going to need those results, it’s going to need the genome, and it is also going to need a file that identifies which genes are where in the genome.

First, copy the bam files and the indexed files (.bai) to your own home computer:

$ scp yourname@10.120.5.246:~/chipseq/*bam* .

Note that the two stars means you’ll copy everything that has a “bam” in it: this includes files that end in “bam” and also “bam.bai”. If you have some other file in that directory with a name that includes those letters - “Alabama.txt”, say - that too will get copied over. IGV actually will only take sorted bams, so if you try to load the unsorted one it will get mad at you.

You will have to copy a few other things next: you will need the entire genome for this exercise, the one that you indexed, which is the same one you aligned the chipseq reads to (Baby_chipseq_genome.fa).

$ scp yourname@10.120.5.246:~/chipseq/*fa .

I picked “*fa” because I’m lazy and didn’t want to bother typing the whole filename. You’ll start doing this too.

Finally, you’ll need an annotation file. There are several formats of these: .psl, .gff3, or .gtf. Many genomes that have had a lot of annotations done have these already. I made one by aligning Taejoon’s gene models/transriptome to the v7.1 genome build with BLAT (we’ll go over BLAT tomorrow I think) and changed the format to be something a little more universal, .gtf. Be forwarned that my homemade .gtf can make the browser mad, so if things screw up in a few steps we can do a workaround. First, though, get that file.

$ scp yourname@10.120.5.246:/usr/local/share/xenopus/Baby_chipseq.gtf .

You’re now ready to play with a browser! Fire up the internet and go to https://www.broadinstitute.org/software/igv/download

It will probably ask you to register with an email address. If you don’t want the Broad to brag about your using their software, you can use mine, iquigley@salk.edu. For simplicity’s sake, you can use the Java webstart buttons to make it go. If you are feeling advenurous you can also download the package itself and run it from the command line; this gives you a little more control about how much memory you want to allocate to the program, etc. If you’re uploading tons of data (each experiment is called a “track”), you may want to do this.

A word of warning: IGV is Java-based, and Macs in particular get real pissy about Java. If you download the webstart script from IGV (igv_mm.jnlp) and click on it, Mac security may try to stop you. Here are their instructions to getting around that, although the App will actually be in the Downloads folder.

In the Finder, locate the app you want to open. Most apps can be found in the Applications folder. Press the Control-key and click the app icon. Choose Open from the pop-up menu. Click Open.

Once IGV is open, you will first need to load a genome. Click the “genomes” tab and “load file from genome”. Find Baby_chipseq_genome.fa and load that.

Next, you’ll start loading files. Load the .gtf file first: click on the “file” tab and select “load from file”. Select the .gtf. My homemade .gtf is clunky and for a minute you might think IGV is choking on it. It seems to work after that, though.

Now, click on the “file” tab, “load from file”, and then pick the two .bam files.

I gave you guys two-scaffold genome; one scaffold is tiny and sucks and one is pretty big. Go to the search box (the empty one next to the “go” button at the top of the browser) and type in “gata2”. You can then select that gene and the browser will take you there. Use the + and - buttons on the right to zoom in and out; if you click on the browser itself you can drag it to the left or right. Go play!

HOMER

Okay, time to call peaks. Peak-calling is how you put a defined position on the peaks for later interrogation.

There are several software packages for calling peaks: MACS, MACS2, SPP, HOMER, and more. I have used MACS (it’s okay), MACS2 (you need to basically rewrite parts of it to get it to work with genomes of many scaffolds like frog), SPP (in R, minor pain in the ass) and HOMER (very easy, we’re gonna use that one). HOMER download is here but installation is mildly complex, so we’ve installed it on the server for you. In the future, install it yourself from http://homer.salk.edu/homer/introduction/install.html

To use HOMER for peak-finding, go back to your directory on the server:

$ ssh yourname@10.120.5.246

To find peaks, a peak-caller has to get an idea of how many tags (reads) end up at a particular position in the genome, so it has to count them. The starting process for this in HOMER is to make a “tag directory” of each experiment. Go ahead and just run it empty to see the syntax and the options.

$ makeTagDirectory

Look those over. Yeah, it’s complicated! Don’t worry about it. For the first try, just run it on defaults like so:

$ makeTagDirectory Baby_input_tags Baby_input_reads_sorted.bam
$ makeTagDirectory Baby_H3K4me3_tags Baby_H3K4me3_reads_sorted.bam

Now that the tag directories are made, you are in a position to use them to call peaks.

Peak-calling can be tricky business. Each peak-caller incorporates assumptions about how high a clump of reads needs to be to be considered a peak, some of them also incorporate how steep the edges of the clump are (HOMER does). Finally, it’s smart to have a grasp of what the background looks like. Maybe some regions are repetitive and get lots of multi-aligned reads and you might be fooled into thinking a peak is there when it’s not. To control for this and other problems, having a good background is nice. Here, I have given you reads from immunoprecipitation input prior to the IP; it’s better to use IgG pulldowns. For tagged transcription factors, it’s even better to inject and ChIPseq the tag alone. Fortunately, most of these seem to all look pretty close, i.e., pretty flat. Let’s call some peaks. I am going to suggest some options. One is the style of peak: if you are looking for transcription factor peaks, you want a pretty narrow peak, because they stick to a single motif (although we often find multiple copies of the TF bound right next to each other). On the other hand, if you’re looking at histone modifications (like we are right now), those peaks are broader. Moreover, histones tend to move away from where TFs bind, leaving a big dip in the middle (go ahead and look at gata2 on the browser again). Finally, some folks (e.g. Rick Young) are excited about really big/active enhancers, which they call “superenhancers”. These will look a little different from regular histone mods and definitely different from TF peaks. All of these differences can inform how we want to call the peaks from our experiment. So let’s call some histone peaks:

$ findPeaks Baby_H3K4me3_tags -style histone -o auto -i Baby_input_tags

It will chug. When it’s done, you can go to the tag directory and have a quick look at the peakfile:

$ cd Baby_H3K4me3_tags 
$ more regions.txt

There are too many formats for genomic data. HOMER even made its own (which is actually one of the better ones). However, most other packages don’t like it, so you have to convert it to the something else. One popular option is bed format, so we are going to convert the peakfile (“regions.txt”) to bed next. We’ll do this with a little tool included in HOMER. HOMER actually has tons and tons of little neat tools like this.

$ pos2bed.pl regions.txt > Baby_H3K4me3_regions.bed

Now let’s see what kind of job it did. Log out of the server and go back to the terminal window on your own computer. Use scp to get the bedfile onto your computer:

$ scp yourname@10.120.5.246:~/chipseq/Baby_H3K4me3_tags/Baby_H3K4me3_regions.bed .

And now that it’s here, go back to the IGV browser and load the bedfile into the browser with “load from file”.

Presto! You now have peaks in the browser and called peaks below. Cruise around on the browser and see what sorts of features HOMER thinks are actual H3K4me3 peaks. Because you selected the “-style histone” option, if there is a detectable dip in the peak (like where TFs might bind) the whole peak will be centered on that dip.

Next session, we will manipulate the peaks: we’ll get a bedfile that is only overlapping promoters, and we’ll get a different one that is everything NOT in a promoter. We will also to sequence motif enrichment on peaks and subsets of peaks.