ChiPseq Motif

From TaejoonLab
Jump to: navigation, search

You took fastq reads from an H3K4me3 experiment, made a genomic bowtie2 index from a fasta file, aligned the reads, converted the aligned sam files to bam files, sorted them, indexed them, and put them on IGV. You also called H3K4me3 peaks over background with findPeaks, converted them into bed files with pos2bed.pl. Nice! You guys did such a great job yesterday that today I am going to throw some typos into the commands. Just like real life! When they don’t work, you will have to fuss with them and figure out what went wrong. It may be that the target directory isn’t specified, or that I used the wrong option. Fortunately, if you run each command by itself, you will get that short instructional thing about how to use it. Should it not work, try running the command alone and looking at the syntax. You’ll figure it out.

First, let’s normalize the bam file that you’re looking at on the browser to a standard library size. We’re doing this because if you run different experiments and sequence to different read depths, the pileup of reads could be dramatically higher when they are not actually enriched any differently. There is a standardized file format for genome browsers called “bedGraph”. HOMER has a nice tool to do just that. Log into the server first:

$ ssh yourname@10.120.5.246

Then make your way to your directory with the chipseq alignments.

$ cd ~/chipseq

Because you logged off from the server, you'll need to reset those PATHs, so type the following:

$ PATH=$PATH:/usr/local/src/homer/bin/
$ PATH=$PATH:/usr/local/src/weblogo/weblogo/

If you're feeling particularly savvy, you can edit the PATH file so that you never have to do this again. You'll need to go to your home directory and edit a file called .bashrc:

$ cd
$ nano .bashrc

Look at all that stuff! .bashrc sets your user environment, so it makes text colors and stuff like that. It also tells the computer where to look for executables. Scroll to the bottom of that file and then paste in the two lines (note that they're the same as above).

PATH=$PATH:/usr/local/src/homer/bin/
PATH=$PATH:/usr/local/src/weblogo/weblogo/

Once you've pasted them, exit nano, and save when it asks. I've removed the "$" at the front of each line since they're going into a text file and are not actually commands.

Congratulations! You've now added HOMER and seqlogo to your PATH permanently.

Great, back to the task at hand. We're going to normalize the library size of the ChIPseq data and make it a little prettier to look at on the browser by making a bedGraph file. To do so, use this simple command (make sure to run it from the directory that has the Baby tag directories in it):

$ makeUCSCfile Baby_H3K4me3_tags/ -o auto

Log out of the server, and copy it back to your own computer:

$ scp yourname@10.120.5.246:~/chipseq/Baby_H3K4me3_tags/Baby_H3K4me3_tags.ucsc.bedGraph.gz .

If you shut down IGV last night, fire it up again, and load the baby genome, the gtf file, the baby sorted bam files, and now also load the new bedGraph file you just made. Now, when you do other chipseqs, you can do this with all of them and be able to compare them directly. You’re probably noticing that the peaks aren’t as discrete - I think this has to do with how the script using a sliding-window approach to add up the reads - so maybe play with all the makeUCSCfile settings to see how the outputs look different.

You probably noticed that most, but not all, H3K4me3 peaks in the browser lay on top of the transcriptional start sites. Suppose you only want the ones that are definitively promoters, based on the gene models. Even on that scaffold there are a lot and you wouldn’t want to do that by hand, and forget about the rest of the genome, where there are about 22,000.

Fortunately, there is an easy script in HOMER to do this. Go to the directory with the H3K4me3 peakfile (after you called the peaks in HOMER it is probably called “regions.txt”).

The tool we are going to use is called getDistalPeaks. Run it first to have a look at the options:

$ getDistalPeaks.pl

You’ll note among other options is one called -prox. getDistalPeaks was originally designed to get peaks far away from a feature, but the developer put in this option so you can get close ones. You’ll note that there are other options to get sequence in and around various features (gene bodies, etc.)

HOMER was designed for relatively well-studied genomes. Mouse, human, etc. Frog is not one of those (nor axlotl). Fortunately, I was able to bully the developer into including some custom genome functionality. This requires the fasta of the genome sequence and the annotation file (.gtf). To make things exceptionally smooth, I generated a .gtf file that is solely for the two scaffolds in your genome reference/index. You probably used it last night when you fired up IGV for the first time, make sure you know where it is.

So now to get the subset of the peaks that are around Taejoon’s promoters, you will want to tell getDistalPeaks the appropriate parameters. I’ll assume you have all been paying attention and know where your peakfile from yesterday is and that you haven’t deleted anything in the past day, so go to that directory and run the following command:

$ getDistalPeaks.pl regions.txt ~/chipseq/Baby_chipseq_genome.fa -gtf ~/chipseq/Baby_chipseq.gtf -d 1000 -prox > H3K4me3_promoters.only.txt

This will work if the paths to the genome and the gtf file are correct. If it doesn't work, recheck those paths and put in the right ones! Once run, this operation should keep some, but not all, of the H3K4me3 peaks. Maybe you should run wc -l on it to find out.

Now that you’ve isolated only the promoter peaks, turn the peakfile into a bed file.

$ pos2bed.pl H3K4me3_promoters_only.txt -bed

And then download the bed file from the server onto your own computer. Next, load it into IGV along with the other stuff from yesterday: the original Baby_chipseq_genome.fa, the sorted bam files, the other bed files. Cruise around; you should find that this bed file only contains promoters actually straddling transcriptional start sites and not any others.

Now let’s do the reverse and only get H3K4me3 peaks that are far away from transcriptional start sites:

$ getDistalPeaks.pl regions.txt ~/chipseq/Baby_chipseq_genome.fa -gtf ~/chipseq/Baby_chipseq.gtf -d 1000  > H3K4me3_distal_only.txt

Convert that peakfile to a bedfile that IGV can interpret:

$ pos2bed.pl H3K4me3_distal_only.txt -bed

Get that from the server onto your own machine:

$ scp yourname@10.120.5.246:/tag_dir_with_H3K4me3_peak_files/H3K4me3_distal_only.bed .

Remember that the directory is something you named - it'll probably be "ChIPseq/H3K4me3_tags" or something like that. Load that into IGV, too. See how the features combined add up to the original peakfile.

Now you have 3 peakfiles: you have one that’s got all the peaks, one that has peaks close to transcriptional start sites (promoters) and another has peaks far away from them (probably enhancers). Unfortunately, you still have to scroll through the damn browser to get a list of genes that the peaks are in! HOMER comes with a nifty tool called annotatePeaks that you can use to figure out what sequence features (in this case, protein-coding genes) are nearest to the peaks, how far away they might be, etc. For downstream exercises, we'll need the whole genome and a gtf to match, so I put those things in the shared folder (/usr/local/share/xenopus/). For your convenience, I've also replaced the appropriate parts of the cut-and-paste commands below.

$ annotatePeaks.pl regions.txt /usr/local/share/xenopus/LAEVIS_7.1_1kb.fa -gtf /usr/local/share/xenopus/Mayball_v7b.gtf

It should spit out a bunch of stuff. Now use “>” to put it into a file:

$ annotatePeaks.pl regions.txt /usr/local/share/xenopus/LAEVIS_7.1_1kb.fa -gtf /usr/local/share/xenopus/Mayball_v7b.gtf > regions_annotated.txt

Copy this file from the server back to your computer, fire up excel and open the regions.annotated.txt file and look at what sort of annotation information you get.

Maybe you want to see what the general shape of the tag density and nucleotide frequencies look like across all promoters. Unfortunately, that small number of peaks (~200) is not gonna look like much. We’ve been using the “Baby” datasets so that the exercises wouldn’t be too computationally-intensive (for the mapping, etc), but for this one I'm going to give you not just the H3K4me3 peaks from that little piece of the genome but instead from the entire genome. Log back into the server and copy the following peakfile to your ~/chipseq directory:

$ cp /usr/local/share/xenopus/H3K4me3_regions_annotated.txt .

We're also going to want the whole genome this time, so you'll want to refer to it and a gtf that contains information for all genes, and not just those on that one lonely scaffold we were looking at earlier. So our instructions to HOMER will reflect that. Let's first get only the H3K4me3 peaks on promoters, like we did with the baby dataset above:

$ getDistalPeaks.pl H3K4me3_regions_annotated.txt  /usr/local/share/xenopus/LAEVIS_7.1_1kb.fa -gtf /usr/local/share/xenopus/Mayball_v7b.gtf -d 1000 -prox > all_H3K4me3_promoters.txt

Please again note that this is laevis genome version 7.1, along with Taejoon's Mayball gene models and my naming convention. The v9.1 genome is much better, but it isn't incorporated into HOMER yet. Fortunately, you can just get the fasta file for the genome, and a gtf from me (or the gff3 from Xenbase, which will serve the same function) and run HOMER as we were doing earlier. That's why we're using the genome and gtf in the share, and not the baby one we used earlier.

While we’re at it, let’s get all the peaks that aren’t near promoters, too:

$ getDistalPeaks.pl H3K4me3_regions_annotated.txt   /usr/local/share/xenopus/LAEVIS_7.1_1kb.fa -gtf /usr/local/share/xenopus/Mayball_v7b.gtf -d 1000 > all_H3K4me3_enhancers.txt

I bet the peaks around promoters are pretty good, and I bet the other ones won’t be as clean. (This is the same bet as betting that Taejoon did a good job when he made the Mayball models. If he did a crappy job, his promoters won't overlap with our measured H3K4me3.) To figure this out, let’s look at a histogram of the reads. HOMER has a function to do this, it’s called -hist. You need to tell it what bin size to count the reads in (10 should be fine, option is -hist 10). You also have to tell it what tag directory to use to count, so you’ll need to use the H3K4me3 tag directory, too (option is -d /path/to/tag/dir). Just for kicks, throw in the nucleotide frequencies (option is -nuc). We also want to have a large enough window around the TSS to figure out what’s going on, so we’ll use +/-2500 bases. Finally, to count the reads ("tags") in these various positions, HOMER is gonna need a tag directory. I've made one for the whole H3K4me3 experiment, and I made it the exact same way you guys made tag directories with the Baby dataset earlier. Command to get those histograms:

$ annotatePeaks.pl all_H3K4me3_promoters.txt /usr/local/share/xenopus/LAEVIS_7.1_1kb.fa -gtf /usr/local/share/xenopus/Mayball_v7b.gtf -hist 10 -d /usr/local/share/xenopus/H3K4me3_v7.1_tags -nuc -size 5000  > H3K4me3_promoters_hist.txt

And let’s go ahead and do the same on the H3K4me3 peaks that aren’t found in promoters:

$ annotatePeaks.pl all_H3K4me3_enhancers.txt /usr/local/share/xenopus/LAEVIS_7.1_1kb.fa -gtf /usr/local/share/xenopus/Mayball_v7b.gtf -hist 10 -d /usr/local/share/xenopus/H3K4me3_v7.1_tags -nuc -size 5000 > H3K4me3_enhancers_hist.txt

Now let’s log out of the server and copy those files to the laptop in front of you:

$ scp yourname@your_server_ip_address:~/chipseq/*hist.txt .

Open those files with excel and start making scatterplots. First use the first two columns ("distance from center" vs. "coverage"), which will be the position and then the tag density: make a scatterplot.

Next, use the first column and last four columns, which will be the nucleotide frequencies.

Finally, copy one of the coverage histogram plots and paste it on top of the other. Then you’ll see why everybody says H3K4me3 mostly hangs around promoters and not elsewhere.


Next, let's have a look at motifs. HOMER's primary function is to look for motif enrichment in sequences that you give it. How do we get those sequences? We use a peakfile (or a bed file) and use the positions to extract sequence from the genome. HOMER will do all of this for us, which is great.

Go back to the server. For this exercise, you get to choose if you want to look for motifs in all the promoters, all the enhancers, or maybe just the few peaks from the baby dataset. Up to you!

$ findMotifsGenome.pl peak_file_you_choose /usr/local/share/xenopus/LAEVIS_7.1_1kb.fa output_dir_name_you_choose

Let her rip! It will take a few minutes, and after that we'll discuss how to get it back to your machine to look at it.