Halobacterium salinarum MeDiChI demo
MeDiChI is an algorithm for inferring protein-DNA binding site from ChIP-chip data. This demo will walk a new user through the process of moving data into the R statistical environment, finding model-fit and peaks, and visualizing the results using the Gaggle genome browser.
In our example, we reproduce a finding from the paper Prevalence of transcription promoters within archaeal operons and coding sequences (Koide et al, 2009) that the succinate dehydrogenase complex (sdhCDBA) exhibits conditionally altered transcript levels of constituent genes.
Set up prerequisites
Warning: For the time being, an update to the Gaggle R Goose will have to be installed manually. Linux and Mac OS X users can download gaggle_1.12.0.tar.gz and type R CMD INSTALL gaggle_1.12.0.tar.gz. Windows users can download gaggle_1.12.0.zip and select Packages|Install package(s) from local zip files... from the menu in the R GUI. Windows users will also have to install rJava, RUnit, and graph from CRAN. (install.packages(c("rJava","RUnit","graph")))
The genome.browser.support.R script requires that a few libraries be installed on your system, namely
MeDiChI, RSQLite, and Gaggle R Goose. The script will attempt to automatically install RSQLite and gaggle. MeDiChI must be manually installed.
- Install MeDiChI and its dependencies (see the MeDiChI page for up-to-date instructions)
- RSQLite should be automatically installed by genome.browser.support.R. For more information see RSQLite.
- The Gaggle R Goose should be automatically installed by genome.browser.support.R. More informations is here: R Goose.
Find and visualize protein-DNA binding sites
- Run the genome.browser.support.R script. In R, type:
- Connect R to the Gaggle. A Gaggle Boss should start automatically. If not, launch a Gaggle Boss and try gaggleInit() again.
Launch the genome browser with ChIP-chip and expression data for Halobacterium salinarum NRC-1. The genome browser should register with the Boss and be visible in automatically download a data file to your machine.
The genes are shown in the center, forward strand in yellow and reverse in orange. The reference transcript signal is plotted in blue with segmentation overlaid in red. Two time points, at optical density 0.4 and 1.3, are plotted as log ratios relative to the reference. The tracks plotted in green are ChIP-chip for TFBd at two different resolutions.
- Broadcast a dataset description from the genome browser to R.
In the genome browser, select the Gaggle|Show Gaggle Toolbar menu. Verify that the connection status indicator is green. Select Description of dataset: Halobacterium tiling array from the Gaggle Data menu and select R as the target. Click the broadcast button (on the far right) to broadcast a description of the currently open dataset to R.
- To receive the broadcast, go to R and type:
ds = getDatasetDescription()
- Get a list of track names contained in the data set:
- We'll look for sites where the transcription factor TFBd binds with the genome by analyzing the 500 base-pair resolution ChIP-chip track. Pull the track data into the R environment:
track = getTrackDataAsMatrix(ds, 'ChIP-chip TFBd 500bp')
- Inspect the shape and size of the track data:
- In the genome browser, search for sdh. Using the crosshairs mouse tool, you can confirm that the central point of this gene cluster is near position 978,000 on the main chromosome. We'll look at a 20,000 base pair window around this coordinate at a resolution of 10 b.p..
- Run the MeDiChI deconvolution algorithm:
fit <- chip.deconv(track, where="chromosome",
fit.res=10, center=978000, wind=20000, max.steps=100, n.boot=10,
Here, we're using MeDiChI's chip.deconv function to fit a model to the data in our track.
There are two kernels loaded by the script: kernel.halo.hires for the high-resolution Nimblegen microarray platform, and kernel.halo.lowres for the 500bp tiling arrays. We'll use the lower resolution kernel in the following steps. For other array platforms, kernels can be created using MeDiChI's generate.binding.profile function.
The remaining parameters control the algorithm used to search for the best fitting arrangement of peaks. More help on the MeDiChI is available in the R help system by typing help("MeDiChI") or help("chip.deconv"), the MeDiChI website or the MeDiChI paper.
- Broadcast the curve-fit (or profile) generated by MeDiChI as a Gaggle Data Matrix:
broadcastProfile(fit, kernel.halo.lowres, name="MeDiChI profile 500bp")
The genome browser will pop up a dialog as shown here. Click OK.
You'll probably want to adjust the visual properties of the new track. Right-click on the track and select Track visual properties | MeDiChI profile 500bp. Adjust the color to your liking. In the Overlay menu, select 500bp to overlay the new track on the existing ChIP-chip track.
- Broadcast profile to the Genome Browser with the name medichi_curve_fit.
broadcastPeaks(fit, name="MeDiChI peaks 500bp")
- Now, open the visual properties dialog again. Select 500bp in the Overlay menu. Change the renderer to Vertical Bar. Make the color something bold and increase the opacity. If the alignment looks off, make sure the Range of Values has a minimum of 0.0. You might also want to adjust the vertical positioning of the tracks.
Results should look something like this: