These instructions go along with a screencast demonstrating the Gaggle Genome Browser. In this example, we'll get the E. coli genome from the UCSC microbial genome browser and some tiling array data from NCBI's GEO database. We'll manipulate data in R and import the data into GGB, and finally do a little exploration.
Navigate to http://gaggle.systemsbiology.net/docs/geese/genomebrowser/ and click on the ‘Launch’ button in the lower left. This starts GGB using Java Web Start. It will also ensure that a Gaggle Boss is running, starting one if necessary.
This step should happen automatically. If not, start a Gaggle Boss. Selectfrom the GGB menu. The indicator light on the toolbar's left side should be green to indicate that we're connected. If it's red, click it. GGB will try to connect.
In the GGB menu, select. Type ‘coli’ and Select and click ‘OK’. For more detail on this step see getting started.
For this example, we'll acquire data from NCBI's GEO database. The Bernard Palsson lab at UCSD ran tiling array experiments on E. coli in series GSE15534. We'll use the first log-phase sample GSM389294.
The table shown on the page has two columns, probe ID and Value, where value is the log2 signal intensity of each probe which is exactly what we need for plotting in GGB. Also available is a Nimblegen pair file, which we could also use. But since they've gone to the trouble of extracting the useful columns and converting intensities to log2, let's use that.
At the bottom of the page, click the link marked ‘View full table...’. Oddly, it seems that we have to copy-and-paste, which is slightly inconvenient. Copy the table from the web page and paste into a text editor that can handle 371,034 lines. Save the file as ‘GSM38294’. Alternatively, use the browser'sfeature and use the UNIX grep command to extract the table from the surrounding markup:
grep ECOL GSM38294.html > GSM38294.txt
Either way, we should now have a tab-delimited text file with columns for probe ID and log2 intensity. For consistency with the rest of the example, the first line of the file should contain column headers, ‘ID_REF’ and ‘VALUE’.
We need to map the intensity values to the genome via their probe IDs. which is the purpose of the platform file GPL8387. Happily, the button below the table says, “Download full table...”, which saves us a bit of bother. Download that and put it in the same folder as the sample file. Now that we have the data we need, we need to get it in the right form.
The R project, a statistical computing environment, has powerful facilities for manipulating and reshaping data. We've also built a library that let's you feed data directly from R into GGB. Go to the directory containing your data files and start the R environment.
probes <- read.table("GPL8387-2180.txt", sep="\t", header=T, quote=NULL) head(probes) ID RANGE_STRAND RANGE_START RANGE_END RANGE_GB ORF 1 ECOL00P000000000 + 1 50 NC_000913 2 ECOL00P000000025 + 26 75 NC_000913 3 ECOL00P000000050 + 51 100 NC_000913 4 ECOL00P000000075 + 76 125 NC_000913 5 ECOL00P000000100 + 101 150 NC_000913 6 ECOL00P000000125 + 126 175 NC_000913 SEQUENCE 1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA 2 CAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG 3 AAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT 4 AACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACT 5 TAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATA 6 AAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAG dim(probes)  371034 7
sample <- read.table("GSM389294", sep="\t", header=T, quote=NULL) dim(sample)  371034 2 head(sample) ID_REF VALUE 1 ECOL00P000000000 7.44 2 ECOL00P000000025 8.36 3 ECOL00P000000050 7.94 4 ECOL00P000000075 7.53 5 ECOL00P000000100 6.75 6 ECOL00P000000125 7.89
In R, the merge function will join two tables on a common key.
merged <- merge(probes, sample, by.x="ID", by.y="ID_REF")
Next we need to extract the columns we need, creating a data.frame with columns for sequence name, strand, start, end, and value (which will hold the log2 intensities).
track <- data.frame(sequence="chr", strand=merged$RANGE_STRAND, start=merged$RANGE_START, end=merged$RANGE_END, value=merged$VALUE) dim(track)  371034 5 head(track) sequence strand start end value 1 chr + 1 50 7.44 2 chr + 26 75 8.36 3 chr + 51 100 7.94 4 chr + 76 125 7.53 5 chr + 101 150 6.75 6 chr + 126 175 7.89
For viewing in GGB, most data will need to be cast into this form. Now that the data munging is done, the next step is to get the data into the genome browser.
Warning: For the time being, an update to the Gaggle R Goose will have to be installed before continuing.
The script genome.browser.support.R contains several functions which help GGB interoperate with R. Load the script with the following command (in R). The script will load several packages that it depends on. If it complains that MeDiChI is missing, that's OK. We won't be using MeDiChI in this example.
First we R to connect to the Gaggle using the R goose:
Next we need to get some information from GGB to the R side. In the GGB's Gaggle toolbar, selectin the Gaggle Data chooser. Select as the target and broadcast.
Back in R, we need to receive the broadcast like so:
ds <- getDatasetDescription()
This gives R enough information to interact with the SQLite database where the genome browser stores its data. Now, we'll broadcast the track data to the genome browser.
addTrack(ds, track, name="GSM389294")
GGB responds by popping up a dialog. Click ‘OK’, and the new track should appear in the GGB after a pause to digest the data. Now comes the fun part. Let's explore the data.
Scrolling through the dataset, several interesting features pop out. Near position 253,400 there's a short signal in the forward strand between genes b0235 (yafp) and b0234 (ykfj). Zooming in, we can see the individual probes. Seeing these might lead us to investigate whether this is a non-coding RNA, particularly if its expression changes in conjunction with some biological change.
In the region [278333, 290612] we can see a transposon bracketed by insertion elements. Interestingly, we see signal in both strands around the insertion sequences, which might be an artifact of palindromic sequence.
Visualizing a single data series is great, but to see dynamics, we'll need transcription data from several conditions. And really interesting results often come from mashing up different kinds of data. The Palsson lab dataset includes 14 samples under four distinct conditions, which we could visualize together. A great next step might be to find transcription factor binding sites, maybe from ChIP-chip. Or we could try integrating quantitative proteomics data. Any data that can be related to coordinates on the genome can be visualized with the Gaggle Genome Browser.