Gaggle genome browser E. coli demo

Step-by-step instructions

These instructions go along with a screencast demonstrating the Gaggle Genome Browser. In this example, we'll get the E. coli genome from the UCSC microbial genome browser and some tiling array data from NCBI's GEO database. We'll manipulate data in R and import the data into GGB, and finally do a little exploration.

  1. Start a new project with the E. coli genome

    Start genome browser

    Navigate to http://gaggle.systemsbiology.net/docs/geese/genomebrowser/ and click on the ‘Launch’ button in the lower left. This starts GGB using Java Web Start. It will also ensure that a Gaggle Boss is running, starting one if necessary.

    Launching GGB from the web page

    Connect to Gaggle

    This step should happen automatically. If not, start a Gaggle Boss. Select View|Toolbars|Show Gaggle Toolbar from the GGB menu. The indicator light on the toolbar's left side should be green to indicate that we're connected. If it's red, click it. GGB will try to connect.

    Gaggle Genome Browser - Gaggle toolbar

    New Project

    In the GGB menu, select File|New Dataset. Type ‘coli’ and Select Escherichia coli K12 MG1655 and click ‘OK’. For more detail on this step see getting started.

  2. Download expression data from GEO

    For this example, we'll acquire data from NCBI's GEO database. The Bernard Palsson lab at UCSD ran tiling array experiments on E. coli in series GSE15534. We'll use the first log-phase sample GSM389294.

    Download sample GSM389294.

    NCBI GEO

    The table shown on the page has two columns, probe ID and Value, where value is the log2 signal intensity of each probe which is exactly what we need for plotting in GGB. Also available is a Nimblegen pair file, which we could also use. But since they've gone to the trouble of extracting the useful columns and converting intensities to log2, let's use that.

    At the bottom of the page, click the link marked ‘View full table...’. Oddly, it seems that we have to copy-and-paste, which is slightly inconvenient. Copy the table from the web page and paste into a text editor that can handle 371,034 lines. Save the file as ‘GSM38294’. Alternatively, use the browser's Save Page As feature and use the UNIX grep command to extract the table from the surrounding markup:

    grep ECOL GSM38294.html > GSM38294.txt

    Either way, we should now have a tab-delimited text file with columns for probe ID and log2 intensity. For consistency with the rest of the example, the first line of the file should contain column headers, ‘ID_REF’ and ‘VALUE’.

    Download sample GSM389294.

    We need to map the intensity values to the genome via their probe IDs. which is the purpose of the platform file GPL8387. Happily, the button below the table says, “Download full table...”, which saves us a bit of bother. Download that and put it in the same folder as the sample file. Now that we have the data we need, we need to get it in the right form.

  3. Data manipulation in R

    The R project, a statistical computing environment, has powerful facilities for manipulating and reshaping data. We've also built a library that let's you feed data directly from R into GGB. Go to the directory containing your data files and start the R environment.

    Read the platform file

    probes <- read.table("GPL8387-2180.txt", sep="\t", header=T, quote=NULL)
    head(probes)
                    ID RANGE_STRAND RANGE_START RANGE_END  RANGE_GB ORF
    1 ECOL00P000000000            +           1        50 NC_000913    
    2 ECOL00P000000025            +          26        75 NC_000913    
    3 ECOL00P000000050            +          51       100 NC_000913    
    4 ECOL00P000000075            +          76       125 NC_000913    
    5 ECOL00P000000100            +         101       150 NC_000913    
    6 ECOL00P000000125            +         126       175 NC_000913    
                                                SEQUENCE
    1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA
    2 CAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG
    3 AAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT
    4 AACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACT
    5 TAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATA
    6 AAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAG
    dim(probes)
    [1] 371034      7
    

    Read the sample file

    sample <- read.table("GSM389294", sep="\t", header=T, quote=NULL)
    dim(sample)
    [1] 371034      2
    head(sample)
                ID_REF VALUE
    1 ECOL00P000000000  7.44
    2 ECOL00P000000025  8.36
    3 ECOL00P000000050  7.94
    4 ECOL00P000000075  7.53
    5 ECOL00P000000100  6.75
    6 ECOL00P000000125  7.89
    

    Join and reshape track data

    In R, the merge function will join two tables on a common key.

    merged <- merge(probes, sample, by.x="ID", by.y="ID_REF")
    

    Next we need to extract the columns we need, creating a data.frame with columns for sequence name, strand, start, end, and value (which will hold the log2 intensities).

    track <- data.frame(sequence="chr", strand=merged$RANGE_STRAND,
     start=merged$RANGE_START, end=merged$RANGE_END, value=merged$VALUE)
    dim(track)
    [1] 371034      5
    head(track)
      sequence strand start end value
    1      chr      +     1  50  7.44
    2      chr      +    26  75  8.36
    3      chr      +    51 100  7.94
    4      chr      +    76 125  7.53
    5      chr      +   101 150  6.75
    6      chr      +   126 175  7.89
    

    For viewing in GGB, most data will need to be cast into this form. Now that the data munging is done, the next step is to get the data into the genome browser.

  4. Broadcast new track data to GGB

    Warning: For the time being, an update to the Gaggle R Goose will have to be installed before continuing.

    Connect R and the Gaggle Genome Browser through Gaggle

    The script genome.browser.support.R contains several functions which help GGB interoperate with R. Load the script with the following command (in R). The script will load several packages that it depends on. If it complains that MeDiChI is missing, that's OK. We won't be using MeDiChI in this example.

    source("http://gaggle.systemsbiology.net/R/genome.browser.support.R")

    First we R to connect to the Gaggle using the R goose:

    gaggleInit()

    Broadcast dataset description

    Next we need to get some information from GGB to the R side. In the GGB's Gaggle toolbar, select Description of dataset: Escherichia coli K12 MG1655 in the Gaggle Data chooser. Select R as the target and broadcast.

    Broadcast dataset description to R

    Back in R, we need to receive the broadcast like so:

    ds <- getDatasetDescription()

    This gives R enough information to interact with the SQLite database where the genome browser stores its data. Now, we'll broadcast the track data to the genome browser.

    Broadcast track data to GGB

    addTrack(ds, track, name="GSM389294")

    GGB responds by popping up a dialog. Click ‘OK’, and the new track should appear in the GGB after a pause to digest the data. Now comes the fun part. Let's explore the data.

    GGB Escherichia coli
  5. Explore

    Scrolling through the dataset, several interesting features pop out. Near position 253,400 there's a short signal in the forward strand between genes b0235 (yafp) and b0234 (ykfj). Zooming in, we can see the individual probes. Seeing these might lead us to investigate whether this is a non-coding RNA, particularly if its expression changes in conjunction with some biological change.

    GGB Unannotated transcript example

    In the region [278333, 290612] we can see a transposon bracketed by insertion elements. Interestingly, we see signal in both strands around the insertion sequences, which might be an artifact of palindromic sequence.

    GGB Unannotated transcript example

    More fun with data

    Visualizing a single data series is great, but to see dynamics, we'll need transcription data from several conditions. And really interesting results often come from mashing up different kinds of data. The Palsson lab dataset includes 14 samples under four distinct conditions, which we could visualize together. A great next step might be to find transcription factor binding sites, maybe from ChIP-chip. Or we could try integrating quantitative proteomics data. Any data that can be related to coordinates on the genome can be visualized with the Gaggle Genome Browser.

Validator link