Day 4 (~ 4hr)

Data analysis with Gaggle – hypothesis formulation

Introduction and Overview

The class has generated a time series of microarray data for Halobacterium in response to copper stress. In order to quickly generate powerful biological insights and formulate hypotheses to drive future research, we will analyze the copper stress data using a software suite called Gaggle that was developed in-house at the ISB.

Within this software package, we will upload the dataset, normalize, filter by significance, and cluster the data to identify copper response genes. Then we will functionally classify the resultant gene list in the KEGG database and formulate hypotheses for our next experiment. For this analysis, we will be analyzing pre-loaded ISB data and comparing it to the data generated by the class.

Software Tools

  1. the Gaggle boss.
  2. the Data Matrix Viewer DMV.
  3. Cytoscape: Halobacterium NRC-1 network.
  4. the R statistical environment.
  5. MeV.
  6. the Firegoose extension to Firefox.
  7. Gaggle Genome Browser: (Halo tiling reference) (Halo tiling as heatmap)
  8. GTC (Gaggle Tool Creator) multispecies Gaggle

The R environment should already be set up on course laptops. If not, you'll need to install the gaggle r-goose package plus an additional utility script.


# skip this step, if you're using pre-configured course laptops!
source ("http://www.bioconductor.org/biocLite.R")
biocLite()
biocLite('gaggle')

To start up gaggle connectivity from R and load the utility script:

 
library(gaggle)
gaggleInit()
source("http://gaggle.systemsbiology.net/R/gaggleUtil.R")

The Gaggle Boss

The Gaggle Boss serves as a hub which routes data from one program to another. It looks like this on Windows:

Gaggle Boss

...and like this on Mac OS X:

Gaggle Boss

Most of the Gaggle programs are packaged as webstarts, meaning they can be started by clicking on a link in a web browser.

  1. Click here to start the boss.

Data Matrix Viewer (DMV)

The DMV provides a spreadsheet-like view of numeric matrices. We'll use it to work with microarray data.

  1. Start the DMV.
  2. Load the Copper time series data into the DMV:
    1. Find the ISB copper time series data in the experiment tree. (Experiments/environmental/metals/copper/isb)
    2. Click on the red “9” in the upper left hand corner of the DMV. This data has been pre-normalized with respect to the starting conditions. (See the “extra credit” to see how we did this in R.)
    Data Matrix Viewer

Filter the data by statistical significance

  1. Open “R” (by double-clicking on the "R 2.7.1" desktop icon).
  2. (An error message about the "wrong java version" may appear if Java 1.6.x is installed. As long as version 1.5 or higher is installed his message can safely be disregarded.)
  3. Check to make sure that the R goose appears in the Boss panel. Boss with DMV and R connected
  4. In the DMV, select the R goose in the dropdown menu in the upper left.
  5. Still in the DMV, select the “lambdas” tab and click on the “all” button in the upper right hand corner. This will select all of the cells in the data matrix.
  6. Click on the “M” in the upper left corner of the DMV (next to the dropdown menu) to broadcast the lambda significance values to R. DMV
  7. In R, type the command:
    
    lambdas = getMatrix()
    
    This will retreive the data matrix from the DMV. Our goal now is to find significantly differentially expressed genes. We do this by filtering on the lambda values and broadcasting a list of genes that pass the filter back to the DMV.
  8. In DMV, click the "Clear" button to clear selections.
  9. In R, type the command:
    
    broadcast (filterMatrix (lambdas, 50, 2))
    
    This will filter the data such that only those genes with a significance value greater than 50 in 2 or more time points will be included in the final list. These parameters may need adjustment according to the data. In the end, these genes will be selected in the DMV. (The filterMatrix function is explained further here.) Filtering by statistical significance in R
  10. In the “log 10 ratios” tab, a number of genes should be selected which depends on the data and the filtering parameters used. (This number appears in the box next to the “clear” button in the upper right corner of the DMV.) DMV

Cluster the data

We'll use the clustering capabilities of the microarray data analysis tool MeV.

  1. Start MeV.
  2. In DMV, click the "log10 ratios" tab so that you broadcast ratios, not lambdas.
  3. In DMV, select "Multiple Array Viewer" in the broadcast target dropdown menu.
  4. Click “M” to broadcast your selected genes to MeV for clustering.
  5. In MeV, reset the color scale limits
    1. Click “Display/Set color scale limits”
    2. In the “Color Range Selection” box, type -0.5 as the lower limit and 0.5 as the upper limit. Click “OK”.
    3. Check to make sure that your heatmap in MeV has brightened.
    MeV
  6. In the upper row of MeV, click “HCL” to perform hierarchical clustering. In the pop-up dialog, uncheck the box for Sample Tree. We will cluster genes only. Select “Pearson's Correlation” as the distance measure to be used by the clustering algorithm. Click “OK”
    (It appears that you should use Euclidean Distance instead of Pearson's Correlation, at least with the 2008 class data, or VNG700 clusters in the wrong group.) MeV
  7. In the data analysis tree to the left, navigate to Analysis Results / HCL / HCL Tree.
  8. To make the gene labels readable:
    1. Click “Display/Set element size/custom
    2. In the dialog box, change the element height to 10.
  9. Examine the tree structure. Which clusters contain genes which are induced by copper? Repressed?
  10. Select (by clicking on the tree branches to the left of the heat-map) the cluster containing induced genes. Branches higher in the tree are more inclusive; branches lower down select higher correlations.
  11. What genes are induced? Let's broadcast the induced cluster back to the DMV.
    MeV
  12. First, clear selections in the DMV. In the MeV window, right-click the heatmap and select “Broadcast Gene List to Gaggle” to select the genes of interest back in the DMV.
  13. Plot these genes in the DMV. Are they truly induced by copper? Why or why not?
    DMV with induced genes selected
  14. Conduct the same protocol (20-24) to broadcast other clusters of interest, either induced or repressed. Don’t clear the induced genes already selected in DMV between broadcasts. The goal is to collect all of the genes of interest in the DMV as a selection.
    DMV plot

Functional classifications of copper-responsive genes.

The Firegoose is an extension for the popular Firefox web browser which allows exchange of data with web based resources. If the Firegoose is not already installed, follow the firegoose install instructions.

  1. Open the Firegoose (start firefox).
  2. Click “Gaggle” in the Firegoose toolbar to connect it to the other Geese.
  3. Select the Firegoose as the broadcast target in the DMV dropdown menu.
  4. Broadcast the genes of interest from the DMV to the Firegoose by clicking “B” in the upper left corner.
  5. Optionally, set the default species. (This should be done already.) Click on the colored Gaggle button on the left of the toolbar. Select Set Default Species. Set it to “Halobacterium sp. NRC-1”. Click OK.
  6. On the toolbar, point the righthand Firegoose menu to the “KEGG Pathway” target.
  7. On the left side of the Firegoose toolbar, it should say “gaggle: NameList (165)” (or some number of genes of interest). This means that your genes of interest have been received by the Firegoose. Click on the Broadcast button.
    Firegoose
  8. What functional categories are in the Pathway Search Result? Which genes from your initial clustering analysis are missing from this list?
  9. Broadcast genes of interest to the Halobacterium Annotated Proteome database. In the Firegoose, select “Halo Annotations” in the target menu and click “Broadcast”.
  10. What are the functions of the copper induced and repressed genes? Click on the links in the Function column to explore the annotations further. To which PFAM does this gene belong? COG?

Hypothesis formulation.

  1. Of the genes of known function, which are repressed and induced by copper? What about the genes of unknown function?
  2. Which of these genes respond most strongly? (i.e. are induced or repressed most significantly)?
  3. Gathering all of the information you have learned from your analysis, can you come up with a conclusion and a hypothesis?
  4. How should these hypotheses be tested?

Other Gaggle Tools:

Cytoscape: Halobacterium NRC-1 network

Unnormalized class data: DMV

back