[Gaggle home | Firegoose home | contents]

The Scenario

A researcher undertakes a study of the physiological response in H. salinarum to changes in oxygen level. She does a series of 61 microarrays under conditions of varying oxygen concentration.

What can we learn from this data?

Part 1: (Optional) Finding and clustering differentially expressed genes

The first step of this analysis is to find genes whose expression changes under conditions where the level of oxygen has been perturbed relative to a reference condition. Then we use a clustering algorithm to group together genes whose expression changes in similar ways.

This part of the analysis exercises the Gaggle and a few desktop tools, namely R, MeV, and the DMV. The list of genes computed in Part 1 has been encoded in the Gaggle microformat and embedded in this page. The reader can skip directly to Part 2 to analyze these genes using the Firegoose.

Start Gaggle tools

  1. Start the Gaggle Boss.
  2. Start the DMV.
  3. Start R. Connect to the Gaggle by typing the following at the R command prompt:
    (or just cut-n-paste!)

Load the microarray data into the DMV

  1. In the DMV, open the environmental folder on the left and select oxygen. A button labeled with a red 61 should appear in the top of the window. Click the button to load microarray data for the 61 oxygen conditions.
  2. The DMV should open two tabs: lambdas and log10 ratios.

For each condition we have two measurements per gene:

Broadcast ratios to R

  1. In the DMV, Make sure the log10 ratios tab is selected.
  2. Click the All button to select all rows in the log10 matrix.
  3. Click Update and select R in the drop-down list of geese.
  4. Broadcast the matrix to R by clicking the button marked M.

Normalize microarray data

  1. In R, assign the name ratios to the matrix we just broadcast.
  2. Note that on some platforms (Windows) the matrix ready message fails to appear. Type "dim(ratios)" to verify the size of the matrix.
  3. In one step, we will normalize ratios to a mean of 0 and a standard deviation of 1 and broadcast the normalized matrix back to the DMV. This is done to make comparing expression profiles during clustering easier.

matrix ready, dimension 2400 x 61
> ratios <- getMatrix()

> dim(ratios)
[1] 2400   61

> broadcast(normalize(ratios), "log10_ratios_normalized")
  1. A new tab labeled "log10_ratios_normalized" should appear in the DMV.

Broadcast lambdas to R

  1. Select the lambdas tab in the DMV.
  2. Press All to select all rows.
  3. R should still be selected in the drop-down list of geese.
  4. Press M to broadcast the matrix to R.

Find differentially expressed genes

Out of the ~2400 unique genes in H. salinarum, we want to find those significantly differentially expressed under our experimental conditions. To accomplish this, a threshold can be applied in R to the lambda values.

  1. Assign the matrix just broadcast to a variable as before. This time call it lambdas.
  2. Again, verify the dimensions of the matrix.
  3. The filterMatrix function returns a list of row names where the data in the row passes a threshold. Here we require that 6 conditions have a lambda value of at least 50. The FALSE parameter indicates that we don't care whether the 6 conditions are consecutive. The choice of these parameters is somewhat arbitrary and some judgement must be applied.
  4. broadcast these genes back to the DMV.

matrix ready, dimension 2400 x 61
> lambdas <- getMatrix()

> dim(lambdas)
[1] 2400   61

> sig_genes <- filterMatrix(lambdas, 50, 6, FALSE)

> sig_genes

  [1] "VNG0013C" "VNG0014C" "VNG0017H" "VNG0018H" "VNG0022H" "VNG0027H"
  [7] "VNG0028C" "VNG0029H" "VNG0033H" "VNG0043H" "VNG0049H" "VNG0053H"
[445] "VNG6432H" "VNG6439H" "VNG6441H"

> broadcast(sig_genes)
  1. In the log10_ratios_normalized tab of the DMV, the 447 rows that passed the filter should be selected.

Cluster submatrix using MeV

We use clustering to group together genes with similar expression profiles.

  1. Start MeV (multiexperiment viewer).
  2. In DMV, make sure the log10_ratios_normalized tab is selected and the 447 rows from the previous step are selected. Click M to broadcast the submatrix to MeV.
  3. A green and red heat-map should appear in MeV.
  4. Press the PCA button and click "OK" to cluster the expression profiles using the priciple component algorithm. PCA is a technique that projects data of high dimesionality onto fewer axes.
  5. Open the PCA genes folder (left of screen) and open the subfolders Projections on the PC axes, Components 1,2,3, and 2D Views. The axis that corresponds best to response to oxygen is axis 1. Click on "1,2" to plot the genes against these axes.
  6. Use the mouse to draw an ellipse around the genes to the left of the Y-axis. Select as many as possible without crossing the Y-axis.

Clustering by principle component analysis

Highlight the selected cluster in the DMV

  1. Return momentarily to the DMV. Press "Clear" to clear the current selections.
  2. Broadcasting the clustered genes is slightly awkward. Back in MeV, right-click on "1,2" (on the left) and select Launch new session. A heat-map will appear containing just the genes in the selected cluster.
  3. Open the Gaggle menu and click "Broadcast" to broadcast the gene names.
  4. In the DMV, 222 genes should be selected (give or take a few depending on how generous an ellipse you drew).

A similar procedure can be used to select the genes activated by oxygen, which lie on the right side of the Y-axis.

Resulting gene clusters

The results of this part of the analysis are a pair of gene clusters broadly classified into those whose expression is correlated with oxygen levels and those whose expression is anticorrelated with oxygen. Cluster 1 holds the genes induced by the absence of oxygen. Cluster 2 holds genes induced by the presence oxygen.

Cluster 1: Anaerobic genes

These 222 genes were found to be genes activated under anaerobic conditions.

[+] Show/hide cluster 1.

Cluster 2: Aerobic genes

These 223 genes were found to be genes repressed under anaerobic conditions and active under aerobic conditions.

[+] Show/hide cluster 2.

[+] Show/hide Anaerobic genes as GI numbers.

Institute for Systems Biology