Biological Relationship Analysis
To identify biologically significant patterns in microarray data we developed an approach called “Biological Relationship Analysis”. We have used this approach to construct a superoxide response network in
E. coli (Blanchard et al. 2007). Biological relationship analysis consists of four steps:
- Developing biological relationship data sets.
- Partitioning the microarray data results into groups of genes with similar expression patterns using clustering methods.
- Application of a statistical test to determine whether genes having a relationship are over represented in a group.
- Repeating steps 2 and 3 to determine the appropriate clustering method and number of clusters.
Developing biological relationship data sets
Our biological relationship data set are derived from the research databases, published data sets and data sets we have built from results in our laboratory. The biological relationship data sets used in our E. coli microarray analyses are derived from transcriptional regulatory relationships from RegulonDB, metabolic pathway relationships from KEGG, protein complex and metabolic relationships in EcoCyc, Gene Ontology relationships, common operon relationships, and protein-protein relationships identified by mass spectrometry. Here is a link to the
E. coli data sets in our recently submitted manuscript. We have also written out the protocols for creating these data sets in
Arabidopsis.
Statistical tests of the biological relationships in microarray data
We currently use clustering methods implemented in the
MultiExperiment Viewer to partition the microarray data into groups according to similarities in their expression patterns. We then determine whether genes in a cluster are united by a common biological relationship more frequently than we would expect based on random chance. This is done by finding the hypergeometric distribution describing the probability of selecting biological relationship terms of one kind in a test group (e.g. a cluster) relative to the proportion of biological relationship terms in the larger population (e.g. all genes on the array). This statistical test is implemented in a Perl program called
GeneMerge. We have written a program that runs this statistical test on all biological relationship data sets in all clusters.