Microarray Data Mining: Facing the Challenges
Gregory Piatetsky-Shapiro
KDnuggets and U. Mass Lowell
gregory
at
kdnuggets.com
Pablo Tamayo
MIT / Broad Institute
tamayo
at
broad.mit.edu
1. MOLECULAR BIOLOGY AND DNA
All organisms on Earth, except for viruses, consist of cells. Yeast,
for example, has one cell, while humans have trillions of cells. All
cells have a nucleus, and inside nucleus there is DNA, which
encodes the “program” for making future organisms. DNA has
coding and non-coding segments, and coding segments, called
“genes”, specify the structure of proteins, which are large
molecules, like hemoglobin, that do the essential work in every
organism. Practically all cells in the same organism have the same
genes, but these genes can be expressed differently at different
times and under different conditions. Genes make proteins in two
steps. First, DNA is transcribed into messenger RNA or mRNA,
which in turn is translated into proteins. The different patterns of
gene expression following carefully tuned biological programs,
according to tissue type, developmental stage, environment and
genetic background account for the huge variety of different cells
states and types. Virtually all major differences in cell state or
type are correlated with changes in the mRNA levels of many
genes.
2. MICROARRAYS: AN OVERVIEW
In recent years there has been an explosion in the rate of
acquisition of biomedical data. Advances in molecular genetics
technologies, such as DNA microarrays [1-8] allow us for the first
time to obtain a "global" view of the cell. For example, we can
now routinely investigate the biological molecular state of a cell
measuring the simultaneous expression of tens of thousands of
genes using DNA microarrays.
Different types of microarray use different technologies for
measuring mRNA expression levels; detailed description of these
technologies is beyond the scope of this paper. Here we will focus
on the analysis of data from Affymetrix arrays, which are
currently one of the most popular commercial arrays. However,
the methodology for analysis of data from other arrays would be
similar, but would use different technology-specific data
preparation and cleaning steps.
Figure 1: Affymetrix GeneChip® (right),
its grid (center) and a cell in a grid (left).
This type of microarray is a silicon chip that can measure the
expression levels of thousands of genes simultaneously. This is
done by hybridizing a complex mixture of mRNAs (derived from
tissue or cells) to microarrays that display probes for different
genes tiled in a grid-like fashion. Hybridization events are
detected using a fluorescent dye and a scanner that can detect
fluorescence intensities. The scanners and associated software
perform various forms of image analysis to measure and report
raw gene expression values. This allows for a quantitative readout
of gene expression on a gene-by-gene basis. As of 2003, there are
one-chip microarrays that measure expression of over 30,000
genes, covering most of the human genome.
Microarrays have opened the possibility of creating data sets of
molecular information to represent many systems of biological or
clinical interest. Gene expression profiles can be used as inputs to
large-scale data analysis, for example, to serve as fingerprints to
build more accurate molecular classification, to discover hidden
taxonomies or to increase our understanding of normal and
disease states.
The first generation of microarray analysis methodologies
developed over the last 5 years has demonstrated that expression
data can be used in a variety of class discovery or class prediction
biomedical problems including those relevant to tumor
classification [10-14]. Machine learning and statistical techniques
applied to gene expression data have been used to address the
questions of distinguishing tumor morphology, predicting post-
treatment outcome, and finding molecular markers for disease.
Today the microarray-based classification of different
morphologies, lineages and cell histologies can be performed
successfully in many instances. The performance in predicting
treatment outcome or drug response has been more limited but
some of the results are quite promising. Most results of
microarray analysis still require further experimental validation
and follow up study. Many current efforts are being directed in
this direction. In a few cases the results of microarray analysis
have found their way into more serious consideration in clinical
Figure 2: An example raw microarray image for a single
sample (image courtesy of Affymetrix). The intensity of
image on the left is translated by microarray software
into numbers like the ones on the right.
Gene
Value
D26018_at
193
D26067_at
-70
D26068_at
144
D26129_at
33
SIGKDD Explorations.
Volume 5,Issue 2 - Page 1