Abstract
Background
Several microarray studies have revealed the gene lists for forecasting the outcome of various methods of treating diseases. These studies have compiled gene lists that overlap to a small extent, signifying that the findings from any one study are unstable. It has been proposed that the fundamental pathways are alike and that the expression of gene sets, instead of that of individual genes, may give more information in predicting and understanding the basic biological processes.
Results
In this experiment, we wanted to study the constancy of prognostic traits based on gene arrays instead of single genes. We documented breast cancer findings from five microarray studies concerning the risk of metastasis. The traits of the individual genes in the sets are cumulated by utilizing a set statistic. In doing this, it was found out that the ensuing prognostic gene sets were as predictive as when individual genes were used but exhibited more steady rankings via bootstrap replication within datasets. Besides, the resultant prognostic genes produced more steady classifiers across dissimilar datasets, this made them easier to interpret as they studied gene expression in the context of the adjacent genes in the pathway.
Conclusion
Currently, most studies of gene expression have centered at the level of individual genes. This experiment demonstrates that an alternative method of examining data using predefined gene sets can lower the errors and could give further information into the fundamental biological pathways.
Introduction
Roughly, 12% of females in the US will develop cancer in the course of their life (Wang et al, pp. 4). Despite the progress made in treatment options such as chemotherapy, Wang et al (4) assert that breast cancer still ranks as the second most dangerous cancer in women. This points to the growing need for the improvement of methods to predict the prognosis for the disease. With the introduction of large genetic databases and the drop in costs for the studies, scientists are faced with selecting from a large pool of likely prognostic markers from various examinations of breast cancer gene trait reports.
Methods
The experiment observed several techniques for extracting gene sets and the data collected are as described below.
Data
Five sets of breast cancer data were used for the study from NCBI GEO, these were GSE2034, GSE4922, GSE6532, GSE7390, and GSE11121 (Abraham et al, pp. 2). The five datasets are Affymetrix HG-U133A microarrays, the data first underwent a quality control procedure where some values were eliminated based on the variance on the datasets, this left each dataset with 22,215 probesets. They were then independently normalized.
Data Composition
The data in each of the five datasets contain both negative and positive lymph nodes from the examined patients. The patients were placed into two categories of low and high risk. Information gathered from patients with more than five years was removed from the dataset as these patients were considered non-informative.
Set Statistics
The importance of the set statistic is to decrease the set’s expression medium to a lone vector, this is then used as a trait for categorization. The aim is for the test statistic to reflect the expression levels of the whole set, beneficially. The test statistic used here can be defined as unsupervised, i.e. they do not consider the metastatic class, unlike other statistical methods such as the t-Test.
Results and Discussion
From the study, it was found out that the bias from the internal and external corroboration was identical, indicating that the centroid classifier did not considerably affect the data. The experiment focused on how the ranks of a single trait differed since the desired traits are those that are highly ranked on average and have a small inconsistency about that average (Abraham et al, pp. 6). If a trait has a low average rank and great variability, it sometimes appears top of the list by probability when the study is conducted twice, pointing out the fact that it is an erratic predictor. Traits that exhibit high average ranks and large variability can be excellent predictors.
In an assessment of the variability of the ranks, it was found out that the top gene sets had lesser variation as compared to the top genes, this shows that the lists of prognostic genes exhibit little variation, as even the best-graded genes showed only a slight variation from the same dataset. From these observations, Wang et al concluded gene set traits are more even (Wang et al, pp. 6).
The study was aimed at examining the variability of datasets and the significance of the traits. Wang et al used two tools: rank-correlation of the mid classifier’s weights, and the agreement of the gene sets. The two approaches have been examined by Bavelstad et al (Wang et al, pp. 3) on the measurement of the correspondence between the gene sets and the implication of this parameter. The study employed Spearman’s rank correlation between the classifier weights of the normalized data sets, it is plain from the results that the rank correlation for the weights of the set centroids, median, medoid, and the t-statistic is higher as compared to the genes making up the dataset. This shows that classifiers generated from traits found on gene sets are more stable than those generated from the individual genes, and are less probable to overfit.
Examination of how the ranked lists given by each dataset were in conformation with the dataset proved the deduction that lists of individual genes show greater variability, hence very unstable.
The experiment shows that classifiers based on sets, instead of the individual genes, have the identical predictive ability of breast cancer, but are more stable and may give a deeper understanding of the biological system with relation to breast cancer prognosis. The possible reason is that the expression of any gene is a factor of both its contextual regulation as well as natural variability because of germ-line variations and disparities in individual tumor reactions (Wang et al, 13).
Kela et al, in their study of the outcome signature genes in breast cancer, posit that variations in host-tumor response among breast cancer patients lower the prognostic accuracy as the variability between the patients is not always regarded when generating prognostic models (Kela et al, 175). This illustrates that there is a prognostic value in a large-scale sampling of gene sets rather than using individual genes. This is in conformation with the current trends in the prognosis of breast cancer in women (Elledge et al, pp. 709).
Works Cited
Abraham, Gad, Kowalczyk, Adam, Loi, Sherene, Haviv, Izhak, Zobel Justin. Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context. 2010. BMC Bioinformatics 2010, 11: 277.
Bavelstad, Hege, Nygard, Stale, Storvold, Hege, Aldrin, Magne, Borgan, Ornulf, Frigessi, Arnoldo, Lingjaerde, Ole Christian. Predicting Survival from microarray data- a comparative study. 2007. Bioinformatics, Vol. 23, Issue 16, pp. 2080-2087.
Elledge, Richard, Clark, Gary, Chamness, Gary, Osborne, C. Kent. Tumor Biologic Factors and Breast Cancer Prognosis Among White, Hispanic, and Black Women in the United States. 1994. Journal of the National Cancer Institute, Vol. 86, Issue 9, pp. 705-712.
Kela, Itai, Ein-Dor, Liat, Getz, Gad, Givol, David, Domany, Etyan. Outcome signature genes in breast cancer: is there a unique set? 2004. Bioinformatics, Vol. 21, Issue 2.
Wang, Dan, Miecznikowski, Jeffrey, Liu, Song Sucheston, Lara and Gold, David. Comparative survival analysis of breast cancer microarray studies identifies important prognostic genetic pathways. 2010. Bioinformatics 10:573.