Abstract
|
In microarray cancer datasets, the gene analysis and classification is an imperative task because gene expression data have large dimensionalities, contain redundant information, irrelevant features and noises. Therefore, the main contribution of this paper is selecting a concise subset of informative genes, for improving processing speed and prediction performance. A two-phase hybrid approach is proposed which combines the Principal Component Analysis (PCA) algorithm with Partial Decision Tree (PART) rules. The PCA is applied to identify a small set with most discriminating genes, while the PART rules is proposed to classify microarray data into two or multi-classes. Eleven datasets that consists of different classes, and genes are used, which are Breast Cancer, CNS, Colon, Leukemia, Leukemia_3C, Leukemia_4C, Lung, Lymphoma, MLL, Ovarian, and SRBCT. The data analysis is conducted by using the full training method and the cross validation technique 2-folds to 10-folds. Experimental analysis shows that gene selection using PCA method reduced the computational complexity and obtained the smallest subset of genes prior to classification. Also, it was noticed that the PART classifier when combined with PCA algorithm works faster and showed a remarkable improvement in the classification accuracy.
|
Keywords
|
Principal Component Analysis (PCA) algorithm, Partial Decision Tree (PART) rules, Microarray data, Classi?cation, Gene selection, Data mining
|