TY - GEN
T1 - Towards a Mixed Learning Strategy for Discovering New Gene Signatures in Breast Cancer Prognosis
AU - Cola-Pilicita, Cristhian
AU - Martínez-Mejía, Mateo
AU - Alba, Eduardo
AU - Marrero-Ponce, Yovani
AU - Pérez-Pérez, Noel
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - This work focuses on developing a mixed-learning method that combines a filter-based metaheuristic searcher with a shallow learning classifier to reduce the feature space while maximizing the breast cancer prognosis classification. The searcher used a genetic algorithm together with the average symmetrical uncertainty (aSU) and ReliefF (aReliefF) filter functions. This modification allowed us to measure the relevance per capita of a group of features (genes). The proposed method was validated on a data set with 396 instances. The most effective classification scheme emerged from the random forest model, utilizing 60 tree predictors and employing the aReliefF objective function. This configuration achieved an average area under the receiver operating characteristic curve (AUC) score of 0.854 and 0.874 for the training and test stages, respectively. Thus, this classification scheme is the best breast cancer prognosis classification strategy. In addition, we identified a set of master genes through the intersection of both objective functions regarding feature relevance. Nevertheless, evaluating this subset in the test set using the top-performing classification scheme yielded a comparatively lower performance (AUC=0.829), underscoring the necessity for additional genes to maximize classification effectiveness.
AB - This work focuses on developing a mixed-learning method that combines a filter-based metaheuristic searcher with a shallow learning classifier to reduce the feature space while maximizing the breast cancer prognosis classification. The searcher used a genetic algorithm together with the average symmetrical uncertainty (aSU) and ReliefF (aReliefF) filter functions. This modification allowed us to measure the relevance per capita of a group of features (genes). The proposed method was validated on a data set with 396 instances. The most effective classification scheme emerged from the random forest model, utilizing 60 tree predictors and employing the aReliefF objective function. This configuration achieved an average area under the receiver operating characteristic curve (AUC) score of 0.854 and 0.874 for the training and test stages, respectively. Thus, this classification scheme is the best breast cancer prognosis classification strategy. In addition, we identified a set of master genes through the intersection of both objective functions regarding feature relevance. Nevertheless, evaluating this subset in the test set using the top-performing classification scheme yielded a comparatively lower performance (AUC=0.829), underscoring the necessity for additional genes to maximize classification effectiveness.
KW - Genetic algorithm
KW - Metaheuristics
KW - Naive Bayes
KW - Random forest
KW - ReliefF
KW - Shallow learning
KW - Symmetrical uncertainty
KW - k-nearest neighbors
UR - http://www.scopus.com/inward/record.url?scp=85211153038&partnerID=8YFLogxK
U2 - 10.1109/ARGENCON62399.2024.10735947
DO - 10.1109/ARGENCON62399.2024.10735947
M3 - Contribución a la conferencia
AN - SCOPUS:85211153038
T3 - 2024 7th IEEE Biennial Congress of Argentina, ARGENCON 2024
BT - 2024 7th IEEE Biennial Congress of Argentina, ARGENCON 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 7th IEEE Biennial Congress of Argentina, ARGENCON 2024
Y2 - 18 September 2024 through 20 September 2024
ER -