TY - GEN
T1 - Fine-Tuning Wav2Vec2 for Low-Resource Kichwa Automatic Speech Recognition
AU - Santamaria, Christian
AU - Grijalva, Felipe
AU - Parra, Carla
AU - Rosero, Karen
AU - Vega-Sánchez, José
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - In recent years, advancements in artificial intelligence (AI) have significantly accelerated the development of natural language processing and automatic speech recognition (ASR) systems for high-resource languages, raising concerns about the marginalization of ancestral and underrepresented languages. In this context, this work explores the fine-tuning of the Wav2Vec 2.0 model, developed by Meta AI, for ASR in Kichwa-a low-resource language spoken in the Ecuadorian Andes. The training process utilized two datasets totaling approximately 8 hours of audio, segmented into clips ranging from 1.5 to 5 seconds, with manually aligned transcriptions created using ELAN software. Fine-tuning was performed using the Connectionist Temporal Classification (CTC) loss function. After multiple experiments, a two-tailed Wilcoxon signed-rank test revealed no statistically significant improvement when applying SpecAugment. The best-performing model, trained without data augmentation, achieved promising results on the test set: a Word Error Rate (WER) of 0.262, a Character Error Rate (CER) of 0.120, and a Match Error Rate (MER) of 0.401. These findings d emonstrate the viability of adapting pre-trained self-supervised models to low-resource settings and underscore the potential of ASR technologies to support greater linguistic inclusivity in artificial intelligence.
AB - In recent years, advancements in artificial intelligence (AI) have significantly accelerated the development of natural language processing and automatic speech recognition (ASR) systems for high-resource languages, raising concerns about the marginalization of ancestral and underrepresented languages. In this context, this work explores the fine-tuning of the Wav2Vec 2.0 model, developed by Meta AI, for ASR in Kichwa-a low-resource language spoken in the Ecuadorian Andes. The training process utilized two datasets totaling approximately 8 hours of audio, segmented into clips ranging from 1.5 to 5 seconds, with manually aligned transcriptions created using ELAN software. Fine-tuning was performed using the Connectionist Temporal Classification (CTC) loss function. After multiple experiments, a two-tailed Wilcoxon signed-rank test revealed no statistically significant improvement when applying SpecAugment. The best-performing model, trained without data augmentation, achieved promising results on the test set: a Word Error Rate (WER) of 0.262, a Character Error Rate (CER) of 0.120, and a Match Error Rate (MER) of 0.401. These findings d emonstrate the viability of adapting pre-trained self-supervised models to low-resource settings and underscore the potential of ASR technologies to support greater linguistic inclusivity in artificial intelligence.
KW - Audio
KW - Automatic Speech Recognition
KW - Connectionist Temporal Classification
KW - Deep Learning
KW - FineTuning
KW - Kichwa
UR - https://www.scopus.com/pages/publications/105032529153
U2 - 10.1109/ETCM67548.2025.11304301
DO - 10.1109/ETCM67548.2025.11304301
M3 - Contribución a la conferencia
AN - SCOPUS:105032529153
T3 - ETCM 2025 - 9th Ecuador Technical Chapters Meeting
BT - ETCM 2025 - 9th Ecuador Technical Chapters Meeting
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 9th Ecuador Technical Chapters Meeting, ETCM 2025
Y2 - 21 October 2025 through 24 October 2025
ER -