TY - GEN
T1 - Building a Generalized Framework for Analyzing Public Procurement Data from the Kapak Database
AU - Ulloa, Sthefano
AU - Riofrío, Daniel
AU - Grijalva, Felipe
AU - Fuertes, Mateo
AU - Guerrero, Melisa
AU - Vega-Sánchez, José
AU - Alba, Pavel
AU - Pérez-Pérez, Noel
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Public procurement systems generate large, complex datasets that can reveal corruption risks; however, analyzing these datasets is challenging due to unstructured formats, fragmented sources, and technical barriers. This paper introduced a data pipeline that streamlines the processing of procurement data from Ecuador's Official Public Procurement System (SOCE, for its acronym in Spanish), expanding on the Kapak project, which uses big data and data science to promote transparency. Specifically, Kapak implements a web crawler that periodically collects extensive procurement data from Ecuador's public procurement website SOCE. Nevertheless, its raw format-comprising base64encoded USHAY files, fragmented documents, and scattered JSON files-hinders effective analysis. To address this, our pipeline automates decoding, file reconstruction, and dataset consolidation, thereby ensuring that the data is ready for analysis. Subsequently, we evaluated, the pipeline on Reverse Electronic Auction (REA) documents, using TF-IDF and cosine similarity to detect patterns among high-risk procurement processes. The analysis specifically focused on technical specifications, bid documents, and stakeholder Question and Answer (Q&A) records. Overall, the pipeline offers a generalizable, modular framework that shifts focus from preprocessing to analysis. In addition, it supports integration with advanced Natural Language Processing (NLP) models, making it a valuable tool for corruption detection and public procurement oversight.
AB - Public procurement systems generate large, complex datasets that can reveal corruption risks; however, analyzing these datasets is challenging due to unstructured formats, fragmented sources, and technical barriers. This paper introduced a data pipeline that streamlines the processing of procurement data from Ecuador's Official Public Procurement System (SOCE, for its acronym in Spanish), expanding on the Kapak project, which uses big data and data science to promote transparency. Specifically, Kapak implements a web crawler that periodically collects extensive procurement data from Ecuador's public procurement website SOCE. Nevertheless, its raw format-comprising base64encoded USHAY files, fragmented documents, and scattered JSON files-hinders effective analysis. To address this, our pipeline automates decoding, file reconstruction, and dataset consolidation, thereby ensuring that the data is ready for analysis. Subsequently, we evaluated, the pipeline on Reverse Electronic Auction (REA) documents, using TF-IDF and cosine similarity to detect patterns among high-risk procurement processes. The analysis specifically focused on technical specifications, bid documents, and stakeholder Question and Answer (Q&A) records. Overall, the pipeline offers a generalizable, modular framework that shifts focus from preprocessing to analysis. In addition, it supports integration with advanced Natural Language Processing (NLP) models, making it a valuable tool for corruption detection and public procurement oversight.
KW - automated workflows
KW - big data
KW - corruption risk detection
KW - data pipeline
KW - Ecuador
KW - Kapak project
KW - Natural Language Processing (NLP)
KW - public accountability
KW - Public procurement
KW - Reverse Electronic Auction (REA)
KW - SOCE
KW - transparency
UR - https://www.scopus.com/pages/publications/105032511307
U2 - 10.1109/ETCM67548.2025.11304396
DO - 10.1109/ETCM67548.2025.11304396
M3 - Contribución a la conferencia
AN - SCOPUS:105032511307
T3 - ETCM 2025 - 9th Ecuador Technical Chapters Meeting
BT - ETCM 2025 - 9th Ecuador Technical Chapters Meeting
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 9th Ecuador Technical Chapters Meeting, ETCM 2025
Y2 - 21 October 2025 through 24 October 2025
ER -