TY - JOUR
T1 - Use and misuse of trait imputation in ecology
T2 - the problem of using out-of-context imputed values
AU - Gorné, Lucas Damián
AU - Aguirre-Gutiérrez, Jesús
AU - Souza, Fernanda C.
AU - Swenson, Nathan G.
AU - Kraft, Nathan Jared Boardman
AU - Schwantes Marimon, Beatriz
AU - Baker, Timothy R.
AU - Ferreira de Lima, Renato A.
AU - Vilanova, Emilio
AU - Álvarez-Dávila, Esteban
AU - Monteagudo Mendoza, Abel
AU - Flores Llampazo, Gerardo Rafael
AU - dos Santos, Rubens Manoel
AU - Boenisch, Gerhard
AU - Araujo-Murakami, Alejandro
AU - Rivas-Torres, Gonzalo
AU - Ramírez-Angulo, Hirma
AU - dos Santos Prestes, Nayane Cristina
AU - Morandi, Paulo S.
AU - Cerruto Ribeiro, Sabina
AU - Wesley, Wesley Jonatar
AU - Disney, Mathias
AU - Di Fiore, Anthony
AU - Marimon-Junior, Ben Hur
AU - Feldpausch, Ted R.
AU - Malhi, Yadvinder
AU - Phillips, Oliver L.
AU - Galbraith, David
AU - Díaz, Sandra
N1 - Publisher Copyright:
© 2025 The Author(s). Ecography published by John Wiley & Sons Ltd on behalf of Nordic Society Oikos.
PY - 2025
Y1 - 2025
N2 - Despite the progress in the measurement and accessibility of plant trait information, acquiring sufficiently complete data from enough species to answer broad-scale questions in plant functional ecology and biogeography remains challenging. A common way to overcome this challenge is by imputation, or ‘gap-filling' of trait values. This has proven appropriate when focusing on the overall patterns emerging from the database being imputed. However, some applications force the imputation procedure out of its original scope, using imputed values independently from the imputation context, and specific trait values for a given species are used as input for computing new variables. We tested the performance of three widely used imputation methods (Bayesian hierarchical probabilistic matrix factorization, multiple imputation by chained equations with predictive mean matching, and Rphylopars) on a database of tropical tree and shrub traits. By applying a leave-one-out procedure, we assessed the accuracy and precision of the imputed values and found that out-of-context use of imputed values may bias the estimation of different variables. We also found that low redundancy (i.e. low predictability of a new value on the basis of existing values) in the dataset, not uncommon for empirical datasets, is likely the main cause of low accuracy and precision in the imputed values. We therefore suggest the use of a leave-one-out procedure to test the quality of the imputed values before any out-of-context application of the imputed values, and make practical recommendations to avoid the misuse of imputation procedures. Furthermore, we recommend not publishing gap-filled datasets, publishing instead only the empirical data, together with the imputation method applied and the corresponding script to reproduce the imputation. This will help avoid the spread of imputed data, whose accuracy, precision, and source are difficult to assess and track, into the public domain.
AB - Despite the progress in the measurement and accessibility of plant trait information, acquiring sufficiently complete data from enough species to answer broad-scale questions in plant functional ecology and biogeography remains challenging. A common way to overcome this challenge is by imputation, or ‘gap-filling' of trait values. This has proven appropriate when focusing on the overall patterns emerging from the database being imputed. However, some applications force the imputation procedure out of its original scope, using imputed values independently from the imputation context, and specific trait values for a given species are used as input for computing new variables. We tested the performance of three widely used imputation methods (Bayesian hierarchical probabilistic matrix factorization, multiple imputation by chained equations with predictive mean matching, and Rphylopars) on a database of tropical tree and shrub traits. By applying a leave-one-out procedure, we assessed the accuracy and precision of the imputed values and found that out-of-context use of imputed values may bias the estimation of different variables. We also found that low redundancy (i.e. low predictability of a new value on the basis of existing values) in the dataset, not uncommon for empirical datasets, is likely the main cause of low accuracy and precision in the imputed values. We therefore suggest the use of a leave-one-out procedure to test the quality of the imputed values before any out-of-context application of the imputed values, and make practical recommendations to avoid the misuse of imputation procedures. Furthermore, we recommend not publishing gap-filled datasets, publishing instead only the empirical data, together with the imputation method applied and the corresponding script to reproduce the imputation. This will help avoid the spread of imputed data, whose accuracy, precision, and source are difficult to assess and track, into the public domain.
KW - BHPMF
KW - Rphylopars
KW - gap-filling
KW - imputation
KW - mice
KW - plant trait
KW - sparse matrix
UR - http://www.scopus.com/inward/record.url?scp=85216987228&partnerID=8YFLogxK
U2 - 10.1111/ecog.07520
DO - 10.1111/ecog.07520
M3 - Artículo
AN - SCOPUS:85216987228
SN - 0906-7590
JO - Ecography
JF - Ecography
ER -