TY - JOUR

T1 - LEGO-based generalized set of two linear algebraic 3D bio-macro-molecular descriptors

T2 - Theory and validation by QSARs

AU - Marrero-Ponce, Yovani

AU - Teran, Julio E.

AU - Contreras-Torres, Ernesto

AU - García-Jacas, César R.

AU - Perez-Castillo, Yunierkis

AU - Cubillan, Nestor

AU - Peréz-Giménez, Facundo

AU - Valdés-Martini, José R.

N1 - Publisher Copyright:
© 2019

PY - 2020/1/21

Y1 - 2020/1/21

N2 - Novel 3D protein descriptors based on bilinear, quadratic and linear algebraic maps in Rn are proposed. The latter employs the kth 2-tuple (dis) similarity matrix to codify information related to covalent and non-covalent interactions in these biopolymers. The calculation of the inter-amino acid distances is generalized by using several dis-similarity coefficients, where normalization procedures based on the simple stochastic and mutual probability schemes are applied. A new local-fragment approach based on amino acid-types and amino acid-groups is proposed to characterize regions of interest in proteins. Topological and geometric macromolecular cutoffs are defined using local and total indices to highlight non-covalent interactions existing between the side-chains of each amino acid. Moreover, local and total indices calculations are generalized considering a LEGO approach, by using several aggregation operators. Collinearity and variability analyses are performed to evaluate every generalizing component applied to the definition of these novel indices. These experiments are oriented to reduce the number of MDs obtained for performing prediction models. The predictive power of the proposed indices was evaluated using two benchmark datasets, folding rate and secondary structural classification of proteins. The proposed MDs are modeled using the following strategies: Multiple Linear Regression (MLR) and Support Vector Machine (SVM), respectively. The best regression model developed for the folding rate of proteins yields a cross-validation coefficient of 0.875 (Test Set) and the best model developed for secondary structural classification obtained 98% of instances correctly classified (Test Set). These statistical parameters are superior to the ones obtained with existing MDs reported in the literature. Overall, the new theoretical generalization enhanced the information extraction into the MDs, allowing a better correlation between these two evaluated benchmark datasets and the proposed indices. The optimal theoretical configurations defined for the calculation of these MDs consider low collinearity and less information redundancy among them. These theoretical configurations and the software are available at http://tomocomd.com/mulims-mcompas.

AB - Novel 3D protein descriptors based on bilinear, quadratic and linear algebraic maps in Rn are proposed. The latter employs the kth 2-tuple (dis) similarity matrix to codify information related to covalent and non-covalent interactions in these biopolymers. The calculation of the inter-amino acid distances is generalized by using several dis-similarity coefficients, where normalization procedures based on the simple stochastic and mutual probability schemes are applied. A new local-fragment approach based on amino acid-types and amino acid-groups is proposed to characterize regions of interest in proteins. Topological and geometric macromolecular cutoffs are defined using local and total indices to highlight non-covalent interactions existing between the side-chains of each amino acid. Moreover, local and total indices calculations are generalized considering a LEGO approach, by using several aggregation operators. Collinearity and variability analyses are performed to evaluate every generalizing component applied to the definition of these novel indices. These experiments are oriented to reduce the number of MDs obtained for performing prediction models. The predictive power of the proposed indices was evaluated using two benchmark datasets, folding rate and secondary structural classification of proteins. The proposed MDs are modeled using the following strategies: Multiple Linear Regression (MLR) and Support Vector Machine (SVM), respectively. The best regression model developed for the folding rate of proteins yields a cross-validation coefficient of 0.875 (Test Set) and the best model developed for secondary structural classification obtained 98% of instances correctly classified (Test Set). These statistical parameters are superior to the ones obtained with existing MDs reported in the literature. Overall, the new theoretical generalization enhanced the information extraction into the MDs, allowing a better correlation between these two evaluated benchmark datasets and the proposed indices. The optimal theoretical configurations defined for the calculation of these MDs consider low collinearity and less information redundancy among them. These theoretical configurations and the software are available at http://tomocomd.com/mulims-mcompas.

KW - 3D-protein descriptor

KW - Aggregation operator

KW - Amino acid interaction

KW - Folding rate

KW - Machine learning

KW - Metrics

KW - Normalization procedure

KW - Protein structural classes

KW - Two-linear algebraic forms

UR - http://www.scopus.com/inward/record.url?scp=85073065776&partnerID=8YFLogxK

U2 - 10.1016/j.jtbi.2019.110039

DO - 10.1016/j.jtbi.2019.110039

M3 - Artículo

C2 - 31589877

AN - SCOPUS:85073065776

SN - 0022-5193

VL - 485

JO - Journal of Theoretical Biology

JF - Journal of Theoretical Biology

M1 - 110039

ER -