Integración de rasgos y aprendizaje semi-supervisado para la clasificación funcional de enzimas utilizando K-medias de Spark

González Valle, Yadelis; Galpert, Deborah; Molina-Ruiz, Reinaldo; Aguero-Chapin, Guillermin

Mi SciELO

Servicios personalizados

Servicios Personalizados

Articulo

Enviar articulo por email

Indicadores

Citado por SciELO

Links relacionados

Similares en SciELO

Permalink

Revista Cubana de Ciencias Informáticas

versión On-line ISSN 2227-1899

Resumen

GONZALEZ VALLE, Yadelis; GALPERT, Deborah; MOLINA-RUIZ, Reinaldo y AGUERO-CHAPIN, Guillermin. Feature integration and semi-supervised learning for functional enzyme classification by using Spark K-means. Rev cuba cienc informat [online]. 2020, vol.14, n.4, pp. 134-161. Epub 01-Dic-2020. ISSN 2227-1899.

The functional classification of enzymes has been a field of great interest for bioinformatics for several years. This classification must take into account the scarce information of some classes, the imbalance between them and the increasing number of enzymes to be classified. In this article we investigate the use of semi-supervised and unsupervised clustering algorithms to group similar enzyme sequences, from the integration of alignment-free protein descriptors based on the k-mers method with different k values. Four algorithms were implemented in Spark that group enzymes according to their enzymatic function. These are based on transformations to existing methods such as the Global Logic Combinatorial, the K-means and the Ensemble Clustering. The quality of the clustering was measured using the silhouette index as an internal measure and the F-measure as an external measure. In the experiment, 58 functionally characterized sequences of 501 enzymes of the Glicosil Hidrolasa-70 (GH-70) family (with a high value for biotechnology and that can cause millionaire losses in sugar production) from the CAZy database were taken as reference, with the objective of comparing the results of the implemented grouping methods. There were obtained moderate values of the silhouette index as an internal measure but better than those obtained with the K-means method. The best value of 0.9 of the F-measure of the Ensemble Clustering method combined with semi-supervised learning was achieved.

Palabras clave : Enzyme clustering; K-mean centroids; unsupervised learning; semi-supervised learning.

· resumen en Español · texto en Español · Español (

pdf )