<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>2227-1899</journal-id>
<journal-title><![CDATA[Revista Cubana de Ciencias Informáticas]]></journal-title>
<abbrev-journal-title><![CDATA[Rev cuba cienc informat]]></abbrev-journal-title>
<issn>2227-1899</issn>
<publisher>
<publisher-name><![CDATA[Editorial Ediciones Futuro]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S2227-18992019000300091</article-id>
<title-group>
<article-title xml:lang="es"><![CDATA[Mejoras en la clasificación de interacciones de proteínas de secuencias de la Arabidopsis Thaliana utilizando técnicas de bases de datos desbalanceadas]]></article-title>
<article-title xml:lang="en"><![CDATA[Improvements in the classification of protein-protein interactions of Arabidopsis Thaliana sequences using unbalanced database techniques]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Chavez Cardenas]]></surname>
<given-names><![CDATA[María del Carmen]]></given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
</contrib-group>
<aff id="Af1">
<institution><![CDATA[,Universidad Central Marta Abreu de Las Villas  ]]></institution>
<addr-line><![CDATA[ ]]></addr-line>
<country>Cuba</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>09</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>09</month>
<year>2019</year>
</pub-date>
<volume>13</volume>
<numero>3</numero>
<fpage>91</fpage>
<lpage>106</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://scielo.sld.cu/scielo.php?script=sci_arttext&amp;pid=S2227-18992019000300091&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://scielo.sld.cu/scielo.php?script=sci_abstract&amp;pid=S2227-18992019000300091&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://scielo.sld.cu/scielo.php?script=sci_pdf&amp;pid=S2227-18992019000300091&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="es"><p><![CDATA[RESUMEN Un reto de las comunidades científicas en el área del aprendizaje automatizado lo constituye una correcta clasificación en conjuntos de datos no balanceados. En problemas de Bioinformática es muy común tener grandes bases de casos, en la mayoría de las veces estas son desbalanceadas, siendo la clase minoritaria casi siempre la de principal interés de investigación. Varios métodos de aprendizaje automático se han desarrollado para hacer frente al problema de las clases no balanceadas. Se tienen técnicas al nivel de los algoritmos y otras enfocadas a los datos. Entre los métodos dirigidos al procesamiento de los datos se destacan los que se centran en intentar balancear los conjuntos, reduciendo la clase con mayor cantidad de ejemplos, o ampliando la de menor cantidad, conocidas como under-sampling y over-sampling respectivamente. Se pretende mejorar la clasificación para la base de datos de interacciones de proteínas para la planta Arabidopsis Thaliana obtenida por el Departamento de Biología de Sistemas de Plantas de la Universidad de Ghent, la cual presenta desbalance de clases. En este trabajo se realiza una experimentación aplicando un compendio de diferentes investigaciones orientadas a la edición de los conjuntos de entrenamiento con lo cual se logra mejorar la clasificación de interacciones de proteínas.]]></p></abstract>
<abstract abstract-type="short" xml:lang="en"><p><![CDATA[ABSTRACT A challenge of the scientific communities in the area of Machine Learning is a correct classification in unbalanced data sets. In Bioinformatics problems it is very common to have large case base, in most cases these are unbalanced, the minority class almost always being the main research interest. Several methods of automatic learning have been developed to address the problem of unbalanced classes. Techniques are at the level of the algorithms and others are focused on the data. Among the methods used for data processing are those that focus on trying to balance the sets, reducing the class with more samples, or expanding the smaller ones, known as under-sampling and over-sampling respectively. In this work is try to be improved the classification for the protein-protein interactions for the Arabidopsis Thaliana plant obtained by the Department of Plant Systems Biology at the University of Ghent, which presents an imbalance of classes. The experimentation is carried out applying a compendium of different research oriented to the edition of the training sets to try to improve the classification of the Protein-Protein Interactions.]]></p></abstract>
<kwd-group>
<kwd lng="es"><![CDATA[Clasificación]]></kwd>
<kwd lng="es"><![CDATA[conjuntos de datos desbalanceados]]></kwd>
<kwd lng="es"><![CDATA[aprendizaje automatizado, interacciones de proteínas.]]></kwd>
<kwd lng="en"><![CDATA[Classification]]></kwd>
<kwd lng="en"><![CDATA[unbalanced data sets]]></kwd>
<kwd lng="en"><![CDATA[Machine Learning]]></kwd>
<kwd lng="en"><![CDATA[Protein-Protein Interactions]]></kwd>
</kwd-group>
</article-meta>
</front><back>
<ref-list>
<ref id="B1">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Albisua]]></surname>
<given-names><![CDATA[I. A]]></given-names>
</name>
</person-group>
<source><![CDATA[The quest for the optimal class distribution : An approach for enhancing the effectiveness of learning via resampling methods for imbalanced data sets]]></source>
<year>2013</year>
<page-range>45-63</page-range><publisher-name><![CDATA[Springer-Verlag Berlin Heidelberg]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B2">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Barandela]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Strategies for learning in class imbalance problems]]></article-title>
<source><![CDATA[Pattern Recognit]]></source>
<year>2003</year>
<volume>36</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>849-51</page-range></nlm-citation>
</ref>
<ref id="B3">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Batista]]></surname>
<given-names><![CDATA[G. P]]></given-names>
</name>
</person-group>
<source><![CDATA[A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data.]]></source>
<year>2002</year>
<page-range>139-46</page-range><publisher-name><![CDATA[International Conference on Machine Learning]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B4">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bonet]]></surname>
<given-names><![CDATA[C., I]]></given-names>
</name>
</person-group>
<source><![CDATA[Modelo para la clasificación de secuencias, en problemas de la bioinformática, usando técnicas de inteligencia artificial.]]></source>
<year>2008</year>
<publisher-loc><![CDATA[Cuba ]]></publisher-loc>
<publisher-name><![CDATA[Universidad Central &#8220;Martha Abreu&#8221; de las Villas]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B5">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Chaite]]></surname>
<given-names><![CDATA[F. R]]></given-names>
</name>
</person-group>
<source><![CDATA[A First Approach to Deal with Imbalance in Multi-label Datasets.]]></source>
<year>2013</year>
<volume>8073</volume>
<page-range>150-60</page-range><publisher-name><![CDATA[Springer-Verlag Berlin Heidelberg]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B6">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Chávez]]></surname>
<given-names><![CDATA[M. d.]]></given-names>
</name>
</person-group>
<source><![CDATA[Modelos de Redes Bayesianas en el Estudio de Secuencias Genómicas y otros Problemas Biomédicos.]]></source>
<year>2008</year>
<publisher-loc><![CDATA[Cuba ]]></publisher-loc>
<publisher-name><![CDATA[Universidad Central &#8220;Marta Abreu&#8221; de Las Villas]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B7">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Eibe]]></surname>
<given-names><![CDATA[F. M]]></given-names>
</name>
</person-group>
<source><![CDATA[The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann]]></source>
<year>2016</year>
<edition>Fourth Edition</edition>
</nlm-citation>
</ref>
<ref id="B8">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[García]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Modelo de un Sistema de Razonamiento Basado en Casos para el Análisis en la Gestión de Riesgos.]]></article-title>
<source><![CDATA[Serie Científica De La Universidad De Las Ciencias Informáticas]]></source>
<year>2011</year>
<volume>4</volume>
</nlm-citation>
</ref>
<ref id="B9">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[González]]></surname>
<given-names><![CDATA[O. M]]></given-names>
</name>
</person-group>
<source><![CDATA[Clasificadores Supervisados basados en Patrones Emergentes para Bases de Datos con Clases Desbalanceadas.]]></source>
<year>2014</year>
</nlm-citation>
</ref>
<ref id="B10">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Krawczyk]]></surname>
<given-names><![CDATA[B. W]]></given-names>
</name>
</person-group>
<source><![CDATA[Cost-sensitive decision tree ensembles for efective imbalanced classification.]]></source>
<year>2014</year>
</nlm-citation>
</ref>
<ref id="B11">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kubat]]></surname>
<given-names><![CDATA[M. M]]></given-names>
</name>
</person-group>
<source><![CDATA[Addressing the Course of Imbalanced Training Sets: One-sided Selection.]]></source>
<year>1997</year>
<page-range>179-86</page-range></nlm-citation>
</ref>
<ref id="B12">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Laurikkala]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<source><![CDATA[Improving identification of difficult small classes by balancing class distribution.]]></source>
<year>2001</year>
<page-range>63-6</page-range><publisher-name><![CDATA[Springer-Verlag Berlin Heidelberg,]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B13">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[López]]></surname>
<given-names><![CDATA[V. F]]></given-names>
</name>
</person-group>
<source><![CDATA[On the importance of the validation technique for classification with imbalanced datasets : Addressing covariate shift when data is skewed.]]></source>
<year>2014</year>
<volume>14</volume>
</nlm-citation>
</ref>
<ref id="B14">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Mark]]></surname>
<given-names><![CDATA[H. E]]></given-names>
</name>
</person-group>
<source><![CDATA[The weka data mining software: an update.]]></source>
<year>2011</year>
<page-range>10-8</page-range><publisher-name><![CDATA[SIGKDD Explorations]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B15">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Menardi]]></surname>
<given-names><![CDATA[G. T]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Training and assessing classification rules with imbalanced data.]]></article-title>
<source><![CDATA[Data Mining and Knowledge Discovery]]></source>
<year>2014</year>
<volume>28</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>92-122</page-range></nlm-citation>
</ref>
<ref id="B16">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Moreno]]></surname>
<given-names><![CDATA[J. R]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[SMOTE-I : mejora del algoritmo SMOTE para balanceo de clases minoritarias.]]></article-title>
<source><![CDATA[Actas de Talleres de las Jornadas de Ingeniería de Software y Bases de Datos]]></source>
<year>2009</year>
<volume>3</volume>
<numero>1</numero>
<issue>1</issue>
</nlm-citation>
</ref>
<ref id="B17">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pérez]]></surname>
<given-names><![CDATA[G., D]]></given-names>
</name>
</person-group>
<source><![CDATA[Algoritmos supervisados para la detección de ortólogos con manejo del desbalance]]></source>
<year>2013</year>
</nlm-citation>
</ref>
<ref id="B18">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Quinlan]]></surname>
<given-names><![CDATA[J. K]]></given-names>
</name>
</person-group>
<source><![CDATA[Improved use of continuous attributes in C4.5.]]></source>
<year>2006</year>
<page-range>907-11</page-range></nlm-citation>
</ref>
<ref id="B19">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ramentol]]></surname>
<given-names><![CDATA[E. C]]></given-names>
</name>
</person-group>
<source><![CDATA[SMOTE-RSB: a hybrid preprocessing approach based on oversampling and under-sampling for high imbalanced data-sets using SMOTE and rough sets theory.]]></source>
<year>2009</year>
<publisher-name><![CDATA[Springer Verlag London]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B20">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Swets]]></surname>
<given-names><![CDATA[J. A]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Better decisions through science]]></article-title>
<source><![CDATA[Scientific American]]></source>
<year>2000</year>
<month>a</month>
<volume>283</volume>
<page-range>82-7</page-range></nlm-citation>
</ref>
<ref id="B21">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Swets]]></surname>
<given-names><![CDATA[J. A]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Psychological science can improve diagnostic decisions.]]></article-title>
<source><![CDATA[Psy-chological Science in the Public Interest]]></source>
<year>2000</year>
<month>b</month>
<volume>1</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>1-26</page-range></nlm-citation>
</ref>
<ref id="B22">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Tomek]]></surname>
<given-names><![CDATA[I]]></given-names>
</name>
</person-group>
<article-title xml:lang=""><![CDATA[Two modifications of CNN.]]></article-title>
<source><![CDATA[IEEE Transactions on Systems, Man and Cybernetics]]></source>
<year>1976</year>
<volume>SMC-6</volume>
<numero>11</numero>
<issue>11</issue>
<page-range>769-72</page-range></nlm-citation>
</ref>
<ref id="B23">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Urcelay]]></surname>
<given-names><![CDATA[G. B]]></given-names>
</name>
</person-group>
<source><![CDATA[Reconocimiento de Patrones]]></source>
<year>2014</year>
<page-range>1-27</page-range></nlm-citation>
</ref>
<ref id="B24">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Vapnik]]></surname>
<given-names><![CDATA[V]]></given-names>
</name>
</person-group>
<source><![CDATA[The Nature of Statistical Learning Theory.]]></source>
<year>1995</year>
<publisher-loc><![CDATA[. New York ]]></publisher-loc>
<publisher-name><![CDATA[Springer-Verlag New York]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B25">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Wilson]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
</person-group>
<source><![CDATA[Asymptotic Properties of Nearest Neighbor Rules Using Edited Data.]]></source>
<year>1972</year>
<page-range>408-21</page-range><publisher-name><![CDATA[IEEE Transactions on Systems, Man, and Cybernetics,]]></publisher-name>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
