SciELO - Scientific Electronic Library Online

 
vol.10 número3MPREDSTOCK: Modelo multivariado de predicción del stock de piezas de repuesto para equipos médicosEvaluación del algoritmo KNN-SP para problemas de predicción con salidas compuestas índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Journal

Artigo

Indicadores

  • Não possue artigos citadosCitado por SciELO

Links relacionados

  • Não possue artigos similaresSimilares em SciELO

Compartilhar


Revista Cubana de Ciencias Informáticas

versão On-line ISSN 2227-1899

Resumo

NUNEZ-ARCIA, Yaisel; DIAZ-DE-LA-PAZ, Lisandra  e  GARCIA-MENDOZA, Juan Luis. Algorithm to correct instance level anomalies in large volumes of data using MapReduce. Rev cuba cienc informat [online]. 2016, vol.10, n.3, pp.105-118. ISSN 2227-1899.

ABSTRACT Data quality problems at instance level have a direct impact on decision making of organizations and affect their performance. As information grows unreasonably it is greater the probability that such problems occur in data. This paper presents an algorithm to correct instance level anomalies in big data sources with semi-structured or structured format. As a clustering method, K-means algorithm was used. To calculate the edit distance between strings the modification of Levenshtein was applied, and to handle the volume of the data, MapReduce model for distributed programming was used. Besides, in order to improve data quality, the following four phases were proposed: identification of the data source type, data format and the problem to be solved; pre-processing of the input data; data clustering and data cleansing.

Palavras-chave : data quality; data cleansing; big data; K-means algorithm; MapReduce.

        · resumo em Espanhol     · texto em Espanhol     · Espanhol ( pdf )

 

Creative Commons License Todo o conteúdo deste periódico, exceto onde está identificado, está licenciado sob uma Licença Creative Commons