Usando el algoritmo K-means para la curva de regresión en un gran sistema de datos para el entorno empresarial

Naoui, Mohammed Anouar; Lejdel, Brahim; Ayad, Mouloud; Naoui, Mohammed Anouar; Lejdel, Brahim; Ayad, Mouloud

Mi SciELO

Servicios personalizados

Servicios Personalizados

Articulo

Enviar articulo por email

Indicadores

Citado por SciELO

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Revista Cubana de Ciencias Informáticas

versión On-line ISSN 2227-1899

Rev cuba cienc informat vol.14 no.2 La Habana abr.-jun. 2020 Epub 01-Jun-2020

Artículo original

Using K-means algorithm for regression curve in big data system for business environment

Usando el algoritmo K-means para la curva de regresión en un gran sistema de datos para el entorno empresarial

0000-0003-1653-531XMohammed Anouar Naoui¹, 0000-0003-1779-0689Brahim Lejdel², 0000-0001-9858-8612Mouloud Ayad³^*

^¹LIMPAF Laboratory, Computer science department,Faculty of Sciences and applied Sciences. University of Bouira.

^² Computer science department, University of El-Oued.

^³ LPM3E Laboratory, Faculty of Sciences and applied Sciences, University of Bouira.

ABSTRACT

Predictive analysis quickly becomes a decisive advantage for desired range of Business activities. It involves methods and technologies for organizations to identify models or patterns for data. Big data bring enormous benefits to the business process. Big data properties such as volume, velocity, variety, variation and veracity, render the existing techniques of data analysis not sufficient. Big data analysis requires the fusion of regression techniques for data mining with those of machine learning. Big data regression is an important field for many researchers, several aspects, methods, and techniques proposed. In this context, we suggest regression curve models for big data system. Our proposition is based on cooperative MapReduce architecture. We offer Map and Reduce algorithms for curve regression, in the Map phase; data transform in the linear model, in the reduce phase we propose a k-means algorithm for clustering the results of Map phase. K-means algorithm is one of the most popular partition clustering algorithms; it is simple, statistical and considerably scalable. Also, it has linear asymptotic running time concerning any variable of the problem. This approach combines the advantage of regression and clustering methods in big data. The regression method extract mathematic models, and in clustering, k-means algorithm select the best mathematic model as clusters.

Key words: Cooperation MapReduce algorithm; Big Data; Regression Curve; k-means algorithm; Business environmental scanning

RESUMEN

El análisis predictivo se convierte rápidamente en una ventaja decisiva para la gama de actividades comerciales deseadas. Implica métodos y tecnologías para que las organizaciones identifiquen modelos o patrones de datos. Los grandes datos aportan enormes beneficios al proceso empresarial. Las grandes propiedades de los datos, como el volumen, la velocidad, la variedad, la variación y la veracidad, hacen que las técnicas existentes de análisis de datos no sean suficientes. El análisis de grandes datos requiere la fusión de las técnicas de regresión para la minería de datos con las de aprendizaje automático. La regresión de grandes datos es un campo importante para muchos investigadores, varios aspectos, métodos y técnicas propuestas. En este contexto, sugerimos modelos de curvas de regresión para grandes sistemas de datos. Nuestra propuesta se basa en la arquitectura cooperativa de MapReduce. Ofrecemos algoritmos Map y Reduce para la regresión de la curva, en la fase Map; la transformación de datos en el modelo lineal, en la fase reduce proponemos un algoritmo k-means para agrupar los resultados de la fase Map. El algoritmo K-means es uno de los algoritmos de clustering de particiones más populares; es simple, estadístico y considerablemente escalable. Además, tiene un tiempo de ejecución asintótica lineal en relación con cualquier variable del problema. Este enfoque combina la ventaja de los métodos de regresión y agrupación en grandes datos. El método de regresión extrae modelos matemáticos, y en la agrupación, el algoritmo k-means selecciona el mejor modelo matemático como agrupaciones.

Palabras-clave: Algoritmo de cooperación MapReduce; Big Data; Curva de Regresión; algoritmo k-means; exploración del entorno empresarial

INTRODUCCIÓN

Regression analysis (^{Golberg et al., 2004}) is a statistical methodology describe the relationship between variables atributes. For example in business marking, regression analysis can explain the relation between price and quality of products. The potential sales of a new product given its price. Regression analysis most used in continuous valued. Linear Regression (^{Bollobás.,1990}) is a model describe the relationship between variables by linear model, let y is the response variable, and x the predictor variable, the model is:

y = ax + b (1)

Where a and b can be solved by the method of least squares. Which minimize the error and extract the best line equation. if D set of data.D = {(x1, y1), (x1, y1), ...(xi, yi), ...(xn, yn)}

b = y¯ − x¯ (2)

Where a and b can be solved by the method of least squares. Which minimize the error and extract the best line equation. if D set of data . D = {(x1, y1), (x1, y1), ...(xi, yi), ...(xn, yn)}.

(3)

Multiple linear regression

Relation between more than one variable describe by linear model, the general equation is:

(4)

Non Linear Regression: Curve regression

Often the relationship between variables is far to being linear. Curve models are the most used, to determine the curve model relationship, there are several mathematics models such as power, exponential, logistic and polynomial model. We are going to present, in the Table 1, the multiple Curve models.

Table 1 Curve regression models.

Once we have chosen the model to adopt, we must transform the curve into a Linear relation. There are several linearization methods which can be cited in Table 2:

Table 2 Linearization Curve regression models.

Big data MapReduce Algorithms

MapReduce (^{Dean et al.,2010}) primitives implements parallel processing, it composes by two algorithms, Map and Reduce, Map algorithm takes a set of data and convert it into another set of data. It takes a pair of (key, pair) and emits ( key, pair) into Reduce algorithm. The input of Reduce algorithm is the result of map algorithm. The Map reduce constitutes from Master called Jobtracker, and a set of slaves server called TaskTracker (^{Shafer et al.,2010}; ^{Martha et al.,2013}). Hadoop (^{Krishna.,2010}) provide MapReduce runtimes with fault tolerance and dynamic flexibility support.

The essential question of our work are:

What is the model that can present regression curve in big data system

This paper is organized as follows, in section 2. We present related works, linear model, curve regression and k-means algorithm. In section 3., we present our proposition, mathematic model, Map and Reduce algorithms and workflow architecture. Subsequently, we show in section 4. Validation and results of our proposition of UnversalBank data set. Finally, we terminate by the conclusion in section 6.

RELATED WORK

There are several research interested by regression, linear or curve in big data (^{Jun et al.,2015}; ^{Oancea et al.,2015}; ^{Ma et al., 2015}; ^{Neyshabouri et al., 2016}). Several works oriented to propose mathematic approaches for regression in big data such as data (Jun et al.,2015; Ma et al., 2015; Neyshabouri et al., 2016). Other geared to proposes MapReduce algorithms and its implementations in big data system like (Oancea et al.,2015)

Linear model

(^{Jun et al. 2015}) presented a divided regression analysis using multiple linear where regression form is :

(5)

Fig. 1 Dividing Big Data using Sampling Method (^{Jun et al.,2015}).

Authors use random sampling data to divided big data into sub samples, they consider all attributes have an equal chance to be selected in the sample Figure 1. (^{Oancea et al. 2015}) presents a way to solve linear regression in big data, they propose a MapReduce algorithm expressed to the least square error, for the implementation they use R-Studio and Rhadoop library. (Ma et al. 2015) presented Leveraging for big data regression.Leverage appear, If a data point A is moved up or down, the corresponding adjusted value moves proportionally. The proportionality constant is called the leverage effect. Figure 2

Fig. 2 Example of leverage point A.

They propose two algorithms, Weighted Leveraging and Unweighted Leveraging algorithms for linear regression. Authors discuss the advantage of those algorithms the in big data system. (^{Neyshabouri et al. 2016}) present an algorithm for nonlinear regression in big data system based on lexicographical splitting graph (^{Wang et al.,2015}) this algorithm divide n data into 2n possible partitions to construct sequence piecewise linear model, and combines them (^{Willems et al.1996}) proposed ^{Cover’s theorem(Cover ;1965}), which can transform training data set non linearly separable in tanning set linearly separable. This work divided data set into tanning data set and test data set the proposed algorithm to generate a huge number of (104 -106) of random feature intermediate is given predictor matrix for the training data set, and they use training test data sets to choose predictive intermediate features by regularized linear or logistic regression.

K-means algorithm

The k-means algorithm takes into account k input parameter, and partition a set of attributes in K clusters. Cluster similarity is measured about the average value of objects in a cluster, which can be considered as the cluster’s centroid or center of gravity (^{Han ;2011})

k-means algorithm calculate the square error criterion:

(6)

Where E is the sum of the square error for all attributes,p is the point in space representing a given

attribute, and m_i is the mean of cluster C_i.

PROPOSITION

Linear model for curve regression

Let X = {x₁, x₂, ... x_n} data set of curve model divided into m sub data set {x¹, x², ...x^m} in big data architecture. The first step in our mathematic model is convert the curve model into linear model by linearization as we presented in Table 2, for each sub data set {x¹, x²,...x^m} convert in linear model {Z¹, Z², ...Z^m},where Zi={zⁱ ₀; zⁱ ₁; zⁱ ₂...zⁱ _l} The general model of sub data i expressed by yⁱ aⁱ _j ,zⁱ and bⁱ.

(7)

This step can returns the vectors{v₁,v₂, ...v_m} Where vi = (aⁱ ₀; aⁱ ₁; aⁱ ₂; :::aⁱ _l; bi) Figure 3

Fig. 3 Dividing Data and convert into linear model.

Map algorithm

Curve model divided into m nodes in big data architecture. Map algorithm can transform each data node, into a linear model, as we describe in 3.1.

Select clusters by Reduce k-means algorithm

After determined the linear regression of each sub data set in node i, we apply Reduce k-means algorithm, to performs hard clustering, each linear model assigned only to one cluster, that can select bests linear models. The Reduce k-means algorithm process as follows. First, it randomly generates k from vⁱ = {1 : : : i = m}, each of which initially represents a cluster mean or center.For each of the remaining vⁱ {i = 1... i = m}, a vⁱ is assigned to the cluster C_j to which it is the most similar, based on the distance between vi and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges.

Work flow of our architecture

The work flow architecture Figure 4, presents data nodes (Data Node1, Data Node2..., DataNodem), and algorithms executes on it.The Map algorithm(Map algo1,Map algo2,...Map algom) execute in each node in order to extract linear model.In the reduce phase algorithm (Reduce algo) extracts K clusters (C₁,C₂...C_k). Table 3

Fig. 4 Work flow architecture.

Table 3 Results of linear models.

Apply k-meas algorithm

The second step of our proposition, apply the Reduce k-means algorithm. We select 3 clusters (k=3). Our algorithm takes linear models parameters extracted from Map Algorithm 2 and, construct 03 clusters. .For example in node5,C1 = (-6.496063, 0:0403190),C2 = (-5.524988, 0:0325660) and C3 = (-6.151511, 0.0368305).The result of (m=4...m=10) in Figure 6.

Fig. 6 Results of k-means clustering for nodes (k=3).

DISSCUTION OF OUR APPROACH

Our approach is a complete approach toward regression problem in big data; it covered the mathematic models such as (^{Jun et al.,2015}; ^{Ma et al., 2015}; ^{Neyshabouri et al 2016}) works, and MapReduce algorithm and architecture like (^{Oancea et al.,2015}). Moreover, our approach combines between to important problem of data mining, regression, and machine learning problems. Map algorithm can solve the regression problem of curve regression; it can convert curve model into linear model and Reduce k-means algorithm can represent the clustering problem. Big data architecture composes by various nodes; each node returns linear model. Consequently, reduce k-means algorithm select the best k-clusters wich can describe linear models.

CONCLUSION AND FUTURE WORK

In this paper, we have proposed curve regression in big data system.Data in our architecture is divided into sub data, each sub data assigned to node, the first algorithm in our approach converts the curve model into linear model, each node convert its sub data into linear model. In the second step, we apply k-means algorithm for each node in order to extract clusters. We validate our approach by UniversalBank data set; we calculate linear models parameters and obtain 03 clusters for each node. Our approach combine the regression with clustering problem in big data architecture, the result extracted from Map algorithm input into Reduce k-means algorithm to select the clusters which can better represent the regression model.

REFERENCE

Bollobás, Béla. Linear analysis. Cambridge: Cambridge University Press, 1990.10p. [ Links ]

Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers,1965: (3), 326-334. [ Links ]

Dean, J., And Ghemawat, S. MapReduce: a flexible data processing tool. Communications of the ACM, 2010. 53(1): p.72-77. [ Links ]

Golberg, Michael A., And Hokwon A. Cho. Introduction to regression analysis. WIT press, 2004.3p. [ Links ]

Han, J., Pei, J., And Kamber, M. Data mining: concepts and techniques. Elsevier,2011. [ Links ]

Jun, S., Lee, S. J., And Ryu, J. B. A Divided Regression Analysis for Big Data. Statistics, 2015:9(5). [ Links ]

Krishna, K, Open source implementation of MapReduce,2010. [ On line]. Teck Kaizen 2010, [Accessed on: April,2019] Disponible en: Disponible en: http://kktechkaizen.blogspot.com/2012/07/apache-hadoop-open-source-mapreduce.html . [ Links ]

Ma, P., And Sun, X, Leveraging for big data regression. Wiley Interdisciplinary Re- views: Computational Statistics, 2015:7(1), p.70-76. [ Links ]

Naoui, M. A., Mcheick, H., Kazar, O. Mobile Agent approach based on mo- bile strategic environmental Scanning using Android and JADELEAP systems. In Electrical and Computer Engineering (CCECE), 2014 IEEE 27th Canadian Conference on IEEE,2014: p.1-7. [ Links ]

Neyshabouri, M. M., Demir, O., Delibalta, I., And Kozat, S. S. Highly efficient non- linear regression for big data with lexicographical splitting. Signal, Image and Video Processing, 2016, p.1-8. [ Links ]

Oancea., B. .Linear Regression With R And HADOOP.International Conference : CKS Challenges of the Knowledge Soc;2015, p1007. [ Links ]

Shafer, J., Rixner, S., And Cox, A. L. The hadoop distributed filesystem: Balancing portability and performance. In Performance Analysis of Systems and Software (ISPASS), 2010: p. 122-133. [ Links ]

V.Martha, W. Zhao, Xiaowei Xu,. h-MapReduce: A Framework for Workload Balancing in MapReduce. IEEE 27th International Conference on Advanced Information Networking and Applications 2013: p.637-644. [ Links ]

Wang, Y., Li, Y., Xiong, M., And Jin, L. Random Bits Regression: a Strong General Predictor for Big Data. arXiv preprint arXiv,2015. [ Links ]

Willems, F. M., Shtarkov, Y. M., And Tjalkens, T. J. Context weighting for general finite-context sources. IEEE transactions on information theory, 1996:42(5),p. 1514-1520. [ Links ]

Received: December 16, 2019; Accepted: March 31, 2020

^* Corresponding Author: manouarn@yahoo.com

No existe conflicto de interés con este trabajo

Mohammed Anouar Naoui: Contribuyó en el enfoque propuesto que abarca arquitectura y algoritmo.

Brahim Lejdel: Contribuyó en la supervision y mejora de la arquitectura.

Mouloud Ayad: Contribuyó en la co-supervisión y mejora del algoritmo.