<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1815-5928</journal-id>
<journal-title><![CDATA[Ingeniería Electrónica, Automática y Comunicaciones]]></journal-title>
<abbrev-journal-title><![CDATA[EAC]]></abbrev-journal-title>
<issn>1815-5928</issn>
<publisher>
<publisher-name><![CDATA[Universidad Tecnológica de La Habana José Antonio Echeverría, Cujae]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1815-59282017000200007</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[An approach to the numerical solution of one-dimensional heat equation on SoC FPGA]]></article-title>
<article-title xml:lang="es"><![CDATA[Computación con esténcil para la aproximación a la solución numérica de la ecuación de calor sobre SoC FPGA]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Castaño]]></surname>
<given-names><![CDATA[Luis]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Osorio]]></surname>
<given-names><![CDATA[Gustavo]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Universidad Nacional de Colombia Facultad de Ingeniería y Arquitectura ]]></institution>
<addr-line><![CDATA[Manizales ]]></addr-line>
<country>Colombia</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>08</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>08</month>
<year>2017</year>
</pub-date>
<volume>38</volume>
<numero>2</numero>
<fpage>83</fpage>
<lpage>93</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://scielo.sld.cu/scielo.php?script=sci_arttext&amp;pid=S1815-59282017000200007&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://scielo.sld.cu/scielo.php?script=sci_abstract&amp;pid=S1815-59282017000200007&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://scielo.sld.cu/scielo.php?script=sci_pdf&amp;pid=S1815-59282017000200007&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[A common kernel used in scientific computing is the stencil computation. FPGA based heterogeneous systems has been used to overcome stencil algorithm performance limitations due to the memory bandwidth on CPU and GPU based systems. Performance improvement is achieved through the combination of several data flow optimization techniques, taking advantage of the FPGA inherent parallelism. However, array architectures used for some two-dimensional problems involves the need of considerable number of FPGAs, for mesh sizes that can be treated by a CPU or GPU based system with a suitable performance at a lower cost. With the development of high level synthesis tools, the implementation of algorithms over FPGA is performed with a better design flow than traditional logic design. In this case, optimization techniques are performed at software level. In this document is presented a system designed to evaluate the performance of a stencil computation algorithm over a SoC FPGA at hardware level. The data-path is designed to perform the stencil computation algorithm using a one-dimensional array of processing elements and registers. System performance is evaluated for the approach to the numerical solution of a heat transfer problem modeled with the heat equation for the one-dimensional case. The proposed architectures are implemented in a ZedBoard Zynq Evaluation and Development Kit using Vivado Design Suite and Xilinx SDK.]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[La computación con esténcil es un esquema muy usado en la computación científica. Se han desarrollado sistemas heterogéneos basados en FPGA para superar las limitaciones debidas al ancho de banda de memoria en los sistemas computacionales basados en CPU o GPU. El mejoramiento del desempeño es logrado mediante el uso de varias técnicas de optimización del flujo de datos, tomando ventaja del paralelismo inherente de los FPGA. Sin embargo, las arquitecturas usadas en problemas bidimensionales involucran el uso de una cantidad considerable de FPGA, para tamaños de malla que pueden ser procesados en sistemas basados en CPU o GPU con un desempeño aceptable a menor costo. Con el desarrollo de herramientas de diseño de alto nivel, la implementación de algoritmos sobre FPGA es realizada con un mejor flujo de diseño que con el diseño lógico tradicional. En este caso las técnicas de optimización se desarrollan a nivel de software. En este documento se presenta un sistema diseñado para evaluar el desempeño de la computación con esténcil sobre una FPGA a nivel de hardware. El camino de datos es diseñado para el empleo de un arreglo unidimensional de elementos de proceso y registros para reducir el número de operaciones de transferencia de datos de memoria. El desempeño del sistema es evaluado para la aproximación a la solución numérica de un problema de transferencia de calor, modelado con la ecuación de calor para el caso unidimensional. Las arquitecturas propuestas son implementadas sobre una ZedBoard empleando Vivado y el Xilinx SDK.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[FPGA]]></kwd>
<kwd lng="en"><![CDATA[stencil computation]]></kwd>
<kwd lng="en"><![CDATA[heat equation]]></kwd>
<kwd lng="en"><![CDATA[finite differences]]></kwd>
<kwd lng="es"><![CDATA[FPGA]]></kwd>
<kwd lng="es"><![CDATA[computación con esténcil]]></kwd>
<kwd lng="es"><![CDATA[ecuación de calor]]></kwd>
<kwd lng="es"><![CDATA[diferencias finitas]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[ <p align="right"><font face="Verdana" size="2"><b>ORIGINAL PAPER</b></font></p>     <p align="justify">&nbsp;</p>     <p align="justify">&nbsp; </p> 	    <p align="justify"><font size="4"><strong><font face="verdana">An approach to the numerical solution of one&#45;dimensional heat equation on SoC FPGA</font></strong></font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><font face="verdana" size="3"><b>Computaci&oacute;n con est&eacute;ncil para la aproximaci&oacute;n a la soluci&oacute;n num&eacute;rica de la ecuaci&oacute;n de calor sobre SoC FPGA</b></font></p>  	    <p align="justify">&nbsp;</p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><font face="verdana" size="2"><b>Luis Casta&ntilde;o, Gustavo Osorio</b></font></p> 	    <p align="justify"><font face="verdana" size="2">Facultad de Ingenier&iacute;a y Arquitectura, Universidad Nacional de Colombia, Manizales, Colombia.</font></p> 	    ]]></body>
<body><![CDATA[<p align="justify">&nbsp;</p> 	    <p align="justify">&nbsp;</p> 	<hr align="JUSTIFY" size="1" noshade>     <p align="justify"><font face="verdana" size="2"><b>ABSTRACT</b></font></p>  	    <p align="justify"><font face="verdana" size="2">A common kernel used in scientific computing is the stencil computation. FPGA based heterogeneous systems has been used to overcome stencil algorithm performance limitations due to the memory bandwidth on CPU and GPU based systems. Performance improvement is achieved through the combination of several data flow optimization techniques, taking advantage of the FPGA inherent parallelism. However, array architectures used for some two&#45;dimensional problems involves the need of considerable number of FPGAs, for mesh sizes that can be treated by a CPU or GPU based system with a suitable performance at a lower cost. With the development of high level synthesis tools, the implementation of algorithms over FPGA is performed with a better design flow than traditional logic design. In this case, optimization techniques are performed at software level. In this document is presented a system designed to evaluate the performance of a stencil computation algorithm over a SoC FPGA at hardware level. The data&#45;path is designed to perform the stencil computation algorithm using a one&#45;dimensional array of processing elements and registers. System performance is evaluated for the approach to the numerical solution of a heat transfer problem modeled with the heat equation for the one&#45;dimensional case. The proposed architectures are implemented in a ZedBoard Zynq Evaluation and Development Kit using Vivado Design Suite and Xilinx SDK.</font></p>  	    <p align="justify"><font face="verdana" size="2"><b>Key words: </b>FPGA, stencil computation, heat equation, finite differences</font></p>  	<hr align="JUSTIFY" size="1" noshade>     <p align="justify"><font face="verdana" size="2"><b>RESUMEN</b></font></p>  	    <p align="justify"><font face="verdana" size="2">La computaci&oacute;n con est&eacute;ncil es un esquema muy usado en la computaci&oacute;n cient&iacute;fica. Se han desarrollado sistemas heterog&eacute;neos basados en FPGA para superar las limitaciones debidas al ancho de banda de memoria en los sistemas computacionales basados en CPU o GPU. El mejoramiento del desempe&ntilde;o es logrado mediante el uso de varias t&eacute;cnicas de optimizaci&oacute;n del flujo de datos, tomando ventaja del paralelismo inherente de los FPGA. Sin embargo, las arquitecturas usadas en problemas bidimensionales involucran el uso de una cantidad considerable de FPGA, para tama&ntilde;os de malla que pueden ser procesados en sistemas basados en CPU o GPU con un desempe&ntilde;o aceptable a menor costo. Con el desarrollo de herramientas de dise&ntilde;o de alto nivel, la implementaci&oacute;n de algoritmos sobre FPGA es realizada con un mejor flujo de dise&ntilde;o que con el dise&ntilde;o l&oacute;gico tradicional. En este caso las t&eacute;cnicas de optimizaci&oacute;n se desarrollan a nivel de software. En este documento se presenta un sistema dise&ntilde;ado para evaluar el desempe&ntilde;o de la computaci&oacute;n con est&eacute;ncil sobre una FPGA a nivel de hardware. El camino de datos es dise&ntilde;ado para el empleo de un arreglo unidimensional de elementos de proceso y registros para reducir el n&uacute;mero de operaciones de transferencia de datos de memoria. El desempe&ntilde;o del sistema es evaluado para la aproximaci&oacute;n a la soluci&oacute;n num&eacute;rica de un problema de transferencia de calor, modelado con la ecuaci&oacute;n de calor para el caso unidimensional. Las arquitecturas propuestas son implementadas sobre una ZedBoard empleando Vivado y el Xilinx SDK.</font></p>  	    <p align="justify"><font face="verdana" size="2"><b>Palabras claves:</b> FPGA, computaci&oacute;n con est&eacute;ncil, ecuaci&oacute;n de calor, diferencias finitas</font></p>  	<hr align="JUSTIFY" size="1" noshade>     <p align="justify">&nbsp;</p>     <p align="justify">&nbsp;</p>     ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="3"><b>1.</b>- <b>INTRODUCTION</b></font></p>     <p align="justify">&nbsp;</p>  	    <p align="justify"><font face="verdana" size="2">A common kernel used in scientific computing is the stencil computation, particularly for linear algebra algorithms, partial differential equations (PDE) and image processing. It is efficient for the approach to the numerical solution of PDE using the explicit finite difference scheme &#91;10&#93;. However, performance of algorithms based on stencil is limited by the difference between the maximum throughput and maximum bandwidth memory on multi&#45;core CPU and GPU based systems &#91;3&#93;. For this reason, the study of stencil algorithms implementation and optimization methods has been of interest. Cache based optimization techniques have been developed for CPU or GPU based systems to overcome the performance limitations by exploiting the temporal and spatial locality, as can be found in &#91;5, 6, 9, 11&#93;. However, there are performance limitations that remain despite the use of optimization methods &#91;3&#93;.</font></p>  	    <p align="justify"><font face="verdana" size="2">For this reason, FPGA&#45;based accelerators are used as an alternative given that these devices have shown better performance with lower power consumption &#91;1, 2&#93;. The performance improvement of the stencil computing scheme using FPGAs is study in &#91;1, 3, 4, 7, 8&#93;. FPGA based systems take advantage of the inherent parallelism for performance improvement through the combination of several data flow optimization techniques. For instance, grid array architectures as proposed in &#91;3, 7&#93; use streaming and pipeline to accelerate stencil computation. However, the use of such architectures involves the need of considerable number of FPGAs to simulate problems with mesh sizes that can be treated by a CPU or GPU with suitable performance at a lower cost.</font></p>  	    <p align="justify"><font face="verdana" size="2">The use of FPGA based system has always represented multiple challenge from the number representations to the design flow complexity. The recent development of design tools has allowed overcoming many of these challenges. In &#91;1&#93; Schmitt <i>et al.</i> demonstrate the feasibility to deal with a stencil computation for a grid of 4096&times;4096 on a single FPGA using a High&#45;Level Synthesis (HLS) tool for system design. In this case, optimization techniques are performed at software level.</font></p>  	    <p align="justify"><font face="verdana" size="2">In this work, we present a hardware level implementation and optimization of a stencil algorithm on an SoC FPGA. A custom IP for the programmable logic (PL) section interacts with an ARM core that acts as a host processor. System performance is evaluated for the approach to numerical solution of the one&#45;dimensional heat equation over a single FPGA. A baseline architecture use a stencil kernel as the processing element (PE) and a control unit performs the sequence of the stencil algorithm. For system parallelization is developed a data&#45;path with a one&#45;dimensional array of processing elements with feedback through a registers bank, with the aim to reduce the resources utilization and memory transfer operations. System analysis shows the performance achieved in terms of the processing time for the stencil algorithm with four implemented architectures. The processing time is compared with the obtained with the sequential algorithm written in C running over one of the ARM CORTEX A9 core of the SoC FPGA. Problem and system description are detailed in sections 2 and 3 respectively. Numerical results and performance analysis are presented in sections 4 and 5. Finally conclusions and future work are drawn in section 6.</font></p>  	    <p align="justify">&nbsp;</p>  	    <p align="justify"><font face="verdana" size="3"><b>2.&#45; PROBLEM DESCRIPTION</b></font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><font face="verdana" size="2">Consider the PDE shown in <a href="#ec1">(1)</a>.</font></p>  	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2"><a name="ec1"/><img width="227" height="29" src="/img/revistas/eac/v38n2/e0107217.gif"></font></p>  	    
<p align="justify"><font face="verdana" size="2">This expression represents a 1D parabolic PDE which is used to model the heat distribution over time in a bar with length L. Given an initial value and boundary conditions problem as shown in <a href="#ec2">(2)</a>, equation solution shows the temperature variation in the space&#45;time domain.</font></p>  	    <p align="justify"><font face="verdana" size="2"><a name="ec2"/><img width="176" height="52" src="/img/revistas/eac/v38n2/e0207217.gif"></font></p>  	    
<p align="justify"><font face="verdana" size="2">An approach to the numerical solution of this equation is obtained using the explicit finite difference method. Defining <i>J</i> and <i>N</i> as the number of points for discretization in the space and time domain respectively, the approximate solution is obtained using <a href="#ec3">(3)</a>. From this expression, a stencil kernel circuit is obtained as shown in <a href="#fig1">Figure 1</a>.</font></p>  	    <p align="justify"><font face="verdana" size="2"><a name="ec3"/><img width="269" height="24" src="/img/revistas/eac/v38n2/e0307217.gif"></font></p>  	    
<p align="justify"><font face="verdana" size="2">This kernel is used to calculate each one of the mesh points as shown in <a href="#alg1">Algorithm 1</a>.</font><font face="verdana" size="2">&nbsp;</font></p>  	    <p align="justify"><font face="verdana" size="2"><a name="alg1"/><b>Algorithm 1.</b> Pseudo&#45;code for the stencil computation to obtain the approach to the numerical solution of heat equation with the explicit scheme.</font></p>  	<hr align="JUSTIFY">  	    <p align="justify"><font face="verdana" size="2">for n from 0 to N&#45;1 do</font></p>  	    <p align="justify"><font face="verdana" size="2">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for j from 1 to J&#45;2 do</font></p>  	    <p align="justify"><font face="verdana" size="2">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img width="191" height="17" src="/img/revistas/eac/v38n2/i0107217.gif"></font></p>  	    
]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;end for</font></p>  	    <p align="justify"><font face="verdana" size="2">end for</font></p>  	    <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>  	    <p align="center"><a name="fig1"/><img src="/img/revistas/eac/v38n2/f0107217.jpg"> 	    
<p align="justify"><font face="verdana" size="2">In this work is presented the use of the Algorithmic State Machine (ASM) method for the logic design at Register Transfer Level (RTL) of a circuit that performs the stencil algorithm on a SoC FPGA. Variations of the data path are made to evaluate the improvement of the system performance.</font></p>  	    <p align="justify">&nbsp;</p>  	    <p align="justify"><font face="verdana" size="3"><b>3.&#45; SYSTEM DESCRIPTION</b></font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><font face="verdana" size="2">The system is implemented in a ZedBoard Zynq Evaluation and Development Kit using Vivado Design Suite. The design takes advantage of the XC7Z020CLG484&#45;1 Xilinx SoC FPGA architecture. The ZYNQ&#45;7 processing system (PS) for the ARM Cortex &#45;A9 MPCore interacts with a custom IP created for the programmable logic (PL) section. The system block diagram is shown in <a href="#fig2">Figure 2</a>.</font></p>  	    <p align="center"><a name="fig2"/><img src="/img/revistas/eac/v38n2/f0207217.jpg"> 	    
]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">The ARM core acts as the host processor where the main application runs. The custom IP is used to obtain the approach to the numerical solution of the PDE using the stencil scheme. The custom IP is fully described in VHDL. It is connected to the ZYNQ&#45;7 PS in a block design over Vivado IP integrator tool. Communication between PS and PL sections is made through AXI4&#45;Lite interface with fixed 32 data bits. For number representation is used a customized 32&#45;bit floating&#45;point format with rounding to the nearest. The floating&#45;point adders and multipliers used in the stencil kernel are described as combinational circuits, therefore there is no output latency in terms of the system clock cycles. The control unit is a finite state machine that coordinates the sequence of the stencil algorithm.</font></p>  	    <p align="justify"><font face="verdana" size="2">The source code for the PS is written in C over Xilinx Software Development Kit (SDK). Through serial console the mesh size <i>(J &times; N)</i> is defined. The initial values and boundary conditions are setting and written to the RAM from the host application. A control signal is sent to the control unit of the custom IP to start processing the memory data. A status signal from control unit tells the host application that has finished the process. The results are read from the PL to the PS and stored in a text file on the SD card.</font></p>  	    <p align="justify"><font face="verdana" size="2">A baseline architecture is designed for a sequential implementation of the stencil algorithm. To optimize the system performance for the implemented stencil algorithm and exploit the FPGA features, a variation of this architecture is proposed.</font></p>  	    <p align="justify"><font face="verdana" size="2"><b>3.1.&#45; B</b><b>ASELINE ARCHITECTURE</b></font></p>  	    <p align="justify"><font face="verdana" size="2">In this architecture (A<sub>1</sub>) the registers <i>R0</i>, <i>R1,</i> and <i>R2</i> are connected in cascade to allow data streaming from memory. The term u<sub>j</sub><sup>n+1</sup> calculated by the stencil kernel is saved to the RAM via multiplexer. The block diagram is shown in <a href="#fig3">Figure 3</a>.</font></p>  	    <p align="center"><a name="fig3"/><img src="/img/revistas/eac/v38n2/f0307217.jpg">  	    
<p align="justify"><font face="verdana" size="2">The flow chart for the stencil algorithm implemented is shown in <a href="#fig4">Figure 4</a>. Operations outside the dashed line are executed in PS and those found within the dashed line are executed in PL.</font></p>  	    <p align="center"><a name="fig4"/><img src="/img/revistas/eac/v38n2/f0407217.jpg">     	    
<p align="justify"><font face="verdana" size="2">A performance counter is used to determine the number of clock cycles used to calculate all mesh points. This amount also can be calculated from the state machine sequence as shown in <a href="#ec4">(4)</a>.</font></p>  	    <p align="justify"><font face="verdana" size="2"><a name="ec4"/><img width="258" height="23" src="/img/revistas/eac/v38n2/e0407217.gif"></font></p>  	    
]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2"><b>3.2.&#45; H</b><b>ARDWARE LEVEL OPTIMIZATIONS</b></font></p>  	    <p align="justify"><font face="verdana" size="2">To increase the amount of space domain points that can be processed in one clock cycle more registers and PE are used. In <a href="#fig5">Figure 5</a> is shown the implementation for an 8&times;N mesh, with six PE and a register bank for eight data. The control unit for this architecture (A<sub>2</sub>) has less states due to the J terms for the time step <i>n+1</i> are obtained concurrently. To keep results available to calculate the values for the next time step without RAM access, the PE outputs are also stored at the same time in the register bank through internal multiplexers. With this configuration, the algorithm inner loop is simplified.</font></p>  	    <p align="center"><a name="fig5"/><img src="/img/revistas/eac/v38n2/f0507217.jpg">  	    
<p align="justify"><font face="verdana" size="2">The concurrent processing improves the algorithm performance, but data storage is still sequential given that the system has only one RAM. Therefore, a memory structure that allows concurrent storage is proposed as shown in <a href="#fig6">Figure 6</a>. In this architecture (A<sub>3</sub>) the inner loop is suppressed from the control unit sequence.</font></p>  	    <p align="center"><a name="fig6"/><img src="/img/revistas/eac/v38n2/f0607217.jpg">  	    
<p align="justify"><font face="verdana" size="2">Although these architectures provide a better performance than the baseline architecture, the value of <i>J</i> is limited by the maximum number of PE allowed according to the FPGA physical resources. To achieve the treatment of problems involving larger mesh sizes in a single FPGA, the architecture shown in <a href="#fig7">Figure 7</a> is proposed. This architecture uses the cascade registers as in the baseline architecture for the continuous reading of the data. In addition, a set of stencil cores and RAM blocks are used for data processing and storage. The arrangement of the registers, the PEs, and the RAMs, allows to perform the spatial and temporal sweep for handling of data dependencies.</font></p>  	    <p align="center"><a name="fig7"/><img src="/img/revistas/eac/v38n2/f0707217.jpg" width="504" height="149"> 	    
<p align="justify">&nbsp;</p>  	    <p align="justify"><font face="verdana" size="3"><b>4.&#45; NUMERICAL RESULTS</b></font></p> 	    <p align="justify">&nbsp;</p> 	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">For initial values a ramp function is generated and send from PS section. Results of the approach to the numerical solution of the 1D&#45;heat equation are stored in the SD card. Data values are printed in a text file using a precision format of 15 decimal digits. To visualize the results the mesh is plotted using GNU Octave. In <a href="/img/revistas/eac/v38n2/f0807217.jpg">Figure 8</a> is shown the meshes obtained with A<sub>4</sub> for 256 points in the space domain and 8, 16, 32, 64, 128, and 256 iterations.</font></p>  	    
<p align="justify"><font face="verdana" size="2">The percent error with respect to CPU results for the same mesh sizes, initial values, and boundary conditions is shown in <a href="/img/revistas/eac/v38n2/f0907217.jpg">Figure 9</a>. Although the error obtained until iteration 256 does not exceed 7x10<sup>&#45;6</sup> %, it is observed that is accumulative with the increase of time steps.</font></p>  	    
<p align="justify"><font face="verdana" size="2">&nbsp;</font></p>  	    <p align="justify"><font face="verdana" size="3"><b>5.&#45; SYSTEM PERFORMANCE</b></font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><font face="verdana" size="2">The processing time is measured using an internal counter enabled from control unit. The processing time in microseconds for 8, 16 and 32 points in the space domain and 512 iterations is shown in <a href="#tab1">Table 1</a>. The time for A<sub>3</sub> does not vary with respect to number of PE given that it just depends on the number of iterations. The speed&#45;up achieved with the parallel architectures (A<sub>2</sub> and A<sub>3</sub>) is calculated in relation to the baseline architecture (A<sub>1</sub>).</font></p>  	    <p align="center"><a name="tab1"/><img src="/img/revistas/eac/v38n2/t0107217.gif">  	    
<p align="justify"><font face="verdana" size="2">To determine the system performance in terms of FLOPS, it is known that the stencil scheme implemented for the PE has four floating&#45;point operations. Therefore, with a 100 MHz clock the system has a peak performance of 400 MFLOPS for A<sub>1</sub> and 12 GFLOPS for A<sub>2</sub> and A<sub>3</sub>. However, considering the number of mesh points, the stencil floating&#45;point operations and the processing time, the performance achieved is shown in <a href="#tab2">Table 2</a>.</font></p>  	    <p align="center"><a name="tab2"/><img src="/img/revistas/eac/v38n2/t0207217.gif">  	    
<p align="justify"><font face="verdana" size="2">The speed&#45;up achieved in comparison with the sequential algorithm written in C running on Linux over an Intel Xeon E5&#45;2667 at 2.90GHz with 32 GB of RAM is shown in <a href="#tab3">Table 3</a>. The values of t<sub>CPU</sub> are the elapsed times used by the CPU processor to performs the nested loop.</font></p>  	    ]]></body>
<body><![CDATA[<p align="center"><a name="tab3"/><img src="/img/revistas/eac/v38n2/t0307217.gif">  	    
<p align="justify"><font face="verdana" size="2">Performance can be improved for A<sub>2</sub> if only the results of the iteration N are stored in the RAM. The processing time in microseconds and the speed&#45;up achieved without storing all mesh for 512 iterations are shown in <a href="#tab4">Table 4</a>.</font></p>  	    <p align="center"><a name="tab4"/><img src="/img/revistas/eac/v38n2/t0407217.gif">  	    
<p align="justify"><font face="verdana" size="2">Although the architectures A<sub>2</sub> and A<sub>3</sub> improve the performance, they are limited in relation to the mesh size. On the other hand, the architectures A<sub>1</sub> and A<sub>4</sub> led to handle problems up to 256 points in the space domain and 512 iterations. In <a href="#fig10">Figure 10</a> is shown the processing time in microseconds using A<sub>4</sub>, according to the number of iterations in function of the number of points in the space domain. It is observed that there is a proportional variation of the processing time both for the increase in the number of iterations and the number of points in the space domain.</font></p>  	    <p align="center"><a name="fig10"/><img src="/img/revistas/eac/v38n2/f1007217.jpg">  	    
<p align="justify"><font face="verdana" size="2">From these results, the speed&#45;up achieved in comparison with the sequential algorithm written in C running over the ARM core al 667 MHz is calculated. In <a href="#fig11">Figure 11a</a> is shown the speed&#45;up when is used a bi&#45;dimensional array and all the mesh points are stored in the RAM. The ARM has memory limitations when the algorithm is implemented using a bi&#45;dimensional array, therefore the plot presents the comparisons for the allowed mesh sizes. In <a href="#fig11">Figure 11b</a> is shown the speed&#45;up when is used a vector and only the terms of the last iteration are stored in the RAM.</font></p>  	    <p align="center"><a name="fig11"/><img src="/img/revistas/eac/v38n2/f1107217.jpg">  	    
<p align="justify"><font face="verdana" size="2">The FPGA resources utilization respect to the PE is summarized in <a href="#tab5">Table 5</a> for A<sub>1</sub>, A<sub>2</sub> and A<sub>3</sub>. This report corresponds to implementation using 65536x32 RAM for A<sub>1</sub> and A<sub>2</sub> and 512x32 RAM blocks for A<sub>3</sub>.</font></p>  	    <p align="center"><a name="tab5"/><img src="/img/revistas/eac/v38n2/t0507217.gif"> 	 	    
<p align="justify">&nbsp;</p> 	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="3"><b>6.&#45; CONCLUSIONS AND FUTURE WORK</b></font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><font face="verdana" size="2">In this paper is presented a system designed for the approach to the numerical solution of a parabolic PDE for a 1D heat transfer problem with initial value and boundary conditions using the explicit finite difference method. The implementation is made using the SoC architecture of a XC7Z020CLG484&#45;1 Xilinx FPGA of the ZedBoard.</font></p>  	    <p align="justify"><font face="verdana" size="2">Four different architectures based in the stencil computation scheme are described. Performance analysis shows the improvement achieved in terms of the processing time for the stencil algorithm with the proposed architectures in relation to the embedded ARM processor. The speedup factor led to determine that implemented architectures offer different performance optimization due to the memory structure and control sequence. In all cases, the use of the registers array allows to take advantage of spatial and temporal locality reducing the need of memory transfer operations.</font></p>  	    <p align="justify"><font face="verdana" size="2">For future work, more deep performance analysis in terms of accuracy, precision, data transfer, scalability and power consumption could be performed. This evaluation should allow performance comparison with CPU and GPU based systems. Otherwise the design of variations for the implemented architectures to address 2D heat transfer problems using parabolic and elliptic PDEs could be developed. In addition, the obtained results could be compared with the software implementation of the stencil algorithm using high level synthesis tools.</font></p>  	    <p align="justify">&nbsp;</p>  	    <p align="justify"><font face="verdana" size="3"><b>ACKNOWLEDGEMENTS</b></font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><font face="verdana" size="2">This work was supported by the AE&amp;CC research group of the Instituto Tecnol&oacute;gico Metropolitano through the project P14208. Luis Casta&ntilde;o acknowledges the financial support by the scholarship Estudiantes Sobresalientes de Posgrado Universidad Nacional de Colombia.</font></p>  	    <p align="justify">&nbsp;</p>  	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="3"><b>REFERENCES</b></font></p> 	    <p align="justify">&nbsp;</p> 	     <!-- ref --><p align="justify"><font face="verdana" size="2">1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    Schmitt C, Schmid M, Hannig F, Teich J, Kuckuk S, K&ouml;stler H. Generation    of Multigrid&#45;based Numerical Solvers for FPGA Accelerators. 2nd International    Workshop on High&#45;Performance Stencil Computations. Amsterdam; The Netherlands.    2015 Jan: 9&#45;15.    </font></p>  	     <!-- ref --><p align="justify"><font face="verdana" size="2">2.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    Usui T, Kobayashi R, Kise K. A Challenge of Portable and High&#45;Speed FPGA    Accelerator. In: Sano K, Soudris D, H&uuml;bner M, Diniz P. Applied Reconfigurable    Computing: 11th International Symposium. Bochum(Germany): Springer International    Publishing; 2015 Apr: 383&#45;392.    </font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">3.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Sano K, Hatsuda Y, Yamamoto S. Multi&#45;FPGA Accelerator for Scalable Stencil Computation with Constant Memory&#45;Bandwidth. IEEE Transactions on Parallel and Distributed Systems. 2014; 25(3): 695&#45;705.    </font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">4.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Kobayashi R, Kise K. Scalable stencil&#45;computation accelerator by employing multiple FPGA. IPSJ Transactions on Advanced Computing Systems. 2013; 6(4): 1&#45;13.    </font></p>  	     ]]></body>
<body><![CDATA[<!-- ref --><p align="justify"><font face="verdana" size="2">5.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    Bandishti V, Pananilath I, Bondhugula U. Tiling Stencil Computations to Maximize    Parallelism. In: International Conference for High Performance Computing, Networking,    Storage and Analysis. Salt Lake City(USA): IEEE Computer Society Press. 2012    Nov: 1&#45;11.    </font></p>  	     <!-- ref --><p align="justify"><font face="verdana" size="2">6.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    Cecilia JM, Abell&aacute;n JL, Fern&aacute;ndez J, Acacio ME, Garc&iacute;a    JM, Ujald&oacute;n M. Stencil computations on heterogeneous platforms for the    Jacobi method: GPUs versus Cell BE. The Journal of Supercomputing. Springer    US. 2012; 62(2): 787&#45;803.    </font></p>  	     <!-- ref --><p align="justify"><font face="verdana" size="2">7.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    Kobayashi R, Takamaeda&#45;Yamazaki S, Kise K. Towards a Low&#45;Power Accelerator    of Many FPGAs for Stencil Computations. International Conference on Networking    and Computing. Naha(Japan): IEEE. 2012 Dec: 343&#45;349.    </font></p>  	     <!-- ref --><p align="justify"><font face="verdana" size="2">8.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    Sano K, Hatsuda Y, Yamamoto S. Scalable Streaming&#45;Array of Simple Soft&#45;Processors    for Stencil Computations with Constant Memory&#45;Bandwidth. Annual International    Symposium on Field&#45;Programmable Custom Computing Machines.&nbsp;&ccedil;    Salt Lake City (USA): IEEE. 2011 May: 234&#45;241.    </font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">9.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Strzodka R, Shaheen M, Pajak D, Seidel H. Cache Accurate Time Skewing in Iterative Stencil Computations. International Conference on Parallel Processing. IEEE. 2011 Sep: 571&#45;581.    </font></p>  	     ]]></body>
<body><![CDATA[<!-- ref --><p align="justify"><font face="verdana" size="2">10.&nbsp;&nbsp;&nbsp; Brodtkorb    AR, Dyken C, Hagen TR, Hjelmervik JM, Storaasli OO. State&#45;of&#45;the&#45;art    in heterogeneous computing. Scientific Programming. IOS Press. 2010; 18(1):    1&#45;33.    </font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">11.&nbsp;&nbsp;&nbsp; Datta K, Kamil S, Williams S, Oliker L, Shalf J, Yelick K. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors. SIAM Review. 2009; 51(1): 129&#45;159.    </font></p>      <p align="justify">&nbsp;</p>     <p align="justify">&nbsp;</p>     <p align="justify"><font face="verdana" size="2">Received: December 15, 2016&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    <br>   Approved: April 10, 2017</font></p>     <p align="justify">&nbsp;</p>     <p align="justify">&nbsp;</p>  	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2"><b>Luis Casta&ntilde;o,</b> Electronic Engineer, MEng Industrial Automation, PhD(c), Facultad de Ingenier&iacute;a y Arquitectura, Universidad Nacional de Colombia, Manizales, Colombia. E&#45;mail: <a href="mailto:lfcastanol@unal.edu.co">lfcastanol@unal.edu.co</a>. Professor, Facultad de Ingenier&iacute;as, Instituto Tecnol&oacute;gico Metropolitano, Medell&iacute;n, Colombia. E&#45;mail: <a href="mailto:luiscastano@itm.edu.co">luiscastano@itm.edu.co</a>.</font></p>       ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Schmitt]]></surname>
<given-names><![CDATA[C]]></given-names>
</name>
<name>
<surname><![CDATA[Schmid]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
<name>
<surname><![CDATA[Hannig]]></surname>
<given-names><![CDATA[F]]></given-names>
</name>
<name>
<surname><![CDATA[Teich]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Kuckuk]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Köstler]]></surname>
<given-names><![CDATA[H]]></given-names>
</name>
</person-group>
<source><![CDATA[Generation of Multigrid-based Numerical Solvers for FPGA Accelerators]]></source>
<year></year>
<conf-name><![CDATA[2 International Workshop on High-Performance Stencil Computations]]></conf-name>
<conf-date>2015 Jan</conf-date>
<conf-loc>Amsterdam </conf-loc>
<page-range>9-15</page-range></nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Usui]]></surname>
<given-names><![CDATA[T]]></given-names>
</name>
<name>
<surname><![CDATA[Kobayashi]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
<name>
<surname><![CDATA[Kise]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A Challenge of Portable and High-Speed FPGA Accelerator]]></article-title>
<person-group person-group-type="editor">
<name>
<surname><![CDATA[Sano]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
<name>
<surname><![CDATA[Soudris]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
<name>
<surname><![CDATA[Hübner]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
<name>
<surname><![CDATA[Diniz]]></surname>
<given-names><![CDATA[P]]></given-names>
</name>
</person-group>
<source><![CDATA[Applied Reconfigurable Computing: 11th International Symposium]]></source>
<year>2015</year>
<month> A</month>
<day>pr</day>
<page-range>383-392</page-range><publisher-loc><![CDATA[Bochum ]]></publisher-loc>
<publisher-name><![CDATA[Springer International Publishing]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sano]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
<name>
<surname><![CDATA[Hatsuda]]></surname>
<given-names><![CDATA[Y]]></given-names>
</name>
<name>
<surname><![CDATA[Yamamoto]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory-Bandwidth]]></article-title>
<source><![CDATA[IEEE Transactions on Parallel and Distributed Systems]]></source>
<year>2014</year>
<volume>25</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>695-705</page-range></nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kobayashi]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
<name>
<surname><![CDATA[Kise]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Scalable stencil-computation accelerator by employing multiple FPGA]]></article-title>
<source><![CDATA[IPSJ Transactions on Advanced Computing Systems]]></source>
<year>2013</year>
<volume>6</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>1-13</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bandishti]]></surname>
<given-names><![CDATA[V]]></given-names>
</name>
<name>
<surname><![CDATA[Pananilath]]></surname>
<given-names><![CDATA[I]]></given-names>
</name>
<name>
<surname><![CDATA[Bondhugula]]></surname>
<given-names><![CDATA[U]]></given-names>
</name>
</person-group>
<source><![CDATA[Tiling Stencil Computations to Maximize Parallelism]]></source>
<year></year>
<conf-name><![CDATA[ International Conference for High Performance Computing, Networking, Storage and Analysis]]></conf-name>
<conf-date>2012 Nov</conf-date>
<conf-loc>Salt Lake City </conf-loc>
<page-range>1-11</page-range></nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Cecilia]]></surname>
<given-names><![CDATA[JM]]></given-names>
</name>
<name>
<surname><![CDATA[Abellán]]></surname>
<given-names><![CDATA[JL]]></given-names>
</name>
<name>
<surname><![CDATA[Fernández]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Acacio]]></surname>
<given-names><![CDATA[ME]]></given-names>
</name>
<name>
<surname><![CDATA[García]]></surname>
<given-names><![CDATA[JM]]></given-names>
</name>
<name>
<surname><![CDATA[Ujaldón]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE]]></article-title>
<source><![CDATA[The Journal of Supercomputing]]></source>
<year>2012</year>
<volume>62</volume>
<numero>2</numero>
<issue>2</issue>
<page-range>787-803</page-range></nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kobayashi]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
<name>
<surname><![CDATA[Takamaeda-Yamazaki]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Kise]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
</person-group>
<source><![CDATA[Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations]]></source>
<year></year>
<conf-name><![CDATA[ International Conference on Networking and Computing]]></conf-name>
<conf-date>2012 Dec</conf-date>
<conf-loc>Naha </conf-loc>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sano]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
<name>
<surname><![CDATA[Hatsuda]]></surname>
<given-names><![CDATA[Y]]></given-names>
</name>
<name>
<surname><![CDATA[Yamamoto]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
</person-group>
<source><![CDATA[Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Memory-Bandwidth]]></source>
<year></year>
<conf-name><![CDATA[ Annual International Symposium on Field-Programmable Custom Computing Machines]]></conf-name>
<conf-date>2011 May</conf-date>
<conf-loc>Salt Lake City </conf-loc>
<page-range>234-241</page-range></nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Strzodka]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
<name>
<surname><![CDATA[Shaheen]]></surname>
<given-names><![CDATA[M]]></given-names>
</name>
<name>
<surname><![CDATA[Pajak]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
<name>
<surname><![CDATA[Seidel]]></surname>
<given-names><![CDATA[H]]></given-names>
</name>
</person-group>
<source><![CDATA[Cache Accurate Time Skewing in Iterative Stencil Computations]]></source>
<year></year>
<conf-name><![CDATA[ International Conference on Parallel Processing]]></conf-name>
<conf-date>2011 Sep</conf-date>
<conf-loc> </conf-loc>
<page-range>571-581</page-range></nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Brodtkorb]]></surname>
<given-names><![CDATA[AR]]></given-names>
</name>
<name>
<surname><![CDATA[Dyken]]></surname>
<given-names><![CDATA[C]]></given-names>
</name>
<name>
<surname><![CDATA[Hagen]]></surname>
<given-names><![CDATA[TR]]></given-names>
</name>
<name>
<surname><![CDATA[Hjelmervik]]></surname>
<given-names><![CDATA[JM]]></given-names>
</name>
<name>
<surname><![CDATA[Storaasli]]></surname>
<given-names><![CDATA[OO]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[State-of-the-art in heterogeneous computing: Scientific Programming]]></article-title>
<source><![CDATA[IOS Press]]></source>
<year>2010</year>
<volume>18</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>1-33</page-range></nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Datta]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
<name>
<surname><![CDATA[Kamil]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Williams]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Oliker]]></surname>
<given-names><![CDATA[L]]></given-names>
</name>
<name>
<surname><![CDATA[Shalf]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Yelick]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors]]></article-title>
<source><![CDATA[SIAM Review]]></source>
<year>2009</year>
<volume>51</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>129-159</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
