1.-INTRODUCTION
The use of elliptic curves in cryptography was proposed independently by Miller [1] and Koblitz [2] when they discovered that the set of points satisfying the curve equation together with point addition as group law form a suitable group to build discrete logarithm systems. Since then, many protocols based on elliptic curves have been developed and included into several standards [3-6]. From an implementation perspective, the main advantage of elliptic curve cryptography relays on its relatively small key-length requirement compared to that of systems based on integer factorization or discrete logarithms in the multiplicative group of finite fields. Shorter keys translate into lower storage requirements and smaller computation times, two great features in general, and specially for embedded platforms where memory space and processing capabilities are usually constrained.
Scalar multiplication is the most important operation in elliptic curve protocols. For this reason, numerous mechanisms aimed to improve the performance of this operation have been proposed. Some of them intend to reduce the number of point addition and point doubling operations required to compute a scalar multiplication. Other mechanisms explore different elliptic curve point representations leading to efficient addition and doubling formulas; while a third group is focused on minimizing the computational cost of the underlying finite field arithmetic [7]. These alternatives are not mutually exclusive, in fact, in most practical cases they are used combined with each other in order to obtain better results. Whatever the case, a common way to boost up performance is to complement those algorithmic improvements with the proper use of any specific feature of the selected implementation platform providing processing acceleration. In this sense, the ARM Cortex-A family of processors comes equipped with NEON, a Single Instruction Multiple Data (SIMD) extension that can be exploited to speed up elliptic curve arithmetic on ARM-powered devices. In particular, this work focuses on using NEON instructions to accelerate operations in the underlying finite fields on top of which elliptic curves are built.
Different approaches can be followed to implement finite field arithmetic using NEON. A first option is to parallelize computations within a single field operation. Inconveniences arise with this alternative when using operands represented in the conventional non-redundant (full-radix) form since most SIMD architectures, including NEON, do not support carry propagation across data items that are processed in parallel. Accordingly, several implementations adopt the reduced-radix (redundant) representation suggested in [8] to ease the handling of carry propagation. However, as stated in [9], such approach leads to more intermediate partial products being computed. For that reason and although there are clever proposals [9,10] achieving parallelization within a single field operation involving full-radix operands, this work explores a second way of using the NEON engine. It consists of performing two field operations in parallel as described for the attribute-based encryption scheme implemented in [11]. Following such approach, the non-redundant representation does not suffer from carry propagation issues. That is, a dual multi-precision finite field operation can be split into a sequence of consecutive dual single-precision computations intended to be executed in separate iterations. Thus, carry values can be passed from one iteration to another until the entire multi-precision computation finishes.
In this paper we construct NEON-based
The rest of this paper is organized as follows: Section 2 briefly discusses some previous results related to the subject presented in this work. In Section 3 an overview on elliptic curve arithmetic is given. Section 4 presents our implementation of NEON-based field operations as well as their application into the elliptic curve arithmetic. Section 5 shows the timing results obtained from the experiments conducted on the ARM Cortex-A9 processing system embedded in the Xilinx XC7Z020 device. Finally, concluding remarks are provided in Section 6.
2.- RELATED WORK
Several researches targeting SIMD-based implementations of cryptographic primitives have been reported in the literature. In particular, the use of NEON vectorization has been proposed by [12-15] to speed up elliptic curve arithmetic. In [11] the authors also proposed the use of NEON in the context of an attribute-based encryption scheme which exploits the computation of bilinear pairings on elliptic curves. Other works like [9,10] used NEON instructions to implement modular multiplication and modular squaring primitives which are common to several cryptographic schemes including elliptic curves. The particular scenarios targeted by these works are quite diverse. For example, in [12] a reduced-radix representation is used to perform NEON-based multiplications to accelerate Curve25519 and Ed25519 curve arithmetic. NEON vectorization was applied across two independent multiplications inside point arithmetic formulas as well as within a single multiplication for those that could not be paired. The authors of [13] implemented a GLV-based scalar multiplication for the Ted127-glv4 curve in which interleaved ARM-NEON instructions were used to perform independent 128-bit multiplications in parallel. The application of NEON vectorization to boost up the computational performance of elliptic curves defined over binary fields has been studied in [14]. Specifically, NEON was used to accelerate the polynomial multiplication of two vectors of eight 8-bit polynomials producing 128-bit products. This primitive was then used in point arithmetic on random and Koblitz binary curves. In [15], interleaved ARM-NEON instructions were employed to speed up multiplications over
Researches summarized above exemplify the successful application of SIMD techniques on ARM processors to speed up cryptography. Along with NEON most of them employed algorithmic optimizations applicable to their particular scenarios and, from an implementation perspective, all of them targeted specific-length implementations optimizing code for a particular bit-length. That is the main difference compared to our proposal. Our intention is to exploit NEON vectorization while keeping the implementation flexible enough to allow scalability. Thus, the library implemented in this work is able to switch at run time between curves of the same family but of different sizes while keeping the same NEON-based processing core.
3.- ELLIPTIC CURVE ARITHMETIC
Let
The most important operation used in elliptic curve protocols is the scalar point multiplication. It is denoted by
INPUT: OUTPUT: |
---|
Return |
Addition and doubling formulas allowing to compute
Operations involved in the above equations are defined in the underlying finite field
3.1.- PROJECTIVE COORDINATES
Equations (1) and (2) are said to be the affine equations for point addition and point doubling respectively since they act on points represented in affine coordinates. In most cases field inversion is the most expensive operation. For example, in the base field
Field inversions can be avoided at the cost of extra field multiplications by employing projective coordinates. As long as the total multiplications count does not exceed the inversion to multiplication ratio, projective equations will perform better than affine formulas. A point in projective coordinates is represented by a triple
Inversion-free point addition and doubling formulas are derived from affine equations by simple substitutions of
Note that expressions for
Table 1 summarizes multiplication and squaring counts for the standard and Jacobian coordinate equations used in this work. Recall from Algorithm 1 that a point doubling always occurs at each iteration of the scalar multiplication loop while point additions are only performed when the corresponding bit of the scalar
4.- NEON-BASED ELLIPTIC CURVE ARITHMETIC
One alternative to boost up the performance of field arithmetic is to take advantage of any specific feature available in the selected implementation platform. In this sense, our work focuses on devices populated with ARM processors provided with the NEON extended instruction set.
Typical sizes for the operands used in elliptic curve schemes are currently in the order of 256, 384 and 512 bits [24]. However, most ARM processors exhibit a 32-bit architecture [25]. Then, a 256 × 256 multiplication, for example, have to be split into several 32 × 32 multiplications and 32 × 32 additions that can be handled by conventional ARM instructions. However, the ARM Cortex-A series processors come equipped with NEON, a 128-bit Single Instruction Multiple Data extension. Thus, the use of NEON instructions could be very helpful to speed up the field operations underlying elliptic curve arithmetic. NEON engine is built around 16 registers of 128 bits (Q0 ~ Q15) that can be accessed as 32 registers of 64 bits (D0 ~ D31). Every register can hold a vector of n lanes with m bits each. Refer to [26] for allowed combinations of n and m.
Most NEON instructions act over n lanes in parallel. They perform the same operation between equivalent lanes on the input vectors and store the results in the corresponding lanes on the output vector. Nevertheless, not all instructions support all possible combinations of n and m. For example, the vmull instruction illustrated in Figure 1 does not support 64-bit lanes. Even when we can specify a 128-bit register as destination operand and two 64-bit registers as source operands, vmull is only allowed to process lanes of up to 32-bit . We must also point out that there is no support for carry propagation between lanes which is a great inconvenient unless a redundant numeric representation is used. Notwithstanding the above NEON provides a useful degree of flexibility by means of the so-called instruction shapes. Most NEON arithmetic instructions come in at least two of four different shapes: normal, long, wide and narrow. We only emphasize the long case due to its relevance for this work (see [26] for further details). Long-shape arithmetic instructions act on source vectors of n lanes of m bits each and produce an output vector of n lanes of 2m bits each. This fact is clear for the vmull example. Observe in Figure 1 that while input vectors have 2 lanes of 32 bits the result is a vector also with 2 lanes but of 64 bits each. This shows a way in which one can perform simultaneously two 32 × 32 multiplications obtaining two 64-bit products as result.
It would be logical to think that the use of the vmull instruction would double the speed of a 256 × 256 multiplication since we would be able to perform two 32 × 32 multiplications at once. However, partial products must be accumulated to obtain the final result which would require carry values propagating across lanes, feature not supported by NEON. Alternatively, we can compute only partial products in parallel and later perform carry propagation sequentially but this would incur in extra time penalties mitigating any benefit provided by the parallel multiplication step. A different approach is not to parallelize a single 256 × 256 multiplication but to perform two independents 256 × 256 multiplications simultaneously. The rest of this section discusses in details how to use such approach in the context of elliptic curves. One of our design goals is to build an implementation flexible enough such that it enables us to switch at run time between different bit-lengths. For this purpose, rather than constructing a fixed-length implementation fully coded with NEON instructions, we propose the implementation of some NEON-based kernels that are properly inserted into C code to achieve a balance between speed and flexibility.
4.1.- POINT ARITHMETIC OVER
As suggested above, anywhere there exist two field multiplications acting on independent data, it is possible to parallelize them by means of NEON instructions. The same principle applies for field squarings. Fortunately, point arithmetic equations, especially in projective coordinates, can be conveniently arranged to extract many independent field operations. In this sense, we propose Algorithm 2 in which operations have been carefully scheduled to break data dependency and group into one pair the two squarings and into others four pairs eight of the nine field multiplications required to perform point addition in standard coordinates. Multiplications and squarings inside each pair can now be computed in parallel assuming we are provided with the proper functions to do it.
INPUT: OUTPUT: | |
---|---|
1. 2. 3. 4. 5. 6. 7. |
8. 9. 10. 11. 12. 13. 14. Return |
Hereafter we generically refer to the parallel computation of two field multiplications as a dual field multiplication. When a concrete finite field needs to be specified, then the word field will be substituted by the notation used to identify that particular finite field. That is, a dual
Although we only exemplify the parallelization opportunities of point addition in standard coordinates, it is worthwhile to mention that similar analyses were conducted for the remaining cases. As a result, we built procedures for point doubling in standard coordinates as well as for point addition and point doubling in Jacobian coordinates with computational costs of 3Md+1Sd+2M, 4Md+1Sd+1S and 1Md+2Sd+1M respectively. This emphasizes the need to build functions to compute dual field multiplications and squarings. We now continue with the implementation of such functions by using NEON instructions.
4.1.1.- DUAL FIELD MULTIPLICATION IN
Best choices to perform multiplications in
Multi-precision integer multiplications are commonly computed through the well-known schoolbook method [28]. The inputs of this method consist of two
The first step towards a dual
Figure 2 graphically illustrates the arithmetic kernel of neon_dual_mac2. Register Q0 holds a
Load and store instructions allowing to move data between ARM memory and NEON vectors were omitted from Figure 2 to keep the diagram as simple as possible. However, it is worthwhile to mention that the time required for these data transfers to take place obviously contributes to the total execution time of the dual multi-precision multiplication.
The second part of the SOS modular multiplication algorithm consists in the Montgomery reduction phase. A single Montgomery reduction takes as inputs an
Figure 4 shows the neon_dual_carry functionality. The input parameters are the pointers pt and pu pointing to the position of arrays t[] and u[] respectively from which it is desired to start the propagation of the corresponding carry values. The length
Algorithm 3 shows how the above pieces are tied together to conform a mixed C/NEON dual SOS Montgomery modular multiplication. This mixed approach allowed us to build a flexible and scalable solution that supports different field sizes at run time. Step 7 performs the dual multi-precision multiplication phase while step 12 computes the dual Montgomery reduction. Note that a correction is applied to the values resulting from Montgomery reduction if they are greater or equal to the modulus. Also note that this final correction step is not parallelized since it turns out that not always both values need to be corrected at the same time. Even though comparisons
INPUT: OUTPUT: |
---|
/* Input arrays Define Define
/* Verifying if final corrections are required. */
|
4.1.2.- DUAL FIELD SQUARING IN
Multi-precision squarings are usually more efficient than generic multi-precision multiplications of an integer
The dual multi-precision squaring implementation is also based on a nested loops structure in which two simultaneous multiply-and-accumulate operations need to be performed. However, as shown in equation (5), an extra multiplication by
The subroutine neon_dual_sqr_mac2 takes the same input parameters as those of neon_dual_mac2. However, the operands to be pairwise multiplied are not located in different arrays. They come from different locations of the same array a[] in the case of operands
The subroutine neon_dual_mac is very much simpler. It is actually a reduced version of neon_dual_mac2 since it does not handle any input carry values. This can be easily appreciated in Figure 6. Note that neon_dual_mac only uses a single vmlal instruction to compute the multiply-and-accumulate operations
Replacing the multi-precision multiplication loop at step 7 of Algorithm 3 by the squaring procedure shown in Algorithm 4 turns the dual SOS modular multiplication into a dual SOS modular squaring operation. Although it is evident that neon_dual_sqr_mac2 is much more complex than neon_dual_mac2, the shorter inner loops involved in the squaring algorithm ensure the expected improvement in terms of speed.
INPUT: OUTPUT: |
---|
/* Input arrays |
Before closing this section, it is worthwhile to point out a situation that sometimes arises in the point arithmetic formulas. Until now we have only considered to pair multiplications and squarings separately. This required us to group in pairs either multiplications or squarings to later perform them simultaneously. Such a thing is sometimes just impossible. However, it could be the case that one unpaired multiplication can be performed in parallel with an unpaired squaring. In such situation it is advantageous to consider a squaring as a simple multiplication and then execute it as part of a dual NEON-based multiplication.
4.2.- APPLYING DUAL OPERATIONS IN POINT ARITHMETIC OVER
We found interesting to devote this section to evaluate the impact of dual NEON-based operations on the performance of elliptic curve arithmetic over
Point addition and doubling formulas in
4.2.1.- NEON-BASED
Dual NEON operations cannot be directly applied to point arithmetic in
Let
Algorithm 5 shows the resulting
INPUT: OUTPUT: |
---|
|
Further refinements can be achieved by using the lazy reduction technique [36]. Allowing intermediate values to grow above the modulus, the number of Montgomery reductions can be decreased to only one at the end of the multiplication procedure. We followed two approaches to implement lazy reduction. When possible we selected the prime characteristic
Algorithm 6 shows the procedure for multiplications in
INPUT: OUTPUT: |
---|
|
4.2.2.- NEON-BASED
The Karatsuba-Ofman multiplication combined with reduction by
5.- IMPLEMENTATION RESULTS
This section evaluates the impact of the dual NEON-based finite field arithmetic on the performance of a software implementation of elliptic curve primitives. Experiments were carried out on the Xilinx XC7Z020 Zynq-7000 device. The XC7Z020 chip integrates a dual-core ARM Cortex-A9 processing system running at
Curve | Bit-length | ||
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Table 3 presents the average execution times of the pure C and the mixed C/NEON variants of field multiplications (mul) and squarings (sqr) in both
Field | Bit-length | Operation | Timing (in |
Factor | |
---|---|---|---|---|---|
C | C/NEON | ||||
|
|
mul |
|
|
|
sqr |
|
|
|
||
|
mul |
|
|
|
|
sqr |
|
|
|
||
|
mul |
|
|
|
|
sqr |
|
|
|
||
|
|
mul |
|
|
|
mul-lr (*) |
|
|
|
||
sqr |
|
|
|
||
|
mul |
|
|
|
|
mul-lr (*) |
|
|
|
||
sqr |
|
|
|
||
|
mul |
|
|
|
|
mul-lr (*) |
|
|
|
||
sqr |
|
|
|
(*) In this context lr stands for lazy reduction.
As shown in Table 3, for the three bit-lengths considered in this work, performing two single
Let us now inspect the timing results of elliptic curve point arithmetic. Table 4 shows the values corresponding to point addition (Add) and point doubling (Dbl) on the elliptic curves over
Bit-length | Coordinates | Point operation | Timing (in |
Factor | |
---|---|---|---|---|---|
C | C/NEON | ||||
|
Affine | Add |
|
|
|
Dbl |
|
|
|
||
Standard | Add |
|
|
|
|
Dbl |
|
|
|
||
Jacobian | Add |
|
|
|
|
Dbl |
|
|
|
||
|
Affine | Add |
|
|
|
Dbl |
|
|
|
||
Standard | Add |
|
|
|
|
Dbl |
|
|
|
||
Jacobian | Add |
|
|
|
|
Dbl |
|
|
|
||
|
Affine | Add |
|
|
|
Dbl |
|
|
|
||
Standard | Add |
|
|
|
|
Dbl |
|
|
|
||
Jacobian | Add |
|
|
|
|
Dbl |
|
|
|
Timing results for point arithmetic on elliptic curves over
Although affine operations experiment some speedup when using NEON-based
Bit-length | Coordinates | Point operation | Timing (in |
Factor | |
---|---|---|---|---|---|
C | C/NEON | ||||
|
Affine | Add |
|
|
|
Dbl |
|
|
|
||
Standard | Add |
|
|
|
|
Dbl |
|
|
|
||
Jacobian | Add |
|
|
|
|
Dbl |
|
|
|
||
|
Affine | Add |
|
|
|
Dbl |
|
|
|
||
Standard | Add |
|
|
|
|
Dbl |
|
|
|
||
Jacobian | Add |
|
|
|
|
Dbl |
|
|
|
||
|
Affine | Add |
|
|
|
Dbl |
|
|
|
||
Standard | Add |
|
|
|
|
Dbl |
|
|
|
||
Jacobian | Add |
|
|
|
|
Dbl |
|
|
|
Finally, Table 6 and Table 7 show the timings obtained for scalar multiplication in
Bit-length | Coordinates | Timing (in |
Factor | |
---|---|---|---|---|
C | C/NEON | |||
|
Affine |
|
|
|
Standard |
|
|
|
|
Jacobian |
|
|
|
|
|
Affine |
|
|
|
Standard |
|
|
|
|
Jacobian |
|
|
|
|
|
Affine |
|
|
|
Standard |
|
|
|
|
Jacobian |
|
|
|
Bit-length | Coordinates | Timing (in |
Factor | |
---|---|---|---|---|
C | C/NEON | |||
|
Affine |
|
|
|
Standard |
|
|
|
|
Jacobian |
|
|
|
|
|
Affine |
|
|
|
Standard |
|
|
|
|
Jacobian |
|
|
|
|
|
Affine |
|
|
|
Standard |
|
|
|
|
Jacobian |
|
|
|
5.1.- COMPARISON WITH RELATED WORKS
Even when the goal of most works summarized in Section 2 was to accelerate elliptic curve arithmetic by using NEON vectorization, the choices for curves, underlying fields, point arithmetic formulas and scalar representations were quite diverse. Such different choices have a direct impact in the resulting performance. Therefore, it is very difficult to establish a fear comparison to evaluate only the influence of the NEON-based approach employed by each of those researches. The parameters setup closer to our settings is that of [11]. Consequently, our implementation timings of NEON-based finite field and point arithmetic are compared against their results in Table 8. Comparison only involves curve arithmetic in Jacobian coordinates since this was the point representation considered in that work.
Work | Implementation | Bit-length | Timing (in 103 clock cycles) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|||||||||
mul | sqr | Add | Dbl | [ |
mul | sqr | Add | Dbl | [ |
|||
[11]a | NEON (fixed length) | 254 | - | - | - | - | 1698 | 2.29 | 2.00 | - | - | 2933 |
This workb | Mixed C/NEON (scalable) | 254 | 3.8* | 3.4* | 22.7 | 15 | 6592 | 6.1 | 4.2 | 64.5 | 38 | 17551 |
384 | 8.5* | 7.5* | 50.1 | 32.7 | 20710 | 13.8 | 9 | 140.3 | 80.9 | 54489 | ||
510 | 13.8* | 11.9* | 80 | 51.5 | 44939 | 21.1 | 14.5 | 216.2 | 125 | 114537 |
a. Galaxy Note (ARM v7) Exynos 4 Cortex-A9 at 1.4 GHz.
b. Zynq ZC7Z020 (ARM v7) Cortex-A9 at 667 MHz.
* Refers to dual
As shown in Table 8 our mixed C/NEON scalar multiplication in
6.- CONCLUSIONS
Achieving implementations of cryptographic primitives with a good performance is crucial to ensure a smooth end-user experience. This is especially relevant in the context of embedded solutions where the processing resources are somewhat limited when compared to that of modern general-purpose computers. Then, taking advantages of any platform specific feature providing support for acceleration could be determinant. Accordingly, this work evaluated the use of the NEON instruction set of ARM Cortex-A processors to speed up elliptic curve arithmetic. We followed the approach of parallelizing the underlying finite field operations using NEON instructions since we observed that elliptic curve point arithmetic formulas, especially in projective coordinates, involve several field multiplications and squarings that can be executed simultaneously. After implementing such NEON-based field operations, it was shown that using them in the context of elliptic curves boosted up the performance of scalar multiplication between