A NEW OPTIMIZED IMPLEMENTATION OF A FAST INTRA PREDICTION MODE DECISION ALGORITHM FOR HEVC STANDARD

: New and stronger video compression standard was developed during the last years, called H.265/HEVC (High Efficiency Video Coding). This standard has undergone several improvements compared to H.264/AVC (Advanced Video Coding). In intra prediction block, 33 directional intra prediction modes were included in H.265 to have an efficient coding instead of 8 modes that were used in H.264 in addition to planar and DC modes, which has generated computational complexities in the new standard. Therefore one of the most issues for embedded implementation of HEVC is time reduction of the encoding process. In this paper, an embedded implementation of a fast intra prediction algorithm is performed on ARM processors under the embedded Linux Operating System. Experimental results included the comparison between the original HM16.7 and the proposed algorithm show that the encoding time was reduced by an average of 61.5% with an increase of 1.19 in the bit rate and a small degradation in the PSNR of 0.05%. Copyright


INTRODUCTION
Videos became these days more popular among consumers than any other type of content. Statistics provided in the world video marketing said that by 2020 online video will make up more than 80% of all consumer internet traffic. In addition, the trend toward higher video qualities is developed to UHD (Ultra High Definition). So, the need of powerful video compression standards that can support higher resolutions with the huge use of videos and that can insure the required bandwidth to transmit video content is demanded. For this purpose, HEVC standard was announced in 2013. This standard can support higher video resolutions that can achieve 8K x 4K UHD [1]. It has also the ability to offer up to 50% (70% in the latest versions) of data compression at the same level of quality compared to H.264/AVC (Advanced Video Coding) that was the most used before [2].
In HEVC video coding, each picture is split into slices, slice segments and tiles, containing the CTUs (Coding Tree Units). Each CTU is divided in one or more CUs (Coding Units) which sizes 2N x 2N ( N = 4, 8,16 and 32). Then the CU is partitioned into one or more PUs (Prediction Units) and one or more TUs (Transform Units) to form the Recursive QuadTree (RQT) which CU is the root. This method of partitioning aids the coding process to be more flexible [3] compared to H.264/AVC standard which was based on macroblocks with reduced sizes and partitions.
In H.265 standard three types of the pictures are used; the I frame that applies only the intra prediction, the P frames and B frames are used in the inter prediction (motion estimation and compensation). The output of the prediction block (residual signal) is the difference between the predicted picture and the reference picture. This output is transformed with 2D DCT and ICT transforms (2 Dimensional Discrete Coding Transform and Integer Coding Transform) and then quantized, finally the transformed quantized coefficients are compressed using the CABAC (Context-Adaptive Binary Arithmetic Coding) algorithm that involves the functions of binairization, context modelling and arithmetic coding [4].
The present work focuses on an optimized method of implementing the overall proposed model [5] on ARM processors that can more accelerate the encoding process of the intra prediction for a more recent version of HM which is HM16.7 reference software. Firstly, the modifications of the intra prediction process are made in HM16.7. Secondly, in the context of software performance testing, these modifications are implemented in a standard platform (Intel Core i7, 3.6 GHz, 4GB on RAM memory) under the Linux (Ubuntu 16.04) operating system based on the Common test conditions and software reference configurations provided by the JCT-VC [6]. Finally, an embedded implementation of the proposed models is performed on octa-core ARM processors under an optimized Linux operating system. As a result, the encoding time was accelerated by around 3 times in ARM processor compared to the original intra prediction process.
The rest of the paper is organized as follows, related works are reviewed in section 2, section 3 describes the complexity of the algorithm of the intra prediction process in the HEVC/H.265 video compression standard. Part 4 provides the description of the proposed algorithms for the intra prediction. In Section 5, the hardware implementation of the proposed model is analyzed with a brief presentation of the used platform and finally we provide the experimental results and comparisons with other works.

RELATED WORKS
As mentioned in the Introduction, the HEVC standard has introduced many methods that make this video compression standard more powerful. As a result of these newly integrated methods, the HEVC/H.265 video coding can achieve the double of compression ratio with a same quality compared to H.264, but this has generated a big complexity on the compression computations. A lot of analysis in the works has been carried out to evaluate the complexity of the HEVC standard, earlier studies [7] present the results of the profiling of HEVC standard on different platforms for the all intra (when inter prediction is disabled) configuration file. Results show that after the transform/quantization block that consumes the lengthy encoding time through the RDOQ (Rate Distortion Optimization Block) process, the prediction (intra and inter prediction) process takes the great amount of time (almost half of the total of encoding time) compared to other blocks. Concerning the random access configuration when the inter prediction is activated, Bossen et al. [8] showed an analysis of HEVC complexity for both the encoder and the decoder using different experimental tests. Their results show that the prediction block consumes more than 60% of encoding time especially in the Rate Distortion (RD) Cost functions that are presented by the TComRdCost class and that accounts for 40% of encoding time. This high percentage demonstrates the complexity presented in the prediction block and more particularly in the general coder control block that is responsible for the decision and that uses, for the most often, the RD Cost functions to decide the best partitioning and modes to be used in the prediction. To reduce this complexity, an active area of researches was concentrated on how to accelerate the intra prediction process. Venugopal et al. [9] have proposed a fast model of the intra prediction. It was concerning a rapid template matching for the intra prediction of HEVC standard. The best template match is derived from the reconstructed samples by matching three best Template Matches (TM) in the sense of minimizing the SSD (Sum of Square Differences), the averaged superposition of these three best TMs is used as the prediction of the current Prediction Block (PB). As results of this method, there was a gain in the bit rate by a percentage of 1.15% and an increase of run-time of 33%. Another way to reduce the complexity of HEVC intra prediction is to reduce the number of the modes verified in the decision process [8,9]. For instance, in the Xie et al. paper [10] a set of tests on split and no split of CU with different QP (Quantization Parameter) sizes in addition to an analysis of the candidate list and the modes that are more selected as the best ones are considered. Based on these tests, the authors propose two innovative algorithms, the first one concerns a fast CU division algorithm based on a set of a predefined threshold values. The second one concerns a fast intra mode decision algorithm that excludes some modes from the candidate list to minimize the computation complexity of intra process. As a result, the combination of the two algorithms can save to 49.6% of the encoding time; there is also an optimization of the bit rate of 1.3% but with a loss in the PSNR of 0.31dB. The other work [11] was based on the correlation between the CTU texture and the optimum CU partition. Authors of [11] proposed two algorithms, the CDRP (CTU Depth Range Prediction) that can reduce the splitting of the CU partition and the IPMS (Intra Prediction Mode Selection) that is proposed in order to optimize the decision process of intra prediction by minimizing the number of candidate modes. The combination between the two algorithms can reduce the running time by 60% with an increase of 1.45% in the bit rate. In another paper, Azgin et al. [12] reported an optimized architecture for the intra prediction concerning the sizes 4 x 4, 8 x 8, 16 x 16 and 32 x 32 for only the angular prediction mode. This block was then implemented on an FPGA. The distinctive characteristic in this paper was the multiplication function that is implemented using the DSP (Digital Signal Processor) blocks instead of using the adders and shifters. As a result, this hardware implementation has up to 36.66% less energy consumption than the original one and can implement up to 55 full HD (High Definition) video frames per second. Kibeya et al. [5] have designed two fast intra prediction algorithms concerning the optimization of the RMD (rough mode decision) process. The goal of their work was the reduction of the necessary encoding time for the intra prediction decision part by minimizing the number of candidate modes evaluated in the decision stage. The modifications were performed on the HM10.0 (HEVC test Model) reference software on a standard platform characterized by an Intel processor Core TM i7-3770 under Windows 7 Operating System. This work has reduced the encoding time by an average of 46.13% compared to the original algorithm of HEVC. More details of this model are presented in section 3.
The embedded implementations of such a powerful video compression standard like HEVC on ARM processors is a huge challenge, because of its high performance and powerful efficiency and a very high level of complexity. In addition that it targets a wide variety of mobile and consumer applications including mobile phones and tablets. Almost all the ARM implementations of HEVC standard that was performed these last few years concern the decoder process that is much easier than the encoder. For instance, Smei et al. [13], exploit the parallel tools used in HEVC standard (Tiles and slices) to perform a pipelining method for the decoder and then they implement this method on a dual ARM platform (Zedboard) [13]. This method was able to minimize the decoding time by 30% compared to the sequential method. Other embedded implementation of HEVC decoder that applies a parallel optimized method was reported by Liu et al. [14] for the multi-view video decoding using the multi-threading. The results show that the proposed method is 5 times faster in the ARM platforms.

INTRA PREDICTION IN HEVC
The main goal of the intra prediction block [15] is to achieve a higher coding efficiency by minimizing the spatial redundancies between the adjacent samples using the reconstructed reference samples of each video frame. In the intra prediction, three main steps are executed as shown in Fig. 1.
After the construction of the reference sample array, the intra prediction is performed on the current sample using one of the 35 intra prediction modes shown in Fig. 2. Finally, in the post processing step, a filter is applied on the current block to reduce the discontinuities between the current and the reference blocks.  The 35 intra modes (see Fig. 2) used in the second step (Sample Prediction) consists of: -33 angular prediction directions (No. 2-34) that are mostly used in objects with directional structures. This type of intra mode can be obtained by projecting the current sample to the reference sample array applying one of these 33 intra modes.
Before executing these three steps, it is required to choose the best intra mode with the minimal overhead, this means that we have to choose one of the 35 intra modes prediction that can give the best trade-off between a reduced distortion and a higher bit rate. In that regard three steps are performed: 1. Before calculating the Rate Distortion Optimization (RDO), the Hadamard transform is executed for all possible intra prediction modes by calculating the RD-cost(1) (equation 1) [17] and the SATD (Sum of Absolute Transformed Differences) using the equations below: where, SATDdistortion, λ -Lagrangian multiplier parameter, Rbit rate needed to encode the prediction mode.
where, Ci,jthe current pixels, Ri,jthe reference pixels, Hthe Hadamard transform matrix. As a result, a Prediction Mode  (2)) of each candidate in the PMT using the equation below [17].
where SSDthe absolute sum of the difference between the current and the original sample, λ -Lagrangian multiplier parameter and Rthe number of bits needed to encode the prediction mode.

DESCRIPTION OF THE PROPOSED FAST INTRA PREDICTION PROCESS
The modifications performed in the intra prediction process consist of three steps executed before choosing the best intra prediction mode. For that, two algorithms are designed:

FAST INTRA-PREDICTION ALGORITHM BASED ON EARLY DETECTION OF ZERO TRANSFORM AND QUANTIZED COEFFICIENTS
This method uses threshold values to decide the best mode among the 35 modes without the assessments of all these modes. The assumed thresholds are the same as those reported by Kibeya et al. [5]. We then follow the steps described in Fig. 3 to find the best mode among the 35 possible.
The SATD is calculated for a candidate i, if the calculated SATD value is smaller than the SATD of the threshold, and then this mode is selected as the best one, and the SATD is calculated for the next candidate.

FAST INTRA-PREDICTION ALGORITHM BASED ON REFINEMENT OPERATION
As mentioned in [5], a list of statistical experiments was performed, using different Qp values (22, 27, 32 and 37) for all classes (A, B, C, D and E). During these experiments, it has been observed two principal remarks: -The intra modes that are more used include planar, DC, directly horizontal and directly vertical. Table 1 summarizes the more used modes for each PU size as reported in [5]. The idea emerging from these results is then to reduce the number of intra modes in RMD process and calculate the RDcost based on SATD just for the modes that are more used for each PU size (shown in Table 1). This method is more detailed in the following steps: 1. Prediction stage: Instead of calculating the Hadamard transform for all the 35 modes in the original intra prediction, the proposed model releases these computations on only the candidate modes illustrated in the Table 1. The mode with the minimal SATD is selected as the best one and indicated by a variable called "first-best-mode".
2. Refinement stage: The goal of this step is to refine the first step by calculating the SATD of the two neighboring direction modes, as a result an Rlist composed of three candidates is generated. More details concerning this step are performed in Fig. 4. 3. Final decision stage: This last step is similar to the one of the original HEVC method (as described in section 2), the RDO based on the SSD is calculated for the three modes in the R-list obtained from the second step. Finally, the mode with the minimal RD-cost is selected as the best intra mode.

HM (HEVC TEST MODEL) SOFTWARE
In order to demonstrate and to study the coding performance of the HEVC coding standard, the (JCT-VC) was developed a reference software HEVC test Model (HM) with various features that can help users to have a rich checking compliance.
HM is provided as a source code developed in C++ that includes both encoder and decoder functionalities, and can be implemented on various platforms. The modifications performed in this work

2.
Calculate the SATD for the N candidates.

Best_mode = min RdCost_SSD.
concerns the HM 16.7 version. The JCT-VC have introduced a set of common test conditions and software reference configurations that can be used in experiments in order to make easier the comparison of the outcome of experiments. Three configurations of files are commonly used. All intra when all pictures are coded as "I frame" using just the intra prediction process, in Random Access and low delay, both intra and inter prediction are used. In this paper the experiments will be performed using only the All Intra configuration file to test the intra prediction process with different test sequences (classes from A to E) and different quantization parameter values.
To implement the HM16.7 on the ARM platform, a cross compilation of the HM on a host machine should be performed before, in order to generate an executable file supported by the target machine (ARM processor). For that, it is necessary to install the cross toolchain corresponding to the target platform used, then the binary files of the HEVC encoder are copied to the SD card and executed using the different test sequences.

DESCRIPTION OF THE ARM PLATFORM
The development platform selected for our embedded implementation was the Banana Pi M3 development board [18]. This is a super low cost single board computer that can support a variety of operating systems including Android and Arch Linux.
The operating system used in this work is the embedded Linux, due to its low cost (freely available source code), ease of customization and its stable kernel. It was also integrated in the SD card. We have designed a Linux Operating System from scratch, this system is scaled in such a way that it consumes less power and material resources, by reducing the size of the image kernel, so that the execution of the HM 16.7 will benefit of the eight cores of the used platform and a big size of memory will be used to execute the code. This way has effectively more optimized the implementation of the HM16.7.
After preparing and booting the image using a dumb-terminal emulation program, the Linux operating system is started and the directory /home can be chosen to execute the HEVC software.

OVERVIEW
After testing the algorithms (algorithm 1 and algorithm 2) implemented in HM16.7 on the Intel processor, different implementation approaches were analyzed and tested on ARM processors in order to highlight the performances of the used algorithm on this platform using the optimized embedded Linux system. The test sequences used in both of these experiments are described in Table 2. For each platform, we have performed 20 different implementations for the original intraalgorithm and 20 for the modified fast intraalgorithm of HM16.7. The 20 implementations were tested using several test sequences with different classes from A to E (Table 2), with different values of quantization parameter (22, 27, 32 and 37) according to the common JCT-VC test conditions, and using the All-Intra-Main configuration file. The parameters FEN (fast encoder decision), FDM (fast decision for merge RD cost), RDOQ (Rate Distortion Optimization Quantization), SAO (Sample Adaptive Offset) and AMP (Asymmetric Motion Partitions) are all enabled. The rest of parameters are described in the All-intra configuration file [19].
In order to compare the performances of the modified model with those of the original one, we have used the Bjontegaard [20] delta bit rate (BDBR) and the Bjontegaard delta peak signal-tonoise ratio (BDPSNR) to verify both the efficiency of the bit rate and the quality of the output video that is encoded using the proposed method. The PSNR of YUV can be calculated as follows: PSNRyuv = ((PSNRy * 6) + PSNRu + PSNRy) / 8, (4) where PSNRythe PSNR of luma component, PSNRu and PSNRv are the PSNR for the chroma components.
Additionally, the computational complexity of the original algorithm is evaluated in comparison with the proposed intra algorithm in terms of encoding time that is calculated using the equation bellow: ΔT=((Torg-Tprop))/Torg, where Torgis the encoding time of the original model, Tpropthe encoding time of the proposed model.

RESULTS AND COMPARISONS
Experimental results, presented in the Tables 3  and 4, are obtained using the above described method in section 5.1. Table  3 presents the experimental implementation results on the Intel platform (Intel Core i7, 3.6 GHz, 4GB on RAM memory) under the Linux (Ubuntu 16.04) operating system.
The encoding time and the bit rate are optimized respectively by a percentage that can reach 67% and 3% compared to the original intra prediction algorithm, but the quality of the output video is decreased by a very small percentage of 0.05% in average. In comparison with Kibeya et al. [5] results, the proposed intra prediction algorithms are more responsive in the HM16.7 reference software than the version HM10.0. Our respective results show that the averages of the encoding time are 64.898% for the present implementation and 46.13% for Kibeya et al. [5] implementation, which corresponds here to a more optimization of around 20%.  Table 4 gives the implementation results of the encoding time, the BDBR and the BDPSNR of the used models compared with the original model on ARM platform and based on the Bjontegaard method. As can be seen the implementation results indicate that a gain of 61.5% in average of encoding time by the proposed model compared to the original HM16.7 is achieved. This highlights that the embedded implementation on ARM platform of these algorithms can reduce the computations complexity by a factor of 63% in maximum and a minimum of 59.31%. Additionally, the bit rate was optimized by a maximum of 2.1% in BDBR and a minimum of 0.21%. On the other hand, there was a loss in the PSNR with a too negligible value of 0.049%. Looking at these results, we can deduce the great performance of the embedded implemented algorithms for the intra prediction process on ARM processors that have speed up the encoding process by around 3 times.
For classes C, D and E, we can see in the Table 4 that the Quantization parameter (Qp) value has an influence on the bit rate and the encoding time, that is when we increase the value of the Qp the bit rate increases but the encoding time decreases. Table 5 gives the results of the embedded implementations of the original HM16.7 and the proposed HM16.7 intra prediction algorithms on ARM platform in comparison with previously reported results [5]. Concerning Kibeya et al. work [5] the fast intra prediction model is executed in an Intel®Core TM i7-3770 @ 3.4 GHz CPU and 12 GB RAM platform for the HM10.0 version. The bit rate, PSNR and encoding time saving comparisons are illustrated in Table 5. By applying the optimized Linux operating system to implement the fast intra prediction algorithms, the results show that the encoding time is more performant in the ARM processor with a gain of about 15.37% in addition to the significant gain in the bit rate by a factor of 1.19%, also the loss in the PSNR is lower here by a factor of 0.2%. Another comparison of the encoding time, bit rate and the PSNR of the proposed model implemented with those reported in [10], [11] is performed. The results are shown in Table 6, which demonstrates the performance in terms of acceleration and video quality of our embedded algorithm compared to the state-of-art fast methods. As can be seen, the average of the time saving of the implemented method is overcome of 12% compared to the case of the algorithms proposed by Xie et al. in [10] and 1.86% in the case of Zhu et al. in [11] results. Additionally, also compared to these methods, we obtained here a lower degradation of the PSNR for the output video (Table 6). On the other hand, their bit rate is more optimized by still negligible values (0.11% and 0.26) respectively for Xie et al. [10] and Zhu et al. [11].

CONCLUSION
This paper presents an implementation on ARM platform of algorithms that concerns the selection of the best mode between the 35 intra modes defined in the intra prediction block of HEVC/H.265 video encoding standard. The focus has been on the RMD process when the number of verified modes is reduced in order to minimize the computation complexity of HEVC standard. These methods have been applied on the HM10.0 under an Intel Core and have achieved an average of 46.13% of reduction in encoding time in comparison with the original algorithm.
In this paper, this method is applied on the HM16.7 instead of HM10.0 and is implemented on ARM architecture under an optimized embedded Linux operating system. As a result, the run-time was reduced by an average of 61.5% and a maximum of 2.1% of bit rate was added in HM16.7 reference software.
Our short-term perspective is to optimize more in terms of both the algorithm and the embedded implementations within the objective of reaching a real-time implementation of the HEVC/H.265 standard on different platforms by optimizing more the complex functionalities of this standard.