Performance Evaluation of

H.264/AVC Decoders



Wei-Yao Tai†, Tang-Hsun Tu‡, and Chih-Wen Hsueh
Embedded System and Wireless Network Laboratory
†Graduate Institute of Computer Science and Information Engineering
‡Graduate Institute of Networking and Multimedia
National Taiwan University
Taipei, Taiwan 106, R.O.C.
E-mail: {r96922110, r96944013, cwhsueh}@ntu.edu.tw


Abstract

  In recent years, considerable research efforts in multimedia have been done over the performance improvement of decoding using H.264/AVC. Especially, multicore platforms have been popularly adopted in multimedia systems to improve the performance. Therefore, we focus on accelerating the decoding on multi-core platforms, adopt the existing multi-threading decoding approaches, and port the decoder onto Linux. We evaluate the performance on single-core, dual-core, and quad-core platforms respectively. Experiment results show that our decoder achieves 262%, 328%, and 333% speedups respectively over the standard reference software, and the performance is close to running in Windows on the same platforms. We also compare the same source codes compiled by Intel compiler and gcc compiler. This report provides comparisons of the same decoding approach running on Windows XP and Linux 2.6 by different compilers on different multi-core platforms. The experiment results are useful reference for H.264 applications on Linux multicore platforms.
  Keywords:multimedia, H.264, performance evaluation, compiler, Linux,Windows XP, multi-threading, multicore

1 Introduction

  H.264 or Advanced Video Coding (AVC) [6] is the standard [6] for video compression. In recent years, considerable research efforts in multimedia have been done over the performance improvement of decoding using H.264/AVC. Multicore platforms have also been popularly adopted in multimedia systems to improve the performance. Since the first application of H.264, parallelized decoding on multicore platforms has become increasingly expected to improve the performance. Because the H.264 decoding components are very tightly coupled, only few parallelizing efforts are effective. As shown in Figure 1, the decoder [10] was successfully modified to parallelize the execution of deblocking filter component. It achieves 21%and 34%overall speedups on dual-core and quad-core platforms over singlecore one. However, it is only done in Windows XP.
  Therefore, we focus on accelerating the decoding on multi-core platforms, adopt the multi-threading decoding approaches from[8, 9], where it has significant performance improvement on multi-core platforms, and port the decoder onto Linux, which might be more cost-effective in many embedded applications. We also evaluate the performance on single-core, dual-core, and quad-core platforms respectively. Experiment results show that our decoder achieves 262%, 328%, and 333% speedups respectively over the reference software [3], and the performance is close to running in Windows XP on the same platforms.
  In evaluating the performance on Windows and Linux operating systems, we found that compilers play an important role in optimizing the performance. Therefore, we also compare the performance of the same source codes compiled by Intel compiler [5] and gcc compiler[1]. This report provides comparisons of the same state-of-the-art decoding approach running on Windows XP and Linux 2.6 by different compilers on different multi-core platforms.
  Experiment design will be discussed in Section 2. The results will be shown in Section 3. Section 4 concludes this report.

2 Experiment Design

  Figure 1 shows the block diagrams of H.264 decoding. The diagram is drawn in a temporal sense that vertically overlapped components can be executed in parallel. The main components are entropy decoding, inverse quantization, inverse discrete cosine transform (IDCT), prediction compensation (including inter and intra motion), and deblocking filter [8]. Since most of the components are executed sequentially, except for deblocking filtering, it is difficult to apply the multi-threading technique to exploit parallelism and speed up its performance. Therefore, we focus on the parallelization of deblocking filter. Based on the H.264 standard reference software [3], the H.264 decoder has been optimized with many advanced features [9, 11].
Arrayflow

  We borrow the decoder with parallel deblocking filter algorithm [10], port the decoder from Windows XP to Linux 2.6.20.17, and run it on a quad-core machine, with Intel R CoreTM2 Quad CPU 2.4GHz, 4MB cache, 1GB RAM, and 160G hard disk. The CPU cores can be disabled individually to experiment as platforms of less cores. We disable 2 cores as a dual-core machine and 3 cores as a single-core machine. We also upgrade the Linux kernel to the latest, 2.6.26.3, for performance comparison. The multi-threading was done using open POSIX[4] Threads library[2], pthread. Note that the pthread library for Windows[7] is different. We run the deblocking filter component by 1, 2, 4 threads on single-core, dual-core and quad-core platforms respectively.
  We also adopt the ten quite different H.264 bitstreams [10] as in Table 1 for our experiments. The bitstreams contain three resolution levels, 1280×720(720p), 720×480(480p) and 352×288(CIF), which are encoded in H.264/AVC baseline profile by using the reference software JM73. We average the frames per second (FPS) of 1000 experiments for each video stream by different compilers, operating systems, and multicore platforms.

3 Experiment Result

  In the following figures of performance evaluation, the y axis stands for frames per second (FPS) and the x axis stands for the index of the 10 video streams. We use legends of 4 digits to denote different experiment configurations. As summarized in Table 2, the first digit, ’l’ or ’d’, stands for the reference software, ldecod, or our decoders respectively. The second digit, ’i’ or ’g’, stands for compiled by the Intel C compiler (icc) or GNU C compiler (gcc) respectively. The third digit stands for the number of cores. The fourth digit stands for operating systems. For example, ”lg2a” stands for the reference software compiled by gcc running on a dual-core machine of Windows XP.

  Figure 2(a) shows experiment results on the single-core platform. Obviously, the decoders on Linux are 3 to 4 times faster than the reference software. The best ones running on Linux are close to the ones running onWindows XP.We experiment more for the reference software and our decoders both with icc and gcc compilers on Linux 2.6.26.3. As shown in the Figure 2(b), icc is 20% to 30% faster than gcc for our decoder on the single-core machine. But the compiler effect for reference software is less significant, where icc is 15% to 20% faster than gcc. Moreover, our decoder gains up to 250% speedup over the reference software on both compilers.
  The kernel effect on the GNU C compiler is shown in Figure 2©. We could see the reference software is not much affected by different kernel versions. And the new Linux kernel 2.6.26.3 is 35% 40% faster than the old Linux kernel 2.6.20.17. As shown in Figure 2(d), the results on Intel C compiler is similar to the result on GNU C compiler. Moreover, our decoder has close performance between Linux 2.6.26.3 and Windows XP on single core. As the figures show, on single core, our decoder has the best performance in the environment compiled by Intel C compiler running on Linux kernel 2.6.26.3 or Windows XP.
  Then, we modify our decoder to run in parallel on multicore platforms using multi-threading for the deblocking filter on the best environment in Linux. As shown in Figure 2(e), on dual core and quad core are both 15% to 20% faster than on single core, and the best performance is 300% faster than the reference software. Most of the results shows that themore cores, themore gain of decoding performance. However, contrarily, the dual-core experiments run faster than the quad-core ones for the 8,9,10 bitstreams. Since the 3 bitstreams are small, the reason might be that for the small bitstreams, the synchronization overhead is bigger than the speedup. Moreover, we could find out that our decoder has close performance on Linux 2.6.26.3 and Windows XP on different cores in Figure 2(f). And it might also be the synchronization overhead so that on the qual core, our decoder on Linux outperforms a little on Windows XP, while on the quad core, it is on the contrary.
Array

4 Conclusion

  We present the experiment results for a H264/AVC decoder modified from the reference software JM73 with parallelization of the deblocking filter in multi-threading on single, dual, and quad cores. The experiments also include the decoder compiled by Intel C and GNU C compilers running on Windows XP and Linux of kernel 2.6.20.17 and 2.6.26.3. The results show that Linux kernel version, compiler, and number of CPU cores, do affect the overall performance. Our decoder can outperform the reference software by up to 300%, and has close performance on Linux and Windows platforms. With fixed compiler, operating system, and number of cores, parallelization and synchronization might be themost significant factor to improve performance on multicore platforms. Moreover, how to schedule multiple threads also affect the performance. There are still many future directions to improve the performance of H.264/AVC decoding on multicore platforms, such as software pipelining.

Reference

[1] GCC, the GNU Compiler Collection. http://gcc.gnu.org/.
[2] GNU C Library. http://www.gnu.org/software/libc.
[3] H.264/AVC JM Reference Software. http://iphome.hhi.de/suehring/tml.
[4] IEEE POSIX Certification Authority. http://standards.ieee.org/regauth/posix.
[5] Intel R Compilers. http://www.intel.com/cd/software/products/asmo-na/eng/compilers/284132.htm.
[6] ISO/IEC 14496-10, International Standard of Joint Video Specification. Coding of Audiovisual Objects-Part 10: Advanced Video Coding. 2003.
[7] POSIX Threads (pthreads) for Win32. http://sourceware.org/pthreads-win32. 2005.
[8] I. E. Richardson. H.264 and MPEG-4 Video Compression. Wiley; 1 edition, ISBN 0-470-84837-5, Aug. 2003.
[9] S.-W.Wang, Y.-T. Yang, C.-Y. Li, Y.-S. Tung, and J.-L.Wu. The Optimization of H.264/AVC Baseline Decoder on Low-Cost TriMedia DSP Processor. Proceeding of SPIE, 5558,2004.
[10] S.-S. Yang, S.-W. Wang, and J.-L. Wu. A Parallel Algorithm for H.264/AVC Deblocking Filter Based on Limited Error Propagation Effect. IEEE International Conference on Multimedia and Expo, pages 1858–1861, Jul. 2007.
[11] X. Zhou, E. Q. Li, and Y.-K. Chen. Implementation of H.264 Decoder on General-Purpose Processors with Media Instructions. Proceeding of SPIE Conference on Image and Video Communication and Processing, 5022, Jan. 2003.

technique_report/wuh264/h264_tech_report.txt · 上一次變更: 2009/04/10 14:42 由 crilit
顯示原始碼舊版
多媒體管理器回到頁頂
CC Attribution-Noncommercial-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0