THE ART AND SCIENCE OF GPU AND MULTI-CORE PROGRAMMING

Robert E. Hiromoto

doi:10.47839/ijc.11.1.552

Authors

Robert E. Hiromoto

DOI:

https://doi.org/10.47839/ijc.11.1.552

Keywords:

Performance, Speedup, Parallelism, Temporal Spatial Locality.

Abstract

This paper examines the computational programming issues that arise from the introduction of GPUs and multi-core computer systems. The discussions and analyses examine the implication of two principles (spatial and temporal locality) that provide useful metrics to guide programmers in designing and implementing efficient sequential and parallel application programs. Spatial and temporal locality represents a science of information flow and is relevant in the development of highly efficient computational programs. The art of high performance programming is to take combinations of these principles an d unravel the bottlenecks and latencies as sociate with the architecture for each manufacturer computer system, and develop appropriate coding and/or task scheduling schemes to mitigate or eliminate these latencies.

References

P.J. Denning, S.C. Schwartz, Properties of the working-set model, Communications of the ACM, (15) 3 (1972), pp. 191-198.

P.J. Denning, The locality principle, Communications of the ACM, (48) 7 (2005), pp. 19-24.

M. Wolfe, High-Performance Compilers for Parallel Computing, Addison Wesley, June 16, 1995.

J.L. Gustafson, Reevaluating Amdahl’s law, Communications of the ACM, (31) 5 (1988), pp. 532-533.

G. Amdahl, Validity of the single processor approach to achieving large-scale computing capabilities, AFIPS Conference Proceedings, (30) (1967), pp. 483-485.

D.H. Bailey, Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, RNR Technical Report RNR-91-020, June 11, 1991.

S. Saini, D. Talcott, D. Jespersen, J. Djomehri, H. Jin, and R. Biswas, scientific application-based performance comparison of SGI Altix 4700, IBM POWER5+, and SGI ICE 8200 Supercomputers, Proceedings of the ACM/IEEE conference on Supercomputing, SC’08, 2008.

S. Hily and A. Seznec, Contention on 2nd level cache may limit the effectiveness of simultaneous multithreading, Technical Report PI-1086, IRISA, 1997.

J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S.W. Keckler, A NUCA substrate for flexible CMP cache sharing, Proceedings of the 19th annual international conference on Supercomputing ICS’05, 2005, pp. 31-40.

W.-D. Weber and A. Gupta, Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results, Proceedings of the 16th Annual International Symposium on Computer Architecture, June 1989, pp. 273-280.

B.J. Smith, Architecture and applications of the HEP multiprocessor computer system, SPIE, (298) (1981), pp. 241-248.

G.R. Andrews, Paradigms for process interaction in distributed programs, ACM Computing Surveys, 1991.

P.E. McKenney and J. Slingwine, Efficient kernel memory allocation on shared-memory multiprocessors, In USENIX Conference Proceedings, Berkeley CA, February 1993.

M.I. Reiman and P.E. Wright, Performance analysis of concurrent-read exclusive-write, ACM, (February 1991), pp. 168-177.

J. Sartori and R. Kumar, Low-overhead, high-speed multi-core barrier synchronization, high performance embedded architectures and compilers, Lecture Notes in Computer Science, (5952) (2010), pp. 18-34.

M. Amde, T. Felicijan, A. Efthymiou, D. Edwards, and L. Lavagno, Asynchronous on-chip networks, IEE Proceedings Computers and Digital Techniques, (152) 02 (2005).

A.O. Balkan, M.N. Horak, G. Qu, U. Vishkin, Layout-accurate design and implementation of a high-throughput interconnection network for single-chip parallel processing, Proc. IEEE Symp. on High Performance Interconnection Networks (Hot Interconnects), August 2007.

P. Stenstrom, A survey of cache coherence schemes for multiprocessors, Computer, (23) 6 (1990), pp. 12-24.

Nvidia, “Compute Unified Device Architecture Programming Guide Version 2.0,” http://developer.download.nvidia.com/compute/cuda/20/docs/NVIDIA CUDA Programming Guide 2.0.pdf.

Nvidia, “NVIDIA GeForce GTX 200 GPU Architectural Overview,” http://www.nvidia.com/docs/IO/55506/GeForce GTX 200 GPU Technical Brief.pdf, May 2008.

T. Halfhill, Looking Beyond Graphics NVIDIA’s Next-Generation CUDA Compute and Graphics Architecture, http://www.nvidia.com/content/PDF/fermi_white_papers/T.Halfhill_Looking_Beyond_Graphics.pdf

A. Kumar, Tips for speeding up your algorithm in the CUDA programming, http://www. mindfiresolutions.com/Tips-for-speed-up-your-algorithm-in-the-CUDA-programming-399.php.

N. Satish, M. Harris and M. Garland, Designing efficient sorting algorithms for manycore GPUs, Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May 2009.

A. Lippert, NVIDIA GPU Architecture for General Purpose Computing, http://www.cs.wm.edu/~kemper/cs654/slides/nvidia.pdf

International Journal of Computing

THE ART AND SCIENCE OF GPU AND MULTI-CORE PROGRAMMING

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information