THE ART AND SCIENCE OF GPU AND MULTI-CORE PROGRAMMING
DOI:
https://doi.org/10.47839/ijc.11.1.552Keywords:
Performance, Speedup, Parallelism, Temporal Spatial Locality.Abstract
This paper examines the computational programming issues that arise from the introduction of GPUs and multi-core computer systems. The discussions and analyses examine the implication of two principles (spatial and temporal locality) that provide useful metrics to guide programmers in designing and implementing efficient sequential and parallel application programs. Spatial and temporal locality represents a science of information flow and is relevant in the development of highly efficient computational programs. The art of high performance programming is to take combinations of these principles an d unravel the bottlenecks and latencies as sociate with the architecture for each manufacturer computer system, and develop appropriate coding and/or task scheduling schemes to mitigate or eliminate these latencies.References
P.J. Denning, S.C. Schwartz, Properties of the working-set model, Communications of the ACM, (15) 3 (1972), pp. 191-198.
P.J. Denning, The locality principle, Communications of the ACM, (48) 7 (2005), pp. 19-24.
M. Wolfe, High-Performance Compilers for Parallel Computing, Addison Wesley, June 16, 1995.
J.L. Gustafson, Reevaluating Amdahl’s law, Communications of the ACM, (31) 5 (1988), pp. 532-533.
G. Amdahl, Validity of the single processor approach to achieving large-scale computing capabilities, AFIPS Conference Proceedings, (30) (1967), pp. 483-485.
D.H. Bailey, Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, RNR Technical Report RNR-91-020, June 11, 1991.
S. Saini, D. Talcott, D. Jespersen, J. Djomehri, H. Jin, and R. Biswas, scientific application-based performance comparison of SGI Altix 4700, IBM POWER5+, and SGI ICE 8200 Supercomputers, Proceedings of the ACM/IEEE conference on Supercomputing, SC’08, 2008.
S. Hily and A. Seznec, Contention on 2nd level cache may limit the effectiveness of simultaneous multithreading, Technical Report PI-1086, IRISA, 1997.
J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S.W. Keckler, A NUCA substrate for flexible CMP cache sharing, Proceedings of the 19th annual international conference on Supercomputing ICS’05, 2005, pp. 31-40.
W.-D. Weber and A. Gupta, Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results, Proceedings of the 16th Annual International Symposium on Computer Architecture, June 1989, pp. 273-280.
B.J. Smith, Architecture and applications of the HEP multiprocessor computer system, SPIE, (298) (1981), pp. 241-248.
G.R. Andrews, Paradigms for process interaction in distributed programs, ACM Computing Surveys, 1991.
P.E. McKenney and J. Slingwine, Efficient kernel memory allocation on shared-memory multiprocessors, In USENIX Conference Proceedings, Berkeley CA, February 1993.
M.I. Reiman and P.E. Wright, Performance analysis of concurrent-read exclusive-write, ACM, (February 1991), pp. 168-177.
J. Sartori and R. Kumar, Low-overhead, high-speed multi-core barrier synchronization, high performance embedded architectures and compilers, Lecture Notes in Computer Science, (5952) (2010), pp. 18-34.
M. Amde, T. Felicijan, A. Efthymiou, D. Edwards, and L. Lavagno, Asynchronous on-chip networks, IEE Proceedings Computers and Digital Techniques, (152) 02 (2005).
A.O. Balkan, M.N. Horak, G. Qu, U. Vishkin, Layout-accurate design and implementation of a high-throughput interconnection network for single-chip parallel processing, Proc. IEEE Symp. on High Performance Interconnection Networks (Hot Interconnects), August 2007.
P. Stenstrom, A survey of cache coherence schemes for multiprocessors, Computer, (23) 6 (1990), pp. 12-24.
Nvidia, “Compute Unified Device Architecture Programming Guide Version 2.0,” http://developer.download.nvidia.com/compute/cuda/20/docs/NVIDIA CUDA Programming Guide 2.0.pdf.
Nvidia, “NVIDIA GeForce GTX 200 GPU Architectural Overview,” http://www.nvidia.com/docs/IO/55506/GeForce GTX 200 GPU Technical Brief.pdf, May 2008.
T. Halfhill, Looking Beyond Graphics NVIDIA’s Next-Generation CUDA Compute and Graphics Architecture, http://www.nvidia.com/content/PDF/fermi_white_papers/T.Halfhill_Looking_Beyond_Graphics.pdf
A. Kumar, Tips for speeding up your algorithm in the CUDA programming, http://www. mindfiresolutions.com/Tips-for-speed-up-your-algorithm-in-the-CUDA-programming-399.php.
N. Satish, M. Harris and M. Garland, Designing efficient sorting algorithms for manycore GPUs, Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May 2009.
A. Lippert, NVIDIA GPU Architecture for General Purpose Computing, http://www.cs.wm.edu/~kemper/cs654/slides/nvidia.pdf
Downloads
Published
How to Cite
Issue
Section
License
International Journal of Computing is an open access journal. Authors who publish with this journal agree to the following terms:• Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
• Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
• Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.