Abstract
Deep learning (DL) is one of the key technologies in the artificial intelligence (AI) domain Deep learning neural networks (DLNN) profit a lot from the overall exponential data growth while on the other hand the computational effort for training and inference strongly increase. Most of the computational time in DLNN is consumed by the convolution step, which is based on a general matrix multiplication (GEMM). In order to accelerate the computational time for DLNN different highly optimized GEMM implementations for Graphic Processing Units (GPUs) have been presented in the last years [1] most of these approaches are GPU hardware specific implementations of the GEMM software kernel and do not incorporate the performance dependency of the training data layout. In order to achieve a maximum performance the parameters of the GEMM algorithm have to be tuned for the different GPU hardware and specific data layout of the training task. In this paper we present a two step autotuning approach for GPU based GEMM algorithms. In the first step the kernel parameter search space is pruned by several performance criteria and afterwards further processed by a modified Simulated Annealing in order to find the best kernel parameter combinations with respect to the GPU hardware and the task specific data layout. Our results were carried out on 160 different input problems with the proposed approach an average speedup against the state of the art implementation from NVIDIA (cuBLAS) from around 12 on a NVIDIA GTX 1080 Ti accelerator card can be achieved.
Chapter PDF
Similar content being viewed by others
References
Theano: Deep learning on gpus with python Bergstra, James and Bastien, Frèdèri and Breuleux, Olivier and Lamblin, Pascal and Pascanu, Razvan and Delalleau, Olivier and Desjardins, Guillaume and Warde-Farley, David and Goodfellow, Ian and Bergeron, Arnaud and others, NIPS 2011, BigLearning Workshop, Granada, Spain
Autotuning GEMMs for Fermi Jakub Kurzak, Stanimire Tomov, Jack Dongarra 2011
Fast k nearest neighbour search using GPU Vincent Garcia, Eric Debreuve, Michel Barlaud available at: http://vincentfpgarcia.github.io/kNN-CUDA/ access13.06.2018
Magma project page http://icl.cs.utk.edu/magma/ access 13.06.2018
Automated empirical optimization of software and the ATLAS project R. Clint Whaley and Antoine Petitet and Jack J. Dongarra 2001
OSKI: A library of automatically tuned sparse matrix kernels Richard Vuduc and James W. Demmel and and Katherine A. Yelick 2005
Application-independent Autotuning for GPUs Martin TILLMANN and Thomas KARCHER and Carsten DACHSBACHER and Walter F. TICHY 2013
Auto-Tuning 3-D FFT Library for CUDA GPUs Akira Nukada and Satoshi Matsuoka 2009
Performance Prediction Based on Statistics of Sparse Matrix-Vector Multiplication on GPUs Ruixing Wang and Tongxiang Gu and Ming Li 2017
Automatic C-to-CUDA Code Generation for Affine Programs Muthu Manikandan Baskaran and J. Ramanujam and P. Sadayappan 2010
A Script-Based Autotuning Compiler System to Generate High-Performance CUDA Code MALIK KHAN and NUST PROTONU BASU and GABE RUDY and MARY HALL and CHUN CHEN and JACQUELINE CHAME 2013
Benchmarking GPUs to Tune Dense Linear Algebra Vasily Volkov, James W. Demmel 2008
A Note on Auto-tuning GEMM for GPUs Yinan Li1, Jack Dongarra, Stanimire Tomov 2009
An Improved Magma Gemm For Fermi Graphics Processing Units Rajib Nath, Stanimire Tomov, Jack Dongarra 2010
Performance, Design, and Autotuning of Batched GEMM for GPUs Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov and Jack Dongarra 2016
Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov and Jack Dongarra 2017
Brute-Force k-Nearest Neighbors Search on the GPU Shengren Li and Nina Amenta 2015
Experiences in autotuning matrix multiplication for energy minimization on GPUs Hartwig Anzt, Blake Haugen1, Jakub Kurzak1 and Piotr Luszczek and Jack Dongarra 2015
Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs Jee W. Choi, Amik Singh and Richard W. Vuduc 2010
https://www12.informatik.unierlangen.de/edu/map/ss08/talks/Best Practices for GPU Programming.ppt, Access 26.2.2018 16:21. Best Practise for GPU Programming.
Understanding Latency Hiding on GPUs, Vasily Volkov, 2016
AARTS, Emile; KORST, Jan. Simulated annealing and Boltzmann machines. 1988
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2019 The Author(s)
About this paper
Cite this paper
Sailer, J., Frey, C., Kühnert, C. (2019). GPU GEMM-Kernel Autotuning for scalable machine learners. In: Beyerer, J., Kühnert, C., Niggemann, O. (eds) Machine Learning for Cyber Physical Systems. Technologien für die intelligente Automation, vol 9. Springer Vieweg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58485-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-662-58485-9_8
Published:
Publisher Name: Springer Vieweg, Berlin, Heidelberg
Print ISBN: 978-3-662-58484-2
Online ISBN: 978-3-662-58485-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)