GPU GEMM-Kernel Autotuning for scalable machine learners

Sailer, Johannes; Frey, Christian; Kühnert, Christian

doi:10.1007/978-3-662-58485-9_8

Johannes Sailer⁵,
Christian Frey⁵ &
Christian Kühnert⁵

Part of the book series: Technologien für die intelligente Automation ((TIA,volume 9))

10k Accesses
1 Citations

Abstract

Deep learning (DL) is one of the key technologies in the artificial intelligence (AI) domain Deep learning neural networks (DLNN) profit a lot from the overall exponential data growth while on the other hand the computational effort for training and inference strongly increase. Most of the computational time in DLNN is consumed by the convolution step, which is based on a general matrix multiplication (GEMM). In order to accelerate the computational time for DLNN different highly optimized GEMM implementations for Graphic Processing Units (GPUs) have been presented in the last years [1] most of these approaches are GPU hardware specific implementations of the GEMM software kernel and do not incorporate the performance dependency of the training data layout. In order to achieve a maximum performance the parameters of the GEMM algorithm have to be tuned for the different GPU hardware and specific data layout of the training task. In this paper we present a two step autotuning approach for GPU based GEMM algorithms. In the first step the kernel parameter search space is pruned by several performance criteria and afterwards further processed by a modified Simulated Annealing in order to find the best kernel parameter combinations with respect to the GPU hardware and the task specific data layout. Our results were carried out on 160 different input problems with the proposed approach an average speedup against the state of the art implementation from NVIDIA (cuBLAS) from around 12 on a NVIDIA GTX 1080 Ti accelerator card can be achieved.

Download to read the full chapter text

Chapter PDF

HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs

Article 12 April 2021

SparG: A Sparse GEMM Accelerator for Deep Learning Applications

OpBench: an operator-level GPU benchmark for deep learning

Article 20 August 2024

Keywords

References

Theano: Deep learning on gpus with python Bergstra, James and Bastien, Frèdèri and Breuleux, Olivier and Lamblin, Pascal and Pascanu, Razvan and Delalleau, Olivier and Desjardins, Guillaume and Warde-Farley, David and Goodfellow, Ian and Bergeron, Arnaud and others, NIPS 2011, BigLearning Workshop, Granada, Spain
Google Scholar
Autotuning GEMMs for Fermi Jakub Kurzak, Stanimire Tomov, Jack Dongarra 2011
Google Scholar
Fast k nearest neighbour search using GPU Vincent Garcia, Eric Debreuve, Michel Barlaud available at: http://vincentfpgarcia.github.io/kNN-CUDA/ access13.06.2018
Magma project page http://icl.cs.utk.edu/magma/ access 13.06.2018
Automated empirical optimization of software and the ATLAS project R. Clint Whaley and Antoine Petitet and Jack J. Dongarra 2001
Google Scholar
OSKI: A library of automatically tuned sparse matrix kernels Richard Vuduc and James W. Demmel and and Katherine A. Yelick 2005
Google Scholar
Application-independent Autotuning for GPUs Martin TILLMANN and Thomas KARCHER and Carsten DACHSBACHER and Walter F. TICHY 2013
Google Scholar
Auto-Tuning 3-D FFT Library for CUDA GPUs Akira Nukada and Satoshi Matsuoka 2009
Google Scholar
Performance Prediction Based on Statistics of Sparse Matrix-Vector Multiplication on GPUs Ruixing Wang and Tongxiang Gu and Ming Li 2017
Google Scholar
Automatic C-to-CUDA Code Generation for Affine Programs Muthu Manikandan Baskaran and J. Ramanujam and P. Sadayappan 2010
Google Scholar
A Script-Based Autotuning Compiler System to Generate High-Performance CUDA Code MALIK KHAN and NUST PROTONU BASU and GABE RUDY and MARY HALL and CHUN CHEN and JACQUELINE CHAME 2013
Google Scholar
Benchmarking GPUs to Tune Dense Linear Algebra Vasily Volkov, James W. Demmel 2008
Google Scholar
A Note on Auto-tuning GEMM for GPUs Yinan Li1, Jack Dongarra, Stanimire Tomov 2009
Google Scholar
An Improved Magma Gemm For Fermi Graphics Processing Units Rajib Nath, Stanimire Tomov, Jack Dongarra 2010
Google Scholar
Performance, Design, and Autotuning of Batched GEMM for GPUs Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov and Jack Dongarra 2016
Google Scholar
Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov and Jack Dongarra 2017
Google Scholar
Brute-Force k-Nearest Neighbors Search on the GPU Shengren Li and Nina Amenta 2015
Google Scholar
Experiences in autotuning matrix multiplication for energy minimization on GPUs Hartwig Anzt, Blake Haugen1, Jakub Kurzak1 and Piotr Luszczek and Jack Dongarra 2015
Google Scholar
Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs Jee W. Choi, Amik Singh and Richard W. Vuduc 2010
Google Scholar
https://www12.informatik.unierlangen.de/edu/map/ss08/talks/Best Practices for GPU Programming.ppt, Access 26.2.2018 16:21. Best Practise for GPU Programming.
Understanding Latency Hiding on GPUs, Vasily Volkov, 2016
Google Scholar
AARTS, Emile; KORST, Jan. Simulated annealing and Boltzmann machines. 1988
Google Scholar

Download references

Author information

Authors and Affiliations

Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB, Karlsruhe, Germany
Johannes Sailer, Christian Frey & Christian Kühnert

Authors

Johannes Sailer
View author publications
You can also search for this author in PubMed Google Scholar
Christian Frey
View author publications
You can also search for this author in PubMed Google Scholar
Christian Kühnert
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Optronik, Systemtechnik und Bildauswertung, Fraunhofer, Karlsruhe, Germany
Jürgen Beyerer
MRD, Fraunhofer Institute for Optronics, System Technologies and Image Exploitation IOSB, Karlsruhe, Germany
Christian Kühnert
inIT - Institut für industrielle Informationstechnik, Hochschule Ostwestfalen-Lippe, Lemgo, Germany
Oliver Niggemann

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sailer, J., Frey, C., Kühnert, C. (2019). GPU GEMM-Kernel Autotuning for scalable machine learners. In: Beyerer, J., Kühnert, C., Niggemann, O. (eds) Machine Learning for Cyber Physical Systems. Technologien für die intelligente Automation, vol 9. Springer Vieweg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58485-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-662-58485-9_8
Published: 18 December 2018
Publisher Name: Springer Vieweg, Berlin, Heidelberg
Print ISBN: 978-3-662-58484-2
Online ISBN: 978-3-662-58485-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

GPU GEMM-Kernel Autotuning for scalable machine learners

Abstract

Chapter PDF

Similar content being viewed by others

HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs

SparG: A Sparse GEMM Accelerator for Deep Learning Applications

OpBench: an operator-level GPU benchmark for deep learning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

GPU GEMM-Kernel Autotuning for scalable machine learners

Abstract

Chapter PDF

Similar content being viewed by others

HGP4CNN: an efficient parallelization framework for training convolutional neural networks on modern GPUs

SparG: A Sparse GEMM Accelerator for Deep Learning Applications

OpBench: an operator-level GPU benchmark for deep learning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation