LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)
In conjunction with the HiPEAC’14 Conference
Monday (morning & afternoon), January 20, 2014, Vienna, Austria
Room location: Ballroom D
The recent success of advanced mobile platforms, such as iOS, Android, and Windows Phone coincides with the rising challenge of ensuring a long battery life, and accompanies a larger trend away from increasing processor clock speeds in favour of increasing parallelism. That high performance computing (HPC) is also strongly motivated in this area, as witnessed by the recent Green500 List, illustrates the timeliness and ubiquity of topics relating to power-efficient hardware and software design. The LPGPU Workshop on Power-Efficient GPU and Manycore Computing (PEGPUM) workshop, colocated with HiPEAC 2014 in Vienna, intends to foster dialogue and interaction among researchers addressing contemporary issues in low-power GPU and many-core software and hardware design.
Approaches to the design challenges of power-efficient GPU and many-core computing will be addressed and include topics such as:
- Heterogeneous Many-core Architectures including Mobile and Embedded Platforms
- GPU Programming Models, APIs, Languages, Tools and Compilers
- Low-Power Application Case Studies and Performance Evaluations
- “Green” High Performance Computing
The full-day PEGPUM 2014 workshop is administered within the auspices of the LPGPU FP7 project (lpgpu.org).
10:00-10:05 Georgios Keramidas (Think Silicon Ltd.) “Welcome”
11:00-11:30 Coffee Break
11:30-12:00 Prashant Sharma (Samsung, UK) “A practitioners view of challenges faced with power and performance on mobile GPUs” (Slides)
12:00-12:30 Biagio Cosenza (University of Innsbruck, Austria) “An Insight into the Insieme Compiler Automatic Partitioning for Heterogeneous Platforms” (Slides)
12:30-13:00 Chi Ching Chi (Technische Universität Berlin) “Power and Energy Efficiency of Video Decoding on Multi-core Architectures” (Slides)
14:00-15:00 HiPEAC Keynote
15:00-15:30 Ayal Zaks (Intel Corporation, Israel) “OpenCL heterogeneous portability – theory and practice” (Slides)
15:30-15:50 Georgios Keramidas and Iakovos Stamoulis (Think Silicon Ltd.) “Nema3D: An OpenGL/OpenCL Embedded Programmable Engine” (Slides)
15:50-16:10 Jan Lucas (Technische Universität Berlin) “DART: A Decoupled Architecture Exploiting Temporal SIMD” (Slides)
16:10-16:30 Sam Martin (Geomerics Ltd.) “Mobile Rendering with On-Chip Memory”
16:30-17:00 Coffee Break
17:00-17:30 Alex Ramirez (Barcelona Supercomputing Center, Spain) “Mont-Blanc: Building supercomputers from commodity embedded chips” (Slides)
17:30-17:50 Ralph Potter (Codeplay Software Ltd. / University of Bath, UK) “Fusing GPU kernels with a novel single-source C++ API” (Slides)
17:50-18:10 Biao Wang (Technische Universität Berlin) “Parallel H.264/AVC Motion Compensation for GPUs using OpenCL” (Slides)
Speakers, Titles an Abstracts
Title: Mont-Blanc: Building supercomputers from commodity embedded chips
Speaker: Alex Ramirez (Barcelona Supercomputing Center, Spain)
Abstract: During the 90’s, miroprocessors replaced the vector supercomputers as soon as they implemented the required features for HPC due to their lower costs and higher energy efficiency. Nowadays, we may be ready to take the next step, and see how embedded processors, which are implementing the required features for HPC, may replace current microprocessors due to their lower cost and higher energy efficiency. The Mont-Blanc project leads a new approach towards energy efficient supercomputers based on mobile and embedded technologies.
Title: A practitioners view of challenges faced with power and performance on mobile GPUs
Speaker: Prashant Sharma (Samsung, UK)
Abstract: Earlier GPUs were only used on desktop and in games console but with the advent of smart phones with more computing capabilities, a lot of high performance GPU tasks like gaming, video processing and maps have become a possibility with current mobiles. With current trends, mobile processing power is surpassing battery capacity. Even with more mobile processing, we want our mobile battery to run longer on a single charge. There is a great need for utilising the mobile GPU power efficiently. Application developers, users and mobile manufactures should be aware how they can effectively use mobile GPU with power consumption, performance and quality in consideration. This talk, with a target audience comprising of both game and application developers and general mobile users, will briefly explain tradeoffs between power consumption, performance and display quality on a mobile devices. It will also discuss differences between desktop and mobile GPU architectures from power consumption point of view. It will highlight common mistakes by programmers which lead to high power consumption and hence low battery life. It will also focus on how to program for mobile while considering power consumption. Understanding the mobile GPU power consumption would not only help in better utilisation of current battery but would also help to improve the power management of mobile devices in the future.
Title: Many-core DSP architectures
Speaker: Gerard Rauwerda (CTO & co-founder Recore Systems)
Abstract: Future DSP processing systems demand more computation power and sufficient flexibility to support different applications. Many-core DSP architectures in combination with efficient resource scheduling creates new opportunities for developing power-aware and reliable embedded systems for these high-performing applications. Recore Systems develops many-core DSP (sub)systems integrating lean DSP processing cores, on-chip memories and scalable interconnect technology. To use those many-core systems effectively and efficiently, we grasp the challenges of run-time scheduling and parallel programming.
Title: OpenCL heterogeneous portability – theory and practice
Speaker: Ayal Zaks (Intel Corporation, Israel)
Abstract: OpenCL is designed to provide portability across heterogenous devices. Yet some aspects of portability are still challenging. We examine several current trends in research and industry striving to further improve OpenCL’s support for heterogenous portability.
Title: Performance portability for embedded GPUs
Speaker: Simon McIntosh-Smith (Head of the microelectronics group at the University of Bristol, UK)
Abstract: As the range of embedded GPUs from ARM, Imagination, Qualcomm, Nvidia and others grows, the challenge for software developers also grows. Developing GPU computing applications which are not only functionally portable, but also performance portable, across this diverse range of GPUs becomes a critical issue which must be solved in order for software developers to more broadly adopt embedded GPU computing. The wide range in scale of the performance of embedded GPUs further complicates the issue. In this talk we will look at some recent work at the University of Bristol which exploits OpenCL to evaluate performance portability across embedded GPU platforms and across GPUs ranging from a few GFLOPS to 100 GFLOPS.
Title: An Insight into the Insieme Compiler Automatic Partitioning for Heterogeneous Platforms
Speaker: Biagio Cosenza (University of Innsbruck, Austria)
Abstract: Unleashing the full potential of heterogeneous systems, consisting of multi-core CPUs and GPUs, is a challenging task due to the difference in processing capabilities, memory availability, and communication latencies of different computational resources. The Insieme Compiler manages these differences by deriving a prediction model based on machine learning (ANN and SVM) which incorporates static program features as well as dynamic, input sensitive features. This talk describes how this approach has been used to perform automatic input-sensitive task partitioning of OpenCL programs and discusses the energy vs performance trade-off on similar platforms.
Title: Parallel H.264/AVC Motion Compensation for GPUs using OpenCL
Speaker: Biao Wang (Technische Universität Berlin)
Abstract: Motion Compensation (MC) is one of the most compute-intensive parts in H.264/AVC video decoding. It exposes massive parallelism which can reap the benefit from Graphics Processing Units (GPUs). However, the divergence caused by different interpolation modes in MC can lead to significant performance penalty on GPUs. In this work, we propose a novel multi-stage approach to parallelize the MC kernel for GPUs using OpenCL. The proposed approach mitigates the divergence by exploiting the fact that different interpolation modes share common computation stages. In addition, the optimized kernel has been integrated into a ffmpeg decoder that supports H.264/AVC high profile. We evaluated our kernel on GPUs with different architectures shipped by AMD, Intel, and Nvidia. Compared to a CPU implementation, our kernel achieves maximum speedups of 3.27 and 3.59 for 1080p and 2160p videos, respectively. Furthermore, we applied zero copy optimization for integrated GPUs from AMD and Intel to eliminate memory copy overhead between CPU and GPU.
Title: Nema3D: An OpenGL/OpenCL Embedded Programmable Engine
Speaker: Georgios Keramidas and Iakovos Stamoulis (Think Silicon Ltd.)
Abstract: Nema3D is the new programmable core designed by Think Silicon Ltd. ( www.think-silicon.com ). Nema3D is a multithreaded processing core powered by an intelligence software-hardware codesign approach, in house compiler support (LLVM-based), highly reconfigurable architecture, and new low power architectural-level techniques. Nema3D is designed to support the latest API standards of the Khronos group (OpenGL ES 3.0 and OpenCL 1.2) in the same silicon footprint, while featuring image and vision acceleration capabilities is under investigation. As part of this presentation, the design philosophy and the architectural organization of the Nema3D core will be outlined.
Title: DART: A Decoupled Architecture Exploiting Temporal SIMD
Speaker: Jan Lucas (Technische Universität Berlin)
Abstract: GPUs can offer very high performance and good energy efficiency on some applications. Many applications, however, do not perform well. The high area and energy efficiency is reached by grouping threads into groups called warps and running the threads from one warp in lockstep. This way with only one instruction per warp fetched, decoded and issued up to warp length operations can be executed. In conventional GPU implementations spatial SIMD units are used to execute warps. This results in underutilization of the execution units, if the threads from one warp are following different control flow paths(branch divergence). This talk presents DART, a novel architecture for GPUs based on Temporal SIMD. By using a temporal implementation of SIMD it can offer better utilization of the execution units with branch divergence. The details of the DART architecture will be explained and benchmark results comparing DART GPU and conventional GPUs will be presented.
Title: Power and Energy Efficiency of Video Decoding on Multi-core Architectures
Speaker: Chi Ching Chi (Technische Universität Berlin)
Abstract: In this talk we present how modern power states influence the power consumption of realtime HEVC video decoding. In realtime applications a set amount of operations has to be performed within a time frame, allowing the processor to go idle when the task has been performed. Processor architectures and offchip memory have incorporated many low power states, which allow the processor to consume less energy at lower activity levels. On x86 processors this has resulted primarily in so called P-States and C-states, which control the power consumption when active and idle, respectively. Each of these states have different transition times, power consumption, and performance level, introducing a new problem of choosing when to use which state. In research, conflicting strategies such as “race to idle” and running longer at lower clock has been proposed as the best solution. Evaluation has been performed to for finding which technique is better for HEVC decoding. Analysis has been performed on different systems ranging from desktop to ultra-mobile platforms.
Title: Mobile Rendering with On-Chip Memory
Speaker: Sam Martin (Geomerics Ltd.)
Abstract: Rendering models developed on power-rich architectures tend to under perform on mobile devices due to the significant differences in available memory bandwidth. Improvements to memory bandwidth are difficult and are expected to be incremental rather than revolutionary. We would be better to re-think our approach for mobile than wait for process and hardware improvements to bridge the gap. In this talk, we discuss two novel extensions to OpenGL ES, proposed by ARM, that allow detailed control of fast on-chip memory. We show that although the amount of memory available is tiny by most standards, it’s direct exposure has huge utility. To demonstrate this we show an on-chip variant of deferred lighting, a bandwidth-intensive technique most commonly associated with high-end console and PC titles. We also discuss how the on-chip memory offers new ways to express computations as a set of chained jobs, and how this new programming mode is uniquely expressive.
Title: Fusing GPU kernels with a novel single-source C++ API
Speaker: Ralph Potter (Codeplay Software Ltd. / University of Bath, UK)
Abstract: Ongoing and rapidly maturing compiler and API research by Codeplay aims to provide a higher-level, single-source, industry-focused C++-based interface to OpenCL. We investigate opportunities for compiler-based kernel fusion utilizing features from C++11 including lambda functions; variadic templates; and lazy evaluation using std::bind expressions.
Organisers and their affiliations
Georgios Keramidas, Think Silicon Ltd.
Stefanos Kaxiras, Uppsala University
Ben Juurlink and Mauricio Alvarez Mesa, TU Berlin
Paul Keir, Codeplay Software Ltd.
Think Silicon Ltd.,
Patras Science Park,
Tel.: +30 2610 911543
Fax: +30 2610 911544
email: g.keramidas AT think-silicon DOT com