PEGPUM 2015

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2015) In conjunction with the HiPEAC 2015 Conference
Monday (morning & afternoon), Wednesday January 21, 2015, Amsterdam, The Netherlands
Room location: E105/E106

Context

The recent success of advanced mobile platforms coincides with the rising challenge of ensuring a long battery life, and accompanies a larger trend away from increasing processor clock speeds in favour of increasing parallelism. That high performance computing (HPC) is also strongly motivated in this area, as witnessed by the recent Green500 List project, illustrates the timeliness and ubiquity of topics relating to low-power computing. In the last years we have seen the introduction of new computing platforms that include multicore CPUs, manycore GPUs and application specific accelerators, some of them specially addressing low-power mobile applications. The design of applications, architectures, and the supporting programming tools for low-power parallel computing systems is an open and very active research field. The proposed HiPEAC 2015 workshop on low-power computing then intends to foster dialogue and interaction among researchers from academia and industry addressing contemporary challenges in low-power parallel software and hardware design.

Topics

Approaches to the design challenges of low-power parallel computing will be addressed and may include topics such as:

Heterogeneous multicore and GPGPU architectures
Mobile and embedded platforms
Application case studies
Graphics and compute interaction
Programming models and tools
High performance computing

Further information on the LPGPU project, and previous editions of the PEGPUM workshop (HiPEAC 2013, HiPEAC 2014) can be found at lpgpu.org

Affiliation

The full-day PEGPUM 2015 workshop is administered within the auspices of the LPGPU FP7 project (lpgpu.org).

Programme

8:45-10:00 HiPEAC Keynote

10:00-10:05 Welcome. Mauricio Alvarez-Mesa. TU Berlin

10:05-10:35 Keynote 1 “Low power GPUs: a view from the industry“: Edvard Sørgård, Senior Principal Graphics Architect at ARM Trondheim, Norway. [slides]

10:35-11:00 “Low-power parallel processing on GPUs: looking back and forward“. Ben Juurlink. TU Berlin. [slides]

11:00-11:30 Coffee Break

11:30-12:00 “Tools and dataflow-based programming models for heterogeneous MPSoCs”. Jerónimo Castrillón. TU Dresden. [slides]

12:00-12:30 “Implementing Khronos SYCL for OpenCL”, Ralph Potter, Codeplay Software Ltd. [Slides]

12:30-13:00 “Ultra Low Power GPUs for IoT Devices“, Georgios Keramidas, Think Silicon. [slides]

13:00-14:00 Lunch

14:00-14:30 Keynote 2 “Low power GPU computing: the state of the union“, Simon McIntosh-Smith. Head of the Microelectronics Group and Senior Lecturer, University of Bristol. [Slides]

14:30-15:00 “Scalarization and Temporal SIMT in GPUs: reducing redundant operations for better performance and higher energy efficiency“, Jan Jucas, TU Berlin. [Slides]

15:00-15:30 “Accelerating Renderscript applications using OpenCL SPIR”, Lukas Kuklinek, Codeplay Software Ltd.[slides]

15:30-16:00 “Building heterogeneous UVMs (unified virtual memories) without the overhead”. Konstantinos Koukos, Uppsala University

16:00-16:30 Coffee break

16:30-17:00 “RePhrase/ParaPhrase: Engineering Software for Heterogeneous Multicore“. Kevin Hammond. [slides]

17:00-17:30 “Semi-Automatic Refactoring for (Heterogeneous) Parallel Programs“. Chris Brown.

17:30-18:00 “REPARA: Reenginering for Heterogeneous Parallelism for Performance and Energy in C++“, Daniel Garcia. [slides]

Speakers, Titles an Abstracts

Keynote 1: Low power GPUs: a view from the industry

Presenter: Edvard Sørgård, Senior Principal Graphics Architect at ARM-Norway (Trondheim)
Abstract: Power has become the number one challenge for growth across the industry. As a major industry GPU provider we face this every day. What have we learned and where are we going next?

Keynote 2: Low power GPU computing: the state of the union

Presenter: Simon McIntosh-Smith, University of Bristol, UK.
Abstract: While GPU computing on the desktop has existed for over 10 years now, low power GPU (LPGPU) computing is a much more recent phenomenon. In this talk we will review recent progress in this area, and analyse recent trends in the LPGPU space.

Low-power parallel processing on GPUs: looking back and forward

Presenter: Ben Juurlink. TU Berlin, Germany
Abstract: The Low-Power Parallel Computing on GPUs (LPGPU) is a European STREP project that ran from September 2011 to October 2014. In this talk we will highlight some of the major achievements of the project as well as the research avenues the project has kick-started.

Tools and dataflow-based programming models for heterogeneous MPSoCs

Presenter: Jerónimo Castrillón. TU Dresden, Germany
Abstract: Programming models based on dataflow or process networks are a good match for streaming applications, common in the signal processing, multimedia and automotive domains. In such models, parallelism is expressed explicitly which makes them well-suited for programming parallel machines. Since today’s applications are no longer static, expressive programming models are needed, such as those based on Kahn Process Networks (KPNs). In these models, tasks cannot be handled as black boxes, but have to be analyzed, profiled and traced to characterize their behavior. This is especially important in the case of heterogenous platforms with many processors of multiple different types. In this presentation we present a tool flow to deal with KPN applications and give insights into mapping algorithms for heterogeneous platforms. We also address the issue of producing a mapping under soft real time constraints, aiming at obtaining energy-efficient solutions.

Implementing Khronos SYCL for OpenCL

Presenter: Ralph Potter, Codeplay Software and The University of Bath, UK.
Abstract: Software development for mobile, SoC, and heterogeneous multicore architectures can be a challenging undertaking. Yet, with mainstream hardware now increasingly heterogeneous, ever greater numbers of developers demand improved language and tools support. SYCL (s?k?l – as in sickle) is a royalty-free, cross-platform C++ abstraction layer from Khronos that builds on the underlying concepts, portability and efficiency of OpenCL, while adding the ease-of-use and flexibility of C++. For example, SYCL enables single source development where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration, and then re-use them throughout their source code on different types of data. In this talk Codeplay will describe SYCL and the prototype SYCL implementation they have developed within the LPGPU project.

Ultra Low Power GPUs for IoT Devices

Presenter: Georgios Keramidas, Think Silicon, Greece.
Abstract: In this presentation, we will introduce Nema|t, a tiny ultra-low-power OpenGL|ES GPU suitable for the Wearables and the broader IoT market. Nema|t integrates graphics hardware accelerators and a multi-threaded processing unit based on a “green” VLIW instruction set. Nema|t is scalable and able to deliver impressive graphics performance per mWatts. Core programmability adds flexibility allowing multiple software graphics APIs to be supported. Nema|t combined with Think Silicon proprietary framebuffer and texture compression brings graphics rendering capabilities to systems with scarce resources (silicon area, memory). The presentation will focus on the co-design SW/HW methodology developed by Think Silicon targeting to provide the graphics performance required by each target application (or device) on the minimal power consumption.

Scalarization and Temporal SIMT in GPUs: reducing redundant operations for better performance and higher energy efficiency

Presenter: Jan Lucas, TU Berlin, Germany.
Abstract: Scalarization is a technique to remove certain redundant operations from GPU kernels. In this presentation, we will analyze why these redudant operations are common in GPU kernels and how they can be recognized by the compiler. We will then explain how Scalarization can be integrated into different GPU architectures. In this part of the presentation we will display the benefits of the combination of Scalarization with Temporal SIMT based GPU architectures. Finally we will show how Scalarization results in significant improvements in performance and energy efficiency in many common GPU applications.

Accelerating Renderscript applications using OpenCL SPIR

Presenter: Lukas Kuklinek, Codeplay Software Ltd. UK.
Abstract: We present a system which enables unmodified RenderScript applications to run on OpenCL-enabled devices. RenderScript is a compute API for Android by Google. Although the two technologies are similar in many ways, there are still some important differences our system has to address in order to bridge the gap between them. The system consists of a compiler and a runtime system. The compiler translates RenderScript kernels into SPIR, a non-source encoding of OpenCL kernels. In the process, it has to deal with the differences between kernel interfaces, different sets of built-in functions and lack of address space annotations in RenderScript which are required by SPIR. The runtime maps RenderScript Java API calls to the corresponding OpenCL calls.

Building heterogeneous UVMs (unified virtual memories) without the overhead

Presenter: Konstantinos Koukos, Uppsala University, Sweden.
Abstract: Currently GPUs do not implement cache coherence, and therefore, the programmer has the burden of keeping the data valid among threads. State-of-the-art proposals implement coherence protocols that maintains strong sequential consistency in the GPU, and in the whole system. Our observation is that GPUs do not require such complicated coherence protocols (since they can tolerate long-latencies and the data-parallel model is mostly streaming). Therefore we focus our implementation on a heterogeneous race-free memory model, which significantly simplifies synchronization, and provides a uniform virtual memory address range across devices at much lower energy / complexity cost. We implement two different variations of the protocol one for the CPU and one for the GPU to optimize on each one demands. The GPU version is simpler, since it’s focused on throughput, while the CPU one is more elaborated and latency-aware. For the classification we use page-level regions kept at the TLBs where we also keep valid/invalid bits for each cache line of the page. This makes bulk-reset of these bits easier during self invalidation (synchronization). The overall design results in a much simpler to implement and verify, directory-less and broadcast-less protocol.

RePhrase/ParaPhrase: Engineering Software for Heterogeneous Multicore

Presenter: Kevin Hammond. University of St Andrews, UK.
Abstract: This talk will introduce the EU ParaPhrase and RePhrase projects. (Near-)future data-intensive applications will need to consider large-scale parallelism as an essential part of their design and development process. RePhrase aims to dramatically simplify this process over the state-of-the-art using a flexible semi-automated development approach that will be built around emerging pattern-based parallel programming technology that has been developed in the ParaPhrase project. Pattern-based programming enables abstraction over low-level parallelism details, including thread creation, communication, synchronisation, and scheduling; and also over data placement, access, migration and replication. This makes it ideal to address the intrinsic complexity of data-intensive applications with respect to parallelism and data management. It will be supplemented by advanced refactoring, program analysis, testing, verification, dynamic adaptivity mechanisms and performance monitoring/measurement tools as part of a coherent software development methodology.

Semi-Automatic Refactoring for (Heterogeneous) Parallel Programs

Presenter: Chris Brown. University of St Andrews, UK.
Abstract: Modern multicore systems offer huge computing potential. Exploiting large parallel systems is still a very challenging task, however, especially as many software developers still use overly-sequential programming models. In this talk, I will present a radical and novel approach to introducing and tuning parallelism for heterogeneous shared-memory systems (comprising a mixture of CPUs and GPUs), that combines algorithmic skeletons, machine-learning, and refactoring tool support. Specifically, I will show how to use skeletons to model the parallelism, machine learning to predict the optimal configuration and mapping and refactoring to introduce the parallelism into the application. Finally, I will demonstrate our tools on a number of applications, showing that we can easily obtain comparable results to hand-tuned optimised versions.

REPARA: Reenginering for Heterogeneous Parallelism for Performance and Energy in C++”

Presenter: Daniel Garcia. University Carlos III of Madrid, Spain.
Abstract: The REPARA project aims to help the transformation and deployment of new and legacy applications in parallel heterogeneous computing architectures while maintaining a balance between application performance, energy efficiency and source code maintainability. To achieve this goal, we have defined a full workflow starting from existing source code to application bundles that can be deployed on heterogeneous platforms. During this presentation I will provide details on the current status of the project with special attention to software refactoring tools, approaches for application partitioning and run-time integration.

Organisers and their affiliations

Mauricio Alvarez-Mesa and Ben Juurlink, TU Berlin
Georgios Keramidas, Think Silicon Ltd.
Paul Keir, Codeplay Software Ltd.

Contact

Mauricio Alvarez-Mesa

Technische Universität Berlin
Embedded Systems Architecture Group (AES)
Department of Electrical Engineering and Computer Science
Building E-N / Office: E-N 601
Einsteinufer 17, 10587 Berlin
Germany

Tel: +49.30.314-21357
Fax: +49.30.314-22943
email: mauricio.alvarezmesa (AT) tu-berlin.de

LPGPU.org

An EU-funded research project into low power GPU technology