The LPGPU2 consortium had a very productive kick-off meeting in Prague on January 21-22, 2016, immediately after the HiPEAC conference. Clockwise from left to right Marco Starace (Samsung), Philip Harmer (Samsung), Prashant Sharma (Samsung), Mehdi Goli (Codeplay), Andrew Richards (Codeplay), Jan Lukas (TU Berlin), Luke Iwanski (Codeplay), Iakovos Stamoulis (Think Silicon), Georgios Keramidas (Think Silicon), Mauricio Alvarez-Mesa (TU Berlin / Spin Digital). Not in the picture but also present were the project coordinator Ben Juurlink (TU Berlin), since he took the photo, and Kit Lam (Samsung), who was probably getting coffee or tea.
The Program of the upcoming PEGPUM 2016 workshop is now online. PEGPUM is now organized by the LPGPU2 project. It will feature 6 talks with speakers from the Industry and Universities. We hope to see you soon at HiPEAC 2016 in Prague.
The PEGPUM Workshop will return to HIPEAC 2016 in Prague. This is the fourth installment of this successful workshop and will feature many interesting talks. It will be the first workshop of the new LPGPU2 project. Please stay tuned for more news and check out the workshop page.
The paper “Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency” by Lucas et al. has been accepted for publication in ACM TACO (http://dl.acm.org/citation.cfm?id=2811402). Because it is original work, it will also be presented at the yearly HiPEAC conference, which is the premier European network event on topics central to ACM TACO and has been attended by more than 600 scientists in 2015. HiPEAC 2016 will be held in Prague (https://www.hipeac.org/2016/prague).
Hierarchical Private/Shared Classification: the Key to Simple and Efficient Coherence for Clustered Cache Hierarchies
Alberto Ros and Stefanos Kaxiras
Hierarchical clustered cache designs are becoming an appealing alternative for multicores. Grouping cores and their caches in clusters reduces network congestion by localizing traffic among several hierarchical levels, potentially enabling much higher scalability. While such architectures can be formed recursively by replicating a base design pattern, keep- ing the whole hierarchy coherent requires more effort and consideration. The reason is that, in hierarchical coherence, even basic operations must be recursive. As a consequence, intermediate-level caches behave both as directories and as leaf caches. This leads to an explosion of states, protocol-races, and protocol complexity. While there have been previ- ous efforts to extend directory-based coherence to hierarchical designs their increased complexity and verification cost is a serious impediment to their adoption.
We aim to address these concerns by encapsulating all hierarchical complexity in a simple function: that of determining when a data block is shared entirely within a cluster (sub-tree of the hierarchy) and is private from the outside. This allows us to eliminate complex recursive operations that span the hierarchy and instead employ simple coherence mechanisms such as self-invalidation and write-through—now restricted to operate within the cluster where a data block is shared.
We examine two inclusivity options and discuss the relation of our approach to the recently proposed Hierarchical-Race-Free (HRF) memory models. Finally, comparisons to both a hierarchical directory-based MOESI and TokenCMP protocols show that, despite its simplicity our approach results in competitive performance and significantly decreased network traffic.
Andrew Richards, CEO of Codeplay, will be giving a talk on designing heterogeneous programming systems at Multicore Day 2014 in Kista, Sweden on October 8th 2014. Engineers can achieve huge performance improvements and power savings by accelerating software on a range of processing cores in a system. But, different processor cores can have very different architectures that make integration of such systems hard. In Codeplay’s research in projects such as LPGPU, we have found that standards-based approaches make integration of different technologies easier. But what standards should engineers use, and how can they work together?
This talk will discuss the existing standards, how they can work together and how engineers can integrate these different standards together into a single, easy-to-use programming model for software developers to use.
The paper “Parallel H.264/AVC Motion Compensation for GPUs using OpenCL” by Biao Wang, Mauricio Alvarez-Mesa, Chi Ching Chi, and Ben Juurlink has been accepted as a Transactions Letter at IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). It will appear in an upcoming issue of the IEEE Transactions on Circuits and Systems for Video Technology.
Abstract: Motion compensation is one of the most compute-intensive parts in H.264/AVC video decoding. It exposes massive parallelism which can reap the benefit from Graphics Processing Units (GPUs). Control and memory divergence, however, may lead to performance penalties on GPUs. In this paper, we propose two GPU motion compensation kernels, implemented with OpenCL, that mitigate the divergence effect. In addition, the motion compensation kernels have been integrated into a complete and optimized H.264/AVC decoder that supports H.264/AVC high profile. We evaluated our kernels on GPUs with different architectures from AMD, Intel, and Nvidia. Compared to the fastest CPU used in this paper, our kernel achieves 2.0 speedup on a discrete Nvidia GPU at kernel level. However, when the overheads of memory copy and OpenCL runtime are included, no speedup is gained at application level.
At the Game AI Conference 2014 held in Vienna/Austria , Codeplay Ltd. and AiGameDev.com KG teamed up again to deliver multiple talks about applying OpenCL and SYCL to creating massively parallel AI. The presentations by Bjoern Knafla (Codeplay), Gordon Brown (Codeplay) and Alex Champandard (AiGameDev) were featured on the first day the conference as part of the Technology Workshop on July 7th, to a sold-out audience of some of the best developers from Europe and beyond.